Thread crashing in movej

csaba · February 11, 2021, 3:55pm

Hi,
We are experiencing random errors on CB-series robots with Polyscope version 3.13.x and 3.14.x. The scenario is the following:

the main program is performing some relatively heavy calculations of the next waypoints, while another thread is moving the robot
the thread writes a log message before and after every movel and movej command (for debugging)
sometimes (once or twice a day) there is one movej that starts but never ends (thread crashing?)
there is absolutely no error message anywhere
the robot movement stops, and the main thread detects a timeout after some time (but no error message from the system)
There are no race conditions, no critical sections either. This is a clean structure but still there is an error somewhere.
Are there any known restrictions in controlling the robot from a thread, or any changes in the latest firmware versions 3.13.x and 3.14.x that may cause this to happen?
Our code structure has not changed in the past 2 years and suddenly we are getting this error from more than one CB robot installations, however, we haven’t heard about the same issue happening on E-series.
@Ebbe do you have any suggestions?

Ebbe · February 11, 2021, 9:00pm

Hi @csaba,

I have no knowledge of an issue like this have been introduced. But if it is related to controller timing execution, thing might have changed marginally.

Is the timeout detected by own logic or Polyscope?
What kind of synchronization do you have between the different threads? And are other of the threads controlling the robot movements?
Have you diagnosed if the thread is still running or not by using join in one well functioning thread?

One simple try could be to create a critical section for each of your movements.

csaba · February 12, 2021, 9:48am

Hi @Ebbe thanks for the quick reply.
The timeout is only detected by the program logic.
The main program is calculating waypoints, and the other thread is performing the movements. There is no other thread trying to control the robot at that time. The communication between the threads is only done through some variables: the main program sets a variable (move target position) and the thread sets another variable (movement status, integer) - a little bit more complex, but something like this.
I suspected it could be a deadlock in the system when using get_inverse_kin, is_within_safety_limits, or get_inverse_kin_has_solution in the main program that is doing the calculations while a thread is moving the robot. We use these functions in our calculations and probably the move() commands use them internally as well.
The error occurs on robots in production, so we haven’t been able to diagnose the thread is still running or not, and have not used the join command either. I’ll try to create a sample program on a test robot to reproduce the error and upload it as soon as possible.

csaba.mucsi · February 26, 2021, 10:11am

UPDATE:
I was able to modify the program on a customer robot to get the joint positions at the moment when such “timeout” was detected:

 [1.32164,-1.11475,2.12934,-2.55896,-1.57589,-1.03973]
 [1.32235,-1.26394,2.08554,-2.36594,-1.57599,-1.03765]
 [1.31811,-1.26711,2.08393,-2.36111,-1.57618,-1.04171]
 [1.31624,-1.27076,2.0822,-2.35558,-1.5761,-1.04354]
 [1.32139,-1.1144,2.12951,-2.55944,-1.57578,-1.03994] 
 [1.31526,-1.41115,1.98306,-2.11634,-1.57633,-1.04267]
 [1.32198,-1.31644,2.05832,-2.28616,-1.57597,-1.03736]
 [1.31737,-1.21514,2.10431,-2.4334,-1.57614,-1.04298] 
 [1.31874,-1.14916,2.12263,-2.51764,-1.57602,-1.04218]
 [1.31732,-1.21527,2.10437,-2.43339,-1.57615,-1.04297]
 [1.31713,-1.40786,1.98644,-2.12288,-1.57628,-1.04092]
 [1.31842,-1.1149,2.12928,-2.5586,-1.57599,-1.04269]
 [1.31848,-1.11495,2.12929,-2.55864,-1.57602,-1.04266]
 [1.38735,-1.30148,2.05694,-2.29986,-1.57425,-0.9721] 
 [1.31873,-1.14928,2.12258,-2.51774,-1.57602,-1.0422] 
 [1.31838,-1.21209,2.10537,-2.43751,-1.57608,-1.04182]
 [1.32089,-1.20565,2.10748,-2.44627,-1.57601,-1.03966]
 [1.31938,-1.40514,1.98931,-2.12848,-1.57614,-1.0388] 
 [1.31696,-1.4068,1.98753,-2.1249,-1.57633,-1.04099]
 [1.31878,-1.11424,2.12937,-2.55947,-1.57603,-1.04245]
 [1.31685,-1.15163,2.12208,-2.5145,-1.5761,-1.04404]
 [1.31861,-1.11466,2.12937,-2.5589,-1.57603,-1.04257] 
 [1.31864,-1.11461,2.12936,-2.55882,-1.57605,-1.04252]
 [1.31872,-1.11422,2.12935,-2.55947,-1.57598,-1.04243]
 [1.31876,-1.14667,2.12316,-2.52086,-1.57598,-1.04224]
 [1.31876,-1.14741,2.12298,-2.51984,-1.57602,-1.04214]
 [1.31602,-1.41144,1.98258,-2.1154,-1.57634,-1.04185] 
 [1.32009,-1.31647,2.05823,-2.28605,-1.57601,-1.03922]
 [1.31975,-1.36334,2.0263,-2.20709,-1.57614,-1.0389]
 [1.31846,-1.20968,2.10618,-2.44072,-1.57612,-1.04185]
 [1.31882,-1.11427,2.12941,-2.55948,-1.57602,-1.04236]

The joint positions seem to be varying, although the target position is hard-coded on this robot.
I was also able to test the join command, which returned immediately.

I did not find any corresponding entries at the time of the failure in urcontrol.log

Ebbe · March 5, 2021, 2:17pm

Hi @csaba ,

Is the list of joint set angles from one incidence, or is it one set from each incidence?

@mmi do you have any input on what could cause @csaba’s thread issue?

csaba.mucsi · March 8, 2021, 2:38pm

Hi @Ebbe ,
The above list of joint angles are a collection of ca. one week of operation, and one line corresponds to one incident.
Here is the original log, filtered by the message type:

3.5 :: 0037d02h15m42.184s :: 2021-02-22 07:32:26.392 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.32164,-1.11475,2.12934,-2.55896,-1.57589,-1.03973] :: :: null
3.5 :: 0037d02h23m35.296s :: 2021-02-22 07:40:19.493 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.32235,-1.26394,2.08554,-2.36594,-1.57599,-1.03765] :: :: null
3.5 :: 0037d02h45m38.968s :: 2021-02-22 08:02:23.059 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31811,-1.26711,2.08393,-2.36111,-1.57618,-1.04171] :: :: null
3.5 :: 0037d02h46m43.216s :: 2021-02-22 08:03:27.381 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31624,-1.27076,2.0822,-2.35558,-1.5761,-1.04354] :: :: null
3.5 :: 0037d03h01m11.408s :: 2021-02-22 08:17:55.512 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.32139,-1.1144,2.12951,-2.55944,-1.57578,-1.03994] :: :: null
3.5 :: 0037d03h03m55.808s :: 2021-02-22 08:20:39.866 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31526,-1.41115,1.98306,-2.11634,-1.57633,-1.04267] :: :: null
3.5 :: 0037d03h15m00.960s :: 2021-02-22 08:31:45.016 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.32198,-1.31644,2.05832,-2.28616,-1.57597,-1.03736] :: :: null
3.5 :: 0037d04h48m55.360s :: 2021-02-22 10:05:39.111 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31737,-1.21514,2.10431,-2.4334,-1.57614,-1.04298] :: :: null
3.5 :: 0037d05h32m14.144s :: 2021-02-22 10:48:57.727 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31874,-1.14916,2.12263,-2.51764,-1.57602,-1.04218] :: :: null
3.5 :: 0037d05h35m48.064s :: 2021-02-22 10:52:31.650 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31732,-1.21527,2.10437,-2.43339,-1.57615,-1.04297] :: :: null
3.5 :: 0037d05h43m14.976s :: 2021-02-22 10:59:58.560 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31713,-1.40786,1.98644,-2.12288,-1.57628,-1.04092] :: :: null
3.5 :: 0037d06h11m46.848s :: 2021-02-22 11:28:30.259 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31842,-1.1149,2.12928,-2.5586,-1.57599,-1.04269] :: :: null
3.5 :: 0037d06h51m17.792s :: 2021-02-22 12:08:01.099 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31848,-1.11495,2.12929,-2.55864,-1.57602,-1.04266] :: :: null
3.5 :: 0037d09h11m11.712s :: 2021-02-22 14:27:54.529 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.38735,-1.30148,2.05694,-2.29986,-1.57425,-0.9721] :: :: null
3.5 :: 0037d10h54m40.784s :: 2021-02-23 08:21:58.044 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31873,-1.14928,2.12258,-2.51774,-1.57602,-1.0422] :: :: null
3.5 :: 0037d10h56m54.536s :: 2021-02-23 08:24:11.793 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31838,-1.21209,2.10537,-2.43751,-1.57608,-1.04182] :: :: null
3.5 :: 0037d12h33m36.864s :: 2021-02-23 10:00:53.823 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.32089,-1.20565,2.10748,-2.44627,-1.57601,-1.03966] :: :: null
3.5 :: 0037d12h53m48.048s :: 2021-02-23 10:21:04.911 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31938,-1.40514,1.98931,-2.12848,-1.57614,-1.0388] :: :: null
3.5 :: 0037d14h35m07.720s :: 2021-02-23 12:02:24.255 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31696,-1.4068,1.98753,-2.1249,-1.57633,-1.04099] :: :: null
3.5 :: 0037d15h32m17.752s :: 2021-02-23 12:59:34.153 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31878,-1.11424,2.12937,-2.55947,-1.57603,-1.04245] :: :: null
3.5 :: 0037d18h26m26.888s :: 2021-02-24 08:12:38.207 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31685,-1.15163,2.12208,-2.5145,-1.5761,-1.04404] :: :: null
3.5 :: 0037d18h55m15.528s :: 2021-02-24 08:41:26.800 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31861,-1.11466,2.12937,-2.5589,-1.57603,-1.04257] :: :: null
3.5 :: 0037d19h04m36.368s :: 2021-02-24 08:50:47.586 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31864,-1.11461,2.12936,-2.55882,-1.57605,-1.04252] :: :: null
3.5 :: 0037d19h41m04.896s :: 2021-02-24 09:27:15.971 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31872,-1.11422,2.12935,-2.55947,-1.57598,-1.04243] :: :: null
3.5 :: 0037d19h44m29.336s :: 2021-02-24 09:30:40.368 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31876,-1.14667,2.12316,-2.52086,-1.57598,-1.04224] :: :: null
3.5 :: 0037d19h45m05.640s :: 2021-02-24 09:31:16.689 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31876,-1.14741,2.12298,-2.51984,-1.57602,-1.04214] :: :: null
3.5 :: 0037d22h12m09.328s :: 2021-02-24 11:58:19.961 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31602,-1.41144,1.98258,-2.1154,-1.57634,-1.04185] :: :: null
3.5 :: 0038d19h44m54.160s :: 2021-02-25 09:31:00.445 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.32009,-1.31647,2.05823,-2.28605,-1.57601,-1.03922] :: :: null
3.5 :: 0038d23h36m11.952s :: 2021-02-25 13:22:17.554 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31975,-1.36334,2.0263,-2.20709,-1.57614,-1.0389] :: :: null
3.5 :: 0039d00h29m55.160s :: 2021-02-25 14:16:00.549 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31846,-1.20968,2.10618,-2.44072,-1.57612,-1.04185] :: :: null
3.5 :: 0039d01h13m05.576s :: 2021-02-25 14:59:10.794 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31882,-1.11427,2.12941,-2.55948,-1.57602,-1.04236] :: :: null

Sometimes there is only a few minutes between two errors.

The log file also shows that there is no direct relation to the calculations performed by the main thread: sometimes the main thread is idle when the move command starts, and there are no heavy calculations during the movement.

tle · April 8, 2021, 9:45am

Hi @csaba.mucsi , have you tried updating to the latest release 3.15.1? There are some improvements in relation to URCaps behavior, please take a look on the release notes: Release Notes 3.15
Please let us know. Thank you.

csaba.mucsi · April 9, 2021, 8:15am

Hi @tle new Polyscope version 3.15.1 does not solve the issue.
UPDATE
We have localized the error and it seems to be a movej not responding when called immediately after another movel (all this is done in a separate thread) but the thread is actually not crashing.
As a temporary workaround, one can extend the program as shown below:

thread myThread():
  movel(waypointA)
  movel(waypointB)
  ... 
  # stability fix >>
  while (not is_steady()):
    sleep(0.1)
  end
  # << stability fix
  movej(waypointX)
end

mmi · April 9, 2021, 8:48am

We’ve been able to reproduce the issue. I’m happy that you could find a workaround.

csaba.mucsi · April 9, 2021, 11:02am

Hi @mmi thanks for the feedback, it’s good to hear the issue can be reproduced.

Topic		Replies	Views
Cancel movement to a pose Technical Questions threads	6	1894	January 27, 2021
Thread execution leads to error Technical Questions	1	345	March 22, 2023
Downgrade robot software to a version older than the original Technical Questions	3	2198	March 9, 2021
Stopj(a) function not working for me Technical Questions	4	86	September 27, 2024
Error: Another thread is already controlling the robot Technical Questions	5	920	May 13, 2022

Thread crashing in movej

Related topics