Thread crashing in movej

Hi,
We are experiencing random errors on CB-series robots with Polyscope version 3.13.x and 3.14.x. The scenario is the following:

  • the main program is performing some relatively heavy calculations of the next waypoints, while another thread is moving the robot
  • the thread writes a log message before and after every movel and movej command (for debugging)
  • sometimes (once or twice a day) there is one movej that starts but never ends (thread crashing?)
  • there is absolutely no error message anywhere
  • the robot movement stops, and the main thread detects a timeout after some time (but no error message from the system)
    There are no race conditions, no critical sections either. This is a clean structure but still there is an error somewhere.
    Are there any known restrictions in controlling the robot from a thread, or any changes in the latest firmware versions 3.13.x and 3.14.x that may cause this to happen?
    Our code structure has not changed in the past 2 years and suddenly we are getting this error from more than one CB robot installations, however, we haven’t heard about the same issue happening on E-series.
    @Ebbe do you have any suggestions?

Hi @csaba,

I have no knowledge of an issue like this have been introduced. But if it is related to controller timing execution, thing might have changed marginally.

  • Is the timeout detected by own logic or Polyscope?
  • What kind of synchronization do you have between the different threads? And are other of the threads controlling the robot movements?
  • Have you diagnosed if the thread is still running or not by using join in one well functioning thread?

One simple try could be to create a critical section for each of your movements.

Hi @Ebbe thanks for the quick reply.
The timeout is only detected by the program logic.
The main program is calculating waypoints, and the other thread is performing the movements. There is no other thread trying to control the robot at that time. The communication between the threads is only done through some variables: the main program sets a variable (move target position) and the thread sets another variable (movement status, integer) - a little bit more complex, but something like this.
I suspected it could be a deadlock in the system when using get_inverse_kin, is_within_safety_limits, or get_inverse_kin_has_solution in the main program that is doing the calculations while a thread is moving the robot. We use these functions in our calculations and probably the move() commands use them internally as well.
The error occurs on robots in production, so we haven’t been able to diagnose the thread is still running or not, and have not used the join command either. I’ll try to create a sample program on a test robot to reproduce the error and upload it as soon as possible.

UPDATE:
I was able to modify the program on a customer robot to get the joint positions at the moment when such “timeout” was detected:

 [1.32164,-1.11475,2.12934,-2.55896,-1.57589,-1.03973]
 [1.32235,-1.26394,2.08554,-2.36594,-1.57599,-1.03765]
 [1.31811,-1.26711,2.08393,-2.36111,-1.57618,-1.04171]
 [1.31624,-1.27076,2.0822,-2.35558,-1.5761,-1.04354]
 [1.32139,-1.1144,2.12951,-2.55944,-1.57578,-1.03994] 
 [1.31526,-1.41115,1.98306,-2.11634,-1.57633,-1.04267]
 [1.32198,-1.31644,2.05832,-2.28616,-1.57597,-1.03736]
 [1.31737,-1.21514,2.10431,-2.4334,-1.57614,-1.04298] 
 [1.31874,-1.14916,2.12263,-2.51764,-1.57602,-1.04218]
 [1.31732,-1.21527,2.10437,-2.43339,-1.57615,-1.04297]
 [1.31713,-1.40786,1.98644,-2.12288,-1.57628,-1.04092]
 [1.31842,-1.1149,2.12928,-2.5586,-1.57599,-1.04269]
 [1.31848,-1.11495,2.12929,-2.55864,-1.57602,-1.04266]
 [1.38735,-1.30148,2.05694,-2.29986,-1.57425,-0.9721] 
 [1.31873,-1.14928,2.12258,-2.51774,-1.57602,-1.0422] 
 [1.31838,-1.21209,2.10537,-2.43751,-1.57608,-1.04182]
 [1.32089,-1.20565,2.10748,-2.44627,-1.57601,-1.03966]
 [1.31938,-1.40514,1.98931,-2.12848,-1.57614,-1.0388] 
 [1.31696,-1.4068,1.98753,-2.1249,-1.57633,-1.04099]
 [1.31878,-1.11424,2.12937,-2.55947,-1.57603,-1.04245]
 [1.31685,-1.15163,2.12208,-2.5145,-1.5761,-1.04404]
 [1.31861,-1.11466,2.12937,-2.5589,-1.57603,-1.04257] 
 [1.31864,-1.11461,2.12936,-2.55882,-1.57605,-1.04252]
 [1.31872,-1.11422,2.12935,-2.55947,-1.57598,-1.04243]
 [1.31876,-1.14667,2.12316,-2.52086,-1.57598,-1.04224]
 [1.31876,-1.14741,2.12298,-2.51984,-1.57602,-1.04214]
 [1.31602,-1.41144,1.98258,-2.1154,-1.57634,-1.04185] 
 [1.32009,-1.31647,2.05823,-2.28605,-1.57601,-1.03922]
 [1.31975,-1.36334,2.0263,-2.20709,-1.57614,-1.0389]
 [1.31846,-1.20968,2.10618,-2.44072,-1.57612,-1.04185]
 [1.31882,-1.11427,2.12941,-2.55948,-1.57602,-1.04236]

The joint positions seem to be varying, although the target position is hard-coded on this robot.
I was also able to test the join command, which returned immediately.

I did not find any corresponding entries at the time of the failure in urcontrol.log

Hi @csaba ,

Is the list of joint set angles from one incidence, or is it one set from each incidence?

@mmi do you have any input on what could cause @csaba’s thread issue?

Hi @Ebbe ,
The above list of joint angles are a collection of ca. one week of operation, and one line corresponds to one incident.
Here is the original log, filtered by the message type:

3.5 :: 0037d02h15m42.184s :: 2021-02-22 07:32:26.392 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.32164,-1.11475,2.12934,-2.55896,-1.57589,-1.03973] :: :: null
3.5 :: 0037d02h23m35.296s :: 2021-02-22 07:40:19.493 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.32235,-1.26394,2.08554,-2.36594,-1.57599,-1.03765] :: :: null
3.5 :: 0037d02h45m38.968s :: 2021-02-22 08:02:23.059 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31811,-1.26711,2.08393,-2.36111,-1.57618,-1.04171] :: :: null
3.5 :: 0037d02h46m43.216s :: 2021-02-22 08:03:27.381 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31624,-1.27076,2.0822,-2.35558,-1.5761,-1.04354] :: :: null
3.5 :: 0037d03h01m11.408s :: 2021-02-22 08:17:55.512 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.32139,-1.1144,2.12951,-2.55944,-1.57578,-1.03994] :: :: null
3.5 :: 0037d03h03m55.808s :: 2021-02-22 08:20:39.866 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31526,-1.41115,1.98306,-2.11634,-1.57633,-1.04267] :: :: null
3.5 :: 0037d03h15m00.960s :: 2021-02-22 08:31:45.016 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.32198,-1.31644,2.05832,-2.28616,-1.57597,-1.03736] :: :: null
3.5 :: 0037d04h48m55.360s :: 2021-02-22 10:05:39.111 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31737,-1.21514,2.10431,-2.4334,-1.57614,-1.04298] :: :: null
3.5 :: 0037d05h32m14.144s :: 2021-02-22 10:48:57.727 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31874,-1.14916,2.12263,-2.51764,-1.57602,-1.04218] :: :: null
3.5 :: 0037d05h35m48.064s :: 2021-02-22 10:52:31.650 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31732,-1.21527,2.10437,-2.43339,-1.57615,-1.04297] :: :: null
3.5 :: 0037d05h43m14.976s :: 2021-02-22 10:59:58.560 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31713,-1.40786,1.98644,-2.12288,-1.57628,-1.04092] :: :: null
3.5 :: 0037d06h11m46.848s :: 2021-02-22 11:28:30.259 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31842,-1.1149,2.12928,-2.5586,-1.57599,-1.04269] :: :: null
3.5 :: 0037d06h51m17.792s :: 2021-02-22 12:08:01.099 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31848,-1.11495,2.12929,-2.55864,-1.57602,-1.04266] :: :: null
3.5 :: 0037d09h11m11.712s :: 2021-02-22 14:27:54.529 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.38735,-1.30148,2.05694,-2.29986,-1.57425,-0.9721] :: :: null
3.5 :: 0037d10h54m40.784s :: 2021-02-23 08:21:58.044 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31873,-1.14928,2.12258,-2.51774,-1.57602,-1.0422] :: :: null
3.5 :: 0037d10h56m54.536s :: 2021-02-23 08:24:11.793 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31838,-1.21209,2.10537,-2.43751,-1.57608,-1.04182] :: :: null
3.5 :: 0037d12h33m36.864s :: 2021-02-23 10:00:53.823 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.32089,-1.20565,2.10748,-2.44627,-1.57601,-1.03966] :: :: null
3.5 :: 0037d12h53m48.048s :: 2021-02-23 10:21:04.911 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31938,-1.40514,1.98931,-2.12848,-1.57614,-1.0388] :: :: null
3.5 :: 0037d14h35m07.720s :: 2021-02-23 12:02:24.255 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31696,-1.4068,1.98753,-2.1249,-1.57633,-1.04099] :: :: null
3.5 :: 0037d15h32m17.752s :: 2021-02-23 12:59:34.153 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31878,-1.11424,2.12937,-2.55947,-1.57603,-1.04245] :: :: null
3.5 :: 0037d18h26m26.888s :: 2021-02-24 08:12:38.207 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31685,-1.15163,2.12208,-2.5145,-1.5761,-1.04404] :: :: null
3.5 :: 0037d18h55m15.528s :: 2021-02-24 08:41:26.800 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31861,-1.11466,2.12937,-2.5589,-1.57603,-1.04257] :: :: null
3.5 :: 0037d19h04m36.368s :: 2021-02-24 08:50:47.586 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31864,-1.11461,2.12936,-2.55882,-1.57605,-1.04252] :: :: null
3.5 :: 0037d19h41m04.896s :: 2021-02-24 09:27:15.971 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31872,-1.11422,2.12935,-2.55947,-1.57598,-1.04243] :: :: null
3.5 :: 0037d19h44m29.336s :: 2021-02-24 09:30:40.368 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31876,-1.14667,2.12316,-2.52086,-1.57598,-1.04224] :: :: null
3.5 :: 0037d19h45m05.640s :: 2021-02-24 09:31:16.689 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31876,-1.14741,2.12298,-2.51984,-1.57602,-1.04214] :: :: null
3.5 :: 0037d22h12m09.328s :: 2021-02-24 11:58:19.961 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31602,-1.41144,1.98258,-2.1154,-1.57634,-1.04185] :: :: null
3.5 :: 0038d19h44m54.160s :: 2021-02-25 09:31:00.445 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.32009,-1.31647,2.05823,-2.28605,-1.57601,-1.03922] :: :: null
3.5 :: 0038d23h36m11.952s :: 2021-02-25 13:22:17.554 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31975,-1.36334,2.0263,-2.20709,-1.57614,-1.0389] :: :: null
3.5 :: 0039d00h29m55.160s :: 2021-02-25 14:16:00.549 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31846,-1.20968,2.10618,-2.44072,-1.57612,-1.04185] :: :: null
3.5 :: 0039d01h13m05.576s :: 2021-02-25 14:59:10.794 :: -3 :: C0A0:0 :: null :: 1 :: Joint positions when timeout detected: [1.31882,-1.11427,2.12941,-2.55948,-1.57602,-1.04236] :: :: null

Sometimes there is only a few minutes between two errors.

The log file also shows that there is no direct relation to the calculations performed by the main thread: sometimes the main thread is idle when the move command starts, and there are no heavy calculations during the movement.

Hi @csaba.mucsi , have you tried updating to the latest release 3.15.1? There are some improvements in relation to URCaps behavior, please take a look on the release notes: Release Notes 3.15
Please let us know. Thank you.

1 Like

Hi @tle new Polyscope version 3.15.1 does not solve the issue.
UPDATE
We have localized the error and it seems to be a movej not responding when called immediately after another movel (all this is done in a separate thread) but the thread is actually not crashing.
As a temporary workaround, one can extend the program as shown below:

thread myThread():
  movel(waypointA)
  movel(waypointB)
  ... 
  # stability fix >>
  while (not is_steady()):
    sleep(0.1)
  end
  # << stability fix
  movej(waypointX)
end
1 Like

We’ve been able to reproduce the issue. I’m happy that you could find a workaround.

1 Like

Hi @mmi thanks for the feedback, it’s good to hear the issue can be reproduced.