April 2021: "Hanging" subscriptions in Device Control

Documentation and notes about a bug we found and fixed in Device Control.

Bug Presentation

The bug presented itself as many devices being displayed as "not connected" (read, plug symbol): They never received a response from their respective subscriptions, not even an initial update.

It reproduceably affected the same devices, e.g. YR09VV1T was always "not connected", while YR09VV2T was always displaying its proper status (green).

In other applications (e.g. FESA Explorer) or test snippets, all the devices could be reached via subscription and delivered a proper status.

The problem appeared suddenly on Wednesday evening (April 21) with Device Control 16.0.22.

Documented in OLOG 327239 .

Initial Analysis and Hotfix

  • Christian and Anne analyzed the problem on Friday afternoon, as Sigi was not available.
  • They realized that the affected devices never received a response from their respective subscriptions, not even an initial update.
  • The problem could be reproduced with older versions of Device Control, e.g. 16.0.10, which was the first version to support the VV devices.
  • The subscriptions were setup with a PULL period, which means that JAPC performs a GET periodically.
  • Debugging and extra logging showed that the GETs for the affected devices were never performed.
  • We assumed that for some reason, the PullingMonitoringTasks that should run on JapcExecutor.getScheduledExecutor() had not been executed because the executor had been terminated, it's threads had crashed, or something similar.
  • Hotfix: With the system property JapcExecutor.SYSPROP_SCHEDULED_THREADS="japc.scheduled.executor.threads", we could raise the amount of available threads in the executor from its default of 2. With 4 threads, more devices than before showed the proper status, but not yet all. With 8 and 16 threads, all devices displayed the proper status. We therefore released a hotfix in which the amount of threads was set to 8.

Follow-up Analysis and Solution

  • Anne and Ben analyzed the problem some more in the following week.
  • We noticed that from the 8 threads configured with the system property, only 3 were actively processing data. The other 5 seemed to be "stuck" on a GET call. Even when the program was restarted, the threads always got stuck on the same 5 devices: YR09DSAVO, YR09DSBHA, YR09DSBVO, YR09DSAHA, YR09DD1IS. They all belonged to the same Server MotorSlit_DU.yr09ci01, same FEC yr09ci01, same SCU (190 ?) .
  • For those devices, single asynchronous GET calls were performed to their properties Configuration and NamedPositions .
  • Since Wednesday evening, the SCU the 5 devices belonged to was not reachable. The async GET calls therefore returned errors.
  • We discovered a bug in an overridden exceptionOccurred method that led to an infinite loop of method calls:
    • DevCtrlSlitJapc.exceptionOccured called super.exceptionOccured, which is AbstractJAPCDevice.exceptionOccured
    • AbstractJAPCDevice.exceptionOccured called AbstractJAPCDevice.exceptionOccuredInternal
    • AbstractJAPCDevice.exceptionOccuredInternal was overridden by AbstractDevCtrlCommonDrivesJapc.exceptionOccuredInternal
    • AbstractDevCtrlCommonDrivesJapc.exceptionOccuredInternal erroneously called super.exceptionOccured, leading back to DevCtrlSlitJapc.exceptionOccured
  • Solution: The bug could easily be fixed by making AbstractDevCtrlCommonDrivesJapc.exceptionOccuredInternal call super.exceptionOccuredInternal instead of super.exceptionOccured . With this fix, all subscriptions started working normally again, and all devices reported their proper status.

  • Afterwards, Anne and Vitaliy investigated why the infinite call loop in the async GET handler for a property of device caused subscriptions (both normal subscriptions as well as PULL subscriptions) for other properties of the same device to hang.
    • The infinite call loop caused a StackOverflowError.
    • JAPC catches and logs Exception in value handlers and does not re-throw them, e.g. in ValueListenerHandler#onDataReceived and TaskSchedulerImpl$TaskExecutorDecorator#execute . Therefore, normal exceptions that might occur in value handlers, e.g. NullPointerException, are properly handled.
    • JAPC does not catch Error/Throwable. This is the recommended approach for Java programs and not a bug in JAPC. However, that means that such errors (e.g. OutOfMemoryError and the StackOverflowError which occured in this bug) are not logged, and they fall through all JAPC try-catch handlers until they eventuall reach the JAPC connection to the device.
    • As both JAPC and CMW contain optimizations to only have one true connection to a device, the unhandled exception reaching the connection probably broke that connection permanently (e.g. by crashing a thread that processed the requests and updates for that connection), meaning that all further interactions with the affected device could not be handled and got stuck.

  • This means that the hotfix only seemed to work because only 5 devices were affected - a thread count of 6 or more threads would have been sufficient. However, if more devices were affected, we would have had to increase the thread count even more. Also, the 5 affected devices were still broken even with the hotfix, because the underlying issue had not yet been fixed.

Summary / Things to Remember stargold

  • Java Errors (e.g. OutOfMemoryError and StackOverflowError) that occur in value handlers for GET/SET actions or subscriptions are not caught or logged by JAPC. When they occur on a property of a device D, they can permanently break the connection to that device D. This will cause all further interactions with any property of the same device D to become "stuck"/not return any values,
    • => If you investigate problems with "hanging" GET/SET calls or subscriptions, you might want to look into earlier interactions that occurred with the same device, even if those interactions are on a different property than you think you are investigating. Did those calls/subscriptions return/behave as expected? If not, try to find and fix the cause. It might solve the problem you were looking at as well.

-- AnnekeWalter - 12 May 2021
Topic revision: r1 - 12 May 2021, AnnekeWalter
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback