BSS behavior when MASP disables Storage Ring Mode context / API between LSA and BSS for updating Patterns and PatternGroups

Date: 2019-11-01

Participants: Hanno Hüther, Stefan Krepp, Raphael Müller, Andreas Schaller, Anneke Walter

Minutes: Anneke Walter

1 BSS behavior when MASP disables Storage Ring Mode context

1.1 Current behavior

If the MASP reports the status of the Chain in a Storage Ring Mode Pattern to be not-ok, BSS currently deactivates any breakpoints internally (signal states are unchanged), and otherwise leaves the scheduling untouched. When the Pattern eventually starts over (after MASP reports the status of the Chain to be ok again), enabled breakpoints do not show any effect anymore (which is a bug).

1.2 Expected behavior

Anne talked to Markus Steck about the expected/previous behaviour of cycles at ESR when devices had interlocks. Markus Steck reported that device interlocks did not cause a running cycle to be changed, e.g. a paused cycle would remain paused, configured repetitions for a virtual machine would repeat as configured, etc. Either the affected device was not needed for the remaining cycle (e.g. an injection device goes to an interlock state after the beam has already been injected), in which case the remainder of the cycle should be able to finish successfully. Or the device is critical for the success of the running cycle, in which case the beam would probably be lost and the ESR operators would decide how to handle that (e.g. stop the cycle while skipping as many virtual machines as possible).

For the new Storage Ring Mode, running Storage Ring Mode patterns should likewise not be affected when MASP notifies the BSS that a Chain should not be executed (anymore): If breakpoints/manipulations are active, the Pattern should pause there until the operators decide to disable the breakpoint(s)/manipulation(s). Subchain repetitions and skipping for Subchains should be performed as configured. In general, the BSS should treat Storage Ring Mode Patterns like any other Pattern: It should abort any remaining Pattern repetitions (if any are configured), and should prevent the Pattern from starting over once it reaches its end if the MASP status is still not ok.

It was discussed if aborting Pattern repetitions should be avoided when the MASP status changes back to ok before the end of the Pattern was reached. While that behavior could be nice, there is no easy way to implement it, so we will keep the current behavior (aborting Pattern repetitions) unless/until the users request a different behavior.

1.3 Decisions

TODO Stefan will change the behavior of the BSS to treat Storage Ring Mode Patterns like any other Pattern: A running execution of a Pattern is not affected by the BSS at all, but any configured remaining Pattern repetitions would be aborted and the Pattern will not start over until the MASP status is ok again.

The fix shall be rolled out on PRO for the engineering run (2019-11-12), and we shall test it on INT before rolling it out to PRO.

2 API between LSA and BSS for updating Patterns and Pattern Groups

We discussed once more the API for updating the BSS regarding Pattern schedules and Pattern Groups to make sure that we are all on the same page, and the Storage Ring Mode Working Group (SRM WG) realized that they could not specify timeout durations and would prefer to combine the add/removeInternalNodes operations into a single atomic operation.

2.1 Current and old proposed API

2.1.1 Current API

  • addToGlobalSchedule(String dotSchedule)
  • removePatternInternalNodes(String patternName)
  • waitUntilSafeToRemove(Collection<Pattern> patterns, long timeoutInMillis)

2.1.2 Old proposed API

  • addPatternInternalNodes(String dotSchedule)
  • setPatternGroups(Set<PatternGroup> patternGroups, Map<String,String> schedulesForNewPatterns)
    • the Map contains the names of Patterns that were not previously part of any PatternGroup as keys, and their respective dot schedules as values. It may be empty if the Pattern Group update does not include new Patterns (e.g. deletion of a Pattern Group, or Pattern order change within an existing Pattern Group.
  • waitUntilSafeToRemove(Collection<Pattern> patterns, long timeoutInMillis)
  • removal of addToGlobalSchedule

2.3 Discussion

2.3.1 Various simple points

  • Naming improvement: "updatePatternGroups" instead of "setPatternGroups"
  • SRM WG wants to combine waitUntilSafeToRemove + removePatternInternalNodes + addPatternInternalNodes into a single atomic operation => "updatePatternSchedule"
  • SRM WG wants to specify timeout durations for waitUntilSafeToRemove calls performed internally by BSS in "updatePatternSchedule" and "updatePatternGroups"
  • Simplification: waitUntilSafeToRemove shall receive a Set of Pattern names instead of a Collection of Pattern objects, as the BSS only needs to know the pattern names.
  • waitUntilSafeToRemove will remain part of the API, because when a coupled SIS18 Pattern is trimmed, LSA disables the coupled ESR Pattern, waits until the ESR Pattern is safe-to-remove, and only then disables the SIS18 Pattern. We do it like that because the SIS18 Pattern will only become safe-to-remove if the ESR Pattern is safe-to-remove. If the SIS18 Pattern would be disabled first, the ESR Pattern may wait for a long time (timeout) for beam transfer from SIS18, which in turn would prolong the time the SIS18 pattern needs to wait until it can be trimmed.

2.3.2 How to handle errors when adding internal nodes of Patterns?

  • Right now, if BSS successfully removed a Pattern's internal nodes but, for whatever reason, could not add the internal nodes provided by LSA again, only the Pattern entry and exit nodes remain in the graph. They are unconnected and we would crash the Generator/Data Master if the Pattern is allowed to run again.
    • Note: If the internal nodes for Pattern A could not be added, LSA would leave Pattern A disabled-by-LSA, so it should not run. However, if a user made a change to the Pattern Group that contains Pattern A, its disabled-by-LSA might be reset, so A would become eligible for running again.
  • A simple solution could be that in case that adding of a Pattern's internal nodes fails, the BSS creates a dummy node between Pattern entry and exit before throwing an exception: Entry -> DummyNode -> Exit. If the Pattern starts running again, it would effectively do nothing, but it also would not crash the Generator/Data Master. Users can try to rectify the situation by performing e.g. another drive, which would again remove the internal nodes (which in this case would be only the dummy node) and add the correct internal nodes.

2.3.3 Method signature of updatePatternGroups

The method signature of updatePatternGroups is not very nice/easily understandable, but it was chosen for the following reasons:
  • To update the Pattern Groups as an atomic operation, the BSS needs at least the following data (possibly incomplete):
    • For the Pattern Groups:
      • Pattern Group name (for logging)
      • All Pattern (names) of the group in order
      • Information about all Pattern Groups is needed to handle the removal/deletion of Pattern Groups.
    • For the Patterns:
      • Names
      • Execution Conditions
      • Contained Chains or Chain Indices (for being able to map MASP status information for chains to the correct Pattern)
      • Dot schedule for Patterns that were not part of a Pattern Group before (Dot schedule for other patterns are already known to BSS)
  • Right now, the Pattern Group objects contain the Pattern objects, which provide all necessary information except for the Dot schedules. The Dot schedule information for new Patterns is therefore provided in an additional Map.
  • It would be possible to create a specific Data Transfer Object (DTO) that describes exactly the information needed, without including additional unnecessary data. The DTO could also perform some consistency checks (e.g. Dot schedules may only be provided for Patterns that are part of a specified Pattern Group).

2.4 Decisions

2.4.1 Agreed-upon API

  • updatePatternSchedule(String patternName, String newDotSchedule, long timeoutForWaitUntilSafeToRemoveInMillis)
    • combines functionalities waitUntilSafeToRemove + removePatternInternalNodes + addPatternInternalNodes
    • In case of an error when trying to add the Pattern's internal nodes (new Dot schedule), BSS shall create a dummy node between Pattern Entry and Pattern Exit before throwing an exception (for details see section 2.3.2).

  • updatePatternGroups(Set<PatternGroup> patternGroups, Map<String,String> schedulesForNewPatterns, long timeoutForWaitUntilSafeToRemoveInMillis)
    • the Map contains the names of Patterns that were not previously part of any Patterns Group as keys, and their respective dot schedules as values. It may be empty if the Pattern Group update does not include new Patterns (e.g. deletion of a Pattern Group, or Pattern order change within an existing Pattern Group).
    • the implementation of this method shall have sanity checking for the Map contents: The Map may contain only Patterns that are now part of a Pattern Group, but were not known to the BSS before
    • the definition of this method shall have Javadoc explaining what the arguments mean/what they are used for
    • We will start with the signature as defined here, and may move to a more specific DTO later (clean-up).

  • waitUntilSafeToRemove(Set<String> patternNames, long timeoutInMillis)

  • removal of addToGlobalSchedul and removePatternInternalNodes
TODO Stefan shall implement the API as defined above. He shall also decide if he would prefer to provide the new methods in parallel with the old methods (so the API would provide both old (possibly deprecated) and new methods, and LSA would call the old API on PRO and the new API on master/INT), or if he wants to switch directly to only the new API (so he would have to support the old API on PRO, and the new one on master/INT).

2.4.2 Time Frame

The API update shall not be rolled out for the engineering run (2019-11-12), and probably also not for the beam time at the start of 2020 (if the feature is ready then, we can reconsider it, however we probably want to go into the beam time with the stable and well-tested old version of the API). We shall test the new API on INT before rolling it out to PRO.

TODO For the engineering run (2019-11-12), Stefan shall add the functionality for handling errors when trying to add the Pattern's internal nodes to the existing API to addToGlobalSchedule: In case of an error when trying to add the Pattern's internal nodes (new Dot schedule), BSS shall create a dummy node between Pattern Entry and Pattern Exit before throwing an exception (for details see section 2.3.2).

Amendment

https://www-acc.gsi.de/wiki/Service/Intern/BssMinutes-2020-02-24

-- AnnekeWalter - 01 Nov 2019
Topic revision: r5 - 25 Feb 2020, StefanKrepp
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback