Email exchange with CERN
Only important excerpts are added here.
19.05.2011 - Mark Buttner
Hi Bert,
My comments appear in the middle of your message below ...
Regards
Mark
On 05/12/2011 10:21 AM, Besser, Bert wrote:
> Hello Mark,
>
> Jutta and Raphael told me that they had a quick chat with you during
> the visit and about your concerns that our ideas could be overly
> generic. We would very much appreciate your opinion. Perhaps I can call you?
Yes.
>
>
> We also fear that the following points could turn out to be to much to
> start with and try to find (intermediate) solutions.
>
> = rules to generate alarms
> How expressive must rules be to cover all alarm circumstances? Are
> and-or-not trees with leaves like bitmask, threshold, etc. too much
> (e.g. "Is magnet offline AND its beamline online?")?
> Strategy:
> Existing rule engines for C++ seem to be far too complex for what
> we need. We should implement a fitting solution ourselves.
I fully agree to that. The rules engine you describe is exactly what we will have in our middle-tier. About existing rule engines, I came to the same conclusion than you.
> = avoid multiple acquisition of same data
> e.g. beamline status for a range of magnets all handled on the same
> frontend or device status for different rules
> Strategy:
> Open. An event mechanism to re-evaluate all rules that depend on a
> (changed) datum would definitely be nice but perhaps too much to
> develop ourselves.
Obviously, you should avoid to acquire the same data multiple times. At the frontend-to-server level, we use use "proxies": the frontend publishs the data to a proxy to which clients subscribe. The frontend sees only one client (the proxy). At the server level, the rules are defined on "data tags", which are the software representation of a single acquired value. Rules refer to data tags, when a data tag changes, all rules based on it are re-evaluated. In the solution we are about to implement, rules are re-evaluated "on change", but events are filtered using deadbands and delays (to avoid CPU overload on the server).
> = deployment maintenance of alarm generation processes
> FAIR will have around 2000 frontend computers. Unless the
> generation process' implementation is very reliable error tracking,
> manual restarts, etc. could cause much overhead.
> Strategy:
> Start small with only a few server-side processes, bugfix and take
> time to make stable enough to be deployed on frontends.
By experience, this is a critical point: whatever you plan to change in frontend software, the update will be something in between difficult and impossible. On our side, if LHC is in operation, we just can not deploy important changes to our frontend components. I am very much in favor of limiting the frontend activity to data acquisition, and place the "algotihms" on the server side, so that they can evolve even during operation ...
> = verification of correct rule placement in sources
> We do not want one frontend to evaluate rules that acquire data
> from a device on another frontend. There must be server-side
> (config-time and
> run-time) checks that verify that the desired placement of rules in
> source processes is "right".
> Strategy:
> This point is not so urgent if we start with few server side alarm
> generation processes.
... and this is exactly what worried my a bit in your plans: having the rules evaluation on the frontends and also on the server results in a more complex system. Our policy is that either frontend software is in a position to clearly identify and announce an error (without rules to be configured), or it has to publish the data that allows a clever server to do so. At CERN, I would not go for "distributed rules" system. In your organization, you have perhaps less people involved, more control on your project.
> = access different data sources
> With FAIR there will be "FESA" and "DeviceAccess" frontends. Data
> acquisition from these should be transparent to a source process, i.e.
> through a unified API. Perhaps access to machine state should also be
> hidden under that API. In the future if one wanted also to monitor
> infrastructure like switches this could perhaps also be handled
> through this API. All in all this sounds very much like
JAPC for C++
> and definitely does not belong in the project.
> Strategy:
> Focus on device access frameworks and machine state first, perhaps
> without a unifying API.
As far as I understand the point, we would create a "japc-ext-whatsoever" plugin for our
JAPC framework, so that the data acquisition at the middle-tier level is completely unified. We already have that for SNMP, JMX, YAMI, the FESA/CMW/RDA family ... but I do not know enough about your "DeviceAccess" to evaluate if it is suitable for a
JAPC extension.
>
>
> I have also attached the slides that I had prepared for the visit.
> There is an illustration in it that shows how we would like to
> distribute the alarm gen processes as far as possible while respecting
> restrictions like e.g. available resources and access to machine state.
>
> Best regards,
> Bert
>
17.11.2010 - Mark Buttner
Hello,
My comments are inserted in the middle of your message below.
Regards
Mark
Le 16/11/2010 12:23, Bert Besser a écrit :
> Dear Mark,
>
> thank you very much for all the information. We are glad to learn from your experiences. Still we would like to better understand certain aspects of your plans.
>
>
>
> 1) Do you suspect that the computation efforts in the future middle-tier (subscription, normalization, rule application) will need to be distributed? If one process is not enough, how are you going to split the work? Are you free to assign the supervision and alarm generation to server processes per device?
>
We are actually doing scalability tests on the middle tier. Figures will be available by the end of the year. What we already know is that we will split the middle-tier into an acquitision layer (device subscription, deadband-, on-delay, off-delay- filtering) and "server part" (processing rules and trigger alarms). The question is now to verify that we do not need to split the rule processor into several processes. The first test results indicate that there will be no need for that. Note that we do not foresee to do alarm generation in the acquisitiion layer. However, one can easily imagine an aquisition process producing a boolean datatag, which is turned into an alarm by the server using a very trivial rule (true = alarm, false = no alarm).
> Would you like to share your immediate thoughts (preferably concerns)
> about the idea of alarm sources that
> - run in their own processes,
> - are manageable from a central point
> - to subscribe to devices
> - and generate alarms for the configured conditions
> - so that several such sources can be deployed
> - as near to frontend computers as possible
> - but only in environments where control over the processes can be ensured (redeploy, etc.)?
>
>
As far as I understand your design proposal, you are missing a central place, where your GUIs subscribe to. Of course, you can decide that a message broker is this place, but than you can not combine data elements from different devices to produce a single alarm (usecase: a given machine mode + a given equipment state like a valve position = alarm).
If you split your alarm generation, you must than ensure that the process subscribes to all devices and external conditions you need to compute such combined alarms.
> As you can see we are still thinking/talking about how to decentralize alarm generation while avoiding the pitfalls you told us about (of course now a single source is conceptually not different to central acquisition and still one has to care about different platforms to run sources on).
>
>
>
> 2) How do you plan to handle processes that raise alarms but that are non-device processes, i.e. processes that cannot be monitored like a FESA device for example? Do these still need to use a library provided by your team?
>
If some work has to be done on such processes, we will recommend to use a standard communication library used at CERN (CMW, YAMI, ...), not a LASER specific one. In this way, we are in the same schemen than for a device. For existing stuff, we will have a gateway process, which translates the input from our current LASER source lib to the new scheme.
>
>
> 3) How extensive are your plans to raise alarm quality? I read about the wish to do statistical analysis of the monitored data. Do you plan, for example, to detect jitter automatically or on request? Would this be part of L3 so that operators could be suggested suitable deadband widths or would this be a separate system?
>
>
I do not know yet to what extent we will do things automatically. Doing data analysis is a possibility, but the starting point will be based on standard alarm metrics: which are the alarms coming too often? which alarms stay active for a long time without blocking the system (indicator for false alarm definition)? Alarms matching such criteria should be investigated and configured differently (by an operator, automatically, zith/or without suggestions ... we will decide this later).
> Again, thanks in advance!
> Regards,
> Bert
02.11.2010 - Mark Buttner
Hello,
Hereafter some information related to your questions ...
1. Java Enterprise
=============
Apart from my personal opinion (that doesn't matter so much), the main reason is to comply to the technical standards of our group. LASER is more or less the last product in CERN's accelerator controls infrastructure using EJB and
OC4J. More than enterprise Java, the
OC4J application server is a bad choice (migration from one version to the other is difficult, the future of the software is unclear). But to be honest, for technical aspects, we simply decided to not invest time and just take the accepted in-house standards.
Your second question is simple: the actual LASER is maintained, but not developped any longer. We accept to implement improvements in the console, but not much more. The development effort will go directly into DIAMON/LASER (internally called "L3").
2. Re-design
=========
More than just the problem about data acquisition , experience in our environment has shown that too many people are involved. Even if we improve our libraries, it would be very difficult to have all information providers migrating to it and using its features. Also very important: alarm sources are often running on critical computers, and are often embedded into critical software. Updates need prior notice and are sometimes simply impossible for a very long time. In addition, alarm definitions do change (thresholds, delays, priorities, external conditions), underlying data does not or less often (the alarm stills based on a temperature, voltage, equipment state).
Where DIAMON comes in: Considering all this, the best approach seems to monitor everything in a central place, and generate the alarms out of the monitored data. In this view, a temperature or voltage is a "real" data element (available for monitoring), an alarms become a kind of "derived" data element (derived from, for instance, a temperature combined with a machine state).
To summarize: many places, very different equipment, alarm source software developped by many people ... we believe that it is difficult to push all actors to provide a good level of functionality. Taking the data is simpler, and is needed anyway for monitoring. Having the business logic in a central place will allow to use a single software for both monitoring and alarms, and at the same time allow to improve alarm processing without updating critical software components on critical computers.
3. Usability
=========
a. Alarm quality: LASER is for the moment very poor in terms of alarm quality measurement. But we know that we have many alarms defined in our database, which were never ever activated, not even for a test. We have many alarms staying active for long periods (let's say more than 1 week), and CERN survives without anybody doing something about these alarms. In certain areas, we have more alarms than an operator can reasonably read ... all these elements are indicators for insufficient alarm quality management. The new system has to provide tools to track such alarms.
b. Declare alarms before use: This is something (at least in my mind) absolutely essential. When an alarm is activated, someone has to do something, otherwise things will go wrong (if they don't, it is NOT a real alarm). How can you ensure that the operator on shift knows how to react if your alarm definition was not declared validated approved prior to first real event in production?
So, I think that was a valid choice. However, LASER is missing some facilities to make the work of declaring/checking/approving easier. Because of weaknesses in this area, we had to implement a special category of alarms, which are created "on the fly" (when a certain alarm source sends something, we trust it and add a corresponding definition to our database). But that's not a good idea at all.
c. Impact of outdated alarm definitions: This is not that critical. It's like having software with lots of lines commented out, blocks of code that are never used, business rules that are no longer valid ... it slows down the system, makes investigations difficult. It's just an unclean situation.
d. (non-)intrusive approach to alarm sources: If you ask people to compile your library into realtime software, you take the risk that the realtime task (the one driving some equipment, or reacting to events) is blocked or killed by your component. Even if there is no incident, you will use the memory they need, load their CPU, and you will have bugs. Whatever happens, you will be suspected as soon as a problem occurs ... there are many good reasons to clearly separate the control tasks and their monitoring.
4. Other
======
a. LASER development process: Maintaining the APIs is not a problem, deploying the libraries is one. Another one is the fact that database requests from the consoles are processed by the same instance of application server than the incoming alarms. Result (to keep the answer short): database querying could kill the alarm process. We never had an incident, for which this was the proven cause, but: An additional field needed in the result of a search query requires to redeploy the critical process that handles the incoming alarms. And ...
b. ... in LASER, it is impossible to lockout sources: It happened more than once that a developer "forgot" that, for testing, the output of an alarm source should be sent to the JMS brokers of the test environment, not the production environment. Once, we received test alarms from one source at the speed of 1 backup message with 1'500 alarms once per millisecond. The production server crashed. That's why in L3, we really want to go for subscription, so that we can decide about communication.
Hopefully, my answers will help you a bit in your decision-making process. Do not hesitate if you have other questions or need more details.
Note also that we have again a public website (was missing since our wikis are behind the firewall):
http://cern.ch/lhc-alarms-serviceFor the moment there is not much more than what you already have, but other documents will be published there when available.
Regards
Mark
Bert Besser wrote:
>
Dear Mark, >
>
We are answering a bit late because of vacation and illness related delays. >
>
We have now gathered (quite some) questions mainly on three topics and would really appreciate your answers / opinion on them. Permit us to send you a list of them. >
>
>
>
Regarding Java Enterprise: >
What are your main reasons to turn away from J2EE? Would EJB 3 have been an option or is the need for an Application Server in general not favored by you? >
>
Are you going for a next (intermediate) version of LASER that introduces Spring, ActiveMQ, etc. while staying an independent product or are you immediately developing the new LASER/DIAMON without any further developments on LASER? >
>
>
>
Regarding the redesign: >
Did you encounter circumstances that (flat out) could not be covered with the current LASER, i.e. where the data acquisition approach is really needed? Or is it the case that you want to redesign because the profits you expect to gain are large? >
>
To what extent is the unification with DIAMON reason for redesign, i.e. would you have thought about a (similar) redesign even if no such step was necessary? >
>
Would it have been an option to extend the alarm source library to be manageable, i.e. place alarm detection methods there and control them from the core service? Or is it the case that the circumstances at your site do not allow for this setup (e.g. multitude of environments for alarm sources, weak frontend hardware)? >
>
>
>
Regarding LASER's usability: >
In what terms do you measure low alarm quality and alarm inconsistency? To what extent did both imply an overly complicated alarm definition process? >
>
In retrospect, do you consider LASER's approach of fixed alarms - and the implied need for an alarm definition process - as a wrong choice? >
>
Did you ever encounter the need for the possibility to connect new devices to the alarm system in an adhoc fashion by making them able to send custom alarms? >
>
How large was the impact of outdated alarm definitions / a cluttered config database on the usability of the LASER configuration services? >
>
What do the slides mean by 'intrusive' alarm sources? Were the device developers upset by having to program another API? >
>
>
>
Regarding other: >
What are the main drawbacks of LASER's design regarding development (e.g. the need to maintain alarm source APIs instead of relying only on JAPC)? >
>
What are situations where you would want to shut off LASER from incoming messages (from specific devices)? >
>
>
>
Thank you very much for your time and effort, >
>
Bert
Hello,
Please find attached various documents about our LASER alarm software. I recommend that you start with
LaserOverview and
LaserIntro, read than the ICALEPCS papers from 2003, 2005 and 2007. You should finish with the
TC* documents: These are two presentations given to our technical committee in May 2010. One explains some good reasons and options for a major change (or even rewrite) of the system ("plans"), the other one is more technical oriented and proposes a design as well as a way to move there.
To summarize things a bit:
- we have a solution, that is reliable and suitable (especially if you do not plan to use a huge number of alarms)
- technically, it depends on Oracle,
SonicMQ JMS brokers. The server process runs in an
OC4J container, the GUI is Java Swing based application. You can build (or interface existing ...) alarm sources either in Java or in C/C++. Note: The "alarm sources" are processes running somewhere, doing data acqusition, checking if there is an error and sending the alarm to the LASER server. Clients (i.e. applications reading incoming alarm events) can be build in Java only.
- the main weaknesses are:
(1) no control on the data underlying to alarms, you have to trust alarm sources
(2) it is not a packaged product, one you could take the CD, install and play with it.
The last point is certainly an issue for your prototyping stage. To install the system somewhere, we will first need a bit of preparation on our side, and than organize either some training for you on our site, or the visit of one of our team members on your site.
I suggest that you go through the documentation and than contact me again for further discussions.
Regards
Mark
Bert Besser wrote:
> Dear Mr. Buttner,
>
> for the future FAIR operations at GSI there is a need for a new alarming solution. Our investigations of how to fulfill this requirement includes the evaluation of existing systems.
>
> Having heard about the successful use of LASER at CERN and its usage also in other institutions we would be interested in having a closer look at it and evaluating its applicability here, if this is possible.
>
> For this reason we would like to learn about LASER's features as well as its dependencies, technology it uses, its runtime requirements, etc. We think that the best way to learn is to set up a running (prototype) installation and then investigate deeper the question if LASER could be the right point to start from.
>
> Therefore we are interested which is the current stable release and in hints and/or documentation that helps us with getting LASER running in a test setup. Are there current developments that are worth delaying our efforts and evaluate a coming version?
>
> Thank you very much in advance for your help, Bert Besser
>
>
Attachments