TimingSystemDiagnostics < Timing

You are here: Foswiki>Timing Web>TimingSystemInternHowTo>TimingSystemHowRufbereitschaft>TimingSystemDiagnostics (17 Feb 2021, MathiasKreider)Edit Attach

Reading the DM status

To get an overview on whats running on a DM, you can get the basic status by issuing

[root@tsl017 ~]# dm-cmd dev/wbm0

[root@tsl017 ~]# dm-cmd dev/wbm0 -v

verbose mode is very similar to normal status output, except that it shows the activity of each CPU split by worker threads. We will only cover normal status report here. The result will look something like the following output. Lets go over each point in turn. First, there is the free Space per CPU RAM. If any is close to 100%, this is a reason if LSA is seeing error messages when uploading a schedule. The solution is moving schedules to another core or free up some unused schedules.

Next is a list of safe2remove covenants. These are agreements made when requesting the deletion of part of schedule during runtime. The DM needs to assure that you cannot delete a schedule currently executed and does that by tracing all possible paths that would lead into territory marked for deletion. If a path can be found, then the given subgraph cannot not be safeley deleted at the moment. However, it is possible to change the path by issuing corresponding commands so it does not lead into that area anymore. To speed up the process, these commands do not even need to be executed for a path change to be recognised, they just need to be issued. This generates a safe2remove covenant: The DM will allow you to delete the subgraph, under the condition you do not change or override the command causing the necessary path change. This covenant is stored until the command is executed. The DM will enforce active covenants, which can lead to locking of command queues.

Last is the activity monitor. It is headed by the device name and the current ECA time. Then, for each CPU, it is shown if it is currently running and what the current message count is (this should increase each time you check on a running schedule). It also shows the name of the pattern running on that CPU, the names usually are telltale like SIS_something or CRYRING_Ne.... Patterns Starting with SA_ are idle patterns which usually dont generate messages or a slow heartbeat, if any. Last columns is current node, which is more of a debug thing. Don't worry if this does not appear to be changing, as it usually changes to fast to snapshot.

Space       Free
      97448        75%
      97448        86%
      97448        89%
      97448        97%
Active Safe2remove-Covenants:

╔═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║ DataMaster: dev/wbm0                                                                         │ ECA-Time: 0x16649169442c7ea8 ns │ 2021-02-17 15:07:27 ║
╠═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ Cpu │ Thr │ Running │  MsgCount │                                                 Pattern │                                                    Node ║
╠═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║  0  │  0  │   yes   │  18925681 │                          SIS18_SLOW_HFS_20210215_213626 │                      SIS18_SLOW_HFS_20210215_213626_058 ║
║  1  │  0  │   yes   │  71565351 │                                        CRYRING_Ne_local │                                    CRYRING_Ne_local_166 ║
║  2  │  0  │   yes   │   1747141 │                                       ESR_ISO_3_13Feb21 │                                   ESR_ISO_3_13Feb21_054 ║
║  3  │  0  │   yes   │    260656 │                            SA_20210216144529021_DEFAULT │                       SA_20210216144529021_DEFAULT_EXIT ║
╚═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

Obtaining DM diagnostic report

The current layout is messed up, due to a restriction to 80 chars because of old terminals in the HKR. TODO: fix this shit and make the output look decent again.

Log into DM Pro (tsl017). Issue the following command to get the report:

[root@tsl017 ~]# dm-cmd dev/wbm0 diag

In order to reset, stop or start statisitics, use

[root@tsl017 ~]# dm-cmd dev/wbm0 --help

to get a list of all possible commands and their parameters.

The diagnostics report will get something similar to this:



╔════════════════════════════════════════════════════════════════════════════════╗
║ DataMaster:                       dev/wbm0                                     ║
╠════════════════════════════════════════════════════════════════════════════════╣
║ ECA-Time: 0x16648d98a192b1c0 │ 2021-02-17 13:57:33                             ║
╠════════════════════════════════════════════════════════════════════════════════╣
║ Cpu │            BootTime │    CPU Msg Cnt │      State │    Bad Wait-Time Cnt ║
╠════════════════════════════════════════════════════════════════════════════════╣
║   0 │ 2021-02-05 13:31:26 │       18818208 │ 0x0000000f │                    0 ║
║   1 │ 2021-02-05 13:31:26 │       71364918 │ 0x0000000f │                    0 ║
║   2 │ 2021-02-05 13:31:26 │        1717715 │ 0x0000000f │                    0 ║
║   3 │ 2021-02-05 13:31:26 │         260656 │ 0x0000000f │                    0 ║
╠════════════════════════════════════════════════════════════════════════════════╣
║ Cpu │    Schedule ModTime │   Issuer │     Host │    Op Type │                 ║
╠════════════════════════════════════════════════════════════════════════════════╣
║   0 │ 2021-02-17 13:53:32 416054432ns
 │          │   tsl017 │        Add │                 ║
║   1 │ 2021-02-17 13:57:21 160203272ns
 │          │   tsl017 │        Add │                 ║
║   2 │ 2021-02-17 13:53:32 416054432ns
 │          │   tsl017 │        Add │                 ║
║   3 │ 2021-02-17 13:57:21 160203272ns
 │          │   tsl017 │        Add │                 ║
╠════════════════════════════════════════════════════════════════════════════════╣
║ Cpu │     Command ModTime │   Issuer │     Host │    Op Type │                 ║
╠════════════════════════════════════════════════════════════════════════════════╣
║   0 │ 2021-02-17 12:35:21 637097360ns
 │          │   tsl017 │       Flow │                 ║
║   1 │ 2021-02-16 14:44:13 470083048ns
 │          │   tsl017 │      Start │                 ║
║   2 │ 2021-02-16 13:42:21 567593472ns
 │          │   tsl017 │      Start │                 ║
║   3 │ 2021-02-16 13:46:11 397133128ns
 │          │   tsl017 │      Start │                 ║
╠════════════════════════════════════════════════════════════════════════════════╣
║ Cpu │    Min dT │    Max dT │    Avg dT │   Thrs dT │  Warnings │ Max Backlog  ║
╠════════════════════════════════════════════════════════════════════════════════╣
║   0 │    882200 │    998728 │    976999 │    500000 │         0 │        16    ║
║   1 │    897624 │    998728 │    970618 │    500000 │         0 │        14    ║
║   2 │    958416 │    998728 │    992704 │    500000 │         0 │         5    ║
║   3 │    976000 │    998728 │    993066 │    500000 │         0 │         3    ║
╠════════════════════════════════════════════════════════════════════════════════╣
║ Cpu │         1st Warning │                                           Location ║
╠════════════════════════════════════════════════════════════════════════════════╣
║   0 │ 1970-01-01 00:00:00 0ns
 │                                                  ? ║
║   1 │ 1970-01-01 00:00:00 0ns
 │                                                  ? ║
║   2 │ 1970-01-01 00:00:00 0ns
 │                                                  ? ║
║   3 │ 1970-01-01 00:00:00 0ns
 │                                                  ? ║
╠════════════════════════════════════════════════════════════════════════════════╣
║   T observ │ maxPosDif │        MaxPosUpdate │ minNegDif │        minNegUpdate ║
╠════════════════════════════════════════════════════════════════════════════════╣
║         16 │        16 │ 2021-02-05 13:33:54 411099984ns
 │         0 │ 2021-02-05 13:33:54 411099952ns
 ║
╠════════════════════════════════════════════════════════════════════════════════╣
║ Cpu │  STL obs. │ MaxStreak │   Current │     MaxStreakUpdate │                ║
╠════════════════════════════════════════════════════════════════════════════════╣
║   0 │     50000 │     16674 │      7536 │ 2021-02-05 13:38:10 691203072ns
 │                ║
║   1 │     50000 │     16674 │     13399 │ 2021-02-05 13:38:10 691203072ns
 │                ║
║   2 │     50000 │     16674 │      3631 │ 2021-02-05 13:38:12 509327382ns
 │                ║
║   3 │     50000 │     16674 │     16080 │ 2021-02-05 13:38:10 691203072ns
 │                ║
╚════════════════════════════════════════════════════════════════════════════════╝

Checking the result

System Overview

The topmost part of the output contains interface and current ECA time in ns. Make sure this is actually the server you wanted to check and compare the time to reference for plausibility. 1970 is not plausible.

╔════════════════════════════════════════════════════════════════════════════════╗
║ DataMaster:                       dev/wbm0                                     ║
╠════════════════════════════════════════════════════════════════════════════════╣
║ ECA-Time: 0x16648d98a192b1c0 │ 2021-02-17 13:57:33                             ║

CPU Message Generation Tracker

The report continues with the runtime statistics of each CPU core in the DM hardware (FPGA). Boot time shows the WR time at the last eb-reset, message counter is the per CPU sum of all sent timing messages. The state should always be 0x0000000f if a cpu runs as expected. The bad wait time counter shows all occurrences when a message to be sent is already late inside the DM CPU. If this count is not zero, it is a very bad sign of congestion, either due to a faulty schedule, time slip or a downed link.

║ Cpu │            BootTime │    CPU Msg Cnt │      State │    Bad Wait-Time Cnt ║
╠════════════════════════════════════════════════════════════════════════════════╣
║   0 │ 2021-02-05 13:31:26 │       18818208 │ 0x0000000f │                    0 ║
║   1 │ 2021-02-05 13:31:26 │       71364918 │ 0x0000000f │                    0 ║
║   2 │ 2021-02-05 13:31:26 │        1717715 │ 0x0000000f │                    0 ║
║   3 │ 2021-02-05 13:31:26 │         260656 │ 0x0000000f │                    0 ║

Schedule and Command Log

The next section logs the last change to the schedules in each CPU. First comes the time of last change, followed by the issuer ID. The ID is only recogisable if an actual user issued a schedule modification. The field is blank when the change came from the Generator FESA class. The host field shows the ID of the computer on which the change was issued. This is used to track unauthorised or accidental changes to the schedule over TCP Etherbone. If anyhting except the PRO DM host (tsl017) is found in this field, someone is messing with the DM. last field logs the type of operation. In the depicted example, this is the addition of a schedule or subschedule. Other valid ops would be clear, remove or keep.

╠════════════════════════════════════════════════════════════════════════════════╣
║ Cpu │    Schedule ModTime │   Issuer │     Host │    Op Type │                 ║
╠════════════════════════════════════════════════════════════════════════════════╣
║   0 │ 2021-02-17 13:53:32 416054432ns
 │          │   tsl017 │        Add │                 ║
║   1 │ 2021-02-17 13:57:21 160203272ns
 │          │   tsl017 │        Add │                 ║
║   2 │ 2021-02-17 13:53:32 416054432ns
 │          │   tsl017 │        Add │                 ║
║   3 │ 2021-02-17 13:57:21 160203272ns
 │          │   tsl017 │        Add │                 ║

The command log section is almost identical to the schedule log section, except it tracks realtime cmd ops, not schedule ops.

╠════════════════════════════════════════════════════════════════════════════════╣
║ Cpu │     Command ModTime │   Issuer │     Host │    Op Type │                 ║
╠════════════════════════════════════════════════════════════════════════════════╣
║   0 │ 2021-02-17 12:35:21 637097360ns
 │          │   tsl017 │       Flow │                 ║
║   1 │ 2021-02-16 14:44:13 470083048ns
 │          │   tsl017 │      Start │                 ║
║   2 │ 2021-02-16 13:42:21 567593472ns
 │          │   tsl017 │      Start │                 ║
║   3 │ 2021-02-16 13:46:11 397133128ns
 │          │   tsl017 │      Start │                 ║

Dispatch time statisitics

Next up is the dispatch statistics, tracking the remaining time for each message leaving the DM. dt (delta T) means the time left to a messages deadline at the moment the DM hands it over to the hardware priority queue for network transfer. The most important field is Min dT, as it tells what the worst case ever observed was. Since the network transfer can take up to 500us worst case over 5 switches and the dispatch window is set to 1000us, Min dT must never go below 500us (1000us-500us) or late messages are likely to occur. Max Dt minus the dispatch window shows the best case processing time inside the DM, Avg dT shows if schedules are on average relaxed or too crowded with network spikes. The closer to Max dT, the better. Thr dT means the value for dT at which late messages are likely downstream, which should always be set to 500us. The warnings counter tracks on how often dt < Thr dT. And lastly, the Max Backlog field tracks the number of messages arising inside the same dispatch window. Higher values mean higher traffic spikes. Since processing time is around 10us on average, this value should be < 50 (500us Thrs dt / 10us processing time).

╠════════════════════════════════════════════════════════════════════════════════╣
║ Cpu │    Min dT │    Max dT │    Avg dT │   Thrs dT │  Warnings │ Max Backlog  ║
╠════════════════════════════════════════════════════════════════════════════════╣
║   0 │    882200 │    998728 │    976999 │    500000 │         0 │        16    ║
║   1 │    897624 │    998728 │    970618 │    500000 │         0 │        14    ║
║   2 │    958416 │    998728 │    992704 │    500000 │         0 │         5    ║
║   3 │    976000 │    998728 │    993066 │    500000 │         0 │         3    ║

The warnings section refers to the dt Warnings and details when the first warning for a CPU occured and what the name of the involved schedule node was. This is helpful when looking for a time to search in graylog and for helping the LSA with information in which area of the schedule the trouble began. If no warning has yet occured, the time is still 1970 and location will show a question mark.

╠════════════════════════════════════════════════════════════════════════════════╣
║ Cpu │         1st Warning │                                           Location ║
╠════════════════════════════════════════════════════════════════════════════════╣
║   0 │ 1970-01-01 00:00:00 0ns
 │                                                  ? ║
║   1 │ 1970-01-01 00:00:00 0ns
 │                                                  ? ║
║   2 │ 1970-01-01 00:00:00 0ns
 │                                                  ? ║
║   3 │ 1970-01-01 00:00:00 0ns
 │                                                  ? ║

WR time continuity tracker

Follwing up is the time section. This tracks the monotonicity of the WR time. T observ is interval in ns when WR time is sampled. Max pos div must thus equal t observ once the TR is in track phase, otherwise WR time jumped forward at some point, causing a discontinuity. The MaxPosUpdate time shows when the biggest such jump happened. It should equal the time the TR went into track phase, ie. the boot time, otherwise there were fluctuations in WR time. If these are big, ie ms to seconds, the link was lost, if these are small, something is wrong with the WR time sync. The minNegUpdate shows jumps backward in time. It must equal zero for the WR time to be okay. minNegUpdate, akin to maxPosUpdate, shows the absolute time when the biggest jump happened.

╠════════════════════════════════════════════════════════════════════════════════╣
║   T observ │ maxPosDif │        MaxPosUpdate │ minNegDif │        minNegUpdate ║
╠════════════════════════════════════════════════════════════════════════════════╣
║         16 │        16 │ 2021-02-05 13:33:54 411099984ns
 │         0 │ 2021-02-05 13:33:54 411099952ns
 ║

WB bus congestion tracker

Last topic is flow control on the internal WB bus. This tracks the stall lines as visible to CPUs, giving an indication on how they were able to offload their timing messages. STl obs is the number of cycles to be observed, Max Streak shows the maximum number of stalls detected within the STL obs period. Due to the LM32 CPUs working on non pipelined wishbone, the MaxStreak value should alway be equal to 1/3 STL obs. If it is significantly higher, the DM is struggling to keep up. If it is at STL obs, the WR link is most likely dead. The current field is updated after each measurement and should be differnt each time it is read. If not, the DM is either not generating messages or stalled infefinitely. And lastly, MaxStreakUpdate denotes the time when the MaxStreak field was last increased.

╠════════════════════════════════════════════════════════════════════════════════╣
║ Cpu │  STL obs. │ MaxStreak │   Current │     MaxStreakUpdate │                ║
╠════════════════════════════════════════════════════════════════════════════════╣
║   0 │     50000 │     16674 │      7536 │ 2021-02-05 13:38:10 691203072ns
 │                ║
║   1 │     50000 │     16674 │     13399 │ 2021-02-05 13:38:10 691203072ns
 │                ║
║   2 │     50000 │     16674 │      3631 │ 2021-02-05 13:38:12 509327382ns
 │                ║
║   3 │     50000 │     16674 │     16080 │ 2021-02-05 13:38:10 691203072ns
 │                ║
╚════════════════════════════════════════════════════════════════════════════════╝

-- MathiasKreider - 17 Feb 2021

Topic revision: r1 - 17 Feb 2021, MathiasKreider

Timing

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback