FRC and BC Monitoring

Modified: 31 March, 2004
  1. Displays & GUIs
  2. Definitions of Error & Warning Conditions
  3. Errors, Warnings & What to do about them
  4. Definitions of Warning Conditions
  5. Description of Monitoring Variables
  6. Monitoring Protocol

Proposal for Displays & GUIs

Monitoring displays should allow: (i) the shifter to quickly localize problems as they occur and (ii) simultaneously aid experts to debug the system. To facilitate point (i) monitoring GUIs should be tiered with each display containing boxes that turn yellow or red if a particular warning or error condition occurs. Proposed tiers are listed below.
  1. Top-level STT-wide status display which contains an array of boxes for each card in each crate. A given card's box turns yellow or red depending on whether a set of specified warning/error conditions are present in that card.
    A proposed format for this display is given here.
  2. A set of FRC-level status displays, one for each FRC. These should display the values of all (or most) of the monitoring information available in the FRC/BC. The shifter should be able to easily see which particular variable is causing problems by its box turning yellow or red.
    A proposed format for this display is given here.
  3. A set of BC-level status displays, one for each crate containing information about all the BCs in the crate.
    A proposed format for this display is given here.
  4. Note: the proposed displays linked above are temporarily shown in html format, simply because I don't have access to the monitoring GUI.
  5. The current FRC monitoring GUI looks like this

Definitions of Error & Warning Conditions

Monitoring Errors (red boxes) are generally produced for events where fast action is required. If the recommended steps don't solve the problem, the STT expert should be paged. Examples are:
  1. Conditions that cause the entire STT to hang. Displaying an error in these cases makes it easier to identify the problem quickly.
  2. Conditions that cause all subsequent data from the STT to be corrupt or out of synch.
Monitoring Warnings (yellow boxes) are produced for events that the shifter should pay attention to but for which immediate action may not be required. If warnings persist, the STT expert should be informed by email. Examples are:
  1. Problems with the data of a single event that do not mean that all subsequent data will be corrupt.

See the Troubleshooting section for ideas on how to deal with these problems.

FRC Error/Warning Definitions

E/W Name Condition on EPICS Var.'s
E FRC Firmware Error TRDFVER != TRDF Ver. Ref. OR (*)
SCLFVER != SCLF Ver. Ref.   OR (*)
BMVER     != BM Ver. Ref.      OR (*)
PCI1VER   != PCI-1 Ver. Ref. OR
PCI2VER   != PCI-2 Ver. Ref. OR
PCI3VER   != PCI-3 Ver. Ref.
(Current Firmware Version references coming soon)
(*) not yet avail in EPICS
E FRC Status Error TRDFST & TRDF bit mask > 0 OR
SCLFST & SCLF bit mask   > 0 OR
BM0ST    & BM1 bit mask   > 0 OR
BM1ST    & BM1 bit mask   > 0 OR
PCI1ST   & PCI-1 bit mask > 0 OR
PCI2ST   & PCI-2 bit mask > 0 OR
PCI3ST   & PCI-3 bit mask > 0
(Bit masks from individual status reg bits that signal errors)
W FRC Status Warning Defined as above but different bit masks
E CTT Error lCTT-ERR > 5
W CTT Warning lCTT-ERR > 0
E Event Error lL1BXERR  > 5 OR
lTURNERR > 5
W Event Warning lL1BXERR  > 0 OR
lTURNERR > 0

BC Error/Warning Definitions

E/W Name Condition on EPICS Var.'s
E BC Firmware Error BCVER != BC Ver. Ref. (*)
(Current Firmware Version references coming soon)
(*) not yet avail in EPICS ?
E BC Status Error BCST & TRDF bit mask > 0
(Bit masks from individual status reg bits that signal errors)
W BC Status Warning Defined as above but different bit masks

Errors, Warnings & What to Do About Them

See also Global STT System Troubleshooting

FRC Troubleshooting

E/W Class Symptom Known Cause Solution
E Firmware Version Incorrect download script Call STT expert
E TRDF No CTT Data Corrupted data from CTT CTT Problems
Reseat STT cards/cables
E TRDF RR FIFO Full (Latched) Corrupted data from CTT CTT Problems
PCI Hang
W TRDF BOE or EOE Missing Corrupted data from CTT SCL Init
W TRDF BX or TURN Mismatch Corrupted data from CTT SCL Init
E SCLF SCL Mezz Data Err Bad transaction with mezzanine card SCL Init
E SCLF SCL Sync Err loss of synch with SCL SCL Init
E PCI1/2 L1 FIFO Full Bad PCI transaction SCL Init
PCI Hang if persistent
E PCI1/2 EOE PCI 33 Bad PCI transaction (likely Master Abort) SCL Init
PCI Hang if persistent
E PCI1/2 Timeout Latch Bad PCI transaction SCL Init
PCI Hang if persistent
W PCI3 ???   SCL Init if persistent
E BM-0 Get Done Timeout internal L3 readout problem SCL Init
E BM-0 Put Done Timeout L3 readout hung in a board SCL Init
E BM-0 L1 FIFO Full BM not reading out - many possible causes SCL Init
E BM-0 L2 FIFO Full BM not reading out - many possible causes SCL Init
E BM-0 PCI3 L3 FIFO Full BM not reading out - many possible causes SCL Init
E BM-0 L3 XFER Number FIFO Full BM not reading out - many possible causes SCL Init
E BM-1 L1/L2 Busy generally indicates STT is hung SCL Init
E BM-1 Overflow Error (L1/L2) Busy was ignored by framework SCL Init
E BM-1 STT Output FIFO Full BC output FIFO overflowed
all subsq. events garbage
SCL Init
W BM-1 L1/L2 Error error sent back to SCL hub keep track of occurences
E Counts > 25 Error Counts in last cycle   SCL Init
call expert if persistent
W Counts > 0 Error Counts in last cycle   SCL Init if persistent
W Trk-Cnt all cumulative events in bin 0 CTT data problem ask for CTT Fix
E Timer SCL Int Count > huge value system stuck in SCL Init ???
W Timers timer > 10*average   SCL Init if persistent

BC Troubleshooting

E/W Class Symptom Known Cause Solution
E Status DB Timeout L3 Data lost in DB STT Missing
E Status DB Busy/Error a board is hung STT Missing
E Status LM DB L3 Wait
(when data not flowing)
L3 readout is hung on a board STT Missing
E Status PCI LM TSR6 & TSR7 = 1 both should never be set Call STT expert
W Status L1/L2 BX Errors > 1   keep track of when these happen


Monitoring Variable Descriptions

FRC Monitoring variables come in several classes:
  1. Event Variables: that give the state of the system for the current event.
  2. Last Cycle Counts: whose values correspond to the number of occurences of the object in question since the last monitoring request. These variables are cleared in the FRC firmware after each monitoring request.
  3. Cumulative Counts: whose values correspond to the number of occurences of the object in question since the last clear monitoring request was issued (usually at the start/end of run). These variables are constructed in the CPU as the running sum of the Last Cycle Count variables.
  4. Histogram: collections of related Last Cycle or Cumulative variables.
The following table lists the FRC monitoring variables that are currently being passed to EPICs. Those lines highlighted in yellow correspond to variables that we may want to add to th list.
No. Bits EPICs Name
FRC Name
Src:Class Description
Variables from FRC Firmware
0-4831..0 lCTTTRnn
CTT_Track
TRDF:Last
(hist)
Histogram of No. of tracks in CTT data since last mon cycle. These are 16-bit counters
49a15..0 lCTT-ERR
CTT_EVENT_ERR
TRDF:Last No. of CTT event errors (MISSING DATA, BOE, or EOE) since prev mon cycle
49b31..16 lL1BXERR
L1_BX_ERR
TRDF:Last No. of L1 BX mismatches detected since prev mon cycle
50a15..0 lC-ERR-A
CTT_EVENT_ERR
_ACCUM
TRDF:spec No. of accumulated CTT event errors since prev. CPU_CLR issued to TRDF
50b31..16 lTURNERR
TURN_ERR
TRDF:Last No. of L1 TURN mismatches detected since prev mon cycle
5131..0 lC-DELAY
MAX_CTT_DELAY
TRDF:Last Max delay btw L1 ACC and beginning of CTT data input (in units of 30 ns). This is an 8-bit counter.
52 - 6731..0 lL1QULnn
QUAL0_MON
SCLF:Last
(hist)
Histogram of L1 Qualifier Bits since prev mon cycle. These are 16-bit counters.
6831..0 lL1-PER
L1_PERIOD
SCLF:Last No. of L1 Periods since prev mon cycle. This is a 16-bit counter.
6931..0 lL2-REJ
L2_REJECT
SCLF:Last No. of L2 Rejects since prev mon cycle. This is a 16-bit counter.
7031..0 lL2-ACC
L2_ACCEPT
SCLF:Last No. of L2 Accepts since prev mon cycle. This is a 16-bit counter.
7131..0 lL2-PER
L2_PERIOD
SCLF:Last No. of L2 Periods since prev mon cycle. This is a 16-bit counter.
7231..0 lMONTIME
RAW_MON_COUNT
BM:Last Time from previous monitoring cycle (in units of 30 ns).
7331..0 lMON-LEN
RAW_INT_COUNT
BM:Event Length of monitoring cycle (in units of 30 ns).
7431..0 lL1-PROC
L1_PROC_COUNT
BM:Last Cumulative time of L1 processing since prev mon cycle (in units of 30 ns).
7531..0 lL2-PROC
L2_PROC_COUNT
BM:Last Cumulative time of L2 processing since prev mon cycle (in units of 30 ns).
7631..0 lSLV-RDY
SLV_RDY_COUNT
BM:Last Cumulative time of L3 readout via SBC since prev mon cycle (in units of 30 ns).
xx31..0 TRDFST
TRDF_STATUS
TRDF:Event TRDF Status Registers (details)
xx31..0 SCLFST
SCLF_STATUS
SCLF:Event SCLF Status Registers (details)
xx31..0 BM0ST
BM0_STATUS
BM:Event BM-0 Status Registers (details)
xx31..0 BM1ST
BM1_STATUS
BM:Event BM-1 Status Registers (details)
xx31..0 PCI1ST
PCI1_STATUS
PCI1:Event PCI-1 bus Status Registers (details)
xx31..0 PCI2ST
PCI2_STATUS
PCI1:Event PCI-2 bus Status Registers (details)
xx31..0 PCI3ST
PCI3_STATUS
PCI3:Event PCI-3 bus Status Registers (details)
xx31..0 TRDFVER
TRDF_FIRMWARE_VER
TRDF:Event TRDF Firmware version
xx31..0 SCLFVER
SCLF_FIRMWARE_VER
SCLF:Event SCLF Firmware version
xx31..0 BMVER
BM_FIRMWARE_VER
BM:Event BM Firmware version
xx31..0 PCI1VER
PCI1_FIRMWARE_VER
PCI1:Event PCI1 Firmware version
xx31..0 PCI2VER
PCI2_FIRMWARE_VER
PCI2:Event PCI2 Firmware version
xx31..0 PCI3VER
PCI3_FIRMWARE_VER
PCI3:Event PCI3 Firmware version
xx31..0 lSCL-INT
SCL_INT_COUNT
BM:Last Cumulative time with SCL Init interrupt set since prev mon cycle (in units of 30 ns). Is this possible/sensible?
xx31..0 lTSR6-7
TSR6_AND_TSR7
BM:Last Cumulative time with both TSR6 and TSR7 = 1 since prev mon cycle (in units of 30 ns).
Variables Accumulated in the CPU
xx-xx31..0 cCTTTRnn CPU:Cumul
(hist)
Cumulative version of lCTTTRnn
xx31..0 cCTT-ERR CPU:Cumul Cumulative version of lCTT-ERR
xx31..0 cL1BXERR CPU:Cumul Cumulative version of lL1BXERR
xx31..0 cTURNERR CPU:Cumul Cumulative version of lTURNERR
xx31..0 cC-DELAY CPU:Cumul Cumulative version of cC-DELAY
xx-xx31..0 cL1QULnn CPU:Cumul
(hist)
Cumulative version of cL1QULnn
xx31..0 cL1-PER CPU:Cumul Cumulative version of lL1-PER
xx31..0 cL2-REJ CPU:Cumul Cumulative version of lL2-REJ
xx31..0 cL2-ACC CPU:Cumul Cumulative version of lL2-ACC
xx31..0 cL2-PER CPU:Cumul Cumulative version of lL2-PER
xx31..0 cMONTIME CPU:Cumul Cumulative version of lMONTIME
xx31..0 cMON-LEN CPU:Cumul Cumulative version of lMON-LEN
xx31..0 cL1-PROC CPU:Cumul Cumulative version of lL1-PROC
xx31..0 cL2-PROC CPU:Cumul Cumulative version of lL2-PROC
xx31..0 cSLV-RDY CPU:Cumul Cumulative version of lSLV-RDY
xx31..0 cTSR6-7 CPU:Cumul Cumulative version of lTSR6-7