MONITORING THE SLIC

1. BASIC STRUCTURE

During operation monitoring/troubleshooting the SLICs will happen at four levels of increasing complexity.

  1. Standard monitoring displays constructed from collect status data. The goal here is for shifters to be able to anticipate problems from changes in some basic histograms and to be able to quickly identify a problem as coming from the SLICs when it occurs. For this to occur efficiently the shifter must not be bombarded with too much information from the SLICs.
  2. Expert monitoring displays constructed from collect status data. These would be used mainly by L2muon experts to keep a eye on the detailed performance of the SLICs without having to stop the run and check the SLICs explicitly. Obviously, the more information that can go in here, the better.
  3. A slic_alive program that shifters could run to localize an existing problem to a given SLIC. This program would be run on the entire collection of SLICs with the run stopped and would quickly test all major components of each board for basic functionality. The goal of this test would be to identify as quickly as possible a problem SLIC so that it could be replaced and the run continued.
  4. The full SLIC test suite which would only be run by experts when all other avenues of investigation had failed or for localizing problems during the repair of broken SLICs.

The slic_alive program will be a subset of the full SLIC test suite and is under development now. The rest of this note will address constraints on data obtained on collect status events due to the SLIC architecture.

Note that Arthur has made a proposal for SLIC output formats. This proposal includes monitoring information in a very sensible way. The purpose of this note is not to make another such proposal, but to lay out the constraints due to the SLIC design that lead to Arthur's scheme. To do this we will present several possible monitoring schemes and discuss their strengths and weaknesses.

2. SLIC ELEMENTS

Structurally, the SLIC consists of 6 basic element shown in the table below with their components. Also shown is how information about the status of each element can be obtained over VME. More detailed information about the SLIC architecture is available in the overview.
Element Components VME Access
Inputs Hotlink Receivers
Input FPGAs
Input FIFOs
limited - through fpga
status
limited - through fpga
Link Link FPGAs status
DSP Input DSP FPGAs
DSP FIFOs
status
only almost-full
Worker DSP DSPs 1-4 DSP cmds via fpga
Admin DSP DSP 5 DSP cmds via fpga
Output Buffer FPGA
Output FIFOs
Output FPGA
Hotlink Transmitters
status
limited - through fpga
no
no status

3. AVAILABLE INFORMATION

Given below is a list of information from each element that could be available for online monitoring (collect status). This list is intended to be exhaustive for the FPGAs, so anything not on it should be considered as impossible (or at least very difficult) to access. The main sources of information in the table below are the input trailer word added to the end of each data block by the input FPGA, the input, link and output status commands (VME) and the DSP FPGA status register (VME).

DSP monitoring information is limited only be inventiveness in coding and processing power, so only an abbreviated list is given.
Element Information Where
Input Input Word Count trailer / status
  Local Event Count trailer / inp status
  FIFO Full or Almost-Full trailer / inp status
  Error Flags trailer / inp status
  Configuration inp status
Link FIFO empty link status
  Local Event Count link status
  Configuration link status
DSP Input FIFO Almost-full dspfpga status
  Configuration dspfpga status
DSPs Processing Times code
  Data Sizes code
  Errors code
Output FIFO Full or Almost-full out status
  Local Event Count out status
  Configuration out status

4. ONLINE MONITORING SCHEMES

The basic constraint imposed by the design of the SLIC on monitoring data available on collect status events is that there is limited provision for event asynchronous information on the inputs and outputs. For example, there is no free-running clock that samples the occupancy of the input FIFOs at some low rate. (In fact, the FIFOs we use don't support this function anyway.) All status data information (except for FIFO-full and ALMOST-full) on input and output data is assembled in the respective FPGAs and is available only after receipt of an end of event character.

For the DSPs, asynchronous time-in-state counters can be implemented in the code at the expense of extra processing time. The speed loss will depend heavily on how ambitious this monitoring is.

The main questions to be addressed in the SLIC online monitoring scheme are then, where will the information be stored and who will access it. Several possibilities suggest themselves and are listed below in approximate order of increasing dependence having functional DSPs.

  1. Direct VME Access to each Element
    In this scheme some external agent (the Alpaha, TCC,...) that is in charge of collecting monitoring data loops over a list of all elements in each SLIC and reads status information from them directly over VME.
    Advantages Disadvantages Summary
    This type of scheme would probably involve too much overhead. Directly accessing each SLIC element is more appropriate to do outside of normal running as part of slic_alive and the test suite.

  2. VME Access to each DSP
    Here each of the five DSPs collects status information from the inputs through the input trailer word. They add their own processing statistics to this and put it all in a convenient location. An external agent then accesses this location over VME and reads the information.
    Advantages Disadvantages Summary
    This scheme will probably also prove to be too disruptive to normal data taking to be implemented.

  3. VME Access to DSP-5
    In this scheme the administrator DSP serves as the collection point for all monitoring information in a SLIC. Each worker DSP collects input channel status via trailer words and adds its own status information. This data is included in the normal worker DSP output, as in Arthur's proposal, and sent to DSP-5 over the link after it has finished processing each event. DSP-5 then assembles the information from all the worker DSPs into a convenient location which is accessed over VME by an external agent.
    Advantages Disadvantages Summary
    This scheme is attractive in that it balances efficiency of SLIC operation with autonomy from other boards in the system. We need to study how VME accesses to the SLIC during data-taking will really affect the system though.

  4. Monitoring Data as part of SLIC output
    The SLIC can simply pass the buck on monitoring upstream to a worker Alpha by sending monitoring data which is collected as in scheme 4) out with its normal data output. This could conceivably be streamlined by adding the monitoring block only on collect status events. (Is this possible?)
    Advantages Disadvantages Summary
    This would certainly be the easiest scheme to implement (for SLIC people).