From - Thu Jun 27 12:40:57 2002 Return-Path: Received: from cirse.saclay.cea.fr (cirse.saclay.cea.fr [132.166.192.127]) by franklin.nevis.columbia.edu (8.11.6/8.11.6) with SMTP id g5PFpOL21403 for ; Tue, 25 Jun 2002 11:51:24 -0400 Received: from argiope.saclay.cea.fr (argiope.saclay.cea.fr [132.166.192.108]) by cirse.saclay.cea.fr (8.12.2/8.12.2/CEAnet-Internet.1.0) with ESMTP id g5PFpNZj014734 for ; Tue, 25 Jun 2002 17:51:23 +0200 (MEST) Received: from muguet.saclay.cea.fr (unverified) by argiope.saclay.cea.fr (Content Technologies SMTPRS 4.2.10) with ESMTP id ; Tue, 25 Jun 2002 17:46:39 +0200 Received: from dphdse.saclay.cea.fr (dphdse.saclay.cea.fr [132.166.30.5]) by muguet.saclay.cea.fr (8.12.2/8.12.2/CEAnet-Interne.1.0) with ESMTP id g5PFpMC7027681; Tue, 25 Jun 2002 17:51:22 +0200 (MEST) Received: from seipcd31 (seipcd31.saclay.cea.fr [132.166.37.31]) by dphdse.saclay.cea.fr (8.9.0/8.9.0) with SMTP id RAA16361; Tue, 25 Jun 2002 17:51:22 +0200 (MET DST) Message-ID: <004a01c21c60$27614740$1f25a684@saclay.cea.fr> From: "Denis Calvet" To: "Hal Evans" Cc: "Ken Johns" , "Maris Abolins" , "Patrick LeDu" , , , References: Subject: Re: latency Date: Tue, 25 Jun 2002 17:51:22 +0200 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2600.0000 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 Dear Hal, I agree that the latency issue is crucial, but I see little margin for improvement on the ADF side, at least unless cost is brought to a much higher level. Here is a detailed decription of the ADF chain, how much latency is consumed at each stage and how it could be reduced. I feel that we should be extremely careful not to forget anything in these latency calculations; because forgetting means giving zero value to items that are non-null in reality; and this will cause bad surprises sometime later. You can distribute this text to whoever might be interested in it. It is a good topic of discussion for those who will work while I am on vacations (I'll be away from July 2nd till July 23rd). Best Regards, Denis. 0/ Time reference First we need to know the origin of time when speaking about latency. On the ADF side, I count latency from the moment the ADC chip of a trigger tower sees the peak of the pickoff signal. So all the figures of latency that I give below are offset by the amount of time it takes for the real signals to develop in the detector, travel to the analog summers and the cables and go through the analog chain placed in front of the analog to digital converter chip. 1/ Analog cables and system synchronisation The ADF system needs some source of synchronisation: the idea is to use the 7.57 MHz (132 ns) beam-crossing clock for that purpose. This clock will be demultiplexed from the Serial Control Link (SCL), and distributed to all ADF crates and boards via a fanout tree with matched cable length. Because there are disparities in the length of the cables transporting the analog signals from the BLS, a given beam crossing will produce signals with different peaking times from tower to tower. If I remember correctly the figure quoted by Dan, the spread is ~50 ns. For each given beam crossing, it is highly desirable that the signal produced by the "earliest" tower and that produced by the "latest" tower are guaranteed to occur during the same period of the synchronization signal i.e. the BC clock transported by the SCL. The convention so far is to say that a beam crossing "starts" on the rising edge of the clock signal and lasts until the next rising edge of that clock. For correct synchronisation, the signals derived by the SCL will go through a programmable delay line adjusted so that: - the earliest tower produces its peak after the rising edge of the BC clock - the latest tower produces its peak before the next rising edge of the BC clock Because the spread of cable delay (50 ns) is less than the BC period (132 ns), it shall not be ambiguous to assign each trigger tower signal to the correct beam crossing. There are several options for the value of the delay of the synchronization signal: - It can be tuned so that "average delay" towers will produce their peak around the falling edge of the BC clock. This provides the best immunity against assignment of energy to the wrong beam crossing caused by time jitter in the peaking time of the analog signal. The delay introduced in this case is 0 at best and 132/2 + 50/2 = 91 ns at worst. - Or it can be tuned with 0 (or little) tolerance on peaking time jitter. In this case, the delay is 0 at best and 50 ns at worst. I haven't figured out yet the method for setting the delay correctly in the final system, but is is clear that we are forced to take the worst case. Because I do not have any measurement on jitter for the various signals, I would recommend the most conservative tuning, i.e. 91 ns. 2/ ADC Chip There are at least 2 types of ADCs suited to our application: flash ADC's and pipe-lined ADCs. An N bit flash ADC's consists of an array of 2**N comparators followed by a priority encoder. These devices are extremely fast and have very short latency, but are limited to 8 bit precision because integrating more than 256 comparators with precise resistor trimming is challenging/ expensive. For conversion rates of several tens of MHz, pipelined ADC's offer a much cheaper alternative. These devices have a high conversion rate, good precision (e.g. 12 bits), but exhibit a latency proportional to the sampling period (typically 5 to 7 periods). This latency is not the most stringent requirement for the applications targeted by these devices (video, medical imaging) where primary concerns are conversion rate, precision, low power consumption and low cost. From what I saw in TI, Datel, Maxim, Analog Devices, 8 bit flash ADC's in the 10-100 MHz range are becoming uncommon if not obsolete parts being replaced by pipe-lined ADC's with 8-12 bit precision. Datel has some choices of true-flash ADCs, e.g. ADC 30720 8 bit 20 MHz, 25 ns latency (I called but could not get a price however...). For what concerns the ADF, there are a number of benefits for digitizing signals with 10 bit instead of 8: simpler analog chain before the ADC, a "real" 8-bit precision... So the idea is to use 10 bit pipe-line ADCs and over-sample, in order to cut latency. Sampling at BCx4 (31 MHz) seems a good compromise between cost and power consumption. The choice is AD 9218 dual 10 bit 40 MHz 350 mW 165 ns latency @31 MHz (10 $ per 1k units, 15 $ per unit). A faster ADC would allow a shorter latency; the same models exists in 80 MHz (550 mW 14 $ per 1k) and 105 MHz (17 $ per 1k). Sampling at BCx10 would cost ~5 k$ in total and ADC latency would be 66 ns; i.e. a latency reduction of ~100 ns. However, sampling at this speed may force to use faster logic and bigger power supplies, so the additional cost is probably under- estimated. 3/ Digital samples decimation and phase adjustment If the ADCs do oversample, a decimation stage is needed to select which samples to process among all that are converted. By the proper selection of samples on a per tower basis, coarse delay compensation can be made. In my current design, sampling is done at BCx4 while the sampling rate for digital processing is BCx2. All logic is run at BCx8 (61 MHz) to cut latency. By clocking each ADC on the rising or falling edge of a BCx4 clock, 4 values of delay compensation can be obtained. Because I did not want to have a logic whose latency depends on the input decimator parameters, my design performs re-synch after decimation. The latency introduced by this stage is the largest value of the delay compensation parameter (1/2 BC i.e. 66 ns) plus one clock period of the logic (BCx8 i.e. 16 ns); that is 82 ns in total. By suppressing the capability to make any phase correction, the sample decimation stage could be done in 16 ns; a gain of 66 ns compared to the current design. The question is therefore whether the delay compensation is needed or not; I have no clear answer; and it is equally possible that a proper tuning of digital filter parameters can do the job, or that the phase adjustment is a real plus. 4/ Input to the digital filter In my present design, the digital filter can take its input from the ADC chip or from a memory that stores pre-loaded values. A synchronous multiplexer stage is used at that level; the latency added is 16 ns. If the test mode is suppressed, the corresponding delay can be suppressed. 5/ Digital filter At this stage, one must consider the irreducible latency of an algorithm and its effective latency. The only algorithm that gave me decent results so far is a matched filter followed by a peak detector. A matched filter has an irreducible latency that depends on the temporal window extracted from the original signal used to determine the impulse response of the filter. Hence, the choice of coefficients will change the intrinsic latency of the filter. I found that a filter with 5 to 8 taps is OK, I'am less confident with 4 taps - but I have not explored all the parameter space. So I dimensioned the logic to go up to 8 taps. As the sampling rate for the filter is BCx2; the impulse response of the filter corresponds to 4 BC periods. To which tap does one put the largest coefficient? I.e. how much of the rising/flat-top/falling edge of the signal do we put in the impulse response? Again, I have not explored all options, so I put the largest coefficient in the "middle" of the temporal impulse response. This means that the output of the filter will peak ~2 BCs after the peak of its input - no matter how fast computations are made. Changing the position of the peak is possible and can lead to higher/shorter delay; but the latency/performance tradeof needs to be assessed. Being optimistic, we can imagine to gain 1 BC of latency at that level. The intrinsic latency is only a theoretical consideration, when implemented in logic, computations takes time... An 8-tap filter running at BCx2 (15 MHz) requires to perform 120 millions Multiply-ACcumulate operations per second. I could not achieve this speed in the highest speed grade Virtex II (these include hardwired multipliers). The option I took is to make operations in parallel and use 2 multipliers per filter; each running at 61 MHz. The latency is 10 clocks periods, i.e. 160 ns (8 Multiply + Accumulate + sum of the two partial convolutions). The only possible increase in speed at that level would be to use next generation of FPGA (e.g. Altera Stratix) that are not probably yet on the market, or use a higher level of parallelism. Using 8 multipliers per filter, one could imagine to do the convolution in 3 clock periods, i.e. 48 ns - a gain of 112 ns compared to the current design. Adding the intrinsic latency to the computation time, we have 292-424ns latency for the current design and 180-312 ns for the 8-multiplier per filter option. The additional cost is 4 500K gate FPGA per ADF card; i.e. 4 x 80 x 130 = 41.6 k$ Again, this exclude the potential need for more expensive power supplies. 6/ Peak detector and output decimator The current algorithm is a 3 point peak-detector. The irreducible latency of this operator is 2 (sampling) periods; i.e. 132 ns in the present design. The computation itself takes 1 clock period (61 MHz). Because the digital filter/peak detector is run at BCx2 (partly to cut latency) while only one sample per BC is needed, a decimation by a factor 2 is needed. There are two possible for the sample to keep; in order to have a constant latency independently of that choice, a re-synch is needed. It consumes 1 sampling period + 1 (61MHz) clock period. The capability to select the correct sample among the 2 computed per BC is mandatory; the peak detector will fail if this choice is not provided. In total the peak detector + decimation operator takes 164 ns, with no perspective of significant gain at this level. 7/ Convolution scaling, Et calibration look-up table, saturation and clipping The peak detector is followed by a stage that performs an arithmetic shift of the peak before driving a look-up table which performs the final conversion to transverse energy, implements saturation and clipping of the lowest energy signals. The look-up table could be suppressed in principle, though the result scaling is mandatory. If the LUT is suppressed, the computation of filter coefficients would need to include the Energy to transverse energy scaling. Saturation would be un-detected and no clipping would be done. I see no possible latency gain at that level; only savings in logic and blocks of RAM. 8/ Choice of stream to serialize In the current design, the output to be sent to the TAB can be selected from one of the 3 following sources: - A pseudo-random generator in order to test the ADF to TAB link independently of the digital filter - A register that contains a programmable value, in order to turn off a channel or perform tests - the ouptut of the Et Look Up Table for normal operation This stage takes 1 (61 MHz) clock period. Reducing latency at that level would mean suppressing the previous functionality. One needs anyway 1 clock tick to load the parallel-load serial-out shift register of each channel. 9/ Delay Equalization One must be absolutely sure that the output of all digital filters correspond to the same beam crossing under all circumstances. The delay compensation circuitry included in the input and output decimators play some role to achieve that, but this is not sufficient because the intrinsic latency of each filter can vary, depending on the choice of its coefficients. It is likely that the determination of the coefficient of each filter will need to include a "collective" constraint to make sure that the differential of latency between any 2 channels is null. If this cannot be achieved, a delay compensation circuitry must be included. The current design incorporates for each channel a shift register of programmable length (1-33 bit) clocked at BCx8 (61 MHz) corresponding to a delay of 1 (61 MHz) clock tick plus 0 to 4 BC units. The 2 MSBs of shift register length are programmable on a per channel basis. A normal value would be a delay of 1 BC; so that a "late" channel can be given a "negative" delay to bring it in phase with the other ones. The 3 LSBs of the shift register length are also programmable, but are common to all channels within a chip. This facility is used to adjust latency from 1:8th to 7:8th of BC unit if we find it useful. It is not yet clear to me whether this whole delay compensation circuitry will be mandatory or not; so I have included it in the design, but set the length of the shift register to its minimum, i.e. 1 tap. If we remove it, we could gain 16 ns of latency; but if it appears that it is needed, then it will add 132 ns of latency. 10/ Serializer source and checksum calculation The ADF will output 8 bit Et calibrated samples most of the time except following a L1 accept where the raw ADC chip samples (10 bit) will be sent. A pipeline stage is needed at that level to switch between the 8 bit serial stream from the digital filter and the 10 bit stream from the memory that stores the raw ADC values. To detect transmission errors, my current design calculates a parity bit; computation takes 1 (61 MHz) clock period. The latency for this stage is 32 ns. 11/ FPGA to Channel link serializer and cable At present, my VHDL simulation does not go beyond the FPGA. The datasheet for the Channel link has some latency figures in it; for the transmitter, we have: 1.5 x 16 + 6 = 30 ns for the cable 4ns/m is a good number; so 12 ns for the ADF to TAB cable seems realistic. The total at that level is 42 ns. To summarize, I have in total for a typical channel on the current ADF design: Time (in 1:8 of BC units) | Comment 0 : peak of the analog signal at ADC input 10 : ADC value valid 10.5 : ADC pins sensed 15.5 : digital input decimated 41.5 : peak at convolver output 50.5 : peak detection and decimation done 51.5 : Et LUT done 52.5 : serial stream selected 52.5 : delay adjustment (set to minimum here) 53.5 : parity calculation / selection of 8 bit or 10 bit stream 54 : FPGA outptut valid - LVDS serializer clocked 57 : LSB of each channel at the TAB end of the cable 64 : MSB of each channel at the TAB end of the cable That is ~8 BC or 1056 ns with the 0 reference I chose. We should NOT forget to add to that number the delay in the analog chain and cables and also part of the latency introduced by the need of delaying the BC clock to achieve proper synchro. Keeping only the vital functions in the ADF; one might be able to bring the latency down to ~6 BCs. With a significant increase of cost (50 k$) ~5 BCs could be achieved; going below that would require to suppress the digital filter, or make major changes. In any case, the potential gain in latency must be balanced against the reduction of performance/functionality. ----- Original Message ----- From: "Hal Evans" To: "Denis Calvet" Cc: "Ken Johns" ; "Maris Abolins" ; "Patrick LeDu" Sent: Monday, June 24, 2002 3:33 PM Subject: latency > Hello Denis, > > Ken Johns and I have been discussing the latency in transmission of > L1Cal signals to the Cal-Trk match system. It looks like the estimates of > this latency that I have been presenting so far are too small (see Ken's > mail below). We're probably over budget by at least 500 ns and maybe as > much as 1000 ns. So, clearly, we need to do something. > > Some reduction in latency in the TABs may be possible, but it will > certainly not be enough to cover the entire problem. Are there any > possibilities on the ADF side? I seem to remember that the digital > filtering algorithm you're using is the one with the largest latency. Is > this true, and if so how much will we lose by dropping back to a less > performant algorithm? > > We're definitely going to have to think hard about this issue or the > Cal-Trk system will be in real danger. > > One other thing. We have to submit an updated TDR for L1Cal by the end of > July. Do you think that you will have time to make changes to the ADF part > that you wrote for the last version before you leave for vacation? If not, > do you know of anyone at Saclay who could take this up while you're away. > > Thanks very much for your help. > > Regards - Hal > > -- > ^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v > | Hal Evans | > | evans@nevis1.columbia.edu | > | Physics Dept. Nevis Labs | > | Columbia University PO Box 137 | > | 538 W 120th Mailcode 5215 136 S Broadway | > | New York, NY 10027 Irvington, NY 10533 | > | Tel: (212)854-3334 (914)591-2815 | > | Fax: (212)854-3379 (914)591-8120 | > v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^ > > ---------- Forwarded message ---------- > Date: Sat, 22 Jun 2002 12:48:02 -0700 (MST) > From: Ken Johns > To: Hal Evans > Cc: darien wood , rob mccroskey > Subject: Re: re cal-track latency > > hi, > i will try to put together another spreadsheet tomorrow or > monday. but a good estimate is that for the "as-is" design > (mtxxx+mtcm+mtm) and for the specified tf decision time > (which we are still not at yet, msu folks) the inputs must arrive > to l1mu around 1250 ns after bc (not counting the deserial time > at our end). from your spreadsheet this number is 2245, so the > problem is big: on the order of 1000ns. > > now what can be done. of course we dont have to send to the > tf from the collision hall now so we get back 250 ns (a little less > since there is still some cable needed). if we push the tf latency time > back one more big clock tick (present run 2a plan? to accommodate l1cft > being late) this gains us 132ns. if we go to the "sans mtcm cal-trk" > design we can save perhaps 250 ns. this means we are still > 1000-250-132-250=370 ns over budget with few if any knobs left to turn. > maybe?? there is 132 in the trigger logic but i havent looked at this yet. > but ~400ns is huge. > > there may be one more clock tick on the muon front ends so that > we could go 2 big ticks beyond the spec but i have to talk with boris > about this. i know he is reluctant to go to the last buffer. possibly > the muon front-ends could change their timing to buy more buffer space > but this would involve fair effort on their part to run at different > clock speeds than present and the loss of resolution needs to be > considered. > > so that is why it extremely important that the adf guys give > the cal-track trigger some consideration in their design if possible. and > given our experience with the l1cft we probably need a cushion of order > 100ns so our total is more like 500 than 400ns. > > will send a spreadsheet merging your numbers and mine > tomorrow/monday. ps, it might be useful if could you put your > l1cal link on the d0 run2b web page sometime. i will be out to > fermi again new thu-fri so we can butt heads then. thanks. > > have fun..kj > >