From - Thu Jun 27 12:40:57 2002
Return-Path: <calvet@hep.saclay.cea.fr>
Received: from cirse.saclay.cea.fr (cirse.saclay.cea.fr [132.166.192.127])
	by franklin.nevis.columbia.edu (8.11.6/8.11.6) with SMTP id g5PFpOL21403
	for <evans@nevis1.nevis.columbia.edu>; Tue, 25 Jun 2002 11:51:24 -0400
Received: from argiope.saclay.cea.fr (argiope.saclay.cea.fr [132.166.192.108])
	by cirse.saclay.cea.fr (8.12.2/8.12.2/CEAnet-Internet.1.0) with ESMTP id g5PFpNZj014734
	for <evans@nevis1.nevis.columbia.edu>; Tue, 25 Jun 2002 17:51:23 +0200 (MEST)
Received: from muguet.saclay.cea.fr (unverified) by argiope.saclay.cea.fr
 (Content Technologies SMTPRS 4.2.10) with ESMTP id <T5bb4ca97c184a6c06c828@argiope.saclay.cea.fr>;
 Tue, 25 Jun 2002 17:46:39 +0200
Received: from dphdse.saclay.cea.fr (dphdse.saclay.cea.fr [132.166.30.5])
	by muguet.saclay.cea.fr (8.12.2/8.12.2/CEAnet-Interne.1.0) with ESMTP id g5PFpMC7027681;
	Tue, 25 Jun 2002 17:51:22 +0200 (MEST)
Received: from seipcd31 (seipcd31.saclay.cea.fr [132.166.37.31])
	by dphdse.saclay.cea.fr (8.9.0/8.9.0) with SMTP id RAA16361;
	Tue, 25 Jun 2002 17:51:22 +0200 (MET DST)
Message-ID: <004a01c21c60$27614740$1f25a684@saclay.cea.fr>
From: "Denis Calvet" <calvet@hep.saclay.cea.fr>
To: "Hal Evans" <evans@nevis1.nevis.columbia.edu>
Cc: "Ken Johns" <johns@fnal.gov>, "Maris Abolins" <abolins@pa.msu.edu>,
   "Patrick LeDu" <ledu@hep.saclay.cea.fr>, <eperez@hep.saclay.cea.fr>,
   <mandjavi@hep.saclay.cea.fr>, <mur@hep.saclay.cea.fr>
References: <Pine.SGI.4.44.0206240923130.4588338-100000@nevis1.nevis.columbia.edu>
Subject: Re: latency
Date: Tue, 25 Jun 2002 17:51:22 +0200
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2600.0000
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000

Dear Hal,

I agree that the latency issue is crucial, but I see
little margin for improvement on the ADF side, at least
unless cost is brought to a much higher level. Here is a
detailed decription of the ADF chain, how much latency is
consumed at each stage and how it could be reduced.
I feel that we should be extremely careful not to forget
anything in these latency calculations; because forgetting
means giving zero value to items that are non-null in reality;
and this will cause bad surprises sometime later.
You can distribute this text to whoever might be interested
in it. It is a good topic of discussion for those who will
work while I am on vacations (I'll be away from July 2nd
till July 23rd).

Best Regards,
Denis.


0/ Time reference
First we need to know the origin of time when speaking about
latency. On the ADF side, I count latency from the moment the
ADC chip of a trigger tower sees the peak of the pickoff signal.
So all the figures of latency that I give below are offset by the
amount of time it takes for the real signals to develop in the detector,
travel to the analog summers and the cables and go through the analog
chain placed in front of the analog to digital converter chip.

1/ Analog cables and system synchronisation
The ADF system needs some source of synchronisation: the
idea is to use the 7.57 MHz (132 ns) beam-crossing clock for that
purpose. This clock will be demultiplexed from the Serial
Control Link (SCL), and distributed to all ADF crates and
boards via a fanout tree with matched cable length.
Because there are disparities in the length of the cables
transporting the analog signals from the BLS, a given beam
crossing will produce signals with different peaking times
from tower to tower. If I remember correctly the figure quoted
by Dan, the spread is ~50 ns. For each given beam crossing,
it is highly desirable that the signal produced by the "earliest"
tower and that produced by the "latest" tower are guaranteed
to occur during the same period of the synchronization signal
i.e. the BC clock transported by the SCL. The convention so far
is to say that a beam crossing "starts" on the rising edge of
the clock signal and lasts until the next rising edge of that
clock. For correct synchronisation, the signals derived by the
SCL will go through a programmable delay line adjusted so that:
- the earliest tower produces its peak after the rising edge
of the BC clock
- the latest tower produces its peak before the next rising
edge of the BC clock
Because the spread of cable delay (50 ns) is less than the BC
period (132 ns), it shall not be ambiguous to assign each trigger
tower signal to the correct beam crossing. There are several options
for the value of the delay of the synchronization signal:
- It can be tuned so that "average delay" towers will produce their
peak around the falling edge of the BC clock. This provides the
best immunity against assignment of energy to the wrong beam
crossing caused by time jitter in the peaking time of the analog
signal. The delay introduced in this case is 0 at best and
132/2 + 50/2 = 91 ns at worst.
- Or it can be tuned with 0 (or little) tolerance on peaking time
jitter. In this case, the delay is 0 at best and 50 ns at worst.

I haven't figured out yet the method for setting the delay correctly
in the final system, but is is clear that we are forced to take the
worst case. Because I do not have any measurement on jitter
for the various signals, I would recommend the most conservative
tuning, i.e. 91 ns.

2/ ADC Chip
There are at least 2 types of ADCs suited to our application: flash ADC's
and pipe-lined ADCs. An N bit flash ADC's consists of an array of 2**N
comparators followed by a priority encoder. These devices are extremely
fast and have very short latency, but are limited to 8 bit precision
because integrating more than 256 comparators with precise resistor
trimming is challenging/ expensive. For conversion rates of several
tens of MHz, pipelined ADC's offer a much cheaper alternative. These
devices have a high conversion rate, good precision (e.g. 12 bits), but
exhibit a latency proportional to the sampling period (typically
5 to 7 periods). This latency is not the most stringent requirement
for the applications targeted by these devices (video, medical imaging)
where primary concerns are conversion rate, precision, low power
consumption and low cost. From what I saw in TI, Datel, Maxim, Analog
Devices, 8 bit flash ADC's in the 10-100 MHz range are becoming
uncommon if not obsolete parts being replaced by pipe-lined ADC's
with 8-12 bit precision. Datel has some choices of true-flash
ADCs, e.g. ADC 30720 8 bit 20 MHz, 25 ns latency (I called but could
not get a price however...). For what concerns the ADF, there are
a number of benefits for digitizing signals with 10 bit instead of
8: simpler analog chain before the ADC, a "real" 8-bit precision...
So the idea is to use 10 bit pipe-line ADCs and over-sample,
in order to cut latency. Sampling at BCx4 (31 MHz)
seems a good compromise between cost and power consumption. The
choice is AD 9218 dual 10 bit 40 MHz 350 mW 165 ns latency @31 MHz
(10 $ per 1k units, 15 $ per unit). A faster ADC would allow a shorter
latency; the same models exists in 80 MHz (550 mW 14 $ per 1k) and
105 MHz (17 $ per 1k). Sampling at BCx10 would cost ~5 k$ in total
and ADC latency would be 66 ns; i.e. a latency reduction of ~100 ns.
However, sampling at this speed may force to use faster logic
and bigger power supplies, so the additional cost is probably under-
estimated.

3/ Digital samples decimation and phase adjustment
If the ADCs do oversample, a decimation stage is needed to select
which samples to process among all that are converted. By the proper
selection of samples on a per tower basis, coarse delay compensation
can be made. In my current design, sampling is done at BCx4 while the
sampling rate for digital processing is BCx2. All logic is run at
BCx8 (61 MHz) to cut latency. By clocking each ADC on the rising
or falling edge of a BCx4 clock, 4 values of delay compensation can
be obtained. Because I did not want to have a logic whose latency
depends on the input decimator parameters, my design performs re-synch
after decimation. The latency introduced by this stage is the largest
value of the delay compensation parameter (1/2 BC i.e. 66 ns) plus one
clock period of the logic (BCx8 i.e. 16 ns); that is 82 ns in total.
By suppressing the capability to make any phase correction, the sample
decimation stage could be done in 16 ns; a gain of 66 ns compared to
the current design. The question is therefore whether the delay
compensation is needed or not; I have no clear answer; and it is
equally possible that a proper tuning of digital filter parameters
can do the job, or that the phase adjustment is a real plus.

4/ Input to the digital filter
In my present design, the digital filter can take its input from
the ADC chip or from a memory that stores pre-loaded values. A
synchronous multiplexer stage is used at that level; the latency
added is 16 ns. If the test mode is suppressed, the corresponding
delay can be suppressed.

5/ Digital filter
At this stage, one must consider the irreducible latency of an
algorithm and its effective latency. The only algorithm that gave
me decent results so far is a matched filter followed by a peak
detector. A matched filter has an irreducible latency that depends
on the temporal window extracted from the original signal used
to determine the impulse response of the filter. Hence, the choice
of coefficients will change the intrinsic latency of the filter.
I found that a filter with 5 to 8 taps is OK, I'am less confident
with 4 taps - but I have not explored all the parameter space. So
I dimensioned the logic to go up to 8 taps. As the sampling rate
for the filter is BCx2; the impulse response of the filter corresponds
to 4 BC periods. To which tap does one put the largest coefficient? I.e.
how much of the rising/flat-top/falling edge of the signal do we
put in the impulse response? Again, I have not explored all options,
so I put the largest coefficient in the "middle" of the temporal
impulse response. This means that the output of the filter will
peak ~2 BCs after the peak of its input - no matter how fast
computations are made. Changing the position of the peak is possible
and can lead to higher/shorter delay; but the latency/performance
tradeof needs to be assessed.
Being optimistic, we can imagine to gain 1 BC of latency at that
level. The intrinsic latency is only a theoretical consideration,
when implemented in logic, computations takes time... An 8-tap
filter running at BCx2 (15 MHz) requires to perform 120 millions
Multiply-ACcumulate operations per second. I could not achieve
this speed in the highest speed grade Virtex II (these include
hardwired multipliers). The option I took is to make operations
in parallel and use 2 multipliers per filter; each running at 61 MHz.
The latency is  10 clocks periods, i.e. 160 ns (8 Multiply + Accumulate
+ sum of the two partial convolutions). The only possible increase in
speed at that level would be to use next generation of FPGA
(e.g. Altera Stratix) that are not probably yet on the market, or
use a higher level of parallelism. Using 8 multipliers per filter,
one could imagine to do the convolution in 3 clock periods, i.e.
48 ns - a gain of 112 ns compared to the current design.
Adding the intrinsic latency to the computation time, we
have 292-424ns latency for the current design and 180-312 ns
for the 8-multiplier per filter option. The additional cost
is 4 500K gate FPGA per ADF card; i.e. 4 x 80 x 130 = 41.6 k$
Again, this exclude the potential need for more expensive power
supplies.

6/ Peak detector and output decimator
The current algorithm is a 3 point peak-detector. The irreducible
latency of this operator is 2 (sampling) periods; i.e. 132 ns in
the present design. The computation itself takes 1 clock period
(61 MHz). Because the digital filter/peak detector is run at BCx2
(partly to cut latency) while only one sample per BC is needed,
a decimation by a factor 2 is needed. There are two possible
for the sample to keep; in order to have a constant latency
independently of that choice, a re-synch is needed. It consumes
1 sampling period + 1 (61MHz) clock period. The capability to
select the correct sample among the 2 computed per BC is mandatory;
the peak detector will fail if this choice is not provided.
In total the peak detector + decimation operator takes 164 ns, with
no perspective of significant gain at this level.

7/ Convolution scaling, Et calibration look-up table, saturation and
clipping
The peak detector is followed by a stage that performs an arithmetic
shift  of the peak before driving a look-up table which performs
the final conversion to transverse energy, implements saturation
and clipping of the lowest energy signals. The look-up table could be
suppressed in principle, though the result scaling is mandatory.
If the LUT is suppressed, the computation of filter coefficients
would need to include the Energy to transverse energy scaling.
Saturation would be un-detected and no clipping would be done.
I see no possible latency gain at that level; only savings in logic and
blocks of RAM.

8/ Choice of stream to serialize
In the current design, the output to be sent to the TAB can be
selected from one of the 3 following sources:
- A pseudo-random generator in order to test the ADF to TAB link
independently of the digital filter
- A register that contains a programmable value, in order to
turn off a channel or perform tests
- the ouptut of the Et Look Up Table for normal operation
This stage takes 1 (61 MHz) clock period. Reducing latency
at that level would mean suppressing the previous functionality.
One needs anyway 1 clock tick to load the parallel-load serial-out
shift register of each channel.

9/ Delay Equalization
One must be absolutely sure that the output of all digital filters
correspond to the same beam crossing under all circumstances. The delay
compensation circuitry included in the input and output decimators
play some role to achieve that, but this is not sufficient because
the intrinsic latency of each filter can vary, depending on the choice
of its coefficients. It is likely that the determination of the coefficient
of each filter will need to include a "collective" constraint to
make sure that the differential of latency between any 2 channels
is null. If this cannot be achieved, a delay compensation circuitry
must be included. The current design incorporates for each channel
a shift register of programmable length (1-33 bit) clocked at BCx8
(61 MHz) corresponding to a delay of 1 (61 MHz) clock tick plus 0
to 4 BC units. The 2 MSBs of shift register length are programmable
on a per channel basis. A normal value would be a delay of 1 BC; so
that a "late" channel can be given a "negative" delay to bring it
in phase with the other ones. The 3 LSBs of the shift register length
are also programmable, but are common to all channels within a chip.
This facility is used to adjust latency from 1:8th to 7:8th of BC
unit if we find it useful. It is not yet clear to me whether
this whole delay compensation circuitry will be mandatory or not;
so I have included it in the design, but set the length of the
shift register to its minimum, i.e. 1 tap. If we remove it, we
could gain 16 ns of latency; but if it appears that it is needed,
then it will add 132 ns of latency.

10/ Serializer source and checksum calculation
The ADF will output 8 bit Et calibrated samples most of the time
except following a L1 accept where the raw ADC chip samples (10 bit)
will be sent. A pipeline stage is needed at that level to switch
between the 8 bit serial stream from the digital filter and the
10 bit stream from the memory that stores the raw ADC values.
To detect transmission errors, my current design calculates a parity
bit; computation takes 1 (61 MHz) clock period. The latency for
this stage is 32 ns.

11/ FPGA to Channel link serializer and cable
At present, my VHDL simulation does not go beyond the FPGA.
The datasheet for the Channel link has some latency figures
in it; for the transmitter, we have: 1.5 x 16 + 6 = 30 ns
for the cable 4ns/m is a good number; so 12 ns for the ADF to
TAB cable seems realistic. The total at that level is 42 ns.


To summarize, I have in total for a typical channel on
the current ADF design:

Time (in 1:8 of BC units) |    Comment
0             : peak of the analog signal at ADC input
10            : ADC value valid
10.5          : ADC pins sensed
15.5          : digital input decimated
41.5          : peak at convolver output
50.5          : peak detection and decimation done
51.5          : Et LUT done
52.5          : serial stream selected
52.5          : delay adjustment (set to minimum here)
53.5          : parity calculation / selection of 8 bit or 10 bit stream
54            : FPGA outptut valid - LVDS serializer clocked
57            : LSB of each channel at the TAB end of the cable
64            : MSB of each channel at the TAB end of the cable

That is ~8 BC or 1056 ns with the 0 reference I chose. We
should NOT forget to add to that number the delay in the analog
chain and cables and also part of the latency introduced by the
need of delaying the BC clock to achieve proper synchro.
Keeping only the vital functions in the ADF; one might be
able to bring the latency down to ~6 BCs. With a significant
increase of cost (50 k$) ~5 BCs could be achieved;
going below that would require to suppress the digital
filter, or make major changes. In any case, the potential
gain in latency must be balanced against the reduction of
performance/functionality.


----- Original Message -----
From: "Hal Evans" <evans@nevis1.nevis.columbia.edu>
To: "Denis Calvet" <calvet@hep.saclay.cea.fr>
Cc: "Ken Johns" <johns@fnal.gov>; "Maris Abolins" <abolins@pa.msu.edu>;
"Patrick LeDu" <ledu@hep.saclay.cea.fr>
Sent: Monday, June 24, 2002 3:33 PM
Subject: latency


> Hello Denis,
>
> Ken Johns and I have been discussing the latency in transmission of
> L1Cal signals to the Cal-Trk match system. It looks like the estimates of
> this latency that I have been presenting so far are too small (see Ken's
> mail below). We're probably over budget by at least 500 ns and maybe as
> much as 1000 ns. So, clearly, we need to do something.
>
> Some reduction in latency in the TABs may be possible, but it will
> certainly not be enough to cover the entire problem. Are there any
> possibilities on the ADF side? I seem to remember that the digital
> filtering algorithm you're using is the one with the largest latency. Is
> this true, and if so how much will we lose by dropping back to a less
> performant algorithm?
>
> We're definitely going to have to think hard about this issue or the
> Cal-Trk system will be in real danger.
>
> One other thing. We have to submit an updated TDR for L1Cal by the end of
> July. Do you think that you will have time to make changes to the ADF part
> that you wrote for the last version before you leave for vacation? If not,
> do you know of anyone at Saclay who could take this up while you're away.
>
> Thanks very much for your help.
>
> Regards  -  Hal
>
> --
> ^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v
> |                    Hal Evans                       |
> |              evans@nevis1.columbia.edu             |
> | Physics Dept.                Nevis Labs            |
> | Columbia University          PO Box 137            |
> | 538 W 120th  Mailcode 5215   136 S Broadway        |
> | New York, NY 10027           Irvington, NY 10533   |
> | Tel: (212)854-3334           (914)591-2815         |
> | Fax: (212)854-3379           (914)591-8120         |
> v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^
>
> ---------- Forwarded message ----------
> Date: Sat, 22 Jun 2002 12:48:02 -0700 (MST)
> From: Ken Johns <johns@hep0.physics.arizona.edu>
> To: Hal Evans <evans@nevis1.nevis.columbia.edu>
> Cc: darien wood <darien@fnal.gov>, rob mccroskey
<robmcc@physics.arizona.edu>
> Subject: Re: re cal-track latency
>
> hi,
> i will try to put together another spreadsheet tomorrow or
> monday.  but a good estimate is that for the "as-is" design
> (mtxxx+mtcm+mtm) and for the specified tf decision time
> (which we are still not at yet, msu folks) the inputs must arrive
> to l1mu around 1250 ns after bc (not counting the deserial time
> at our end).  from your spreadsheet this number is 2245, so the
> problem is big: on the order of 1000ns.
>
> now what can be done.  of course we dont have to send to the
> tf from the collision hall now so we get back 250 ns (a little less
> since there is still some cable needed).  if we push the tf latency time
> back one more big clock tick (present run 2a plan? to accommodate l1cft
> being late) this gains us 132ns.  if we go to the "sans mtcm cal-trk"
> design we can save perhaps 250 ns.  this means we are still
> 1000-250-132-250=370 ns over budget with few if any knobs left to turn.
> maybe?? there is 132 in the trigger logic but i havent looked at this yet.
> but ~400ns is huge.
>
> there may be one more clock tick on the muon front ends so that
> we could go 2 big ticks beyond the spec but i have to talk with boris
> about this.  i know he is reluctant to go to the last buffer.  possibly
> the muon front-ends could change their timing to buy more buffer space
> but this would involve fair effort on their part to run at different
> clock speeds than present and the loss of resolution needs to be
> considered.
>
> so that is why it extremely important that the adf guys give
> the cal-track trigger some consideration in their design if possible.  and
> given our experience with the l1cft we probably need a cushion of order
> 100ns so our total is more like 500 than 400ns.
>
> will send a spreadsheet merging your numbers and mine
> tomorrow/monday.  ps, it might be useful if could you put your
> l1cal link on the d0 run2b web page sometime.  i will be out to
> fermi again new thu-fri so we can butt heads then.  thanks.
>
> have fun..kj
>
>