At a recent job interview I was asked my preferred clocking approach for designing digital systems. It is the Xilinx best practice to only use the positive edge, single-edge clocking scheme, however in my discussion I expounded the use of dual-edge clocking, i.e. changing data on the positive edge of the clock, and latching that data on the next negative edge.
In this blog post, we do a simple static timing analysis to investigate which strategy is suitable and when? The overal concept is to ensure that the subsequent flip-flops (FFs) in the design are able to be satisfied in terms of their setup and hold time.
Tsu = Setup Time (min): This is the time from the change of a data bit/bus and the time it is latched by a clock edge. It is the time the signal must be stable to guarantee a correct latching of that data. This in effect is related to the capacitive loading a gate presents on that signal, in that the data bit/bus must be asserted for long enough that the internal nodes of the next FF are able to charge/discharge to the correct value before that value is latched.
Th = Hold Time (min): This is the time a data bit/bus must be stable for between the clock edge and the next time the input data is modified. When latched, the clock passes the input value to the output but of course incurs the charge/discharge time of that output node. If we bring the data low pre-maturely, then the output will also prematurely stop in its charging/discharging to the new level.
Tck = Clock Period: In a synchronous design, we are always interested in the fastest rate at which a circuit can be reliably clocked. The clock period is obviously the reciprocal of the clock frequency.
Tsus = Setup Time Slack: This is the excess setup time over and above the minimum setup time. Lets say for example the time between the data changing edge and the next latch edge is 5ns, then as the minimum setup time is 0.5ns, then we have an excess of 4.5ns. In order to make the circuit both fast and robust, we would design this slack to be: positive (statisfy minimum setup time), as small as possible (for efficient use of the time within Tck) and with a small design margin of perhaps 10-20% of the setup time.
Ths = Hold Time Slack: As with Tsus, this is the time over and above the minimum hold required time Th. For example lets say the time between the latching edge and the next time the data changes is 5ns, as with the Tsus calculation, we then have an excess hold time of 5ns-0.5ns = 4.5ns. Again we want to minimize this time to give fast clocking, it must remain positive so we do not violate our hold condition, and we must provide a small design margin of 10-20% to cover our worst case temperature, supply and process corners.
Tlp = Longest Path Time: This is the delay on a data line between when a signal is launched and when it is received. There are a number of impacting factors to this however this can be assumed to first order to be both routing delays and gate delays of any combinatorial (unclocked) logic. [Other delays may be Ohm/sq of different metal layers, different CL loading of different layers or with other nearby lines (power/ground), characteristic impedance if treated as a transmission line (structures on above and below routing layers)]. Obviously if we place two registers a significant distance apart there will be some delay in that data propogating through the data lines and obviously if we have a large set of AND/OR/NOR gates or circuits that need to ripple through a carry term (ripple adder) then the delay becomes an issue.
Latency: This is measured in units of clock cycles. Typically a computation, lets say a single flip-flip data latching takes 1 clock cycle, leading to a 1 clock cycle latency. However when we are trying to match two parallel data paths, we may need to equalize the latencies of both operations so that both arrive at a destination at the same time. We are also interested in reducing latency where possible. This comes evident when we cannot easily pipeline or parallelize a computation and instead must wait for its answer. We may find that the the total latency of a string of computations is too long for some other process we wish to perform.
Basic Equation #1:
Single-Edge Clocking (Pos-Pos)
So, why might Xilinx and other IC design houses prefer single-edge clocking. To answer this we must think of the most efficient clock distribution available. This tends to be a clock H Tree distributing a low skew, timing balanced, buffered clock over a wide area. This single clock in FPGAs can use a dedicated clock route, where all logic cells receive a clock from their nearest dedicated clock line. As the clock path is split into multiple units, the RC load per segment is low (allowing fast designs), likewise with a small segment RC load, the required toggling dV/dT can be provided by minimum current drive strength (high speed, low latency, small area) clock buffers.
The clock signal to two registers (RegA and RegB) can therefore be almost identical especially if matched length routing is used (i.e. clock buffer is equidistant between both registers).
In Figure #1 (below), we have held the setup and hold times to both be 0.5ns (for a Xilinx Spartan 6 -2 grade FPGA the data sheet, DS162, shows that a fabric logic CLB (configurable logic block), has a setup time of Tsu = 0.47ns and a hold time of Th = 0.39ns). The example clock frequency is 100MHz leading to a Tck of 10ns. To illustrate the roles of path delay on the single-edge clocking scheme, we have swept the routing/path delay from 0ns to 10ns (0% to 100% of Tck). Two vertical lines are presented that give the out-of-bound regions for setup and hold time violation. The vertical ('y') axis is the slack time for both setup and hold. Ideally these could be equal but for a robust digital design, the slack needs only be positive and have a suitable design margin to ensure robustness despite temperature, power and process variations.
From Figure #1 (above), it is clear that with single-edge clocking, we have a wide scope of time in which to achieve timing closure (i.e. satisfy all minimum values, delays and design margins). From this example we can accommodate, without design margins, a path delay between 0.5ns and 9.5ns. The optimum path delay being between 4ns and 6ns. It should be noted that where these lines cross, i.e. setup slack = hold slack, and when setup time = hold time, the sampling point is exactly in the center of the data eye, and is thus robust with respect to phase jitter of the clock.
This has the advantage that we can be flexible with:
However the disadvantage comes when we consider either system latency or very short (very close layout) path delays. Single-edge clocking therefore has:
A New Scenario:
Let us take an example scenario that may skew our opinion as to why we would not use the preferred single-edge clocking strategy.
Let us say for the sake of argument, that we have two registers RegA and RegB that are physically close on an ASIC, that there is no combinatorial logic between the two, and that through MET1 routing the distance between RegA output node and RegB input node is 1um. Depending on the relationship between the very small routing delay (caused by propagation of 1um) and the setup time of the FFs in a particular CMOS technology, we may be able to satisfy typical minimum setup times, but we may not be able to implement a big enough slack design margin to adequately cover all temperature, power and process variations, and hence would need to relax the clock frequency to ensure all manufactured ICs clock correctly. For example what if the actual setup time was slightly longer than the typical case due to a high temperature, slow process corner and a low power supply?
Let us also say, for the sake of argument, that RegA and RegB are indicative of a path within a wider block that must operate in as few a clock cycles as possible, perhaps because it must perform a computation between two external events (hard real time), or before or at the same time as another IC block returns its value. Now, we may not be able to increase the clock frequency of the circuit due to some other clocking issue. It may be external such as PCB traces for example, or it could be that the other IC block would also speed up in complement with an increased clock frequency. And hence latency would only be corrected by using different clock frequencies (generally a bad idea)
We would therefore like some clocking strategy that is able to satisfy setup, hold and design margins even with very short routing delays, and a strategy that could help reduce the latency of a block to be in-line with another block. In effect we want some control of design factors, other than the strategy of adding registers into a block to artificially (and at the expense of power and area) increase its latency to be inline with another block.
Dual-Edge Clocking (Pos-Neg):
In order to implement dual edge clocking, we must either:
Distribute two clocks, with positive active edges but with 180 degrees phase offset, or
Implement at-Register clock inversion.
Both of these strategies increase area and power use and allow the design to be more susceptible to clock skew and jitter. Further an inversion at the site of use of a @negedge will cause a delay between the data launching clock and the data receiving clock, even if the majority of the clock is on the dedicated clock lines of the clock tree.
In Figure #2 (below), the same parameters for Tck, Tsu and Th are used, however as can be seen there is now the capability of sampling correctly, when the data path delay is comparable to the setup time. Further the latching on the following clock edge allows the design to pass the same data with a latency of only 0.5 clock cycles.
In Figure #2 (above), timing closure is achieved (discounting clock delay, jitter and skew), even with a data delay of 0ns. As the delay increases we approach the setup time violation whereby the data only just arrives before the negative edge of the clock (a time of Tck/2).
One way in which positive-negative clocking architectures are achieved, particularly in FPGA technologies, is at-site clock inversion for a particular register. In FPGAs the in-CLB register itself will likely have a positive edge clock sensitivity necessitating the CLB's LUT to be configured as an inverted. While not ideal as this forces the clock off the dedicated clock route, the gate delay associated with this inversion in both FPGA and ASIC technologies can act as an advantage (Figure #3), in that it helps reduce the data path delay by a gate unit delay (lets assume that this gives a clock delay time, Tcd of 0.5ns). Lets put aside the increase in silicon area, power and jitter aside for the moment...
In Figure #3 (above), a clock delay of approximately 0.5ns is included to model a local at-site clock inversion for the second (negative edge) register, RegB. This delay offsets the latching clock, partially counteracting some of the delay on the data line, Tlp. This allows a little more latitude with respect to data path routing delays. The crossing over of the setup slack (blue) and hold slack (red), with a data path delay of 0.5ns shows that the data sampling point is positioned at the center of the eye diagram for the data line.
One significant issue with dual edge clock architectures is that the duty cycle of the ideal digital clock MUST be well bounded within the region of a 50% duty cycle. Depending on the setup and hold times and delays on the IC, duty cycles of between 40% and 60% may be permissible, however a duty cycle of 25% would run the risk of violating timing closure for operations within that half of the clock period. For example if the time between the positive and negative edges was (25/100)*10ns = 2.5ns this will only be suitable for a maximum path delay of 2.5ns-0.5ns-0.5ns-0.5ns = 1ns, if we assume the setup slack is 100% of Tsu. In comparison a correct 50% duty cycle would allow maximum path delays of 5ns-0.5ns-0.5ns-0.5ns = 3.5ns.
Latency and Data Throughput:
Above, we discussed that positive to negative (dual-edge) clocking is able to reduce the latency of a register to register data flow to 0.5 clock cycles rather than 1.0 clock cycles, and that this may be appropriate if we need to half the latency of some block to bring it into line with another block.
However there is another reason why we may want to take advantage of dual-edge clocking. Let us treat the negative edge of the clock, in effect the same as the positive edge, in that it is simply another active edge within the digital synchronous design. The time between active edges is then Tclk/2, e.g. 10ns/2 = 5ns. This can be interpreted as a local doubling of the clock frequency.
With this in mind then, dual-edge clocking can be used to clock a block at an effective rate of 100MHz, but with a clock speed of 50MHz. This can be a significant advantage if we are restricted in terms of clock speeds by some external factor. Likewise as the clock tree is likely to constitute a significant power draw on an IC (large drivers, on-IC PLL, multiple buffers in the H-tree, long distances and additive RCL loading), it is in our interest to reduced the clock power by reducing the value of P = fCV^2. If this is combined with low-voltage clock distribution, then we have succeeded in clocking the circuit at the desired effective clock rate with the minimum of clock distribution power overheads.
As a direct example, the ST Microelectronics 130nm (90nm metalization) IMG175 CMOS process, had standard IO pads rated up to 100Mb/s for reliable digital communications (although their dv/dt was suitable for IO of single-photon avalanche diode (SPAD) toggling and jitter experiments). So when these were used for the clock input to a SPAD communications IC, the IC's system clock and IO clocks were instantly constrained to 100MHz or slower. However, through simulation of the core digital logic with extracted parasitic capacitances, routing resistances and delays, it was clear that much of the digital core could be clocked at 500MHz (typical simulation) for a likely 250MHz to 280MHz maximum clock rate for the worst case manufactured IC.
With this in mind, is it not prudent to clock the digital core with a frequency related to its setup, hold and routing delay times rather than the much lower IO frequency? I.e. the efficient use of the speed of the CMOS technology? Indeed many commercial ICs do exactly this, where internally computation may be at 2.5 to 3.0GHz, but IO and memory interfaces run at a much slower data rate dictated by the specifics of their IO loading/specifications.
To conclude then, single-edge clocking is an efficient method when we consider setup, hold, and routing times along with design margins and minimum area clock distribution. However it is not the only clocking approach. Overall it allows:
Latency = 1 clock cycle
Computation Per Clock = 1
Setup Slack = Maximum (~40% Tck)
Hold Slack = Maximum (~40% Tck)
Path Delay Coverage = 10% of Tck to 90% Tck
Skew/Jitter = Maximal Control
Duty Cycle = Minimal Impact
Area Required = Minimal
Dual-edge clocking can be efficient when we consider:
Latency = 0.5 clock cycles
Computation Per Clock = 2
Setup Slack = Halved (~20% Tck)
Hold Slack = Halved (~20% Tck)
Path Delay Coverage = 0% of Tck to 45% Tck
Skew/Jitter = Complicated by added at-site inversion
Duty Cycle = High Impact
Area Required = Either extra clock lines, or at-site inversion
Therefore, the choice of positive-only (single-edge) or positive-negative (dual-edge) clocking within an IC is one of a cost-benefit or sensitivity analysis, influenced by the specifications and the constraints of the project. I would expect that for an FPGA design, use of a single clock and a single clock-edge would be preferred (it is), notably because the minimum register-to-register distance will likely be greater than the setup time anyway, and local-inversion will act to increase clock jitter and resource use.
However for the SPAD communications IC, the situation would be the reverse, where limiting the clock to the IO clock frequency would under-utilize the possible clock speeds of the ASIC logic. Likewise for a communications IC, we wished to minimize the system latency, and reduce the clock distribution power usage. It is clear, that for the SPAD IC, a 200MHz clock could neither be supported by the IO pads, nor could the communications energy per bit metric tolerate an increase in clock power consumption from 20mW to 40mW.
That is all for now.