Physical and timing constraints are critical in obtaining correct timing closure and robust build to build functionality in FPGA based systems.
As part of my work within the FLITES project, we need to constrain our FPGA and Embedded microprocessor design in the face of a variety of highly heterogeneous blocks, running across multiple clock domains, and of course maintaining a close eye on the pads that interface a block to the real world.
In this blog I want to discuss how to prioritise the physical constraints, using the FLITES firmware design as a case study.
- - - - - - -
Typically a fabric-only FPGA design will synthesise, translate, map and route with no problems if the compiler is left to its own devices. It will try to ensure: a) minimum distances to all pads, b) minimum combinatorial path delay and c) maximum achievable system clock. That is all well and good but sometimes the compiler needs some added help, or we wish to constrain it for the later addition of other blocks.
Problem 1) Let us say we have a block such as an SPI or I2C Transceiver, clearly it has its outputs CSB, SCLK and SIO (in the case of SPI), so logically we should place the core near these pins on FPGA's pad ring. Lets say that the system clock however is not in the same IO region as these SPI outputs, lets say it is clear the other side of the FPGA. This may only be a matter of 1mm but that can be significant in terms of delay, skew, signal degradation etc.
With no constraints, the compiler will globally optimise for the declared system clock, but some logic will of course want to be placed near those SPI IO pins. In this case you end up with utilised CLBs that yes are functional, but are placed in a random manner around the FPGA, and if there are no other timing constraints declared it may or may not operate at the target clock rate. It might be functional at 25 or 50MHz but be entirely non-functional at 100MHz.
Problem 2) The above un-constrained placement gives another problem, in that rebuild the design and it will perform the routing optimisation again, and by the nature of optimisation algorithms may settle on a different placement. In this case it may indeed work at 75MHz, but there way be some skew between the SPI SCLK and SDIO lines that is not particularly good for the connected IC peripheral.
Problem 3) In an un-constrained system timing might be met by default of a nice slow system clock requirement, or skew and signal integrity may be no problem for the particular peripheral IC we are using SPI to communicate with. Great so functional! But what about power? Clearly a long clock line from A to B will have an associated lumped RCL time constant (It is worth remembering that the switches in an FPGA are blind so they do not automatically add buffers or registers to a signal unless the designer specifies them). That RCL time constant presents an issue from a signal integrity point of view, but it also increases power dissipation. The dynamic power of digital logic switching is approximately given by P=fC(V^2), if we have a CMOS buffer driving a lengthy un-buffered line then the total capacitance of that line will increase, and hence the power will increase.
Solutions: So by constraining a system in terms of the physical placement of the logic and the timing of various nets in the design, we can give the compiler more information to work with when it does its optimisation.
We could for example constrain the SPI core to an area only 10% larger than it requires. This would mean that the core itself has minimum delays between logic, a fast core-only clock speed and that the core remains the same with each build.
We could also constrain it to be near the SPI IO pads meaning that the SPI signals should always have the same relative delays/skew/jitter and these would be as low as possible.
We could also constrain the core's system clock to use the FPGA's global clock lines and clock buffers. These dedicated clock routes are designed to enable high speed, low skew and low jitter systems.
So with the above the only remaining freedom the compiler has is the routing of the system clock and the core's outputs to the nearby pads. With the compiler's goals in mind, the optimisation has only a few solutions so we achieve both high speed operation and build to build robustness.
- - - - - - -
Case Study [FLITES Firmware]:
The FLITES project (see other blogs etc) is implementing a large design on a Spartan 6 LX45 FPGA. It includes a Micro-Blaze soft embedded microprocessor core used for creating Ethernet packets of data, grabbing data from the fabric using direct memory access (DMA) and obtains configuration and command packets during the experiment. The firmware also includes interfaces to peripherals using SPI and I2C communications, MII communications to the Ethernet PHY and SerDes communications (LVDS) for high speed data transfer from sixteen 40MS/s 14-bit ADCs on the board. The firmware is expected to also perform real time digital lock-in (DLI) signal processing using direct digital synthesis (DDS) of a reference signal. It is also expected to manage the timing of data acquisition relative to other units in our distributed system and temporarily store the data in memory while the Micro-Blaze creates Ethernet headers, CRC check footers etc.
The FLITES firmware has multiple clock regions:
SerDes running at 260MHz obtaining DDR data at 580Mb/s from sixteen ADCs,
DLI and FIFOs running at 40MHz matching the ADC 40MS/s sample rate,
a Micro-Blaze processor running at 100MHz to ensure it has enough time to read the FIFOs using DMA between DLI measurement windows and of course,
a 25MHz clock region with respect to the 25MHz network clock of BASE-100TX Ethernet.
So, in view of all these different blocks and timing requirements, clearly some form of constraint system is needed. Otherwise we may have one bit of a control FSM flip-flop, not next to its logical next bit, but instead right next to an I2C shift register. Likewise we might find that the 25MHz and 40MHz clocks run fine but the 100MHz and 260MHz logic is unable to function at those speeds and is only functional at 75MHz.
In the image below (Fig.1), I've declared some system physical constraints for the Micro-Blaze, the system clock, the system's main PLL, the peripheral interfaces and the experiment timing/synchronisation module.
In the centre the Micro-Blaze is declared as a vertical stripe of slices, DSP slices and Block-RAMs. The PLL of the Micro-Blaze is in orange near the centre of the FPGA and the centre of the Micro-Blaze slice constraint. The two separate (left and right) vertical stripes define the Micro-Blaze Block-RAM constraints, and the right hand side mid-FPGA block is the ADC clocking PLL's I2C configuration transceiver.
The SerDes blocks for the ADCs exist (in Xilinx FPGAs) within the pad ring of the FPGA, so they are logically placed as close as possible to the differential LVDS pads.
The digital lock-in (DLI) blocks use lots of the hard-core DSP48A1 slices, the stripes of green blocks, but very few logic slices. As such these will be constrained in the different clock regions (X0Y1 for example). As the Micro-Blaze exists in half of some of these clock regions, the DLIs will exist in the remaining area.
The direct digital synthesis (DDS) units are to be constrained to clock regions X0Y3 and X1Y3, right in the middle of the FPGA, and right in the middle of the Micro-Blaze. Luckily they a) use DSP48A slices while the Micro-Blaze uses DSP slices elsewhere and b) can use the free slices around those areas. But there is a problem, and hence some form of priority is needed.
In the above figure (Fig. 2), the DLI and DDS blocks are constrained into 14 of the 16 clock regions, the two clock regions at the top are not used here as the black parts represent a reduction in the per clock region available resources. This is presumably due to the presence of hard IPcores such as PCI cores or Gigabit transceivers (available in various models of this FPGA line).
As you can see, this separate front-end firmware only (i.e. not the Micro-Blaze project) from the FLITES project's Ph.D student, shows a resource use problem when we come to combining it with the top level design shown in Fig.1.
- - - - - - -
Priorities and Constraint Hierarchies:
From a design view, what must we achieve?
We must achieve timing closure on all clock domains,
We must maintain data integrity over those domains,
We must ensure correct interfacing to the external world (pad locations, signal integrity etc),
We must ensure a low hit/miss ratio in our Micro-Blaze working memory and DMA blocks, and
We much ensure reliable zero-glitch data acquisition suitable for our spectroscopy colleagues on the FLITES project.
With these in mind we can start making a priority list.
Prioritise high-speed above lower-speed (i.e. 100MHz Micro-Blaze must be prioritised above 400kHz rate I2C)
Prioritise real-time command functionality above on-power-up or one-off configuration (i.e. the Micro-Blaze command receive, and fabric block configuration registers must be prioritised above the system clock on-power-up state machine)
Mission critical blocks above resettable non-critical blocks (i.e. Micro-Blaze hit/miss ratio and timing to its .text, .data, .heap, .stack memory locations must be prioritised in comparison to the DDS/DLIs as those can be remotely reset, while a catastrophic Micro-Blaze error can only be corrected remotely by a power off and re-initialisation cycle)
Blocks using hard-IPcores such as the LVDS pads, the SerDes blocks, clocking resources (SerDes clocks, system PLL, clock managers), or memory controller blocks (MCBs) should have a higher priority than soft-IPcores.
PADs, being fixed locations due to the fixed nature of the PCB always remain the ultimate constraint. There can be moved easily in the constraint file, but absolutely cannot be moved on the PCB unless the PCB is re-spun and re-populated (i.e. high cost factor).
The higher the speed the tighter the core should be constrained and the closer it should be placed to critical clock or pad resources, (i.e. we would expect a PCIe interface to be directly adjacent to its IO pads).
All clock nets/domains should include a timing constraint allowing the compiler to optimise it appropriately (i.e. there is no need for a 25Hz clock to actually be suitable for a 75MHz clock, if this means that the 100MHz clock is actually only suitable for a 96MHz clock).
- - - - - - -
So, what does this mean for the FLITES project. It means we will have an interplay between blocks as they are added to the system. The Microblaze constraints will need to move and change if other blocks need resources locally, the DLI's logic will be very close to the DLI's DSP slices, but the DDS units will require thought in terms of their placement and may well need to be moved/distributed.
That is all for now, when we have progressed the development further, I might write a follow up blog, stay tuned folks.