Over-Constrained FPGA Physical Constraints
Some time ago I wrote a small blog post on physical constraints for FPGAs (HERE). The idea here was to constrain a block to a finite area of slices or resources in order to promote high-speed design through small routing paths between CLBs, even if this means that clocks must travel a reasonable distance from their input pads. Of course this was keeping in mind that for clocks there are numerous dedicated clock routes and options for on-FPGA re-timing etc.
In this blog post we look at over-constraint, i.e. if there is a possibility of design instability or probabilistic failure simply as a result of a tight physical constraint being too tight. The short answer is yes, there is.
Previously, we discussed one idea and related this to an SPI module, physically constraining it to an area only 10% larger in terms of its required resources and placed physically close to its SCK, SDI and SDO input/output PADs. The idea being that with a defined and finite area, it must by definition have quite short internal path lengths and therefore the ability to meet say a 150MHz system clock requirement, but the area has enough room that the place and route step of compilation can move logic around as required to best utilise the available resources, hard IPcores such as BRAM or DSP cores or to optimise the speed or delay to the IO pads.
I stick by this advice, however with a word of warning. That that 10% may need to be increased to 20% or even 25% when dealing with more complex blocks. To see this let's take an example...
Let us take a Microblaze, Xilinx's soft-core 32bit RISC micro-processor as an example, and let us assume it has i) all of its BRAM memory blocks, ii) memory controllers for FLASH and SDRAM, iii) that it has controllers for both peripherals such as general purpose IO or Ethernet MAC, and iv) a DMA engine to efficiently move data from the FIFOs of our front-end firmware to the micro-processors head/stack memory.
In the below images, the physical constraints for both the Micro-Blaze (Back-End) design and the FLITES project specific (Front-End) design are shown. In red are the resources used for the Micro-Blaze, while yellow represents the SerDes, DDS, DLIA and FIFOs of the front-end. In green are blocks such as the ADC control SPI and PLL I2C along with a synchronisation FSM and a boot-up clock management FSM.
SLICEs = 2,510 of 3,434 (74% Util)
LUTs = 10,032 of 13,736 (74% Util)
REGs = 7,216 of 27,472 (27% Util)
While the resource use of 74% is not particularly bad when considering the size of the constrained area, a recent Doulos Webinar with the Xilinx training scheme highlighted that as a design approaches 80 to 90% utilisation, things begin to show signs of instability. While 74% is below this region, and the 27% utilisation of registers is a good deal below this, I suspect that when the hard locations of DSP and BRAM slices are added into the mix, that 80 to 90% should be lower at perhaps the 70 to 80% region.
It was found that many of the functions written as part of our Embedded C worked perfectly, and hence one can assume that both the processor and its working memory structure was being clocked correctly and met suitable timing closure. However other functions, particularly those dealing with either memory controllers for FLASH or SDRAM or for bulk IO such as the Ethernet MAC, functions became unstable. Through previous testing it was clear this was not an issue of the Embedded C code, nor the setup and design of the firmware, it was clearly correlated with an update of the blocks included in the top level design and therefore the physical placement and utilisation of the FPGAs fabric logic.
For one self-test, namely an Ethernet MAC and PHY loop-back self-test, an initial run of the previously working code simply failed. Subsequent tries probabilistically failed and succeeded, demonstrating that a) the module and test was working, but b) that some timing or perhaps memory-miss issue was preventing correct 100% of the time success of this simple loop-back test. Running the self test 10 times, it was found that it succeeded approximately 60% of the time, with the remaining 40% being total self-test failures.
A second self-test booted up the AXI Quad SPI peripheral memory controller, ran initialisation, loop-back tests, enquired as to the ID and size of the attached SPI FLASH and performed an erase, write and read-back self-test. This also failed approximately 50% of the time from multiple runs.
Together both of these probabilistically failing tests suggested that timing closure to the memory controllers was not adequate despite no mention of violations or departure from the system's included timing constraints on both the clocks and IO.
In the below images, the physical constraints for the Micro-Blaze fabric logic slices have been relaxed by another approx 20%. This however keeps the other constraints, i.e. the DSP slice constraints, and the sub-block constraints, the same. It also keeps the constraints of the other top-level blocks in the design (Front-End in yellow and Misc in green), the same.
SLICEs = 2,510 of 4,714 (54% Util)
LUTs = 10,032 of 18,856 (54% Util)
REGs = 7,216 of 37,712 (20% Util)
The utilised slices and LUTs are now a much smaller percentage of the available, allowing increased movement of resources during the PAR step and as can be seen from the above images, increased placement of slices in the regions of the BRAMs and memory controllers.
Clearly, a relaxed constraint has allowed the required slices to be fitted closer to the fixed resources of both BRAMs and DSP slices. The stability issue observed above has also been resolved, principally because certain paths from memory blocks or controllers to their preceding or proceeding fabric CLBs is reduced. I suspect that previously, while meeting timing constraints the amount of slack was quite small, in comparison perhaps here the slack has been lengthened.
While I would say that "Required Area + 10%" is reasonable for front-end blocks or blocks with relaxed timing constraints or a slower clock, I would perhaps revise this rule of thumb to be "Required Area + 30%" when the block is more complex and when hard IPcore resources are added into the scenario.
Lets say then that, a probable route forward, would be to treat initial constraints with the two rules of thumb above, and then to revise this as required to achieve both the required floorplanning and the required design stability.
That's all for now.