RTL coding phase
Following are critical aspects which need to be considered during RTL coding phase:
1. Logic delay: Though it may be adequate to maintain logic delay of around 50%, it is desirable to maintain high speed paths in the design lower than that, say to 20-30%. Usually there are abundant resources such as Flip Flops (normally 1 flip flop for each look-up table), RAMs, and Multipliers etc. Wherever it doesn’t affect throughput, additional pipeline stages can be introduced judiciously keeping in mind the routing congestion issues.
2. Device mapping efficiency: The RTL code shall enable best FPGA mapping by exploring the device architecture. One such example is in Xilinx Virtex2 FPGA there is an additional 2:1 MUX (F5) between 2 LUTs with dedicated routes. If a 4:1 MUX is coded as single entity, it will map well in one slice with 2 LUTs and an F5 MUX. Instead if 4:1 MUX built with pipelining after 2:1 MUX, then it can’t be mapped to F5 MUX and additional slice is needed. Another example is long register based shift register can be mapped to SRL configuration of LUT, provided all these registers need not have reset.
3. Fan-out: Though synthesis tools can do automatic fan-out control, manual control is needed especially for the signals interfacing to hard-macros, as tools will treat every thing in same manner and often they are black-boxes.
4. Vendor specific structures and instantiations: Create hierarchy around them to give freedom to migrate from one technology to another.
5. Macro interface: All the inputs/outputs of macros shall be registered due to their fixed locations.
6. Gated clocks: Avoid gated clocks and use clock enables instead.
7. Critical logic: Place critical logic in separate hierarchy
8. Critical paths: Make sure that they are not crossing hierarchy of the block by registering all the outputs.
9. Tri-state buffers: For low speed paths, it is desirable to use tri-state buffers to save logic cells
10. Unused hard-macros: Unused RAMs can be used as register set or to map state machines coded as look up tables. This will also avoid large multiplexers in the read path. Also unused multipliers can be used as long shifters.
11. False and multi-cycle paths: False and multicycle paths shall not be pipelined and shall be identified by design and pass on to synthesis tool.
12. Trail synthesis and P&R: Each module level designer shall perform individual module level synthesis and P&R of the design with the given floorplan and optimize the RTL code while being developed. If the IO requirement of a module exceeds the device physical IOs, dummy logic can be added to demultiplex/miltiplex few-pins-to-more-pins and/or more-pins-to-few-pins using shift register structures and/or OR-gate structure as shown in Figure 2. Also as shown in this figure insert additional flip-flops on interfaces to selected module to other modules by leaving actual IO interfaces same. This will eliminate skewed timing results due to dummy logic and connections. Also black-box timing information shall be used during synthesis to avoid skewed timing results.
13. Module level Floorplanning: With-in the given floorplan area, often it is desirable to do sub-module level floorplanning. In this submodule level floorplanning it is often necessary to do floorplan only for critical parts of the design. Also it is necessary to do individual synthesis compile of timing critical sub-modules being floorplanned which will prevent hierarchy loss (as shown in Figure 3), and there-by ineffcient placement.
14. Logic compression: Though from area standpoint it is preferred to do maximum level packing of unrelated logic (for example using COMPRESSION with Xilinx flow), it will have adverse impact on timing. Thus unrelated logic packing level shall be set based on timing criticality of each sub-module.
15. IO allocation: The respective module IO fixing shall be done based on IO ring pin sequence on the die rather than pin sequence on the package.
Chip level Synthesis phase
During the chip level synthesis phase, following information shall be collected from individual module designers:
1. Area constraints with unrelated logic compression information
2. Timing constrains including false and multicycle paths
3. IO assignments
4. Black-box timing information
5. Synthesis compile hierarchy
6. Timing critical sub-module information
Module level synthesis has to be carried out with the information gathered from designers. Mere meeting frequency at synthesis stage is not good enough as route estimates are inaccurate. Instead if logic delay achieved is 50% of the cycle time, we can say we have achieved possible best results out of synthesis and move on to further steps.
The resource sharing and fan-out control options in synthesis tool can be enabled for non timing critical sub-modules. Whereas synthesis tool options such as register replication, fan-out control and retiming can be enabled for timing critical submodules. Thus in the chip top level synthesis compilation, all modules will be black-boxes. Automated push-button based physical synthesis has yielded only 10-15% overall improvement in performance after P&R. However there are physical synthesis tools (e.g. Synplify premier) which supports floorplanning at synthesis stage. However the methodology described in this paper is equally applicable to netlist based floorplanning or physical synthesis based design floorplanning.