Thursday, August 28, 2025

FPGA RTL coding

 RTL coding phase

Following are critical aspects which need to be considered during RTL coding phase:

1.      Logic delay: Though it may be adequate to maintain logic delay of around 50%, it is desirable to maintain high speed paths in the design lower than that, say to 20-30%. Usually there are abundant resources such as Flip Flops (normally 1 flip flop for each look-up table), RAMs, and Multipliers etc. Wherever it doesn’t affect throughput, additional pipeline stages can be introduced judiciously keeping in mind the routing congestion issues.

2.      Device mapping efficiency: The RTL code shall enable best FPGA mapping by exploring the device architecture. One such example is in Xilinx Virtex2 FPGA there is an additional 2:1 MUX (F5) between 2 LUTs with dedicated routes. If a 4:1 MUX is coded as single entity, it will map well in one slice with 2 LUTs and an F5 MUX. Instead if 4:1 MUX built with pipelining after 2:1 MUX, then it can’t be mapped to F5 MUX and additional slice is needed. Another example is long register based shift register can be mapped to SRL configuration of LUT, provided all these registers need not have reset.

3.      Fan-out: Though synthesis tools can do automatic fan-out control, manual control is needed especially for the signals interfacing to hard-macros, as tools will treat every thing in same manner and often they are black-boxes.

4.      Vendor specific structures and instantiations: Create hierarchy around them to give freedom to migrate from one technology to another.

5.      Macro interface: All the inputs/outputs of macros shall be registered due to their fixed locations.

6.      Gated clocks: Avoid gated clocks and use clock enables instead.

7.      Critical logic: Place critical logic in separate hierarchy

8.      Critical paths: Make sure that they are not crossing hierarchy of the block by registering all the outputs.

9.      Tri-state buffers: For low speed paths, it is desirable to use tri-state buffers to save logic cells

10.  Unused hard-macros: Unused RAMs can be used as register set or to map state machines coded as look up tables. This will also avoid large multiplexers in the read path. Also unused multipliers can be used as long shifters.

11.  False and multi-cycle paths: False and multicycle paths shall not be pipelined and shall be identified by design and pass on to synthesis tool.

12.  Trail synthesis and P&R: Each module level designer shall perform individual module level synthesis and P&R of the design with the given floorplan and optimize the RTL code while being developed. If the IO requirement of a module exceeds the device physical IOs, dummy logic can be added to demultiplex/miltiplex few-pins-to-more-pins and/or more-pins-to-few-pins using shift register structures and/or OR-gate structure as shown in Figure 2. Also as shown in this figure insert additional flip-flops on interfaces to selected module to other modules by leaving actual IO interfaces same. This will eliminate skewed timing results due to dummy logic and connections. Also black-box timing information shall be used during synthesis to avoid skewed timing results.

13.  Module level Floorplanning: With-in the given floorplan area, often it is desirable to do sub-module level floorplanning. In this submodule level floorplanning it is often necessary to do floorplan only for critical parts of the design. Also it is necessary to do individual synthesis compile of timing critical sub-modules being floorplanned which will prevent hierarchy loss (as shown in Figure 3), and there-by ineffcient placement.

14.  Logic compression: Though from area standpoint it is preferred to do maximum level packing of unrelated logic (for example using COMPRESSION with Xilinx flow), it will have adverse impact on timing. Thus unrelated logic packing level shall be set based on timing criticality of each sub-module.

15.  IO allocation: The respective module IO fixing shall be done based on IO ring pin sequence on the die rather than pin sequence on the package.

Chip level Synthesis phase

During the chip level synthesis phase, following information shall be collected from individual module designers:

 

1.      Area constraints with unrelated logic compression information

 

2.      Timing constrains including false and multicycle paths

 

3.      IO assignments

 

 

4.      Black-box timing information

 

5.      Synthesis compile hierarchy

 

6.      Timing critical sub-module information

 

Module level synthesis has to be carried out with the information gathered from designers. Mere meeting frequency at synthesis stage is not good enough as route estimates are inaccurate. Instead if logic delay achieved is 50% of the cycle time, we can say we have achieved possible best results out of synthesis and move on to further steps. 

The resource sharing and fan-out control options in synthesis tool can be enabled for non timing critical sub-modules. Whereas synthesis tool options such as register replication, fan-out control and retiming can be enabled for timing critical submodules. Thus in the chip top level synthesis compilation, all modules will be black-boxes. Automated push-button based physical synthesis has yielded only 10-15% overall improvement in performance after P&R. However there are physical synthesis tools (e.g. Synplify premier) which supports floorplanning at synthesis stage. However the methodology described in this paper is equally applicable to netlist based floorplanning or physical synthesis based design floorplanning.

 

FPGA critical aspects

 

FPGA

 

During the micro-architecture or detailed design phase FPGA resource requirements shall be estimated. Module designers shall have “detailed view” of the design down to function/majorcomponent level for near-accurate estimates. At the end of this phase, exact FPGA part to be used shall be finalized from the chosen family.

Following are critical aspects that need to be considered during this phase:

1.      FPGA device Architecture: Detailed investigation and understanding of FPGA device architecture/capabilities including logic cells, RAMs, multipliers, DLL/PLL and IOs

2.      Module boundaries: All modules interfaces shall be on register boundary.

3.      Internal bus structure: A well defined internal point-to-point bus structure is preferred than routing all signals back and forth.

4.      Clocks: Clock multiplexing and gating shall be avoided and if required shall be done based on device capabilities

5.      Resets: Number of resets in the system shall be optimized based on dedicated reset routingresources available

6.      Register file: Instead of creating one common register file and routing register values to all modules; it is better to have registers wherever they are used. If needed even registers may be duplicated. It should be noted that though write path may be of multi-cycle path, but read path may not be. Also registers shall be implemented in RAM wherever possible

7.      Selection of memories/multipliers: The memory size requirement shall decide whether to use hard-macros or to build with logic. For small size memories, it is not at all preferred to map to large memory hard-macros, though it might take additional logic resources. The primary reason for this is hard-macro memory locations are fixed and placing driving/receiving logic next to memories is not always possible. Similarly, it is not advantageous to map small multiplier (such as 3x3) to an 18x18 hard- macro multiplier.

8.      Data/Control path mixing: Often it is advantageous to store control signals along with data bits in memories and pass-on to other modules. For example let us consider 16 data bits and 2 control bits to be transferred from one module to another through memory. These 18 bits can be stored as data bits in available block-memory of size say 1kx18 block memories. Also this method will be further advantageous if the hand-shake is asynchronous.

9.      Big multiplexer structures: It is not preferred to build very big multiplexer structures (say 256:1) especially for timing critical paths. Instead smaller multiplexers can be built, which are more controllable.

10.  High-level Floorplan: A high-level floorplan including IO planning shall be worked-out (as shown in Figure 1) based on the gate count and other macro estimates. Also spare area shall be planned for future/field upgrades. At this stage it is not necessary to fix the IO locations but it is necessary to fix the IO banks in FPGA. Having done the high level floorplan; the budgeted area shall be known to module level designers. Also interface module floorplan locations shall be known to the module level designers, which will enable them to further floorplan allocated area if necessary. Some of the high level floorplanning considerations are:

a.       Controlling congestion along with proximity

b.      Draw the data flow diagram of the design with the memories that are used to terminate the data paths and do module level area allocation

c.       Interdependent modules should be closer

d.      Module level area allocated shall be close to Macros which it is interfacing to

e.       Free area (rows and columns) between module area allocations, which will aid in inter module routing in full chip

f.       Clock resources and routing limitations if any

11.  Module output replication: Based on the initial floorplan each module output might have to be replicated if modules receiving this data are located in different corners of the chip.

12.  Best practices: RTL coding guidelines shall be passed on to module level designers.

 


Protocol ARM numericals

 

The start and end address of a 4 KB memory in Hex

 

we must represent addresses in the hexadecimal format.

Now, 1 kilobyte = 1024 bytes = 2^10 bytes

So 4 KB will have 2^10 * 2^2 = 2^12 bytes.

 

2^12 bytes of memory means 2^12 locations can be accessed in the memory.

For this, we require 12 bits to access all the locations.

In hexadecimal, each digit represents 4 binary bits. So 3 hex digits will cover 12 bits.

Memory addresses in hex: 000 to FFF for a 4kB memory.

Start address = 0x000

End address = 0xFFF

 

 

Divide a 4 GB memory space equally for 8 slave devices. What is the address range for each device

 

4 GB = 2^32 = 32 bits to represent the memory space

In hex, 0000 0000 to FFFF FFFF represents 4 GB space.

Divide by 8, we get 512 MB = 2^29 = 1FFF FFFF increment each time for 8 slaves.

 

Address spacing for each slave:

0000 0000 to 1FFF FFFF

2000 0000 to 3FFF FFFF

4000 0000 to 5FFF FFFF

6000 0000 to 7FFF FFFF

8000 0000 to 9FFF FFFF

A000 0000 to BFFF FFFF

C000 0000 to DFFF FFFF

E000 0000 to FFFF FFFF

 

 

Now let us consider Wrap operation.

 

Concept:

A wrap operation performs read/write starting from a start address, increments by size and reaches upto the wrap boundary.

After this, we move back to the lower wrap address. 

A wrap transfer is defined by wrap length and wrap size.

Note the following:

1. Start address of a wrap burst must be aligned to the size of the transfer.

2. The length of the burst must be 2,4,8 or 16.

 

Example:

Consider a 4-beat burst of 4-byte transfers.

(meaning length = 4 and size = 4 bytes)

Total size = 4*4 = 16 bytes. 

So address must wrap at every 16 byte boundary.

16 = 2^4. 

So wrap address starts and ends with 0000. (Four bits zeroes at the end)  

Note: 4-byte size transfers must be aligned to 4-byte boundaries (two bits zeroes at the end).

 

Eg. 0x20, 0x24, 0x28, 0x2c is a valid sequence of addresses for 4-beat burst of 4-byte transfers. After reaching 0x2c, we wrap back to 0x20.

Skills to acquire for a VLSI professional in AI

 

Skills to acquire for a VLSI professional in AI

 

To become a VLSI professional specializing in AI, you should consider acquiring the following skills:

 

VLSI Design: Develop a solid understanding of VLSI circuit design, digital logic, and chip architecture.

 

AI and Machine Learning: Learn about various AI and machine learning algorithms and their implementation in hardware.

 

FPGA and ASIC Design: Familiarize yourself with Field Programmable Gate Arrays (FPGAs) and Application-Specific Integrated Circuits (ASICs) for AI acceleration.

 

Hardware Description Languages (HDLs): Gain proficiency in HDLs like Verilog or VHDL for designing and simulating VLSI circuits.

 

Deep Learning Accelerators: Study the design and optimization of specialized hardware for deep learning tasks.

 

Neural Network Architectures: Understand different neural network architectures like CNNs, RNNs, and Transformers for AI applications.

 

Low-Power Design Techniques: Learn techniques to minimize power consumption in VLSI circuits, especially crucial for AI devices.

 

Hardware-Software Co-design: Explore the integration of AI algorithms with hardware to optimize performance and efficiency.

 

System-on-Chip (SoC) Design: Get familiar with designing complete AI systems on a single chip, including processor cores and accelerators.

 

Verification and Validation: Understand methodologies for verifying and validating VLSI designs to ensure their correctness.

 

Parallel Processing: Learn about parallel computing techniques to maximize the computational efficiency of AI algorithms.

 

Signal Processing: Study signal processing techniques, as they are often used in AI-related applications.

 

Remember, the field of AI and VLSI is continually evolving, so staying up-to-date with the latest advancements is crucial for success in this domain

Linux Device driver

 





1)      Any Linux driver has a constructor and a destructor.

2)      The module’s constructor is called when the module is successfully loaded into the kernel and the destructor when rmmod succeeds in unloading the module.

These two are like normal functions in the driver, except that they are specified as the init and exit functions, respectively, by the macros module_init() and module_exit(), which are defined in the kernel header module.h.

1)Simple driver(OFD)

Program for ofd:

#include <linux/module.h>

#include <linux/version.h>

#include <linux/kernel.h>

 static int __init ofd_init(void) /* Constructor */

{

    printk (KERN_INFO "Namaskar: ofd registered");

    return 0;

}

 static void __exit ofd_exit(void) /* Destructor */

{

    printk(KERN_INFO "Alvida: ofd unregistered");

}

 module_init(ofd_init);

module_exit(ofd_exit);

 MODULE_LICENSE("GPL");

MODULE_AUTHOR("");

 

MODULE_DESCRIPTION("Our First Driver");

Make file:

# Makefile – makefile of our first driver

 

# if KERNELRELEASE is defined, we've been invoked from the

# kernel build system and can use its language.

ifneq (${KERNELRELEASE},)

    obj-m := ofd.o

# Otherwise we were called directly from the command line.

# Invoke the kernel build system.

else

    KERNEL_SOURCE := /usr/src/linux

    PWD := $(shell pwd)

default:

    ${MAKE} -C ${KERNEL_SOURCE} SUBDIRS=${PWD} modules

 

clean:

    ${MAKE} -C ${KERNEL_SOURCE} SUBDIRS=${PWD} clean

Endif

 

With the C code (ofd.c) and Makefile ready, all we need to do is invoke make to build our first driver (ofd.ko).

$ make

make -C /usr/src/linux SUBDIRS=... modules

make[1]: Entering directory `/usr/src/linux'

  CC [M]  .../ofd.o

  Building modules, stage 2.

  MODPOST 1 modules

  CC      .../ofd.mod.o

  LD [M]  .../ofd.ko

make[1]: Leaving directory `/usr/src/linux'

 

To dynamically load or unload a driver, use these commands, which reside in the /sbin directory, and must be executed with root privileges:

  1.  insmod <module_file> — inserts/loads the specified module file

2.      Type d message can be seen by using the command

  dmesg

3.      rmmod<module_file> --- removes/unloads the module

 

4.      Type d message can be seen by using the command   dmesg


Wednesday, July 10, 2024

FIFO

 

FIFO is needed when there are two different clock domains and we need to transmit the data. We must use FIFO when the fast sending data is to be delayed to match the speed of the receiver. We come across different cases let us understand one by one.

One of the most common questions in interviews is how to calculate the depth of a FIFO. Fifo is used as buffering element or queuing element in the system, which is by common sense is required only when you slow at reading than the write operation.

        So size of the FIFO basically implies the amount of data required to buffer, which depends upon data rate at which data is written and the data rate at which data is read. Depends on write rate and read rate).

        Statistically, Data rate varies in the system majorly depending upon the load in the system. So to obtain safer FIFO size we need to consider the worst case scenario for the data transfer across the FIFO under consideration. (Statistics show that the data rate in the system Changes mainly depend on the load of the system. Therefore, in order to obtain a safe FIFO depth, we need to consider the worst case of data transmission across FIFOs during design).

         For worst case scenario, Difference between the data rate between write and read should be maximum. Hence, for write operation maximum data rate should be considered and for read operation minimum data rate should be considered. So in the question itself, data rate of read operation is specified by the number of idle cycles and for write operation, maximum data rate should be considered with no idle cycle.
         So for write operation, we need to know Data rate = Number of data * rate of clock. Writing side is the source and reading side becomes sink, data rate of reading side depends upon the writing side data rate and its own reading rate which is Frd/Idle_cycle_rd.

In order to know the data rate of write operation, we need to know Number of data in a Burst which we have assumed to be B.

So following up with the equation as explained below: 

Fifo size = Size to be buffered = B-B * Frd/(Fwr* Idle_cycle _rd ).

That is, FIFO_DEPTH = Burst_length-Burst_length*rd_clk/(wr_clk*Idle_cycle_rd).

Here we have not considered the sychnronizing latency if Write and Read clocks are Asynchronous. Greater the Synchronizing latency, higher the FIFO size requirement to buffer more additional data written.

 

Write clock : wr_clk; and in the write clock cycle, A data will be written into the FIFO every B clock cycle;

Read clock : r_clk; and in the read clock cycle, there will be X data read FIFO in every Y clock cycles.

Busrt_len for reading and writing : Same, both are Burst_len.What is the minimum depth required by the FIFO?

the amount of data written in a certain period of time (T) must be equal to the amount of data read, that is, A/B*wr_clk =X/Y*r_clk.

To calculate the maximum burst length burst_length of the written data, consider the worst case.

For example, if  80 data are written into the FIFO every 100 clocks, then in the case of back-to-back , burst_length=2*80=160.

If the data packet transmission method is clearly given in the question,

 For example, one package is far away from another package, that is, it does not affect each other, this back-to-back situation will not occur.

The minimum depth of the final FIFO: fifo_depth =Burst_length-Burst_len * X/Y * r_clk/w_clk

BTW: Usually, for safety's sake, more depth is left.

The derivation process of FIFO minimum depth calculation formula:

The core guidance is the conditions mentioned above: 

{ FIFO depth /(write rate - read rate)} >      { write data volume /write rate }

And in order to work under harsh conditions, we need to consider the situation when writing is the fastest and reading is the slowest:

            fifo_depth/[wda_width*(wr_clk-rd_clk*X/Y)]> Burst_len/wr_clk*rda_width

 Under normal circumstances, the bit width of read and write data is the same, there are:

            fifo_depth/(wr_clk-rd_clk*X/Y)> Burst_len/wr_clk

Sorted out: fifo_depth> Burst_len/wr_clk(wr_clk-rd_clk*X/Y)

           I.e. fifo_depth> Burst_len -Burst_len* (rd_clk/wr_clk)*X/Y

 

 If 100 write clock cycles can write 80 data, 10 clock cycles can read 8 data. Let wclk = rclk, consider back-to-back (20 clk sent data + 80clk sent data + 80clk sent data + 20clk no data, a total of 200 clk) to calculate the FIFO depth:

fifo_depth = 160-160*80%=160*128=32 .

If you let wclk=200Mhz, change to write 40 out of 100 wclk, rclk=100Mhz, read out 8 out of 10 r_clk. Then the minimum FIFO depth at this time should be:

fifo_depth = 80-80*

When reading and writing FIFO are not performed at the same time

  If the reading and writing of the FIFO are not performed at the same time, it is necessary to set the FIFO depth to the maximum burst number of write data .

Asynchronous FIFO minimum depth calculation example

Setting the depth of FIFO depends on the application scenario. 

Read and write FIFO used in SDRAM

     In SDRAM applications, setting the depth of a FIFO is generally sufficient to double the operation data . For example, SDRAM full page reads and writes 256, the corresponding depth is 512. Because the SDRAM read and write speed is definitely faster than the FIFO write speed, and the read speed of the subsequent FIFO, the overall rate of the SDRAM before and after the operation is the same.

Asynchronous clock data interface

     It is used for asynchronous clock domain data interface. If reading and writing are performed at the same time, the FIFO generally set is that the write clock is greater than the read clock. At this time, setting the depth of the FIFO must correspond to two clocks and write the largest burst data.

Assume that the write clock frequency is 40Mhz, the read clock is 25Mhz, and the maximum burst write data number at the write end is 100 data.
Then the depth is calculated as: 100-100*(2/40)=37.5, the corresponding depth is at least 38.

FIFO 

 An 8-bit wide AFIFO, the input clock is 100Mhz, the output clock is 95Mhz, a package is 4Kbit, and the sending distance between the two packages is large enough. Find the depth of AFIFO.

If it is large enough, it means that two adjacent transmissions will not affect each other, only one package needs to be considered;

Burst length calculation: if burst_length =4000/8=500, th 1Kbit=1024bit, so burst_length = 4*1024/8=512;

Calculation module fifo_depth = burst_length – burst_length*(X/Y)*(r_clk/w_clk)

Because the values ​​of X and Y are not given, all default to 1.

Get depth1=4000/8-(4000/8)*(95/100)=25.

Or depth =512-512*(95/100) = 25.6, so the minimum value of fifo_depth is 26

scenario can be two of following
Writing is faster than reading then there is possibility of overflow
Writing is slower than reading then there is possibility of underflow
In FIFO depth calculation we always have to consider worst case .Size of FIFO basically implies that how much data is required to buffer . And it is totally depend on data rate of reading and writing.
Data rate = Number of Data x Time Period
Difference = DRFAST – DRSLOW
Depth = Difference / Higher Freq. Time Period
Writing side is Source and Reading Side is Sink. If the data rate of writing is higher then the reading side’s data rate then the FIFO will now overflow.Another Type of depth calculation can be done by this method:
Consider F1 is writing frequency and F2 is reading frequency (F1>F2) and Data Size is D (Data Words)
Time taken to write to FIFO = DATA/F1
Data Read from the FIFO in the same time = (DATA/F1) xF2
Excess Data is FIFO (Backlog) = DATA-((DATA/F1) x F2
Read side will take more time to read the data so that time is called mop-up time.
Following is the calculation of mop-up time
Mop-up time = Backlog/F2 = DATA-(DATA/F1 F2)
Depth= Wmax – {Wmax x Fread x Wread}
Fwrite x WwriteFwrite = Clock Frequency of Write Clock Domain
Fread = Clock Frequency of Read Clock Domain
Wmax= Maximum number of worlds can be Written
Wwrite = Number of Writes in Per Clock
Wread = Number of reads in Per Clock
Following are some cases of FIFO depth calculation with perfect explanation.



Writing Side = 30 MHz => 33.33 ns Time Period
Reading Side = 40 MHz = > 25.0 ns Time Period



Consider the data size is = 10
Data Rate of Writing = 10 x (1/30) =333.33 ns
Data Rate of Reading = 10 x (1/40) =250.0 ns.
Difference between Data Rates = 333.33 – 250.0 = 83.33
Now divide with highest frequency time period = 83.33/25ns = 3.3332 = 4 (Aprox)
Depth of FIFO Should be 4



Writing Side = 80 Data/100 Clock = 100 MHz => 10ns Time Period
Reading Side = 80 Data/80 Clock = 80 MHz => 12.5ns Time Period
No Randomization



Data Size is = 80
Data Rate of Writing =80*10=800ns
Data Rate of Reading=80*12.5=1000ns
Difference = 1000-800=200ns
Now divide with highest frequency time period = 200/10ns=20
Depth of FIFO Should be 20



Writing Side = 10 MHz => 100 ns
Reading Side = 2.5 MHz=>400 ns
Word Size = 2



Data Rate of Writing =100*2=200ns
Data Rate of Reading=400*2=800ns
Difference = 800-200=600
Now Divide by the lowest frequency time period = 600/400 = 1.5 = 2
(Reading Frequency is slower than writing Frequency )



Writing Data = 80 DATA/100 Clock (Randomization of 20 Data’s)
Outgoing Data= 8 DATA/10 Clock

Above shows that the Writing Frequency is equal to reading frequency
20 Data + 80 Valid Data + 80 Valid Data + 20 Data

We will consider worst case

So we will consider 200 Cycles. In 200 Cycle 160 Data is written. it means 160 data continuously written in 160 clock it is the worst case .
At the reading side we will Read 8x16=128 data in 16x10=160 Clock Cycle.
So the Difference between Written Data and Read Back Data is = 160-128=32
FIFO Depth Should be 32

1.Calculate depth of an async fifo, but I am confused how to calculate it. The fifo parameters are as follows:

Write Clk Freq = 60 MHz.
Read Clk Freq = 100 MHz.
Maximum Write Burst Size = 1024.
Delay between writes in burst = 4 clk.
Read Delay = 2 clk.

2.A/D sampling rate is 50Mhz, dsp reading A/D rate is 40Mhz, or send 100,000 sampling data to DSP without loss, at least add a large capacity (depth) FIFO between A/D and DSP Just work?