IC Design : FPGA

Implementation of random logic,Prototyping,Reconfigurable hardware,Hardware dedicated to solving one problem

• DSP 48 can perform mathematical functions such as:

• Add/Subtract

• Accumulate

• Multiply

• Multiply-Accumulate

• Multiplexer

• Barrel Shifter

• Counter

• Divide (multi-cycle)

• Square Root (multi-cycle)

• Can also create filters such as:

• Serial FIR Filter (Xilinx calls this MACC filters)

• Parallel FIR Filter

• Semi-Parallel FIR Filter

• Multi-rate FIR Filters

High-performance FPGA families

Virtex (220 nm)

Virtex-E, Virtex-EM (180 nm)

Virtex-II (130 nm)

Virtex-II PRO (130 nm)

Virtex-4 (90 nm)

Virtex-5 (65 nm)

Virtex-6 (40 nm)

Virtex-7 (28 nm)

• Low Cost Family

– Spartan/XL – derived from XC4000

– Spartan-II – derived from Virtex

– Spartan-IIE – derived from Virtex-E

– Spartan-3 (90 nm)

– Spartan-3E (90 nm) – logic optimized

– Spartan-3A (90 nm) – I/O optimized

– Spartan-3AN (90 nm) – non-volatile,

– Spartan-3A DSP (90 nm) – DSP optimized

– Spartan-6 (45 nm)

– Artix-7 (28 nm)

• Starting with Virtex 4 family, Xilinx introduced DSP48 block for high-speed DSP on FPGAs

• Essentially a multiply-accumulate core with many other features

• Now also in Spartan-3A, Spartan 6, Virtex 5, and Virtex 6

Embedded memories

The M4K memory blocks support the following modes:

v Single-port RAM (RAM:1-Port)

v Simple dual-port RAM (RAM: 2-Port)

v True dual-port RAM (RAM:2-Port)

v Tri-port RAM (RAM:3-Port)

v Single-port ROM (ROM:1-Port)

v Dual-port ROM (ROM:2-Port)

Change in Supply Voltages

Year Technology (nm) Core Supply Voltage (V)

1998 350 3.3

1999 250 2.5

2000 180 1.8

2001 150 1.5

2003 130 1.2

2008 65 1.0

2009 40 0.9

2011 28 0.9

Multiple Standards for High-Speed Serial Communication

Fibre Channel

InfiniBand

PCI Express (developed by Intel)

RapidIO

SkyRail (developed by MindSpeed Technologies)

10-gigabit Ethernet

Field Programmable Gate Array

Simple’ Programmable Logic Blocks

Massive Fabric of Programmable Interconnects

Standard CMOS Integrated Circuit fabrication process as for memory chips (Moore’s Law)

An FPGA contains a set of programmable logic gates and rich interconnect resources, making it possible to implement complex digital circuits

To simplify the design of complex systems in FPGAs, there exists libraries of pre-defined complex functions and circuits (IP cores) that have been tested and optimized to speed up the design process.

All done by CAD system (e.g. Xilinx, Quartus)

Chop up circuit into little pieces of logic

Each piece goes in a separate logic element (LE)

Hook them together with the programmable routing

Logic Functions implemented in Look Up Table LUTs.

Flip-Flops. Registers. Clocked Storage elements.

Multiplexers (select 1 of N inputs)

Huge Density of Logic Block ‘Islands’

1,000 … 100,000’s

in a ‘Sea’ of Interconnects

FPGA Features

Large Complex Functions

Re-Programmability, Flexibility.

Massively Parallel Architecture

Processing many channels simultaneously cf MicroProcessor

Fast Turnaround Designs ,Standard IC Manufacturing Processes. Moore’s Law

Mass produced. Inexpensive. Many variants. Sizes. Features.

PP Not Radiation Hard L

Power Hungry ,No Analogue

LUT

LUT contains Memory Cells to implement small logic functions

Each cell holds ‘0’ or ‘1’ .

Programmed with outputs of Truth Table

Inputs select content of one of the cells as output

Larger Logic Functions built up by connecting many Logic Blocks together

Determined by SRAM cells

LUT AS DISTRIBUTED RAM

• CLB LUT configurable as Distributed RAM

– An LUT equals 16x1 RAM

– Implements Single and Dual-Ports

– Cascade LUTs to increase RAM size

• Synchronous write

• Synchronous/Asynchronous read

– Accompanying flip-flops used for synchronous read

• When the CLB LUT is configured as memory, it can implement 16x1 synchronous RAM. One LUT can implement 16x1 Single-Port RAM. Two LUTs are used to implement 16x1 dual port RAM. The LUTs can be cascaded for desired memory depth and width.

• The write operation is synchronous. The read operation is asynchronous and can be made synchronous by using the accompanying flip flops of the CLB LUT.

• The distributed ram is compact and fast which makes it ideal for small ram based functions.

LUT AS SHIFT REGISTER

• Each LUT can be configured as shift register

– Serial in, serial out

• Dynamically addressable delay up to 16 cycles

• For programmable pipeline

• Cascade for greater cycle delays

• Use CLB flip-flops to add depth

The LUT can be configured as a shift register (serial in, serial out) with bit width programmable from 1 to 16.

For example, DEPTH[3:0] = 0010(binary) means that the shift register is 3-bit wide. In the simplest case, a 16 bit shift register can be implemented in a LUT, eliminating the need for 16 flip flops, and also eliminating extra routing resources that would have been lowered the performance otherwise.

FAST CARRY LOGIC

Each CLB contains separate logic and routing for the fast generation of sum & carry signals

– Increases efficiency and performance of adders, subtractors, accumulators, comparators, and counters

Carry logic is independent of normal logic and routing resources

All major synthesis tools can infer carry logic for arithmetic functions

– Addition (SUM = A + B)

– Subtraction (DIFF = A - B)

– Comparators (if A < B then…)

– Counters (count = count +1)

CLB

Each slice contains two sets of the following:

– Four-input LUT

• Any 4-input logic function,

• or 16-bit x 1 sync RAM (SLICEM only)

• or 16-bit shift register (SLICEM only)

– Carry & Control

• Fast arithmetic logic

• Multiplier logic

• Multiplexer logic

– Storage element

• Latch or flip-flop

• Set and reset

• True or inverted inputs

• Sync. or async. control

– Each Virtex^ä-II CLB contains four slices

– Local routing provides feedback between slices in the same CLB, and it provides routing to
neighboring CLBs

– A switch matrix provides access to general routing resources

– Each slice has four outputs

– Two registered outputs, two non-registered outputs

– Two BUFTs associated
with each CLB, accessible
by all 16 CLB outputs

– Carry logic runs vertically,
up only

– Two independent carry chains per CLB

IOB

• IOB provides interface between the package pins and CLBs

• Each IOB can work as uni- or bi-directional I/O

• Outputs can be forced into High Impedance

• Inputs and outputs can be registered

– advised for high-performance I/O

• Inputs can be delayed

SELECTED IO

• Allows direct connections to external signals of varied voltages and thresholds

– Optimizes the speed/noise tradeoff

– Saves having to place interface components onto your board

• Differential signaling standards

– LVDS (Low Voltage Differential Signaling), BLVDS, ULVDS

– LDT (Lightning Data Transport)

– LVPECL (Low Voltage Pseudo Emitter Coupled Logic)

• Single-ended I/O standards

– LVTTL, LVCMOS (3.3V, 2.5V, 1.8V, and 1.5V)

– PCI-X at 133 MHz, PCI (3.3V at 33 MHz and 66 MHz)

– GTL, GTLP

– and more!

MEMORY AND CLOCKING

• Distributed RAM and Block RAM

– Distributed RAM uses the CLB resources

– Block RAM is a dedicated resource on the device (18-kb blocks)

• Dedicated 18 x 18 multipliers next to block RAMs

• Clock management resources

– 16 dedicated global clock multiplexers

– Digital Clock Managers (DCMs)

DISTRIBUTED SELECT RAM

• Uses a LUT in a slice as memory

• Synchronous write

• Asynchronous read

– Accompanying flip-flops
can be used to create
synchronous read

• Emulated dual-port RAM

– One read/write port

One read-only port

BRAM

• The Block RAM is true dual port, which means it has 2 independent Read and Write ports and these ports can be read and/or written simultaneously, independent of each other.

• All control logic is implemented within the RAM so no additional CLB logic is required to implement dual port configuration.

• Most efficient memory implementation

• Dedicated blocks of memory

• Ideal for most memory requirements

• 4 to 104 memory blocks

• 18 kbits = 18,432 bits per block (16 k without parity bits)

• Use multiple blocks for larger memories

• Builds both single and true dual-port RAMs

• Block SelectRAM™ resources are dedicated resources on the silicon.

• RAMs can be given an initial value. Many “initialization” attributes are associated with the Block SelectRAM resources:

• INIT_xx: Numbered attributes (00 - 3F) that specify the initial memory data contents. Each INIT_xx attribute is a 64-digit hex number.

• INITP_xx: Numbered attributes (00 - 07) that specify the initial memory parity contents. Each INITP_xx attribute is a 64-digit hex number.

• INIT_A/INIT_B: Specifies the initial value of the RAM output latches after configuration.

• SRVAL_A/SRVAL_B: Specifies the value of the RAM output latches after SSRA/SSRB is asserted.

• INIT and SRVAL attributes are specified as hex numbers.

• Up to 3.5 Mb of RAM in 18-kb blocks

• Synchronous read and write

• True dual-port memory

• Each port has synchronous read and write capability

• Different clocks for each port

• Supports initial values

• Synchronous reset on output latches

• Supports parity bits

• One parity bit per eight data bits

DUAL PORTS

• Each port can be configured with a different data bus width

• Provides easy data width conversion without any additional logic

• Because the RAM blocks are true dual port, each port can be configured for a different width. This example shows port A configured as 1K x 4 and port B configured as 256 x16. This feature can be used for applications requiring different bus widths for two applications.

• Note that the Altera FLEX 10K and ACEX 1K families do not have this feature, as they do not have true dual port capability.

ADDED ADVANTAGE OF TRUE DUAL PORT

• Added advantage of True Dual-Port

– No wasted RAM Bits

• Can split a Dual-Port 16K RAM into two Single-Port 8K RAM

Simultaneous independent access

• To access the lower RAM

– Tie the MSB address bit to Logic Low

• To access the upper RAM

– Tie the MSB address bit to Logic High

TWO INDEPENDENT SINGLE PORT

• Here, a single 4K bit memory block is split into two independent 2K bit Single-Port blocks. This feature allows efficient utilization of memory bits. The upper 2K bit block is accessed by tying the ADDR11 bit to Vcc whereas the lower 2K bit block is accessed by tying it to GND instead.

DEDICATED MULTIPLER BLOCKS

• 18-bit two’s complement signed operation

• Optimized to implement Multiply and Accumulate functions

• Multipliers are physically located next to Block SelectRAM™ memory

GLOBAL CLOCK ROUTING RESOURCES

• Sixteen dedicated global clock multiplexers

• Eight on the top-center of the die, eight on the bottom-center

• Driven by a clock input pad, a DCM, or local routing

• Global clock multiplexers provide the following:

• Traditional clock buffer (BUFG) function

• Global clock enable capability (BUFGCE)

• Glitch-free switching between clock signals (BUFGMUX)

• Up to eight clock nets can be used in each clock region of the device

Each device contains four or more clock regions

DCM

• Up to twelve DCMs per device

– Located on the top and bottom edges of the die

– Driven by clock input pads

• DCMs provide the following:

– Delay-Locked Loop (DLL)

– Digital Frequency Synthesizer (DFS)

– Digital Phase Shifter (DPS)

• Up to four outputs of each DCM can drive onto global clock buffers

All DCM outputs can drive general routing

CLOCKED LOGIC:

Registers on outputs. CLOCKED storage elements.

Synchronous FPGA Logic Design, Pipelined Logic.

FPGA Fabric Pulse from Global Clock (e.g. LHC BX frequency)

FPGA Classification

Based on Functional Unit/Logic Cell Structure

v Transistor Pairs

v Basic Logic Gates: NAND/NOR

v MUX

v Look –up Tables (LUT)

v Wide-Fan-In AND-OR Gates

Programming Technology

v Anti-Fuse Technology

v SRAM Technology

v EPROM Technology

• Vary from vendor to vendor. All share the common property: Configurable in one of the two positions – ‘ON’ or ‘OFF’

• Can be classified into three categories:

– SRAM based

– Fuse based

– EPROM/EEPROM/Flash based

• Desired properties:

• Minimum area consumption

• Low ON resistance; High OFF resistance

• Low parasitic capacitance to the attached wire

• Reliability in volume production

– SRAM cells are used

• As Look-Up Tables (LUT) to implement logic (as Truth Tables)

• As embedded RAM blocks (for buffer storage etc.)

• As control to routing and configuration switches

– Advantages

• Allows In-System Programming (ISP)

• Suitable for Reconfigurable HW

– Disadvantages

• Volatile – needs power all the time / use PROM to

download configuration data

ANTIFUSE

v Invented at Stanford and developed by Actel

v Opposite to regular fuse Technology - Normally an open circuit until a programming current (about 5 mA) is forced through it

v Two Types:

1) Actel’s PLICE [Programmable Low-Impedance Circuit Element] - A High-Resistance Poly-Diffusion Antifuse

2) QuickLogic’s Low-Resistance metal-metal antifuse [ViaLink] technology

ü Direct metal-2-metal connections

ü Higher programming currents reduce antifuse resistance

v Disadvantages:

1) Unwanted Long Delay

EPROM

• EPROM Programming Technology

– Two gates: Floating and Select

– Normal mode:

• No charge on floating gate

• Transistor behaves as normal n-channel transistor

– Floating gate charged by applying high voltage

• Threshold of transistor (as seen by gate) increases

• Transistor turned off permanently

– Re-programmable by exposing to UV radiation

• Not in-system re-programmable

• Re-programming is a time consuming task

• Two gates: Floating and Select

• Functionally equivalent to EPROM; Construction and structure differ

• Electrically Erasable: Re-programmable by applying high voltage

(No UV radiation exposure!)

• When un-programmed, the threshold (as seen by select gate) is negative!

• Re-programmable; In general, in-system re-programmable

• Re-programming consumes lesser time compared to EPROM technology

• Multiple voltage sources may be required

• Area occupied is twice that of EPROM!

All FPGAs contain the same basic resources

– Slices (grouped into CLBs)

• Contain combinatorial logic and register resources

– IOBs

• Interface between the FPGA and the outside world

– Programmable interconnect

– Other resources

• Memory

• Multipliers

• Global clock buffers

• Boundary scan logic

Configuring of FPGA

Millions of SRAM cells holding LUTs and Interconnect Routing

Volatile Memory. Lose configuration when board power is turned off.

Keep Bit Pattern describing the SRAM cells in non-Volatile Memory e.g. PROM

Configuration takes ~ secs

FPGA Trends

State of Art is 40nm on 300 mm wafers

Top of range >500,000 Logic Blocks

>1,000 pins (Fine Pitched BGA)

Logic Block cost ~ 1$ in 1990

Today < 0.1 cent

Problems

Power. Leakage currents.

Design Gap

CAE Tool

FPGA RESET

FPGA design where a reset acts as a synchronization signal that sets all the storage elements to a known state. In a digital design, designers normally implement a global reset as an external pin to initialize the design on power-up. The global reset pin is similar to any other input pin and is often applied asynchronously to the FPGA.

Designers can then choose to use this signal to reset their design asynchronously or synchronously inside the FPGA. An optimal reset structure will enhance device utilization, timing and power consumption in an FPGA.

Devices in the Xilinx 7 series architecture contain eight registers per slice, and all these registers are D-type flip-flops. All of these flip-flops share a common control set.The control set of a flip-flop is the clock input (CLK), the active-high chip enable (CE) and the active-high SR port. The SR port in a flip-flop can serve as a synchronous set/reset or an asynchronous preset/clear port.

The RTL code that infers the flip-flop also infers the type of reset a flip-flop will use. The code will infer an asynchronous reset when the reset signal is present in the sensitivity list of an RTL process . The synthesis tool will infer a flip-flop with an SR port configured as a preset or clear port . When the SR port is asserted, the flip-flop output is immediately forced to the SRVAL attribute of the flip-flop. In the case of synchronous resets, the synthesis tool will infer a flip-flop whose SR port is configured as a set or reset port.In addition, you can initialize the flip-flop output to the value the INIT attribute specifies. The INIT value is loaded into the flip-flop during configuration and when the global set reset (GSR) signal is asserted.

The flip-flops in Xilinx FPGAs can support both asynchronous and synchronous reset and set controls. However, the underlying flip-flop can natively implement only one set / reset / preset / clear at a time. Coding for more than one set / reset / preset / clear condition in the RTL code will result in the implementation of one condition using the SR port of the flip-flop and the other conditions in fabric logic, thus using more FPGA resources.

If one of the conditions is synchronous and the other is asynchronous, the asynchronous condition will be implemented using the SR port and the synchronous condition in fabric logic. In general, it’s best to avoid more than one set/reset/preset/clear condition. Furthermore, only one attribute for each group of four flip-flops in a slice determines if the SR ports of flip-flops are synchronous or asynchronous.

Regardless of the reset type used (synchronous or asynchronous), you will generally need to synchronize the reset with the clock. As long as the duration of the global reset pulse is long enough, all the device flip-flops will enter the reset state. However, the deassertion of the reset signal must satisfy the timing requirements of the flip-flops to ensure that the flip-flops transition cleanly from their reset state to their normal state. Failure to meet this requirement can result in flip-flops entering a metastable state. Furthermore, for correct operation of some subsystems, like state machines and counters, all flip-flops must come out of reset on the same clock edge. If different bits of the same state machine come out of reset on different clocks, the state machine may transition into an illegal state. This reinforces the need to make the deassertion of reset synchronous to the clock.

For designs that use a synchronous reset methodology for a given clock domain, it is sufficient to use a standard metastability resolution circuit (two back-to-back flip-flops) to synchronize the global reset pin onto a particular clock domain. This synchronized reset signal can then initialize all storage elements in the clock domain by using the synchronous SR port on the flip-flops. Because both the synchronizer and the flip-flops to be reset are on the same clock domain, the standard PERIOD constraint of the clock covers the timing of the paths between them. Each clock domain in the device needs to use a separate synchronizer to generate a synchronized version of the global reset for that clock domain.

Sometimes a portion of a design is not guaranteed to have a valid clock. This can occur in systems that use recovered clocks or clocks that are sourced by a hot-pluggable module. In such cases, the storage elements in the design may need to be initialized with an asynchronous reset using the asynchronous SR port on the flip-flops. Even though the storage elements use an asynchronous SR port, the deasserting edge of the reset must still be synchronous to the clock. This requirement is characterized by the reset-recovery timing arc of the flip-flops, which is similar to a setup requirement of the deasserting edge of an asynchronous SR to the rising edge of the clock. Failure to meet this timing arc can cause flip-flops to enter a metastable state and synchronous subsystems to enter unwanted states.

You can see the change in slack

Description:

Generate a new clock object from an existing physical clock object in the

design.

Clocks can be added to a design in one of three ways:

* Primary physical or virtual clocks defined with the create_clock command.

* Derived clocks defined with the create_generated_clock command

generated from a primary physical clock.

* Derived clocks automatically generated by the Vivado Design Suite when

a clock propagates to an MMCM/PLL/BUFR.

You can also use the create_generated_clock command to change the name of clocks that the Vivado tool has auto-derived from an MMCM/PLL/BUFR. In this case, a new clock is not created, but an existing clock defined on the specified source object is renamed to the provided name. This requires -name and <object> to be specified, and supports the use of -source and/or -master_clock to further identify the clock to rename when multiple clocks exist on the source object. Refer to the Vivado Design Suite User Guide:

Using Constraints (UG903) for more information on renaming auto-derived clocks.

Note: You cannot rename a clock that is already in use by other constraints at the time of renaming. You must rename the clock prior to any other appearance or use of the clock in an XDC file

This command returns the name of the clock object that is created, or returns an error if it fails.

Arguments:

-name <arg> - (Optional) The name of the generated clock to create on the specified object, or the name to assign to an existing clock on the specified object. If no name is specified, the generated clock will be given the name of the <object> it is assigned to. If assigned to multiple

<objects>, the name will be the first object in the list.

-source <arg> - (Optional) The pin or port of the master clock from which to derive the generated clock. The master clock must be a previously defined physical clock, not a virtual clock; but can be a primary clock or another generated clock. If the source pin or port currently has multiple clocks defined, the -master_clock option must be used to identify which clock on the source is to be used to define the generated clock.

-edges< <arg> - (Optional) Specifies the edges of the master clock to use in defining transitions on the generated clock. Specify transitions on the generated clock in a sequence of 1, 2, 3, by referencing the appropriate edge count from the master clock in numerical order, counting from the first edge. The sequence of transitions on the generated clock defines the period and duty cycle of the clock: position 1 is the first rising edge of the generated clock, position 2 is the first falling edge of the generated clock and so defines the duty cycle, position 3 is the second rising edge of the generated clock and so defines the clock period. Enclose multiple edge numbers in braces {}. See the example below for specifying edge numbers.

-divide_by <arg> - (Optional) Divide the frequency of the master clock by the specified value to establish the frequency of the generated clock object. The value specified must be >= 1, and must be specified as an integer.

-multiply_by <arg> - (Optional) Multiply the frequency of the master clock by the specified value to establish the frequency of the generated clock object. The value specified must be >= 1, and must be specified as an integer.

-combinational - (Optional) Define a combinational path to create a "-divide_by 1" generated clock.

-duty_cycle< <arg> - (Optional) The duty cycle of the generated clock defined as a percentage of the new clock period when used with the -multiply_by argument. The value is specified as a percentage from 0.0 to 100.

-invert - (Optional) Create a generated clock with the phase inverted from the master clock.

-edge_shift <arg> - (Optional) Shift the edges of the generated clock by the specified values relative to the master clock. See the example below for specifying edge shift.

-add - (Optional) Add the generated clock object to an existing clock group specified by <objects>.

Note: -master_clock and -name options must be specified with –add -master_clock <arg> - (Optional) If there are multiple clocks found on the source pin or port, the specified clock object is the one to use as the master for the generated clock object.

Note: -add and -name options must be specified with -master_clock -quiet - (Optional) Execute the command quietly, returning no messages from the command. The command also returns TCL_OK regardless of any errors encountered during execution.

Note: Any errors encountered on the command-line, while launching the command, will be returned. Only errors occurring inside the command will be trapped.

-verbose - (Optional) Temporarily override any message limits and return all messages from this command.

Note: Message limits can be defined with the set_msg_config command.

<objects> - (Required) The pin or port objects to which the generated clock should be assigned. If the specified objects already have a clock defined, use the -add option to add the new generated clock and not overwrite any existing clocks on the object.

Examples:

The following example defines a generated clock that is divided from the master clock found on the specified CLK pin. Since -name is not specified, the generated clock is assigned the same name as the pin it is assigned to:

create_generated_clock -divide_by 2 -source [get_pins clkgen/sysClk] fftEngine/clk

The following example defines a generated clock named CLK1 from the specified source clock, specifying the edges of the master clock to use as transition points for the generated clock, with edges shifted by the specified amount. In this example, the -edges option indicates that the second edge of the source clock is the first rising edge of the generated clock, the third edge of the source clock is the first falling edge of the generated clock, and the eighth edge of the source clock is the second rising edge of the generated clock. These values determine the period of the generated clock as the time from edge 2 to edge 8 of the source clock,

and the duty cycle as the percentage of the period between edge 2 and edge

3 of the source clock. In addition, each edge of the generated clock is

shifted by the specified amount:

create_generated_clock -name CLK1 -source CMB/CLKIN -edges {2 3 8} \

-edge_shift {0 -1.0 -2.0} CMB/CLKOUT

Note: The waveform pattern of the generated clock is repeated based on the

transitions defined by the -edges argument.

This example creates two generated clocks from the output of a MUX, using

-master_clock to identify which clock to use, using -add to assign the

generated clocks to the Q pin of a flip flop, and using -name to define a

name for the generated clock, since the object it is assigned to has

multiple clocks assigned:

create_generated_clock -source [get_pins muxOut] -master_clock M_CLKA \

-divide_by 2 -add -name gen_CLKA [get_pins flop_Q]

create_generated_clock -source [get_pins muxOut] -master_clock M_CLKB \

-divide_by 2 -add -name gen_CLKB [get_pins flop_Q]

The following example renames the automatically named clock that is derived

by the Vivado Design Suite on the MMCM clock output:

create_generated_clock -name CLK_DIV2 [get_pins mmcm/CLKOUT1]

Timing Analysis

Vivado IDE synthesis engine transforms the RTL description into technology mapped netlist.

With synth design net delay modelling is not very accurate; synth netlist should either meet timing or fail by a small amount before starting implementation.

There are three categories of constraints for synthesis:

RTL Attributes

- directives written in the RTL files (MARK_DEBUG, etc.)

Timing Constraints (XDC)

Following have real impact on synthesis

create_clock

create_generated_clock

set_input_delay

set_output_delay

set_clock_groups

set_false_path

set_max_delay

set_multicycle_path

TIMING ANALYSIS

Timing Assertions Section

Primary clocks

Virtual clocks

Generated clocks

Clock Groups

Input and output delay constraints

Timing Exceptions Section

False Paths

Max Delay / Min Delay

Multicycle Paths

Case Analysis

Disable Timing

Physical Constraints Section located anywhere in the file, preferably before or after the timing constraints or stored in a separate constraint file.

Start with the clock definitions.

The clocks must be created before they can be used by any subsequent constraints.

Any reference to a clock before it has been declared results in an error and the corresponding constraint is ignored. This is true within an individual constraint file, as well as across all the XDC files (or Tcl scripts) in your design.

The order of the constraint files matters. You must be sure that the constraints in each file do not rely on the constraints of another file.

If this is the case, you must read the file that contains the constraint dependencies last.

If two constraint files have interdependencies, you must either:

• Merge them manually into one file that contains the proper sequence, or

• Divide the files into several separate files, and order them correctly.

FPGA TIMING

Start with no timing constraints

Use IDE to view the clock networks

Create basic clocks

Review Timing reports for coverage

Open synthesized design .See schematic

Report clock networks.Click constraints

See unconstrained

Before editing go for report_clocks

Edit timing constraints

Click generated clock

Description:

Create a generated clock object

Syntax:

create_generated_clock [-name <arg>] [-source <args>] [-edges <args>]

[-divide_by <arg>] [-multiply_by <arg>]

[-combinational] [-duty_cycle <arg>] [-invert]

[-edge_shift <args>] [-add] [-master_clock <arg>]

[-quiet] [-verbose] <objects>

Returns:

new clock object

Usage:

Name Description

-----------------------------

[-name] Generated clock name

[-source] Master clock source object pin/port

[-edges] Edge Specification

[-divide_by] Period division factor: Value >= 1

Default: 1

[-multiply_by] Period multiplication factor: Value >= 1

Default: 1

[-combinational] Create a divide_by 1 clock through combinational logic

[-duty_cycle] Duty cycle for period multiplication: Range: 0.0 to 100.0

Default: 0.0

[-invert] Invert the signal

[-edge_shift] Edge shift specification

[-add] Add to the existing clock in source_objects

[-master_clock] Use this clock if multiple clocks present at master pin

[-quiet] Ignore command errors

[-verbose] Suspend message limits during command execution

<objects> List of clock source ports, pins, or nets

Categories:

SDC, XDC

report_timing

if its ports go generic

if it is pins,do tcl for get_pins

and do report_timing -from[get_pins ..]-to[get_pins ..]]

view contents of the report

Timing reports always start from primary clock propagate to the generated clocks and then on to the clock elements.

observe the destination path clock timing.

See the requirements of the generated clock

See the destination clock start of next clock edge and on to destination register

Slack is the required time – arrival time

Open the schematic of the netlist ,select clk_gen.window.zoom it

Observe the difference in schematic which is periodically enabled to generate the destination clocks.

In this case,generated clocks doesnot have the predefined relationship with the primary clocks clk pin

As a result,create a tcl command

create_generated clock

save the constraints

enter the tcl command report_clocks

observe the new generated clock included in the timing reports

click report timing summary

select new generated clk

see intraclk path,see the setup.double click any path to view the path properties

see the source clock path

source clock delay starts primary clock and propgate to generated clocks both automatically and manual generated clocks

see the destination clock path

starts from primary clock and propagated to generated clocks

close the properties.

set input_delay

report_timing –from[all_inputs]

see the reports

set_input_delay –clock[get_clocks –phyclk0] –max3$my_input

report_timing –from$my_input –setup

see the reports

see the slack and input delay

insert the below tcl

set_input_delay –clock[get_clocks –phyclk0] –min1$my_input

report_timing –from$my_input –hold

report shows actual slack and the input delay

In addition see the edit timing constraint, can add set_input_delay

set output_delay

report_timing –to[all_inputs]

see the reports

set_input_delay –clock[get clocks –sysclk0] $my_output

report_timing –from$my_input –hold

see the reports

see the actual slack and input delay

In addition see the edit timing constraint, can add set_output_delay

set clock groups

Open the synthesized design

Report clock interaction

Analyze the timing path from one clock domain to another clock domain

Report shows grid of clock interactions

Each grid is colored to indicate timing ,constraint status in inter clock path

If the two clock groups does not meet timing it is asynchronous

Clock frequencies are not integer multiples .Its impossible to find the relative fields between them.

report_clock_interactions –delay_type min_max –significant_digits 3 –name timing_1

see the wns and choose max

suppose if you have two constraints like clk_out1_clk_core, clk_out2_clk_core=>

join [get_timing_paths –from [get_clocks clk_out1_clk_core] –to[get_clocks clk_out2_clk_core]

-max_paths 200]

join [get_timing_paths –from [get_clocks clk_out2_clk_core] –to[get_clocks clk_out1_clk_core]

-max_paths 200]

Two clocks marked as asynchronous

Launch timing contraints-edit

Double click set clock_group

You need to specify two clock groups

Add the first clock group

Add the second by click + sign

Note asynchronous is chosen by default.

Save the constarints

Report clock interactions

Observe the interclock path between clock grp1 and 2 is decalred blue confirming asynchronous

Set false path

report timing_summary

See the interclock path

See the setup

set_false_path –from[get_pins ]-to[get_pins ]]

report_timing –from [get_pins ]-to[get_pins ]]

if sklack is infinite false path is sucessful

if not check the false path

set multicycle path

report_timing

See the interclock path

See the setup

See the vioLated slack if any

See the requirement

What are the guidelines followed for good floorplan?

Steps followed with the aim of generating an optimum floor plan.

1. Defining the core area using Specify Floor Plan Form

2. Defining ports specified by top level engineer.

3. Placing the macros inside the core area.

4. Placing the Macros which are communicating with each other, together with help of Fly Lines,

5. Color by Hierarchy and Data Flow Diagrams.

6. Avoid the placement of Macros in front of ports.

7. Arrange the Macros to get contiguous core area.

8. Defining halos

9. Defining Placement and Routing blockages.

3.5 Place & Route phase

During the chip level P&R phase, following information shall be collected from synthesis stage along with the netlists:

1. Area constraints with unrelated logic compression information

2. Timing constrains including false and multicycle paths

3. IO assignments

4. Timing critical sub-module information

As a first step in P&R process, fix the locations of top level primitives such as global clock buffers, DLL/PLL and IOs. After first pass P&R, usually on a complex design the route delays are high and there are many violator paths even after doing module level floorplanning.

At this stage it is better to look for area regions where the route delays are higher rather than trying to solve timing issues one at a time.

Top one or two timing violator areas shall be found and unrelated logic compression shall be reduced and/or area of such region shall be increased. The area left out during high-level floorplan will be helpful here. This process shall be continued till number of timing violators come down to few in number.

At this stage it might be better to go for option of constructing relatively-placed-macros (RPM) for hard to meet timing paths, such as huge multiplexer, FIFO etc.

The core generator tools (such as Coregen from Xilinx) can generate components with RPM. It is also possible to generate custom RPM structures based on floorplan by instantiating FPGA primitives. These macros will be placed as a group of elements in the defined area region.

Especially this RPM structure method is extremely useful for short timing critical paths. After performing all these steps, there may be still last few timing violators.

One way of handling these violators is to open the place & routed design in FPGA editing tools, and fix by moving elements around. This is often a tedious manual process, which might even impact other paths. It would be of great help if P&R tools can automate this process even under manual guidance by doing what-if analysis.

Clock related issues:

Distribution – generation (frequency synthesis)

Deskew

Multiplexing

Runtime programming

Domain crossing

Clock Distribution

Device split into halves (top/bottom) and clock regions (rectangular, uniform size)

Different clock buffer types:

BUFG can clock any FF within the top or bottom of device (top/bottom BUFGs) –BUFR clocks a single clock region.

BUFMR clocks up to 3 clock regions –BUFH clock a horizontal clock spine within a region

FPGA CLOCKING

Clock generation (frequency synthesis)

Uses “Clock Management Tiles” which consist of:

• PLL/DCM (Frequency Synthesis)

• MMCM (Advanced PLL with phase control) – Clock input from PCB must use “Clock capable pins” of FPGA

• Differential pairs

• Single ended clocks always use P side

FREQUENCY SYNTHESIS

Common use case: generate all design clocks from single input clock (crystal oscillator)

CLOCK MUX

Many applications require clock multiplexing:

In circuit debugging (to avoid domain crossing) –

Dynamic frequency scaling

Clock redundancy (switch away from dead clock)

FPGA clock multiplexers (2:1) implemented with BUFGMUX_CTRL primitive

Clock switch latency: max 3 clock cycles of the slower clock

Glitchless output

CLOCK RELATED CONSTRAINTS

Timing constraints: –

Period: guides timing analysis with regard to a periodic signal (clock)

NET “" TNM_NET = “"; • TIMESPEC = PERIOD “" HIGH/LOW %; – OFFSET IN/OUT – FROM:TO

CLOCK DOMAIN CROSSING

FPGAs have rapid average metastability recovery (ps). But recovery is unbounded (probability is nonnull that the metastable state will last a given time T)

Design goal: achieve a desired MTBF, given the recovery parameters of the flipflop.

Synchronizer: Multiple FFs at the receiving end guard against metastability

Number of FFs a function of desired MTBF and ratio of clock frequencies. Register output from source domain for more predictable timing (increased MTBF).

2FF synchronizers work well when destination clock is faster than source clock (signal will remain stable for at least one destination clock cycle). Fast to slow crossing requires closed loop synchronizer with handshake

Closed loop: Control signal crosses into clock domain 2, then back into clock domain 1, and is checked against the reference (high latency)

GTX TRANSCEIVER

The transceiver is referred to as GTX (Gigabit Transceiver), but other variants of transceivers, for instance GTH and GTZ, are to a large extent the same components with different bandwidth capabilities.

The 7 series FPGAs GTX and GTH transceivers are power-efficient transceivers, supporting line rates from 500 Mb/s to 12.5 Gb/s for GTX transceivers and 13.1 Gb/s for GTH transceivers.

The GTX/GTH transceiver is highly configurable and tightly integrated with the programmable logic resources of the FPGA.

The GTX/GTH transceiver supports these use modes:

• PCI Express, Revision 1.1/2.0/3.0

• 10GBASE-R

• Interlaken

• 10 Gb Attachment Unit Interface (XAUI), Reduced Pin eXtended Attachment Unit

Interface (RXAUI), 100 Gb Attachment Unit Interface (CAUI), 40 Gb Attachment Unit

Interface (XLAUI)

• Common Packet Radio Interface (CPRI™)/Open Base Station Architecture Initiative

(OBSAI)

• OC-48/192

• OTU-1, OTU-2, OTU-3, OTU-4

• Serial RapidIO (SRIO)

• Serial Advanced Technology Attachment (SATA)/Serial Attached SCSI (SAS)

• Serial Digital Interface (SDI)

• SFF-8431 (SFP+)

Overview

GTXs, which are the basic building block for common interface protocols (e.g. PCIe and SATA) are becoming an increasingly popular solution for communication between FPGAs.

As the GTX’ instance consists of a clock and parallel data interface, it’s easy to mistake it for a simple channel that moves the data to the other end in a failsafe manner.

A more realistic view of the GTX’ is a front end for a modem, with the possible bit errors and a need to synchronize serial-to-parallel data alignment at the receiver.

Designing with the GTX also requires attention to classic communication related topics, e.g. the use of data encoding, equalizers and scramblers.As a result, there are a few application-dependent pieces of logic that needs to be developed to support the channel:

The possibility of bit errors on the channel must be handled

The alignment from a bit stream to a parallel word must be taken care of (which bit is the LSB of the parallel word in the serial stream?)

If the transmitter and receiver aren’t based on a common clock, a protocol that injects and tolerates idle periods on the data stream must be used, or the clock difference will cause data underflows or overflows. Sending the data in packets in a common solution. In the pauses between these packets, special skip symbols must be inserted into the data stream, so that the GTX’ receiver’s clock correction mechanism can remove or add such symbols into the stream presented to the application logic, which runs at a clock slightly different from the received data stream.

Odds are that a scrambler needs to be applied on the channel. This requires logic that creates the scrambling sequence as well as synchronizes the receiver. The reason is that an equalizer assumes that the bit stream is uncorrelated on the average. Any average correlation between bit positions is considered ISI and is “fixed”.

It’s not uncommon that no bit errors are ever observed on a GTX channel, even at very high rates, and possibly with no equalization enabled. This can’t be relied on however, as there is in fact no express guarantee for the actual error probablity of the channel.

Enhanced Features in Virtex7 :

· 2-byte and 4-byte internal datapath to support different line rate requirements.

· Quad-based LC tank PLL (QPLL) for best jitter performance and channel-based ring oscillator PLL.

· Power-efficient, adaptive linear equalizer mode called the low-power mode (LPM) and a high-performance, adaptive decision feedback equalization (DFE) mode to compensate for high frequency losses in the channel while providing maximum flexibility.

· RX margin analysis feature to provide non-destructive, 2-D post-equalization eye scan.

Clocking

The clocking of the GTXs is an issue in itself. Unlike the logic fabric, each GTX has a limited number of possible sources for its reference clock.

It’s mandatory to ensure that the reference clock(s) are present in one of the allowed dedicated inputs.

Each clock pin can function as the reference clock of up to 12 particular GTXs.

It’s also important to pay attention to the generation of the serial data clocks for each GTX from the reference clock(s).

It’s not only a matter of what multiplication ratios are allowed, but also how to allocate PLL resources and their access to the required reference clocks.

QPLL vs. CPLL

Two types of PLLs are availble for producing the serial data clock, typically running at severtal GHz: QPLLs and CPLLS.

The GTXs are organized in groups of four (“quads”). Each quad shares a single QPLL (Quad PLL), which is instantiated separately (as a GTXE2_COMMON). In addition, each GTX has a dedicated CPLL (Channel PLL), which can generate the serial clock for that GTX only.

Each GTX may select its clock source from either the (common) QPLL or its dedicated CPLL. The main difference between these is that the QPLL covers higher frequencies.

High-rate applications are hence forced to use the QPLL. The downside is that all GTXs sharing the same QPLL must have the same data rate (except for that each GTX may divide the QPLL’s clock by a different rate).

The CPLL allow for a greater flexibility of the clock rates, as each GTX can pick its clock independently, but with a limited frequency range.

Jitter

Jitter on the reference clock(s) is the silent killer of GTX links and it is often neglected by designers because “it works anyhow”, but jitter on the reference clock has a disastrous effect on the channel’s quality, which can be by far worse than a poor PCB layout.

As both jitter and poor PCB layout (and/or cabling) contribute to the bit error rate and the channel’s instability, the PCB design is often blamed when things go bad. This makes to believe that GTX links are extremely sensitive to every via or curve in the PCB trace, which is not the case at all. It is, on the other hand, very sensitive to the reference clock’s jitter.poorly chosen reference clock can be compensated for with a very clean PCB trace.

Jitter is commonly modeled as a noise component which is added to the timing of the clock transition, i.e. t=kT+n (n is the noise). Consequently, it is often defined in terms of the RMS of this noise component, or a maximal value which is crossed at a sufficiently low probability.

The treatment of an GTX’ reference clock requires a slightly different approach; the RMS figures are not necessarily a relevant measures. In particular, clock sources with excellent RMS jitter may turn out inadequate, while other sources, with less impressive RMS figures may work better.

Since the QPLL or CPLL locks on this reference clock, jitter on the reference clock results in jitter in the serial data clock. The prevailing effect is on the transmitter, which relies on this serial data clock; the receiver is mainly based on the clock it recovers from the incoming data stream, and is therefore less sensitive to jitter.

Some of the jitter – in particular “slow” jitter (based upon low frequency components) is fairly harmless, as the other side’s receiver clock synchronization loop will cancel its effect by tracking the random shifts of the clock.

On the other hand, very fast jitter in the reference clock may not be picked up by the QPLL/CPLL, and is hence harmless as well.

All in all, there’s a certain band of frequency components in the clock’s timing noise spectrum, which remains relevant:

The band that causes jitter components which are slow enough for the QPLL/CPLL to track and hence present on the serial data clock, and too fast for the receiver’s tracking loop to follow.

The measurable expression for this selective jitter requirement is given in terms of phase noise frequency masks, or sometimes as the RMS jitter in bandwidth segments. The spectral behavior of clock sources is often more difficult to predict: Even when noise spectra are published in datasheets, they are commonly given only for certain scenarios as typical figures.

8b/10b encoding

Several standardized uses of MGT channels (SATA, PCIe, DisplayPort etc.) involve a specific encoding scheme between payload bytes for transmission and the actual bit sequence on the channel.

Each (8-bit) byte is mapped to an 10-bit word, based upon a rather peculiar encoding table.

The purposes of this encoding is to ensure a balance between the number of 0′s and 1′s on the physical channel, allowing AC-coupling of the electrical signal. Also, this encoding also ensures frequent toggling between 0′s and 1′s, which ensures the proper bit synchronization at the receiver by virtue of the of the clock recovery loop (“CDR”).

As there are 1024 possible code words covering 256 possible input bytes, some of the excessive code words are allocated as control characters. In particular, a control character designated K.28.5 is often referred to as “comma”, and is used for synchronization.

The 8b/10b encoding is not an error correction code despite its redundancy, but it does detect some errors, if the received code word is not decodable. On the other hand, a single bit error may lead to a completely different decoded word, without any indication that an error occurred.

Scrambling

To put it short and concise: If an equalizer is applied, the user-supplied data stream must be random.

If the data payload can’t be ensured to be random itself (this is almost always the case), a scrambler must be defined in the communication protocol, and applied in the logic design.

Applying a scrambler on the channel is a tedious task, as it requires a synchronization mechanism between the transmitter and receiver.

It’s often quite tempting to skip it, as the channel will work quite well even in the absence of a scrambler, even where it’s needed. However in the long run, occasional channel errors are typically experienced.

The problem equalizers solve is the filtering effect of the electrical media (the “channel”) through which the bit stream travels.

Both cables and PCBs reduce the strength of the signal, but even worse: The attenuation depends on the frequency, and reflections occur along the metal trace. As a result, the signal doesn’t just get smaller in magnitude, but it’s also smeared over time.

A perfect, sharp, step-like transition from -1200 mV to +1200mV at the transmitter’s pins may end up as a slow and round rise from -100mV to +100mV. Because of this slow motion of the transitions at the receiver, the clear boundaries between the bits are broken.

Each transmitted bit keeps leaving its traces way after its time period. This is called Inter-Symbol Interference (ISI): The received voltage at the sampling time for the bit at t=0 depends on the bits at t=-T, t=t-2T and so on.

Each bit effectively produces noise for the bits coming after it.This is where the equalizer comes in. The input of this machine is the time samples of the bit at t=0, but also a number of measured voltage samples of the bits before and after it.

By making a weighted sum of these inputs, the equalizer manages, to a large extent, to cancel the Inter-Symbol Interference. In a way, it implements a reverse filter of the channel.

There are different techniques for training an equalizer to work effectively against the channel’s filtering. For example, cellular phones do their training based upon a sequence of bits on each burst, which is known in advance. But when the data stream runs continuously, and the channel may change slightly over time (e.g. a cable is being bent) the training has to be continuous as well. The chosen method for the equalizers in GTXs is therefore continuous.

The Decision Feedback Equalizer, for example, starts with making a decision on whether each input bit is a ’0′ or ’1′. It then calculates the noise signal for this bit, by subtracting the measured voltage with the expected voltage for a ’0′ or ’1′, whichever was decided upon.

The algorithm then slightly alters the weighted sums in a way that removes any statistical correlation between the noise and the previous samples.

This works well when the bit sequence is completely random: There is no expected correlation between any input sample, and if such exists, it’s rightfully removed. Also, the adaptation converges into a compromise that works on the average best for all bit sequences.

But what happens if there is a certain statistical correlation between the bits in the data itself?

The equalizer will specialize in reducing the ISI for the bit patterns occurring more often, possibly doing very bad on the less occurring patterns.

The equalizer’s role is to compensate for the channel’s filtering effect, but instead, it adds an element of filtering of its own, based upon the common bit patterns. In particular, note that if a constant pattern runs through the channel when there’s no data for transmission (zeros, idle packets etc.) the equalizer will specialize in getting that no-data through, and mess up with the actual data.

One could be led to think that the 8b/10b encoding plays a role in this context, but it doesn’t. Even though cancels out DC on the channel, it does nothing about the correlation between the bits. For example, if the payload for transmission consists of zeros only, the encoded words on the channel will be either 1001110100 or 0110001011. The DC on the channel will remain zero, but the statistical correlation between the bits is far from being zero.So unless the data is inherently random (e.g. an encrypted stream), using an equalizer means that the data which is supplied by the application to the transmitter must be randomized.

The common solution is a scrambler:

XORing the payload data by a pseudo-random sequence of bits, generated by a simple state machine. The receiver must XOR the incoming data with the same sequence in order to retrieve the payload data. The comma (K28.5) symbol is often used to synchronize both state machines.

In GTX applications, the (by far) most commonly used scrambler is the G(X)=X^16+X^5+X^4+X^3+1 LFSR, which is defined in a friendly manner in the PCIe standard

TX/RXUSRCLK and TX/RXUSRCLK2

Almost all signals between the FPGA logic fabric and the GTX are clocked with TXUSRCLK2 (for transmission) and RXUSRCLK2 (for reception). These signals are supplied by the user application logic, without any special restriction, except that the frequency must match the GTX’ data rate so as to avoid overflows or underflows. A common solution for generating this clock is therefore to drive the GTX’ RX/TXOUTCLK through a BUFG.

The logic fabric is required to supply a second clock in each direction, TXUSRCLK and RXUSRCLK (without the “2” suffix). These two clocks are the parallel data clocks in a deeper position of the GTX.

The rationale is that sometimes, it’s desired to let the logic fabric work with a word width which is twice as wide as the actual word width. For example, in a high-end data rate application, the GTX’ word width may be set to 40 bits with 8b/10b, so the logic fabric would interface with the GTX through a 32 bit data vector. But because of the high rate, the clock frequency may still be too high for the logic fabric, in which case the GTX allows halving the clock, and applying the data through a 80 bits word. In this case, the logic fabric supplies the 80-bit word clocked with TXUSRCLK2, and is also required to supply a second clock, TXUSRCLK having twice the frequency, and being phase aligned with TXUSRCLK2. TXUSRCLK is for the GTX’ internal use.

A similar arrangement applies for reception.Unless the required data clock rate is too high for the logic fabric (which is usually not the case), this dual-clock arrangement is best avoided, as it requires an MMCM or PLL to generate two phase aligned clocks. Except for the lower clock applied to the logic fabric, there is no other reason for this.

Word alignment

On the transmitting side, the GTX receives a vector of bits, which forms a word for transmission. The width of this word is one of the parameters that are set when the GTX is instantiated, and so is whether 8b/10b encoding is applied. Either way, some format of parallel words is transmitted over the channel in a serialized manner, bit after bit. Unless explicitly required, there is nothing in this serial bitstream to indicate the words’ boundaries. Hence the receiver has no way, a-priori, to recover the word alignment.

The receiver’s GTX’ output consists of a parallel vector of bits, typically with the same width as the transmitter. Unless a mechanism is employed by the user logic, the GTX has no way to recover the correct alignment. Without such alignment, the organization into a parallel words arrives wrong at the receiver, and possibly as complete garbage, as an incorrect alignment prevents 8b/10b decoding (if employed).It’s up to the application logic to implement a mechanism for synchronizing the receiver’s word alignment.

There are two methodologies for this: Moving the alignment one bit at a time at the receiver’s side (“bit slipping”) until the data arrives properly, or transmitting a predefined pattern (a “comma”) periodically, and synchronize the receiver when this pattern is detected.

Bit slipping is the less recommended practice, even though simpler to understand. It keeps most of the responsibility in the application logic’s domain.The application logic monitors the arriving data, and issues a bit slip request when it has gathered enough errors to conclude that the alignment is out of sync.

However most well-established GTX-based protocols use commas for alignment. This method is easier in the way that the GTX aligns the word automatically when a comma is detected (if the GTX is configured to do so). If injecting comma characters periodically into the data stream fits well in the protocol, this is probably the preferred solution. The comma character can also be used to synchronize other mechanisms, in particular the scrambler (if employed).

Comma detection may also have false positives, resulting from errors in the raw data channel. As these data channels usually have a very low bit error probability (BER), this possibility can be overlooked in applications where a short-term false alignment resulting from a false comma detected is acceptable. When this is not acceptable, the application logic should monitor the incoming data, and disable the GTX automatic comma alignment through the rxpcommaalignen and/or rxmcommaalignen inputs of the GTX.

Tx buffer, to use or not to use.The Tx buffer is a small dual-clock (“asynchronous”) FIFO in the transmitter’s data path + some logic that makes sure that it starts off in the state of being half full.The underlying problem, which the Tx buffer potentially solves, is that the serializer inside the GTX runs on a certain clock (XCLK) while the application logic is exposed to another clock (TXUSRCLK).The frequency of these clocks must be exactly the same to prevent overflow or underflow inside the GTX. This is fairly simple to achieve.Ensuring proper timing relationships between these two clocks is however less trivial.

There are hence two possibilies:

Not requiring a timing relationship between these clock (just the same frequency). Instead, use a dual-clock FIFO, which interfaces between these two clock domains. This small FIFO is referred to as the “Tx buffer”. Since it’s part of the GTX’ internal logic, going this path doesn’t require any additional resources from the logic fabric.

Make sure that the clocks are aligned, by virtue of a state machine. This state machine is implemented in the logic fabric.The first solution is simpler and requires less resources from the FPGA’s logic fabric. Its main drawback is the latency of the Tx buffer, which is typically around 30 TXUSRCLK cycles.

While this delay is usually negligible from a functional point of view, it’s not possible to predict its exact magnitude. It’s therefore not possible to use the Tx buffer on several parallel lanes of data, if the protocol requires a known alignment between the data in these lanes, or when an extremely low latency is required.

The second solutions requires some extra logic, but there is no significant design effort: This logic that aligns the clocks is included automatically by the IP core generator on Vivado 2014.1 and later, when “Tx/Rx buffer off” mode is chosen.

Only note that gtN_tx_fsm_reset_done_out may take a bit longer to assert after a reset (something like 1 ms on a 10 Gb/s lane).

Rx buffer

The Rx buffer (also called “Rx elastic buffer”) is also a dual-clock FIFO, which is placed in the same clock domain gap as the Tx buffer, and has the same function. Bypassing it requires the same kind of alignment mechanism in the logic fabric.

As with its Tx counterpart, bypassing the Rx buffer makes the latency short and deterministic. It’s however less common that such a bypass is practically justified:

While a deterministic Tx latency may be required to ensure data alignment between parallel lanes in order to meet certain standard protocol requirements, there is almost always fairly easy methods to compesate for the unknown latency in user logic. Either way, it’s preferred not to rely on the transmitter to meet requirements on data alignment, and align the data, if required, by virtue of user logic.

sysclk_in must be stable when the FPGA wakes up from configuration. A state machine that brings up the transceivers is based upon this clock. It’s referred to as the DRP clock in the wizard

It’s important to declare the DRP clock’s frequency correctly, as certain required delays which are measured in nanoseconds are implemented by dwelling for a number of clocks, which is calculated from this frequency.

In order to transmit a comma, set the txcharisk to 1 (since it’s a vector, it sets the LSB) and the value of the 8 LSBs of the data to 0xBC, which is the code for K.28.5.

1. How AES Encryption work ?

2. What is yours contribution on Encryption algorithm ? How you modify it and test it?

3. Explain the tree of encryption algorithm?

4. How Self manuring system work ?

5. What is your contribution ?

6. How DAC works?

7. Which FPGA you worked on?

8. Is it sram based or fused based ?

9. can you define the component in one cell of ARTIX 7 FPGA ?

10. Xilinx tool Flow ( ISE & VIVADO) ?

11. How you debug the board ( techniques) ?

12. How ILA works ?

13. various type of constraints ?

14. write counter program in VHDL ?

15. How you debug high intense IO board ?

1. If there are 4 inputs in decoder..no of output lines?

Application of decoder.

2. Fpga design flow in detail.

3. Signal vs variable

4. Fsm types.. difference between the two with block diagram

5. No of inputs in 4 bit parallel adder

6. Full adder

7. Buffer using XNOR

8. Setup and hold time.

9. If a number 101 is left shifted 2 times, what is the value of new number in decimal system. Ans. 20 (10100)

10. Difference between block ram and distributed ram

11. Difference between flip flops and latch

12. Libraries used in vhdl...why ieee.std_logic library is used?

13. UART protocol

1) What are the different kinds of memory available in an FPGA

A: Distributed RAM, Block Ram and optionally Ultra RAM

2) Name some resources generated in a netlist

A: LUTs, Flipflops, DSPs, BRAMS, LUT RAMS, URAM,I/O etc

3) Name some debugging tools you have used in FPGA

A: ILA, Chip scope, VIO

4) Name some Verilog coding constructs that are not synthesisable

A: Time delay, initial, fork/join, force and release

5) How do you fix setup time violations?

A: Registering, reduce the clock frequency, user stronger cells to drive, adjust skew of the violating path etc

6) How do you fix setup time violations?

A: Insert cells, Reduce drive strength of the cells, Skew the clock etc

7) How does an IF statement in a VHDL/Verilog code manifest in to real hardware

A: MUX/ LUT

8) How would you design Clock domain crossing (CDC)

A: Using FIFO, Synchronisers

9) What happens when you don’t add a default statement to a “Case” construct

A: Inferred Latch is generated

10) Name few signals in AXI-stream

A: T-valid, T-data, T-ready, T-keep, T-last ,T-Id etc

11) Difference between AXI4 / AXI-lite and AXI stream

A:AXI4/AXI-lite are Memory mapped and AXI stream does not access memory location

12) Maximum burst size supported by AXI -4 MM?

A: 256

13) Name clocking resources in an FPGA?

A: MMCM , PLL

14) What should the value of Worst negative slack be for a design to pass timing constraints

A: greater than zero

15) When a slave does not give T-ready in an AXI stream communication what is the expected behaviour of master

A: it should retain the value keeping T-valid high

16) What memory does the ILA use for capturing the packets

A: BRAM

17) What does clock gating do in an FPGA

A: it reduces power consumption of logic cells

18) Can a CLB be configured as RAM? If yes what kind of RAM?

A: yes, Distributed RAM

19) What are the different channels in an AXI -4 interface

A: Write Address, Write data, read address, read data, Write response

20) Can a slave stream data to Master in an AXI-stream?

A: NO

21) How ILA works ? wt are resource will it ?

22) Wt is MMCM and PLL ?

23) How to reduce utilization OF FPGA?

24) Types of Constraints ?

25) Wt is AXI ?

26) Difference between DPS and LUT ?

27) How much Clock Frequency u worked ?

28) what are the complexity u faced during FPGA Design ?

29) How to resolve Hold time Violation ?

30) wt is MMCM and PLL ?

FPGA architecture

VHDL /verilog code.

Interfaces (such as PCIE, DDR3..)

Embedded ARM

What do you mean by JTAG OR USB ?

Write an RTL Block that implements 4x4 matrix multiplication. Then they asked me to optimize the critical path. Then check if we can reduce the chip area using less number of Multipliers.

How do you optimize your ASIC/FPGA design

What is speed grade and how do you select FPGA as per requirements?

What is the maximum possible speed achievable for a given device?

How do you code to reduce power in FPGA design

Answers

Avoid reset for FPGA

Clock Gate

Use synchronous design

Avoid over constraining

Reduce Device temperature (cooling solution)

Use clk_en and control enable for all Memory

Use LUT for smaller memory. BRAM takes more power

What is clock gating and how do you do it in FPGA

What is Clock domain crossing?

Why 2 blocks may need to work on different clocks?

What problems may arise due to clock domain crossing of signals?

How to resolve issues arising because of clock domain crossing ?

How do you manage multiple clocks and how do you route them?

CDC tools can help this like Spyglass, etc. But asynchronous transfers must be handles carefully in design and later they can be assigned false path for the tool to go easy on compilation

How do you do IO planning and can you explain few types of IO types?

How do add debug probes for chipscope and what are the benefits inserting it and inferring it?

Answers:Inferring it along with your RTL gives you freedom of debugging whenever required. Adding it as a probe requires you to sort the netlist file and break up your head in finding the net names. Sometimes some logic needs to be always put on JTAG for status/debug.

How do you program multiple FPGA images in flash and load?

How do you control reset logic?

1. FPGA used in my project. Its Architecture

2. What is LUT?

3. Difference between BRAM and Distributed RAM

4. Difference between FIFO and RAM

5. If a FIFO has 30 MHz clock for writing and 20 MHz for reading, is the reading operation synchronous/ asynchronous.

6. FPGA design flow

7. Metastability

8. What will the debugging process, if the functional simulation is correct.

9. Will the test bench for functional simulation and GLS simulation be same?

For FPGA, global reset is sufficient. Use async reset for internal logic and sync that reset in main clk (if they are in same clock, else use accordingly).Always assert the reset asynchronously and de-assert synchronously with clock

How do you do STA?

1. Setup time and Hold time, Equations foe these and how to fix them in case of violation

2. What is skew?

3. No. of output bits after performing 8*8

4. Blocking and Non-blocking outputs

5. Verilog code for 2:1 MUX

6. Implement AND gate and XOR gate using 2:1 MUX

7. Half Adder and Full Adder with equations. Implementing multiplier using adder

8. FPGA Architecture

9. CDC

10. Realization of 3 input AND gate using 4 input LUT

What are the timing constraints that you do for a typical design having synchronous and asynchronous logic?

How does the RTL logic convert to logic gate, say a comparator or counter, please describe

Can you write a code in RTL for a debounce logic or 9-bit counter?

RTL coding for FPGA primitive components, what are the primitive components and what have you used.

What are the following?BUFGMUX, IBUF,OBUF?

How can we find the max clock for a given FPGA program?

What is Metastability and what are its effects?

What is setup and hold time?

What is pipelining?

Why do FPGAs have dedicated clock pins? What's so special about them?

Describe a design that you think that will best fit an FPGA.

Describe a design that you think that will best fit a Microcontroller.

What are the difficulties in asynchronous FIFO design?

Differences between FPGA and ASICs, clock and reset schemes

A signal derived in clock domain A is used in a sequential process in clock domain B. What can potentially happen if you use the signal directly in clock domain B? What is the proper way to handle the signal before using it in clock domain B?

Why can't we use a Clock Source directly rather using FF Dividier circuits which also needs a clock source for its operation??

Internal workings of FPGA's such as block ram, dsp, logic cell, LUT

What is the most basic way of meeting or checking to see if you are meeting timing?

Tell me some of constraints you used and their purpose during your design?

1. Have you worked on HAPS prototype board?

2. Asked about my project.

3. What type of the issues you have faced in your project?

4. How can you reduce area if the all FPGA resources are used?

5. Asked about FPGA structure

6. How will you debug your design if there is no any error/timing violation, but your design is not working.

7. What is difference between Function and Task?

8. Do you know VHDL and Verilog? Asked 1-bit latch Verilog code.

What are different types of FPGA programming modes? What are you currently using ?how to change from one to another?

Can you list out some of synthesizable and non synthesizable constructs?

Can you draw general structure of fpga?

Difference between FPGA and CPLD?

MMCM Vs. DCM

What is slice,clb,lut?

What is FPGA you are currently using and some of main reasons for choosing it?

Draw a rough diagram of how clock is routed through out FPGA?

How many global buffers are there in your current FPGA, what is their significance?

What is frequency of operation and equivalent gate count of project?

Compare PLL & DLL.

1. How can you resolve routing issue?

1. Implementing f =(a^b) & c using 4:1 Mux.

2. Sine multiplier.

3. How many output bits we will get if we multiply 8 bits with 8 bits?

4. Highest and lowest value of n bit 2’s complement number.

5. Verilog code for active low asynchronous set, en (high priority), reset(synchronous).

6. Dual port RAM code.

7. ROM, PAL.

8. Design ROM using 6-LUT.

2. What is difference between Function and Task?

3. What is the purpose of DSP 48 block?

4. Xilinx ISE flow

1. Blocking and non-blocking assignment output.

2. Edge detector

3. Set up and hold related questions like when we get violations, we can remove, hold depends on frequency or not.

4. Verilog code for 4:1 Mux.

5. RAM Verilog code with writing has the highest priority.

6. How to flush memory to default /reset value in a single clock.

7. In your design which type of the reset signals are used? Why?

8. Why CDC is required? CDC techniques.

9. What is MMCM in FPGA?

Suggest some ways to increase clock frequency?

What is LV and why do we do that. What is the difference between LVS and DRC?

Clock management blocks in FPGAs

What are the transceiver types and speeds in Xilinx FPGAs?

Why are their hard macros in FPGAs? What are the downsides of including them?

Why is FPGA performance not usually as good as an ASIC or mask programmed gate array? (may not even be true anymore)

What are some of the basic differences between FPGA families?

Verilog 2001 vs 95

Latches

clock gatimg

resource sharing

operator overloading

operator balancing

power gatimg

flip flop vs latches

counters

registers

memories

fifo

fifo depth

meta stability resolve of it

overlap

skew

slack

setup

hold fix ‘

encoder

decoder

throughput latency calculation

bufg

power reduction in rtl

Fsm verilog mealy moore

Clock schemes

pattern gen

defparam

local parameter

parameter

dynamic ram sram and its types.memory calculation

.shift register types .all digital mux encoder flip conversions race race around glitches static hazard and dynamic hazards

1010 11001 overlap or non overlap

Booth multiplier verilog

Full subt adder carry

Mealy conversion to moore.

Booth encoding

For loop generate while loop forever

: Encoding schemes .clocking schemes. Advantages

Save adder

Verilog rtl synthesis

Task and functions

Mcp false path sta dta

: Generate statements for multiplication

Parity error check and correction clock crossing in verilog verilog renerta web and tidbits fully

Setup and pulse width

Face detection of verilog in Xilinx

Neural network in fpga

Steps for handover protocol on FPGA using C

How can I physically convert high speed camera/image sensor to a fpga

No of frames and frame length in Spartan 6,virtex7,kintex7

Xilinx system generator hardware co-simulation compilation problem

16 bit multiplier using xilinx’

Semicustom design without FPGA Or CPLD on Xilinx

Face recognition and speech recognition in FPGA

how to write HDL code, to compare images. For example, i need to compare two 512x512 image, If i compared pixel by pixel with its value, how to do?

Verilog to detect circles

Verilog for EMD algorithm

Generate variable frequency sinusoidal signal using Xilinx system generator’

FPGA Signal integrity for SERDES

FPGA issues in Timing

SPI active edge

Version control

Board bring up

FPGA partitioning

ARM Modes

Which mode arm will boot up? Why this mode. uses?

Interrupts

Booting procedures

When and where the modes can be used?

How ARM boots?

How interrupt handling while boots?

Difference between NAND and NOR FLASH

How C code runs after process come out of reset

Pin mux in FPGA

Gated logic implementation in FPGA

Hard IP in FPGA worked

Pull up RESISTORS in FPGA

pull-up built into the I/O is not sufficient and for sure you need external pullup resistors.

Tools used in bitgen

I2C pin mapping in FPGA

USB multiplexer via FPGA

Possible signal integrity problem in 100mhz clock

4 core processor on FPGA

Relay control in fpga

Routing through TCL

Reset FPGA by PLL

PS to PL interrupts.

Generate saif file in ISIM or MODELSIM simulation’

Simulate sdf file in modelsim

Good scaling scheme for radix FFT on FPGA

Copy contents of array in memory editor Xilinx Isim

Miller Decoder state machine

Update .mif file without compiling TCL

Generate 4 phase shifted clock signals

Implement ADC in FPGA where FPGA runs at 100MHz clock.

Interface SRAM to microcontroller using FPGA.

Clock skew problem in serdes

Interface of FPGA to DSO

Data string length to be send out from FPGA.

Uart TX signaling from another processor in FPGA

How to read back the configuration memory of SRAM based FPGA for SEU mitigation?

Read/Write to Virtex7 through USB/jtag port.

Generate vcd from .v file and add it in synthesis to perform power analysis.

configure cordic divider in Xilinx system generator

What are the issues if the duty cycle of the clock in a digital ckt is changed from 50%?

Challenges faced while transfer of data between 2 different clock domains

DFF with 1 delay.what happens?

If interface to FPGA and uart are same but with different sources will it affects?

If my receiver clock having some tolerance with transmitter clock will it affect?

If I2C is replaced by APB What happens?

Analog IO pad and digital IO pad?

What are the different ways for optimizing a timing path in the VLSI design?

Interface pins of uart to fpga

Interface pins of usb to fpga

Interface pins of DDR,PCI to FPGA

DDR3,DDR4,USB,PCI

Constraints after synthesis or before synthesis

Floorplan,placement,routing methods

Script for basic clock gating

ASIC to FPGA conversion and vice versa

Ethernet interface to FPGA

Ethernet frames

TCP/IP protocol

Axi protocol all

AMBA protocol all

Qualify STA of a chip

Factors affecting delay and skew

Boot Linux on zynq fpga

Detect whether transparent latch is inferred during the elaboration

USB transfer types

CDC Strategies

100gb ethertnet

Usb for mobile

SRIO

CPRI

Images ,audio and video algorithms in FPGA

Floating point in fpga

ELF vs HDF

AXI tapeout to peripheral

USB

PETALINUX COMPILE

UART IN ZED BOARD

LVDS STDS

LVDS PORT MAP

PCS,PMA IN GTX

USB 3.0 validation

Link traing for pci

AURORA IP

Difference between root complex and endpoint

ARM processor bringup

Design margin

Construct memory using LUT

64X1 MUX USING LUT

Dsp48

Maximum frequency of the design

Maximum frequency Fmax

ADC interface to FPGA

DRC in vivado

DRP in vivado

Axi tapeout to peripheral

Video IP

Pixel size,image resolution in video

Testbench of 100Mhz with 1 Mhz clk reset 100ns

Desugn for high throughput

Latency at low end

Eyescan flow diagram

RJ45 pins to FPGA

In I2C if master request is to be read and if it fails .what happens?

FSBL

Cpu interfaces with FPGA

Banking rule of FPGA

Which buffer used for violation?

If 60ps is used,how to fix violation?

ARM addeessing modes

AWLEN

10 mhz and 100mhz working with single pulse

GATE COUNT in vivado

In CLB,GATE COUNT

Fault simulation

RAM extraction

Shift register extraction

Thooughput and latency calculations in design

In GTX if transmitter frequency is of high w r to receiver,..what happens?

Differntial clock to a single clock in vc707

Multipliers used in 64 coefficients

FIR BLOCK

Ahb Timing Diagram

AHB block

32 bit reg to AHB.RTL CODE

APB Timing diagram

Sdk environment

Metastability Fix

Metastability waveform?

A circuit give,what is the hardware?

DDR interfaces,PCIE,USB,SERDES

Flip flop as 4 bit counter

High speed Transceiver

DLL,PLL,DPLL

ADC to FPGA

DRC

In which stage of the design flow we get an idea about the false paths and multicycle paths in the design?

Which tool will report them?

In RTL level, we are left with the logic of the design only and we will have no idea about how the tool will synthesize different paths in the design. The tool will synthesize the design according to it's algorithm. From thousands of R2R paths the tool synthesize, how can we know that which all are false paths or multicycle paths?

best way to use the an asynchronous reset if one has to have it. Use a BUGCE for the clock domain, and reset sequence must be as following:

1. Disable the clock

2. Assert the reset

3. Deassert the reset, let the system wait for something like 32 clocks

4. Enable the clock

If clock gating fails?

You want a clock gate pushed into a BUFGCE, the register that creates the gate becomes part of the BUFGCE structure. If the two gating terms come from registers on different clocks or if they come from somewhere outside the hierarchy being examined by the synthesizer, then the tools may not be able to automatically create the gated clock. Just taking the gate term as written and using it as the CE input of a BUFGCE would cause a gating delay that doesn't match the RTL. It also seems that the tools are smart enough to look for other solutions like divide-by-2, but couldn't match those to the RTL, either

Challenges with clock gating

1. Circuit timing changes

2. Area penalty (additional logic is added)

3. Equivalency checking (especially for sequential clock gating)

4. Clock domain crossing issues

5. Reset domain crossing issues

6. Complicate timing closure due to additional delay at clock network.

7. As these changes are local to power domains, I don’t think any impact can come from power states. I am not sure of any impact if the clock gating applied at an architectural level?

8. DFT methodology changes. The test enable mux can be inserted pre OR post of latch in clock gating cell

Bypass and in the clock gating

It can be done, with care and thorough understanding of the possible consequences. These include metastability when taking clocked signals into the gated clock domain and worse results from timing-driven synthesis/layout.But there are nearly always other ways to achieve the same control over a circuit as gating the clock, without all such risks and penalties.

Resource sharing

Pipelining

How do I avoid glitch in generating multiple clocks and mux the clocks into one interface clock based on selection lines in Verilog?

If an input is dumping into FPGA ,define the constraints to be added.

When setting up constraints in Vivado Design Suite flows, be sure to do the following:

· Define all primary clocks on input ports or nets connected to input ports.

· Define clocks on the black box output pins.

· Define generated clocks on nets.

· Don’t define gated clocks.

· Provide correct clock constraints – don’t over-constrain, and be sure to place unrelated (aka asynchronous) clocks in separate clock groups.

· Define timing exceptions such as false paths and multicycle paths.

Clock gating not used in FPGA .Why?

Whenever a clock signal is passed through a gate it add skew to the clock signal. ASIC has the flexibility to correct setup and hold time violations occurring due to such gating. It is rectified in design layout process. FPGAs do not have such flexibility to correct the problems. One can calculate and predict such violations but cannot rectify it using buffers as in ASICs. Even if buffer is introduced, the delay, location & numbers of buffer is completely out of our hands. So normal gating can be done in FPGA however highly constraint gating where timing is very important can produce undesirable results. So it NOT RECOMMEND in FPGAs.

What are the different tests you would do to verify your verilog code?

To generate non-overlapping clock

Why FPGA use flash?

Flash-based FPGAs non-volatile memory cells hold the configuration pattern right on the chip, and even if power is removed the contents of the flash cells stay intact. Thus when the system restarts, the FPGAs power up in microseconds, saving time and allowing the system to recover quickly from a power failure or a restart.

How to build SPI Flash for FPGA?

How to fix recovery time of an asynchronous reset

Need to make sure that you are using the asynchronous reset pin of the FF correctly; use the reset bridge described in that post for synchronizing the reset, then won't have a reset recovery failure on your "real" FFs. However, will/might get one on the asynchronous input to the reset bridge .Since the reset bridge is a synchronizer, reset input to the synchronizer can be declared as a false path:

1. Disable the clock

2. Assert the reset

3. Deassert the reset, let the system wait for something like 32 clocks

4. Enable the clock

In asynchronous FIFO,Add the delay.How?

DMA

What is the frequency of waveform in FPGA?

frequency_step = 2^N * frequency_hz / sample_clock_rate_hz

Vivado TCL Flow
=============

Use makefile and tcl to run simulation, synthesis, implmentment and bit generation.

# Makefile

==========
sim:
    vivado -mode batch -source sim.tcl
imp:
    vivado -mode batch -source non-proj.tcl
proj:
    vivado -mode batch -source proj.tcl

------------------------------------------------------------------------------------

# sim.tcl

==========
set path [file dirname [info script]]
puts "script is invoked from $path"
source [file join $path modelsim.tcl]

------------------------------------------------------------------------------------


# modelsim.tcl

================
set file_dir [file normalize [file dirname [info script]]]

puts "== Unit Test directory: $file_dir"
#set ::env(XILINX_TCLAPP_REPO) [file normalize [file join $file_dir .. .. ..    ]]

#puts "== Application directory: $::env(XILINX_TCLAPP_REPO)"
#lappend auto_path $::env(XILINX_TCLAPP_REPO)

set name "modelsim"

create_project $name ./$name -force
add_files -fileset sources_1 "$file_dir/q1.vhd"
#add_files -fileset sources_1 "$file_dir/.vhd"
add_files -fileset sim_1 "$file_dir/tb.vhd"
update_compile_order -fileset sources_1
update_compile_order -fileset sim_1
launch_simulation -batch
close_project


------------------------------------------------------------------------------------


# non-proj.tcl

==============
# STEP#1: setup design sources and constraints
read_vhdl ./q1.vhd
#read_vhdl ./src/AND_TEST.vhd
#read_vhdl ./src/OR_GATE.vhd
# read_vhdl
# read_verilog
read_xdc ./q1.xdc
#
# STEP#2: define the output directory area.
set outputDir ./output
file mkdir $outputDir

#
# STEP#3: run synthesis, write design checkpoint, report timing,
# and utilization estimates
#
synth_design -top q1 -part xc7z020clg484-1
write_checkpoint -force $outputDir/post_synth.dcp
report_timing_summary -file $outputDir/post_synth_timing_summary.rpt
report_utilization -file $outputDir/post_synth_util.rpt
#
# Run custom script to report critical timing paths
#reportCriticalPaths $outputDir/post_synth_critpath_report.csv
#
# STEP#4: run logic optimization, placement and physical logic optimization,
# write design checkpoint, report utilization and timing estimates
#
opt_design
#reportCriticalPaths $outputDir/post_opt_critpath_report.csv
place_design
report_clock_utilization -file $outputDir/clock_util.rpt
#
# Optionally run optimization if there are timing violations after placement
if {[get_property SLACK [get_timing_paths -max_paths 1 -nworst 1 -setup]] < 0} {
    puts "Found setup timing violations => running physical optimization"
    phys_opt_design
}
write_checkpoint -force $outputDir/post_place.dcp
report_utilization -file $outputDir/post_place_util.rpt
report_timing_summary -file $outputDir/post_place_timing_summary.rpt


#
# STEP#5: run the router, write the post-route design checkpoint, report the routing # status, report timing, power, and DRC, and finally save the Verilog netlist.
#
route_design
write_checkpoint -force $outputDir/post_route.dcp
report_route_status -file $outputDir/post_route_status.rpt
report_timing_summary -file $outputDir/post_route_timing_summary.rpt
report_power -file $outputDir/post_route_power.rpt
report_drc -file $outputDir/post_imp_drc.rpt
write_verilog -force $outputDir/cpu_impl_netlist.v -mode timesim -sdf_anno t    rue
#
# STEP#6: generate a bitstream
#
write_bitstream -force $outputDir/proj.bit



------------------------------------------------------------------------------------



# proj.tcl

=========
#
# STEP#1: define the output directory area.
#
set outputDir ./project
file mkdir $outputDir
create_project project_cpu_project ./project -part xc7z020clg484-1 -force

#
# STEP#2: setup design sources and constraints
#
# example:
#add_files -fileset sim_1 ./Sources/hdl/cpu_tb.v
#add_files [ glob ./Sources/hdl/bftLib/*.vhdl ]
#add_files ./Sources/hdl/bft.vhdl
#add_files [ glob ./Sources/hdl/*.v ]
#add_files [ glob ./Sources/hdl/mgt/*.v ]
#add_files [ glob ./Sources/hdl/or1200/*.v ]
#add_files [ glob ./Sources/hdl/usbf/*.v ]
#add_files [ glob ./Sources/hdl/wb_conmax/*.v ]
#add_files -fileset constrs_1 ./Sources/top_full.xdc
#set_property library bftLib [ get_files [ glob ./Sources/hdl/bftLib/*.vhdl]]
add_files -fileset sim_1 ./tb.vhd
add_files ./q1.vhd
add_files -fileset constrs_1 ./q1.xdc

#
# Physically import the files under project_cpu.srcs/sources_1/imports directory 
import_files -force -norecurse
#
#
# Physically import bft_full.xdc under project_cpu.srcs/constrs_1/imports directory 
import_files -fileset constrs_1 -force -norecurse ./q1.xdc
# Update compile order for the fileset 'sources_1'
set_property top top [current_fileset]

update_compile_order -fileset sources_1
update_compile_order -fileset sim_1
#
# STEP#3: run synthesis and the default utilization report.
#
launch_runs synth_1
wait_on_run synth_1
#
# STEP#4: run logic optimization, placement, physical logic optimization, route and
#         bitstream generation. Generates design checkpoints, utilization and timing
#         reports, plus custom reports.
set_property STEPS.PHYS_OPT_DESIGN.IS_ENABLED true [get_runs impl_1]
set_property STEPS.OPT_DESIGN.TCL.PRE [pwd]/pre_opt_design.tcl [get_runs impl_1]
set_property STEPS.OPT_DESIGN.TCL.POST [pwd]/post_opt_design.tcl [get_runs impl_1]
set_property STEPS.PLACE_DESIGN.TCL.POST [pwd]/post_place_design.tcl [get_runs impl_1]
set_property STEPS.PHYS_OPT_DESIGN.TCL.POST [pwd]/post_phys_opt_design.tcl [get_runs impl_1]
set_property STEPS.ROUTE_DESIGN.TCL.POST [pwd]/post_route_design.tcl [get_runs impl_1]
launch_runs impl_1 -to_step write_bitstream
wait_on_run impl_1
puts "Implementation done!"


------------------------------------------------------------------------------------



# pre_opt_design.tcl
############## pre_opt_design.tcl ##################
set outputDir [file dirname [info script]]/project
source [file dirname [info script]]/reportCriticalPaths.tcl
#
report_timing_summary -file $outputDir/post_synth_timing_summary.rpt
report_utilization -file $outputDir/post_synth_util.rpt
reportCriticalPaths $outputDir/post_synth_critpath_report.csv


------------------------------------------------------------------------------------

# post_opt_design.tcl
############## post_opt_design.tcl ##################
# Run custom script to report critical timing paths
reportCriticalPaths $outputDir/post_opt_critpath_report.csv


------------------------------------------------------------------------------------


# post_place_design.tcl
############## post_place_design.tcl ##################
report_clock_utilization -file $outputDir/clock_util.rpt



------------------------------------------------------------------------------------


# post_phys_opt_design.tcl
############## post_phys_opt_design.tcl ##################
report_utilization -file $outputDir/post_phys_opt_util.rpt
report_timing_summary -file $outputDir/post_phys_opt_timing_summary.rpt



------------------------------------------------------------------------------------



# post_route_design.tcl
############## post_route_design.tcl ##################
report_route_status -file $outputDir/post_route_status.rpt
report_timing_summary -file $outputDir/post_route_timing_summary.rpt
report_power -file $outputDir/post_route_power.rpt
report_drc -file $outputDir/post_imp_drc.rpt
write_verilog -force $outputDir/cpu_impl_netlist.v -mode timesim -sdf_anno true


------------------------------------------------------------------------------------


# reportCriticalPaths.tcl

#------------------------------------------------------------------------
# This function generates a CSV file that provides a summary of the first
# 50 violations for both Setup and Hold analysis. So a maximum number of
# 100 paths are reported.
#
# #------------------------------------------------------------------------
proc reportCriticalPaths { fileName } {
# Open the specified output file in write mode
    set FH [open $fileName w]
    # Write the current date and CSV format to a file header
    puts $FH "#\n# File created on [clock format [clock seconds]]\n#\n"
    puts $FH "Startpoint,Endpoint,DelayType,Slack,#Levels,#LUTs"
    # Iterate through both Min and Max delay types
    foreach delayType {max min} {
        # Collect details from the 50 worst timing paths for the current analysis
        # (max = setup/recovery, min = hold/removal)
        # The $path variable contains a Timing Path object.
        foreach path [get_timing_paths -delay_type $delayType -max_paths 50 -nworst 1] {
        # Get the LUT cells of the timing paths
        set luts [get_cells -filter {REF_NAME =~ LUT*} -of_object $path]
        # Get the startpoint of the Timing Path object
        set startpoint [get_property STARTPOINT_PIN $path]
        # Get the endpoint of the Timing Path object
        set endpoint [get_property ENDPOINT_PIN $path]
        # Get the slack on the Timing Path object
        set slack [get_property SLACK $path]
        # Get the number of logic levels between startpoint and endpoint
        set levels [get_property LOGIC_LEVELS $path]
        # Save the collected path details to the CSV file
        puts $FH "$startpoint,$endpoint,$delayType,$slack,$levels,[llength $luts]"
        }
    }
    # Close the output file
    close $FH
    puts "CSV file $fileName has been created.\n"
    return 0
}; # End PROC

=======================================================================
======================================================================
FPGA
=====
Write an RTL Block that implements 4x4 matrix multiplication. then they asked me to optimize the critical path. Then what if the elements are floating point numbers?
Then check if we can reduce the chip area using less number of Multipliers.
How do you optimize your ASIC/FPGA design
What is speed grade and how do you select FPGA as per requirements
Speed Grade is what that determines how max a clock can run in FPGA. Companies use different values, while -1, -2 indicates the scale. Higher the grade, higher the cost of FPGA
What is the maximum possible speed achievable for a given device say Virtex6 device (some speed grade)
The Fmax is determined by Flop-to-Flop timing using shortest route (CLB) with least clock skew. To put it simple, usually this is calculated based on logic levels between a source to destination path
What is logic level
Number of logics levels in your combo circuit. Google for more info
How do you code to reduce power in FPGA design
Avoid reset for FPGA
Clock Gate
Use synchronous design
Avoid overconstraining
Reduce Device temperature (cooling solution)
Use clk_en and control enable for all Memory
Use LUT for smaller memory. BRAM takes more power
What is clock gating and how do you do it in FPGA
Xilinx and Altera provides primitive components for this. Check them
How do you manage multiple clocks and how do you route them
CDC tools can help this like Spyglass, etc. But asynchronous transfers must be handles carefully in design and later they can be assigned false path for the tool to go easy on compilation
How do you do IO planning and can you explain few types of IO types
how do add debug probes for chipscope and what are the benefits inserting it and inferring it?
Inferring it along with your RTL gives you freedom of debugging whenever required. Adding it as a probe requires you to sort the netlist file and break up your head in finding the net names.
Sometimes some logic needs to be always put on JTAG for status/debug.
How do you program multiple FPGA images in flash and load?
Check google on this
How do you control reset logic
For FPGA, global reset is sufficient. Use async reset for internal logic and sync that reset in main clk (if they are in same clock, else use accordingly)
Always assert the reset asynchronously and de-assert synchronously with clock
How do you do STA
What are the timing constraints that you do for a typical design having synchronous and asynchronous logic.
For async, false path or TIG constraints. We do not do any constraint for async path if it is taken care in RTL
How does the RTL logic convert to logic gate, say a comparator or counter, please describe
Can you write a code in RTL for a debounce logic or 9-bit counter?
Some questions about RTL coding for FPGA primitive components, what are the primitive components and what have you used.
BUFGMUX, ibuff, obuff, etc
How can we find the max clock for a given FPGA program?

FPGA Emulation
------------------

FPGA Architecture,CLB,LUT,SLICES,BRAM,Distributed RAM,clock buffer and clock resources,synthesis,placement,routing,Timing analysis,constraints for FPGA,setup,hold violation,equations and examples for setup,hold,slack,skew,positive slack,negative slack,equatiopn for postive as well as negative slack.

Fixing the violations,STA,DTA,Synchroniser,metastability,CDC,VIVADO Tool flow,dcp,drpXilinx ISE flow,Migration of ISE to Vivado,FPGA IO resources,High speed interfaces,MIG,DMA,AXI-DMA,zynq SoC,arm cortex,Vivado reports vs Xilinx ISE reports,PCIe ,ddr3,ddr4,synchronous reset vs asynchronous reset,synchronous memories vs asynchronous memories ,Chipscope-ILA,mmcm vs PLL,DCM,bufg,bufgctrl,IBUF,BUFR,GTX transceiver,eye diagram debug,baud rate,shell,tcl,perl,python

FPGA -rtos in fpga,interrupts,bsp,boot soc,sdk,application program,Vivado HLS,RTL debug,integration methods,LINTING

Protocols-AXI,APB,AXI-APB,AXI lite ,AXI stream,USB 2.0,migration of USB 2.0 -3.0,,i2c,spi,can

BUS-PCIe internals,DDR3 internals
=======================================================================================================================

1. fpga architecture
2. VHDL /verilog code.
3. interfaces (such as PCIE, DDR3..)
4. embedded ARM

what do you mean by JTAG OR USB ?
jtag work thru USB or Parallel port
i mean you will use JTAG anyway , the question should be USB or parallel
in this case i think if you want to design your own dvp board use parallel
it would be easier...
anyway in alera site search for BYTEBLASTER and you will see
a schematics of JTAG CABLE , so you'll be able just to duplicate it.....

=======================================================================================================================

1. What is Metastability and what are its effects?
2. What is setup and hold time?
3. What is pipelining?
4. Why do FPGAs have dedicated clock pins? What's so special about them?
5. Describe a design that you think that will best fit an FPGA.
6. Describe a design that you think that will best fit a Microcontroller.
7. What are the difficulties in asynchronous FIFO design?
========================================================================

For example: convert 14 into hex, binary, and octal (about third of candidates cannot do that)

- Understanding of digital logic: setup and hold, pipelines, latency and throughput, etc.

- FPGA architecture questions: how different circuits are synthesized, process of timing closure,
differences between FPGA and ASICs, clock and reset schemes

- Good knowledge of hardware description language (SystemVerilog)

A question about clock domain crossing is typical in an FPGA interview. For example, a signal derived in clock domain A is used in a sequential
process in clock domain B. What can potentially happen if you use the signal directly in clock domain B?
What is the proper way to handle the signal before using it in clock domain B?

=====MAX FREQ OF THE CODE ----

FPGA can be programmed with VHDL, VEDRILOG like HDL languages. So basically for a program to be frank, they can run at any frequency and no limitations at all. However when it comes to some particular FPGA, the program gets limited by the particular FPGA architecture.

For example

A small program say a timer can run flawlessly at 400 MHz clock operating frequency on a modern day Virtex-7 like device just like that… But the same running on a decade old Spartan-3 is not at all possible. Here, the program that I’m referring is “Timer”which could be as simple as that. But the limiting factor here is not just the program but the technology. Now let’s re-write the Timer code with logical levels at 1. The Spartan -3 can now get a max clock of 200 and that would be the max limit of that FPGA

→ Key point here is, every FPGA device or any silicon has a physical limit in conducting electricity across one node to other (Node is your FF) and the delay between that two determines the max possible frequency.

Now let’s take another program called “memory controller”. This program is complex and hence achieving 200 MHz on Virtex-7 is hard and you get 190 MHz, whereas on Spartan-3 you get 100 MHz if written well……not a big diff, heh?. This is mainly because when the number of logic levels increase, the 1 clock processing time increases proportionally. By that way, keeping strictly at 1 logic level throughout your code, you can amazingly achieve 500+ MHz for your design on devices like Virtex-7, now ain’t that amazing?, but wait….!!! did I mention that the same will eat up all your resources?. So the designer must keep that in mind and design such as way that he\she keeps the logic levels not too much but 1–3 which can help get good speeds.

So I would say, that a program can be written

To work with best possible speeds
To work with worst possible speeds as well
So what is my limiting factor that prevents my program to execute at MAX frequency?

Type of FPGA and it’s silicon technology
The Slices interconnect wire length and hence the fab tech that was used
The ability of the tools that compile\synthesis\PAR your design. Synopsis or Cadence tools will show you how better they are compared with Xilinx or Altera
Finally the ability of the design engineer who writes HDL code
So, for your question my answer is → NO, you can’t find the max frequency from just a HDL code and theoretically the code can run at any frequency. Still don’t agree?, try your simulation and provide an input clock period of 1 ns which is 1 GHz.

Your major limiting factor here is the Silicon chip that you are targeting.

------

Why can't we use a Clock Source directly rather using FF Dividier circuits which also needs a clock source for its operation??

Is it so? If so my answer is below,

First thing is to reduce the number of clock source in a system or a design.
Using single clock source we can produce a multiple clcok frequencies with the help of Frequency divider.
For instance if a Processor needs 48MHz XTAL clock for its core operation where as other devices like SRAM and
additional pheriperals are working in different feequncies less than a 12MHz then using a FF
we can easily get a required frequencies without using separate XTALs for each peripherals
Second thing is, By doing so as mentioned above we can avoid timing issues like rise/fall, hold up,
jitter and so on. (Using multiple clock sources may give above issues)
========================================================================

verify Rtl design using FPGA
============================

First an RTL design needs to be verified using a simulator. We call it this way

Verify = Simulation in EDA tools

Validation = test on board using debugger, etc

When you say verify, you mean the functionality which will be verified using simulator tools first. Then goes other stage of compilation. When you say validate, you mean the functionality verification in board probably running on FPGA. To validate there are many ways to do in hardware and some are

Use JTAG to capture your internal registers or IOs
Use VIO core (say from Xilinx FPGA) and drive some set of internal registers to do DFT style testing
Use a debugger module to communicate through UART or so and run various internal test cases on board (like running in simulation) through scripts. This is how modern day ASICs are verified (but a bit complex).
With devices like FPGA, time consuming gate level simulations have been aged up. These days people do

Functional simulation and do coverage
synthesis and verify netlist
Check timing and area
Do PAR and check for DRC issues and others
Go for validation

Simple RTL syntax of your choice

Test bench structure

What is verification? Ways of verifying things.

Internal workings of FPGA's such as block ram, dsp, logic cell, LUT

What is the most basic way of meeting or checking to see if you are meeting timing? ( hint: clock constraints)

Implement some digital logic

You don't necessarily need to know what a hash table is or how you can implement fir filter.

Other's have live very specific skills that are useful but unlikely to be asked during an interview for an entry level position. More general skills are very important:

A firm grasp of digital logic, basically everything covered in digital and advanced digital design courses

Math - (DSP is a bonus but just solid algebra and signals and systems knowledge). Being familiar with fixed point arithmetic is a huge boon.

EE knowledge - Understanding the electrical properties of both the FPGA and the devices it interfaces to is a huge plus.

Scripting - Scripting is used to automate many tasks from builds to testbenches to analysis. TCL is widely used by the FPGA Tools and Python is just about everywhere for the rest.

Understanding Trade Offs - Accuracy for speed for example comes up all the time in FPGA designs.

hat kind of toolchains, ide's, software do you use for your own projects and work projects?
What is the biggest pain you experience ?
What is the top feature/tool/thing you wish you had/could use that would make your life so much easier when designing with HDL?
How do I get this ripple Moore multiplier to work?#
I am looking for an FPGA (IC, not dev board) for a small project involving HDMI signal processing. I have been looking for an FPGA with the following specifications:

Enough transceiver capacity and I/O for HDMI input and output (doesn't need to be full HD but I'd like at least 720p 30 Hz)

Support for manufacturer's HDMI IP (don't want to reinvent the wheel)

Supported by free version of Quartus/Vivado/etc (not critical, but I would prefer not having to be restricted to the trial period of whatever software I choose) (apparently the trial does not support generating the programming file so that's not an option)

Ideally not BGA, as it would be harder to design the board and especially solder it by hand

# Need advice: Altera DE0 from Digilent or Altera DE10 from Terasic

========================================================================
race around ----

In jk flip flop there occurs a condition called race around when we put both j and k as 1.

In race around Condition till the clock is high the output varies continuously from 0 to 1 &1 to 0. This condition is undesirable as it is of no use because the change in output is uncontrolled. However this change is useful if we have a control on it.To avoid race around condition or to change race around condition to toggling we use master slave configuration in which the output changes whenever a negative edge occurs.This controlled change in output helps in constructing counters which works on toggling Condition of jk flipflop .In short to convert race around condition in jk flip flop to toggling condition we use master slave configuration of jk flip flop.

Let us understand the mechanism of edge triggering in this case. Masteslave configuration actually have two jk flip flop in which one of the flip flop have inverted clock with respect to the other flip flop. These two flip flop are named as MASTER AND SLAVE flip flop. Now as the clock of master is high it is active while during this time the clock of slave is low which provides memory condition. Let our output is Q is 1 and Q' is 0. Consider 1 clock pulse during this pulse when it is high slave will remain in memory state and hence output will not change untill the clock is high…. However master flip flop is active during this time interval( i.e time interval during which a pulse remains high) so it's J will recieve input from output Q' of slave i.e 0 hence produces and output of 1 which similiarly k will produce output of 1 these are the input of slave. As clock pulse becomes low master will became inactive i.e in memory state and slave will became active and with input 0 and 1 it will produces output of Q=0 and Q'=1. So it appears that as negative edge appears our output changes which is the reason why master slave are called negative edge triggered flip flop which is not Nothing special mechanism. Also output of this flip flop changes once in a clock pulse.

If you know the reason of using JK over SR flip flop

then the answer of this question would be

in JK flip flop feedback is changing more than one in a single clock pulse however in master slave JK flip flop,
feedback has constant value that restricts to change the output value more than one and hence it removes racing between 0 and 1(race around) condition.
Also it becomes independent from condition of propagation delay.
Either the master works or the slave works at a time, not simultaneously.
Always the slave follows the action of the master . Master changes its its o/p in one half cycle, slave does the same change in the next half. So, one of them is inactive at any time, preventing the serious problem of race around.
Edge triggering
Regarding the edge triggering, fine, it can be so. Some such devices available. What is required is a low value in CR circuit to convert these levels to fast edges.
Master slave ff overcome a race around condition by connect two same circuitary together and one for working and other one for delay . What happens in race around condition is you can't predict the answer though you know the previous state as it changes so fast and hence the slave in master slave hold the ckt i.e. provide some delay . This is how I understand it. The proper working is already explained in some answers by others.

We use edge triggering in this type of conditions but we prefer to use master slave ckt comparitively because in edge trigger j-k flip flop , there are chances of occurances of glitches in the output which doesn't come in master slave.

When we stop providing the clock signal, there won't be any further toggling process but we won't be able to determine the output. It could be either 1 or 0.

This is the race around condition. Elimination of this condition can be done by implementing a JK flip-flop, which has two stages. Since the clock signal is inverted before providing it to the slave-stage, hence at a particular instant only one among the two stages will be active.

So, as a result, whenever the output is generated, it is not immediately fed back as input to the input side, but it remains in between the two stages, that is because the slave-stage will be inactive. This makes the output predictable.

Steps to avoid racing condition in JK Flip flop:

If the Clock On or High time is less than the propagation delay of the flip flop then racing can be avoided. T
his is done by using edge triggering rather than level triggering.
If the flip flop is made to toggle over one clock period then racing can be avoided.
This introduced the concept of Master Slave JK flip flop.

========================================================================
http://www.design4silicon.com/2016/05/fpga-interview-questions-and-answers.html

========================================================================

Digital design (many topics within digital design),
CMOS circuit design & underlying concepts (sizing, chain of inverters,
logical effort, combinational & sequential logic design, drive strength,
power considerations & repercussions , derating for PVT…) ,
FPGA’s , Board related concepts , HDL coding (Systemverilog , Verilog , VHDL),
Good HDL coding practices, Synthesis concepts, ASIC design concepts,
Verification methodologies, Basic serial communication protocols & their concepts,
Basic & Advanced bus theory & knowledge of prominent bus protocols (AMBA - APB / AHB / AXI … ).
Working level knowledge of a particular protocol or domain - storage (SSD’s / SATA )
/ communication (Ethernet / USB …) / Memory interfaces / DSP knowledge .
Knowledge of computer architecture (many sub-topics within this topic)
is central to all concepts involved as everything is connected to a processor core.
Good working level knowledge of Verification methodologies - UVM / OVM.
These are just technical skills, besides these, you need to know how to use tools and hands on with Unix environment / scripting for automation of tasks .
There are sooooo many vlsi blogs, it’s easy to get lost. Chalk out your study plan
and start with basics & ask a senio
r manager / colleague in the domain of your interest to mentor
/ guide you with your career plan.

1.start with samir palanithikar book or j Bhasker design small systems like gates,counters etc

2.Improve your design skills by reading codes of projects in opencores use good editor for understanding code

3.learn using Vio and ILA IP cores and improve your debugging skills

4.with help of Vivado videos on Xlinx website improve your placement and routing skills

Practice make you perfect …design new subsystems and interact actively in Xlinx forum

Hope this will help you get started .

========================================================================
1) What is minimum and maximum frequency of dcm in spartan-3 series fpga?

Spartan series dcm’s have a minimum frequency of 24 MHZ and a maximum of 248

2)Tell me some of constraints you used and their purpose during your design?

There are lot of constraints and will vary for tool to tool ,I am listing some of Xilinx constraints
a) Translate on and Translate off: the Verilog code between Translate on and Translate off is ignored for synthesis.
b) CLOCK_SIGNAL: is a synthesis constraint. In the case where a clock signal goes through combinatorial logic before being connected to the clock input of a flip-flop, XST cannot identify what input pin or internal net is the real clock signal. This constraint allows you to define the clock net.
c) XOR_COLLAPSE: is synthesis constraint. It controls whether cascaded XORs should be collapsed into a single XOR.
For more constraints detailed description refer to constraint guide.

3) Suppose for a piece of code equivalent gate count is 600 and for another code equivalent gate count is 50,000 will the size of bitmap change?in other words will size of bitmap change it gate count change?

The size of bitmap is irrespective of resource utilization, it is always the same,for Spartan xc3s5000 it is 1.56MB and will never change.

4) What are different types of FPGA programming modes?what are you currently using ?how to change from one to another?

Before powering on the FPGA, configuration data is stored externally in a PROM or some other nonvolatile medium either on or off the board. After applying power, the configuration data is written to the FPGA using any of five different modes: Master Parallel, Slave Parallel, Master Serial, Slave Serial, and Boundary Scan (JTAG). The Master and Slave Parallel modes
Mode selecting pins can be set to select the mode, refer data sheet for further details.

5) Tell me some of features of FPGA you are currently using?

I am taking example of xc3s5000 to answering the question .

Very low cost, high-performance logic solution for
high-volume, consumer-oriented applications
- Densities as high as 74,880 logic cells
- Up to 784 I/O pins
- 622 Mb/s data transfer rate per I/O
- 18 single-ended signal standards
- 6 differential I/O standards including LVDS, RSDS
- Termination by Digitally Controlled Impedance
- Signal swing ranging from 1.14V to 3.45V
- Double Data Rate (DDR) support
• Logic resources
- Abundant logic cells with shift register capability
- Wide multiplexers
- Fast look-ahead carry logic
- Dedicated 18 x 18 multipliers
- Up to 1,872 Kbits of total block RAM
- Up to 520 Kbits of total distributed RAM
• Digital Clock Manager (up to four DCMs)
- Clock skew elimination
• Eight global clock lines and abundant routing

6) What is gate count of your project?

Well mine was 3.2 million, I don’t know yours.!

7) Can you list out some of synthesizable and non synthesizable constructs?

not synthesizable->>>>
initial
ignored for synthesis.
delays
ignored for synthesis.
events
not supported.
real
Real data type not supported.
time
Time data type not supported.
force and release
Force and release of data types not supported.
fork join
Use nonblocking assignments to get same effect.
user defined primitives
Only gate level primitives are supported.

synthesizable constructs->>
assign,for loop,Gate Level Primitives,repeat with constant value...

8)Can you explain what struck at zero means?

These stuck-at problems will appear in ASIC. Some times, the nodes will permanently tie to 1 or 0 because of some fault. To avoid that, we need to provide testability in RTL. If it is permanently 1 it is called stuck-at-1 If it is permanently 0 it is called stuck-at-0.

9) Can you draw general structure of fpga?

10) Difference between FPGA and CPLD?

FPGA:
a)SRAM based technology.
b)Segmented connection between elements.
c)Usually used for complex logic circuits.
d)Must be reprogrammed once the power is off.
e)Costly

CPLD:
a)Flash or EPROM based technology.
b)Continuous connection between elements.
c)Usually used for simpler or moderately complex logic circuits.
d)Need not be reprogrammed once the power is off.
e)Cheaper

11) What are dcm's?why they are used?

Digital clock manager (DCM) is a fully digital control system that
uses feedback to maintain clock signal characteristics with a
high degree of precision despite normal variations in operating
temperature and voltage.
That is clock output of DCM is stable over wide range of temperature and voltage , and also skew associated with DCM is minimal and all phases of input clock can be obtained . The output of DCM coming form global buffer can handle more load.

12) FPGA design flow?

13)what is slice,clb,lut?

I am taking example of xc3s500 to answer this question

The Configurable Logic Blocks (CLBs) constitute the main logic resource for implementing synchronous as well as combinatorial circuits.
CLB are configurable logic blocks and can be configured to combo,ram or rom depending on coding style
CLB consist of 4 slices and each slice consist of two 4-input LUT (look up table) F-LUT and G-LUT.

14) Can a clb configured as ram?

YES.

The memory assignment is a clocked behavioral assignment, Reads from the memory are asynchronous, And all the address lines are shared by the read and write statements.

15)What is purpose of a constraint file what is its extension?

The UCF file is an ASCII file specifying constraints on the logical design. You create this file and enter your constraints in the file with a text editor. You can also use the Xilinx Constraints Editor to create constraints within a UCF(extention) file. These constraints affect how the logical design is implemented in the target device. You can use the file to override constraints specified during design entry.

16) What is FPGA you are currently using and some of main reasons for choosing it?

17) Draw a rough diagram of how clock is routed through out FPGA?

18) How many global buffers are there in your current fpga,what is their significance?

There are 8 of them in xc3s5000
An external clock source enters the FPGA using a Global Clock Input Buffer (IBUFG), which directly accesses the global clock network or an Input Buffer (IBUF). Clock signals within the FPGA drive a global clock net using a Global Clock Multiplexer Buffer (BUFGMUX). The global clock net connects directly to the CLKIN input.

19) What is frequency of operation and equivalent gate count of u r project?

20)Tell me some of timing constraints you have used?

21)Why is map-timing option used?

Timing-driven packing and placement is recommended to improve design performance, timing, and packing for highly utilized designs.

22)What are different types of timing verifications?

Dynamic timing:
a. The design is simulated in full timing mode.
b. Not all possibilities tested as it is dependent on the input test vectors.
c. Simulations in full timing mode are slow and require a lot of memory.
d. Best method to check asynchronous interfaces or interfaces between different timing domains.
Static timing:
a. The delays over all paths are added up.
b. All possibilities, including false paths, verified without the need for test vectors.
c. Much faster than simulations, hours as opposed to days.
d. Not good with asynchronous interfaces or interfaces between different timing domains.

23) Compare PLL & DLL ?

PLL:
PLLs have disadvantages that make their use in high-speed designs problematic, particularly when both high performance and high reliability are required.
The PLL voltage-controlled oscillator (VCO) is the greatest source of problems. Variations in temperature, supply voltage, and manufacturing process affect the stability and operating performance of PLLs.

DLLs, however, are immune to these problems. A DLL in its simplest form inserts a variable delay line between the external clock and the internal clock. The clock tree distributes the clock to all registers and then back to the feedback pin of the DLL.
The control circuit of the DLL adjusts the delays so that the rising edges of the feedback clock align with the input clock. Once the edges of the clocks are aligned, the DLL is locked, and both the input buffer delay and the clock skew are reduced to zero.
Advantages:
· precision
· stability
· power management
· noise sensitivity
· jitter performance.

24) Given two ASICs. one has setup violation and the other has hold violation. how can they be made to work together without modifying the design?

Slow the clock down on the one with setup violations..
And add redundant logic in the path where you have hold violations.

25)Suggest some ways to increase clock frequency?

· Check critical path and optimize it.
· Add more timing constraints (over constrain).
· pipeline the architecture to the max possible extent keeping in mind latency req's.

26)What is the purpose of DRC?

DRC is used to check whether the particular schematic and corresponding layout(especially the mask sets involved) cater to a pre-defined rule set depending on the technology used to design. They are parameters set aside by the concerned semiconductor manufacturer with respect to how the masks should be placed , connected , routed keeping in mind that variations in the fab process does not effect normal functionality. It usually denotes the minimum allowable configuration.

27)What is LVs and why do we do that. What is the difference between LVS and DRC?

The layout must be drawn according to certain strict design rules. DRC helps in layout of the designs by checking if the layout is abide by those rules.
After the layout is complete we extract the netlist. LVS compares the netlist extracted from the layout with the schematic to ensure that the layout is an identical match to the cell schematic.

28)What is DFT ?

DFT means design for testability. 'Design for Test or Testability' - a methodology that ensures a design works properly after manufacturing, which later facilitates the failure analysis and false product/piece detection
Other than the functional logic,you need to add some DFT logic in your design.This will help you in testing the chip for manufacturing defects after it come from fab. Scan,MBIST,LBIST,IDDQ testing etc are all part of this. (this is a hot field and with lots of opportunities)

29) There are two major FPGA companies: Xilinx and Altera. Xilinx tends to promote its hard processor cores and Altera tends to promote its soft processor cores. What is the difference between a hard processor core and a soft processor core?

A hard processor core is a pre-designed block that is embedded onto the device. In the Xilinx Virtex II-Pro, some of the logic blocks have been removed, and the space that was used for these logic blocks is used to implement a processor. The Altera Nios, on the other hand, is a design that can be compiled to the normal FPGA logic.

30)What is the significance of contamination delay in sequential circuit timing?

31)When are DFT and Formal verification used?

DFT:
· manufacturing defects like stuck at "0" or "1".
· test for set of rules followed during the initial design stage.

Formal verification:
· Verification of the operation of the design, i.e, to see if the design follows spec.
· gate netlist == RTL ?
· using mathematics and statistical analysis to check for equivalence.

32)What is Synthesis?

Synthesis is the stage in the design flow which is concerned with translating your Verilog code into gates - and that's putting it very simply! First of all, the Verilog must be written in a particular way for the synthesis tool that you are using. Of course, a synthesis tool doesn't actually produce gates - it will output a netlist of the design that you have synthesised that represents the chip which can be fabricated through an ASIC or FPGA vendor.

33)We need to sample an input or output something at different rates, but I need to vary the rate? What's a clean way to do this?

Many, many problems have this sort of variable rate requirement, yet we are usually constrained with a constant clock frequency. One trick is to implement a digital NCO (Numerically Controlled Oscillator). An NCO is actually very simple and, while it is most naturally understood as hardware, it also can be constructed in software. The NCO, quite simply, is an accumulator where you keep adding a fixed value on every clock (e.g. at a constant clock frequency). When the NCO "wraps", you sample your input or do your action. By adjusting the value added to the accumulator each clock, you finely tune the AVERAGE frequency of that wrap event. Now - you may have realized that the wrapping event may have lots of jitter on it. True, but you may use the wrap to increment yet another counter where each additional Divide-by-2 bit reduces this jitter. The DDS is a related technique. I have two examples showing both an NCOs and a DDS in my File Archive. This is tricky to grasp at first, but tremendously powerful once you have it in your bag of tricks. NCOs also relate to digital PLLs, Timing Recovery, TDMA and other "variable rate" phenomena.
==========================================================================================================================

intel
======

An hour on semiconductor physics, an hour on digital logic design, an hour on process technology, and hour on CPUs and computers
(I was interviewing for a CPU design job.)

========================================================================

What is a CLB?

CLB refers to the “Configurable Logic Block”s in Xilinx FPGAs

Spartan3 CLB:
4 Slices, each slice has 2 LUTs, 2 FFs
LUTs have 4 inputs

Spartan6 CLB:
2 Slices, each slice has 4 LUTs, 8 FFs
LUTs have 6 inputs. These basic slices are called SliceX. Some slices also have multipliers and carry logic, and they are called SliceLs. And some slices also have capability to use LUTs as distributed RAMs, and also as variable length shift registers. They are called SliceMs.

Virtex6 CLB:
Same as Spartan6. But basic slices are SliceLs. And it also has SliceMs.

7 Series CLB:
LUTs can be configured as 1 6-input LUT, or 2 5-input LUTs. Each slice has 4 6-input LUTs, and 8 FFs. Each CLB has 2 slices. 2/3 of slices are SliceLs, others are SliceMs.

Ultrascale CLB:
Each CLB has one slice, but 2 slices of the 7 series is combined into one cohesive slice. So, each slice has 8 LUTs, and 16 FFs.

Altera has ALM instead of CLB
ALM: Adaptive Logic Module
It has 8 inputs for its LUT, which can implement a full 6-input LUT, or 7 input functions. Each ALM has one of these LUTs, and 2 FFs.

5. Clock management blocks in FPGAs

In Xilinx FPGAs:
Older families like Spartan 3 had DCMs (digital clock manager). New families have CMTs (clock management tile). CMTs have MMCMs and PLLs in them.

6. What are the transceiver types and speeds in Xilinx FPGAs?

Transceivers:
GTP: 6 gbps
GTX: 12.5 gbps
GTH: 13.1 gbps
GTY: 28 gbps on ultrascale, 32 gbps on ultrascale +

UltraScale+ GTR (6.0 Gb/s): Easiest integration of common protocols to the Zynq Processor Subsystem
UltraScale+ GTH (16.3 Gb/s): Low power & high performance for the toughest backplanes
UltraScale+ GTY (32.75 Gb/s): Maximum performance for the fastest optical and backplane applications; 33G transceivers for chip-to-chip, chip-to-optics, and 28G backplanes
UltraScale GTH (16.3 Gb/s): Low power & high performance for the toughest backplanes
UltraScale GTY (30.5 Gb/s): High performance for optical and backplane applications; 30G transceivers for chip-to-chip, chip-to-optics, and 28G backplanes
7 Series GTP (6.6 Gb/s): Power optimized transceiver for consumer and legacy serial standards
7 Series GTX (12.5 Gb/s): Lowest jitter and strongest equalization in a mid-range transceiver
7 Series GTH (13.1 Gb/s): Backplane and optical performance through world class jitter and equalization
7 Series GTZ (28.05 Gb/s): Highest rate, lowest jitter 28G transceiver in a 28nm FPGA

How do they work? What are the tradeoffs between a large, complex logic cell and a simple one?
Why are their hard macros in FPGAs? What are the downsides of including them?
Why is FPGA performance not usually as good as an ASIC or mask programmed gate array? (may not even be true anymore)
What are some of the basic differences between FPGA families? Be able to talk about setup and hold times, too.
Also, this being the modern era, I'd expect questions about HDLs. If you say you know Verilog and not VHDL then they're not going to hit you with VHDL questions
for spite. But be sure to be able to talk about Verilog. Understand behavioral, RTL,
and gate-level coding styles. Understand what synthesis tools can do for you and what they can't, etc.

What happens to gate capacitance as gate oxide thickness is reduced?
Why is electron mobility higher than hole mobility? Why do they say a NAND gate is functionally complete?
Explain 1's and 2's complement arithmetic. What is the difference between a "Moore" and "Mealy" machine.
How do you make a flip-flop from latches. Design a state machine for a traffic light. Why would you want to use a Grey code?
What is cache for? What is the difference between Harvard and Von Neuman architecture?
If a read from cache takes 2 cycles but a miss takes 500 cycles, what ratio of hits must be maintained to keep reads on average to 3 cycles?
What is the difference between a const pointer to an int and a pointer to a const int?

You said you are interviewing for an FPGA design position? I'd be ready to talk about FPGAs.
How do they work? What are the tradeoffs between a large, complex logic cell and a simple one?
Why are their hard macros in FPGAs? What are the downsides of including them? Why is FPGA performance not usually as good as an ASIC or mask programmed gate array?
(may not even be true anymore) What are some of the basic differences between FPGA families? Be able to talk about setup and hold times, too.
Also, this being the modern era, I'd expect questions about HDLs. If you say you know Verilog and not VHDL then they're not going to hit you with VHDL questions
for spite..

Most of the questions were oriented towards timing constrait problems.
There was a question I couldn't answer related to how to make a design run at 50MHz if the synthesis and routing tools say
"I made my best effort, but it will only be able to run at 25MHz". Since I have never had to face that kind of problems (most of the modules I've made run at 10MHz tops), I told him with all honesty that I needed to research that. He told me that it's ok, and advised me to run the examples in the Xilinx application notes just to have that extra knowledge since they face that kind of problems daily. Since I graudate in May he told me to send him a reminding email around april so that he can make me
a more formal personal interview. What do you think?

========================================================================
CDC
=====

Clock Domain Crossing

What is Clock domain crossing ?
When a signal or a set of signals requires (due to functionality, data transfer, control info. transfer etc )
to traverse from one block (working in one clock domain) to another block (working in another clock domain),
In such a case, clock domain crossing of signal(s) takes place.

Why 2 blocks may need to work on different clocks ?
There can be different practical reasons for the same like ::

1. Inside a chip, Some IP can be custom designed (All Steps in VLSI design flow already done and we have a good working IP) to work on one particular frequency to meet the timing requirements of the IP. But It is quite possible, that the IPs with which this IP is interacting can work fine on either faster or slower clocks. so they will be working at different frequencies. So, clock domain crossing scenerios will arise in such a case.

2. Some IPs are usually bought from other companies and these IPs are also custom designed to work on some particular frequency only.

What problems may arise due to clock domain crossing of signals ?
1. Metastability (Discussed earlier)

How to resolve issues arising because of clock domain crossing ?
Using different types of synchronizers at the boundary.

========================================================================

Gated clocks in older FPGA’s

A decade ago, ASIC’s needed clock gating to save dynamic power (not static, dynamic!!!) by disabling the clock so the flops inside part of the chip are not triggered when that part is not used. This posed a problem in FPGA prototyping of that ASIC because the clock gating puts logic on the clock network inside the FPGA. Clocks have their own optimized low-skew wire buses which cannot be used easily with the introduction of logic between clock source and the clock lines. Moreover, the system clock is no longer one clock anymore. Several parts of the chip clocked by system clock can be gated off. The ASIC clocktree is balanced specifically for staying synchronous between these parts, but an FPGA cannot guarantee the synchronous assumption of asynchronous clocks even though they have the same frequency.

Gated clocks today

Today, the complex FPGA’s for system-on-chip prototyping have these basic blocks of logic. They contain logic gates and flipflops, basic blocks to construct the digital circuit you want. These flipflops are connected to the special global routed clock lines (dedicated for clocks so, low-skew). And the flops can be gated locally. Instead of one gater gating off the whole clock, the clock is always toggling, but locally in each logic block, the gate off can be used to disable parts of the logic while still keeping everything synchronous (because they all use the same clock).

Importance

An ASIC chip is prototyped in FPGA to allow software developers to start developing their code. But also to verify the digital circuitry with the real software. In the past, digital designers needed to change the code for FPGA to not include these gaters. In the ideal world, the ASIC code should not be changed one single bit because the FPGA must reflect the ASIC behaviour as close as possible. Today, we can avoid changing the ASIC code to prototype it on FPGA because the gating is now possible on FPGA.

You might think, a small change like this, removing clock gaters would not matter much. But every assumption you make, how small or insignificant you think it will be, might turn your multi-million dollar chip into a useless thing. Especially for clocks, where I have seen a chip that assumed the clock had started to run while the external oscillator was expecting to be enabled with a pin from the chip. It was before my time with that company, but the silicon was dead after reset and it took a while before they figured it out what went wrong. And every for every new project, all designers and project leaders were reminded of that project to never assume anything. Not in simulation, not in FPGA, nowhere. A valuable lesson I always keep in mind. And something that today’s digital designers lack. They are forcing their simulation into the right mode of operation for their testcases. I do understand that some IP needs to startup and calibrate, but I see a lot of unnecessary forces (I call them laziness forces) that affect other parts of the chip in a way they can’t even understand. I know that there is no cure for stupidity, but sometimes I wonder why humans are called the intelligent race.

Conclusion

In ASIC clock gating is done for power consumption and in the FPGA prototype, this makes not that much sense since the logic and reconfigurability of the FPGA make the power consumption numbers hardly relevant. But it is extremely important that as much of the ASIC code as possible goes unchanged when the ASIC is prototyped in FPGA. And today, those big FPGA’s (I didn’t look at small ones) support clock gating and that is important for ASIC prototyping.

For low volume products, where an ASIC is not commercially viable, it can be useful to clock gate parts of the FPGA so that consumption can be reduced. Less power is less cooling (active cooling with fan rpm) so the product specification and requirements can require clock gating even if the design is in FPGA which is always consuming more than the same design in the same tech node ASIC (same tech node as the FPGA).

What is FPGA?

ANS: Field Programmable Gate Array is a semiconductor device containing programmable logic components called "logic blocks", and programmable interconnects.

Logic blocks can be programmed to perform the function of basic logic gates such as AND, and XOR, or more complex combinational functions such as decoders or mathematical functions. In most FPGAs, the logic blocks also include memory elements, which may be simple flip-flops or more complete blocks of memory.

What are DCM’s? Why they are used?

ANS: Digital clock manager (DCM) is a fully digital control system that uses feedback to maintain clock signal characteristics with a high degree of precision despite normal variations in operating temperature and voltage. That is clock output of DCM is stable over wide range of temperature and voltage, and also skew associated with DCM is minimal and all phases of input clock can be obtained . The output of DCM coming from global buffer can handle more load.

Q. What are the differences between FPGA and CPLD?

ANS:

FPGA:

A) SRAM based technology.

B) Segmented connection between elements.

C) Usually used for complex logic circuits.

D) Must be reprogrammed once the power is off.

E) Costly

CPLD:

A) Flash or EPROM based technology.

B) Continuous connection between elements.

C) Usually used for simpler or moderately complex logic circuits.

D) Need not be reprogrammed once the power is off.

E) Cheaper

Q. What is DFT ?

ANS: DFT means design for testability. 'Design for Test or Testability' - a methodology that ensures a design works properly after manufacturing, which later facilitates the failure analysis and false product/piece detection Other than the functional logic, you need to add some DFT logic in your design. This will help you in testing the chip for manufacturing defects after it come from fab. Scan, MBIST, LBIST, IDDQ testing etc. are all part of this. (This is a hot field and with lots of opportunities)

Q. What is Synthesis?

ANS: Synthesis is the stage in the design flow, which is concerned with translating your Verilog code into gates - and that's putting it very simply! First of all, the Verilog must be written in a particular way for the synthesis tool that you are using. Of course, a synthesis tool doesn't actually produce gates - it will output a netlist of the design that you have synthesised that represents the chip which can be fabricated through an ASIC or FPGA vendor.

Q. What is slice? What is CLB?

ANS: The Configurable Logic Blocks (CLBs) constitute the main logic resource for implementing synchronous as well as combinatorial circuits. CLB are configurable logic blocks and can be configured to combo, ram or rom depending on coding style CLB consist of 4 slices and each slice consist of two 4-input LUT (look up table) F-LUT and G-LUT. The memory assignment is a clocked behavioural assignment, Reads from the memory are asynchronous, and all the address lines are shared by the read and write statements.

Q. Can a CLB configured as ram?

ANS: Yes

Q. What is the purpose of DRC?

ANS: DRC is used to check whether the particular schematic and corresponding layout (especially the mask sets involved) cater to a pre-defined rule set depending on the technology used to design. They are parameters set aside by the concerned semiconductor manufacturer with respect to how the masks should be placed, connected, routed keeping in mind that variations in the fab process does not effect normal functionality. It usually denotes the minimum allowable configuration.

Q. Compare PLL & DLL?

PLL:

PLLs have disadvantages that make their use in high-speed designs problematic, particularly when both high performance and high reliability are required. The PLL voltage-controlled oscillator (VCO) is the greatest source of problems. Variations in temperature, supply voltage, and manufacturing process affect the stability and operating performance of PLLs.

DLLs, these however, are immune to these problems. A DLL in its simplest form inserts a variable delay line between the external clock and the internal clock. The clock tree distributes the clock to all registers and then back to the feedback pin of the DLL. The control circuit of the DLL adjusts the delays so that the rising edges of the feedback clock align with the input clock. Once the edges of the clocks are aligned, the DLL is locked, and both the input buffer delay and the clock skew are reduced to zero.

Advantages:

· Precision

· Stability

· Power management

· Noise sensitivity

· Jitter performance.

Q. What is LVs and why do we do that. What is the difference between LVS and DRC?

ANS: The layout must be drawn according to certain strict design rules. DRC helps in layout of the designs by checking if the layout is abide by those rules. After the layout is complete we extract the netlist. LVS compares the netlist extracted from the layout with the schematic to ensure that the layout is an identical match to the cell schematic.

Q. Can you suggest some ways to increase clock frequency?

ANS:

· Check critical path and optimize it.

· Add more timing constraints (over constrain).

· Pipeline the architecture to the max possible extent keeping in mind latency req's.

Q. What is the significance of FPGAs in modern day electronics?

ANS:

 Less time to Market when comparing with ASIC

 It’s better for start-up companies to design their projects using FPGA Rather than ASIC side

 Less Cost and Reconfigurable As many require number of times

Q. FPGA design flow?

Q. Tell me some features of FPGA you are currently using?

1. FPGA Spartan3E: (XC3S500E)

The Spartan 3E Starter Board provides a powerful and highly advanced self-contained development platform for designs targeting the Spartan 3E FPGA from Xilinx.

It features a 500K gate Spartan 3E FPGA with a 32 bit RISC processor and DDR interfaces.

The board also features

A Xilinx Platform Flash, USB and JTAG parallel programming interfaces with numerous FPGA configuration options via the onboard Intel StrataFlash and ST Microelectronics Serial Flash.

The board with a power supply and USB cable for programming so designs can be implemented immediately with no hidden costs.

The Spartan 3E Starter board is also compatible with the Micro Blaze Embedded Development Kit (EDK) and Pico Blaze from Xilinx.

2. FPGA Spartan6: (XC6LX16-CS324)

The system development platform features Xilinx's newest Spartan-6 FPGA, 48Mbytes of external memory (including two non-volatile phase-change memories from Micron), and enough I/O devices and ports to host a wide variety of digital systems.

The on-board high-speed USB2 port provides board power, FPGA programming, and user-data transfers at rates up to 38Mbytes/sec.

A large collection of low-cost peripheral boards, including more than 30 Pmods and several new Vmods, can add additional features to the Nexys3,

Including A/D and D/A converters, breadboards, motor drivers, displays, etc. The Nexys3 is fully compatible with all Xilinx tools, including the free Chipscop, EDK (embedded processor design kit) and other tools.

Q. What is LUT?

Look up Table:

 Look-up tables (LUTs) are used to implement function generators in CLBs.

 Four independent inputs are provided to each of two function generators (F1-F4 and G1-G4).

 These function generators can implement any arbitrarily defined Boolean function of four inputs.

Q. How to generate clocks on FPGA?

 you need clock source regardless to drive FPGA ,

 inside you can use PLL to generate specific frequencies.

 Also you can use counters to scale down clock

Q. What are FPGA Editor File formats?

FPGA Editor Reads and writes the following file types:

File Type	Input / Output	Definition
NCD	Input/Output	MAP or PAR generates the .ncd file. FPGA Editor uses this file with the New or Open command from the File menu. You can edit the NCD file in the FPGA Editor.
PCF	Input/Output	A .pcf file is an ASCII file containing physical constraints created by the MAP program as well as physical constraints entered by you. You can edit the PCF file in the FPGA Editor.
NMC	Input/Output	An .nmc file contains a physical hard macro which can be created or viewed with the FPGA Editor.
ELF	Input	An .elf file (pronounced “elf”) is a binary data file that contains an executable CPU code image, ready for running on a CPU
DRF	Input	A .drf file (pronounced “dwarf”) is a binary data file that also contains the executable CPU code image, plus debug information required by symbolic source-level debuggers.
MEM	Input	A .mem file (memory) is a simple text file that describes contiguous blocks of data.
BIT	Output	A .bit file contains location information for logic on a device, including the placement of CLBs, IOBs, TBUFs, pins, and routing elements. The bitstream also includes empty placeholders that are filled with the logical states sent by the device during a readback. Only the memory elements, such as flip-flops, RAMs, and CLB outputs, are mapped to these placeholders, because their contents are likely to change from one state to another. When downloaded to a device, a bitstream configures the logic of a device and programs the device so that the states of that device can be read back.
CDC	Output	A .cdc file can be generated from the ILA command.

Q. How do you implement DCM?

the manual of my spatran 3E fpga kit says Alternatively, use the FPGA’s Digital Clock Manager (DCM) to generate or synthesize other frequencies from the on-board 50 MHz oscillator.

Q. Why is map-timing option used?

To improve design performance, timing, and packing for highly utilized designs.

Q. What are different types of timing verifications?

1. Static timing:	2. Dynamic timing:
a. The delays over all paths are added up.	a. The design is simulated in full timing mode.
b. All possibilities, including false paths, verified without the need for test vectors.	b. Not all possibilities tested as it is dependent on the input test vectors.
c. Much faster than simulations, hours as opposed to days.	c. Simulations in full timing mode are slow and require a lot of memory.
d. Not good with asynchronous interfaces or interfaces between different timing domains.	d. Best method to check asynchronous interfaces or interfaces between different timing domains.

Name a few types of Flip Flops?

JK, D, T. Further Reading: What is a Flip-Flop

Name a few types of latches

SR, JK. Further Reading: What is a Latch

Describe the difference between a flip-flop and a latch

The main difference is that a Flip-Flop uses a clock as an input but a latch does not. The clock input on the flip-flop is used to pass the D input on the Flip-Flop to the Q output. A latch having no clock will latch or hold the output steady. In general, Flip-Flops are used commonly in FPGA designs and latches should be avoided.

Why might you choose to use an FPGA in your design?

FPGAs are highly customizable. Some reasons to use an FPGA in your design might be: many unique peripherals required, lots of input/output needed, very fast processing speed required, lots of math operations, high data throughputs, interfaces to high bandwidth external memory, and reprogrammability are common reasons.

What does a for loop do in synthesizable code? Does it work the same way as in a software language like C?

A for loop in VHDL and Verilog is not the same thing as a for loop in python or C. In synthesizable code, for loops are used to replicate logic. It saves having to type the same thing over and over again, but it does not produce a loop in the same way that software programming loops work. In general, for loops should be avoided in synthesizable code for beginners unless they clearly understand how they work. Further Reading: For Loops in VHDL or Verilog.

Describe the differences between SRAM and DRAM

SRAM means Static Random Access Memory, DRAM stands for Dynamic Random Access Memory. Both SRAM and DRAM are volatile, meaning that their information written to them is cleared when power is removed. However DRAM also needs to be "refreshed" in order to maintain the values written to it, where as SRAM being static does not need this. The refreshing is usually handled by the memory controller. As far as use-cases, SRAM is better when there are short non-sequential reads and writes to memory required, and DRAM is better when there are large burst reads/writes of sequential memory. For example if you're reading 16-bits at a time from any random address in memory on each read, this is likely a better application for SRAM. But if you're trying to store an entire image from a camera to memory this is definitely a better application from DRAM.

What is the purpose of a PLL

PLL stand for Phase-locked loop and is commonly used inside FPGAs to generate desired clock frequencies. PLLs are built-in to the FPGA fabric and are able to take an input clock and derive a unique out-of-phase clock from that input clock. They are very useful if your design requires several unique clocks to be running internally.

Describe the difference between inference and instantiation

Inference is when you write VHDL or Verilog to "infer" or tell the synthesis tools to place some type of component down. For example, by creating a large memory storage register, you might be inferring a Block RAM. Instantiation is when you directly create the primitive component for the Block RAM based on the particular vendor's user guide for how to instantiate primitive components. Inferrence is more portable across FPGA technologies. Instantiation might be better if you need to be very explicit about the primitive that you want to work with, or apply some unique settings to it.

What is metastability, how would you prevent it?

Metastability means that a signal is in an unpredictable or unknown state. It occurs most often when bringing in a signal external to the FPGA into the internal logic. In general it should be avoided. There are several methods to prevent metastability inside your FPGA, a common suggestion is to "double-flop" all inputs into the local clock domain.

What is a FIFO?

FIFO stands for First In First Out. It is a commonly used FPGA component. A FIFO is a storage element, usually made of a Block RAM for large FIFOs and registers for short FIFOs. They are used to buffer data, for example when reading and writing to external memory, or for when crossing clock domains, or for storing pixels coming out of a camera. Further Reading: What is a FIFO

What is a Block RAM?

A Block RAM is a specific part of an FPGA that is usually a 16k or 32k bits storage element. It can have dynamic width and depth and is useful for many applications inside of an FPGA. They are used in Dual-port memories, FIFOs, and LUTs to name a few. More Info: Block RAM in FPGA.

Describe how a UART works and where it might be used?

UART stands for Universal Asynchronous Receiver Transmitter. It is asynchronous, meaning there is no clock involved. It has the ability to both receive and transmit data serially. It is a low-bandwidth, low-pin-count, very simple interface that is commonly used to communicate between FPGAs, or from and FPGA to a microcontroller, or from an FPGA to a computer. Since there is no clock involved, the receiver and the transmitter need to agree on a baud rate and other parameters prior to communication. Further Reading: What is a UART?.

What is the difference between synchronous and asynchronous logic?

Synchronous means there is a clock involved, where as asynchronous means there is no clock. Most logic inside of an FPGA uses synchronous logic, like Flip-Flops, Block RAMs, etc, and things like Resets can be asynchronous. Separately, there can be both synchronous and asynchronous communication protocols. In general, most communication interfaces are synchronous (e.g. SPI, I2C, USB, etc), but a UART is asynchronous.

What is a shift register in an FPGA?

A shift register is a method of moving data from a source to a destination in some number of clock cycles. It is useful for creating delays inside of your FPGA, or for converting serial data to parallel data, or for converting parallel data to serial data. Further Reading: What is a Shift Register.

Describe some differences between VHDL and Verilog?

VHDL is strongly typed, commonly used for high-reliability code (e.g. military and medical devices). It is very verbose, meaning a lot of typing is required. Verilog is closer to C, is less strongly typed, and is generally more forgiving. Further Reading: VHDL vs. Verilog, which language should you learn first?.

What should you be concerned about when crossing clock domains in an FPGA?

Whenever crossing clock domains you should be concerned about creating a metastable condition. In general, it's a good idea to use a primitive that is capable of crossing clock domains, such as a Block RAM, unless you're careful with how you create your register logic and create timing constraints that tell the tools what you're doing. Additionally, you want to make sure that your data storage element is deep enough to cross between the clock domains without losing data. This might happen if you write to a full FIFO or read from an empty FIFO, which happens when one clock domains is expecting data and there is none. Further Reading: Crossing Clock Domains in an FPGA.

Describe setup and hold time, what happens if they are violated?

Setup is the amount of time required for the input of a Flip-Flop to be stable before the clock edge comes along. Hold time is the amount of time required for the input of a Flip-Flop to be stable after the clock edge comes along. If setup or hold time is violated, then there can be a metastable condition inside of your FPGA. This will likely create a timing error inside the place and route tool. You should never have timing errors inside of an FPGA design. Further Reading: Setup and Hold Time in an FPGA.

What's the difference between a melee and moore machine?

Both melee and moore machines are types of state machines that can exist inside your FPGA. Melee machines depend on both the current state as well as inputs to know which state to go to next. Moore machines depend just on your current state to know which state to go to next. Personally, I find Moore machines simpler to code and understand, so I tend to use those.

What is the purpose of the synthesis tools?

The synthesis tools are provided by the FPGA vendor and are used to translate your VHDL or Verilog code into logic that the FPGA is built from (e.g. Flip-Flops, Look-Up Tables, Block RAMs, etc).

What happens during Place and Route?

The synthesis process is usually followed by place and route, which takes the primitives and places them inside the FPGA and checks that the entire design meets your timing constraints. The timing constraints tell the FPGA the clock rates and the specific I/O to use, and the place and route process ensures that your design is able to work at those speeds.

What is a SERDES transceiver and where are they used?

SERDES stands for SERializer/DESerializer. These are high-speed transmitters and receivers that are used to send serial data across a point to point link. Usually these are used at speeds of > 1 Gbps. Current FPGA technology can have SERDES transceivers that operate at > 50 Gbps for a single data link. Data integrity at these speeds is very challenging, so lots of tricks are employed to ensure data is able to pass successfully. They are used commonly in RF applications, high-speed video applications, communication interfaces such as PCI Express, SATA, Gigabit Ethernet, etc.

What is the purpose of a DSP tile in an FPGA?

DSP stands for Digital Signal Processor but it is really a dedicated piece of hardware inside the FPGA that is very good at performing fast multiplication and addition operations. Normal FPGA logic is able to perform multiplies, but not at fast data rates of say 200 MHz or faster. This is where dedicated DSP tiles are used. They are common in filter design or image processing pipelines that require many fast multiplication operations to be performed on input data.

What projects have you done? What technologies have you worked with?

Be prepared to talk in detail about your own experience using FPGAs! Have an interesting project to discuss. Bonus points if you put a link to a GitHub repository on your resume and show your source code.

FPGA APPLICATIONS

Ideal platform for prototyping

Providing fast implementation to reduce time-to-market

Cost effective solutions for products with small volumes on demand

Implementing hardware systems requiring re-programming flexibility

Implementing dynamically re-configurable systems

CLOCK DOMAIN CROSSING

Different clocks within a single design

Many possible reasons, here are a few:

• Complying with multiple protocols

• SoC Integration: IP from many sources

• Saving area/power, limiting fast clock use

Dangers when signals cross domains

• Incorrect functionality if not synced

Not properly covered by validation

Each flop has a setup and hold time

• Setup time: Time before clk edge

• Hold time: Time after clk edge

CDC problems

corrupt control and data signal, are difficult to reproduce and debug, temperature, voltage, and process sensitive, will only occur in hardware; often in the final design. Traditional verification techniques do not work for CDC signals

When 2 or more designs run on disparate clocks:

The clocks will continually skew, guaranteeing setup/hold violations

Signals from one design to another are “Clock Domain Crossings” (CDCs)

Signals that cross asynchronous clock domains (CDC signals) WILL violate setup and hold conditions

Problem:

Signals crossing a clock domain will violate set-up/hold

Impact: Control/data signals will be dropped/corrupted

■ Loss of Data

Approaches:

Avoid having systems that have multiple clocks

■ Although sensible, it’s becoming impossible

Design around the problem

■ Designer can add “synchronizers” to the design

■ Metastability still happens, but nobody else sees it

— E.g. 2DFF, FIFO, etc.

“Fences in” metastability

■ Designers add synchronizers to reduce the probability of metastable signals

■ Synchronizers are sub-circuits that can prevent metastable values from being sampled across clock domains

— Take unpredictable metastable signals and create predictable behavior

■ CDC signals cross with an assumed relationship

■ Can be combinational, sequential, or deeply sequential

■ Unpredictable delays on CDC paths lead to reconvergence errors

— Designs need logic to correctly handle reconvergence

— Can occur on single-bit or multiple-bit signals

■ Synchronization between clock domains requires a transfer protocol

— Ensures data is predictably transferred between domains

— These protocols must be verified

■ When protocol is violated

— Data is lost

— Simulation may not show a failure

— Silicon will eventually show a functional error

During design planning

Create systems/designs using 1 clk, 1 edge when possible
If multiple clocks are required, try to use 1 designer for both clock domains, and use coding guidelines

Use signal naming conventions
Many clock domain errors come from design changes, not the initial design

■ Problem:

— Missing synchronizers will create metastability

— Correctly placed but misused synchronizers won’t work

— Reconvergence of synchronized signals can create unexpected behavior

■ Approaches:

— Simulation

■ Digital logic simulators do NOT model transistor behavior

■ Do not model “metastability”

Asynchronous Clock Domain Crossings Issues

Data expected ready by setup time, and stable during hold time

Source clk not coordinated with dest clk

Chance of arriving in setup/hold window

Possibility of ‘metastable’ (halfway) state

Value may settle either way

Multiple fanouts may detect different values

CDC often done with FV tools

• Mentor 0in, Cadence Conformal, RealIntent

– In other direction, Lint vendors expanding to CDC: Synopsys LEDA, Atrenta Spyglass

Some CDC checks may add assertions

For checking in simulation or FV

Simulation: inject random metastability
Useful, but hard to get right

• Sometimes delay a value change

• Allow 1-cycle value to be missed

• Depends on # of random tests

FPGAs have rapid average metastability recovery (ps)

But recovery is unbounded (probability is nonnull that the metastable state will last a given time T)

Design goal: achieve a desired MTBF, given the recovery parameters of the flipflop

Synchronizer: Multiple FFs at the receiving end guard against metastability

Number of FFs a function of desired MTBF and ratio of clock frequencies

2FF synchronizers work well when destination clock is faster than source clock (signal will remain stable for at least one destination clock cycle)

Fast to slow crossing requires closedloop synchronizer with handshake

Closedloop: Control signal crosses into clock domain 2, then back into clock domain 1, and is checked against the reference (high latency)

Single clock domain

Multiple clocks domain

• Synchronizers

Circuits that conditions CDC signals in order to reduce the probability of metastability.

Types of Synchronizers

• Basic Synchronizer (Level Signal)

Require synchronizer: structure to reduce risk of metastability

Simplest synchronizer: two flops

Metastability very unlikely: second flop will choose a value if first is metastable

One-cycle uncertainty in transition

• Edge-detecting Synchronizer

• Pulse Synchronizer

Clock freq maybe configurable / unknown

Hard to design previous syncs for all cases

Hard to properly check for all cases

Maybe sometimes overconservative

Solution: handshake protocols

Sender stores data in FIFO

Explicit request/response for data

More logic & harder to check, but powerful

Basic CDC Checks

Structural: Identify asynchronous CDCs

• Note: synchronous signals are OK, little metastability risk if properly timed

• But same-frequency != synchronous

• If no synchronizer, report error

Functional: Assert signal held long enough

• Due to uncertainty of capture time

• Hold for 2+ receiver cycles: create assert

• May need longer depending on relative freq

CDC Protocols

In may applications, simple signals aren’t the only information crossing clock domains. Data, address and control buses also travel together across domains

CDC Verification

Assertion Based Verification.
Netlist Analysis.
User Specification of Design Intent.
Verifying CDC Monitors Using Simulation.

5. Verifying CDC Assertion Using Formal Verification

http://www.national.com/appinfo/power/xilinxfpga.html
https://www.seeedstudio.com/Spartan-Edge-Accelerator-Board-p-4261.html
http://corevlsi.blogspot.com/2014/11/fpga-interview-questions_4.html
http://www.design4silicon.com/2016/05/fpga-interview-questions-and-answers.html

FPGA INTERVIEW QUESTIONS

FPGA architecture

VHDL /verilog code.

Interfaces (such as PCIE, DDR3..)

Embedded ARM

What do you mean by JTAG OR USB ?

Write an RTL Block that implements 4x4 matrix multiplication. Then they asked me to optimize the critical path. Then check if we can reduce the chip area using less number of Multipliers.

How do you optimize your ASIC/FPGA design

What is speed grade and how do you select FPGA as per requirements?

What is the maximum possible speed achievable for a given device?

How do you code to reduce power in FPGA design

Answers

Avoid reset for FPGA

Clock Gate

Use synchronous design

Avoid over constraining

Reduce Device temperature (cooling solution)

Use clk_en and control enable for all Memory

Use LUT for smaller memory. BRAM takes more power

What is clock gating and how do you do it in FPGA

What is Clock domain crossing?

Why 2 blocks may need to work on different clocks?

What problems may arise due to clock domain crossing of signals?

How to resolve issues arising because of clock domain crossing ?

How do you manage multiple clocks and how do you route them?

CDC tools can help this like Spyglass, etc. But asynchronous transfers must be handles carefully in design and later they can be assigned false path for the tool to go easy on compilation

How do you do IO planning and can you explain few types of IO types?

How do add debug probes for chipscope and what are the benefits inserting it and inferring it?

How do you program multiple FPGA images in flash and load?

How do you control reset logic?

How do you do STA?

What are the timing constraints that you do for a typical design having synchronous and asynchronous logic?

How does the RTL logic convert to logic gate, say a comparator or counter, please describe

Can you write a code in RTL for a debounce logic or 9-bit counter?

RTL coding for FPGA primitive components, what are the primitive components and what have you used.

What are the following?BUFGMUX, IBUF,OBUF?

How can we find the max clock for a given FPGA program?

What is Metastability and what are its effects?

What is setup and hold time?

What is pipelining?

Why do FPGAs have dedicated clock pins? What's so special about them?

Describe a design that you think that will best fit an FPGA.

Describe a design that you think that will best fit a Microcontroller.

What are the difficulties in asynchronous FIFO design?

Differences between FPGA and ASICs, clock and reset schemes

Why can't we use a Clock Source directly rather using FF Dividier circuits which also needs a clock source for its operation??

Internal workings of FPGA's such as block ram, dsp, logic cell, LUT

What is the most basic way of meeting or checking to see if you are meeting timing?

Tell me some of constraints you used and their purpose during your design?

What are different types of FPGA programming modes? What are you currently using ?how to change from one to another?

Can you list out some of synthesizable and non synthesizable constructs?

Can you draw general structure of fpga?

Difference between FPGA and CPLD?

MMCM Vs. DCM

What is slice,clb,lut?

What is FPGA you are currently using and some of main reasons for choosing it?

Draw a rough diagram of how clock is routed through out FPGA?

How many global buffers are there in your current FPGA, what is their significance?

What is frequency of operation and equivalent gate count of project?

Compare PLL & DLL.

Suggest some ways to increase clock frequency?

What is LV and why do we do that. What is the difference between LVS and DRC?

Clock management blocks in FPGAs

What are the transceiver types and speeds in Xilinx FPGAs?

Why are their hard macros in FPGAs? What are the downsides of including them?

Why is FPGA performance not usually as good as an ASIC or mask programmed gate array? (may not even be true anymore)

What are some of the basic differences between FPGA families?

PROTOTYPING OF FPGA

· Why do we need FPGA prototyping, when IP/SoC verification already exists ?

First of all we need to understand that FPGA Prototyping is not an alternative for IP/SoC verification. IP/SoC verification has its own importance in Silicon Design Flow. As an example let's take a case where we need to verify an IP like RTC (Real Time Clock). Here we know that to simulate a single minute in Verification take more than an hour, then we can very well understand that it would be a nightmare to attempt to simulate one hour of RTC in verification. The other example is when an IP has large register set, then millions of combinations can exist but it is not possible to cover all combinations in verification and even overnight regression can only cover few hundreds of test cases.

Similarly there are may other cases where regression time plays a critical role and FPGA Prototyping can help to complete testing of all combinations. FPGA system clock is typically in MHz and depends on design size, which makes it much more faster than verification.

· Where we can use FPGA Prototyping ?

We have seen one example in previous answer where we can use FPGA Prototyping in Validating IP and gain confidence. This process of FPGA Prototyping used in validating IP before silicon arrives is called Pre-Silicon Validation.

Also today SoC's run huge applications and previously software development teams wait till silicon to arrive then start developing software code on it. But by using FPGA Prototyping software team don't need to wait till silicon, now they can complete all driver and application code much before the silicon arrive.

So FPGA Prototyping is very useful in Pre-Silicon Validation as well as Software development well before actual hardware comes.

· What are the inputs required for FPGA Prototyping ?

Only one input required for FPGA Prototyping which is synthesizable RTL design. That means RTL design without all those elements which cannot be synthesized like analog IP's, pads and delay elements etc.

· What is the flow for FPGA Prototyping ?

Similar to ASIC Design flow there is FPGA Design flow. Most of the part of this flow is done by tool only, we need not do any manual work for it which makes it easy to create and use.

· What are its limitations ?

FPGA Prototyping has it's advantages and limitations

1. Design Limitation: It can work on only synthesizable digital design.

2. Speed Limitation: It is thousand time faster than verification but it is generally 10-100 times slower than ASIC.

3. Debug Limitation: In verification, user can take complete waveform dump and debug the design, while in FPGA we have some tools available where user can configure a 'Trigger' and take a waveform dump but this can be done only for limited signals due to memory constraint of FPGA.

4. Coverage Limitation: In FPGA user can calculate functional coverage but other parameter like code coverage cannot be calculated.

FPGA Synthesis workaround

I placed a timing constraint on a path, but the constraint has errors and misses the goal by 10% to 15%. What can I do to make this timing constraint pass?

Solution

If your constraint misses its timing requirement by less than 15%, and the logic delay is less than 60%, try several cost tables at effort level 5, using MPPR (Multi-Pass Place and Route).

I placed a timing constraint on a path, but the constraint has errors and misses the goal by 5% to 10%. What can I do to make this timing constraint pass?

Solution

If your constraint misses its timing requirement by 5% to 10%, and the logic delay is less than 60%, you can try selecting a higher placement effort (4 or 5) and constraining the I/Os, especially data buses.
I placed a timing constraint on a path, but the constraint has errors and too many levels of logic.
How can I make this timing constraint pass?

Solution

This is a case where logic exceeds some percentage of the total path delay, implying that there is too much logic between timing end points; the amount of logic must be reduced in order to meet timing requirements.
This number was traditionally around 50% for older architectures; it would need to be quantified for Virtex families (60%). There are exceptions to this rule for carry chain paths, in which the logic delays are much smaller and would allow for a higher number of logic levels or a lower component percentage.

To reduce the levels of logic, return to the source and try the following:

1. Issue State Machine Optimization Suggestions,

2. Use CASE statements instead of nested IF-ELSE statements.

3. Use 3-state instead of large muxes (7 or more inputs).

4. Use creative math; shift instead of multiplying by multiples of two.

5. Use decoders instead of comparators.

6. Balance logic around registers.

7. Pyramid logic with parentheses instead of serial implementation.

8. Use IF-THEN-ELSE statements only to do the following:

a. Pre-decode and register counter values.

b. Add a level of pipelining to pre-decode and register input signals.

9. Use muxes with more than 7-bit wide buses only to do the following:

. Instead of logic, use registers that are in a 3-state condition.

a. Drive enable signals from registers; 3-states are in a 3-state condition when enable signals are "1," and drive signals when the enable is "0".

b. Use floorplan 3-states.

10. Add pipeline registers.

I have a path that is failing in my PERIOD constraint, and it does not matter to me that this path is valid every clock cycle.
How can I constrain this path to avoid errors?

Solution

A path that is allowed to take multiple clock cycles to be valid in a design is called a multi-cycle path. These types of paths are typically covered by a PERIOD constraint by default, and might cause errors since a PERIOD is a one-cycle constraint.
To eliminate these errors, you can remove the path from the PERIOD constraint by putting a specific multi-cycle constraint on the path. A multi-cycle constraint is applied by using a "FROM:TO" constraint. A "FROM:TO" constraint has higher priority than a PERIOD constraint, and it removes the specified path(s) from the PERIOD to the "FROM:TO" constraint.
Example syntax:
NET "clk" TNM_NET = "clk";
TIMESPEC TS_clk = PERIOD clk 10 ns;
INST "source_inst_name" TNM = "source_group";
INST "destination_inst_name" TNM = "destination_group";
TIMESPEC TS_01 = FROM "source_group" TO "destination_group" TS_clk*3;
The TS_01 timespec constrains a specific path (or paths) that are only valid every three clock cycles. The "FROM:TO" in TS_01 is constrained to three times the TS_clk timespec (that is, three clock cycles or 30 ns).
For more information on "FROM:TO" constraints, see the Constraints Guide at:
http://toolbox.xilinx.com/docsan/xilinx5/data/docs/cgd/cgd0092_15.html

Suggestions for timing through irrelevant paths such as RESET or ".SR" pin

These paths can be Timing Ignored (TIG), either by paths or by net. The following syntax is for NET and PATH TIG: (Comments = #)
#Net TIG;
NET reset_net TIG;
#Path TIG;
TIMESPEC TS_TIG = FROM source_group TO destination_group TIG;

uggestions for I/O 3-state enable paths

you can use 3-state flip-flops in the IOB; you must have separate flip-flops for each IOB. To place the flip-flop in the IOB, you can use the "IOB=TRUE" constraint in the UCF.

long carry logic chains

You can use smaller counters to drive other counters, and cascade them so that the first counter is the enable for the second counter.estions for state machine optimization:

· Use one-hot encoding for the states.

· Use a synthesis state machine coding tool, if possible.

· Reduce the number of input signals and pre-decode the input signals.

· Register input and output signals.

· Pre-decode and register counter values.

· Remove data flow from the state machine and create control signals to control data flow.

· Duplicate state to where many states transition, and reduce the number of states in state equations.

· Use CASE statements and not IF-ELSE statements.

I placed a timing constraint on a path, but the constraint has errors. What can I do to make this timing constraint pass?

Solution

Possible suggestions for high fanout signals:

· Floorplan or LOC the origin and the global buffer of the high fanout signal.

· Duplicate the driver and tell the synthesis tool not to remove the duplicate logic.

· Use specific net fanout control on the specific net, if the synthesis tool allows.

Zynq FPGAs workaround

Zynq-7000 SoC - Boot Image requirements when using larger than 16MB QSPI and RSA Authentication

Description:The BootROM uses the Linear mode to access the first 16MB of the QSPI flash to look for the boot image.There are limitations on where the boot image could be placed if larger than 16MB QSPI flash and RSA Authentication are used.The following apply only when RSA Authentication is used on larger than 16MB QSPI flashes in single x2 or x4, dual-stacked x4 and dual-parallel x4 configurations.The boot image cannot be placed at 0x0 offset in the flash when using RSA Authentication in a larger than 16MB QSPI device.

Solution

There are three possible work-arounds for this requirement:

· Erase the first 32KB of flash and program the boot image at 0x0 + 32KB offset.
The BootROM will fail booting from 0x0, will fallback and will boot from 0x0 +32KB offset (see UG585 Zynq-7000-TRM for Boot Partition Search).

· Program the boot image at 0x0 and Duplicate the Image Header at 0x0 + 16MB offset.
The BootROM will use the Image Header at 0x0 + 16MB offset and then will boot with the boot image programmed at 0x0.

· Use only single x1 QSPI mode.

NOTE: If RSA is not used, the boot image can be placed at 0x0 even for larger than 16MB QSPI.

Boot Image requirements when using larger than 16MB QSPI with optional Execute-in-Place (XIP) mode

DescriptionThe BootROM uses the linear mode to access the first 16MB of the QSPI flash to look for the boot image.There are limitations on where the boot image could be placed if QSPI flash larger than 16MB is used with the optional Execute-in-Place mode.The same limitation applies when RSA is used. The QSPI linear mode read command (6bh) might not get the correct data for the first 4 bytes.Looking at the following figure, the incorrect 4 bytes of data are from the upper 16MB of memory. This is a known behavior of the QSPI controller in linear mode.

In the normal QSPI boot mode (not XIP), the first 4 bytes of BootROM header are not used, and checksum is only calculated from 0x020 to 0x044.So the incorrect data can be ignored. QSPI will boot without issues.However, if QSPI boots with XIP, the first word is used to remap the flash linear address space.The following applies only when RSA Authentication is used on QSPI flashes larger than 16MB in single x2 or x4, dual-stacked x4, and dual-parallel x4 configurations.The boot image cannot be placed at 0x0 offset in the flash when using a QSPI device larger than 16MB with the optional Execute-in-Place mode.

Solution

There are three possible work-arounds for this requirement:

· Erase the first 32KB of flash and program the boot image at 0x0 + 32KB offset.
The BootROM will fail booting from 0x0, then will fallback and boot from 0x0 +32KB offset (see UG585 Zynq-7000-TRM for Boot Partition Search).

· Program the boot image at 0x0 and Duplicate the Image Header at 0x0 + 16MB offset.
The BootROM will use the Image Header at 0x0 + 16MB offset and will then boot with the boot image programmed at 0x0.

· Use only single x1 QSPI mode.

NOTE: If XIP is not used, the boot image can be placed at 0x0 even for QSPI larger than 16MB.

QSPI boot time consideration with larger QSPI memory

DescriptionWhen booting a Zynq device with a large (> 16MB) QSPI, such as on the Zedboard, boot times are not improving when a fast QSPI clock ( over 40 MHz) is programmed. When the user changes the FSBL default QSPI clock from 25MHz (divider by 8) to 100MHz (divider by 2) it is expected that the boot time (to load the bitstream and application) will drop.Instead the boot time stays the same. Why does this happen?

Solution

This behavior is due to the way the QSPI driver is structured.

When a QSPI part is greater than 16MB, the driver switches over to I/O access mode from linear.

I/O requires that the memory be copied in a different manner which is not as fast as the direct copy in linear mode.

This can be worked around by doing the following:

if all of the boot code fits into the lower 16MB, then in qspi.c (located in the FSBL application source directory), the following line can be added in the FlashReadID function in qspi.c, at the end of the "Deduce flash size" code block, before "return XST_SUCCESS" add the following line:

QspiFlashSize = FLASH_SIZE_128M;

Note: this will not work if the boot image is greater than 16MB or if it crosses the 16MB boundary.

After the boot process is complete, the system code will then control access to the QSPI and access to the full part can be restored.

This answer record is a documentation map providing information about booting a Zynq-7000 SoC device.

It links to documents which cover different modes and configurations for booting a Zynq-7000 device using your boot interface of choice.

Note: This answer record is part of the Xilinx Zynq-7000 SoC Solution Center

The Xilinx Zynq-7000 SoC Solution Center is available to address all questions related to Zynq-7000 SoC.

Whether you are starting a new design with Zynq-7000 SoC or troubleshooting a problem, use the Zynq-7000 SoC Solution Center to guide you to the right information.

BOOTNG ZYNQ SOC

Solution

UG1046 UltraFast Embedded Design Methodology Guide now contains all of the information about booting a Zynq-7000 SoC device.

Boot Flows and Concepts (FSBL, image creation through BootGEN, multiboot and fallback mechanisms) are described in chapters 6 and 32 of (UG585) Zynq-7000 SoC Technical Ref. Manual and chapter 3 and appendix A of UG821 Zynq-7000 SoC Software Developers Guide.

Below are some more Xilinx Answers relevant for Boot and Configuration.

Primary Boot Devices

Zynq-7000 SoC supports Quad-SPI, NAND, NOR and SD as primary boot interfaces.

(Xilinx Answer 50991) contains details about which memory vendors and devices families are tested and supported by Xilinx.

Boot Times for Zynq-7000 Devices

Zynq-7000 SoC: Boot Times using NAND / QSPI

Considerations on large QSPI devices

QSPI Programming/Booting Checklist

his is a list of required items, necessary actions, and points to be considered, when debugging QSPI programming and booting on Zynq-7000 SoC.

Solution

Before opening a Service Request, collect all of the information requested below.

1) Is the QSPI flash and configuration supported by Xilinx?

Please provide the full flash name, the configuration type (single, dual parallel, dual stacked) and the voltage. If the configuration is not "standard" (muxes, level shifters or other), provide also the board schematics.

2) Is Zynq Production Silicon?

Use XMD to read and report the PS_VERSION from 0xF8007080.

Please provide Silicon Version reporting register 0xF8007080

3) Is the JTAG chain operating properly?

Use XMD to try to connect to the CPU.

Please provide JTAG chain description (how many devices on the chain, how many Zynq, Zynq in cascade or independent JTAG, any level shifter in the chain). Report any XMD error.

4) In which phase of booting Zynq is failing? BootROM or FSBL?

In order to determine this, program an image with FSBL debug prints enabled. #define FSBL_DEBUG_INFO in fsbl_debug.h

If some printing comes out on the UART during boot:

Please provide a log of the FSBL print out on the UART. FSBL is a user application and can be easily debugged using SDK. Try to do a brief investigation before filing a Service Request.

- If nothing comes out on the UART during boot, first double check the UART baudrate.

Please provide the status of INIT_B (high or low or blinking), REBOOT_STATUS and BOOT_MODE registers after the boot failure.

Most likely the boot image was not programmed properly (continue to step 5).

5) Are SDK and iMPACT failing to program?

See (Xilinx Answer 59275) for SDK / iMPACT QSPI programming known issues and workarounds.
For debug purposes the Debug Environmental Variable XIL_CSE_ZYNQ_DISPLAY_UBOOT_MESSAGES can be set to 1. See (Xilinx Answer 59272) for more details.

Please provide the version of the tool used. Be sure your image was built with the same version of the tool used to program.

Please provide the boot mode settings used for programming (booting from JTAG or QSPI).

Please provide the log obtained using the XIL_CSE_ZYNQ_DISPLAY_UBOOT_MESSAGES variable.

6) Is it working using u-boot?

Use the u-boot.elf pre-built from the latest released image on the wiki, and follow the CTT guide (UG873) including the instructions under "Program QSPI Flash With the Boot Image Using JTAG and U-Boot Command".

Please provide the log of the programming using pre-built u-boot image from the wiki. Specify the u-boot version used.

7) Is the board design to support the QSPI frequency used for programming?

Use u-boot and double check the clock settings to verify the QSPI clock frequency (QSPI_REF_CLK and QSPI_CLK on the CLK pin).

Remember that QSPI has two modes of operations depending on if the clock frequency is higher or lower than 40MHz. Calculate and verify the QSPI clock speed.

Please provide the register settings and the calculation done to verify the QSPI clock frequency.

8) Is the Xilinx standalone example working?

Some Debugging is required to understand where the example is failing (through the SDK debugger or by adding debug prints).

Is the issue with the initial query of the QSPI or a mismatch between writes and reads?

Is there any error pattern in the read back data? (Maybe a particular bit stuck to 1 or 0).

Report the type of failure in the Xilinx standalone example

SD Programming/Booting Checklist

Solution

Before opening a Service Request, collect all of the information requested below:.

1) SD booting.

SD booting is supported by Xilinx.

See (Xilinx Answer 50991) to determine the benefits and downsides of SD booting.

Please provide the board schematics and the name of the SD Memory Card used.

2) Is Zynq Production Silicon?

- Use XMD to read and report the PS_VERSION from 0xF8007080.

3) In which phase of booting is Zynq is failing? BootROM or FSBL?

In order to determine this, program an image with FSBL debug prints enabled.

#define FSBL_DEBUG_INFO in fsbl_debug.h

If some printing comes out on the UART during boot,

Please Provide a log of the FSBL print out on the UART.

FSBL is a user application and can be easily debugged using SDK.

This should be attempted before filing a Service Request.

- If nothing comes out on the UART during boot, first double check the UART baudrate.

Please provide the status of INIT_B (high or low or blinking), REBOOT_STATUS and BOOT_MODE registers after the boot failure.

4) Are SD_CD and SD_WP properly connected on the board?

FSBL - Fails to boot from SD if the card is write protected (WP is active)
EMIO SD Write Protect and Card Detect Signals Not Properly Configured In Zynq FSBL

Please provide ps7_init.tcl to verify SD_CD and SD_WP are properly configured to be directed to MIO or EMIO (depending on the board schematics).

5) Is the SD running at a supported frequency?

- Check the SD clock configuration. See (UG585) chapter 25 about Clocks.

Please provide the register settings and the calculation done to verify the SD clock frequency.

6) Is the JTAG chain operating properly?

- Use XMD to attempt to connect to the CPU.

Please provide JTAG chain description (how many devices on the chain, how many Zynq, Zynq in cascade or independent JTAG, any level shifter in the chain).

Report any XMD error.

7) Is the Xilinx standalone example working?

Some Debugging is required to understand where the example is failing (Through the SDK debugger or by adding debug prints).

Report the type of failure found in the Xilinx standalone example.

NAND Programming/Booting Checklist

This is a list of required items, necessary actions, and points to be considered while debugging NAND programming and booting on Zynq-7000 SoC.

Solution

Before opening a Service Request, please collect all of the information below highlighted in Bold.

1. Is the NAND flash and configuration supported by Xi

To determine the support category (Supported, Limited Support or Unsupported)
Note: Only On-Die ECC (Micron) and 1-bit ECC (Spansion) NAND devices can be used with Zynq-7000 SoC.Zynq NAND flash controller does only provide 1-bit ECC and a single chip select support which means if customer's NAND requires multi-bits of ECC or multiple CS, this NAND cant be used.Micron On-Die ECC NAND which use multi-bits of ECC on chip instead of using ECC bit from controller can provide better ECC support.
Provide the full NAND flash name, the configuration mode and the voltage. If the configuration is not "standard" (muxes, level shifters or other), provide also the board schematics.

2. Are the signals properly connected to the memory device?

Provide a schematic of NAND part and monitor the data signals on the board to verify what kind of words are read.

3. Is Zynq Production Silicon?

- Use XMD to read and report the PS_VERSION from 0xF8007080.
for Silicon Version Register Values
-If it is not a production silicon, check for the silicon revision differences.

Provide Silicon Version reporting register 0xF8007080

4. Is the JTAG chain operating properly?

Use XMD to try to connect to the CPU.

Provide JTAG chain description (how many devices on the chain, how many Zynq, Zynq in cascade or independent JTAG, any level shifter in the chain). Report any XMD error.

5. In which phase of booting Zynq is failing? BootROM or FSBL?

There are some issues related to FSBL, first check the following answers

for the booting a bin monolithic Linux image
for booting in secure boot mode

In order to understand this, program an image where FSBL has debug prints enabled. Set the FSBL_DEBUG_INFO FSBL compilation flags.

If some printing comes out on the UART during boot:

Provide a log of the FSBL print out on the UART. FSBL is a user application and can be easily debugged using SDK. Try to do a brief investigation before filing a Service Request.

If nothing comes out on the UART during boot, first double check the UART baudrate.
Check if the boot image is put to the first 128MB in NAND, the BootROM only searches a limited address.

Provide the status of INIT_B (high or low or blinking), REBOOT_STATUS and BOOT_MODE registers after the boot failure. Most likely the boot image was not programmed properly (continue to step 5).

6. Are SDK and iMPACT failing to program?

If there are multiple devices on the JATG chain, and Zynq is not the first device.

For debug purposes the Debug Environmental Variable XIL_CSE_ZYNQ_DISPLAY_UBOOT_MESSAGES can be set to your Windows or Linux machine.

Example of setting a Debug Environmental Variable for Windows:

Example of setting a Debug Environmental Variable for Linux:

setenv XIL_CSE_ZYNQ_DISPLAY_UBOOT_MESSAGES 1

NAND programming from Flash Writer requires DDR. Flash Writer needs a FSBL to initialize the board including DDR.

Be sure the FSBL is the same used in the Boot image

Provide the version of the tool used. Be sure your image was built with the same version of the tool used to program.
Provide the boot mode settings used for programming (booting from JTAG or NAND).
Provide the log obtained using the XIL_CSE_ZYNQ_DISPLAY_UBOOT_MESSAGES variable.

7. Is it working using u-boot?

Use the u-boot.elf pre-built from the latest released image on the wiki, and follow the CTT guide (UG873) that includes the instructions under "Program QSPI Flash With the Boot Image Using JTAG and U-Boot Command".

The flow above for NAND is similar but u-boot command is different to QSPI.

An example below shows

nand info
nand erase 0 0x100000
nand write 0x800 0 0xE1000

Note: The zc702 board does not come with NAND flash so we have not added NAND flash support on zc70x configuration.

If you need this support on existing zc70x configuration, please make the below changes in include/config/zynq_zc70x.h

#define CONFIG_NAND_ZYNQ
And then compile the zc70x as
$ make zynq_zc70x_config
$ make

Provide the log of the programming using pre-built u-boot image from the wiki. Specify the u-boot version used.

8. Is the board design to support the NAND frequency used for programming?

Use u-boot and double check the clock settings to verify the NAND clock frequency.

The NAND controller is based on ARM SMC PL353, refer to ARM PrimeCell Static Memory Controller (PL350 series) Technical Reference Manual r2p1 for more timing details.

Check if NAND timing parameters have been set correctly to SMC Timing Calculation.

This setting will reflect to the SMC set_cycles register.

A NAND boot might not work due to an incorrect setting here.

Refer to the NAND device AC Characteristics to input the relevant timing parameters to the CS0 column, the unit is nanoseconds.

The cycles will be calculated based on the operating frequency.

Provide the register settings and the calculation done to verify the NAND clock frequency.

9. Is the Xilinx standalone example working?

NAND examples are provided under the SDK install directory

sw\XilinxProcessorIPLib\drivers\nandps_v1_0X_a\examples

Some Debug is needed to understand where the example is failing (through the SDK debugger or by adding debug prints).

Report the kind of failure found in the Xilinx standalone example.

10. Users can access NAND correctly when booting from a different source (for example SD), data is correct after read/write operations (for example from u-boot), but my system won't boot from NAND.

Make sure there is a pull-up on the NAND BUSY signal.

The internal pull-up in Zynq is disabled in the BootROM (refer to TRM table 6-11).

QSPI flash programming now requires that you specify an FSBL

Description

Starting with the 2017.3 release, Vivado Hardware Manager and XSDK require that you specify an FSBL in order to program a QSPI flash.

This has been done in order to have a common flow between Zynq-7000 and Zynq UltraScale+.

Solution

With this change there are a number of implications:

1) The user needs a working FSBL.

If this FSBL is initializing DDR, then DDR needs to be functioning even if the QSPI flash programming does not really use it.

A suggestion is to use #define FSBL_DEBUG_INFO in the FSBL, to check if the UART of the FSBL is fully executed without hangs during QSPI flash programming.

2) If you have issues programming the FLASH in Vivado 2017.3 or 2017.4, add the following environment variable.

(The ENV variable is not required for 2018.1):

XIL_CSE_ZYNQ_UBOOT_QSPI_FREQ_HZ = 10000000

This will force the mini-uBoot to set the QSPI device clock to 10 MHz.

Note: depending on your FSBL design, you might see a different QSPI clock on your hardware.

3) The device clocking is now configured by the FSBL rather than the tool.

The configuration which was previously used by the tool is listed below.

If you have issues programming the flash, you should check the FSBL configuration against this table.

Register Name	Register Address	Register Value
ARM_PLL_CFG	0xF8000110	0x00177EA0	(default values)
ARM_PLL_CTRL	0xF8000100	0x0001A000	ARM_PLL = 866 MHz (not bypassed)
ARM_CLK_CTRL	0xF8000120	0x1F000400	CPU_6x4x = 866 / 4 = 216 MHz
IO_PLL_CFG	0xF8000118	0x00177EA0	(default values)
IO_PLL_CTRL	0xF8000108	0x0001A000	IO_PLL = 866 MHz (not bypassed)
PLL_STATUS	0xF800010C	0x0000003F	ARM_PLL and IO_PLL are LOCKED and STABLE.

* Assuming PS_REF_CLK = 33.33 MHz

4) In case of XIP (Execute in place from QSPI), a custom FSBL that executes from OCM needs to be created to be specified during QSPI flash programming.

SDK is not able to program the NOR if in NOR boot mode on a production silicon

Description

This issue arises when programming an image on the NOR flash using SDK when booting in NOR mode on either of the following:

A ZC702 with 7020 production silicon.
A ZC706 with 7045 production silicon.

Solution

The reason is that when booting in NOR mode, the BootROM might not find a valid image.

It then goes on a non-secure-lockdown and puts all of the MIO in 3-state.

As a result flash_writer cannot configure the NOR properly.

The work-around is to force the NOR MIO to the correct settings.

Because flash_writer requires an FSBL to configure NOR, you will need to add XSmc_NorInit() to the nor.c in FSBL

void InitNor(void)

{

* Set up the base address for access

FlashReadBaseAddress = XPS_NOR_BASEADDR;

XSmc_NorInit(); /* <--Add this line */

}

Note: This is not a silicon requirement, it is a tool limitation.

NAND programming requires the board to be in JTAG mode

NAND programming requires the board to be in JTAG mode.

Solution

To work around this issue, set the boot mode to JTAG boot.

If booting in JTAG mode is not possible (for example if a custom board does not have boot mode jumpers), the FSBL used for NAND programming can be modified.

Instead of reading the bootmode from the MIO:

BootModeRegister = Xil_In32(BOOT_MODE_REG);

BootModeRegister &= BOOT_MODES_MASK;

The code can "fake" the boot mode hard-coding JTAG:

BootModeRegister = JTAG_MODE;

This can be modified in the FSBL main.c file, for the FSBL used to program the NAND only (NOT the FSBL that goes into the boot image).

Note: This is not a Silicon issue, but a tool limitation.

Failed to program QSPI flashes with 256K and 512K erase sector size

According to the datasheet of S25FL512S, the erase sector is 256K.

So from the below error message, 2228224(0x220000) bytes are going to be erased.

But 2228224 is not a multiple of 256k.

This is causing the following error.

zynq-uboot> sf erase 0 220000

SF: 2228224 bytes @ 0x0 Erased: ERROR

Note: If the flashes are in dual parallel configuration, the erase sector size become 512K.

Solution

The work-around is to use u-boot and issue the "sf erase" command with the proper size.

An alternative is to use iMPACT (not Vivado) and set this environmental variable to change the erase sector size:

XIL_CSE_ZYNQ_FLASH_SECTOR_SIZE set to 262144

Note: This issue is scheduled to be fixed in the 2015.3 release.

From this release on, Vivado will check for the proper erase sector size.

2013.3 / 2013.4 /14.7 SDK, iMPACT, Zynq-7000 - Indirect Flash program does not work with multiple Zynq devices on JTAG chain

Description

There are multiple Zynq devices on the JTAG chain.

Indirect Flash program from iMPACT and SDK works only for the first Zynq device on the JTAG chain.

How do I program Flash on the other Zynq devices?

Solution

If all Zynq devices are connected to each other in cascaded JTAG mode, change it to independent JTAG mode for the Zynq devices other than the target device.

For example, if there are three Zynq devices on the JTAG chain and you are programming Flash via the third device:

[ARM DAP#1] -> [PL TAP#1] -> [ARM DAP#2] -> [PL TAP#2] -> [ARM DAP#3] -> [PL TAP#3]

Change the first and second device to the independent JTAG mode:

[PL TAP#1] -> [PL TAP#2] -> [ARM DAP#3] -> [PL TAP#3]

This issue will be fixed in the 2014.1 version of the tools.

To work around this issue until then, please follow the steps below:

Note: This solution assumes that 2013.4 tool are used and the patch in (Xilinx Answer 59896) has been applied.

This solution uses the uboot API to copy the image to the QSPI.

The uboot ELF can be obtain from the Xilinx Wiki page here

Launch SDK, and do the following:

Connect to 2nd device in the JTAG chain
Init the PS7
Place the BOOT.bin into DDR (you can use any address, but don't overwrite the uboot at 0x04000000)
Download the uboot ELF
Execute

Here are the commands to be used as a reference:

connect arm hw -debugdevice devicenr 3
source ps7_init.tcl
ps7_init
dow -data BOOT.bin 0x08000000

Open the serial port and stop the process from completing by hitting enter.

Next, the QSPI API will be used to copy the (BIN file) image from DDR on to the QSPI.

The commands to do this are seen below:

sf probe 0 0 0
sf erase 0 0x200000
sf write 0x08000000 0 0x200000

Finally, set the Mode pins to boot from QSPI, and do a POR.

The boot image in the QSPI will now be used.

Pros: Fast turn-out time, re-programming capability, dynamic reconfiguration

capability

Cons: performances and integration are not as good as full-custom or standardcell-

based ICs, power consumption

NB: integration issue mitigated by SoC technology (microprocessors + FPGA in the

same device)

=======================================================================

Strategies & directives

Add phys_opt_design to the implementation flow. This will do timing based physical optimization which can help with congestion. Multiple iterations of phys_opt_design can also help, with each using different options.
Also, there is the option to use phys_opt_design post-placement or post-routing. See phys_opt_design -help for more information

Vivado has several congestion specific Strategies that can be used (Tools Options -> Strategies). From these Strategies, specific directives for sub-steps such as place_design & route_design can be found that can be useful for congestion.
Also, the -Explore directive will generally give better results at a cost of increased run-time.

Try using the Vivado Synthesis AlternateRoutability directive.

Try iterating through different place_design directives (found with place_design -help). Specifically, the SpreadLogic_high/medium/low and AltSpreadLogic_high/medium/low directives are meant to spread logic to prevent congestion.
Comparing the results with different place_design directives and running through route_design will give you an idea of which directives perform better for a specific design.
Please note that the best performing directive can change as the design changes. "report_timing_summary" and "report_design_analysis -congestion" can be used to compare the different directives.

Finding a suitable opt_design directive is also helpful for a congested design. Run report_utilization after each iteration to see which yields the lowest LUT & FF count, depending on the which element is highly utilized.

Try using the Vivado Synthesis option -resource_sharing on value. This can often share arithmetic operators and the default is set to auto.

Try taking certain timing critical paths and over constraining them only during place_design and phys_opt_design. This prioritizes these paths which can lead to better QOR.

Try floorplanning block RAMs or DSPs. It can be helpful to floorplan block RAMs or DSPs using the best timing results from iterating through different place_design directives, and fixing these LOC constraints for further implementation runs.
There are a few options to obtain these constraints:

From a routed design, select the Block RAMs or DSPs that you wish to fix, and right-click within the Device Window and select "Fix Cells". Saving the design at this point will save the fixed constraints.
To print out LOC constraints for the current block RAM or DSP placement, use the below commands.
The results can be copied into an XDC file.
Similar syntax can be used for DSPs, and the search pattern can be found from Vivado IDE Find window.

set BRAMS [get_cells -hierarchical -filter { PRIMITIVE_TYPE =~ BLOCKRAM.BRAM.* } ]

foreach i $BRAMS {puts "set_property LOC [get_property LOC [get_cells $i]] \[get_cells \{${i}\}\]"}

Reducing & Controlling fanout

Use the -force_replication_on_nets option of phys_opt_design. This is a good option for reducing fanout as a post-place phys_opt_design will be able to use the placement information to decide which nets are driven from replicated drivers so that the path length is not excessive.
Example command: "phys_opt_design -force_replication_on_nets ${hi_fanout_nets}" where hi_fanout_nets is a list of nets you would like to reduce the fanout of.
See phys_opt_design -help for more information. A Tcl example is attached to this Answer Record that finds synchronously driven high fanout nets and creates a variable with a list of nets that can be used in the phys_opt_design command.

Use the -fanout_opt place_design option. Available with 2017.x, the -fanout_opt performs high fanout replication of critical nets. See place_design -help for more information on the option.

Use global buffers on non-clock high-fanout nets. The opt_design command can automatically insert BUFGs on high fanout nets.
Using global clocking resources can help congestion due to high fanout nets. Consult the report_high_fanout report from the routed design to see if there are potential candidates. Also, automatic BUFG insertion by opt_design can be adjusted. See (Xilinx Answer 54177) for more information.

Reducing Local Congestion

Try reducing/removing LUT combining from synthesis (-no_lc). This can reduce the number of nets entering CLBs that become congested due to LUT inputs.

Try enlarging or removing pblock constraints if the logic constrained is related to the congested region. This gives the placer more flexibility to avoid the congestion.

Reducing utilization

Analyze the necessity of resets. Some reset signals might not be necessary if the GSR can provide the necessary initialization. Consult Chapter 4 (RTL Coding Guidelines) of (UG949) for more information

======================================================================

Vivado Non project mode

# Step 1: Read in top-level EDIF netlist from synthesis tool read_edif c:/top.edf # Read in lower level IP core netlists read_edif c:/core1.edf read_edif c:/core2.edf

# Step 2: Specify target device and link the netlists # Merge lower level cores with top level into single design link_design -part xc7k325tfbg900-1 -top top

# Step 3: Read XDC constraints to specify timing requirements read_xdc c:/top_timing.xdc # Read XDC constraints that specify physical constraints such as pin locations read_xdc

c:/top_physical.xdc

# Step 4: Optimize the design with default settings opt_design

# Step 5: Place the design using the default directive and save a checkpoint # It is recommended to save progress at certain intermediate steps # The placed checkpoint can also be routed in multiple runs using different options place_design -directive Default write_checkpoint post_place.dcp

# Step 6: Route the design with the AdvancedSkewModeling directive. For more information # on router directives type 'route_design -help' in the Vivado Tcl Console route_design -directive AdvancedSkewModeling

# Step 7: Run Timing Summary Report to see timing results report_timing_summary -file post_route_timing.rpt # Run Utilization Report for device resource utilization report_utilization -file post_route_utilization.rpt

# Step 8: Write checkpoint to capture the design database; # The checkpoint can be used for design analysis in Vivado IDE or TCL API write_checkpoint post_route.dcp

The key steps in the Non-Project Mode are:

• Step 1: Read Design Source Files

• Step 2: Build the In-Memory Design

• Step 3: Read Design Constraints

• Step 4: Perform Logic Optimization

• Step 5: Place the Design

• Step 6: Route the Design

• Step 7: Run Required Reports

• Step 8: Save the Design Checkpoint

Strategy selects the strategy to use for the implementation run.

Defining Implementation Strategies A run strategy is a defined approach for resolving the synthesis or implementation challenges of the design.

• Strategies are defined in pre-configured sets of options for the Vivado implementation features.

• Strategies are tool and version specific.

• Each major release of the Vivado Design Suite includes version-specific strategies.

Vivado implementation includes several commonly used strategies that are tested against internal benchmarks.

Vivado Design Suite includes a set of pre-defined implementation strategies, or you can create your own.

Options When you select a strategy, each step of the Vivado implementation process displays in a table in the lower part of the dialog box:

• Opt Design (opt_design)

• Power Opt Design (power_opt_design) (optional)

• Place Design (place_design)

• Post-Place Power Opt Design (power_opt_design) (optional)

• Post-Place Phys Opt Design (phys_opt_design) (optional)

• Route Design (route_design)

• Post-Route Phys Opt Design (phys_opt_design) (optional)

• Write Bitstream (write_bitstream)

Click the command option to view a brief description of the option at the bottom of the Design Run Settings dialog box.

===============================================================

Use makefile and tcl to run simulation, synthesis, implmentment and bit generation

# Makefile

sim:

vivado -mode batch -source sim.tcl

imp:

vivado -mode batch -source non-proj.tcl

proj:

vivado -mode batch -source proj.tcl

------------------------------------------------------------------------------------

# sim.tcl

set path [file dirname [info script]]

puts "script is invoked from $path"

source [file join $path modelsim.tcl]

------------------------------------------------------------------------------------

# modelsim.tcl

set file_dir [file normalize [file dirname [info script]]]

puts "== Unit Test directory: $file_dir"

#set ::env(XILINX_TCLAPP_REPO) [file normalize [file join $file_dir .. .. .. ]]

#puts "== Application directory: $::env(XILINX_TCLAPP_REPO)"

#lappend auto_path $::env(XILINX_TCLAPP_REPO)

set name "modelsim"

create_project $name ./$name -force

add_files -fileset sources_1 "$file_dir/q1.vhd"

#add_files -fileset sources_1 "$file_dir/.vhd"

add_files -fileset sim_1 "$file_dir/tb.vhd"

update_compile_order -fileset sources_1

update_compile_order -fileset sim_1

launch_simulation -batch

close_project

------------------------------------------------------------------------------------

# non-proj.tcl

# STEP#1: setup design sources and constraints

read_vhdl ./q1.vhd

#read_vhdl ./src/AND_TEST.vhd

#read_vhdl ./src/OR_GATE.vhd

# read_vhdl

# read_verilog

read_xdc ./q1.xdc

# STEP#2: define the output directory area.

set outputDir ./output

file mkdir $outputDir

# STEP#3: run synthesis, write design checkpoint, report timing,

# and utilization estimates

synth_design -top q1 -part xc7z020clg484-1

write_checkpoint -force $outputDir/post_synth.dcp

report_timing_summary -file $outputDir/post_synth_timing_summary.rpt

report_utilization -file $outputDir/post_synth_util.rpt

# Run custom script to report critical timing paths

#reportCriticalPaths $outputDir/post_synth_critpath_report.csv

# STEP#4: run logic optimization, placement and physical logic optimization,

# write design checkpoint, report utilization and timing estimates

opt_design

#reportCriticalPaths $outputDir/post_opt_critpath_report.csv

place_design

report_clock_utilization -file $outputDir/clock_util.rpt

# Optionally run optimization if there are timing violations after placement

if {[get_property SLACK [get_timing_paths -max_paths 1 -nworst 1 -setup]] < 0} {

puts "Found setup timing violations => running physical optimization"

phys_opt_design

}

write_checkpoint -force $outputDir/post_place.dcp

report_utilization -file $outputDir/post_place_util.rpt

report_timing_summary -file $outputDir/post_place_timing_summary.rpt

# STEP#5: run the router, write the post-route design checkpoint, report the routing # status, report timing, power, and DRC, and finally save the Verilog netlist.

route_design

write_checkpoint -force $outputDir/post_route.dcp

report_route_status -file $outputDir/post_route_status.rpt

report_timing_summary -file $outputDir/post_route_timing_summary.rpt

report_power -file $outputDir/post_route_power.rpt

report_drc -file $outputDir/post_imp_drc.rpt

write_verilog -force $outputDir/cpu_impl_netlist.v -mode timesim -sdf_anno t rue

# STEP#6: generate a bitstream

write_bitstream -force $outputDir/proj.bit

------------------------------------------------------------------------------------

# proj.tcl

# STEP#1: define the output directory area.

set outputDir ./project

file mkdir $outputDir

create_project project_cpu_project ./project -part xc7z020clg484-1 -force

# STEP#2: setup design sources and constraints

# example:

#add_files -fileset sim_1 ./Sources/hdl/cpu_tb.v

#add_files [ glob ./Sources/hdl/bftLib/*.vhdl ]

#add_files ./Sources/hdl/bft.vhdl

#add_files [ glob ./Sources/hdl/*.v ]

#add_files [ glob ./Sources/hdl/mgt/*.v ]

#add_files [ glob ./Sources/hdl/or1200/*.v ]

#add_files [ glob ./Sources/hdl/usbf/*.v ]

#add_files [ glob ./Sources/hdl/wb_conmax/*.v ]

#add_files -fileset constrs_1 ./Sources/top_full.xdc

#set_property library bftLib [ get_files [ glob ./Sources/hdl/bftLib/*.vhdl]]

add_files -fileset sim_1 ./tb.vhd

add_files ./q1.vhd

add_files -fileset constrs_1 ./q1.xdc

# Physically import the files under project_cpu.srcs/sources_1/imports directory

import_files -force -norecurse

# Physically import bft_full.xdc under project_cpu.srcs/constrs_1/imports directory

import_files -fileset constrs_1 -force -norecurse ./q1.xdc

# Update compile order for the fileset 'sources_1'

set_property top top [current_fileset]

update_compile_order -fileset sources_1

update_compile_order -fileset sim_1

# STEP#3: run synthesis and the default utilization report.

launch_runs synth_1

wait_on_run synth_1

# STEP#4: run logic optimization, placement, physical logic optimization, route and

# bitstream generation. Generates design checkpoints, utilization and timing

# reports, plus custom reports.

set_property STEPS.PHYS_OPT_DESIGN.IS_ENABLED true [get_runs impl_1]

set_property STEPS.OPT_DESIGN.TCL.PRE [pwd]/pre_opt_design.tcl [get_runs impl_1]

set_property STEPS.OPT_DESIGN.TCL.POST [pwd]/post_opt_design.tcl [get_runs impl_1]

set_property STEPS.PLACE_DESIGN.TCL.POST [pwd]/post_place_design.tcl [get_runs impl_1]

set_property STEPS.PHYS_OPT_DESIGN.TCL.POST [pwd]/post_phys_opt_design.tcl [get_runs impl_1]

set_property STEPS.ROUTE_DESIGN.TCL.POST [pwd]/post_route_design.tcl [get_runs impl_1]

launch_runs impl_1 -to_step write_bitstream

wait_on_run impl_1

puts "Implementation done!"

------------------------------------------------------------------------------------

# pre_opt_design.tcl

############## pre_opt_design.tcl ##################

set outputDir [file dirname [info script]]/project

source [file dirname [info script]]/reportCriticalPaths.tcl

report_timing_summary -file $outputDir/post_synth_timing_summary.rpt

report_utilization -file $outputDir/post_synth_util.rpt

reportCriticalPaths $outputDir/post_synth_critpath_report.csv

------------------------------------------------------------------------------------

# post_opt_design.tcl

############## post_opt_design.tcl ##################

# Run custom script to report critical timing paths

reportCriticalPaths $outputDir/post_opt_critpath_report.csv

------------------------------------------------------------------------------------

# post_place_design.tcl

############## post_place_design.tcl ##################

report_clock_utilization -file $outputDir/clock_util.rpt

------------------------------------------------------------------------------------

# post_phys_opt_design.tcl

############## post_phys_opt_design.tcl ##################

report_utilization -file $outputDir/post_phys_opt_util.rpt

report_timing_summary -file $outputDir/post_phys_opt_timing_summary.rpt

------------------------------------------------------------------------------------

# post_route_design.tcl

############## post_route_design.tcl ##################

report_route_status -file $outputDir/post_route_status.rpt

report_timing_summary -file $outputDir/post_route_timing_summary.rpt

report_power -file $outputDir/post_route_power.rpt

report_drc -file $outputDir/post_imp_drc.rpt

write_verilog -force $outputDir/cpu_impl_netlist.v -mode timesim -sdf_anno true

------------------------------------------------------------------------------------

# reportCriticalPaths.tcl

#------------------------------------------------------------------------

# This function generates a CSV file that provides a summary of the first

# 50 violations for both Setup and Hold analysis. So a maximum number of

# 100 paths are reported.

# #------------------------------------------------------------------------

proc reportCriticalPaths { fileName } {

# Open the specified output file in write mode

set FH [open $fileName w]

# Write the current date and CSV format to a file header

puts $FH "#\n# File created on [clock format [clock seconds]]\n#\n"

puts $FH "Startpoint,Endpoint,DelayType,Slack,#Levels,#LUTs"

# Iterate through both Min and Max delay types

foreach delayType {max min} {

# Collect details from the 50 worst timing paths for the current analysis

# (max = setup/recovery, min = hold/removal)

# The $path variable contains a Timing Path object.

foreach path [get_timing_paths -delay_type $delayType -max_paths 50 -nworst 1] {

# Get the LUT cells of the timing paths

set luts [get_cells -filter {REF_NAME =~ LUT*} -of_object $path]

# Get the startpoint of the Timing Path object

set startpoint [get_property STARTPOINT_PIN $path]

# Get the endpoint of the Timing Path object

set endpoint [get_property ENDPOINT_PIN $path]

# Get the slack on the Timing Path object

set slack [get_property SLACK $path]

# Get the number of logic levels between startpoint and endpoint

set levels [get_property LOGIC_LEVELS $path]

# Save the collected path details to the CSV file

puts $FH "$startpoint,$endpoint,$delayType,$slack,$levels,[llength $luts]"

}

# Close the output file

close $FH

puts "CSV file $fileName has been created.\n"

return 0

}; # End PROC

=======================

Start with no timing constraints

Use IDE to view the clock networks

Create basic clocks

Review Timing reports for coverage

Open synthesized design .See schematic

Report clock networks.Click constraints

See unconstrained

-------------------------------------------------------------------------------------------------------------

Before editing go for report_clocks

Edit timing constraints

Click generated clock

Description:

Create a generated clock object

Syntax:

create_generated_clock [-name <arg>] [-source <args>] [-edges <args>]

[-divide_by <arg>] [-multiply_by <arg>]

[-combinational] [-duty_cycle <arg>] [-invert]

[-edge_shift <args>] [-add] [-master_clock <arg>]

[-quiet] [-verbose] <objects>

Returns:

new clock object

Usage:

Name Description

-----------------------------

[-name] Generated clock name

[-source] Master clock source object pin/port

[-edges] Edge Specification

[-divide_by] Period division factor: Value >= 1

Default: 1

[-multiply_by] Period multiplication factor: Value >= 1

Default: 1

[-combinational] Create a divide_by 1 clock through combinational logic

[-duty_cycle] Duty cycle for period multiplication: Range: 0.0 to 100.0

Default: 0.0

[-invert] Invert the signal

[-edge_shift] Edge shift specification

[-add] Add to the existing clock in source_objects

[-master_clock] Use this clock if multiple clocks present at master pin

[-quiet] Ignore command errors

[-verbose] Suspend message limits during command execution

<objects> List of clock source ports, pins, or nets

Categories:

SDC, XDC

report_timing

if its ports go generic

if it is pins,do tcl for get_pins

and do report_timing -from[get_pins ..]-to[get_pins ..]]

view contents of the report

Timing reports always start from primary clock propagate to the generated clocks and then on to the clock elements.

observe the destination path clock timing.

See the requirements of the generated clock

See the destination clock start of next clock edge and on to destination register

Slack is the required time – arrival time

Open the schematic of the netlist ,select clk_gen.window.zoom it

Observe the difference in schematic which is periodically enabled to generate the destination clocks.

In this case,generated clocks doesnot have the predefined relationship with the primary clocks clk pin

As a result,create a tcl command

create_generated clock

save the constraints

enter the tcl command report_clocks

observe the new generated clock included in the timing reports

click report timing summary

select new generated clk

see intraclk path,see the setup.double click any path to view the path properties

see the source clock path

source clock delay starts primary clock and propgate to generated clocks both automatically and manual generated clocks

see the destination clock path

starts from primary clock and propagated to generated clocks

close the properties.

set input_delay

report_timing –from[all_inputs]

see the reports

set_input_delay –clock[get_clocks –phyclk0] –max3$my_input

report_timing –from$my_input –setup

see the reports

see the slack and input delay

insert the below tcl

set_input_delay –clock[get_clocks –phyclk0] –min1$my_input

report_timing –from$my_input –hold

report shows actual slack and the input delay

In addition see the edit timing constraint, can add set_input_delay

set output_delay

report_timing –to[all_inputs]

see the reports

set_input_delay –clock[get clocks –sysclk0] $my_output

report_timing –from$my_input –hold

see the reports

see the actual slack and input delay

In addition see the edit timing constraint, can add set_output_delay

set clock groups

Open the synthesized design

Report clock interaction

Analyze the timing path from one clock domain to another clock domain

Report shows grid of clock interactions

Each grid is colored to indicate timing ,constraint status in inter clock path

If the two clock groups does not meet timing it is asynchronous

Clock frequencies are not integer multiples .Its impossible to find the relative fields between them.

report_clock_interactions –delay_type min_max –significant_digits 3 –name timing_1

see the wns and choose max

suppose if you have two constraints like clk_out1_clk_core, clk_out2_clk_core=>

join [get_timing_paths –from [get_clocks clk_out1_clk_core] –to[get_clocks clk_out2_clk_core]

-max_paths 200]

join [get_timing_paths –from [get_clocks clk_out2_clk_core] –to[get_clocks clk_out1_clk_core]

-max_paths 200]

Two clocks marked as asynchronous

Launch timing contraints-edit

Double click set clock_group

You need to specify two clock groups

Add the first clock group

Add the second by click + sign

Note asynchronous is chosen by default.

Save the constarints

Report clock interactions

Observe the interclock path between clock grp1 and 2 is decalred blue confirming asynchronous

Set false path

report timing_summary

See the interclock path

See the setup

set_false_path –from[get_pins ]-to[get_pins ]]

report_timing –from [get_pins ]-to[get_pins ]]

if sklack is infinite false path is sucessful

if not check the false path

set multicycle path

report_timing

See the interclock path

See the setup

-hold

- hold

See the vio;ated slack if any

See the requirement

You can see the change in slack

Description:

Generate a new clock object from an existing physical clock object in the

design.

Clocks can be added to a design in one of three ways:

* Primary physical or virtual clocks defined with the create_clock command.

* Derived clocks defined with the create_generated_clock command

generated from a primary physical clock.

* Derived clocks automatically generated by the Vivado Design Suite when

a clock propagates to an MMCM/PLL/BUFR.

Using Constraints (UG903) for more information on renaming auto-derived clocks.

Note: You cannot rename a clock that is already in use by other constraints at the time of renaming. You must rename the clock prior to any other appearance or use of the clock in an XDC file

This command returns the name of the clock object that is created, or returns an error if it fails.

Arguments:

<objects>, the name will be the first object in the list.

-combinational - (Optional) Define a combinational path to create a "-divide_by 1" generated clock.

-invert - (Optional) Create a generated clock with the phase inverted from the master clock.

-edge_shift <arg> - (Optional) Shift the edges of the generated clock by the specified values relative to the master clock. See the example below for specifying edge shift.

-add - (Optional) Add the generated clock object to an existing clock group specified by <objects>.

Note: Any errors encountered on the command-line, while launching the command, will be returned. Only errors occurring inside the command will be trapped.

-verbose - (Optional) Temporarily override any message limits and return all messages from this command.

Note: Message limits can be defined with the set_msg_config command.

Examples:

create_generated_clock -divide_by 2 -source [get_pins clkgen/sysClk] fftEngine/clk

and the duty cycle as the percentage of the period between edge 2 and edge

3 of the source clock. In addition, each edge of the generated clock is

shifted by the specified amount:

create_generated_clock -name CLK1 -source CMB/CLKIN -edges {2 3 8} \

-edge_shift {0 -1.0 -2.0} CMB/CLKOUT

Note: The waveform pattern of the generated clock is repeated based on the

transitions defined by the -edges argument.

This example creates two generated clocks from the output of a MUX, using

-master_clock to identify which clock to use, using -add to assign the

generated clocks to the Q pin of a flip flop, and using -name to define a

name for the generated clock, since the object it is assigned to has

multiple clocks assigned:

create_generated_clock -source [get_pins muxOut] -master_clock M_CLKA \

-divide_by 2 -add -name gen_CLKA [get_pins flop_Q]

create_generated_clock -source [get_pins muxOut] -master_clock M_CLKB \

-divide_by 2 -add -name gen_CLKB [get_pins flop_Q]

The following example renames the automatically named clock that is derived

by the Vivado Design Suite on the MMCM clock output:

create_generated_clock -name CLK_DIV2 [get_pins mmcm/CLKOUT1]

Friday, December 20, 2019

FPGA

Timing Analysis

QSPI boot time consideration with larger QSPI memory

QSPI Programming/Booting Checklist

Solution

1) Is the QSPI flash and configuration supported by Xilinx?

2) Is Zynq Production Silicon?

3) Is the JTAG chain operating properly?

4) In which phase of booting Zynq is failing? BootROM or FSBL?

5) Are SDK and iMPACT failing to program?

6) Is it working using u-boot?

7) Is the board design to support the QSPI frequency used for programming?

8) Is the Xilinx standalone example working?

SD Programming/Booting Checklist

Solution

1) SD booting.

2) Is Zynq Production Silicon?

3) In which phase of booting is Zynq is failing? BootROM or FSBL?

4) Are SD_CD and SD_WP properly connected on the board?

5) Is the SD running at a supported frequency?

6) Is the JTAG chain operating properly?

7) Is the Xilinx standalone example working?

NAND Programming/Booting Checklist

QSPI flash programming now requires that you specify an FSBL

SDK is not able to program the NOR if in NOR boot mode on a production silicon

NAND programming requires the board to be in JTAG mode

Failed to program QSPI flashes with 256K and 512K erase sector size

2013.3 / 2013.4 /14.7 SDK, iMPACT, Zynq-7000 - Indirect Flash program does not work with multiple Zynq devices on JTAG chain

No comments:

Total Pageviews

Contact Form