Tải bản đầy đủ (.pdf) (10 trang)

High Level Synthesis: from Algorithm to Digital Circuit- P6 potx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (887.81 KB, 10 trang )

36 T. Bollaert
Fig. 3.3 The Gantt chart
to complete. In the Catapult flow, the generation of RTL is accomplished in a matter
of minutes.
Catapult generates VHDL, Verilog or SystemC netlists, based on user settings.
Various reports are also produced providing both hardware-centric and algorithm-
centric information about the design’s characteristics.
Finally, Catapult provides an integrated verification flow that automates the pro-
cess of validating the HDL netlist(s) output from Catapult against the original
C/C++ input. This is accomplished by wrapping the netlist output with a SystemC
“foreign module” class and instantiating it along with the original C/C++ code and
testbench in a SystemC design. The same input stimuli are applied to both the orig-
inal and the synthesized code and a comparator at each output validates that the
output from both are identical (Fig. 3.4). The flow automatically generates all of
the SystemC code to provide interconnect and synchronization signals, Makefiles to
perform compilation, as well as scripts to drive the simulation.
3.3 Coding and Optimizing a Design with Catapult Synthesis
This section provides an overview of the various controls the user can leverage to
efficiently synthesize his design.
3 Catapult Synthesis: A Practical Introduction to Interactive C Synthesis 37
Fig. 3.4 Catapult synthesis’ automatic verification flow
3.3.1 Coding C/C++ for Synthesis
The coding style used for functional specification is plain C++ that provides a
sequential implementation of the behavior without any notion of timing or concur-
rency. Both the syntax and the semantics of the C++ language are fully preserved.
3.3.1.1 General Constructs and Statements
Catapult supports a very broad subset of the ANSI C++ language. The C/C++
synthesized top-level function may call other sub-functions, which may be inlined
or may be kept as a level of hierarchy. The design may also contain static vari-
ables that keep some state between invocations of the function. “if” and “switch”
condition statements are supported, as well as “for,” “do” and “while” looping


statements. “break,” “continue” and “return” branching statements are synthesiz-
able as well. The only noticeable restriction is that the code should be statically
determinable, meaning that all its properties must defined at compilation time. As
such, dynamic memory allocation/deallocation (malloc, free, new, delete) is not
supported.
3.3.1.2 Pointers
Pointers are synthesizable if they point to statically allocated objects and there-
fore can be converted into array indexes. Pointer arithmetic is also supported and
a pointer can point to several objects inside of an array.
38 T. Bollaert
Fig. 3.5 Coding style example
3.3.1.3 Classes and Templates
Compound data types such as classes, structs and arrays are fully supported for
synthesis. Furthermore, parameterization through C++ templates is also supported.
The combination of classes and templates provides a powerful mechanism facilitat-
ing design re-use.
The example in Fig. 3.5 gives an overview of some of the coding possibilities
allowed by the Catapult synthesizable subset. A struct is defined to model a RGB
pixel. The struct is templatized so users can define the actual bitwidth of the R,
G and B fields. Additionally, a method is defined which returns a grayscale value
from the RGB pixel. The synthesized design is the “convert
to gray” function. It
is implemented as a loop which reads RGB pixels one by one from an input array,
calls the “to
gray” method to compute the result and assigns it to the output array
using pointer arithmetic.
3.3.1.4 Bit-Accurate Data Types
Hardware designers are accustomed to bit-accurate datatypes in hardware design
languages such as VHDL and Verilog. Similarly, bit-accurate data types are needed
to synthesize area efficient hardware from C models. The arbitrary-length bit-

accurate integer and fixed-point “Algorithmic C” datatypes provide an easy way
to model static bit-precision with minimal runtime overhead. Operators and meth-
ods on both the integer and fixed-point types are clearly and consistently defined so
that they have well defined simulation and synthesis semantics.
The precision of the integer type ac
int<W,S> is determined by template param-
eters W(integer that gives bit-width) and S (a boolean that determines whether the
integer is signed or unsigned).
The fixed-point type ac
fixed<W,I,S,Q, O> has five template parameters which
determine its bit-width, the location of the fixed-point, whether it is signed or
unsigned and the quantization and overflow modes that are applied when construct-
ing or assigning to object of its type.
3 Catapult Synthesis: A Practical Introduction to Interactive C Synthesis 39
The advantages of the Algorithmic C datatypes over the existing integer and
fixed-point datatypes are the following:
• Arbitrary-Length: this allows a clean definition of the semantics for all operators
that are not tied to an implementation limit. It is also important for writing general
IP algorithms that don’t have artificial (and often hard to quantify and document)
limits for precision.
• Precise Definition of Semantics: special attention has been paid to define and
verify the simulation semantics and to make sure that the semantics are appro-
priate for synthesis. No simulation behavior has been left to compiler depen-
dent behavior. Also, asserts have been introduced to catch invalid code during
simulation.
• Simulation Speed: the implementation of ac
int uses sophisticated template
specialization techniques so that a regular C++ compiler can generate opti-
mized assembly language that will run much faster than the equivalent SystemC
datatypes. For example, ac

int of bit widths in the range 1–32 can run 100×
faster than the corresponding sc
bigint/sc biguint datatype and 3× faster than
the corresponding sc
int/sc uint datatype.
• Correctness: the simulation and synthesis semantics have been verified for many
size combinations using a combination of simulation and equivalence checking.
• Compilation Speed and Smaller Executable: code written using ac
int datatypes
compiles 5× faster even with the compiler optimizations turned on (required for
fast simulation). It also produces smaller binary executables.
• Consistency: consistent semantics of ac
int and ac fixed.
In addition to the Algorithmic C datatypes, Catapult Synthesis also supports the
C++ native types (bool, char, short, int and long) as well as the SystemC sc
int,
sc
bigint and sc fixed types and their unsigned version.
3.3.2 Synthesizing the Design Interface
3.3.2.1 Hardware Interface View of the Algorithm
The design interface is how a hardware design communicates with the rest of the
world. In the C/C++ source code, the arguments passed to the top-level function
infer the interface ports. Catapult can infer three types of interface ports:
• Input Ports transfer data from the rest of the world to the design. All inputs are
either non-pointer arguments passed to the function or pointer arguments that are
read only.
• Output Ports transfer data from the design to the rest of the world. Structure or
pointer arguments infer output ports if the design reads from them but does not
write to them.
• Input ports transfer data both to and from the design. These are pointer arguments

that are both written and read.
40 T. Bollaert
3.3.2.2 Interface Synthesis
Catapult builds a correspondence between the arguments of the C/C++ function
and the I/Os of the hardware design. Once this is established, the designer uses
interface synthesis constraints to specify properties of each hardware ports.
With this approach, designers can target and build any kind of hardware interface.
Interface synthesis directives give users control other parameters such as bandwidth,
timing, handshake and other protocols aspects.
This way the synthesized C/C++ algorithm remains purely functional and
doesn’t have to embed any kind of interface specific information. The same code
can be retargeted based on any interface requirement (bandwidth, protocol, etc )
Amongst other transformations and constraints, the user can for instance:
• Define full, partial or no handshake on interface signals
• Map arrays to wires, memories, busses or streams
• Control the bandwidth (bitwidth) of the hardware ports
• Add optional start/done flags to the design
• Define custom interface protocols
Hardware specific I/O signals such as clock, reset, enable or and handshaking
signals do not need to be modeled either and are added automatically based on user
constraints.
3.3.3 Loop Controls
3.3.3.1 Loop Unrolling
Loop unrolling exposes parallelism that exists across different subsequent iterations
of a loop by partially or fully unrolling the loop.
The example in Fig. 3.6 features a simple loop summing two vectors of four
values. If the loop is kept rolled, then Catapult will generate a serial architecture. As
shown on the left, a single adder will be allocated to implement the four additions.
The adder is therefore time-shared, and dedicated control logic is built accordingly.
Assuming the mux, add and demux logic can fit in the desired clock period, four

cycles are needed to compute the results.
On the right-hand side, the same design is synthesized with its loop fully
unrolled. Unrolling is applied by setting a synthesis constraint and has the same
effect as copying four times the loop body. Catapult can now exploit the operation-
level parallelism to build a fully parallel implementation of the same algorithm. The
resulting architecture necessitates four adders to implement the four additions and
has a latency of one clock cycle.
Partial unrolling may also be used to trade the area, power and performance of
the resulting design. In the above example, an unrolling factor of 2 would cause
the loop body to be copied twice, and the number of loop iterations halved. The
3 Catapult Synthesis: A Practical Introduction to Interactive C Synthesis 41
Fig. 3.6 Unrolling defines how many times to copy the body of a loop
synthesized solution would therefore be built with two adders, and have a latency of
two cycles.
3.3.3.2 Loop Merging
Loop merging exhibits loops-level parallelism. This technique applies to sequential
loops and creates a single loop with the same functionality as the original loops.
This transformation is used to reduce latency and area consumption in a design by
allowing parallel execution, where possible, of loops that would normally execute
in series.
With loop merging, algorithm designers can develop their application in a very
natural way, without having to worry about potential parallelism in the hardware
implementation.
In Fig. 3.7, the code contains sequential loops. Sequential loops are very con-
venient to model the various processing stages of an algorithm. By enabling or
disabling loop merging, the designer decides if in the generated hardware, the loops
should run in parallel (merging enabled) or sequentially (merging disabled). With
this technique, the designer maintains the readability and hardware independence
of his source code. The transformation and optimization techniques in Catapult can
produce a parallelized design which would have required a much more convoluted

source description, as shown on the right-hand side.
It should also be noted in this example that Catapult is able to appropriately
optimize the intermediate data storage. When sequentially processing the two loops,
intermediate storage is needed to store the values of “a.” When parallelizing the two
loops, values of “a” produced in the first loop can directly be consumed by the
second loop, removing the need for storage.
42 T. Bollaert
Fig. 3.7 Merging parallelizes sequential loops
Fig. 3.8 Pipelining defines when to initiate the next iteration of a loop
3.3.3.3 Loop Pipelining
Loop pipelining provides a way to increase the throughput of a loop (or decreas-
ing its overall latency) by initiating the next iteration of the loop before the current
iteration has completed. Overlapping the execution of subsequent iterations of a
loop exploits parallelism across loop iterations. The number of cycles between iter-
ations of the loop is called the initiation interval. In many cases loop pipelining may
improve the resource utilization thus increasing the performance/area metric of the
design.
In example Fig. 3.8, a loop iteration consists of four operations: an I/O read
to in[i], a multiplication with coef1, an addition with coef2, and finally an I/O
write to out[i]. Assuming that each of these operations executes in a clock cycle,
and if no loop constraints are applied, the design schedule will look as shown on
the left hand side. Each operation happens sequentially, and the start of a loop
iteration (shown here with the red triangle) happens after the previous iteration
3 Catapult Synthesis: A Practical Introduction to Interactive C Synthesis 43
completes. Conceptually, the pipeline initiation interval is equal to the latency of
a loop iteration, in this case, four cycles.
By constraining the initiation interval with loop pipelining, designers determine
when to start each loop iteration, relative to the previous one. The schedule on the
right hand side illustrates the same loop, pipelined with an initiation interval of one
cycle: the second loop iteration starts one cycle after the first one.

Pipelining a design directly impacts the data rate of the resulting hardware imple-
mentation. The first solution makes 1 I/O access every four cycles, while the second
one will make I/O accesses every cycles. Some applications may require a given
throughput, therefore commanding the initiation interval constraint. Other designs
may tolerate some flexibility, allowing the designers to explore different pipelining
scenarios, trading area, bandwidth utilization as well as power consumption.
3.3.4 Hierarchical Synthesis
The proper integration of individual blocks into a sub-system is one of the major
challenges in chip design. With its hierarchical synthesis capability Catapult Syn-
thesis can greatly simplify the design and integration tasks, building complex
multi-block systems correct-by-construction.
While loop unrolling exploits instruction level parallelism and loop merging
exploits loop level parallelism, hierarchy exploits function level (task-level) paral-
lelism. In Catapult, the user can specify which function calls should be synthesized
as hierarchical units. The arguments of the hierarchical function define the data
flow of the system, and Catapult will build all the inter-block communication and
synchronization logic.
Hierarchy generalizes the notion of pipelining, allowing different functions to run
in a parallel and pipelined manner. In complex systems consisting of various pro-
cessing stages, hierarchy is very useful to meet design throughput constraints. When
pipelining hierarchical systems, Catapult builds a design were the execution of the
various functions overlap in time. As shown in Fig. 3.9, in the sequential source
code, the three functions (stage1, stage2 and stage3) execute one after the other. In
Fig. 3.9 Task-overlapping with hierarchical synthesis
44 T. Bollaert
the resulting hierarchical system, the second occurrence of stage1 can start together
with the first occurrence of stage2, as soon as first occurrence of stage1 ends.
3.3.5 Technology-Driven Scheduling and Allocation
Scheduling and allocation is the process of building and optimizing the design given
all the user constraints, including the specific clock period and target technology.

With the clock period defining the maximum register-to-register path, the tech-
nology defines the logic delay for each design operation. The design schedule is
therefore intimately tied to these clock and technology constraints (Fig. 3.10). This
is fundamental to build optimized RTL implementations, allowing efficient retar-
geting of algorithmic specifications from one ASIC process to another, or even to
FPGAs, with always optimal results.
This capability opens new possibilities in the field of IP and reuse. While RTL
reuse can provide a quick path to the desired functionality, it often comes at the
expense of suboptimal results. RTL IPs maybe reused over many years. Devel-
oped on older processes, IPs will certainly work on newer ones, but without taking
advantage of higher speeds and better densities, therefore resulting in bigger and
slower than needed implementations. In contrast, Catapult can built optimized RTL
designs from functional IPs for each process generation, taking reuse to a new level
of efficiency.
Fig. 3.10 Technology-driven scheduling and allocation
3 Catapult Synthesis: A Practical Introduction to Interactive C Synthesis 45
3.4 Case Study: JPEG Encoder
In this section we will show how a sub-system such as a JPEG encoder can be
synthesized with Catapult Synthesis.
We chose a JPEG encoder design for this case study, as we felt that the applica-
tion would be sufficiently familiar to most readers in order to be easily understood
without extensive explanations. Moreover, such an encoder features a pedagogical
mix of datapath and control blocks, giving a good overview of Catapult Synthesis’
capabilities.
3.4.1 JPEG Encoder Overview
The pixel pipe (Fig. 3.11) of the encoder can be broken down in four main
stages: first RGB to YCbCr color space conversion block, second DCT (discrete
cosine transform), third zigzag reordering combined with quantization and last, the
Huffman encoder.
3.4.2 The Top Level Function

The top level function synthesized by Catapult (Fig. 3.12) closely resembles the
system block diagram. Four sub-functions implement the four processing stages of
the algorithm. The sub-functions simply pass on arrays to each other, mimicking the
system data flow.
3.4.3 The Color Space Conversion Block
The color space conversion unit is implemented as a relatively straightforward
vector multiplication. Different sets of coefficients are used for Y, Cb and Cr
components.
Fig. 3.11 JPEG encoder block diagram

×