56 S. Aditya and V. Kathail
the high degree of chip integration made possible by Moore’s law. For example,
a cell-phone chip now contains multiple modems, imaging pipeline for a camera,
video codecs, music players, etc. A video codec used to be a whole chip a few years
back, and now it is a small part of the chip. Second, there is relentless pressure to
reduce time-to-market and lower prices.
It is clear that automation is the key to success. Automatic application engine
synthesis (AES) from a high level algorithmic description significantly reduces both
design time and design cost. There is a growing consensus in the design community
that hardware/software co-design, high level synthesis, and high level IP reuse are
together necessary to close the design productivity gap.
4.1.2 Application Engine Design Space
Application engines like multi-standard video codecs are large, complex systems
containing a significant number of processing blocks with complex dataflow and
control flow among them. Externally, these engines interact with system CPU, sys-
tem bus and other application engines. The ability to synthesize complex application
engines from C algorithms automatically requires a careful examination of the type
of architectures that lend themselves well to such automation techniques.
Broadly speaking, there are three main approaches for designing application
engines [4] (see Fig. 4.2).
1. Dedicated hardware accelerators: They provide the highest performance and
the lowest power. Typically, they are 2–3 orders of magnitude better in power
and performance than a general purpose processor. They are non-programmable
but can provide limited amount of multi-modal execution based on configuration
parameters. There are two approaches for automatic synthesis of dedicated
hardware blocks:
Hybrid Application
Engines
Behavioral
Synthesis of
Accelerators/FPGAs
Architectural
Synthesis of
Accelerators/FPGAs
Customizable or
Configurable
Processors
Fig. 4.2 The application engine design space
4 Algorithmic Synthesis Using PICO 57
(a) Behavioral synthesis: This is a bottom-up approach in which individual
blocks are designed separately. C statements and basic blocks are mapped to
datapath leading to potentially irregular datapath and interconnect. The dat-
apath is controlled by a monolithic state machine which reflects the control
flow between the basic blocks and can be fairly complex.
(b) Architectural synthesis: This is a top-down approach with two distinguishing
characteristics. First, it takes a global view of the whole application and can
optimize across blocks in order to provide high performance. Second, it uses
an efficient, high performance architecture template to design datapath and
control leading to more predictable results. PICO’s approach for designing
dedicated hardware accelerators falls in this category.
2. Customizable or configurable processors: Custom or application-specific pro-
cessors can give an order of magnitude better performance and power than
general-purpose processor while still maintaining a level of programmability.
This approach is well-suited for the following two cases
(a) The performance requirements are not very high and power requirements are
not very stringent.
(b) Standards or algorithms are still in flux, and flexibility to make algorithmic
changes after fabrication is needed.
3. Hybrid approach: Inour view, this is the right approach for synthesizing complex
application engines. An efficient architecture for these engines is a combina-
tion of
(a) Programmable processor(s), typically custom embedded processor, for parts
of the application that don’t require high performance
(b) Dedicated hardware blocks to get high performance at low power and low
area
(c) Local buffers and memories for high bandwidth
This approach allows a full spectrum of designs to be explored that trade-off among
multiple dimensions of cost, performance, power and programmability.
4.1.3 Requirements of a Production AES System
In addition to generating competitive hardware, a high level synthesis system needs
to fit in a SoC design flow for it to be practically useful and of significant benefit
to designers. We can identify a number of steps in the SoC design process. These
steps, along with the capabilities that the synthesis system must provide for each
step, are described below.
1. Architecture exploration for application engines: Architecture and micro-
architecture choices have a great impact on the power, performance and area
58 S. Aditya and V. Kathail
of a design, but there is no way to reliably predict this impact without actually
doing the design. A high level synthesis system makes it possible to do design
space exploration to find an optimal design. However, the system must be struc-
tured to make it easy to explore multiple designs from the same C source code.
For example, a system that requires users to control the scheduling of individual
operations in order to get good results is not very useful for architectural explo-
ration because of the amount of time it takes to do one design. Therefore, design
automation is the key to effective exploration.
2. High level, multi-block IP design and implementation: This is, of course, the
main purpose of a high level synthesis system. It must be able to generate
designs that are competitive with manual designs for it to be widely acceptable
in production environments.
3. RTL verification: It is unrealistic to expect that designers would write test-
benches for the RTL generated by a synthesis system. They should verify their
design at the C level using a C test bench. The synthesis system should then auto-
matically generate either an RTL test bench including test vectors or a C-RTL
co-simulation test bench. In addition, the synthesis system should provide a
mechanism to test corner cases in the RTL that cannot be exercised using the
C test bench.
4. System modeling and validation (virtual platform) support: Currently, designers
have to manually write transaction level models (TLM) for IP they are designing
in order to incorporate them in system level platforms. This is in addition to
implementing designs in RTL. Generating transaction level models directly from
a C algorithm will significantly reduce the development time for building these
models.
5. SoC integration: To simplify the hardware integration of the generated IP into an
SoC, the system should support a set of standard interfaces that remain invariant
over designs. In addition, the synthesis system should provide software device
drivers for easy integration into a CPU based system.
6. RTL to GDSII design flow integration: The generated RTL should seamlessly go
through the existing RTL flows and methodologies. In addition, the RTL should
close timing in the first pass and shouldn’t present any layout problems because
it is unrealistic to expect that designers will be able to debug these problems for
RTL they didn’t write.
7. Soft IP reuse and design derivatives: One of the promised benefits of high level
synthesis system is the ability to reuse the same C source for different designs.
Examples include designs at different performance points (low-end vs. high-end)
across a product family or design migration from one process node to another
process node. As an example of the requirement placed on the tool, support for
process migration requires that there is a methodologyto characterize the process
and then feed the relevant information to the tool so that it is retargeted to that
process.
4 Algorithmic Synthesis Using PICO 59
4.2 Overview of AES Methodology
Figure 4.3 shows the high level flow for synthesis of application engines following
the hybrid approach outlined in Sect. 4.1.2. Typically, the first step in the appli-
cation engine design process is high level partitioning of the desired functionality
into hardware and software components. Depending on the application, an engine
may consist of a control processor (custom or off-the-shelf) and one or more cus-
tom accelerator blocks that help to meet one or more design objectives such as
cost, performance, and power. Traditionally, the accelerator IP is designed block by
block either by reusing blocks designed previously or by designing new hardware
blocks by hand keeping in view the budgets for area, cycle-time and power. Then
the engine is assembled together, verified, and integrated with the rest of the SoC
platform, which usually takes up a significant fraction of the overall product cycle.
The bottlenecks and the risks in this process clearly are in doing the design, verifi-
cation and integration of the various accelerator blocks in order to meet the overall
functionality specification and the design objectives. In the rest of the paper, we will
focus our attention on these issues.
In traditional hardware design flows, substantial initial investment is made to
define a detailed architectural specification of various accelerator blocks and their
interactions within the application engine. These specifications help to drive the
manual design and implementation of new RTL blocks and their verification test
benches. In addition, a functional executable model of the entire design may be
used to test algorithmic coverage and serve as an independent reference for RTL
verification.
Fig. 4.3 Application engine design flow
60 S. Aditya and V. Kathail
In design flows based on high level synthesis, on the other hand, an automatic
path to RTL implementation and verification is possible starting from a high level,
synthesizable specification of functionality together with architectural information
that helps in meeting the desired area, performance and power metrics. The addi-
tional architectural information may be provided to a HLS tool in various ways.
One possible approach is to combine the hardware and implementation specific
information together with the input specification. Some tools based on SystemC [5]
require the user to model the desired hardware partitioning and interfaces directly in
the input specification. Other tools require the user to specify detailed architectural
information about various components of the hardware being designed using a GUI
or a separate design file. This has the advantage of giving the user full control of
their hardware design but it increases the burden of input specification and makes
the specification less general and portable across various implementation targets.
It also leaves the tool with very little freedom to make changes and optimizations
in the design in order to meet the overall design goals. Often, multi-block hardware
integration and verification becomes solely the responsibility of the user because the
tool has little or no control over the interfaces being designed and their connectivity.
4.2.1 The PICO Approach
PICO [6] provides a fully automated, performance-driven, application engine syn-
thesis methodology that enables true algorithmic level input specification and yet
is sensitive to physical design constraints. PICO not only produces a cost-effective
C-to-RTL mapping but also guarantees its performance in terms of throughput and
cycle-time. In addition, multiple implementations at different cost and performance
tradeoffs may be generated from the same functional specification, effectively
reusing the input description as flexible algorithmic IP. This methodology also
reduces design verification time by creating customized verification test benches
automatically and by providing a correct-by-construction guarantee for both RTL
functionality and timing closure. Lastly, this methodology generates standard set of
interfaces which reduces the complexity of assembling blocks into an application
engine and final integration into the SoC platform.
The key to PICO’s approach is to use an advanced parallelizing compiler in
conjunction with an optimized, compile-time configurable architecture template to
generate hardware as shown in Fig. 4.4. The behavioral specification is provided
using a subset of ANSI C, along with additional design constraints, such as through-
put and clock frequency. The RTL design creation can then be viewed as a two step
process. In the first step, a retargetable, optimizing compiler analyzes the high level
algorithmic input, exposing and exploiting enough parallelism to meet the required
throughput. In the second step, an architectural synthesizer configures the architec-
tural template according to the needs of the application and the desired physical
design objectives such as cycle-time, routability and cost.
4 Algorithmic Synthesis Using PICO 61
ANSI C Algorithm (e.g. FDE)
Design constraints:
Throughput, clock frequency
Application Engine Synthesis
Verilog RTL HW + SystemC Models
Configurable
Architectural
Template
Advanced
Parallelizing
Compiler
ANSI C Algorithm (e.g. FDE)
Design constraints:
Throughput, clock frequency
Application Engine Synthesis
Verilog RTL HW + SystemC Models
Configurable
Architectural
Template
Advanced
Parallelizing
Compiler
Fig. 4.4 PICO’s approach to high level synthesis
Fig. 4.5 System level design flow using PICO
4.2.2 PICO’s Integrated AES Flow
Figure 4.5 shows the overall design flow for creating RTL blocks using PICO.
The user provides a C description of their algorithm along with performance
requirements and functional test inputs. The PICO system automatically generates
the synthesizable RTL, customized test benches, synthesis and simulation scripts,
as well as software integration drivers to run on the host processor. The RTL imple-
mentation is cost-efficient and is guaranteed to be functionally equivalent to the
algorithmic C input description by construction. The generated RTL can then be
taken through standard simulation, synthesis, place and route tools and integrated
into the SoC through automatically configured scripts.
62 S. Aditya and V. Kathail
Along with the hardware RTL and its related software, PICO also produces
SystemC-based TLM models of the hardware at various levels of abstraction –
untimed programmer’s view (PV), and timed programmer’s view (PV+T). The PV
model can be easily integrated into the user’s virtual SoC platform enabling fast
validation of the hardware functionality and its interfaces in the system context,
whereas the PV+T model enables early verification of the performance, the paral-
lelism and the resources used by the hardware in the system context.
The knowledge of the target technology and its design trade-offs is embed-
ded as part of a macrocell library which the PICO system uses as a database
of hardware building blocks. This library consist of pre-verified, parameterized,
synthesizable RTL components such as registers, adders, multipliers, and intercon-
nect elements that are carefully hand-crafted to provide the best cost-performance
tradeoff. These macrocells are then independently characterized for various target
technology libraries to obtain a family of cost-performance tradeoff curves for var-
ious parametric settings. PICO uses this characterization data for its internal delay
and area estimation.
4.2.3 PICO Results and Experience
The PICO Express
TM
tool incorporating our approach has been used extensively
in production environments. Table 4.1 shows a representative set of designs done
Table 4.1 Some example designs created using PICO Express
TM
Product Design Area Performance Time vs. hand
design
DVD Horizontal–vertical
filter
60–49 K gates,
40% smaller than
target
Met cycle budget
and frequency
target
v1: 1 month
v2: 3 days vs. 2–3
months
Digital
Camera
Pixel scaler Met the target Multiple versions
designed at
different targets
2–3 weeks
Multiple revisions
within hours
Set-top
box
HD video codec 200 K gates, 5%
smaller than hand
design
Same as hand
design
<2 months to
design and verify
Camcorder High-perf. video
compression
1 M gates, met the
target
Same as hand
design
Same design time
with significantly
less resources
Video
Processing
Multi-standard
deblocking,
deringing and
chroma conversion
Same as hand
design
30% higher than
hand design
3–4× productivity
improvement
Multi-
media cell
phone
High bandwidth
3Gwireless
baseband
400 K gates, same
as hand design
Same as hand
design
2 months vs. >9
months
Wireless
LAN
LDPC encoder for
802.11n
60 K gates, 6%
over hand design
Same as hand
design, low power
<1monthto
design and verify
4 Algorithmic Synthesis Using PICO 63
using PICO Express. These designs range from relatively small horizontal-vertical
filter for a DVD player with ∼49 K gates to large designs with more than 1 M gates
for high performance video compression. In all cases, designs generated using PICO
Express met the desired performance targets with an area within 5–10% of the hand-
design except in one case where the PICO design had significantly less area. In all
cases, PICO Express provided significant productivity improvements ranging from
3–5× for the initial design and more than 20× for derivative designs. As far as we
know, no other HLS tool can handle many of these designs because of their com-
plexity and the amount of parallelism needed to meet performance requirements.
Users’ experience with PICO Express is described in these papers [7,8].
4.3 The PICO Technology
In this section, we will describe the key ingredients of the PICO technology that
help to meet the application engine design challenges and the requirements of a
high level synthesis tool as outlined in Sect. 4.1.
4.3.1 The Programming Model
The foremost goal of PICO has been to make the programming model for design-
ing hardware to be as simple as possible for a large class of designs. PICO has
chosen C/C++ languages as the preferred mode of input specification at the
algorithmic level. The goal is not to replace Verilog or VHDL as hardware spec-
ification languages necessarily, but to raise the level of specification to a point
where the essential algorithmic content can be easily manipulated and explored
without worrying about the details of hardware allocation, mapping, and scheduling
decisions.
Another important goal for PICO’s programming model is to allow the user to
specify the hardware functionality as a sequential program. PICO automatically
extracts parallelism from the input specification to meet the desired performance
based on its analysis of program dependences and external resource constraints.
However, the functional semantics of the hardware generated still corresponds to
the input sequential program. On one hand, this has obvious advantages for under-
standability and ease of design and debugging, while on the other hand, this allows
the tool to explore and throttle the parallelism as desired since the input specifica-
tion becomes largely independent of performance requirements. This approach also
helps in verifying the final design against the input functional specification in an
automated way.
64 S. Aditya and V. Kathail
L1
L2
L3
time
L1
Sequential
L1
L2
L3
L1
L2
L3
Loop-level
parallelism
L1
L2
L3
L1
L2
L3
Task-level and
Loop-level
parallelism
Iteration-level parallelism
1
2
3
Instruction-level
parallelism
Task-level
parallelism
L1
L2
L3
L1
L2
L3
L1
L2
L3
proc
1
2
3
4
5
6
L1
L2
L1
L2
L3
time
L1
L1
L2
L3
time
L1
Sequential
L1
L2
L3
L1
L2
L3
L1
L2
L3
L1
L2
L3
Loop-level
parallelism
L1
L2
L3
L1
L2
L3
L1
L2
L3
L1
L2
L3
Task-level and
Loop-level
parallelism
Iteration-level parallelism
1
2
3
1
2
3
Instruction-level
parallelism
Task-level
parallelism
L1
L2
L3
L1
L2
L3
L1
L2
L3
L1
L2
L3
L1
L2
L3
proc
L1L1
L2L2
L3L3
proc
1
2
3
4
5
6
L1
L2
1
2
3
4
5
6
L1
L2
Fig. 4.6 Multiple levels of parallelism exploited by PICO
4.3.1.1 Sources of Parallelism
A sequential programming model may appear to place a severe restriction to the
class of hardware one can generate or the kind of parallelism one can exploit in those
hardware blocks. However, this is not actually so. A very large class of consumer
data-processing applications such as those in the fields of audio, video, imaging,
security, wireless, networking, etc. can be expressed as a sequential C program that
process and transform arrays or streams of data. There is tremendous amount of
parallelism in these applications at various levels of granularity and PICO is able to
exploit them all using various techniques.
As shown in Fig. 4.6, a lot of these applications consist of a sequence of trans-
formations expressed as multiple loop-nests encapsulated in a C procedure that is
designated to become hardware. One invocation of this top level C procedure is
called one task which processes one block of data by executing each loop-nest
once. This would be followed by the next invocation of the code processing another
block of data. PICO, however, converts the C procedure code to a hardware pipeline
where each loop-nest executes on a different hardware block. This enables proce-
dure level task parallelism to be exploited by pipelining a sequence of tasks through
this system, increasing overall throughput considerably.
At the loop-nest level, PICO provides a mechanism to express streaming data that
is synchronized with two-way handshake and flow control in the hardware. In the C
program, this manifests itself simply as an intrinsic function call that writes data to
a stream and another intrinsic function call that reads data from that stream. Streams
may be used to communicate data between any pair of loop-nests as long as temporal
causality between the production and the consumption of data is maintained during
4 Algorithmic Synthesis Using PICO 65
sequential execution. The advantage of the fully synchronized communication in
hardware is that the loop-nests can be executed in parallel with local transaction
level flow control which exploits producer-consumer parallelism at the loop level.
Within a single hardware block implementing a loop-nest, PICO exploits iter-
ation level parallelism, by doing detailed dependence analysis of the iteration
space and transforming the loop-nest to run multiple iterations in parallel even
in the presence of tight recurrences. Subsequently, the transformed loop iteration
code is scheduled using software-pipelining techniques that exploit instruction level
parallelism in providing the shortest possible schedule while meeting the desired
throughput.
4.3.1.2 The Execution Model
Given the parallelism available in consumer application at various levels, the PICO
compiler attempts to exploit this parallelism without violating the sequential seman-
tics of the application. This is accomplished by following the well-defined, parallel
execution model of Kahn process networks [9], where a set of sequential pro-
cesses communicate via streams with block-on-read semantics and unbounded
buffering. Kahn process networks have the advantage that they provide determin-
istic parallelism, i.e., the computation done by the process network is unchanged
under different scheduling of the processes. This property enables PICO to par-
allelize a sequential program with multiple loop-nests to a Kahn process network
implemented in hardware where each loop-nest computation is performed by a cor-
responding hardware block that communicates with other such blocks via streams.
Since the process network is derived from a sequential program, it still retains the
original sequential semantics even under different parallel executionsof its hardware
blocks. Each hardware block, in turn, runs a statically parallelized implementa-
tion of the corresponding loop-nest that is consistent with its sequential semantics
using software-pipelining techniques. In this manner, iteration level and instruc-
tion level parallelism are exploited at compile-time within each hardware block,
and producer–consumer and task level parallelism are exploited dynamically across
blocks without violating the original sequential semantics.
The original formulation of Kahn process networks captured infinite computa-
tion using unbounded FIFOs on each of the stream links. However, PICO is able
to restrict the size of computation and buffering provided on each link by impos-
ing additional constraints on the execution model. These constraints are described
below:
• Single-task execution: Each process in a PICO generated process network is
able to execute one complete invocation to completion without restarting. This
corresponds to the single task invocation of the top level C procedure in the
input specification, where each loop-nest in that procedure executes once and
the procedure terminates. In actual hardware execution, multiple tasks may be
overlapped in a pipelined manner depending on resource availability, but this