Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 326 2009-10-2
326 Model-Based Design for Embedded Systems
FPGA GPP DSP FPGA
DSRC ASIC FPGA GPP
DSRC DSRC FPGA DSP
ASIC GPP ASIC DSRC
FIGURE 11.1
A heterogenous SoC template.
• A multicore organization can contribute to the energy efficiency of a
SoC. The best energy savings can be obtained by simply switching
off cores that are not used, which also helps in reducing the static
power consumption. Furthermore, the processing of local data in small
autonomous cores abides by the locality of reference principle. More-
over, a core processor can be adaptive; it does not have to run at
full clock speed to achieve the required QoS at a particular moment
in time.
• When one of the cores is discovered to be defect (either because of a
manufacturing fault or discovered at operating time by the built-in
diagnosis), this defective core can be switched off and isolated from
the rest of the design.
• A multicore approach also eases verification of an integrated circuit
design, since the design of identical cores has to be verified only once.
The design of a single core is relatively simple and therefore a lot of
effort can be put in (area/power) optimizations on the physical level
of integrated circuit design.
• The computational power of a multicore architecture scales linearly
with the number of cores. The more cores there are on a chip, the
more computations can be done in parallel (provided that the network
capacity scales with the number of cores and there is sufficient paral-
lelism in the application).
• Although cores operate together in a complex system, an individual
tile operates quite autonomously. In a reconfigurable multicore archi-
tecture, every processing core is configured independently. In fact, a
core is a natural unit of partial reconfiguration. Unused cores can be
configured for a new task, while at the same time other cores continue
performing their tasks. That is to say, a multicore architecture can be
reconfigured partly and dynamically.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 327 2009-10-2
Reconfigurable MultiCore Architectures 327
11.1.2.1 Heterogeneous Multicore SoC
The reason for heterogeneity in a SoC is efficiency, because typically some
algorithms run more efficiently on bit-level reconfigurable architectures (e.g.,
PN-code generation), some on DSP-like architectures, and some perform
optimal on word-level reconfigurable platforms (e.g., FIR filters or FFT algo-
rithms). We distinguish four processor types: “GPP,” “fine-grained reconfig-
urable” hardware (e.g., FPGA), “coarse-grained” reconfigurable hardware,
and “dedicated” hardware (e.g., ASIC). The different tile processors (TPs) in
the SoC are interconnected by a NoC. Both SoC and NoC are dynamically
reconfigurable, which means that the programs running on the processing
tiles as well as the communication channels are configured at run-time. The
idea of heterogeneous processing elements (PEs) is that one can match the
granularity of the algorithms with the granularity of the hardware. Appli-
cation designers or high-level compilers can choose the most efficient pro-
cessing core for the type of processing needed for a given application task.
Such an approach combines performance, flexibility, and energy efficiency.
It supports high performance through massive parallelism, it matches the
computational model of the algorithm with the granularity and capabilities
of the processing entity, and it can operate at minimum supply voltage and
clock frequency and hence provides energy efficiency and flexibility at the
right granularity only when and where needed and desirable. A thorough
understanding of the algorithm domain is crucial for the design of
an (energy-)efficient reconfigurable architecture. The architecture should
impose little overhead to execute the algorithms in its domain. Streaming
applications form a rather good match with multicore architectures: the com-
putation kernels can be mapped on cores and the streams to the NoC links.
Interprocessor communication is in essence also overhead, as it does not
contribute to the computation of an algorithm. Therefore, there needs to be
a sound balance between computation and interprocessor communication.
These are again motivations for a holistic approach.
11.1.3 Design Criteria for Streaming Applications
In this section, the key design criteria of multicore architectures for streaming
applications are introduced.
11.1.3.1 Predictable and Composable
To manage the complexity of streaming DSP applications, predictable tech-
niques are needed. For example, the NoC as well as the core processors
should provide latency and throughput guarantees. One reason for pre-
dictability is that the amount of data in streaming DSP applications is so
high that even a large buffer would be too small to compensate for unpre-
dictably behaving components and that the latency that these buffers would
introduce is not acceptable in typical streaming DSP applications. A second
reason for using predictable techniques is composability. This means that in
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 328 2009-10-2
328 Model-Based Design for Embedded Systems
case multiple applications are mapped on the same platform, the behavior
of one application should not influence another application. Furthermore,
in streaming applications, there are often hard deadlines at the beginning
of the chain (e.g., sampling rate of an A/D converter) or at the end of the
chain (e.g., fixed rate of the D/A converter, or update rate of the screen).
In other applications such as phased array applications, individual paths of
signals should be exactly timed before they can be combined. Also in these
applications the data rate is so high (e.g., 100 M samples/s) that buffering of
data is not useful. Unfortunately, future semiconductor technologies intro-
duce more uncertainty. Design techniques will have to include resiliency at
the circuit and microarchitecture level to deal with these uncertainties and
the variability at the device technology level. One of the future challenges is
to design predictable systems with unpredictable components.
11.1.3.2 Energy Efficiency
Energy efficiency is an important design issue in streaming DSP applica-
tions. Because portable devices rely on batteries, the functionality of these
devices is strictly limited by the energy consumption. There is an exponen-
tial increase in demand for streaming communication and processing for
wireless protocol baseband processing and multimedia applications, but the
energy content of batteries is increasing at a pace of 10% per year. Also for
high-performance computing there is a need for energy-efficient architec-
tures to reduce the costs for cooling and packaging. In addition to that, there
are also environmental concerns that urge for more efficient architectures in
particular for systems that run 24h per day such as wireless base stations and
search engines (e.g., Google has an estimated server park of 1 million servers
that run 24 h per day).
Today, most components are fabricated using CMOS technology. The
dominant component of energy consumption (85%–90%) in 130 nm CMOS
technology is dynamic power consumption. However, when technology
scales to lower dimensions, the static power consumption will become more
and more pronounced. A first-order approximation of the dynamic power
consumption of CMOS circuitry is given by the formula (see [13]):
P
d
= α · C
eff
·f · V
2
(11.1)
where
P
d
is the power in Watts
C
eff
is the effective switch capacitance in Farads
V is the supply voltage in Volts
α the activity factor
f is the frequency of operations in Hertz
Equation 11.1 suggests that there are basically four ways to reduce power:
reduce the capacitive load C
eff
, reduce the supply voltage V, reduce the
switching frequency f, and/or reduce the activity α. In the context of this
chapter, we will mainly address reducing the capacitance.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 329 2009-10-2
Reconfigurable MultiCore Architectures 329
As shown in Equation 11.1, energy consumption in CMOS circuitry is
proportional to capacitance. Therefore energy consumption can be reduced
by minimizing the capacitance. This can not only be reached at the
technological level, but much profit can be gained by an architecture that
exploits locality of reference. Connections to external components typically
have much higher capacitance than connections to on-chip resources. There-
fore, to save energy, the designer should use few off-chip wires, and have
them toggle as infrequently as possible. Consequently, it is beneficial to use
on-chip memories such as caches, scratchpads, and registers.
References to memory typically display a high degree of temporal and
spatial locality of reference. Temporal locality of reference refers to the obser-
vation that referenced data is often referenced again in the near future.
Spatial locality of reference refers to the observation that once a particular
location is referenced, a nearby location is often referenced in the near future.
Accessing a small and local memory is much more energy efficient than
accessing a large distant memory. Transporting a signal over a 1 mm wire in
a 45 nm technology requires more than 50 times the energy of a 32-bit oper-
ation in the same technology (the off-chip interconnect consumes more than
a 1000 times the energy of an on-chip 32-bit operation). A multicore architec-
ture intrinsically encourages the usage of small and local on-core memories.
Exploiting the locality of the reference principle extensively improves the
energy efficiency substantially. Because of the locality of reference principle,
the communications within a core are more frequent than between cores.
11.1.3.3 Programmability
Design automation tools form the bridge between processing hardware and
application software. Design tools are the most important requirement for
the viability of multicore platform chips. Such tools reduce the design cycle
(i.e., cost and time-to-market) of new applications. The application program-
mer should be provided with a set of tools that on the one hand hides the
architecture details but on the other hand gives an efficient mapping of the
applications onto the target architecture. High-level language compilers for
(DSP) domain-specific architectures are far more complex than compilers for
general-purpose superscalar architectures because of the data dependency
analysis,instructionscheduling,andallocation.Besidestoolingforapplication
development, tooling for functional verification and debugging is required
for programming multicore architectures. Ingeneral,suchtooling comprises
• General HDL simulation software that provides full insight into the
hardware state, but is extremely slow and not suited for software engi-
neers
• Dedicated simulation software that provides reasonable insight into
the hardware state, performs better than general hardware simulation
software, and can be used by software engineers
• Hardware prototyping boards that achieve great simulation speeds,
but provide poor insight into the hardware state and are not suited
for software engineers
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 330 2009-10-2
330 Model-Based Design for Embedded Systems
By employing the tiled SoC approach, as proposed in Figure 11.1, various
kinds of parallelism can be exploited. Depending on the core architecture one
or more levels of parallellism are supported.
• Thread-level parallelism is explicitly addressed by the multicore
approach as different tiles can run different threads.
• Data-level parallelism is achieved by processing cores that employ par-
allelism in the data path.
• Instruction-level parallelism is addressed by processing cores when mul-
tiple data path instructions can be executed concurrently.
11.1.3.4 Dependability
With every new generation of CMOS technology (i.e., 65 nm and beyond)
the yield and reliability of manufactured chips deteriorate. To effectively
deal with the increased defect density, efficient methods for fault detec-
tion, localization, and fault recovery are needed. Besides yield improvement,
such techniques also improve the long-term reliability and dependability of
silicon-implemented embedded systems. In the ITRS 2003 roadmap (see [1]),
it is indicated that “Potential solutions are adaptive and selfcorrecting, self-
repairing circuits and the use of on-chip reconfigurability.” Modern static
and dynamic fault detection and localization techniques and design-for-test
(DFT) techniques are needed for advanced multicore designs. Yield and reli-
ability can be improved by (dynamically) circumventing the faulty hardware
in deep-submicron chips. The latter requires run-time systems software. This
software detects defective cores and network elements and deactivates these
resources at run-time. The tests are performed while the chip is already in the
field. These self-diagnosis and self-repair hardware and software resources
need to be on chip.
11.2 Classification
Different hardware architectures are available in the embedded systems
domain to perform DSP functions and algorithms: “GPP, DSP, (re-) con-
figurable hardware, and application-specific hardware.” The application-
specific hardware is designed for a dedicated function and is usually referred
to as ASIC. The ASIC is, as its name suggests, an application-specific proces-
sor that has been implemented in an IC.
These hardware architectures have different characteristics in relation to
“performance, flexibility” or “programmability,” and “energy efficiency.”
Figure 11.2 depicts the trade-off in flexibility and performance for differ-
ent hardware architectures. Generally, more flexibility implies a less energy-
efficient solution.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 331 2009-10-2
Reconfigurable MultiCore Architectures 331
GPP
DSP
Fine-grained
Reconfigurable hardware
Coarse-grained
ASIC
HighFlexibilityLow
Low Performance High
FIGURE 11.2
Flexibility versus performance trade-off for different hardware architectures.
Crucial for the fast and efficient realization of a multiprocessor system-
on-chip (MP-SoC) is the use of predesigned modules, the so-called build-
ing blocks. In this section, we will first classify these building blocks, and
then classify the MP-SoCs that can be designed using these building blocks
together with the interconnection structures between these blocks.
A basic classification of MP-SoC building blocks is given in Figure 11.3.
The basic PEs of an MP-SoC are run-time reconfigurable cores and fixed
cores. The functionality of a run-time reconfigurable core is fixed for a rela-
tively long period in relation to the clock frequency of the cores. Fine-grained
reconfigurable cores are reconfigurable at bit level while coarse-grained
reconfigurable cores are reconfigurable at word level (8 bit, 16 bit, etc.).
Two other essential building blocks are memory and I/O blocks. Designs of
MP-SoCs can be reused to build larger MP-SoCs, increasing the designers
productivity.
A classification of MP-SoCs is given in Figure 11.4. An MP-SoC basi-
cally consists of multiple building blocks connected by means of an intercon-
nect. If an MP-SoC consists of multiple building blocks of a single type, the
MP-SoC is referred to as “homogeneous.” The homogeneous MP-SoC
architectures can be subdivided into single instruction multiple data
(SIMD), multiple instruction multiple data (MIMD), and array architectures.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 332 2009-10-2
332 Model-Based Design for Embedded Systems
MP-SoC building blocks
Fixed
cores
Run-time
reconfigurable
cores
Memory
I/O
Fine
grain
Coarse
grain
General
purpose
Design-time
reconfigurable
cores
(ARM)
(SPARC)
(FPGA) (MONTIUM)
(Silicon hive)
(Tensilica)
ASIC
FIGURE 11.3
Classification of MP-SoC building blocks for streaming applications.
MP-SoC architectures
Homogeneous
Heterogeneous
Interconnect
Bus
Network-on-chip
Dedicated
Packet switched Circuit switchedSIMD
MIMD Array
FIGURE 11.4
Classification of MP-SoC architectures and interconnect structures for
streaming applications.
Examples of these architectures will be given below. If multiple types of
building blocks are used, the MP-SoC is called “heterogeneous.”
To interconnect the different building blocks, three basic classes can be
identified: bus, NoC, and dedicated interconnects. A bus is shared between
different processing cores and is a notorious cause of unpredictability.
Unpredictability can be circumvented by an NoC [9]. Two types can be iden-
tified: packet-switched and circuit-switched. Besides the use of these more
or less standardized communication structures, dedicated interconnects are
still widely used. Some examples of different MP-SoC architectures are pre-
sented in Table 11.1.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 333 2009-10-2
Reconfigurable MultiCore Architectures 333
TABLE 11.1
Examples of Different MP-SoC Architectures
Class Example
SIMD Linedancer (see Section 11.3.2)
Geforce G80 [3]
Xetal [19]
Homogeneous MIMD Tilera (see Section 11.3.4)
Cell [21]
Intel Tflop processor [25]
Array PACT (see Section 11.3.3)
ADDRESS [2]
Heterogeneous A
NNABELLE (see Section 11.3.1)
Silicon Hive [12]
11.3 Sample Architectures
11.3.1 MONTIUM/ANNABELLE System-on-Chip
11.3.1.1 MONTIUM Reconfigurable Processing Core
The M
ONTIUM is an example of a coarse-grained reconfigurable processing
core and targets the 16-bit DSP algorithm domain. The M
ONTIUM architec-
ture origins from research at the University of Twente [18,22]. The M
ONTIUM
processing core has been further developed by Recore Systems [23]. A sin-
gle M
ONTIUM processing tile is depicted in Figure 11.5. At first glance the
M
ONTIUM architecture bears a resemblance to a very large instruction word
(VLIW) processor. However, the control structure of the M
ONTIUM is very
different. The lower part of Figure 11.5 shows the communication and con-
figuration unit (CCU) and the upper part shows the coarse-grained reconfig-
urable M
ONTIUM TP.
11.3.1.1.1 Communication and Configuration Unit
The CCU implements the network interface controller between the NoC and
the M
ONTIUM TP. The definition of the network interface depends on the
NoC technology that is used in a SoC in which the M
ONTIUM processing
tile is integrated [11]. The CCU enables the M
ONTIUM TP to run in “stream-
ing” as well as in “block” mode. In “streaming” mode the CCU and the
M
ONTIUM TP run in parallel. Hence, communication and computation over-
lap in time. In “block” mode, the CCU first reads a block of data, then starts
the M
ONTIUM TP, and finally after completion of the MONTIUM TP, the CCU
sends the results to the next processing unit in the SoC (e.g., another M
ONTIUM
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 334 2009-10-2
334 Model-Based Design for Embedded Systems
PPA
M01 M02
ALU1
ABCD
Out2 Out1
E
M03 M04
ALU2
ABCD
Out2 Out1
WE
M05 M06
ALU3
ABCD
Out2 Out1
WE
M07 M08
ALU4
ABCD
WE
M09 M10
ALU5
ABCD
W
Memory
decoder
Inter-
connect
decoder
Register
decoder
ALU
decoder
Sequencer
Communication and configuration unit
Out2 Out1 Out1Out2
FIGURE 11.5
The M
ONTIUM coarse-grained reconfigurable processing tile.
processing tile or external memory). Hence, communication and computa-
tion are sequenced in time.
11.3.1.1.2 M
ONTIUM Tile Processor
The TP is the computing part of the M
ONTIUM processing tile. The MONTIUM TP
can be configured to implement a particular DSP algorithm. DSP algorithms
that have been implemented on the M
ONTIUM are, for instance, all power of 2
FFTs upto 2048 points, non-power of 2 FFT upto FFT 1920, FIR filters, IIR
filters, matrix vector multiplication, DCT decoding, Viterbi decoders, and
Turbo (SISO) decoders. Figure 11.5 reveals that the hardware organization
of the M
ONTIUM TP is very regular. The five identical arithmetic logic units
(ALU1 through ALU5) in a tile can exploit data level parallellism to enhance
performance. This type of parallelism demands a very high memory band-
width, which is obtained by having 10 local memories (M01 through M10) in
parallel. The small local memories are also motivated by the locality of ref-
erence principle. The data path has a width of 16 bit and the ALUs support
both signed integer and signed fixed-point arithmetic. The ALU input regis-
ters provide an even more local level of storage. Locality of reference is one of
the guiding principles applied to obtain energy efficiency in the M
ONTIUM TP.
Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 335 2009-10-2
Reconfigurable MultiCore Architectures 335
A vertical segment that contains one ALU together with its associated input
register files, a part of the interconnect, and two local memories is called
a processing part (PP). The five PPs together are called the processing part
array (PPA).
A relatively simple sequencer controls the entire PPA. The sequencer
selects configurable PPA instructions that are stored in the decoder blocks
of Figure 11.5. For (energy) efficiency it is imperative to minimize the con-
trol overhead. The PPA instructions, which comprise ALU, AGU, memory,
register file, and interconnect instructions, are determined by a DSP appli-
cation designer at design time. All M
ONTIUM TP instructions are scheduled
at design time and arranged into a M
ONTIUM sequencer programme. By stat-
ically scheduling the instructions as much as possible at compile time, the
M
ONTIUM sequencer does not require any sophisticated control logic which
minimizes the control overhead of the reconfigurable architecture.
The M
ONTIUM TP has no fixed instruction set, but the instructions are con-
figured at configuration time. During configuration of the M
ONTIUM TP, the
CCU writes the configuration data (i.e., instructions of the ALUs, memories
and interconnects, etc., sequencer and decoder instructions) in the configu-
ration memory of the M
ONTIUM TP. The size of the total configuration mem-
ory of the M
ONTIUM TP is about 2.6 kB. However, configuration sizes of DSP
algorithms mapped on the M
ONTIUM TP are typically in the order of 1 kB.
For example, a 64-point fast Fourier transform (FFT) has a configuration
size of 946 bytes. By sending a configuration file containing configuration
RAM addresses and data values to the CCU, the M
ONTIUM TP can be con-
figured via the NoC interface. The configuration memory of the M
ONTIUM
TP is implemented as a 16-bit wide SRAM memory that can be written by
the CCU. By only updating certain configuration locations of the configu-
ration memory, the M
ONTIUM TP can be partially reconfigured. In the con-
sidered M
ONTIUM TP implementation, each local SRAM is 16-bit wide and
has a depth of 1024 addresses, which results in a storage capacity of 2kB
per local memory. The total data memory inside the M
ONTIUM TP adds up
to a size of 20 kB. A reconfigurable address generation unit (AGU) is inte-
grated into each local memory in the PPA of the M
ONTIUM TP. It is also
possible to use the local memory as a look-up table (LUT) for complicated
functions that cannot be calculated using an ALU, such as sine or division
(with one constant). The memory can be used in both integer or fixed-point
LUT mode.
11.3.1.2 Design Methodology
Development tools are essential for quick implementation of applications in
reconfigurable architectures. The M
ONTIUM development tools start with a
high-level description of an application (in C/C++ or M
ATLAB
R
)andtrans-
late this description to a M
ONTIUM TP configuration [16]. Applications can
be implemented on the M
ONTIUM TP using an embedded C language, called