High Level Synthesis: from Algorithm to Digital Circuit- P28 pps

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (257.17 KB, 10 trang )

260 M.C. Molina et al.
Table 14.1 Schedule and binding based on operation fragmentations
Operation Fragmentation Cycle 1 Cycle 2 Cycle 3
E = A×B ⊗4×4
F = C×D ⊗4×4
G = E+ F ⊕8
I = H×GI1= H ×G
3 0
⊗8×4
I2 = H
3 0
×G
7 4
⊗4×4
I3 = H
7 4
×G
7 4
⊗4×4
I4 = I1
11 4
+ I2 ⊕8
I5 = (‘000’ & I4
8 4
)+I3 ⊕8
L = J+ K ⊕8
N = L×MN1= L×M
11 8
⊗8×4
N2 = L ×M
3 0

⊗8×4
N3 = L
3 0
×M
7 4
⊗4×4
N4 = L
7 4
×M
7 4
⊗4×4
N5 = N3 + N2
11 4
⊕8
N6 = N4+ (‘000’ & N5
8 4
) ⊕8
N7 = N1+ (‘000’ & N6) ⊕12
R = P+ QR1= P
11 0
+ Q
11 0
⊕12
R2 = P
23 12
+ Q
23 12
⊕12
Fig. 14.2 Execution of operation N = L ×M in cycle 3
14 Exploiting Bit-Level Design Techniques in Behavioural Synthesis 261

Table 14.2 Main features of the synthesized implementations
Conventional Proposed Saved (%)
Datapath FUs ⊗12 ×8, ⊗4 ×4, ⊕24 ⊗8×4, 2 ⊗4 ×4, ⊕12, 2 ⊕8–
FU’s area 1,401 inverters 911 inverters 35
Circuit area 3,324 inverters 2,298 inverters 30.86
Cycle length 23.1 ns 21.43 ns 7.22
In the second implementation of this example every datapath FU is used once
per cycle to execute one operation of its same width, and therefore the HW waste
present in the conventional implementation has been totally removed. The circuit
area (including FUs, registers, multiplexers, and controller) and the cycle length
have been reduced around 31 and 7%, respectively, as summarized in Table 14.2.
14.2 Design Techniques to Reduce the HW Waste
The ﬁrst step to reduce the HW waste present in conventional implementations
becomes the extraction of the common operative kernel of operations in the behavi-
oural speciﬁcation. This implies the fragmentation of compatible operations (that
can be executed over FUs of a same type) into their common operative kernel plus
some glue logic. These transformations can be applied not only to trivial cases like
the compatibility between additions and subtractions, but to more complex ones
like the compatibility between additions and multiplications. The common oper-
ative kernel extraction increments the number of operations that can be executed
over the same HW resources, and in consequence, helps augment the FUs reuse.
In addition to the extraction of the common operative kernels, several other
advanced design techniques can be applied during the scheduling or allocation
phases to further reduce the HW waste. These new methods handle speciﬁcation
operations at the bit level, and produce datapaths where the number, type, and width
of the HW resources do not depend on the number, type, and width of the speciﬁca-
tion operations and variables. This independence from the description style used in
the speciﬁcation brings the applicability of HLS closer to inexperienced designers,
and offers expert ones more freedom to specify behaviours.
14.2.1 Kernel Extraction of Compatible Operations

Compatible operations can be executed over the same FUs in order to improve HW
reuse, provided that their common operative kernel has been extracted in advance.
This kernel extraction can be applied to the behavioural speciﬁcations as a pre-
processing phase prior to the synthesis process. This way, additive operations such
as subtractions, comparisons, maximums, minimums, absolute values, data format
converters, or data limiters can be substituted for additions and logic operations.
262 M.C. Molina et al.
And more complex operations like multiplication, division, MAC, factorial or power
operations can be decomposed into several multiplications, additions, and some
glue logic.
In order to increase the number of operations that may share one FU, it is also
desirable the uniﬁcation of the different representation formats used in the speciﬁca-
tion. Signed operations can be transformed into several unsigned ones, e.g. a two’s
complement signed multiplication of m × n bits (being m ≥ n) is transformed
using a variant of the Baugh and Wooley algorithm [1] into one multiplication of
(m −1) ×(n −1) bits, and two additions of m and n + 1 bits. Some of the transfor-
mations that can be applied to homogenize operation types and data representation
formats have been described by Molina et al. [2].
14.2.2 Scheduling Techniques
The HW waste present in conventional implementations can be reduced in the
scheduling phase if we balance the computational cost of the operations executed
per cycle (number of bits of every different operation type) instead of the number of
operations. In order to obtain homogeneous distributions, some speciﬁcation oper-
ations must be transformed into a set of new ones, whose types and widths may be
different from the original.
Operation fragments inherit the mobility of the original operation and are sched-
uled separately, in order to balance the computational cost of the operations executed
per cycle. Therefore, one original operation may be executed across a set of cycles,
not necessarily consecutive (saving the partial results and carry information calcu-
lated in every cycle). This implies that every result bit is available to be used as input

operand the cycle it is calculated in, even if the operation ﬁnishes at a later cycle.
Also each fragment can begin its execution once its input operands are available,
even if its predecessors ﬁnish at a later cycle. It can also occur that the ﬁrst opera-
tion fragment executed is not the one that uses the LSB of the input operands. This
feature does not imply that the MSB of the result can be calculated before its LSB
in the case of operations with rippling effect.
The set of different fragmentations to be applied in every case mainly depends
on the type and width of the operation. In the following sections a detailed anal-
ysis of the different ways to fragment additions and multiplications is presented.
Other types of operations could also be fragmented in a similar one. However,
its description has been omitted as most of them can be reduced to multiplica-
tions, additions, and logic operations and the remaining ones are less common in
behavioural speciﬁcations.
14.2.2.1 Fragmentation of Additions
In order to improve the quality of circuits, some speciﬁcation additions can be
fragmented into several smaller additions and executed across several not neces-
sarily consecutive cycles. In this case the new data dependences among operation
14 Exploiting Bit-Level Design Techniques in Behavioural Synthesis 263
fragments (propagation of carry signals) obliges to calculate the LSB of the oper-
ation in the earliest cycle and the MSB in the latest one. The application of this
design technique provides many advantages in comparison to previous ones, such
as chaining, bit-level chaining, multicycle, or non-integer multicycle. Some of them
are summarized below:
(a) Area reduction of the required FUs. The execution of the new operation frag-
ments can be performed using a unique adder. The adder width must equal the
width of the biggest addition fragment, which is always narrower than the origi-
nal operation length. In general, the routing resources needed to share the adder
among all the addition fragments do not waste the beneﬁts achieved by the FU
reuse. Aside from the remaining additions present in the speciﬁcation, the best
area reduction is obtained when all the fragments have similar widths due to the

HW waste minimization, which is also applicable to the routing resources.
In the particular case that the addition fragments are scheduled in consecu-
tive cycles, this design technique may seem similar to the multicycle technique.
However, the main difference resides on the required HW, as the width of the
multicycle FU must match the operation width, meanwhile in our case must
equal the widest fragment.
(b) Area reduction of the required storage units. As well as all the input operand bits
of a fragmented addition are not required in the ﬁrst of its execution cycles, the
already summed up operand bits are not longer necessary in the later ones. That
is, every input operand bit is needed just in one cycle, and it has to be stored only
until that cycle. The storage requirements are quite smaller than when the entire
operation is executed at a later cycle, because this implies to save the complete
operands during several cycles. In the case of multicycle operators, the input
operands are necessary all the cycles that last its execution.
(c) Cycle length reduction. The reduction of the addition widths, consequence of
the operation fragmentations, directly involves a reduction of their delays, and
in many cases, it also results in considerable cycle length reductions. However,
this reduction is not achieved in a forthright manner, because it also depends
on both the schedule of the remaining speciﬁcation operations and the set of
operations chained in every cycle.
(d) Chaining of uncompleted operations. Every result fragment calculated can be
used as input operand in the cycle it has been computed in. This allows the
beginning of a successor operation in the same cycle. This way, the calculus of
addition fragments is less restrictive than the calculus of complete operations,
and therefore extends the design space explored by synthesis algorithms.
Figure 14.3 shows graphically the schedule of one n bits addition in two incon-
secutive cycles. The m LSB are calculated in cycle i, and the remaining ones in cycle
i + j. The input operand bits computed in cycle i are not required in the following
ones, and thus the only saved operands are A
n−1 m

and B
n−1 m
. A successor oper-
ation requiring A + B
m−1 0
can begin as soon as cycle i. Although this example
corresponds to the most basic fragmentation of one addition in two smaller ones,
many others are also possible, including the ones that produce a bigger number of
new operations.
264 M.C. Molina et al.
Fig. 14.3 Execution of one addition in two inconsecutive cycles
14.2.2.2 Fragmentation of Multiplications
The multiplications can also be fragmented in smaller operations to be executed in
several not necessarily consecutive cycles, with the same aim of reducing the circuit
area. In this case, two important differences with the addition fragmentation arise:
– The fragmentation of one multiplication not only produces smaller multiplica-
tions but also additions, needed to compose the ﬁnal result from the multiplica-
tion fragments.
– The new data dependences among the operation fragments also oblige to calcu-
late the LSB before the MSB. However, this need does not imply that the LSB
must be calculated in the ﬁrst cycle in this case. In fact, the fragments can be
computed in any order as there are not data dependences among them. Only the
addition fragments that produce the result bits of the original multiplication must
keep their order to propagate adequately the carry signals.
The advantages of this design technique are quite similar to the ones mentioned in
the above section dealing with additions. For this reason just the differences between
them are cited in the following paragraphs.
(a) Area reduction in the required FUs. The execution of the new operation frag-
ments can be performed using a multiplier and an adder, whose widths must
equal the widths of the biggest multiplication and addition fragments, respec-

tively. The total area of these two FUs and the required routing resources to
share them across its execution cycles is usually quite smaller than the area of
one multiplier of the original multiplication width.
(b) Area reduction in the required storage units. Although several multiplication
fragments may require the same input operands, they are not needed all along
every cycle of its execution. Every input operand bit may be available in sev-
eral cycles, but it only needs to be stored until the last cycle it is used as input
operand. The storage requirements are reduced compared to the execution of
the entire operation at a later cycle, or the use of a multicycle multiplier, where
the operand bits must be kept stable during the cycles needed to complete the
operation.
14 Exploiting Bit-Level Design Techniques in Behavioural Synthesis 265
Fig. 14.4 Executions of one multiplication in three inconsecutive cycles
Figure 14.4 illustrates two different schedules of one m×n bits multiplication in
three inconsecutive cycles. The LSB are calculated in cycle i+ j, and the remaining
ones in cycle i+ j +k. Note that in both cases the multiplication fragment scheduled
in cycle i does not produce any result bit. However, if the execution cycles of the
fragments assigned to cycles i and i+ j were changed, the schedules would produce
result bits in cycles i and i+ j + k. In both schedules some of the input operand bits
computed in cycle i are not required in cycle i + j, and none of them are required
in cycle i+ j + k. Although these examples correspond to basic fragmentations of
one multiplication into two smaller ones and one addition, other more complex ones
(that produce more fragments) are also possible.
14.2.3 Allocation and Binding Techniques
The allocation techniques to reduce the HW waste are focused on increasing the
bit level reuse of FUs. The maximum reuse of functional resources occurs when all
the datapath FUs execute an operation of its same type and width in every cycle (no
datapath resource is wasted). If the allocation phase takes place after the scheduling,
then the maximum degree of FUs reuse achievable depends on the given schedule.
Note that the maximum bit level reuse of FUs can only be reached once a totally

homogeneous distribution of operation bits among cycles is provided.
In order to get the maximum FUs reuse for a given schedule, the speciﬁcation
operations must be fragmented to obtain the same number of operations of equal
type and width in every cycle. These fragmentations include the transformations of
operations into several simpler ones whose types, representations, and widths may
be different from those of the original. These transformationsimply that one original
operation will be executed distributed over several linked FUs. Its operands will also
be transmitted and stored distributed over several routing and storage resources,
respectively. The type of fragmentation performed is different for every operation
type.
266 M.C. Molina et al.
Fig. 14.5 Execution of one addition in two adders, and two additions in one adder
14.2.3.1 Fragmentation of Additions
The minimum HW waste in datapath adders occurs when all of them execute addi-
tions of their same width at least in one cycle. For every different schedule there
may exist several implementations with the minimum HW waste, and thus sim-
ilar adders area. The main difference resides on the fragmentation degree of the
implementations. These circuits may present some of the following features:
(a) Execution of one addition over several adders. The set of adders used to execute
every addition must be linked to propagate the carry signals.
(b) Execution of one addition over a wider adder. This requires the extension of
the input operands and the discard of some result bits. This technique is used to
avoid the excessive fragmentation in cycles with smaller computational cost.
(c) Execution of several additions in the same adder. Input operands of different
operations must be separated with zeros to avoid the carry propagation, and the
corresponding bits of the adder result discarded. At least one zero must be used
to separate every two operations, such that the minimum adder width equals the
sum of the additions widths to the number of operations minus one.
Figure 14.5 shows the execution of one addition over two adders as well as the
execution of two additions over the same adder. These trivial examples provide the

insight of how these two design techniques can be applied to datapaths with more
adders and operations.
14.2.3.2 Fragmentation of Multiplications
Like additions, the minimum HW waste in datapath multipliers occurs when all of
them calculate multiplications of their same width at least in one clock cycle. Some
of the desirable features of one implementation with minimum HW waste are:
(a) Execution of one multiplication over several multipliers and adders.Thesetof
multipliers used to execute every multiplication must be connected to adders
14 Exploiting Bit-Level Design Techniques in Behavioural Synthesis 267
Fig. 14.6 Execution of one multiplication over two multipliers and one adder, and two multiplica-
tions in one multiplier
in some way (either chained or via intermediate registers) in order to sum the
partial results and get the ﬁnal one.
(b) Execution of one multiplication over a wider multiplier. This technique is used
to avoid the excessive fragmentation of multiplications in cycles with smaller
computational cost.
(c) Execution of several multiplications in the same multiplier. The input operands
of different operations must be separated with several zeros to avoid interfer-
ences in the calculus matrix of the different multiplications. As the amount of
zeros needed equals the operands widths, this technique seems only appropriate
when the computational costs of the multiplications scheduled in every cycle
are quite different.
Figure 14.6 shows the execution of one multiplication over two multipliers and
one adder as well as the execution of two multiplications over the same multiplier.
Note here that the uniﬁcation of operations to be executed in the same FU produces
an excessive internal waste of the functional resource, what makes this technique
unviable in most designs. By contrast, the fragmentation does not produce any addi-
tional HW waste, and can be used to reduce the area of circuits with some HW waste
in all the cycles.
14.3 Applications to Scheduling Algorithms

The scheduling techniques proposed to reduce the HW waste can be easily imple-
mented in most HLS algorithms. This requires the common kernel extraction of
speciﬁcation operations and its successive transformations into several simpler ones
to obtain balanced distributions in the number of bits computed per cycle.
268 M.C. Molina et al.
This section presents a variant of the classical force-directed scheduling algo-
rithm proposed by Paulin and Knight [4] that includes some bit-level design tech-
niques to reduce the HW waste [3]. The intent of the original method is to minimize
the HW cost subject to a given time constraint by balancing the number of operations
executed per cycle. For every different type of operations the algorithm successively
selects, among all operations and all execution cycles, an (operation, cycle) pair
according to an estimate of the circuit cost called force.
By contrast, the intent of the proposed variant is to minimize HW cost by bal-
ancing the number of bits of every different operation type calculated per cycle.
This method successively selects, among all operations (multiplications ﬁrst and
additions afterwards) and all execution cycles, a pair formed by an operation (or
an operation fragment) and an execution cycle, according to a new force deﬁni-
tion that takes into account the widths of operations. It also uses a new parameter,
called bound, to decide whether an operation should be either scheduled complete,
or fragmented and just one fragment scheduled in the selected cycle.
14.3.1 Force Calculation
Our redeﬁnition of the force measure used in the classical force-directed algo-
rithm does not consider all operations equally. Instead, it gives a different weight
to every operation in function of its computational cost, leading to a more uniform
distribution of the computational costs of operations among cycles.
The proposed algorithm calculates the forces using a set of new distribution
graphs (DGs) that represent the distribution of the computational cost of operations.
For each DG, the distribution in clock cycle c is given by:
DG(τ,c)=
∑

op∈EOP
τ
c
(COST(op)·P(op,c)),
where
• COST(op): computational cost of operation op. For additions it is deﬁned as the
width of the widest operand, and for multiplications as the product of the widths
of its operands.
COST(op)=

Max(width(x),width(y)) op≡ x+ y
width(x) ·width(y) op≡ x×y
• P(op,c): probability of scheduling operation op in cycle c. It is similar to the
classical method.
• EOP
τ
c
(estimated operations of type
τ
in cycle c): set of operations of type
τ
whose mobility makes their scheduling possible in cycle c.
Except for the DG redeﬁnition, the force associated with an operation to cycle
binding is calculated like the classical method. The smaller or more negative the
14 Exploiting Bit-Level Design Techniques in Behavioural Synthesis 269
force expression is, the more desirable an operation to cycle binding becomes. In
each iteration, the algorithm selects the operation and cycle with the lowest force,
thus scheduling ﬁrst operations with less mobility in cycles with smaller sum of
computational costs, i.e. with smaller DG(
τ

,c).
14.3.2 Bound Deﬁnition
Our algorithm binds operations (or fragments) to cycles subject to an upper bound
that represents the most uniform distribution of operation bits of a certain type
among cycles reachable in every moment. It is used to decide if an operation should
be fragmented and how. Its initial value corresponds to the most uniform distribution
of operation bits of type
τ
among cycles, without data dependencies:
bound =
∑
op∈OP
τ
COST(op)
λ
,
where
OP
τ
: set of speciﬁcation operations of type τ.
λ: circuit latency.
The perfect distribution deﬁned by the bound is not always reachable due to data
dependencies among operations and given time constraints. Therefore it is updated
during the scheduling in the way explained below.
If the number of bits already scheduled in the selected cycle c plus the compu-
tational cost of the selected operation op is equal or lower than the bound, then the
operation is scheduled there.
CCS(
τ
,c)+COST(op) ≤ bound,

where
• CCS(
τ
,c): sum of the computational costs of type
τ
operations scheduled in
cycle c
CCS(
τ
,c)=
∑
op∈SOP
τ
c
COST(op).
• SOP
τ
c
: set of operations of type τ scheduled in cycle c.
Otherwise the operation is fragmented and one fragment scheduled in the selected
cycle (the remaining fragments continue unscheduled). The computational cost of
the fragment to be scheduled CCostFrag equals the value needed to reach the bound
in that cycle.
CCostFrag = bound−CCS(
τ
,c).
In order to avoid a reduction in the mobility of the predecessors and successors of
the fragmented operation, they are also fragmented.

High Level Synthesis: from Algorithm to Digital Circuit- P28 pps

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về