12 Conclusions 185
Multi−Threaded
Wide SIMD
I$ D$
Multi−Threaded
Wide SIMD
I$ D$
Multi−Threaded
Wide SIMD
I$ D$
Multi−Threaded
Wide SIMD
I$ D$
Memory Controller
Fixed FunctionTexture Logic
Memory Controller
Display InterfaceSystem Interface
L2 Cache
Fig. 12.2 Larrabee architecture from Intel
DRAM I/F
HOST I/F
DRAM I/F
DRAM I/F DRAM I/F
DRAM I/F DRAM I/F
Giga Thread
L2
Shared Multiprocessor
Core
Fig. 12.3 Fermi architecture from NVIDIA
multiprocessor (SM). The block diagram of a single SM is shown in Fig. 12.4 and
the block diagram of a core within an SM is shown in Fig. 12.5.
With these upcoming architectures, newer approaches for hardware acceleration
of algorithms would become viable. These approaches could exploit the more gen-
eral computing paradigm offered by the newer architectures. For example, the close
coupling between the GPU and the CPU (which reside on the same die) would
186 12 Conclusions
Core Core Core Core
Core
Core
Core
Core
Core Core Core
Core
Core Core
Core
Core
Core Core Core Core
Core
Core
Core
Core
Core Core Core
Core
Core Core
Core
Core
Instruction Cache
Register File
Dispatch Dispatch
SchedulerScheduler
Load/Store Units X 16
Special Func Units X 4
Interconnect Network
64K Configurable
Cache/Shared Mem
Uniform Cache
Fig. 12.4 Block diagram of a single shared multiprocessor (SM) in Fermi
reduce the communication cost. Also, in these upcoming architectures the instruc-
tion dispatch unit is distributed, and the instruction set is more general purpose.
These enhancements would enable a more general computing paradigm (in compar-
ison to the SIMD paradigm for current GPUs), which in turn would enable acceler-
ation opportunities for more EDA applications.
The approaches presented in this monograph collectively aim to contribute
toward enabling the CAD community to accelerate EDA algorithms on modern
hardware platforms. Our work demonstrates techniques to rearchitect several EDA
algorithms to maximally harness their performance on the alternative platforms
under consideration.
References 187
Dispatch Port
Operand Collector
FP Unit
INT Unit
Result Queue
CUDA Core
Fig. 12.5 Block diagram of a single processor (core) in SM
References
1. The MiniSAT Page
2. NVIDIA Tesla GPU Computing Processor. />43499.html
3. OmegaSim Mixed-Signal Fast-SPICE Simulator. />product.html
4. Lee, H.K., Ha, D.S.: An efficient, forward fault simulation algorithm based on the parallel
pattern single fault propagation. In: Proceedings of the IEEE International Test Conference on
Test, pp. 946–955. IEEE Computer Society, Washington, DC (1991)
5. Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake,
A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: A
many-core x86 architecture for visual computing. ACM Transactions on Graphics 27(3), 1–15
(2008)
6. Silva, M., Sakallah, J.: GRASP-a new search algorithm for satisfiability. In: Proceedings of the
International Conference on Computer-Aided Design (ICCAD), pp. 220–7 (1996)
Index
A
Accelerators, 9
ACML-GPU, 15
Activity, 93
Algorithm parallel, 120, 121, 134
Amdahl’s Law, 158, 170
Application specific, 64
Arrival time, 110
Assignment, 31, 37, 40
B
Backtracking, 32
Bandwidth, 13
Bandwidth minimization, 52
Bank conflict, 27
BCP, 32, 37, 40
Bias
survey propagation, 89
Bins, 64
Bin packing, 52, 70
Bin utilization, 74
Bit parallel, 135, 146
Block, 28
Block-based
SSTA, 108
Board test, 15
Boolean Constant Propagation, see BCP
Boolean Satisfiability, see SAT
Box-Muller, 101
BRAM, 11, 14, 32, 63, 66, 72, 78
Brook+, 15
BSIM3
SPICE, 158
BSIM4
SPICE, 158
Bulldog Fortran, 171
C
Capacity, 31, 35
CDFG, 160
Clause, 31
Clock speed, 11
CNF, 31, 34
Co-processors, 9
Compilers, 16
Complete
SAT, 83, 85
Conflict, 37, 40, 42, 44, 71
Conflict clause, 31
Conflict clause generation, 33, 64
Conjunctive Normal Form, 34
Constant Memory, 26, 161
Control and dataflow graph, 173
Control dominated
EDA, 3
Control plus data parallel
EDA, 3
Core, 185
Critical line
critical path tracing, 138
Critical path tracing, 138
CUBLAS, 15
CUDA, 15, 24
CUFFT, 15
Cumulative detectability, 138
Custom IC, 7, 10, 33
D
Data parallel, 28, 106, 120, 122, 134
Debuggers, 16
Decision engine, 37, 39, 49, 70
Decision level, 39, 67
Decisions
SAT, 32
Detectability, 138
DFF, 11
DIMACS, 45
Dimblock, 29
K. Gulati, S.P. Khatri, Hardware Acceleration of EDA Algorithms,
DOI 10.1007/978-1-4419-0944-2,
C
Springer Science+Business Media, LLC 2010
189
190 Index
Dimensionality, 29
Dimgrid, 29
Divide, 12
Dominator, 138
DPLL, 85
DRAM, 14, 66, 184
Dropped
fault table, 134
Dynamic
power, 10
Dynamic bulk modulation, 10
E
EDA, 3
Embedded processor, 10
F
Factor Graph, 87
Fault detection, 134
Fault diagnosis, 134
Fault dropping, 134
Fault grading, 102, 120
Fault injection, 135
Fault parallel
data parallel, 120
Fault simulation, 4, 119
Fault table, 4, 134
Fermi, 184
Fingerprinting, 19
FPGA, 3, 7, 10, 32
Function
Factor Graph, 87
G
Global Memory, 13, 27, 110, 159
GPGPU, 3
Graphics Processors, see GPU
GRASP, 35, 64, 85
Grid, see dimgrid
GridSAT, 87
GSAT, 85
H
Hardware
IP cores, 15
HDL, 10, 14, 19
Hybrid
SAT, 85
I
Immediate dominator
dominator, 138
Implication, 37, 40, 44
Implication graph, 31, 33, 37, 50, 64
Infringement
security, 19
Input vector control, 10
Instance specific, 64
Inter-bin
non-chronological, 32
Intra-bin
non-chronological, 32
IP cores, 15
K
Kernel, 28, 167, 184
L
Larrabee, 184
Latency, 11, 13
Leakage
power, 10
Levelize, 112
Literal, 37
free literal, 41
Local memory, 12, 27
Logic analyzers, 15
Lookup table, 11, 106, 120
LUT, 12
M
Memory bandwidth, 1, 13
Memory wall, 1
Mersenne Twister, 101, 106, 112
MIMD, 171
Minimum unsatisfiable core, 31, 33, 53
MiniSAT, 85
MNA
SPICE, 154
Model evaluation, 154
Model parallel, 122, 134
Monte Carlo, 4
SSTA, 101, 106
Moore’s Law, 24
MOPs, 17
MOPs per watt
MOPs, 17
Multi-GPU, 16
Multi-port
memory, 20
Multiprocessor, 12, 24
MUX, 11
N
Newton-Raphson, 154
NMOS
passgates, 11
Index 191
Non-chronological backtrack, 32, 43, 45, 64,
68, 85
Non-recurring engineering, 10, 18
Non-volatile
memory, 20
O
Off-chip, 14
On-chip, 14
OPB, 67, 72
P
Paging, 12
Parafrase, 170
Parallel
SAT, 85
Partition, 32, 35, 63, 78
Pass/fail fault dictionary, 134
Path-based
SSTA, 108
Pattern parallel
data parallel, 120
PCI, 15
PCI-X
PCI, 15
Pipeline, 11
Piracy
security, 19
PLB, 67, 72
PLB2OPB bridge, 72
Power, 10, 56
average power, 58
Power delay product, 18
Power gating, 10
Power wall, 1
PowerPC, 32
Precharged, 39
Predischarged, 39
Process variations, 106
Processor, 24
Profiling
code, 16
Programmable, 12
Prototyping, 16
Q
QuickPath Interconnect, 18
R
Random
variations, 106
Re-programmability, 19
Reconfigurable logic
FPGA, 11
Reconfigure, 12
Reduced OR, 144
Register, 26, 172
Resolution, 36
Reuse-based design, 19
S
Sample parallelism, 106
SAT, 4, 31, 33, 34, 36
3-SAT, 36
Scalability, 15, 31, 35, 66
Scattered reads, 29
SEE, 18, 114
Self-test, 15
Sensitive input, 138
Shared Memory, 26, 27, 110
Shared multiprocessor, 185
SIMD, 3, 18, 29
Software
IP cores, 15
Span, 69
Speedup, 31
SPICE, 31, 153
Square root, 12
SRAM, 11
SSTA, 4, 101, 106
STA, 101, 106
SPICE, 154
Stem, 137
Stem region, 138, 143
Stochastic
SAT, 83, 85
Subroutine, 167
Subsumption
resolution, 56
Successive chord, 156
Supply voltage, 10
Survey propagation, 84
Surveys
survey propagation, 88
Synchronization points, 29
Synchronize, 28
System test, 15
Systematic
variations, 106
T
Termination cell, 40
Texture fetching
Texture Memory, 27
Texture Memory, 26, 110, 155, 160
Thread, 28, 146
Thread block, 28
Thread parallel, 135
192 Index
Thread scheduler, 29
Throughput, 11
Time slicing, 29
Tree Factor Graph, 87
U
Unate covering, 134
V
Variable, 31, 37
Factor Graph, 87
Variable ordering
SAT, 32
Variable Vt, 10
Variations, 106
Virtual memory, 12
VLIW, 171
VLSI, 106
VSIDS, 93
W
WalkSAT, 85, 90, 96
Warp size, 29
Warps, 29
Watermarking, 19
X
XC2VP30
FPGA, 32