Parallel Programming: for Multicore and Cluster Systems- P45 ppsx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (348.65 KB, 10 trang )

7.5 Cholesky Factorization for Sparse Matrices 433
express the speciﬁc situation of data dependences between columns using the rela-
tion parent( j) [124, 118]. For each column j,0≤ j < n, we deﬁne
parent ( j) = min{i |i ∈ Struct(L
∗j
)} if Struct(L
∗j
) =∅,
i.e., parent (j) is the row index of the ﬁrst off-diagonal non-zero of column j.If
Struct(L
∗j
) =∅, then parent(j) = j. The element parent ( j) is the ﬁrst column
i > j which depends on j. A column l, j < l < i, between them does not depend
on j, since j /∈ Struct(L
l∗
) and no cmod(l, j) is executed. Moreover we deﬁne for
0 ≤ i < n
children(i) ={j < i |parent ( j) = i},
i.e., children(i) contains all columns j that have their ﬁrst off-diagonal non-zero in
row i.
The directed graph G = (V, E) has a set of nodes V ={0, ,n −1} with one
node for each column and a set of edges E, where (i, j) ∈ E if i = parent( j) and
i = j. It can be shown that G isatreeifmatrixA is irreducible.(AmatrixA is
called reducible if A can be permuted such that it is block-diagonal. For a reducible
matrix, the blocks can be factorized independently.) In the following, we assume
an irreducible matrix. Figure 7.25 shows a matrix and its corresponding elimination
tree.
In the following, we denote the subtree with root j by G[ j]. For sparse Cholesky
factorization, an important property of the elimination tree G is that the tree spec-
iﬁes the order in which the columns must be evaluated: The deﬁnition of parent
implies that column i must

be evaluated before column j,ifj = parent (i). Thus,
all the children of column j must be completely evaluated before the computation
of j. Moreover, column j does not depend on any column that is not in the subtree
G[ j]. Hence, columns i and j can be computed in parallel, if G[i] and G[j]are
disjoint subtrees. Especially, all leaves of the elimination tree can be computed in
parallel and the computation does not need to start with column 0. Thus, the sparsity
structure determines the parallelism to be exploited. For a given matrix, elimination
trees of smaller height usually represent a larger degree of parallelism than trees of
larger height [77].
0
1
2
3
4
5
6
7
8
9
∗
∗
∗
∗
∗∗
∗
∗∗
∗
∗∗∗∗∗
∗∗∗∗∗∗∗∗∗∗
0123456789

9
02578
6 3
4
1
Fig. 7.25 Sparse matrix and the corresponding elimination tree
434 7 Algorithms for Systems of Linear Equations
7.5.3.1 Parallel Left-Looking Algorithms
The parallel implementation of the left-looking algorithm (III) is based on n col-
umn tasks Tcol(0), ,Tcol(n −1) where task Tcol( j), 0 ≤ j < n, comprises the
execution of cmod( j, k) for all k ∈ Struct(L
j∗
) and the execution of cdiv( j); this
is the loop body of the for loop in algorithm (III). These tasks are not indepen-
dent of each other but have dependences due to the non-zero elements. The parallel
implementation uses a task pool for managing the execution of the tasks. The task
pool has a central task pool for storing column tasks, which can be accessed by
every processor. Each processor is responsible for performing a subset of the column
tasks. The assignment of tasks to processors for execution is dynamic, i.e., when a
processor is idle, it takes a task from the central task pool.
The dynamic implementation has the advantage that the workload is distributed
evenly although the tasks might have different execution times due to the sparsity
structure. The concurrent accesses of the processors to the central task pool have to
be conﬂict-free so that the unique assignment of a task to a processor for execution
is guaranteed. This can be implemented by a locking mechanism so that only one
processor accesses the task pool at a speciﬁc time.
There are several parallel implementation variants for the left-looking algorithm
differing in the way the column tasks are inserted into the task pool. We consider
three implementation variants:
• Variant L1 inserts column task Tcol( j) into the task pool not before all column

tasks Tcol(k) with k ∈ Struct(L
j∗
) have been ﬁnished. The task pool can be ini-
tialized to the leaves of the elimination tree. The degree of parallelism is limited
by the number of independent nodes of the tree, since tasks dependent on each
other are executed in sequential order. Hence, a processor that has accessed task
Tcol( j) can execute the task without waiting for other tasks to be ﬁnished.
• Variant L2 allows to start the execution of Tcol( j) without requiring that it can
be executed to completion immediately. The task pool is initialized to all column
tasks available. The column tasks are accessed by the processors dynamically
from left to right, i.e., an idle processor accesses the next column that has not yet
been assigned to a processor.
The computation of column task Tcol( j) is started before all tasks Tcol(k)
with k ∈ Struct(L
j∗
) have been ﬁnished. In this case, not all operations
cmod( j, k)ofTcol( j) can be executed immediately but the task can perform
only those cmod( j, k) operations with k ∈ Struct(L
j∗
) for which the corre-
sponding tasks have already been executed. Thus, the task might have to wait
during its execution for other tasks to be ﬁnished.
To control the execution of a single column task Tcol ( j), each column j is
assigned a data structure S
j
containing all columns k ∈ Struct(L
j∗
)forwhich
cmod( j, k) can already be executed. When a processor ﬁnishes the execution
of the column task Tcol(k) (by executing cdiv(k)), it pushes k onto the data

structures S
j
for each j ∈ Struct(L
∗k
). Because different processors might try
to access the same stack at the same time, a locking mechanism has to be used
to avoid access conﬂicts. The processor executing Tcol( j) pops column indices
7.5 Cholesky Factorization for Sparse Matrices 435
Fig. 7.26 Parallel
left-looking algorithm
according to variant L2. The
implicit task pool is
implemented in the while
loop and the function
get
unique index().
The stacks S
1
, ,S
n
implement the bookkeeping
about the dependent columns
already ﬁnished
k from S
j
and executes the corresponding cmod( j, k) operation. If S
j
is empty,
the processor waits for another processor to insert new column indices. When
|Struct(L

j∗
)| column indices have been retrieved from S
j
,thetaskTcol( j) can
execute the ﬁnal cdiv( j) operation.
Figure 7.26 shows the corresponding implementation. The central task pool
is realized implicitly as a parallel loop; the operation get
unique index()
ensures a conﬂict-free assignment of tasks so that the processors accessing the
pool at the same time get different unique loop indices representing column tasks.
The loop body of the while loop implements one task Tcol( j). The data struc-
tures S
1
, ,S
n
are stacks; pop(S
j
) retrieves an element and push( j, S
i
) inserts
an element onto the stack.
• Variant L3 is a variation of L2 that takes the structure of the elimination tree into
consideration. The columns are not assigned strictly from left to right to the pro-
cessors, but according to their height in the elimination tree, i.e., the children of a
column j in the elimination tree are assigned to processors before their parent j.
This variant tries to complete the column tasks in the order in which the columns
are needed for the completion of the other columns, thus exploiting the additional
parallelism that is provided by the sparsity structure of the matrix.
7.5.3.2 Parallel Right-Looking Algorithm
The parallel implementation of the right-looking algorithm (IV) is also based on a

task pool and on column tasks. These column tasks are deﬁned differently than the
tasks of the parallel left-looking algorithm: A column task Tcol( j), 0 ≤ j < n,
comprises the execution of cdiv( j) and cmod(k, j) for all k ∈ Struct(L
∗j
), i.e., a
column task comprises the ﬁnal computation for column j and the modiﬁcations of
all columns k > j right of column j that depend on j. The task pool is initialized to
all column tasks corresponding to the leaves of the elimination tree. A task Tcol( j)
that is not a leaf is inserted into the task pool as soon as the operations cmod( j, k)
for all k ∈ Struct(L
j∗
) are executed and a ﬁnal cdiv( j) operation is possible.
Figure 7.27 sketches a parallel implementation of the right-looking algorithm.
The task assignment is implemented by maintaining a counter c
j
for each col-
umn j. The counter is initialized to 0 and is incremented after the execution of
each cmod( j, ∗) operation by the corresponding processor using the conﬂict-free
436 7 Algorithms for Systems of Linear Equations
Fig. 7.27 Parallel right-looking algorithm. The column tasks are managed by a task pool TP.
Column tasks are inserted into the task pool by add
column() and retrieved from the task
pool by get
column(). The function initialize task pool() initializes the task pool
TP with the leaves of the elimination tree. The condition of the outer while loop assigns column
indices j to processors. The processor retrieves the corresponding column task as soon as the call
filled
pool(TP,j) returns that the column task exists in the task pool
procedure add counter(). For the execution of a cmod(k, j) operation of a task
Tcol( j), column k must be locked to prevent other tasks from modifying the same

column at the same time. A task Tcol( j) is inserted into the task pool, when the
counter c
j
has reached the value |Struct(L
j∗
)|.
The differences between this right-looking implementation and the left-looking
variant L2 lie in the execution order of the cmod() operations and in the executing
processor. In the L2 variant, the operation cmod( j, k) is initiated by the processor
computing column k by pushing it on stack S
j
, but the operation is executed by
the processor computing column j. This execution need not be performed immedi-
ately after the initiation of the operation. In the right-looking variant, the operation
cmod( j, k) is not only initiated, but also executed by the processor that computes
column k.
7.5.3.3 Parallel Supernodal Algorithm
The parallel implementation of the supernodal algorithm uses a partition into funda-
mental supernodes. A supernode I(p) ={p, p +1, , p +q −1} is a fundamental
supernode, if for each i with 0 ≤ i ≤ q −2, we have children(p +i +1) ={p +i},
i.e., node p + i is the only child of p + i + 1 in the elimination tree [124]. In
Fig. 7.23, supernode I (2) ={2, 3, 4} is a fundamental supernode whereas supern-
odes I (6) ={6, 7} and I (8) ={8, 9} are not fundamental. In a partition into fun-
damental supernodes, all columns of a supernode can be computed as soon as the
ﬁrst column can be computed and a waiting for the computation of columns outside
the supernode is not needed. In the following, we assume that all supernodes are
fundamental, which can be achieved by splitting supernodes into smaller ones. A
supernode consisting of a single column is fundamental.
The parallel implementation of the supernodal algorithm (VI)
is based on

supernode tasks Tsup(J) where task Tsup(J)for0≤ J < N comprises the
7.6 Exercises for Chap. 7 437
execution of smod( j, J) and cdiv( j) for each j ∈ J from left to right and the
execution of smod(k, J) for all k ∈ Struct(L
∗(last(J))
); N is the number of supern-
odes. Tasks Tsup(J ) that are ready for execution are held in a central task pool that
is accessed by the idle processors. The pool is initialized to supernodes that are ready
for completion; these are those supernodes whose ﬁrst column is a leaf in the elim-
ination tree. If the execution of a cmod(k, J ) operation by a task Tsup(J) ﬁnishes
the modiﬁcation of another supernode K > J, the corresponding task Tsup(K)is
inserted into the task pool.
The assignment of supernode tasks is again implemented by maintaining a
counter c
j
for each column j of a supernode. Each counter is initialized to 0 and
is incremented for each modiﬁcation that is executed for column j. Ignoring the
modiﬁcations with columns inside a supernode, a supernode task Tsup(J) is ready
for execution if the counters of the columns j ∈ J reach the value |Struct(L
j∗
)|.
The implementation of the counters as well as the manipulation of the columns has
to be protected, e.g., by a lock mechanism. For the manipulation of a column k /∈ J
by a smod(k, J) operation, column k is locked to avoid concurrent manipulation by
different processors. Figure 7.28 shows the corresponding implementation.
Fig. 7.28 Parallel supernodal
algorithm
7.6 Exercises for Chap. 7
Exercise 7.1 For an n ×m matrix A and vectors a and b of length n write a parallel
MPI program which computes a rank-1 update A = A − a · b

T
which can be
computed sequentially by
for (i=0; i<n; i++)
for (j=0; j<n; j++)
A[i][j] = A[i][j]-a[i] · b[j];
438 7 Algorithms for Systems of Linear Equations
For the parallel MPI implementation assume that A is distributed among the p pro-
cessors in a column-cyclic way. The vectors a and b are available at the process
with rank 0 only and must be distributed appropriately before the computation.
After the update operation, the matrix A should again be distributed in a column-
cyclic way.
Exercise 7.2 Implement the rank-1 update in OpenMP. Use a parallel for loop to
express the parallel execution.
Exercise 7.3 Extend the program piece in Fig. 7.2 for performing the Gaussian
elimination with a row-cyclic data distribution to a full MPI program. To do so,
all helper functions used and described in the text must be implemented. Measure
the resulting execution times for different matrix sizes and different numbers of
processors.
Exercise 7.4 Similar to the previous exercise, transform the program piece in Fig. 7.6
with a total cyclic data distribution to a full MPI program. Compare the resulting
execution times for different matrix sizes and different numbers of processors. For
which scenarios does a signiﬁcant difference occur? Try to explain the observed
behavior.
Exercise 7.5 Develop a parallel implementation of Gaussian elimination for shared
address spaces using OpenMP. The MPI implementation from Fig. 7.2 can be
used as an orientation. Explain how the available parallelism is expressed in your
OpenMP implementation. Also explain where synchronization is needed when
accessing shared data. Measure the resulting execution times for different matrix
sizes and different numbers of processors.

Exercise 7.6 Develop a parallel implementation of Gaussian elimination using Java
threads. Deﬁne a new class Gaussian which is structured similar to the Java
program in Fig. 6.23 for a matrix multiplication. Explain which synchronization
is needed in the program. Measure the resulting execution times for different matrix
sizes and different numbers of processors.
Exercise 7.7 Develop a parallel MPI program for Gaussian elimination using a
column-cyclic data distribution. An implementation with a row-cyclic distribution
has been given in Fig. 7.2. Explain which communication is needed for a column-
cyclic distribution and include this communication in your program. Compute
the resulting speedup values for different matrix sizes and different numbers of
processors.
Exercise 7.8 For n = 8 consider the following tridiagonal equation system:
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎝
11
12 1
12
.
.
.
.
.
.
.

.
.
1
12
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎠
· x =
⎛
⎜
⎜
⎜
⎜
⎜
⎝
1
2
3
.
.
.
8
⎞
⎟
⎟

⎟
⎟
⎟
⎠
.
7.6 Exercises for Chap. 7 439
Use the recursive doubling technique from Sect. 7.2.2, p. 385, to solve this equation
system.
Exercise 7.9 Develop a sequential implementation of the cyclic reduction algorithm
for solving tridiagonal equation systems, see Sect. 7.2.2, p. 385. Measure the result-
ing sequential execution times for different matrix sizes starting with size n = 100
up to size n = 10
7
.
Exercise 7.10 Transform the sequential implementation of the cyclic reduction
algorithm from the last exercise into a parallel implementation for a shared address
space using OpenMP. Use an appropriate parallel for loop to express the paral-
lel execution. Measure the resulting parallel execution times for different numbers
of processors for the same matrix sizes as in the previous exercise. Compute the
resulting speedup values and show the speedup values in a diagram.
Exercise 7.11 Develop a parallel MPI implementation of the cyclic reduction algo-
rithm for a distributed address space based on the description in Sect. 7.2.2, p. 385.
Measure the resulting parallel execution times for different numbers of processors
and compute the resulting speedup values.
Exercise 7.12 Specify the data dependence graph for the cyclic reduction algorithm
for n = 12 equations according to Fig. 7.11. For p = 3 processors, illustrate the
three phases according to Fig. 7.12 and show which dependences lead to communi-
cation.
Exercise 7.13 Implement a parallel Jacobi iteration with a pointer-based storage
scheme of the matrix A such that global indices are used in the implementation.

Exercise 7.14 Consider the parallel implementation of the Jacobi iteration in
Fig. 7.13 and provide a corresponding shared memory program using OpenMP
operations.
Exercise 7.15 Implement a parallel SOR method for a dense linear equation system
by modifying the parallel program in Fig. 7.14.
Exercise 7.16 Provide a shared memory implementation of the Gauss–Seidel method
for the discretized Poisson equation.
Exercise 7.17 Develop a shared memory implementation for Cholesky factorization
A = LL
T
for a dense matrix A using the basic algorithm.
Exercise 7.18 Develop a message-passing implementation for dense Cholesky fac-
torization A = LL
T
.
440 7 Algorithms for Systems of Linear Equations
Exercise 7.19 Consider a matrix with the following non-zero entries:
0123456789
0
1
2
3
4
5
6
7
8
9
⎛
⎜

⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
∗
∗∗
∗
∗∗
∗∗∗
∗∗∗∗∗∗
∗
∗∗∗ ∗
∗∗ ∗ ∗∗
∗∗∗∗∗ ∗∗
⎞
⎟
⎟
⎟
⎟
⎟

⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
(a) Specify all supernodes of this matrix.
(b) Consider a supernode J with at least three entries. Specify the sequence of
cmod and cdiv operations that are executed for this supernode in the right-looking
supernode Cholesky factorization algorithm.
(c) Determine the elimination tree resulting for this matrix.
(d) Explain the role of the elimination tree for a parallel execution.
Exercise 7.20 Derive the parallel execution time of a message-passing program of
the CG method for a distributed memory machine with a linear array as intercon-
nection network.
Exercise 7.21 Consider a parallel implementation of the CG method in which the
computation step (3) is executed in parallel to the computation step (4). Given a row-
blockwise distribution of matrix A and a blockwise distribution of the vector, derive
the data distributions for this implementation variant and give the corresponding
parallel execution time.
Exercise 7.22 Implement the CG algorithm given in Fig. 7.20 with the blockwise
distribution as message-passing program using MPI.
References
1. F. Abolhassan, J. Keller, and W.J. Paul. On the Cost–Effectiveness of PRAMs. In Proceedings
of the 3rd IEEE Symposium on Parallel and Distributed Processing, pages 2–9, 1991.
2. A. Adl-Tabatabai, C. Kozyrakis, and B. Saha. Unlocking concurrency. ACM Queue, 4(10):

24–33, Dec 2006.
3. S.V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE
Computer, 29: 66–76, 1995.
4. A. Aggarwal, A.K. Chandra, and M. Snir. On Communication Latency in PRAM Compu-
tations. In Proceedings of 1989 ACM Symposium on Parallel Algorithms and Architectures
(SPAA’89), pages 11–21, 1989.
5. A. Aggarwal, A.K. Chandra, and M. Snir. Communication complexity of PRAMs. Theoreti-
cal Computer Science, 71: 3–28, 1990.
6. A. Aho, M. Lam, R. Sethi, and J. Ullman. Compilers: Principles, Techniques & Tools.
Pearson-Addison Wesley, Boston, 2007.
7. A.V. Aho, M.S. Lam, R. Sethi, and J.D. Ullman. Compilers: Principles, Techniques, and
Tools. 2nd Edition, Addison-Wesley Longman Publishing Co., Inc., Boston, 2006.
8. S.G. Akl. Parallel Computation – Models and Methods. Prentice Hall, Upper Saddle River,
1997.
9. K. Al-Tawil and C.A. Moritz. LogGP Performance Evaluation of MPI. International Sympo-
sium on High-Performance Distributed Computing, page 366, 1998.
10. A. Alexandrov, M. Ionescu, K.E. Schauser, and C. Scheiman. LogGP: Incorporating Long
Messages into the LogP Model – One Step Closer Towards a Realistic Model for Parallel
Computation. In Proceedings of the 7th ACM Symposium on Parallel Algorithms and Archi-
tectures (SPAA’95), pages 95–105, Santa Barbara, July 1995.
11. E. Allen, D. Chase, J. Hallett, V. Luchangco, J W. Maessen, S. Ryu, G.L. Steele, Jr., and
S. Tobin-Hochstadt. The Fortress Language Speciﬁcation, version 1.0 beta, March 2007.
12. R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan
Kaufmann, San Francisco, 2002.
13. S. Allmann, T. Rauber, and G. R
¨
unger. Cyclic Reduction on Distributed Shared Memory
Machines. In Proceedings of the 9th Euromicro Workshop on Parallel and Distributed Pro-
cessing, pages 290–297, Mantova, Italien, 2001. IEEE Computer Society Press.
14. G.S. Almasi and A. Gottlieb. Highly Parallel Computing. Benjamin Cummings, New York,

1994.
15. G. Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale Computer
Capabilities. In AFIPS Conference Proceedings, volume 30, pages 483–485, 1967.
16. K. Asanovic, R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, D.A. Patterson,
W.L. Plishker, J. Shalf, S.W. Williams, and K.A. Yelick. The Landscape of Parallel Com-
puting Research: A View from Berkeley. Technical Report UCB/EECS-2006–183, EECS
Department, University of California, Berkeley, December 2006.
T. Rauber, G. R
¨
unger, Parallel Programming,
DOI 10.1007/978-3-642-04818-0,
C

Springer-Verlag Berlin Heidelberg 2010
441
442 References
17. M. Azimi, N. Cherukuri, D.N. Jayasimha, A. Kumar, P. Kundu, S. Park, I. Schoinas, and
A. Vaidya. Integration challenges and tradeoffs for tera-scale architectures. Intel Technology
Journal, 11(03), 2007.
18. C.J. Beckmann and C. Polychronopoulos. Microarchitecture Support for Dynamic Schedul-
ing of Acyclic Task Graphs. Technical Report CSRD 1207, University of Illinois, 1992.
19. D.P. Bertsekas and J.N. Tsitsiklis. Parallel and Distributed Computation. Athena Scientiﬁc,
Nashua, 1997.
20. R. Bird. Introduction to Functional Programming Using Haskell. Prentice Hall, Englewood
Cliffs, 1998.
21. P. Bishop and N. Warren. JavaSpaces in Practice. Addison Wesley, Reading, 2002.
22. F. Bodin, P. Beckmann, D.B. Gannon, S. Narayana, and S. Yang. Distributed C++:Basic
Ideas for an Object Parallel Language. In Proceedings of the Supercomputing’91 Conference,
pages 273–282, 1991.
23. D. Braess. Finite Elements. 3rd edition, Cambridge University Press, Cambridge, 2007.

24. P. Brucker. Scheduling Algorithms. 4th edition, Springer-Verlag, Berlin, 2004.
25. D.R. Butenhof. Programming with POSIX Threads. Addison-Wesley Longman Publishing
Co., Inc., Boston, 1997.
26. N. Carriero and D. Gelernter. Linda in Context. Communications of the ACM, 32(4):
444–458, 1989.
27. N. Carriero and D. Gelernter. How to Write Parallel Programs. MIT Press, Cambridge,
1990.
28. P. Charles, C. Grothoff, V.A. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun,
and V. Sarkar. X10: An Object-Oriented Approach to Non-uniform Cluster Computing. In
R. Johnson and R.P. Gabriel, editors, Proceedings of the 20th Annual ACMSIGPLAN Confer-
ence on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA),
pages 519–538, ACM, October 2005.
29. A. Chin. Complexity Models for All-Purpose Parallel Computation. In Lectures on Parallel
Computation, chapter 14. Cambridge University Press, Cambridge, 1993.
30. M.E. Conway. A Multiprocessor System Design. In Proceedings of the AFIPS 1963 Fall
Joint Computer Conference, volume 24, pages 139–146, Spartan Books, New York, 963.
31. T.H. Cormen, C.E. Leiserson, and R.L. Rivest. Introduction to Algorithms. MIT Press,
Cambridge, 2001.
32. D.E. Culler, A.C. Arpaci-Dusseau, S.C. Goldstein, A. Krishnamurthy, S. Lumetta, T. van
Eicken, and K.A. Yelick. Parallel Programming in Split-C. In Proceedings of Supercomput-
ing, pages 262–273, 1993.
33. D.E. Culler, A.C. Dusseau, R.P. Martin, and K.E. Schauser. Fast Parallel Sorting Under
LogP: From Theory to Practice. In Portability and Performance for Parallel Processing,
pages 71–98. Wiley, Southampton, 1994.
34. D.E. Culler, R. Karp, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian, and T. von Eicken.
LogP: Towards a Realistic Model of Parallel Computation. Proceedings of the 4th ACM SIG-
PLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’93), pages
1–12, 1993.
35. D.E. Culler, J.P. Singh, and A. Gupta. Parallel Computer Architecture: A Hardware Software
Approach. Morgan Kaufmann, San Francisco, 1999.

36. H.J. Curnov and B.A. Wichmann. A Synthetic Benchmark. The Computer Journal, 19(1):
43–49, 1976.
37. D. Callahan, B.L. Chamberlain, and H.P. Zima. The Cascade High Productivity Language.
In IPDPS, pages 52–60. IEEE Computer Society, 2004.
38. W.J. Dally and C.L. Seitz. Deadlock-Free Message Routing in Multiprocessor Interconnec-
tion Networks. IEEE Transactions on Computers, 36(5): 547–553, 1987.
39. DEC. The Whetstone Performance. Technical Report, Digital Equipment Corporation, 1986.
40. E.W. Dijkstra. Cooperating Sequential Processes. In F. Genuys, editor, Programming Lan-
guages,
pages 43–112. Academic Press Inc., 1968.

Parallel Programming: for Multicore and Cluster Systems- P45 ppsx

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về