Tải bản đầy đủ (.pdf) (10 trang)

Parallel Programming: for Multicore and Cluster Systems- P39 potx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (264.82 KB, 10 trang )

7.1 Gaussian Elimination 373
5. Computation of the elimination factors: The function compute elim fact
loc() is used to compute the elimination factors l
ik
for all elements a
ik
owned
by the processor. The elimination factors are stored in buffer elim
buf.
5a. Distribution of the elimination factors: A single-broadcast operation is used
to send the elimination factors to all processors in the same row group Rop(q);
the corresponding communicator comm(Rop(q)) is used. The number (rank)
of the root processor q for this broadcast operation in a group G is determined
by the function rank(q,G).
6. Computation of the matrix elements: The computation of the matrix elements
by compute
local entries() and the backward substitution performed
by backward
substitution() are similar to the pseudocode in Fig. 7.2.
The main differences to the program in Fig. 7.2 are that more communication is
required and that almost all collective communication operations are performed on
subgroups of the set of processors and not on the entire set of processors.
7.1.4 Analysis of the Parallel Execution Time
The analysis of the parallel execution time of the Gaussian elimination uses func-
tions expressing the computation and communication times depending on the char-
acteristics of the parallel machine, see also Sect. 4.4. The function describing the
parallel execution time of the program in Fig. 7.6 additionally contains the param-
eters p
1
, p
2


, b
1
, and b
2
of the parameterized data distribution in Formula (7.7). In
the following, we model the parallel execution time of the Gaussian elimination
with checkerboard distribution, neglecting pivoting and backward substitution for
simplicity, see also [147]. These are the phases 4, 5, 5a, and 6 of the Gaussian
elimination. For the derivation of functions reflecting the parallel execution time,
these four SPMD computation phases can be considered separately, since there is a
barrier synchronization after each phase.
For a communication phase, a formula describing the time of a collective com-
munication operation is used which describes the communication time as a function
of the number of processors and the message size. For the Gaussian elimination
(without pivoting), the phases 4 and 5a implement communication with a single-
broadcast operation. The communication time for a single-broadcast with p pro-
cessors and message size m is denoted as T
sb
(p, m). We assume that independent
communication operations on disjoint processor sets can be done in parallel. The
values for p and m have to be determined for the specific situation. These parame-
ters depend on the data distribution and the corresponding sizes of row and column
groups as well as on the step k, k = 1, ,n, of the Gaussian elimination, since
messages get smaller for increasing k.
Also, the modeling of the computation times of phases 5 and 6 depends on the
step number k, since less elimination factors or matrix elements have to be computed
for increasing k and thus the number of arithmetic operations decreases with increas-
ing k. The time for an arithmetic operation is denoted as t
op
. Since the processors

374 7 Algorithms for Systems of Linear Equations
perform an SPMD program, the processor computing the most arithmetic operations
determines the computation time for the specific computation phase. The following
modeling of communication and computation times for one step k uses the index sets
I
q
={(i, j) ∈{1 n}×{1 n}|P
q
owns a
ij
} ,
which contain the indices of the matrix elements stored locally in the memory of
processor P
q
:
• The broadcasting of the pivot row k in phase 4 of step k sends the elements of
row k with column index ≥ k to the processors needing the data for computations
in rows ≥ k. Since the pivot row is distributed across the processors of the row
group Ro(k), all the processors of Ro(k) send their part of row k. The amount of
data sent by one processor q ∈ Ro(k) is the number of elements of row k with
column indices ≥ k (i.e., with indices ((k, k), ,(k, n))) owned by processor q.
This is the number
N
row≥k
q
:= #{(k, j) ∈ I
q
|j ≥ k} . (7.8)
(The symbol #X for a set X denotes the number of elements of this set X.) The
processor q ∈ Ro(k) sends its data to those processors owning elements in the

rows with row index ≥ k which have the same column indices as the elements
of processor q. These are the processors in the column group Cop(q)ofthe
processor q and thus these processors are the recipients of the single-broadcast
operation of processor q. Since all column groups of the processors q ∈ Ro(k)are
disjoint, the broadcast operation can be done in parallel and the communication
time is
max
q∈Ro(k)
T
sb
(#Cop(q), N
row≥k
q
) .
• In phase 5 of step k, the elimination factors using the elements a
(k)
kk
and the ele-
ments a
(k)
ik
for i > k are computed by the processors owning these elements of
column k, i.e., by the processors q ∈ Co(k), according to Formula (7.2). Each of
the processors computes the elimination factors of its part, which are
N
col>k
q
:= #{(i, k) ∈ I
q
|i > k} (7.9)

elimination factors for processor q ∈ Co(k). Since the computations are done in
parallel, this results in the computation time
max
q∈Co(k)
N
col>k
q
·t
op
.
• In phase 5a the elimination factors are sent to all processors which recalculate
the matrix elements with indices (i, j), i > k, j > k. Since the elimination
7.1 Gaussian Elimination 375
factors l
(k)
ik
, l = k + 1, ,n, are needed within the same row i, a row-oriented
single-broadcast operation is used to send the data to the processors owning parts
of row i. A processor q ∈ Co(k) sends its data to the processors in its row group
Rop(q). These are the data elements computed in the previous phase, i.e., N
col>k
q
data elements, and the communication time is
max
q∈Co(k)
T
sb
(#Rop(q), N
col>k
q

) .
• In phase 6 of step k, all matrix elements in the lower right rectangular area are
recalculated. Each processor q recalculates the entries it owns; these are the num-
ber of elements per column for rows with indices > k (i.e., N
col>k
q
) multiplied by
the number of elements per row for columns with indices > k (i.e., N
row>k
q
). Since
two arithmetic operations are performed for one entry according to Formula (7.4),
the computation time is
max
q∈P
N
col>k
q
· N
row>k
q
·2t
op
.
In total, the parallel execution for all phases and all steps is
T (n, p) =
n−1

k=1


max
q∈Ro(k)
T
sb
(#Cop(q), N
row≥k
q
)
+ max
q∈Co(k)
N
col>k
q
·t
op
(7.10)
+ max
q∈Co(k)
T
sb
(#Rop(q), N
col>k
q
)
+max
q∈P
N
col>k
q
· N

row>k
q
·2t
op

.
This parallel execution time can be expressed in terms of the parameters of the
data distribution ((p
1
, b
1
), (p
2
, b
2
)), the problem size n, and the step number k by
estimating the sizes of messages and the number of arithmetic operations. For the
estimation, larger blocks of data, called superblocks, are considered. Superblocks
consist of p
1
× p
2
consecutive blocks of size b
1
× b
2
, i.e., it has p
1
b
1

rows and
p
2
b
2
columns. There are

n
p
1
b
1

superblocks in the row direction and

n
p
2
b
2

in the
column direction. Each of the p processors owns one data block of size b
1
× b
2
of
a superblock. The two-dimensional matrix A is covered by these superblocks and
from this covering, it can be estimated how many elements of smaller matrices A
(k)

are owned by a specific processor.
The number of elements owned by a processor q in row k for column indices ≥ k
can be estimated by
N
row≥k
q


n−k+1
p
2
b
2

b
2


n−k+1
p
2
b
2
+1

b
2
=
n−k+1
p

2
+b
2
, (7.11)
376 7 Algorithms for Systems of Linear Equations
where

n−k+1
p
2
b
2

is the number of superblocks covering row k for column indices
≥ k, which are n −k +1 indices, and b
2
is the number of column elements that each
processor of Ro(k) owns in a complete superblock. For the covering of one row, the
number of columns p
2
b
2
of a superblock is needed. Analogously, the number of ele-
ments owned by a processor q in column k for row indices > k can be estimated by
N
col>k
q


n−k

p
1
b
1

b
1


n−k
p
1
b
1
+1

b
1
=
n − k
p
1
+b
1
, (7.12)
where

n−k
p
1

b
1

is the number of superblocks covering column k for row indices > k,
which are n −k row indices, and b
1
is the number of row elements that each proces-
sor of Co(k) owns in a complete superblock. Using these estimations, the parallel
execution time in Formula (7.10) can be approximated by
T (n, p) ≈
n−1

k=1

T
sb

p
1
,
n − k +1
p
2
+b
2

+

n − k
p

1
+b
1

·t
op
+ T
sb

p
2
,
n − k
p
1
+b
1

+

n − k
p
1
+b
1

n − k
p
2
+b

2

·2t
op

.
Suitable parameters leading to a good performance can be derived from this mod-
eling. For the communication time of a single-broadcast operation, we assume a
communication time
T
sb
(p, m) = log p ·(τ + m ·t
c
)
with a startup time τ and a transfer time t
c
. This formula models the communication
time in many interconnection networks, like a hypercube. Using the summation for-
mula

n−1
k=1
(n −k +1) =

n
k=2
k = (

n
k=1

k) −1 =
n(n+1)
2
−1 the communication
time in phase 4 results in
n−1

k=1
T
sb

p
1
,
n − k +1
p
2
+b
2

=
n−1

k=1
log p
1

n − k +1
p
2

+b
2

t
c


= log p
1

n(n + 1)
2
−1

1
p
2
t
c
+(n − 1)b
2
t
c
+(n − 1)τ

.
For the second and third terms the summation formula

n−1
k=1

(n − k) =
n(n−1)
2
is
used, so that the computation time
7.1 Gaussian Elimination 377
n−1

k=1

n − k
p
1
+b
1

·t
op
=

n(n − 1)
2p
1
+(n − 1)b
1

·t
op
and the communication time
n−1


k=1
T
sb

p
2
,
n − k
p
1
+b
1

=
n−1

k=1
log p
2

n − k
p
1
+b
1

t
c



= log p
2

n(n − 1)
2

1
p
1
t
c
+(n − 1)b
1
t
c
+(n − 1)τ

result. For the last term, the summation formula

n−1
k=1
n−k
p
1
·
n−k
p
2
=

1
p

n−1
k=1
k
2
=
1
p
n(n−1)(2n−1)
6
is used. The total parallel execution time is
T (n, p) =log p
1

n(n + 1)
2
−1

t
c
p
2
+(n − 1)b
2
t
c
+(n − 1)τ


+

n(n − 1)
2
1
p
1
+(n − 1)b
1

t
op
+log p
2

n(n − 1)
2
t
c
p
1

+(n − 1)b
1
t
c
+(n − 1)τ

+


n(n − 1)(2n − 1)
6p
+
n(n−1)
2

b
2
p
1
+
b
1
p
2

+(n−1)b
1
b
2

2t
op
.
The block sizes b
i
,1 ≤ b
i
≤ n/ p
i

,fori = 1, 2 are contained in the
execution time as factors and, thus, the minimal execution time is achieved for
b
1
= b
2
= 1. In the resulting formula the terms (log p
1
+log p
2
)
(
(n−1)(τ + t
c
)
)
=
log p
(
(n−1)(τ + t
c
)
)
,(n −1) · 3t
op
, and
n(n−1)(2n−1)
3p
· t
op

are independent of the
specific choice of p
1
and p
2
and need not be considered. The terms
n(n−1)
2
1
p
1
t
op
and
t
c
p
2
(n − 1) log p
1
are asymmetric in p
1
and p
2
. For simplicity we ignore these
terms in the analysis, which is justified since these terms are small compared to
the remaining terms; the first term has t
op
as operand, which is usually small, and
the second term with t

c
as operand has a factor only linear in n. The remaining terms
of the execution time are symmetric in p
1
and p
2
and have constants quadratic in n.
Using p
2
= p/ p
1
this time can be expressed as
T
S
(p
1
) =
n(n−1)
2

p
1
log p
1
p
+
log p−log p
1
p
1


t
c
+
n(n−1)
2

1
p
1
+
p
1
p

2t
op
.
378 7 Algorithms for Systems of Linear Equations
The first derivation is
T

S
(p
1
) =
n(n−1)
2

1

p ·ln 2
+
log p
1
p

log p
p
2
1
+
log p
1
p
2
1

1
p
2
1
·ln 2

t
c
+
n(n−1)
2

1

p

1
p
2
1

2t
op
.
For p
1
=

p it is T

S
(p
1
) = 0 since
1
p

1
p
2
1
=
1
p


1
p
= 0,
1
p ln 2

1
p
2
1
ln 2
= 0, and
log p
1
p

log p
p
2
1
+
log p
1
p
2
1
= 0. The second derivation T

(p

1
)ispositiveforp
1
=

p
and, thus, there is a minimum at p
1
= p
2
=

p.
In summary, the analysis of the most influential parts of the parallel execution
time of the Gaussian elimination has shown that p
1
= p
2
=

p, b
1
= b
2
= 1is
the best choice. For an implementation, the values for p
1
and p
2
have to be adapted

to integer values.
7.2 Direct Methods for Linear Systems with Banded Structure
Large linear systems with banded structure often arise when discretizing partial
differential equations. The coefficient matrix of a banded system is sparse with
non-zero elements in the main diagonal of the matrix and a few further diagonals.
As a motivation, we first present the discretization of a two-dimensional Poisson
equation resulting in such a banded system in Sect. 7.2.1. In Sect. 7.2.2, the solu-
tion methods recursive doubling and cyclic reduction are applied to the solution of
tridiagonal systems, i.e., banded systems with only three non-zero diagonals, and
the parallel implementation is discussed. General banded matrices are treated with
cyclic reduction in Sect. 7.2.3 and the discretized Poisson equation is used as an
example in Sect. 7.2.4.
7.2.1 Discretization of the Poisson Equation
As a typical example of an elliptic partial differential equation we consider the Pois-
son equation with Dirichlet boundary conditions. This equation is often called the
model problem since its structure is simple but the numerical solution is very simi-
lar to many other more complicated partial differential equations, see [60, 79, 166].
The two-dimensional Poisson equation has the form
−Δu(x, y) = f (x, y) for all (x, y) ∈ Ω (7.13)
with domain Ω ⊂ R
2
.
The function u : R
2
→ R is the unknown solution function and the function
f : R
2
→ R is the right-hand side, which is continuous in Ω and its boundary. The
operator Δ is the two-dimensional Laplace operator
7.2 Direct Methods for Linear Systems with Banded Structure 379

Δ =

2
∂x
2
+

2
∂y
2
containing the second partial derivatives with respect to x or y.(∂/∂x and ∂/∂y
denote the first partial derivatives with respect to x or y, and ∂
2
/∂x
2
and ∂
2
/∂y
2
denote the second partial derivatives with respect to x or y, respectively.) Using this
notation, the Poisson equation (7.13) can also be written as


2
u
∂x
2


2

u
∂y
2
= f (x, y) .
The model problem (7.13) uses the unit square Ω = (0, 1) × (0, 1) and assumes a
Dirichlet boundary condition
u(x, y) = ϕ(x, y) for all (x, y) ∈ ∂Ω , (7.14)
where ϕ is a given function and ∂Ω is the boundary of domain Ω, which is
∂Ω ={(x, y) | 0 ≤ x ≤ 1, y = 0ory = 1}∪{(x, y) | 0 ≤ y ≤ 1, x = 0orx = 1}.
The boundary condition uniquely determines the solution u of the model problem.
Figure 7.7 (left) illustrates the domain and the boundary of the model problem.
An example of the Poisson equation from electrostatics is the equation
Δu =−
ρ
ε
0
,
where ρ is the charge density, ε
0
is a constant, and u is the unknown potential to be
determined [97].
For the numerical solution of equation −Δu(x, y) = f (x, y), the method of finite
differences can be used, which is based on a discretization of the domain Ω ∪ ∂Ω
y
x
. . .
. . .
. . .
. . .
y

(0,1)
(1,1)
u = fΔ−
u = ϕ
(0,0) (1,0)
x
Poisson equation mesh for the unit square
boundary values
inner mesh points
Fig. 7.7 Left: Poisson equation with Dirichlet boundary condition on the unit square Ω = (0, 1) ×
(0, 1). Right: The numerical solution discretizes the Poisson equation on a mesh with equidistant
mesh points with distance 1/(N + 1). The mesh has N
2
inner mesh points and additional mesh
points on the boundary
380 7 Algorithms for Systems of Linear Equations
in both directions. The discretization is given by a regular mesh with N + 2mesh
points in x-direction and in y-direction, where N points are in the inner part and 2
points are on the boundary. The distance between points in the x-ory-direction is
h =
1
N+1
. The mesh points are
(x
i
, y
j
) = (ih, jh)fori, j = 0, 1, ,N +1 .
The points on the boundary are the points with x
0

= 0, y
0
= 0, x
N+1
= 1, or
y
N+1
= 1. The unknown solution function u is determined at the points (x
i
, y
j
)of
this mesh, which means that values u
ij
:= u(x
i
, y
j
)fori, j = 0, 1, ,N + 1are
to be found.
For the inner part of the mesh, these values are determined by solving a linear
equation system with N
2
equations which is based on the Poisson equation in the
following way. For each mesh point (x
i
, y
j
), i, j = 1, ,N, a Taylor expansion is
used for the x or y-direction. The Taylor expansion in x-direction is

u(x
i
+h, y
j
) = u(x
i
, y
j
) + h · u
x
(x
i
, y
j
) +
h
2
2
u
xx
(x
i
, y
j
)
+
h
3
6
u

xxx
(x
i
, y
j
) + O(h
4
) ,
u(x
i
−h, y
j
) = u(x
i
, y
j
) − h · u
x
(x
i
, y
j
) +
h
2
2
u
xx
(x
i

, y
j
)

h
3
6
u
xxx
(x
i
, y
j
) + O(h
4
) ,
where u
x
denotes the partial derivative in x-direction (i.e., u
x
= ∂u/∂x) and u
xx
denotes the second partial derivative in x-direction (i.e., u
xx
= ∂
2
u/∂ x
2
). Adding
these two Taylor expansions results in

u(x
i
+h, y
j
) + u(x
i
−h, y
j
) = 2u(x
i
, y
j
) + h
2
u
xx
(x
i
, y
j
) + O(h
4
) .
Analogously, the Taylor expansion for the y-direction can be used to get
u(x
i
, y
j
+h) +u(x
i

, y
j
−h) = 2u(x
i
, y
j
) + h
2
u
yy
(x
i
, y
j
) + O(h
4
) .
From the last two equations, an approximation for the Laplace operator Δu = u
xx
+
u
yy
at the mesh points can be derived
Δu(x
i
, y
j
) =−
1
h

2
(4u
ij
−u
i+1, j
−u
i−1, j
−u
i, j+1
−u
i, j−1
) ,
where the higher order terms O(h
4
) are neglected. This approximation uses the mesh
point (x
i
, y
j
) itself and its four neighbor points; see Fig. 7.8. This pattern is known as
five-point stencil. Using the approximation of Δu and the notation f
ij
:= f (x
i
, y
j
)
7.2 Direct Methods for Linear Systems with Banded Structure 381
Fig. 7.8 Five-point stencil
resulting from the

discretization of the Laplace
operator with a finite
difference scheme. The
computation at one mesh
point uses values at the four
neighbor mesh points
(i,j)
(i,j–1)
(i,j+1)
(i–1,j)
(i+1,j)
y
y
xxx
0
0i
y
N+1
N+1
j
for the values of the right-hand side, the discretized Poisson equation or five-point
formula results:
1
h
2
(4u
ij
−u
i+1, j
−u

i−1, j
−u
i, j+1
−u
i, j−1
) = f
ij
(7.15)
for 1 ≤ i, j ≤ N. For the points on the boundary, the values of u
ij
result from the
boundary condition (7.14) and are given by
u
ij
= ϕ(x
i
, y
j
) (7.16)
for i = 0, N +1 and j = 0, ,N +1or j = 0, N +1 and i = 0, ,N +1. The
inner mesh points which are immediate neighbors of the boundary, i.e., the mesh
points with i = 1, i = N, j = 1, or j = N, use the boundary values in their
five-point stencil; the four mesh points in the corners use two boundary values and
all other points use one boundary value. For all points with i = 1, i = N, j = 1,
or j = N, the values of u
ij
in the formulas (7.15) are replaced by the values (7.16).
For the mesh point (x
1
, y

1
) for example, the equation
1
h
2
(4u
11
−u
21
−u
12
) = f
11
+
1
h
2
ϕ(0, y
1
) +
1
h
2
ϕ(x
1
, 0)
results. The five-point formula (7.15) including boundary values represents a linear
equation system with N
2
equations, N

2
unknown values, and a coefficient matrix
A ∈ R
N
2
×N
2
. In order to write the equation system (7.15) with boundary values
(7.16) in matrix form Az = d,theN
2
unknowns u
ij
, i, j = 1, ,N, are arranged
in row-oriented order in a one-dimensional vector z of size n = N
2
which has the
form
z = (u
11
, u
21
, ,u
N1
, u
12
, u
22
, ,u
N2
, ,u

1N
, u
2N
, ,u
NN
) .
The mapping of values u
ij
to vector elements z
k
is
z
k
:= u
ij
with k = i + ( j −1)N for i, j = 1, ,N .
382 7 Algorithms for Systems of Linear Equations
Using the vector z, the five-point formula has the form
1
h
2

4z
i+( j−1)N
− z
i+1+( j−1)N
− z
i−1+( j−1)N
− z
i+jN

− z
i+( j−2)N

= d
i+( j−1)N
with d
i+( j−1)N
= f
ij
and a corresponding mapping of the values f
ij
to a
one-dimensional vector d. Replacing the indices by k = 1, ,n with k =
i +( j −1)N results in
1
h
2
(
4z
k
− z
k+1
− z
k−1
− z
k+N
− z
k−N
)
= d

k
. (7.17)
Thus, the entries in row k of the coefficient matrix contain five entries which are
a
kk
= 4 and a
k,k+1
= a
k,k−1
= a
k,k+N
= a
k,k−N
=−1.
The building of the vector d and the coefficient matrix A = (a
ij
), i, j =
1, ,N
2
, can be performed by the following algorithm, see [79]. The loops over i
and j, i, j = 1, ,N, visit the mesh points (i, j) and build one row of the matrix A
of size N
2
×N
2
. When (i, j) is an inner point of the mesh, i.e., i, j = 1, N, the corre-
sponding row of A contains five elements at the position k, k+1, k−1, k +N, k −N
for k = i + ( j − 1)N. When (i, j) is at the boundary of the inner part, i.e.,
i = 1, j = 1, i = N,or j = N, the boundary values for ϕ are used.
/* Algorithm for building the matrix A and the vector d */

Initialize all entries of A with 0;
for ( j = 1; j <= N; j ++)
for (i = 1; i <= N; i ++) {
/* Build d
k
and row k of A with k = i + ( j − 1)N */
k = i + ( j −1) · N;
a
k,k
= 4/ h
2
;
d
k
= f
ij
;
if (i > 1) a
k,k−1
=−1/h
2
else d
k
= d
k
+1/ h
2
ϕ(0, y
j
);

if (i < N) a
k,k+1
=−1/h
2
else d
k
= d
k
+1/ h
2
ϕ(1, y
j
);
if ( j > 1) a
k,k−N
=−1/h
2
else d
k
= d
k
+1/ h
2
ϕ(x
i
, 0);
if ( j < N) a
k,k+N
=−1/h
2

else d
k
= d
k
+1/ h
2
ϕ(x
i
, 1);
}
The linear equation system resulting from this algorithm has the structure
1
h
2






B −I 0
−IB
.
.
.
.
.
.
.
.

.
−I
0 −IB






· z = d , (7.18)

×