Proceedings of Supercomputing 95 Conference, San Diego CA, December 1995.
Panayotis A. Skordos
Massachusetts Institute of Technology
545 Technology Square, Cambridge, MA 02139
pas@ai.mit.edu
The mechanics of wind instruments such as the flute and the recorder depend on the nonlinear interaction between the hydrodynamic flow of air and the acoustic waves inside the wind instruments. In order to simulate this interaction accurately, we must use the compressible Navier Stokes equations. We have performed such simulations with the help of a parallel distributed system that we have developed on a cluster of non-dedicated Hewlett-Packard workstations. As far as we know, these are the first direct simulations of wind instruments using the compressible Navier Stokes equations. Furthermore, physical measurements of the acoustic signal generated by a soprano recorder are in qualitative agreement with the predictions of our simulations (reference [1]). The major difficulties that have discouraged previous attempts at this problem include the requirement of very small time steps in order to follow the fast-moving acoustic waves, and the numerical instabilities of subsonic compressible flow (references [2,3,4]). In this report, we discuss the performance and the cost of our simulations using a cluster of 20 Hewlett-Packard workstations.
To solve numerically the compressible Navier Stokes equations, we use a recently-developed numerical method, called the lattice Boltzmann method, in combination with a fourth-order artificial-viscosity filter (references [4,5]) on a uniform numerical grid. The filter is necessary to control the numerical instabilities of subsonic compressible flow. The lattice Boltzmann method is an explicit numerical method for subsonic compressible flow (reference [6]) which has slightly better stability properties than explicit finite difference methods.
The use of an explicit numerical method is a good match for the physical problem (flow and acoustics). In particular, the problem requires small time steps in order to follow the acoustic waves, and explicit methods require small time steps for numerical stability reasons (specific numbers are given in references [4,5]). Thus, we have a match between the problem and the numerical method. Regarding the distributed simulation system, parallelization is achieved by decomposing the problem into subregions which are assigned to different workstations. The use of explicit numerical methods (local interactions) means that only the boundaries between subregions must be communicated between parallel processors. Thus, we have coarse-grain parallel computing with small communication requirements (reference [5]). Overall, there is a good match between the problem, the numerical method, and the distributed system at hand.
The distributed simulation system is built directly on top of UNIX (HP-UX) and TCP/IP, and is specifically designed for local-interaction and spatially-organized problems (reference [5]). Further, the system includes automatic process migration from busy hosts to free hosts. Typically, about 20 workstations are employed out of a total 25 workstations in the cluster so that there are free workstations to migrate to when necessary. There is 1 migration every 40 minutes on the average during 24 hours, and each migration lasts about 30 s, while the total runtime of a typical simulation is 2-3 days.
In order to measure the performance of the system, we count the number of floating-point operations (flops) in terms of additions (A) and multiplications (M), and we also count divisions as being equivalent to multiplications. In two-dimensional simulations each fluid node corresponds to 9 lattice Boltzmann variables Fi (i=0,1,2,...,8) and 3 flow variables V0,V1,V2 (the fluid density is represented by V0 and and the fluid velocity is represented by V1,V2). The parallel subregions, which are assigned to different workstations, contain about 200x200 fluid nodes. The computation in each subregion during an integration step is as follows (reference [6]):
The relaxation step is the most expensive calculation in terms of floating-point operations. The code which is executed at each fluid node (a simplified version) is listed in the appendix. We get 124M and 60A.
The update step computes the flow variables V0,V1,V2
from the Fi as follows (reference [6]),
V0 = Sum(Fi)
V1 = (Sum( EXi * Fi)) / V_0
V2 = (Sum( EYi * Fi)) / V_0
where EXi,EYi are constants for each i=0,1,2,...,8 and
the sum is over i. The implementation of the above
equations gives 25A and 18M. We also perform 3A to subtract
the new values of V0,V1,V2 from the old values (the change
of V0,V1,V2 per integration step is used to monitor the
calculation). Overall, we get 18M and 28A for the update
step.
The filtering step (fourth-order artificial-viscosity
filter, see reference [4]) is applied
to the 3 variables V0,V1,V2 in the following manner,
define Tx = V(i-2) - 4*V(i-1) + 6*V - 4*V(i+1) + V(i+2)
define Ty = V(j-2) - 4*V(j-1) + 6*V - 4*V(j+1) + V(j+2)
then V^n+1 = V^n - alpha * (Tx + Ty)
where V stands for any V0,V1,V2, where i,j are the
horizontal-vertical indices of the orthogonal numerical
grid, and alpha is a filtering coefficient. We get 21M and
30A operations for the filtering step.
Totaling the above, we get 163M and 118A per fluid node per integration step. Each workstation (HP9000, model 715/50 and model 712/60) integrates about 39,000 fluid nodes per second (reference [5]). Therefore, each workstation computes the above algorithm for subsonic flow at 11 Mflops. This is good performance considering that the Hewlett-Packard company rates the workstations at 13 Mflops for linear algebra calculations. Our program does a lot more than floating-point operations; for example, the program does many memory-copy operations and examines the geometry (logical test at each fluid node) of an arbitrary fluid flow simulation, the boundary conditions, etc.
In terms of parallel computing, we consider 20 workstations
and 0.8 million fluid nodes so that 40,000 fluid nodes are
assigned to each workstation. Detailed measurements in
reference [5] show that 20
workstations achieve a parallel efficiency of 80% or a
speedup of 16 versus the performance of a single
workstation. In particular, each workstation spends about
0.8 of the time for computation, and 0.2 of the time for
communication (a new version of the system actually achieves
90% efficiency for 20 workstations by overlapping the
communication with the computation; however, we quote here
the 80% efficiency as it was used in the original
submission of the paper). This result is very good
considering that a shared-bus Ethernet network is used for
communication. Overall, the 20 workstations achieve
16 *
11 Mflops = 176 Mflops
Finally, another way of examining performance is to compare
the 20 Hewlett-Packard workstations (HP9000 models 715/50
and 712/60) with a 64-node CM-5 supercomputer with vector
units. An approximate comparison has shown that the same
problem (for example, a 800x600 grid and 10,000 integration
time steps) is computed at approximately the same time (a
couple of hours) by the 20 workstations and by the 64-node
CM-5. In this comparison, the CM-5 was programmed using the
C* programming language, and the size and the geometry of
the problem were fixed and known at compile time. The
result of this comparison is not surprising because each
workstation is 3-4 times faster than the individual
processors of the CM-5. Therefore, the 20 workstations have
comparable computing power with the 64-node CM-5 when the
ratio of communication to computation is small as in the
present case.
The cost of using non-dedicated workstations is in some sense zero, which implies an infinitely large performance-price ratio. To obtain a finite number for comparison purposes, we estimate the cost assuming that the workstations were dedicated.
We quote the list price of a Hewlett-Packard workstation HP9000 model 712/60 as 2474 US dollars. This price includes CPU, 16MBytes memory, and network connection which are needed, but it does not include other things such as hard disk and monitor which are not needed here. (note that about 6MBytes are used by the distributed simulation when the subregion per processor is 200x200 fluid nodes). Using the base price of 2474 US dollars, the cost of 20 workstations is 49,480. Therefore, we obtain 3.6 Gflops per million dollars.
The floating-point performance can benefit from the multiply-add instruction of the Hewlett-Packard HP-PA processor. In particular, the relaxation step of the lattice Boltzmann method contains a long sequence of multiplications followed by additions which a good compiler can convert into multiply-add instructions.
An interesting issue is that it is necessary to pad the arrays of the C-program (must lengthen the arrays with dumb entries) to avoid ``cache mismatch'' when the length of the arrays is a near multiple of 4096 bytes. If the arrays are not padded, the performance of the HP9000 workstations degrades significantly probably because of cache misses.
Another interesting issue is to avoid underflow errors at the start of the simulation by initializing the flow variables to be arbitrarily small but non-zero. Otherwise, the passage of the first acoustic wave through the simulation region causes a sequence of underflow exceptions in each processor successively until the acoustic wave has reached every processor of the distributed simulation. The underflow exceptions in each processor delay not only the affected processor but all the processors.
A performance-price ratio of 3.6 Gflops per million dollars has been achieved in simulations of wind instruments using a cluster of non-dedicated Hewlett-Packard workstations (the performance-price ratio is calculated assuming that the workstations were dedicated). As far as we know, these are the first direct simulations of wind instruments using the compressible Navier Stokes equations.
The code for the relaxation step looks as follows,
{ oo=J+K*NX; /* array index of current fluid node */ vv0=V0[oo]; vv1=V1[oo]; vv2=V2[oo]; rho=(vv0+massrho); nv1=(vv1*rho); nv2=(vv2*rho); TTnvv=((nv1*vv1)+(nv2*vv2)); /*---constants during Feq-loop---*/ TTnv1vv1=(nv1*vv1); TTnv2vv2=(nv2*vv2); TT2nv1vv2=(2.*nv1*vv2); Feq[0]=(z0*(vv0) + z21*TTnvv); for(index=1;index<=8;index++) /* calculate Feq */ Feq[index]=(C0*(vv0) + C1*(E1*nv1 + E2*nv2) + C20*(E1*E1*TTnv1vv1 + E2*E2*TTnv2vv2 + E1*E2*TT2nv1vv2) + C21*TTnvv); for(index=0;index<=8;index++) /* 1/tau relaxation */ { Fi=ss->Fi[index]; Fi[oo]=((one_tau * Fi[oo]) + (_tau * Feq[index])); } }