Proceedings of Supercomputing 95 Conference, San Diego CA, December 1995.

Economical High Performance Parallel Simulations of Wind Instruments

Panayotis A. Skordos
Massachusetts Institute of Technology
545 Technology Square, Cambridge, MA 02139
pas@ai.mit.edu

Abstract:

A performance-price ratio of 3.6 Gflops per million dollars is demonstrated in simulations of wind instruments using a cluster of 20 non-dedicated Hewlett-Packard workstations and a shared-bus Ethernet network (the performance-price ratio is calculated assuming that the workstations were dedicated). The present simulations of wind instruments are based on the compressible Navier Stokes equations, and have only recently become possible through the use of parallel computing and through developments in numerical methods. The distributed simulation system has been built directly on top of UNIX and TCP/IP, and includes automatic process migration from busy workstations to free workstations.

Introduction

The mechanics of wind instruments such as the flute and the recorder depend on the nonlinear interaction between the hydrodynamic flow of air and the acoustic waves inside the wind instruments. In order to simulate this interaction accurately, we must use the compressible Navier Stokes equations. We have performed such simulations with the help of a parallel distributed system that we have developed on a cluster of non-dedicated Hewlett-Packard workstations. As far as we know, these are the first direct simulations of wind instruments using the compressible Navier Stokes equations. Furthermore, physical measurements of the acoustic signal generated by a soprano recorder are in qualitative agreement with the predictions of our simulations (reference [1]). The major difficulties that have discouraged previous attempts at this problem include the requirement of very small time steps in order to follow the fast-moving acoustic waves, and the numerical instabilities of subsonic compressible flow (references [2,3,4]). In this report, we discuss the performance and the cost of our simulations using a cluster of 20 Hewlett-Packard workstations.

Simulation system

To solve numerically the compressible Navier Stokes equations, we use a recently-developed numerical method, called the lattice Boltzmann method, in combination with a fourth-order artificial-viscosity filter (references [4,5]) on a uniform numerical grid. The filter is necessary to control the numerical instabilities of subsonic compressible flow. The lattice Boltzmann method is an explicit numerical method for subsonic compressible flow (reference [6]) which has slightly better stability properties than explicit finite difference methods.

The use of an explicit numerical method is a good match for the physical problem (flow and acoustics). In particular, the problem requires small time steps in order to follow the acoustic waves, and explicit methods require small time steps for numerical stability reasons (specific numbers are given in references [4,5]). Thus, we have a match between the problem and the numerical method. Regarding the distributed simulation system, parallelization is achieved by decomposing the problem into subregions which are assigned to different workstations. The use of explicit numerical methods (local interactions) means that only the boundaries between subregions must be communicated between parallel processors. Thus, we have coarse-grain parallel computing with small communication requirements (reference [5]). Overall, there is a good match between the problem, the numerical method, and the distributed system at hand.

The distributed simulation system is built directly on top of UNIX (HP-UX) and TCP/IP, and is specifically designed for local-interaction and spatially-organized problems (reference [5]). Further, the system includes automatic process migration from busy hosts to free hosts. Typically, about 20 workstations are employed out of a total 25 workstations in the cluster so that there are free workstations to migrate to when necessary. There is 1 migration every 40 minutes on the average during 24 hours, and each migration lasts about 30 s, while the total runtime of a typical simulation is 2-3 days.

Performance

In order to measure the performance of the system, we count the number of floating-point operations (flops) in terms of additions (A) and multiplications (M), and we also count divisions as being equivalent to multiplications. In two-dimensional simulations each fluid node corresponds to 9 lattice Boltzmann variables Fi (i=0,1,2,...,8) and 3 flow variables V0,V1,V2 (the fluid density is represented by V0 and and the fluid velocity is represented by V1,V2). The parallel subregions, which are assigned to different workstations, contain about 200x200 fluid nodes. The computation in each subregion during an integration step is as follows (reference [6]):

Relaxation step of Fi
Advection step of Fi (copy Fi to neighboring fluid nodes)
Communication of Fi at the boundaries of neighboring subregions
Update step of V0,V1,V2 from Fi
Filtering step of V0,V1,V2

The communication enforces a loose synchronization between the parallel processors because each processor must block until the new boundary data is received from its neighbors during each integration step. For the purpose of counting floating-point operations, neither the communication nor the advection contribute any numbers because they involve integer operations only (copying bytes between different memory locations).

The relaxation step is the most expensive calculation in terms of floating-point operations. The code which is executed at each fluid node (a simplified version) is listed in the appendix. We get 124M and 60A.

The update step computes the flow variables V0,V1,V2 from the Fi as follows (reference [6]),
V0 = Sum(Fi)
V1 = (Sum( EXi * Fi)) / V_0
V2 = (Sum( EYi * Fi)) / V_0
where EXi,EYi are constants for each i=0,1,2,...,8 and the sum is over i. The implementation of the above equations gives 25A and 18M. We also perform 3A to subtract the new values of V0,V1,V2 from the old values (the change of V0,V1,V2 per integration step is used to monitor the calculation). Overall, we get 18M and 28A for the update step.

The filtering step (fourth-order artificial-viscosity filter, see reference [4]) is applied to the 3 variables V0,V1,V2 in the following manner,
define Tx = V(i-2) - 4*V(i-1) + 6*V - 4*V(i+1) + V(i+2)
define Ty = V(j-2) - 4*V(j-1) + 6*V - 4*V(j+1) + V(j+2)
then V^n+1 = V^n - alpha * (Tx + Ty)
where V stands for any V0,V1,V2, where i,j are the horizontal-vertical indices of the orthogonal numerical grid, and alpha is a filtering coefficient. We get 21M and 30A operations for the filtering step.

Totaling the above, we get 163M and 118A per fluid node per integration step. Each workstation (HP9000, model 715/50 and model 712/60) integrates about 39,000 fluid nodes per second (reference [5]). Therefore, each workstation computes the above algorithm for subsonic flow at 11 Mflops. This is good performance considering that the Hewlett-Packard company rates the workstations at 13 Mflops for linear algebra calculations. Our program does a lot more than floating-point operations; for example, the program does many memory-copy operations and examines the geometry (logical test at each fluid node) of an arbitrary fluid flow simulation, the boundary conditions, etc.

In terms of parallel computing, we consider 20 workstations and 0.8 million fluid nodes so that 40,000 fluid nodes are assigned to each workstation. Detailed measurements in reference [5] show that 20 workstations achieve a parallel efficiency of 80% or a speedup of 16 versus the performance of a single workstation. In particular, each workstation spends about 0.8 of the time for computation, and 0.2 of the time for communication (a new version of the system actually achieves 90% efficiency for 20 workstations by overlapping the communication with the computation; however, we quote here the 80% efficiency as it was used in the original submission of the paper). This result is very good considering that a shared-bus Ethernet network is used for communication. Overall, the 20 workstations achieve
16 * 11 Mflops = 176 Mflops
Finally, another way of examining performance is to compare the 20 Hewlett-Packard workstations (HP9000 models 715/50 and 712/60) with a 64-node CM-5 supercomputer with vector units. An approximate comparison has shown that the same problem (for example, a 800x600 grid and 10,000 integration time steps) is computed at approximately the same time (a couple of hours) by the 20 workstations and by the 64-node CM-5. In this comparison, the CM-5 was programmed using the C* programming language, and the size and the geometry of the problem were fixed and known at compile time. The result of this comparison is not surprising because each workstation is 3-4 times faster than the individual processors of the CM-5. Therefore, the 20 workstations have comparable computing power with the 64-node CM-5 when the ratio of communication to computation is small as in the present case.

Cost

The cost of using non-dedicated workstations is in some sense zero, which implies an infinitely large performance-price ratio. To obtain a finite number for comparison purposes, we estimate the cost assuming that the workstations were dedicated.

We quote the list price of a Hewlett-Packard workstation HP9000 model 712/60 as 2474 US dollars. This price includes CPU, 16MBytes memory, and network connection which are needed, but it does not include other things such as hard disk and monitor which are not needed here. (note that about 6MBytes are used by the distributed simulation when the subregion per processor is 200x200 fluid nodes). Using the base price of 2474 US dollars, the cost of 20 workstations is 49,480. Therefore, we obtain 3.6 Gflops per million dollars.

Programming notes

The floating-point performance can benefit from the multiply-add instruction of the Hewlett-Packard HP-PA processor. In particular, the relaxation step of the lattice Boltzmann method contains a long sequence of multiplications followed by additions which a good compiler can convert into multiply-add instructions.

An interesting issue is that it is necessary to pad the arrays of the C-program (must lengthen the arrays with dumb entries) to avoid ``cache mismatch'' when the length of the arrays is a near multiple of 4096 bytes. If the arrays are not padded, the performance of the HP9000 workstations degrades significantly probably because of cache misses.

Another interesting issue is to avoid underflow errors at the start of the simulation by initializing the flow variables to be arbitrarily small but non-zero. Otherwise, the passage of the first acoustic wave through the simulation region causes a sequence of underflow exceptions in each processor successively until the acoustic wave has reached every processor of the distributed simulation. The underflow exceptions in each processor delay not only the affected processor but all the processors.

Conclusion

A performance-price ratio of 3.6 Gflops per million dollars has been achieved in simulations of wind instruments using a cluster of non-dedicated Hewlett-Packard workstations (the performance-price ratio is calculated assuming that the workstations were dedicated). As far as we know, these are the first direct simulations of wind instruments using the compressible Navier Stokes equations.

Appendix

The code for the relaxation step looks as follows,

{
  oo=J+K*NX;   /* array index of current fluid node */
  vv0=V0[oo]; vv1=V1[oo]; vv2=V2[oo];
  rho=(vv0+massrho);
  nv1=(vv1*rho); 
  nv2=(vv2*rho); 
      
  TTnvv=((nv1*vv1)+(nv2*vv2));   /*---constants during Feq-loop---*/
  TTnv1vv1=(nv1*vv1);
  TTnv2vv2=(nv2*vv2);
  TT2nv1vv2=(2.*nv1*vv2);

  Feq[0]=(z0*(vv0) + z21*TTnvv);
	    
  for(index=1;index<=8;index++)        /* calculate Feq */
    Feq[index]=(C0*(vv0) + 
                C1*(E1*nv1 + E2*nv2) + 
                C20*(E1*E1*TTnv1vv1 + 
                     E2*E2*TTnv2vv2 + 
                     E1*E2*TT2nv1vv2) + 
                C21*TTnvv); 

  for(index=0;index<=8;index++)        /* 1/tau relaxation */
    { Fi=ss->Fi[index];
      Fi[oo]=((one_tau * Fi[oo]) + (_tau * Feq[index])); }
}

References

1: P. Skordos and G. Sussman, ``Comparison between subsonic flow simulation and physical measurements of flue pipes,'' in Proceedings of ISMA 95, International Symposium on Musical Acoustics, Le Normont, France, July, 1995.
2: A. Hirschberg, Wind Instruments. Eindhoven Institute of Technology, Report R-1290-D, 1994.
3: J. Hardin, ``Regarding numerical considerations for computational aeroacoustics,'' in Computational Aeroacoustics, pp. 216-228, February 1993.
4: P. Skordos, Modeling flue pipes: subsonic flow, lattice Boltzmann, and parallel distributed computers. Department of Electrical Engineering and Computer Science, MIT, Ph.D. Dissertation, February 1995.
5: P. Skordos, ``Parallel simulation of subsonic fluid dynamics on a cluster of workstations,'' in Proceedings of High Performance Distributed Computing 95, 4th IEEE Int'l Symposium, Pentagon City, Virginia, August, 1995.
6: P. Skordos, ``Initial and boundary conditions for the lattice Boltzmann method,'' Physical Review E, vol. 48, no. 6, pp. 4823-4842, December 1993.

CFD-simulation of a soprano recorder using 20 workstations. Each black rectangle corresponds to one workstation. The brown areas are solid walls and are not simulated. The green and the red lines correspond to positive and negative isocontours of vorticity. The recorder is located at the bottom of the picture and is 20cm long. Air is entering the flue of the recorder (narrow channel) with a speed of 1200cm/sec, impinges the sharp edge of the recorder, and exits from the top of the picture. At the present blowing speed, the recorder generates tones near 400Hz and 1100Hz. The snapshot is taken 25 milliseconds after startup.

DATE: 10/22/95 TIME: 17:33:06