The efficiency of our algorithm is shown by the simulation of a benchmark problem with N particles and P processors. In a 3-dimensional domain, we simulate a Lennard-Jones system in equilibrium with a reduced density and temperature of and , respectively. The truncation distance was . The timing runs were performed in single-precision on Intel Paragon computers located at SUNY Stony Brook and Oak Ridge National Laboratory. For our timing runs, the MP capability of the Paragon was not enabled; thus, the timing values shown reflect the speed when using only one CPU per processor node for computations.
Table 1: Average CPU seconds per simulation step for system sizes N and numbers of processors P.
The values shown in Table 1 reflect all work involved for one time step of the MD calculation, including the CPU time for all distance calculations, force evaluations, and communication messages. The time spent performing communication was on average 10% of the total step time. We observe that our algorithm scales perfectly for large values of N and P for fixed values of . This demonstrates the effectiveness of our asynchronous algorithm --- there are no degradation effects when using larger processor configurations. There are two figures of merit which may be used to compare the performance of different calculations. The ``particle-step time'' (the clock time per time step per particle simulated) is associated with a particular configuration of processors. It determines how long a given simulation would take with that configuration. The other useful measure is the ``time-step per particle per processor'', which allows comparison of performance between runs using different numbers of processors.
For the 1024 processor Paragon runs, our asynchronous algorithm gives a particle-step time of approximately .4 microseconds. While we have made no optimization efforts which are specific to our machine, this is comparable to the reported times of other researches for the same benchmark problem with similar values of N and P. For example, [Lomdahl et al. 1993] reports .33 micro-seconds for N = 180 M and P=1024 using double precision arithmetic on a CM-5, having encoded the distance and force calculations into assembly language specific to the CM-5; and [Plimpton 1993] obtained .32 micro-seconds (for only the force and distance calculations) for N=100 M and P=1904 on an Intel Paragon. The time step per particle per processor is in the neighborhood of .4 milliseconds for all the runs.
In Figure 2, we show the results of applying the load balance algorithm to our MD code. To create an imbalanced case for comparison, we start with 1 million Argon atoms equally distributed throughout a three-dimensional periodic domain of dimensions 1066x1066x266 angstroms. Starting with a uniform (reduced) density of , we place two attracting points of unequal strength in the domain and allow the particles to interact for a simulation time of 1.5 picoseconds, which required 4300 simulation steps. The ratio of the two attracting strengths is 2:1. As the simulation proceeds, the particles converge towards the attracting points, creating a load imbalance. The domain was partitioned using a two-dimensional grid of processors. We distribute 1024 subdomains evenly across the 64 processors, so that each processor initially has an array of subdomains. The load balance algorithm is performed once every 100 steps, and the subdomains are re-distributed only when %. For comparison, we also performed the same simulation and partitioning with no load balancing.
The non-balanced case shows a steady increase in the amount of time for each MD step; the rate of increase falls off when the region around the attracting point becomes saturated. The non-monotonic behavior of the load balanced case occurs because the subdomains are only moved when the imbalance is above the threshold. There is minimal overhead due to the use of multiple subdomains per node. In our implementation, the time required to concurrently relocate subdomains was less than 2 CPU seconds, and we observe that this time is recovered in just a few time steps by the increase in load balance efficiency.