The efficiency of our algorithm is shown by the
simulation of a benchmark problem with **N** particles and **P**
processors. In a 3-dimensional domain, we simulate a Lennard-Jones
system in equilibrium with a reduced density and temperature of
and , respectively. The truncation distance
was . The timing runs were performed in
single-precision on Intel Paragon computers located at
SUNY Stony Brook and Oak Ridge National Laboratory. For
our timing runs, the MP capability of the Paragon
was not enabled; thus, the timing values shown reflect the
speed when using only one CPU per processor node for
computations.

**Table 1:** Average CPU seconds per simulation step for system
sizes **N** and numbers of processors **P**.

The values shown in Table 1 reflect all work involved for one time step of
the MD calculation, including the CPU time for all distance
calculations, force evaluations, and communication messages. The time
spent performing communication was on average 10% of the total step
time. We observe that our algorithm scales perfectly for large values
of **N** and **P** for fixed values of . This demonstrates
the effectiveness of our asynchronous algorithm --- there are no
degradation effects when using larger processor configurations. There
are two figures of merit which may be used to compare the performance
of different calculations. The ``particle-step time'' (the clock time
per time step per particle simulated) is associated with a particular
configuration of processors. It determines how long a given simulation would
take with that configuration. The other useful measure is the ``time-step
per particle per processor'', which allows comparison of performance
between runs using different numbers of processors.

For the 1024 processor Paragon runs, our asynchronous algorithm gives
a particle-step time of approximately .4 microseconds. While we have
made no optimization efforts which are specific to our machine, this is
comparable to the reported times of other researches for the same
benchmark problem with similar values of **N** and **P**. For example,
[Lomdahl * et al.* 1993] reports .33 micro-seconds
for **N = 180 M** and **P=1024**
using double precision arithmetic on a CM-5, having encoded the
distance and force calculations into assembly language specific to the
CM-5; and [Plimpton 1993] obtained .32 micro-seconds (for only the
force and distance calculations) for **N=100 M** and **P=1904** on an
Intel Paragon. The time step per particle per processor is in the
neighborhood of .4 milliseconds for all the runs.

In Figure 2, we show the results of applying the load balance algorithm to our MD code. To create an imbalanced case for comparison, we start with 1 million Argon atoms equally distributed throughout a three-dimensional periodic domain of dimensions 1066x1066x266 angstroms. Starting with a uniform (reduced) density of , we place two attracting points of unequal strength in the domain and allow the particles to interact for a simulation time of 1.5 picoseconds, which required 4300 simulation steps. The ratio of the two attracting strengths is 2:1. As the simulation proceeds, the particles converge towards the attracting points, creating a load imbalance. The domain was partitioned using a two-dimensional grid of processors. We distribute 1024 subdomains evenly across the 64 processors, so that each processor initially has an array of subdomains. The load balance algorithm is performed once every 100 steps, and the subdomains are re-distributed only when %. For comparison, we also performed the same simulation and partitioning with no load balancing.

The non-balanced case shows a steady increase in the amount of time for each MD step; the rate of increase falls off when the region around the attracting point becomes saturated. The non-monotonic behavior of the load balanced case occurs because the subdomains are only moved when the imbalance is above the threshold. There is minimal overhead due to the use of multiple subdomains per node. In our implementation, the time required to concurrently relocate subdomains was less than 2 CPU seconds, and we observe that this time is recovered in just a few time steps by the increase in load balance efficiency.

Thu Jul 27 16:50:35 EDT 1995