There are a number of different approaches to large-scale, short-range
classical MD calculations [Allen and Tildesley 1987; Deng * et al.* 1995b;
Lomdahl * et al.* 1993; Plimpton 1993]. The basic
approach we have adopted is the link-cell method with spatial
decomposition. The three dimensional domain being simulated is
partitioned into subdomains, with each subdomain assigned to a
different processor. If is the range of the force,
i.e. if the force can be neglected at distances beyond , then
each subdomain is divided into a regular grid of cells whose size is
, or slightly larger. To compute short-range forces on each
particle, only the current and adjacent cells of the grid are searched
for particles within the truncation distance . This reduces the
number of operations on non-interacting particles, leading to a
smaller coefficient **c** in the complexity **cN** (where **N** is the
number of particles.) The force is assumed to act for a finite
time-step, the resulting particle motions are calculated, and any
necessary reassignment to cells is carried out. The studies reported
here adopt a pair-wise 6-12 Lennard-Jones potential.

The complication introduced by the parallel decomposition arises because at the boundary of each subdomain, there are adjacent cells which lie in different subdomains and therefore belong to different processors. Communicating information about such adjacent cells, or relocating particles into such cells, introduces a significant overhead, and also makes the code more complex. In the traditional implementation the nodes are globally synchronized at each time step, all exchanging information with their neighbors in the same sequence and at the same time, each processor sending data to and receiving data from at least eight neighbors. Each processor maintains ``ghost cells'' imaging the boundary cells of its neighbors.

The new ingredient we introduce into the computation is to send the information around the required neighboring cells as soon as it has been computed. We consider the computational work for each time-step to be held on an ordered list (in a link-cell algorithm the cells are typically traversed in a specified order, with the particles in each cell held in an array or linked-list structure). For each time-step a traversal of the list is made, performing computations for each particle within each cell. The result is a new list for the subsequent time-step. We take advantage of the fact that the list can be traversed linearly; and therefore, the new list can be built while the current list is being processed. Therefore, messages containing data for the new list can be sent prior to completion of the current time-step; there is no need to wait for synchronization before communicating data for the subsequent time-step. Upon finishing the list, a message indicating that all data has been processed for the current time-step must be sent to neighboring processors (those which have adjacent physical boundaries) so that they can proceed to the subsequent step.

The algorithm is represented by the pseudocode in Figure 1.

**Figure 1:** * Pseudocode for MD algorithm.*

There are three points to note about this algorithm.

- The communication is distributed throughout the time-step, not
all concentrated at the end. There may be some overhead penalty for some systems
because the messages are smaller (in most cases a whole cell rather than a
whole boundary, but occasionally a single particle).
However, it is much less sensitive to the latency of the network
since both nodes continue to compute, except perhaps at the end of a step.
- Because the new information is placed at the head of the new
list, it doesn't matter when it arrives. Note that there are ``send''
operations, but no ``receive'' operations in the pseudocode. By using
IPX [Marr
*et al.*1994], a package developed at BNL, it is possible for code on one node to asynchronously manipulate data on another. - The synchronization is explicit in the algorithm and is local, not global. This is important, because as systems scale, true global synchronization of a very large number of processors will become costly and fragile.

Thu Jul 27 16:50:35 EDT 1995