next up previous
Next: 3 LOAD BALANCING Up: MOLECULAR DYNAMICS FOR Previous: 1 INTRODUCTION

2 ASYNCHRONOUS ALGORITHM

There are a number of different approaches to large-scale, short-range classical MD calculations [Allen and Tildesley 1987; Deng et al. 1995b; Lomdahl et al. 1993; Plimpton 1993]. The basic approach we have adopted is the link-cell method with spatial decomposition. The three dimensional domain being simulated is partitioned into subdomains, with each subdomain assigned to a different processor. If is the range of the force, i.e. if the force can be neglected at distances beyond , then each subdomain is divided into a regular grid of cells whose size is , or slightly larger. To compute short-range forces on each particle, only the current and adjacent cells of the grid are searched for particles within the truncation distance . This reduces the number of operations on non-interacting particles, leading to a smaller coefficient c in the complexity cN (where N is the number of particles.) The force is assumed to act for a finite time-step, the resulting particle motions are calculated, and any necessary reassignment to cells is carried out. The studies reported here adopt a pair-wise 6-12 Lennard-Jones potential.

The complication introduced by the parallel decomposition arises because at the boundary of each subdomain, there are adjacent cells which lie in different subdomains and therefore belong to different processors. Communicating information about such adjacent cells, or relocating particles into such cells, introduces a significant overhead, and also makes the code more complex. In the traditional implementation the nodes are globally synchronized at each time step, all exchanging information with their neighbors in the same sequence and at the same time, each processor sending data to and receiving data from at least eight neighbors. Each processor maintains ``ghost cells'' imaging the boundary cells of its neighbors.

The new ingredient we introduce into the computation is to send the information around the required neighboring cells as soon as it has been computed. We consider the computational work for each time-step to be held on an ordered list (in a link-cell algorithm the cells are typically traversed in a specified order, with the particles in each cell held in an array or linked-list structure). For each time-step a traversal of the list is made, performing computations for each particle within each cell. The result is a new list for the subsequent time-step. We take advantage of the fact that the list can be traversed linearly; and therefore, the new list can be built while the current list is being processed. Therefore, messages containing data for the new list can be sent prior to completion of the current time-step; there is no need to wait for synchronization before communicating data for the subsequent time-step. Upon finishing the list, a message indicating that all data has been processed for the current time-step must be sent to neighboring processors (those which have adjacent physical boundaries) so that they can proceed to the subsequent step.

The algorithm is represented by the pseudocode in Figure 1.

 
Figure 1:   Pseudocode for MD algorithm.

There are three points to note about this algorithm.

  1. The communication is distributed throughout the time-step, not all concentrated at the end. There may be some overhead penalty for some systems because the messages are smaller (in most cases a whole cell rather than a whole boundary, but occasionally a single particle). However, it is much less sensitive to the latency of the network since both nodes continue to compute, except perhaps at the end of a step.

  2. Because the new information is placed at the head of the new list, it doesn't matter when it arrives. Note that there are ``send'' operations, but no ``receive'' operations in the pseudocode. By using IPX [Marr et al. 1994], a package developed at BNL, it is possible for code on one node to asynchronously manipulate data on another.

  3. The synchronization is explicit in the algorithm and is local, not global. This is important, because as systems scale, true global synchronization of a very large number of processors will become costly and fragile.



next up previous
Next: 3 LOAD BALANCING Up: MOLECULAR DYNAMICS FOR Previous: 1 INTRODUCTION



Osman Yasar
Thu Jul 27 16:50:35 EDT 1995