A critical factor in the performance of this type of algorithm is the choice of how subdomains are defined. If the particles are uniformly distributed, then a uniform decomposition is appropriate. In practice, however, the particles are not uniformly distributed, and so such a decomposition would imply that processors handling subdomains with fewer particles would have to wait at each time-step for the others to catch up, thus wasting resources. To complicate matters, the distribution changes dynamically, and in complicated situations, where not all particles are identical, the force calculations can be different, leading to still more potential for load imbalance.
We have developed a dynamic load balancer, which requires very little overhead, as a main component of our MD code. A unique feature of our algorithm is the preservation of geometric locality to avoid increased communication costs. As such, it is general enough to apply to any grid-based algorithm which involves strictly local communications; furthermore, it is easily implemented and can be transparent to the applications programmer. We describe here the algorithm as developed for two-dimensional decompositions, but its extension to higher dimensions is straightforward.
We start by decomposing the spatial domain into many more subdomains than processors, so that a number of subdomains will be held on each processor. Thus the domain can be seen as partitioned into two grids: a coarse grid corresponding to the processor decomposition, and a fine grid corresponding to the subdomain decomposition. The vertices of the coarse grid coincide with a subset of the vertices of the finer grid. To allow balancing, the vertices of the coarse grid are allowed to move along the edges of the fine grid; the regions defined by the coarse grid determine the subdomain-to-node correspondence. A subdomain may move to another processor to obtain better balance; and by construction, the nearest-neighbor relationship among processors is preserved. Hence, better balance can be achieved without excessive fragmentation of the domain.
The movement of a coarse grid vertex is determined by examining the loads of the four processors which have the vertex in common. Let be the load on the i-th subdomain, and the total load on processor N:
The load imbalance, , can be defined as:
where is the maximum, and is the average of all the .
Let , We can think of as a ``pressure'' of unused resources tending to expand the boundaries of the coarse grid cell N. For some processors, whose resources are saturated, will be negative, tending to shrink the boundary.
Our algorithm uses this physical model, treating the boundaries as rigid lines connecting the vertices, and the pressure generating a ``force'' normal to the line. For each coarse grid vertex we compute the total force at the vertex by summing the net force across each of the boundary lines emanating from the vertex. If the force exceeds a suitable threshold, the coarse grid vertex is moved to one of the neighboring fine grid vertices, the choice depending on the direction of the net force. This is carried out independently at each vertex and the procedure is iterated until either it converges (all the become approximately equal) or a limiting number of iterations has been reached. Typically, a better state can be obtained with very few iterations. After the new mapping is determined, then the subdomains are re-distributed to correspond to the new grid arrangement.
For a general problem, each subdomain runs an independent copy of the parallel code. If the subdomains are made too small, then there can be significant overhead costs. If the subdomains are too large, then good balance cannot be achieved. For the special case of MD calculations, there is a natural fine grid structure already established by the cell partitioning required by the algorithm. This can be used directly for the load balancing as described above, without the need to introduce another hierarchy of subdomains.
As before, the link-cell algorithm is applied to each subdomain; the parallel application code does not have to be changed to incorporate the load balance algorithm. The load balance procedure can be called either when a processor has idle time, or perhaps every few hundred steps of the MD algorithm.