A domain decomposition strategy is followed to assign many independently tracable particles to each processor. To trace a particle, one needs to know the sum of forces acting on that particle through the contacts with others. Given a certain volume of the physical domain, each processor calculates the new coordinates of particles contained in its volume. To calculate contact forces on particles touching or crossing the boundaries, processors need to communicate with adjacent processors. Those particles are called overlapping particles and the treatment of them constitutes the bulk of our parallel algorithm.
The first step of the particle simulation is to generate particles with desired properties (density, stiffness, friction, etc.) in the defined node area. The particles are treated in three different types: non-overlapping balls, right-side overlapping balls, and left-side overlapping balls. After the detection of overlapping balls, information (coordinates, velocities, accelerations, etc) about those balls is sent to the involved nodes. Every shared ball will have two ID numbers associated with it so that when information about that ball is exchanged, nodes would recognize it. With the complete information about both non-overlapping and overlapping balls, nodes calculate new coordinates, velocities and all pertaining information before reboxing the particles for the next time step. Particles that moved to the left or right node areas are marked lost and their memory allocation is used for newly arrived particles. Almost all the messages are sent and received asynchronously to overlap the computation and communication. Most of the messages carry force information on the overlapped particles and there are more than forty types of contact patterns to be differentiated. In fact, contact detection and force calculation constitute the largest portion of the simulation. Contact detection even for the balls in the same node appears to be computationaly intensive. One way of reducing this detection procedure is to group particles in adjacent boxes (just as in the case of nodes) and only check the particles in the same box for contacts.
Table 1: Response times for parallel DEM simulations.
Because of significant variation from the scalar DEM algorithm due to complexities near node boundaries, the parallel version has been tested and thus verified to give the same results as the scalar one. Preliminary performance measurements have been taken for different problem parameters. The communication overhead proves to be significant for small problems such as the problems with < 1000 particles. Table 1 lists the response times for 300, 600, 1200, and 2400 total number of particles. These results indicate that the best case to utilize the potential of parallelism is when the total number of particles is >1000. It is clear that one needs to minimize the communication overhead to achieve beter results. Since communication is done asynchronously we can assume that there is no performance degradation due to synchronization, except that load balancing may be checked to improve the results.