GROMACS on the Altix ia64 HPC (parallel, using MPI)
It is quite easy to convert a preexisting gromacs system for use on multiple CPUs, such as a grid-based supercomputer. Under the simplest circumstances, the only modification necessary is to the input parameters to the “full md” stage of grompp and mdrun are necessary:
grompp
-np N -shuffle -sort [original options]
mpirun
/full/path/to/gromacs/bin/mdrun -np N [original options]
the -np N sets both grompp and mdrun to use N processors. The -shuffle and -sort options to grompp optimise the numbering scheme of the atoms over the nodes.
The system is divided spatially. Since the box that contains the atoms is almost always a rectangular prism aligned along the x, y and z axes, the box is divided by planes normal to the x axis.
There is one major caveat, however. With a system containing one large molecule, an error will be generated if grompp tries to distribute the atoms of the large molecule over several nodes. The problem is caused by the SHAKE algorithm, which is used to “nudge” the atoms into optimal (minimum potential energy) bond lengths and angles, prior to running the full molecular dynamics simulation. Another, method is available, called LINCS, which is less accurate than SHAKE for optimising the atomic co-ordinates, but the method works over multiple nodes when SHAKE doesn't. To set which algorithm to use in a system, add the following line to the .mdp file for the full md simulation:
contraint_algorithm: [ lincs | shake ]
The GROMACS manual (the section on run parameters, section 7.3.14 in the GROMACS 3.2 manual) explains the usage of both of these algorithms in full.
Another problem which occurred during the testing of GROMACS on the Altix systems arises out of the use of fast Fourier transforms to calculate long-range electrostatic forces using the Particle Mesh Ewald summation (PME) method. Using this method crashes GROMACS (mdrun to be specific) with a segmentation fault. While alternate methods are available (and can be enabled in the .mdp file, similar to the bond constraint algorithm), the PME method is the most computationally efficient. Long-range electrostatic interactions are frequently the most computationally intensive, accounting for approximately 70% of the processing time in the DPPC simulations.
The problem is probably specific to the system configuration: Altix ia64 architecture, running Linux with GROMACS 3.3b compiled in single-precision mode, and linked to the Intel Math Kernel Library (MKL) for the FFT algorithm. The GROMACS developers recommend that installations of GROMACS utilise the FFTW library (http://www.fftw.org). It is unclear whether the problem lies with the fact that GROMACS was compiled to use single-precision floats, or the MKL installation, or some specific quirk of the ia64 architecture, or a combination of the three.
GROMACS includes a number of tutorial and example systems, which can be used to run performance benchmarks. The example system used in these performance benchmarks is the DPPC system (dipalmitoylphosphatidylcholine), an amphiphilic molecule that self-assembles into a membrane in a water solution. The example system is available from the GROMACS website (http://www.gromacs.org). The DPPC system was run with no modifications except the addition of the parallelisation options. The system was run using 1 to 8 nodes on gust (one of the QPSF Altix HPC systems), and the results are shown below. The images were rendered in Mathematica. The data was obtained from the summary displayed by mdrun and the end of the simulation.

Figure
1: computation speed of parallel gromacs
for the DPPC
demonstration system
The steep drop in performance for 8 processors was due to the fact that only 7 nodes were available at the time the benchmarks were run. GROMACS achieves almost linear speedup over this range of nodes. The most computationally efficient number of nodes for this particular system was 3:

Figure
2: relative computational performance
per node for the DPPC
demonstration system
The
system remained at linear or better than linear speedup for up to 6
nodes.
