Adding water and proteins to your cluster (virtually)
GROMACS is a powerful and versatile package designed to help scientists simulate the behavior of large molecules (like proteins, lipids, and even polymers). Naturally, calculations of such magnitude require the computing horsepower provided by HPC clusters.
Superficially, molecular simulation is really a strikingly simple idea. Given a collection of atoms like a box of water or a protein, we would like to calculate how they interact and move under realistic laboratory conditions. The catch is that it can require truly insane amounts of computational power. Fortunately, relatively few processes (apart from bond breaking/formation) require any quantum mechanical treatment, so we can model atoms as simple classical particles and apply Newton's equations of motion. The expensive part is to determine the force on each atom, since it depends on the positions of all other atoms in the system -- even for a small water box this can involve millions of floating-point operations. Once the forces are known it is straightforward to calculate how the atoms move and where they will be at the next step. Sounds fairly easy, right?
Ah, but the problem is that a step only simulates femtoseconds of real time, so we need to perform five million steps like this to reach 10 nanoseconds -- and that's still the very low end of chemically interesting timescales! This is where the clusters enter -- by parallelizing the algorithm to use multiple processors we can reduce the runtime from months to weeks or even days. Figure One shows a rough sketch of how the parallelization works. This article will step you through installation and basic use of the Gromacs molecular simulation and analysis package, and provide some tips on how you can improve parallel performance. Enough smalltalk, let's get started instead!
tar xzf fftw-2.1.5.tar.gz
Change to the new directory and configure it as
./configure --prefix=/home/joe/software \ --enable-float \ --enable-type-prefix \ --enable-mpi
The backslashes just mean everything should be on a single line. Replace /home/joe with the location of your own home directory, or skip the prefix option if you have root access and want to install FFTW under /usr/local. When the configuration is ready, issue make and make install to compile and install everything.
export CPPFLAGS=-I/home/joe/software/include export LDFLAGS=-L/home/joe/software/lib
or, if you use the csh/tcsh shells:
setenv CPPFLAGS -I/home/joe/software/include setenv LDFLAGS -L/home/joe/software/lib
If you already had FFTW installed in /usr or /usr/local you don't need to do anything. Continue and configure Gromacs as
./configure --prefix=/home/joe/software \ --enable-mpi \ --without-x
Many clusters only have X libraries installed on the frontend, so if we don't disable X support the binaries might not work on the compute nodes. You can also try to add the option --enable-all-static to link all libraries (e.g. FFTW) statically, but this will only work if static system libraries are available. Build and finish the installation with the standard make followed by make install. By default the Gromacs binaries and libraries will be placed in a subdirectory named after the current architecture, so if you have a mixed cluster you can install all different versions in the same place. You might want to add this binary directory to your path.
Puh! That was the boring part, now we can start to play with our new software toys!
qsub -l nodes=4,walltime=2:00:00 -I
The normal usage is to provide a script file with commands to execute instead of the -I option, but for testing it is nice not having to wait in the queue again each time you make a small mistake. The first thing to do on the nodes is to start the MPI environment. For LAM/MPI it is done as
lamboot -v $PBS_NODEFILE
PBS_NODEFILE is an environment variable provided by PBS and Torque, it is the full path to a file that contains the hostnames of the processors allocated to your job. If you just have a very small cluster without any queue system you replace this with your own hostfile where you list the hostname of the nodes to use once for each CPU on them.
You now have shell completion both for Gromacs executables, options, and data files, as well as manual pages (try âman pdb2gmxâ). It works for all normal Unix shells, so you probably want to add it to your login file. You can get also instant help with the -h option to any Gromacs command.
pdb2gmx -f 2LZM.pdb
Select the OPLS-AA/L force field when asked, and press return. You can set the output file names with other flags, but for now we'll stick to the default names. The topology will be called topol.top, coordinates conf.gro, and the file posre.itp contains something called position restraint data (we will use it later). A protein in vacuum isn't particularly realistic though, so we want to use periodic boundary conditions and add solvent water around it. Gromacs supports all types of triclinic boxes, but let's keep it simple. We center the protein in a rectangular box with at least 0.5 nm from the atoms to the box edge, and write it to a new file:
editconf -f conf.gro -d 0.5 -o newbox.gro
Now we can add water to our modified coordinates with the program genbox. We use the SPC water model here -- the coordinates spc216.gro will be read from the Gromacs library directory, and if you provide the topology file it will automatically be edited to add the new water. We will also show another trick: all Gromacs programs detect a number of file formats automatically, so it is just as easy to write the coordinates in PDB format instead of .gro (PDB files cannot store velocities, but we don't have any yet):
genbox -cp newbox.gro -cs spc216.gro -p topol.top -o solvated.pdb
Genbox will output a lot of information and echo that it added about 4400 water molecules in your box. Are you itching to see what the coordinates look like? Copy solvated.pdb to your workstation and display it with pymol solvated.pdb.
integrator = cg nsteps = 250 nstxout = 0 nstvout = 0 nstlog = 0 nstenergy = 0 rlist = 1.0 coulombtype = cut-off rcoulomb = 1.0 vdwtype = cut-off rvdw = 1.0 constraints = none define = -DFLEXIBLE
In short, this means we will use a conjugate-gradients minimizer, use standard interactions with cut-offs at 1.0 nm, but don't output any data since we just want the final structure. Since we just want to get rid of bad contacts and aren't interested in any true energy minimum we just run 250 steps, it is not even worth doing in parallel, although it would work. Prepare a run input file with the command
grompp -f em.mdp -p topol.top -c solvated.pdb -o em.tpr
If you had forgotten e.g. what you called your coordinate file you could just hit the tab key after an option to get a list of the files in the current directory that are compatible with it, provided you sourced GMXRC as we recommended! The file em.tpr contains everything we need for the simulation, so you could also prepare everything on your own workstation and just copy run input files to the cluster or supercomputer (portable across different endian architectures and precisions). It will warn you about the system having non-zero charge since we did not add any counter ions, but we'll live with that for the tutorial. We perform the minimization with the mdrun command:
mdrun -v -deffnm em
The -v flag enables verbose output, and the -deffnm flag is a nice shortcut to use em.* for all filename options. The run input file will be read from em.tpr, final coordinates written to em.gro, the energy file to em.edr, etc. The minimization will take about a minute on a modern CPU, and finish with a warning that it did not converge in 250 steps -- that's quite OK.
integrator = md nsteps = 5000 dt = 0.002 nstxout = 0 nstvout = 0 nstlog = 0 nstenergy = 0 rlist = 1.0 coulombtype = cut-off rcoulomb = 1.0 vdwtype = cut-off rvdw = 1.0 tcoupl = berendsen tc-grps = system tau-t = 0.1 ref-t = 298 gen-vel = yes gen-temp = 298 constraints = h-bonds define = -DPOSRES
We are now using the molecular dynamics integrator, and taking 5000 steps of length 2 fs each. This step length requires bonds involving hydrogens to be kept constant. Atom velocities are generated from a Maxwell distribution at 298K, and during the simulation the temperature will be coupled to 298K using the Berendsen algorithm. Finally we enable position restraints. This time we will prepare a parallel run input file for 4 CPUs:
grompp -f pr.mdp-p topol.top -c em.gro -o pr.tpr -np 4
Time to put that cluster to use! Request four processors for interactive use, and start MPI on them as we described above, or use the instructions for your own cluster. Since we need to start mdrun on all machines you must provide the full path to the command unless it is in your path, sourcing GMXRC only affects the node you are currently logged in to. Almost all MPI implementations use mpirun to start parallel jobs, so the command to use is normally:
mpirun -np 4 /path/to/mdrun -v -deffnm pr
We still have verbose output, so you will get a message every 10 steps telling you when the simulation is expected to finish -- it should only take a couple of minutes.
integrator = md nsteps = 50000 dt = 0.002 nstxout = 10000 nstvout = 10000 nstxtcout = 100 nstlog = 1000 nstenergy = 100 energygrps = protein sol rlist = 1.0 coulombtype = cut-off rcoulomb = 1.0 vdwtype = cut-off rvdw = 1.0 tcoupl = berendsen tc-grps = system tau-t = 0.1 ref-t = 298 constraints = h-bonds
The total simulation time is 100 ps (50,000 steps), which is still very short but enough for a tutorial. Every 10,000 steps we save full precision coordinates and velocities to the trajectory run.trr so we can restart the simulation if something goes wrong, and every 100 steps we will write compressed coordinates to run.xtc (10 times smaller than general trajectories). Energies and similar data like pressure is written every 100 steps to run.edr, and energy terms are written separately for all combinations between protein and solvent groups. Generate the run input file:
grompp -f run.mdp -p topol.top -c pr.gro -o run.tpr -np 4
This simulation will take a while, so let's create a script so it can be submitted as a batch job. The PBS/Torque options can be embedded as a comment in the script, and since most clusters have SMP nodes we ask for 2 nodes and 2 processors per node:
#PBS -l nodes=2:ppn=2 lamboot -v $PBS_NODEFILE cd /your/project/directory mpirun -np 4 /path/to/mdrun -deffnm run wipe
Make the scriptfile executable, and submit the job as qsub scriptfile. Go and have a cup of coffee while you swear over other users competing for resources, and when you are back the output files should be ready in your project directory. All Gromacs binary files are portable, so if you want to you can copy the results and do all post-run analysis on your workstation if Gromacs is installed there.
g_energy -f run.edr
You will be presented with a table of different energy terms, pressures, temperature, etc. Select a couple of terms you are interested in and finish with 0. g_energy will write the average, fluctuation and systematic drift to the screen, and generate the plotfile energy.xvg. Remember that we talked about the Grace program before? If you installed it you can simply type
xmgrace -nxy energy.xvg
and you will get a finished graph with captions and legends. Figure Two shows the total potential energy, all we did in Grace was to make the graph a little wider. You can even use the command line options to xmgrace and produce postscript or PDF plots automatically!
g_rms -s em.tpr -f run.xtc
The reference structure is taken from the tpr file, so we use the energy minimization run input file that started from the PDB structure. You will first be asked which group should be used for the fitting (select C-alpha), then how many groups you want to calculate RMSD for, and which these groups should be. The result will be written to rmsd.xvg, and just as for the energy file plot you can directly fire it up in xmgrace. For Figure Three we selected the three groups C-alpha, the whole protein (except for hydrogens), and only sidechains (except for hydrogens).
g_hbond -s run.tpr -f run.xtc
Select âproteinâ for one group and "SOLA" for the other. Rename the output file hbnum.xvg, and then run the program again, but this time with "protein" for both groups. If you open one file in Grace and then import the other one as an additional set you will get something like Figure Four. That wasn't too hard, was it? Don't forget that most Gromacs programs have a huge list of options to alter their behavior, and if the predefined groups don't match your needs you can easily create a custom index file with the program make_ndx!
g_filter -s run.tpr -f run.xtc -ol lowpass.xtc -nx 10 -all
This time we won't tell you what all the options do, find out yourself instead by executing g_filter -h! Pymol cannot read Gromacs xtc trajectories (yet), and we also want to remove all the water in the trajectory to concentrate on the proteins. This is easy to fix by using another Gromacs program to convert the trajectory to PDB format and only select one group:
trjconv -s run.tpr -f lowpass.xtc -o lzmtraj.pdb
Select âproteinâ when asked for an output group, and visualize the resulting trajectory with the command pymol lzmtraj.pdb. You can push the VCR-style buttons in Pymol to play back the trajectory, and if you really want to impress the audience at a talk you should choose to raytrace/render all frames from the movie menu. If you are using a Macintosh, Pymol can save the movie directly in quicktime format. Otherwise you will need to export a series of images and use an external program to create a movie. Figure Five shows the last frame of a movie we made of the tutorial simulation!
For SMP Xeons, it is important to turn OFF hyperthreading in the bios. This might seem like incredibly stupid advice, since four processors should be better than two, right? Unfortunately, the Linux kernel (including version 2.6) still has problems to do proper scheduling with hyperthreading: If you run two jobs, they sometimes end up on the two virtual CPUs sharing inside one physical CPU, while the other physical CPU is idle. You can imagine what this does to your performance, so just say no to hyperthreading for now. Even if the Linux scheduler did a perfect job you would not see much difference for CPU-bound programs like Gromacs - the integer and floating-point units are not duplicated in current Xeon implementations.
It is a good idea to have some sort of network between the nodes in your cluster. If you have a deep wallet or can apply for time at existing supercomputer facilities, the best solution is to use a dedicated high-performance interconnect such as Myrinet, Infiniband or Scali that provide both low latency and high bandwidth. All these vendors provide their own MPI libraries and documentation, so in that case you should stop reading this and instead ask the vendor how to get the best possible performance. If you are spending your own funds the situation is a little more complex. Myrinet or the alternatives might still be worth it in some cases, but due to the fairly high price we often accept standard Gigabit Ethernet so we can buy more nodes instead.
Unfortunately the special libraries do not support the gigabit cards in our nodes, so we have to stick with LAM/MPI, which we have found to be a tiny bit faster than MPICH. You may want to explore the next generation MPI's such as MPICH2 and Open MPI as well. However, even in the case of LAM/MPI it is well worth recompiling the library with better options. By default TCP/IP communication is used between all processes, but you most certainly want to use shared memory instead when two processes are on the same node. You should also increase the âsmall messageâ buffer size significantly. Download the source from the LAM site and configure it as
./configure -prefix=/home/joe/software \ -with-rpi=usysv \ -with-tcp-short=524288 \ -with-shm-short=524288
Here we assumed you do not have root privileges, but want to install the software under your home directory /home/joe. Issue make and then make install, and your new tweaked MPI library is ready to use once you add /home/joe/software/bin to your path.
qsub -l nodes=4:rack1:ppn=2 script
This command only works if the property ârack1â is assigned to all nodes in rack1. If it's not done on your cluster, bug the administrator until it is fixed -- this is a big performance gain compared to running on single processors per node spread over two or more racks. Figure Six shows some results for the standard Gromacs scaling benchmark -- a 130,000-atom DPPC membrane system with long electrostatics cut-offs.
When you start your own simulations, remember to be gentle with your proteins and don't neglect warning messages -- there is a reason why they are there. Unfortunately simulations are actually very much like laboratory experiments: it is very easy to destroy months of work with a moment of recklessness. There is a wealth of information available at the Gromacs website, including a paper manual and more important: several very active user mailing list. We're looking forward to see your posts there!
This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.
Erik Lindahl is a Gromacs co-author and assistant professor at the Stockholm Bioinformatics Institute, Sweden.