Adding water and proteins to your cluster (virtually)
GROMACS is a powerful and versatile package designed to help scientists simulate the behavior of large molecules (like proteins, lipids, and even polymers). Naturally, calculations of such magnitude require the computing horsepower provided by HPC clusters.
Superficially, molecular simulation is really a strikingly simple idea. Given a collection of atoms like a box of water or a protein, we would like to calculate how they interact and move under realistic laboratory conditions. The catch is that it can require truly insane amounts of computational power. Fortunately, relatively few processes (apart from bond breaking/formation) require any quantum mechanical treatment, so we can model atoms as simple classical particles and apply Newton's equations of motion. The expensive part is to determine the force on each atom, since it depends on the positions of all other atoms in the system -- even for a small water box this can involve millions of floating-point operations. Once the forces are known it is straightforward to calculate how the atoms move and where they will be at the next step. Sounds fairly easy, right?
Ah, but the problem is that a step only simulates femtoseconds of real time, so we need to perform five million steps like this to reach 10 nanoseconds -- and that's still the very low end of chemically interesting timescales! This is where the clusters enter -- by parallelizing the algorithm to use multiple processors we can reduce the runtime from months to weeks or even days. Figure One shows a rough sketch of how the parallelization works. This article will step you through installation and basic use of the Gromacs molecular simulation and analysis package, and provide some tips on how you can improve parallel performance. Enough smalltalk, let's get started instead!
Installation & SetupWe'll assume that you have an account on a cluster, and that your home directory is mounted on all nodes. You will also need an MPI library to do parallel communication; the command which mpicc should echo the location of the MPI-enabled compiler if it is in your path. If you can't find it, read the performance tuning section at the end of this article for instructions on how to compile your own optimized MPI library -- you will probably end up doing that pretty soon anyway. Most users don't have root access, so instead of using the RPM packages from the Gromacs Site, we'll show how to compile the software yourself and install it in your home directory.
Compiling FFTWSome options in Gromacs (notably PME electrostatics) only work if we install the free FFTW library to provide Fourier transforms. Some Linux distributions ship FFTW by default, but you might need to recompile if you change the MPI library as described elsewhere in this article. Download the source code fftw-2.1.5.tar.gz from the FFTW site or our website. Note that FFTW version 3.x won't work since it doesn't support MPI parallel transforms yet. Unpack the tarball with
tar xzf fftw-2.1.5.tar.gz
Change to the new directory and configure it as
./configure --prefix=/home/joe/software \ --enable-float \ --enable-type-prefix \ --enable-mpi
The backslashes just mean everything should be on a single line. Replace /home/joe with the location of your own home directory, or skip the prefix option if you have root access and want to install FFTW under /usr/local. When the configuration is ready, issue make and make install to compile and install everything.
Compiling GromacsJust like FFTW, Gromacs uses standard GNU autoconf scripts, so the installation is straightforward, even when compiling for a parallel architecture. Download the latest source code from the Gromacs website, and unpack it just like we did for FFTW above. If you installed FFTW in your own directory we need to set some environment variables to tell the configure script where to find the headers and libraries. If your shell is bash or csh (echo $SHELL to find out) this is done as
export CPPFLAGS=-I/home/joe/software/include export LDFLAGS=-L/home/joe/software/lib
or, if you use the csh/tcsh shells:
setenv CPPFLAGS -I/home/joe/software/include setenv LDFLAGS -L/home/joe/software/lib
If you already had FFTW installed in /usr or /usr/local you don't need to do anything. Continue and configure Gromacs as
./configure --prefix=/home/joe/software \ --enable-mpi \ --without-x
Many clusters only have X libraries installed on the frontend, so if we don't disable X support the binaries might not work on the compute nodes. You can also try to add the option --enable-all-static to link all libraries (e.g. FFTW) statically, but this will only work if static system libraries are available. Build and finish the installation with the standard make followed by make install. By default the Gromacs binaries and libraries will be placed in a subdirectory named after the current architecture, so if you have a mixed cluster you can install all different versions in the same place. You might want to add this binary directory to your path.
Other ProgramsIn this tutorial we will use two very nice visualization programs that interface well with Gromacs. We use them a lot in our daily simulation work, so it is a good idea to install them locally on your workstation (not the cluster, since we need OpenGL support):
- Grace, a 2D graphing program for the X Window System and Motif.
- Pymol, a powerful OpenGL-based molecular viewer by Warren DeLano.
Puh! That was the boring part, now we can start to play with our new software toys!
Starting Parallel JobsThe exact way to get a parallel environment depends on whether your cluster has any batch queue system, and which MPI library is installed. The most common free batch system is PBS or Torque (descendant of PBS), where you can request e.g. 4 interactive-use nodes for two hours with the command
qsub -l nodes=4,walltime=2:00:00 -I
The normal usage is to provide a script file with commands to execute instead of the -I option, but for testing it is nice not having to wait in the queue again each time you make a small mistake. The first thing to do on the nodes is to start the MPI environment. For LAM/MPI it is done as
lamboot -v $PBS_NODEFILE
PBS_NODEFILE is an environment variable provided by PBS and Torque, it is the full path to a file that contains the hostnames of the processors allocated to your job. If you just have a very small cluster without any queue system you replace this with your own hostfile where you list the hostname of the nodes to use once for each CPU on them.
A Simulation ProjectOur first task is to find something to simulate. Gromacs includes coordinates for water and some other simple molecules, but we will download a hen eggwhite Lysozyme structure to have something slightly more interesting, it is a well-known protein with 164 residues. Before we execute any Gromacs programs we'll do something neat. Issue the command:
You now have shell completion both for Gromacs executables, options, and data files, as well as manual pages (try âman pdb2gmxâ). It works for all normal Unix shells, so you probably want to add it to your login file. You can get also instant help with the -h option to any Gromacs command.
Preparing a Protein StructureGo to the Protein Data Bank, and enter 2LZM in the search box. Tick the field for âPDB IDâ and hit the button. Select download, and get the uncompressed PDB file (upper left entry in the second table). Create a new directory under your home directory on the cluster and copy the file 2LZM.pdb to it. This PDB file is just a collection of atomic coordinates and names, but it doesn't contain any definitions of which parameters to use in a simulation. We will use the Gromacs program pdb2gmx to select a set of parameters (force field), and automatically translate the PDB file to (1) a topology file that describes all static properties like connectivity, bond parameters, names, etc., and (2) a coordinate file. This program doesn't need any parallel environment, so just execute
pdb2gmx -f 2LZM.pdb
Select the OPLS-AA/L force field when asked, and press return. You can set the output file names with other flags, but for now we'll stick to the default names. The topology will be called topol.top, coordinates conf.gro, and the file posre.itp contains something called position restraint data (we will use it later). A protein in vacuum isn't particularly realistic though, so we want to use periodic boundary conditions and add solvent water around it. Gromacs supports all types of triclinic boxes, but let's keep it simple. We center the protein in a rectangular box with at least 0.5 nm from the atoms to the box edge, and write it to a new file:
editconf -f conf.gro -d 0.5 -o newbox.gro
Now we can add water to our modified coordinates with the program genbox. We use the SPC water model here -- the coordinates spc216.gro will be read from the Gromacs library directory, and if you provide the topology file it will automatically be edited to add the new water. We will also show another trick: all Gromacs programs detect a number of file formats automatically, so it is just as easy to write the coordinates in PDB format instead of .gro (PDB files cannot store velocities, but we don't have any yet):
genbox -cp newbox.gro -cs spc216.gro -p topol.top -o solvated.pdb
Genbox will output a lot of information and echo that it added about 4400 water molecules in your box. Are you itching to see what the coordinates look like? Copy solvated.pdb to your workstation and display it with pymol solvated.pdb.
Energy MinimizationBefore we start any simulations we need to remove possible clashes between atoms by energy minimization, otherwise the structure would just explode if there are overlapping atoms. Create this parameter file called em.mdp in your favorite editor:
integrator = cg nsteps = 250 nstxout = 0 nstvout = 0 nstlog = 0 nstenergy = 0 rlist = 1.0 coulombtype = cut-off rcoulomb = 1.0 vdwtype = cut-off rvdw = 1.0 constraints = none define = -DFLEXIBLE
In short, this means we will use a conjugate-gradients minimizer, use standard interactions with cut-offs at 1.0 nm, but don't output any data since we just want the final structure. Since we just want to get rid of bad contacts and aren't interested in any true energy minimum we just run 250 steps, it is not even worth doing in parallel, although it would work. Prepare a run input file with the command
grompp -f em.mdp -p topol.top -c solvated.pdb -o em.tpr
If you had forgotten e.g. what you called your coordinate file you could just hit the tab key after an option to get a list of the files in the current directory that are compatible with it, provided you sourced GMXRC as we recommended! The file em.tpr contains everything we need for the simulation, so you could also prepare everything on your own workstation and just copy run input files to the cluster or supercomputer (portable across different endian architectures and precisions). It will warn you about the system having non-zero charge since we did not add any counter ions, but we'll live with that for the tutorial. We perform the minimization with the mdrun command:
mdrun -v -deffnm em
The -v flag enables verbose output, and the -deffnm flag is a nice shortcut to use em.* for all filename options. The run input file will be read from em.tpr, final coordinates written to em.gro, the energy file to em.edr, etc. The minimization will take about a minute on a modern CPU, and finish with a warning that it did not converge in 250 steps -- that's quite OK.
Position Restrained DynamicsNext we will start a real simulation, but we will be a bit careful and restrain the protein atoms while we give our new water molecules a chance to relax around the protein. Create a new parameter file pr.mdp:
integrator = md nsteps = 5000 dt = 0.002 nstxout = 0 nstvout = 0 nstlog = 0 nstenergy = 0 rlist = 1.0 coulombtype = cut-off rcoulomb = 1.0 vdwtype = cut-off rvdw = 1.0 tcoupl = berendsen tc-grps = system tau-t = 0.1 ref-t = 298 gen-vel = yes gen-temp = 298 constraints = h-bonds define = -DPOSRES
We are now using the molecular dynamics integrator, and taking 5000 steps of length 2 fs each. This step length requires bonds involving hydrogens to be kept constant. Atom velocities are generated from a Maxwell distribution at 298K, and during the simulation the temperature will be coupled to 298K using the Berendsen algorithm. Finally we enable position restraints. This time we will prepare a parallel run input file for 4 CPUs:
grompp -f pr.mdp-p topol.top -c em.gro -o pr.tpr -np 4
Time to put that cluster to use! Request four processors for interactive use, and start MPI on them as we described above, or use the instructions for your own cluster. Since we need to start mdrun on all machines you must provide the full path to the command unless it is in your path, sourcing GMXRC only affects the node you are currently logged in to. Almost all MPI implementations use mpirun to start parallel jobs, so the command to use is normally:
mpirun -np 4 /path/to/mdrun -v -deffnm pr
We still have verbose output, so you will get a message every 10 steps telling you when the simulation is expected to finish -- it should only take a couple of minutes.
- Next >>