Running MULTICLUSTAL on Big Red at IU
On this page:
General information
MULTICLUSTAL searches for parameters that optimize the alignment of a set of sequences. It optimizes by maximizing a quality function that rewards for identical amino acids and conservative substitutions, and penalizes for gaps and islands. Small islands are penalized more heavily than large islands. MULTICLUSTAL tries several substitution matrices, a range of gap-open penalties, and a range of gap-extension penalties. Details are available in Yuan et al. (1999, Bioinformatics 15:862-863).
On Big Red at Indiana University, MULTICLUSTAL uses a parallel version of ClustalW. MULTICLUSTAL is installed at:
/N/soft/whatami/multiclustal-1.1The script multiclustaljob is available for submitting
parallel batch jobs. This page describes how to use MULTICLUSTAL on
Big Red.
Note: MULTICLUSTAL is copyrighted by Merck & Co., Inc. It has been modified with permission to run the parallel version of ClustalW in a batch scheduling environment. Neither IU nor users of MULTICLUSTAL on Big Red are at liberty to distribute the modified version.
For more information about the availability of software on the Indiana University shared central systems, see At IU, what software is available on the research computing systems, and how may I request that software be added?
For more information about TeraGrid software, see the following pages in the TeraGrid User Support documentation:
- Coordinated TeraGrid Software and Services (CTSS)
- TeraGrid Software Repository
- Visualization software
Preparation
The data file that contains sequences to be aligned must be alone in a directory. MULTICLUSTAL creates and reads from many files. By placing the data file in its own directory, you reduce clutter and prevent MULTICLUSTAL from reading inappropriate files.
If you have used MULTICLUSTAL to align sequences in a data file and you wish to run it again, delete all files other than the data file before rerunning MULTICLUSTAL. If you would like to keep the old output, copy it to some other directory.
Note: The name of the data file may not contain an
underscore ( _ ).
The data file must contain sequences in FASTA format. Names of sequences in the file must be alphanumeric (i.e., letters and numbers only). Short names are ideal. Long names may cause problems. A sign of trouble with sequence names is that sequences are lost in the analysis (i.e., the result file contains fewer sequences than the data file).
Running multiclustaljob
Use the multiclustaljob script to submit jobs that run
MULTICLUSTAL. The multiclustaljob script should be in
your path by default, and its manual page should be in your
default path for manual pages. Syntax for multiclustaljob
is:
Replace options_to_multiclustal with command line
options, n with the number of processors to use, and
h with the maximum amount of time the job should be
allowed to run. If you omit the CPUS option, 4 processors
will be used. To request more than 4 processors, specify an integer
value that is a multiple of 4. If you specify a value that is not a
multiple of 4, the value will be increased to the next multiple of
4. If you omit the -wallhours option, your job will be
allowed to run for two hours. For example, to use 16 processors to
align amino acid sequences from a file aaseqs for up to
three hours, run:
When you run multiclustaljob, you'll receive a message
when your job is submitted to the queue, and another when the job
finishes. To check the status of your job, use the llq
command.
The -deep option
MULTICLUSTAL has only one option (-deep). The
-deep option decreases the sizes of steps that
MULTICLUSTAL uses to traverse the range of gap-open penalties and the
range of gap-extension penalties as it searches for optimum
parameters. If you use the -deep option, it must follow
the name of the data file, for example:
Output
MULTICLUSTAL produces quite a bit of output. During the parameter
search, it runs ClustalW many times and keeps all of the output. A
file Final_score will contain a running summary of
progress and identify which parameter set (and file) is associated
with the highest scoring alignment. In addition to its output files,
MULTICLUSTAL will create a file with system and error messages having
a filename similar to multiclustaljob.99999.0.err, where
99999 is the number of your job. That file's contents are
especially useful if it's the only file your job produces.
Known bug
MULTICLUSTAL is known to hang at times because of an issue with a program named BOXSHADE, which it uses to parse the output of ClustalW. If it hangs, your job will sit idle until the job scheduler kills it for running over the allotted time. You can look for early signs by monitoring the accumulation of output files in your data file directory.
This document was developed with support from the National Science Foundation (NSF) under Grant No. 0503697 to the University of Chicago and subcontracted to Indiana University. Additional support was provided by IU through its participation in the TeraGrid, which is supported by the NSF under Grants No. 0833618, SCI451237, SCI535258, and SCI504075. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.
Last modified on January 07, 2009.







