This is the User Guide for the msInspect/AMT
platform, a set of tools for:
- creating and managing Accurate Mass and Time
databases from high-quality MS/MS identifications,
- matching high-resolution LC-MS peptide features
to those data, and
- integrating the AMT results with MS/MS
identifications from the same LC-MS runs
The msInspect/AMT platform takes the Accurate Mass and Time approach
first described by the Smith laboratory and implements those ideas in
an open-source, platform-independent Java software suite, with many
original features that we have described in our publications.
The instructions here are intended to
help you use msInspect/AMT flexibly, depending on the configuration of
your data. We highly recommend beginning with the more
specific msInspect/AMT Tutorial
before using msInspect/AMT on your own data. The brief tutorial will
give you
a good feel for the sequence of commands involved, and will help you
get msInspect up and running. We also recommend using "interactive
mode" (the --interactive command) if you're having trouble
providing the correct arguments to the various commands, since
"interactive mode" provides
quite a bit of online help.
If you have questions about msInspect/AMT that you
can't find answers
to in this guide or in the tutorial, please ask them on our msInspect
Support Forum. Also, please check the msInspect/AMT
FAQ.
This User Guide explains the use of the
msInspect/AMT platform for
Accurate Mass and Time analysis. The purpose of msInspect/AMT is to
locate LC-MS peptide features by their mass and retention time,
assign peptide sequence IDs to those features using a database of
identifications from MS/MS database
searches, assign probability scores to each sequence assignment,
and finally (if desired) to use these identified peptides as part of a
protein inference workflow along with peptide
identifications from MS/MS.
In order to use msInspect/AMT, you will need to be
able to provide:
- pepXML files containing the
results of an MS/MS database search on all runs that you would like to
include in your database (and, ideally, on all the runs that you will
match to the database)
- mzXML files containing the
LC-MS scans from all the runs in the AMT database, and from all the
runs you will match to the database
In order to roll AMT results up to the protein
level, you will need to
have the Trans-Proteomic
Pipeline (or at least RefreshParser and ProteinProphet) installed
and be familiar with their use.
The algorithms that support this analysis workflow
will not be
discussed in detail here. For details on
the algorithms used, please see May D, Fitzgibbon M, Liu Y, Holzman T,
Eng J, Whiteaker J, Paulovich A, McIntosh M.
A Platform for Accurate Mass and Time Analyses of Mass Spectrometry Data,
Journal of Proteome Research 2007.
msInspect/AMT is built on the msInspect platform,
which is open-source,
largely written in Java and R. msInspect
details and source code.
Running msInspect
msInspect/AMT requires Java 1.5.
If
you already have a later version of Java on your machine, try that
first, but if you get problems, you
may need to install 1.5.
It also requires R version 2.2 or
later (some optional functionality
requires 2.5.1 or later). Make sure that R is installed and is on
your path, i.e., when you
type "R" (or "RCMD" on Windows) from a command prompt, the correct
version of R starts. msInspect/AMT
requires the "quantreg" package, which is not installed by default with
R and which may be installed
from the R command prompt as follows:
source("http://bioconductor.org/biocLite.R"); biocLite(c("quantreg"));
To run msInspect, download the msInspect
JAR file
from our website. All of the msInspect/AMT functionality is available
only on the command line, so you
will not be able to use the Java WebStart version of msInspect.
Set up Java and R as described above in the
tutorial. To run msInspect, use this command:
java -Xmx1024M -jar <msinspect_path>
where 1024 is the number of megabytes of memory to
provide to
msInspect, and <msinspect_path>
is the path to the JAR file you downloaded.
Throughout this guide, we will use the shorthand msinspect
in place of the above command.
For help with any msInspect command, you can:
- run "msinspect --help <command_name>" for
textual help on the command and all arguments
- run "minspect --interactive
<command_name>" for a graphical window that allows you to specify
all arguments for the command, and shows all of the same help
information as the "help" command in a friendlier layout
- run "msinspect --usermanual
<command_name>" for an HTML-formatted manual entry on the command
Creating an AMT Database
The first step in the msInspect/AMT workflow is to
create an AMT
database, a collection of observations
of peptides from MS/MS database searches of multiple runs. An AMT
database may be small, with only data from only 5-10 runs, or it may
contain hundreds of related MS/MS runs. An AMT database contains mass
and Normalized Retention Time (NRT) information about every peptide
observation in the database,
broken down by the different modifications with which the peptide was
observed.
The AMT database file created by msInspect uses an
XML format called
"amtXML".
- The top of the file contains summary
information about all of the runs in the file.
- The body of the file is a collection of peptide_entry
elements, one for each unique peptide sequence.
- Each peptide_entry element contains summary
information for that peptide, including the median
NRT with which the peptide was observed. Thiss is the information used
in matching.
- Each peptide_entry element contains a separate modification_state
entry for each "modification state" in which the peptide was observed,
e.g., with oxidized methionine in position 7 and
acetylated Cysteine in position 3.... Summary information for all the
observations at the
modification state level is maintained.
- Each modification_state element contains a
separate observation element for each time the peptide was
observed in any run. This observation element retains the original
retention time of
the peptide within the run, and an indication of which run it was
observed in.
Preparing All Necessary Files
In order to create an AMT database, you will need
a separate pepXML
file for each run you wish to include
in the database.
If you have a single pepXML file containing
multiple runs (e.g.,
fractions), you will
first need to break it up into a separate file for each run (these
separate files are also needed in later steps). msInspect provides a
tool for this:
msinspect --extractrunsfrompepxml
<input_file> <output_directory>
This tool will create a separate file for each individual run in
input_file, and write them all to output_directory.
Retention Time
Retention times for each peptide observation are
crucial for developing an
accurate mapping from retention time
to NRT within each run, and for storing the observations
themselves. You have the option of gathering retention times from
MS/MS data themselves, or from MS1 features that match uniquely to
MS/MS peptides by mass and time (this is discussed below). In
either case, the MS/MS identifications must have retention time
information.
Some versions of the converters that create
pepXML files do not carry this information forward. You may need to
provide retention times for each scan
within the pepXML file, by transferring this information from the mzXML
file from the same run.
You can tell whether your pepXML files have
retention time information by looking for the retention_time_sec
attribute on the spectrum_query elements within the files (or
by waiting for the matchamt command to fail with an error message about
missing retention times).
If the information is not already there, it must
be extracted from the mzXML files associated with each
pepXML file. To provide this information to msInspect, run the
"populatems2times" msInspect command in order to create a text version
of
each pepXML file that contains the RT information pulled from mzXML
files with the same names (other than extension) as the pepXML files
msinspect --populatems2times
--mzxmldir=<mzxml_directory> <input_files>
--outdir=<output_directory>
These files may then be used exactly like pepXML files for the commands
that follow.
Filtering pepXML files
For creating the AMT database, and also for
aligning retention times
between MS runs and the AMT database, you will only want to consider
very high-quality MS/MS features. You can filter your pepXML files
however you like. msInspect provides a utility to filter by
PeptideProphet probability:
msinspect --filter --minpprophet=.95
--outdir=<output_directory> <input_files...>
Running the "createamt" Command
Now that you have assembled all of the necessary
files to create an AMT
database, run the msInspect
command that creates the database. There are several arguments to this
command; here is the most
common use case:
msinspect --createamt
--out=<output_file> --mode=pepxmlfiles
--ms2dir=<ms2_directory> --ms1dir=<ms1_directory>
Note: the MS2
directory referenced in the 'ms2dir' argument can contain
either pepXML files or .tsv files generated through the "--filter"
command described above. All files in the directory will be added to
the database, so the directory should contain only those files you want
in the database.
If the "--ms1dir" argument is not specified, the observations
stored in the database will come directly from high-quality MS/MS
peptide identifications. If the "--ms1dir" argument is specified, then the MS1 feature
files (see below) encountered in the "--ms1dir" directory will be
matched to MS/MS peptide identifications by mass and retention time
(you can control the size of the matching window using the
"--deltamassppm" and "--deltatime" arguments), and the times stored in
the database will be the retention times from MS1 features that match uniquely to MS/MS peptide
identifications. msInspect will search the "--ms1dir" directory
for files that match the MS/MS files from the "--ms2dir" directory by
name (they may have different extensions).
In general, if you are building your AMT database
from runs with high-resolution data, we recommend that you supply the
"--ms1dir" argument. This is because the retention time from MS1
is a much more accurate predictor of future MS1 retention time than
MS/MS retention time is. However, if you do not have
high-resolution MS1 data for your runs, do not supply this
argument.
If you supply the "--ms1dir" argument, you will
first need to find and filter LC-MS peptide
features within each run,
as described below.
Combining Multiple AMT Databases
If you have created multiple AMT databases and wish to combine them
(e.g., to combine information from multiple experiments), you may do so
with this command:
msinspect --createamt --out=<output_file>
--mode=amtxmls <amtxml_files>
Note: do not combine AMT databases whose
retention times are from MS1 with AMT databases whose retention times
are from MS2! If you need to include some runs where
high-resolution data are not available, just use MS2 retention times
for all runs.
Aligning Database Runs
Once you have created an AMT database that
contains all the runs
relevant to your analysis, you should nonlinearly align all runs in the
database to each other.
This step eliminates nonlinear variations in gradient that are not
removed by the linear mapping that is
done when the database is created. You only need to do this step once,
on the final AMT database you wish to match against. You can do
this as follows:
msinspect --manageamt --mode=alignallruns
--out=<output_amt_file> <input_amt_file>
The msInspect "amtdiagnostic" command with "mode=basicinfo" will give
you summary information about
your AMT database: how many peptide entries, runs, etc.
Finding and Filtering LC-MS Peptide Features
Now that you have an AMT database, you need to
locate the LC-MS peptide
features that you want to match
to the database. You can do this with msInspect, or with any other
feature-finding program whose format
is supported by msInspect/AMT. msInspect/AMT supports the following
LC-MS feature file formats:
- msInspect .tsv (tab-separated values)
- APML, an XML standard file format for LC-MS data
- SpecArray's text format
- Hardklor's text format
SpecArray and Hardklor formats are supported through the
"convertfeaturefile" msInspect command, which
can convert them to the msInspect text format.
To find your own peptide features using msInspect,
you can use the "findpeptides" command:
msinspect --findpeptides
--outdir=<output_directory> <input_file> <input_file>
...
The above command operates on multiple files at once (which can be
specified with a "*" wildcard), but you
can also run the "findpeptides" command on one file at a time.
By default, msInspect finds all possible peptide
features that it can within each file. For AMT, it
is important that only the high-quality features are retained.
Filtering your feature files is very important. A loose set of
filtering criteria might be used as follows:
msinspect --filter
--outdir=<output_directory> --minpeaks=2 --maxkl=3
<input_file> <input_file> ...
where "minpeaks" is the minimum number of isotopic peaks identified in
each feature, and "maxkl" is
the maximum value for the "K/L score", an LC-MS feature quality metric.
More or less strict filtering
criteria may be used depending on whether you're having more trouble
with specificity or with sensitivity
in your AMT matching.
You can also find and filter LC-MS features one
run at a time using msInspect's graphical viewer mode
(simply run the msInspect command with no arguments, see msInspect
user's guide for details).
Matching to the AMT Database
Now it's finally time to match your LC-MS features
to the AMT database
and assign peptide IDs. In order to
do this, you'll need:
- An AMT database, preferably having had the
"alignallruns" command run on it
- One or more LC-MS feature files, preferably
filtered
- If you have them, pepXML-formatted MS/MS search
results from each LC-MS run you want to match to the database
This last item is used in order to develop the mapping from retention
time in the LC-MS run to the NRT
scale represented in the database. This mapping is done by doing an
initial pairing of LC-MS features
with peptide entries in the AMT database, and then performing modal
regression on the retention times of
the LC-MS features and the NRT values of the associated AMT database
entries. If you supply a set of MS/MS search results, then the initial
pairing is done using the peptide IDs of the MS/MS results. If not,
the initial pairing is done using a tight mass-only match between LC-MS
features and peptide entries. In
general, the pairing using MS/MS peptide IDs is much more reliable.
To perform a match between the database and a
single file of LC-MS features, use this command:
msinspect --matchamt --mode=singlems1
--ms1=<filtered_features_file>
--embeddedms2=<MSMS_search_result_file> --out=<output_file>
<amtXML_file> --showcharts
The "--showcharts" option will display a number of charts that describe
the performance of the AMT match.
The charts will be displayed in a series of tabs in a single dialog
window. These charts are explained
below. Note: if you are running msInspect on a remote computer,
displaying charts may take a very long time. In that case you can use
the "--savechartsdir" option instead of --showcharts. This will not
display the charts on-screen, but it will save them to files in a
directory you specify, which can be much faster.
The output of the "matchamt" command is an LC-MS
feature file identical to the input feature file except
that peptide sequence assignments, with probabilities, are made for
some peptides.
You can perform AMT matching on an entire
directory full of LC-MS
feature files, one after another.
In addition to being more convenient, this also saves a good deal of
time, since the AMT database only
needs to be loaded into memory once.
msinspect --matchamt --mode=ms1dir
--ms1dir=<filtered_features_directory>
--ms2dir=<MSMS_search_result_dir>
--outdir=<output_directory> <amtXML_file>
If you use the "--showcharts" argument when
matching multiple runs, a
separate set of tabs will be
created containing the charts related to each matched file. This can
become quite memory-intensive and is not recommended for large numbers
of files.
There are a large number of arguments to the
"matchamt" command that
you may wish to adjust. The "--help" command and the graphical
"--interactive" mode are particularly helpful for the AMT matching
command, describing all matching arguments in great detail.
The most important non-required argument to pay
attention to is "modifications".
This argument
lets the matching algorithm know what modifications to expect on
peptides in the LC-MS feature file(s). By default, a static
modification on Cysteine of 57.021Da (representing acetylation) and a
variable modification on Methionine of 15.995Da (representing
oxidation) are assumed. If your data have other static
or potential modifications (e.g., isotopic labeling, see below), you
should declare them here.
Note that the modifications declared here do not
necessarily need to be
the same modifications that were used on the LC-MS/MS data that make up
the database. As long as the modification itself doesn't have a large
impact on peptide retention time, you can, for instance, use unlabeled
LC-MS data with an AMT database generated from isotopically labeled
data.
Note: if you
used MS/MS retention times in building the AMT database, rather than
MS1 retention times, you would add the argument
"--usems1foralignment". This would instruct msInspect to use MS2
retention times, rather than MS1, to create the alignment between the
run and the database.
Interpreting the Results
There are several charts displayed via the
"--showcharts" option during
matching that will help you
to understand how well the AMT matching algorithm performed. Click here
for example charts from a successful AMT match.
- T->H Map: the datapoints used for the
modal regression to develop the nonlinear map from RT to
NRT. The mapping itself is indicated as a line on this chart.
- Before Calibration: describes the mass
calibration of the LC-MS features, prior to recalibration. LC-MS
features are calibrated using the initial match to the AMT database, in
order
ensure a normal distribution of mass error. A line indicates the
regression result used for calibration
- After Calibration: describes the mass
calibration of the LC-MS features after recalibration
- Loose match error data: shows all AMT
matches within a wide tolerance window
- Decoy data: shows all AMT matches to a
decoy AMT database created by adding a fixed value
to AMT feature masses
- EM dist analysis: plots describing how
well the estimated distribution fits the data, one for each dimension.
In the charts on the left, the estimated distributions in each
dimension are
overlaid with the actual data density; ideally there should be little
difference. The quantile-quantile
plots on the right are derived from the same data and should ideally be
a 1:1 line.
- Distribution: 3D perspective plot of the
raw match data density (gray) with the estimated distribution
superimposed in a red mesh. Ideally the two distributions should lie
very closely atop
one another. This plot requires R version >= 2.5.1
- EM Parameters: shows the convergence of
all the parameters estimated by the EM algorithm
- All probabilities: shows the same
datapoints from the "Loose match error data" plot, color-coded by the
probability assigned with the EM model. High-probability points are
blue.
Combining AMT Results with MS/MS results
Once you have performed AMT matching and assigned
peptide sequences,
with probabilities, to many LC-MS
features, you can use these identifications on their own or combine
them with existing LC-MS/MS data for the same runs, if available. If
you want to work with AMT-only data, you can skip
a bit.
First, if you intend to combine the AMT data with
MS/MS data, you will
need to combine the two types
of files:
msinspect --combineamtms2
--outdir=<output_directory>
--indir=<pepXML_directory> --indir2=<amt_matched_dir>
--fasta=<FASTA_database> --restrictcharge --refreshparser
--guessproteins
This command will identify pairs of MS/MS and AMT
files and combine
each pair, given certain constraints
such as the maximum charge to carry forward. You can also run a version
this command on individual pairs
of files. The output files will be in pepXML format.
The "guessproteins" argument uses the provided
FASTA database to guess a single protein for each peptide
identification, which is time-consuming but necessary in order to run
RefreshParser (part of the TPP) on the files. The --refreshparser
argument runs RefreshParser itself; this can be run separately (below)
if RefreshParser is not on your path.
If you are not combining your AMT results with
MS/MS data, you can
simply run the "convertfeaturefile"
msInspect command to transform your AMT matching results to pepXML
format.
Because
ProteinProphet
(see below) requires all possible proteins to be identified for
each peptide,
you will need to run RefreshParser, which will interrogate the FASTA
file again, on all of the combined
pepXML files. If you have RefreshParser on your path, using the
"refreshparser" argument as described above will do this for you.
However, if you need to run RefreshParser separately, unfortunately
there is no built-in mechanism for running it on a number of files at
once. Here is an example of a command for
running it on a number of files at once, under Linux (this assumes that
RefreshParser is on your path):
find merged_unfiltered -name "*pep.xml" -exec
RefreshParser {}
/mnt/cpl/data/fasta/ipi.HUMAN.fasta.20060823 \;
Labeled Quantitation
If the data you are analyzing contain isotopic
labels, you can infer
isotopic ratios from the AMT results
in which either light or heavy labeled states (or both) are matched via
AMT. Be sure to declare the light version
of the isotopic label as a static modification during matching, and the
difference between the light and
heavy label weights as a variable modification.
For instance, acrylamide labeling has a 3.0101-Da
separation between light and heavy labels on Cysteine, and the light
label weighs 71.0366 Da and takes the place
of the "normal" iodoacetamide modification used to protect Cysteine.
Oxidized Methionine is still a possibility. So the correct
"modifications" argument becomes
"--modifications=C71.0366,M15.995V,C3.0101V".
Once you have performed matching with any variable
modification, you
can use that modification to identify abundance ratios between light
and heavy isotopes with the Q3 algorithm, using the "q3" command in
msInspect:
msinspect --q3new --labeledResidue=C
--massDiff=3.01006449
--out=<output_file> <input_file> -d<mzxml_directory>
--maxFracDeltaMass=20ppm --minPeptideProphet=.75 --forceoutput
This command will produce new LC-MS feature files
with AMT peptide IDs,
in which isotopic pairs are
associated with each other and their intensity ratios are calculated.
These ratios will be rolled forward in protein inference. The command
above is appropriate for the Q3 algorithm, but any quantitation
algorithm making use of PepXML files (e.g. XPress) can be used at this
point.
Using AMT Results with ProteinProphet
Note: If you not combining MS2 and AMT
data
as described above, you will need to perform a couple more steps before
proceeding to ProteinProphet. See below.
Now you are ready to use these data in protein
inference. These steps involve the Trans-Proteomic Pipeline (TPP).
The end result of this pipeline is a protXML file containing
information about protein
identifications and their associated peptide identifications and (if
present) quantitation ratios.
To run ProteinProphet using only your combined AMT-MS/MS files and not
the original
MS/MS pepXML files:
ProteinProphet <pepXML_file>
<pepXML_file> ...
<output_protXML_file>
Finally, for isotopically-labeled data you will
need to run a
quantitation algorithm to identify intensity ratios for all
quantifiable
proteins by combining peptide-level ratios. For a 3-Dalton-separation
label such as acrylamide, for instance, you can run the Q3 algorithm as
follows:
Q3ProteinRatioParser <protXML_file>
Preparing AMT-Only Data for ProteinProphet
If you are only using AMT
data, and not combining it with MS2 data as
described above, you'll need to perform a few more steps before running
ProteinProphet:
- Convert your matching output files to pepXML,
specifying a FASTA database: msinspect --convertfeaturefile
--outdir=<output_dir> --outformat=pepxml
--fasta=<FASTA_FILE> <input_files...>
- Guess an initial protein for each peptide match
(necessary for RefreshParser, as described above): msinspect
--guessproteinsfromfasta --fasta=<FASTA_file>
- Run RefreshParser, as described above
- If the data are isotopically labeled, run Q3,
as described above
|