msInspect/AMT Tutorial
This is a step-by-step tutorial for msInspect/AMT. In this
tutorial, you will install the necessary dependencies for running
msInspect/AMT on your machine, create an AMT database from example
pepXML files, match MS1 features to the AMT database, and combine the
AMT matches with MS/MS data and perform protein inference. All output
files from the commands described here are already provided in the
tutorial bundle, so that you can see the output with or without running
the commands yourself.
If you are already familiar with running msInspect
and have the latest version installed on your machine, feel free to use
your regular msInspect command in place of the "msinspect.sh" or
"msinspect.bat" script mentioned below.
Please note: Running time of the scripts in this
tutorial will depend on the speed of your processor and amount of
memory available. We recommend at least 1GB of RAM.
Table of Contents
Step 1: Download the Tutorial Bundle
First, download the msInspect/AMT
Tutorial Bundle (~30MB). Choose a directory on your machine (either
a Windows or a Linux/Unix machine) and unzip the bundle. Change your
directory to the amt_tutorial (top) directory of the bundle, which
contains viewerApp.jar, README, etc.
Step 2: msInspect/AMT Dependencies
Next, you will need to install the dependencies that msInspect needs to
run, if they are not already installed on your machine. These
dependencies are Java 1.5 (or later), and the R statistical language.
Java
To run msInspect, you must have a Java 1.5 (or later) JVM installed and
on your PATH.
Please see the Java website for
download and installation instructions
for "J2SE 5.0", which includes the required version.
R
msInspect/AMT also requires the R
statistical language, version 2.2 or later (some optional
functionality requires 2.5.1 or later). Make sure that R is installed
and is on your path. I.e., when you type "R" (on Linux; "RCMD" on
Windows) from a command prompt, the correct version of R starts.
msInspect/AMT requires the "quantreg" package,
which is not installed by default with R and which may be installed
from the R command prompt as follows:
source("http://bioconductor.org/biocLite.R");
biocLite(c("quantreg"));
You may be informed that you don't have access to
the Rlib directory. If this happens, you will need to set a user
variable called R_LIB to some directory that you have access to, and
re-run R.
Running msInspect
Now that you have set up these dependencies, ensure that you can run
msInspect successfully. We have provided two scripts to run msInspect,
one appropriate for Windows (msinspect.bat) and one for Linux
(msinspect.sh). These scripts simply run Java on the JAR file
viewerApp.jar and pass along whatever arguments you specify. If you run
the appropriate script with no arguments, you should see the msInspect
splash image and then the msInspect viewer, with a dialog asking you to
choose a file to open. Cancel out of this and close msInspect.
Step 3: Find and Filter MS1 Peptide Features
At this point, you would normally run msInspect to find MS1 peptide
features in the set of runs you wish to include in the AMT database.
This step is not necessary, and would not be appropriate for databases
containing runs for which high-resolution MS1 data are not
available. But if high-resolution data are available, it is
preferable to use those data in building the AMT database.
Unfortunately, the mzXML files containing MS1 spectra are far too large
to include in this tutorial. If those files were available, you would
run a command like this:
Windows:
.\msinspect.bat --findpeptides --outdir=features mzXML\*.mzXML
Linux:
./msinspect.sh --findpeptides --outdir=features mzXML/*.mzXML
In this tutorial, the resulting feature files have
already been created
for you, in the features directory. It is important to filter these
feature files, so that they contain only high-quality features:
Windows:
.\msinspect.bat --filter --outdir=features\filtered features\*.tsv
--minpeaks=2 --maxkl=3
Linux:
./msinspect.sh --filter --outdir=features/filtered features/*.tsv
--minpeaks=2 --maxkl=3
Step 4: Create AMT Database
In this step, you will create an AMT database. In the pepxml directory,
there are several files with extension .pep.xml. These files,
in the standard pepXML
format, are the result of an MS/MS database search on
acrylamide-labeled data from complex samples. In order to keep this
tutorial bundle small, these files represent only a few fractions of a
single sample -- normally you would build an AMT database from many
more fractions from multiple samples. For demonstration purposes, you
will create an AMT database from just these few files.
Filter pepXML files
It is important that the AMT database contain only high-quality peptide
entries. To filter the pepXML files, writing the filtered files to the
directory pepxml/filtered, use the following command:
Windows:
.\msinspect.bat --filter --outdir=pepxml\filtered --minpprophet=.95
pepxml\*pep.xml
Linux:
./msinspect.sh --filter --outdir=pepxml/filtered --minpprophet=.95
pepxml/*pep.xml
Create amtXML file
Next, we will create an amtXML file to store all the entries in the AMT
database. Note: the AMT
database build here will contain retention times derived from
high-resolution MS1 peptide features. If you were not using MS1
retention times in your database, you would omit the "--ms1dir"
argument.
Windows:
.\msinspect.bat --createamt --mode=directories
--out=amt\tutorial.amt.xml --ms2dir=pepxml\filtered
--ms1dir=features\filtered
Linux:
./msinspect.sh --createamt --mode=directories
--out=amt/tutorial.amt.xml --ms2dir=pepxml/filtered
--ms1dir=features/filtered
Align runs in database
Now that the AMT database contains all the runs you want, all of the
different runs (in this case, fractions) in the AMT database will need
to be aligned to each other to resolve any nonlinear differences in
retention time:
Windows:
.\msinspect.bat --manageamt --mode=alignallruns amt\tutorial.amt.xml
--out=amt\tutorial_aligned.amt.xml --showcharts
Linux:
./msinspect.sh --manageamt --mode=alignallruns amt/tutorial.amt.xml
--out=amt/tutorial_aligned.amt.xml --showcharts
The "--showcharts" argument, which is available
for many commands, will display charts relevant to the command. In this
case, a chart will be displayed that shows the relationship between
retention time and normalized retention time for each of the runs in
the AMT database, as they are added. A good alignment will appear
to follow the data very closely. The "--showcharts" argument
becomes infeasible for very large databases.
The chromatography in the four sample runs is very
similar, and so the alignment does not have much of an effect. In
a larger database, or a database made up of experiments from different
labs or instruments, you would see much larger changes.
Step 5: Match MS1 Features with AMT Database
Next you will perform AMT matching between the MS1 features and the AMT
database. The msInspect/AMT will nonlinearly align each run to the AMT
database using the MS2 peptides as a guide, match features to the
database within wide mass and RT tolerances, and then construct a model
for the probabilities of the matches based on match error.
First you will perform a match on just one MS1
feature file, to demonstrate the process. msInspect will produce
several charts to help you assess the success of the matching process.
These charts are described in detail here.
Windows:
.\msinspect.bat --matchamt amt\tutorial_aligned.amt.xml
--outdir=matched --mode=singlems1
--ms1=features\filtered\frac1.filtered.tsv --ms2dir=pepxml
--modifications=C71.0366,M15.995V,C3.0101V --minmatchprob=.9
--showcharts
Linux:
./msinspect.sh --matchamt amt/tutorial_aligned.amt.xml --outdir=matched
--mode=singlems1 --ms1=features/filtered/frac1.filtered.tsv
--ms2dir=pepxml --modifications=C71.0366,M15.995V,C3.0101V
--minmatchprob=.9 --showcharts
There are a few important things to note about the
match:
- The probability model relies on adequate
information in the AMT database. In this toy example, the AMT database
is relatively sparse, so the model is not a perfect fit. It errs on the
conservative side, assigning a lower probability to matches than is
likely warranted.
- Another reason for the imperfect fit is that
many of the peptides are represented by exactly one observation, which
is the same observation that is being matched! These peptides
will not follow the normal distribution of error assumed by the model,
since their retention time error will be exactly zero. In real, large
datasets, this effect is almost completely absent, as peptides are
observed multiple times.
- The "modifications" argument is the most
important non-required argument (see AMT user's guide). The value
provided here is appropriate for acrylamide-labeled data. If no value
is provided, the system defaults to values appropriate for unlabeled
data.
- The "minmatchprob" argument sets the minimum
AMT match probability that will be kept in the output files. AMT
matching probability is stored as PeptideProphet probability in the
matching files. It is advisable only to keep high-probability matches,
but you could also keep all matches initially (by leaving out this
argument) and, later, filter the output on "minpprophet" as described
above.
Now that you have seen a single match in detail, you can perform a
single command to match all files at once. Note: typically, you
will be matching many files at once with this command, so the
--showcharts argument is left out, but it is possible to see charts for
all matched files by providing this argument, if you want.
Windows:
.\msinspect.bat --matchamt amt\tutorial_aligned.amt.xml
--outdir=matched --mode=ms1dir --ms1dir=features\filtered
--ms2dir=pepxml --minmatchprob=.9
--modifications=C71.0366,M15.995V,C3.0101V
Linux:
./msinspect.sh --matchamt amt/tutorial_aligned.amt.xml --outdir=matched
--mode=ms1dir --ms1dir=features/filtered --ms2dir=pepxml
--minmatchprob=.9 --modifications=C71.0366,M15.995V,C3.0101V
AMT matching should succeed for all of these
runs. If matching failed (meaning the EM algorithm did not
converge in the maximum number of iterations), details about the
failure would be displayed as output.
You may compare the AMT matching results with the
MS2 search results using msInspect's "peptidecompare" command. This
command gives you many options for comparing multiple files, peptide by
peptide. For a simple comparison of peptide overlap between the AMT and
MS2 versions of the same run, try this:
Windows:
.\msinspect.bat --peptidecompare --mode=showoverlap
pepxml\frac1.pep.xml matched\frac1.filtered.matched.tsv
--minpprophet=.9
Linux:
./msinspect.sh --peptidecompare --mode=showoverlap pepxml/frac1.pep.xml
matched/frac1.filtered.matched.tsv --minpprophet=.9
Bear in mind that the AMT database used in this
tutorial contained very few runs, and so the number of new peptides
confidently matched by AMT is relatively small.
Step 6: Merge AMT Data with MS2 Data
The next step is to augment the MS2 data with your AMT matching data.
Of course, you may not want to combine these two types of data -- you
may wish to deal with MS2 data and AMT data completely separately.
Instructions for that type of workflow are provided in the AMT User's
Guide. This tutorial will concentrate on the combined data use case.
There are actually three different steps
associated with the merge of AMT and MS2 data, and msInspect/AMT
includes a command to perform them all at once. They are:
- Add AMT matches as additional results in the
MS2 pepXML files
- For each AMT peptide, guess a single protein
that it may have come from (using a FASTA database)
- Run RefreshParser (part of the Trans-Proteomic
Pipeline) to guess the rest.
The last two steps are only necessary if you intend to continue with
protein inference via ProteinProphet. The reasons for these last two
steps, and directions on installing the Trans-Proteomic Pipeline, are
provided in the AMT User's Guide. As described there, you will need to
make sure that RefreshParser is on your path before running the next
command.
For the last two steps, you will need access to
the same FASTA database that was used in the MS/MS database search. For
the search that created these pepXML files, that database is the August
23, 2006 version (version 3.20) of the human IPI database. The file is
prohibitively large to include with this tutorial. As of this time, the
database can be downloaded from EBI's FTP site, here: ftp://ftp.ebi.ac.uk/pub/databases/IPI/old/HUMAN/ipi.HUMAN.v3.20.fasta.gz.
Download this file into the same amt_tutorial
directory of the tutorial bundle that you have been working on. Unzip
the file -- on Linux you can use "gunzip ipi.HUMAN.v3.20.fasta.gz", on
Windows you can use WinZip or a similar utility.
Now you are ready to merge the AMT and MS2
results:
Windows:
.\msinspect.bat --combineamtms2 --outdir=merged_ms2_amt\
--ms2dir=pepxml --amtdir=matched --fasta=ipi.HUMAN.fasta.20060823
--restrictcharge --refreshparser --guessproteins
Linux:
./msinspect.sh --combineamtms2 --outdir=merged_ms2_amt/ --ms2dir=pepxml
--amtdir=matched --fasta=ipi.HUMAN.fasta.20060823 --restrictcharge
--refreshparser --guessproteins
The resulting files in the merged_ms2_amt
directory contain both the AMT and the MS2 results. If you run the
peptidecompare command from the last step on one of the merged files
and its corresponding MS2 file, you will see that the merged file
contains all of the peptides in the MS2 file, and more.
Step 7: Run Q3 for Labeled Quantitation
Since these particular data are isotopically labeled, you will want to
run a quantitation algorithm to determine abundance ratios for each
labeled peptide. A good quantitation algorithm for acrylamide-labeled
data like these is Q3, and it is included with msInspect. Unfortunately
these quantitation algorithms require access to the original mzXML
files, which are too large to include in this tutorial, so you will
not be able to run this command. If the mzXML files were available,
you would run Q3 on the merged files as follows:
Windows:
.\msinspect.bat --q3new --labeledResidue=C --massDiff=3.01006449
--outdir=merged_ms2_amt\q3 merged_ms2_amt\*pep.xml -dmzXML --stripoldq3
--maxFracDeltaMass=20ppm --minPeptideProphet=.75 --forceoutput
Linux:
./msinspect.sh --q3new --labeledResidue=C --massDiff=3.01006449
--outdir=merged_ms2_amt/q3 merged_ms2_amt/*pep.xml -dmzXML --stripoldq3
--maxFracDeltaMass=20ppm --minPeptideProphet=.75 --forceoutput
We have provided the output of this command in
the directory merged_ms2_amt/q3 as a demonstration.
Step 8: Protein Inference
The final step in this workflow is protein inference. For this step,
you will need ProteinProphet available and on your path. Note that this
step uses the files in the "q3" directory, which were provided in the
tutorial; you can run this on the pep.xml files that you created in the
merged_ms2_amt directory if you prefer, but quantitation information
will not be available.
Windows:
ProteinProphet merged_ms2_amt\q3\*pep.xml protxml\merged.prot.xml
Linux:
ProteinProphet merged_ms2_amt/q3/*pep.xml protxml/merged.prot.xml
This will create a protXML file containing all
proteins inferred from the peptide evidence in your merged files. Since
these are acrylamide-labeled data, we will need to run Q3ProteinParser
(part of the TPP) to process the Q3 quantitative information.
Windows:
Q3ProteinRatioParser protxml\merged.prot.xml
Linux:
Q3ProteinRatioParser protxml/merged.prot.xml
You may now work with this protXML file just as
you would a protXML derived only from MS2 data. For comparison
purposes, we have provided a protXML file that was created using only
the MS2 data, and not the AMT data. msInspect contains tools to compare
two protXML files.
Windows:
.\msinspect.bat --protxmlcompare protxml\ms2_only.prot.xml
protxml\merged.prot.xml
Linux:
./msinspect.sh --protxmlcompare protxml/ms2_only.prot.xml
protxml/merged.prot.xml
Again, bear in mind that the AMT database used in
this tutorial contained very few runs, and so the gains from the AMT
data are minimal at best. A real AMT match using a database from dozens
or hundreds of fractions would provide a significant boost to
identified peptides and proteins.
This concludes the msInspect/AMT tutorial. Please
provide feedback on the tutorial process, including any errors you may
have encountered, in order to help us make the tutorial as useful as
possible.
|