msInspect/AMT User Guide



Contents

  1. Introduction
  2. Running msInspect
  3. Creating an AMT Database
  4. Finding and Filtering LC-MS Peptide Features
  5. Matching to the AMT Database
  6. Interpreting the Results
  7. Combining AMT Results with MS/MS results
  8. Labeled Quantitation
  9. Using AMT Results with ProteinProphet

Introduction

This is the User Guide for the msInspect/AMT platform, a set of tools for:

  • creating and managing Accurate Mass and Time databases from high-quality MS/MS identifications,
  • matching high-resolution LC-MS peptide features to those data, and
  • integrating the AMT results with MS/MS identifications from the same LC-MS runs
The msInspect/AMT platform takes the Accurate Mass and Time approach first described by the Smith laboratory and implements those ideas in an open-source, platform-independent Java software suite, with many original features that we have described in our publications.

The instructions here are intended to help you use msInspect/AMT flexibly, depending on the configuration of your data. We highly recommend beginning with the more specific msInspect/AMT Tutorial before using msInspect/AMT on your own data. The brief tutorial will give you a good feel for the sequence of commands involved, and will help you get msInspect up and running. We also recommend using "interactive mode" (the --interactive command) if you're having trouble providing the correct arguments to the various commands, since "interactive mode" provides quite a bit of online help.

If you have questions about msInspect/AMT that you can't find answers to in this guide or in the tutorial, please ask them on our msInspect Support Forum. Also, please check the msInspect/AMT FAQ.

This User Guide explains the use of the msInspect/AMT platform for Accurate Mass and Time analysis. The purpose of msInspect/AMT is to locate LC-MS peptide features by their mass and retention time, assign peptide sequence IDs to those features using a database of identifications from MS/MS database searches, assign probability scores to each sequence assignment, and finally (if desired) to use these identified peptides as part of a protein inference workflow along with peptide identifications from MS/MS.

In order to use msInspect/AMT, you will need to be able to provide:

  1. pepXML files containing the results of an MS/MS database search on all runs that you would like to include in your database (and, ideally, on all the runs that you will match to the database)
  2. mzXML files containing the LC-MS scans from all the runs in the AMT database, and from all the runs you will match to the database

In order to roll AMT results up to the protein level, you will need to have the Trans-Proteomic Pipeline (or at least RefreshParser and ProteinProphet) installed and be familiar with their use.

The algorithms that support this analysis workflow will not be discussed in detail here. For details on the algorithms used, please see May D, Fitzgibbon M, Liu Y, Holzman T, Eng J, Whiteaker J, Paulovich A, McIntosh M. A Platform for Accurate Mass and Time Analyses of Mass Spectrometry Data, Journal of Proteome Research 2007.

msInspect/AMT is built on the msInspect platform, which is open-source, largely written in Java and R. msInspect details and source code.

Running msInspect

msInspect/AMT requires Java 1.5. If you already have a later version of Java on your machine, try that first, but if you get problems, you may need to install 1.5.

It also requires R version 2.2 or later (some optional functionality requires 2.5.1 or later). Make sure that R is installed and is on your path, i.e., when you type "R" (or "RCMD" on Windows) from a command prompt, the correct version of R starts. msInspect/AMT requires the "quantreg" package, which is not installed by default with R and which may be installed from the R command prompt as follows:

source("http://bioconductor.org/biocLite.R");
biocLite(c("quantreg"));
To run msInspect, download the msInspect JAR file from our website. All of the msInspect/AMT functionality is available only on the command line, so you will not be able to use the Java WebStart version of msInspect.

Set up Java and R as described above in the tutorial. To run msInspect, use this command:

    java -Xmx1024M -jar <msinspect_path>

where 1024 is the number of megabytes of memory to provide to msInspect, and <msinspect_path> is the path to the JAR file you downloaded.

Throughout this guide, we will use the shorthand msinspect in place of the above command.

For help with any msInspect command, you can:

  1. run "msinspect --help <command_name>" for textual help on the command and all arguments
  2. run "minspect --interactive <command_name>" for a graphical window that allows you to specify all arguments for the command, and shows all of the same help information as the "help" command in a friendlier layout
  3. run "msinspect --usermanual <command_name>" for an HTML-formatted manual entry on the command

Creating an AMT Database

The first step in the msInspect/AMT workflow is to create an AMT database, a collection of observations of peptides from MS/MS database searches of multiple runs. An AMT database may be small, with only data from only 5-10 runs, or it may contain hundreds of related MS/MS runs. An AMT database contains mass and Normalized Retention Time (NRT) information about every peptide observation in the database, broken down by the different modifications with which the peptide was observed.

The AMT database file created by msInspect uses an XML format called "amtXML".

  • The top of the file contains summary information about all of the runs in the file.
  • The body of the file is a collection of peptide_entry elements, one for each unique peptide sequence.
  • Each peptide_entry element contains summary information for that peptide, including the median NRT with which the peptide was observed. Thiss is the information used in matching.
  • Each peptide_entry element contains a separate modification_state entry for each "modification state" in which the peptide was observed, e.g., with oxidized methionine in position 7 and acetylated Cysteine in position 3.... Summary information for all the observations at the modification state level is maintained.
  • Each modification_state element contains a separate observation element for each time the peptide was observed in any run. This observation element retains the original retention time of the peptide within the run, and an indication of which run it was observed in.

Preparing All Necessary Files

In order to create an AMT database, you will need a separate pepXML file for each run you wish to include in the database.

If you have a single pepXML file containing multiple runs (e.g., fractions), you will first need to break it up into a separate file for each run (these separate files are also needed in later steps). msInspect provides a tool for this:

msinspect --extractrunsfrompepxml <input_file> <output_directory>

This tool will create a separate file for each individual run in input_file, and write them all to output_directory.

Retention Time

Retention times for each peptide observation are crucial for developing an accurate mapping from retention time to NRT within each run, and for storing the observations themselves.  You have the option of gathering retention times from MS/MS data themselves, or from MS1 features that match uniquely to MS/MS peptides by mass and time (this is discussed below).  In either case, the MS/MS identifications must have retention time information.

Some versions of the converters that create pepXML files do not carry this information forward. You may need to provide retention times for each scan within the pepXML file, by transferring this information from the mzXML file from the same run.

You can tell whether your pepXML files have retention time information by looking for the retention_time_sec attribute on the spectrum_query elements within the files (or by waiting for the matchamt command to fail with an error message about missing retention times).

If the information is not already there, it must be extracted from the mzXML files associated with each pepXML file. To provide this information to msInspect, run the "populatems2times" msInspect command in order to create a text version of each pepXML file that contains the RT information pulled from mzXML files with the same names (other than extension) as the pepXML files

msinspect --populatems2times --mzxmldir=<mzxml_directory> <input_files> --outdir=<output_directory>

These files may then be used exactly like pepXML files for the commands that follow.

Filtering pepXML files

For creating the AMT database, and also for aligning retention times between MS runs and the AMT database, you will only want to consider very high-quality MS/MS features. You can filter your pepXML files however you like. msInspect provides a utility to filter by PeptideProphet probability:

msinspect --filter --minpprophet=.95 --outdir=<output_directory> <input_files...>

Running the "createamt" Command

Now that you have assembled all of the necessary files to create an AMT database, run the msInspect command that creates the database. There are several arguments to this command; here is the most common use case:

msinspect --createamt --out=<output_file> --mode=pepxmlfiles --ms2dir=<ms2_directory> --ms1dir=<ms1_directory>

Note: the MS2 directory referenced in the 'ms2dir' argument can contain either pepXML files or .tsv files generated through the "--filter" command described above. All files in the directory will be added to the database, so the directory should contain only those files you want in the database.

If the "--ms1dir" argument is not specified, the observations stored in the database will come directly from high-quality MS/MS peptide identifications.  If the "--ms1dir" argument is specified, then the MS1 feature files (see below) encountered in the "--ms1dir" directory will be matched to MS/MS peptide identifications by mass and retention time (you can control the size of the matching window using the "--deltamassppm" and "--deltatime" arguments), and the times stored in the database will be the retention times from MS1 features that match uniquely to MS/MS peptide identifications.  msInspect will search the "--ms1dir" directory for files that match the MS/MS files from the "--ms2dir" directory by name (they may have different extensions).

In general, if you are building your AMT database from runs with high-resolution data, we recommend that you supply the "--ms1dir" argument.  This is because the retention time from MS1 is a much more accurate predictor of future MS1 retention time than MS/MS retention time is.  However, if you do not have high-resolution MS1 data for your runs, do not supply this argument. 

If you supply the "--ms1dir" argument, you will first need to find and filter LC-MS peptide features within each run, as described below.

Combining Multiple AMT Databases

If you have created multiple AMT databases and wish to combine them (e.g., to combine information from multiple experiments), you may do so with this command:

msinspect --createamt --out=<output_file> --mode=amtxmls <amtxml_files>

Note: do not combine AMT databases whose retention times are from MS1 with AMT databases whose retention times are from MS2!  If you need to include some runs where high-resolution data are not available, just use MS2 retention times for all runs.

Aligning Database Runs

Once you have created an AMT database that contains all the runs relevant to your analysis, you should nonlinearly align all runs in the database to each other. This step eliminates nonlinear variations in gradient that are not removed by the linear mapping that is done when the database is created. You only need to do this step once, on the final AMT database you wish to match against.  You can do this as follows: 

msinspect --manageamt --mode=alignallruns --out=<output_amt_file> <input_amt_file>

The msInspect "amtdiagnostic" command with "mode=basicinfo" will give you summary information about your AMT database: how many peptide entries, runs, etc.

Finding and Filtering LC-MS Peptide Features

Now that you have an AMT database, you need to locate the LC-MS peptide features that you want to match to the database. You can do this with msInspect, or with any other feature-finding program whose format is supported by msInspect/AMT. msInspect/AMT supports the following LC-MS feature file formats:

  • msInspect .tsv (tab-separated values)
  • APML, an XML standard file format for LC-MS data
  • SpecArray's text format
  • Hardklor's text format
SpecArray and Hardklor formats are supported through the "convertfeaturefile" msInspect command, which can convert them to the msInspect text format.

To find your own peptide features using msInspect, you can use the "findpeptides" command:

msinspect --findpeptides --outdir=<output_directory> <input_file> <input_file> ...

The above command operates on multiple files at once (which can be specified with a "*" wildcard), but you can also run the "findpeptides" command on one file at a time.

By default, msInspect finds all possible peptide features that it can within each file. For AMT, it is important that only the high-quality features are retained. Filtering your feature files is very important. A loose set of filtering criteria might be used as follows:

msinspect --filter --outdir=<output_directory> --minpeaks=2 --maxkl=3 <input_file> <input_file> ...

where "minpeaks" is the minimum number of isotopic peaks identified in each feature, and "maxkl" is the maximum value for the "K/L score", an LC-MS feature quality metric. More or less strict filtering criteria may be used depending on whether you're having more trouble with specificity or with sensitivity in your AMT matching.

You can also find and filter LC-MS features one run at a time using msInspect's graphical viewer mode (simply run the msInspect command with no arguments, see msInspect user's guide for details).

Matching to the AMT Database

Now it's finally time to match your LC-MS features to the AMT database and assign peptide IDs. In order to do this, you'll need:

  • An AMT database, preferably having had the "alignallruns" command run on it
  • One or more LC-MS feature files, preferably filtered
  • If you have them, pepXML-formatted MS/MS search results from each LC-MS run you want to match to the database
This last item is used in order to develop the mapping from retention time in the LC-MS run to the NRT scale represented in the database. This mapping is done by doing an initial pairing of LC-MS features with peptide entries in the AMT database, and then performing modal regression on the retention times of the LC-MS features and the NRT values of the associated AMT database entries. If you supply a set of MS/MS search results, then the initial pairing is done using the peptide IDs of the MS/MS results. If not, the initial pairing is done using a tight mass-only match between LC-MS features and peptide entries. In general, the pairing using MS/MS peptide IDs is much more reliable.

To perform a match between the database and a single file of LC-MS features, use this command:

msinspect --matchamt --mode=singlems1 --ms1=<filtered_features_file> --embeddedms2=<MSMS_search_result_file> --out=<output_file> <amtXML_file> --showcharts

The "--showcharts" option will display a number of charts that describe the performance of the AMT match. The charts will be displayed in a series of tabs in a single dialog window. These charts are explained below. Note: if you are running msInspect on a remote computer, displaying charts may take a very long time. In that case you can use the "--savechartsdir" option instead of --showcharts. This will not display the charts on-screen, but it will save them to files in a directory you specify, which can be much faster.

The output of the "matchamt" command is an LC-MS feature file identical to the input feature file except that peptide sequence assignments, with probabilities, are made for some peptides.

You can perform AMT matching on an entire directory full of LC-MS feature files, one after another. In addition to being more convenient, this also saves a good deal of time, since the AMT database only needs to be loaded into memory once.

msinspect --matchamt --mode=ms1dir --ms1dir=<filtered_features_directory> --ms2dir=<MSMS_search_result_dir> --outdir=<output_directory> <amtXML_file>

If you use the "--showcharts" argument when matching multiple runs, a separate set of tabs will be created containing the charts related to each matched file. This can become quite memory-intensive and is not recommended for large numbers of files.

There are a large number of arguments to the "matchamt" command that you may wish to adjust. The "--help" command and the graphical "--interactive" mode are particularly helpful for the AMT matching command, describing all matching arguments in great detail.

The most important non-required argument to pay attention to is "modifications". This argument lets the matching algorithm know what modifications to expect on peptides in the LC-MS feature file(s). By default, a static modification on Cysteine of 57.021Da (representing acetylation) and a variable modification on Methionine of 15.995Da (representing oxidation) are assumed. If your data have other static or potential modifications (e.g., isotopic labeling, see below), you should declare them here.

Note that the modifications declared here do not necessarily need to be the same modifications that were used on the LC-MS/MS data that make up the database. As long as the modification itself doesn't have a large impact on peptide retention time, you can, for instance, use unlabeled LC-MS data with an AMT database generated from isotopically labeled data.

Note: if you used MS/MS retention times in building the AMT database, rather than MS1 retention times, you would add the argument "--usems1foralignment".  This would instruct msInspect to use MS2 retention times, rather than MS1, to create the alignment between the run and the database.

Interpreting the Results

There are several charts displayed via the "--showcharts" option during matching that will help you to understand how well the AMT matching algorithm performed. Click here for example charts from a successful AMT match.

  • T->H Map: the datapoints used for the modal regression to develop the nonlinear map from RT to NRT. The mapping itself is indicated as a line on this chart.
  • Before Calibration: describes the mass calibration of the LC-MS features, prior to recalibration. LC-MS features are calibrated using the initial match to the AMT database, in order ensure a normal distribution of mass error. A line indicates the regression result used for calibration
  • After Calibration: describes the mass calibration of the LC-MS features after recalibration
  • Loose match error data: shows all AMT matches within a wide tolerance window
  • Decoy data: shows all AMT matches to a decoy AMT database created by adding a fixed value to AMT feature masses
  • EM dist analysis: plots describing how well the estimated distribution fits the data, one for each dimension. In the charts on the left, the estimated distributions in each dimension are overlaid with the actual data density; ideally there should be little difference. The quantile-quantile plots on the right are derived from the same data and should ideally be a 1:1 line.
  • Distribution: 3D perspective plot of the raw match data density (gray) with the estimated distribution superimposed in a red mesh. Ideally the two distributions should lie very closely atop one another. This plot requires R version >= 2.5.1
  • EM Parameters: shows the convergence of all the parameters estimated by the EM algorithm
  • All probabilities: shows the same datapoints from the "Loose match error data" plot, color-coded by the probability assigned with the EM model. High-probability points are blue.

Combining AMT Results with MS/MS results

Once you have performed AMT matching and assigned peptide sequences, with probabilities, to many LC-MS features, you can use these identifications on their own or combine them with existing LC-MS/MS data for the same runs, if available. If you want to work with AMT-only data, you can skip a bit.

First, if you intend to combine the AMT data with MS/MS data, you will need to combine the two types of files:

msinspect --combineamtms2 --outdir=<output_directory> --indir=<pepXML_directory> --indir2=<amt_matched_dir> --fasta=<FASTA_database> --restrictcharge --refreshparser --guessproteins

This command will identify pairs of MS/MS and AMT files and combine each pair, given certain constraints such as the maximum charge to carry forward. You can also run a version this command on individual pairs of files. The output files will be in pepXML format.

The "guessproteins" argument uses the provided FASTA database to guess a single protein for each peptide identification, which is time-consuming but necessary in order to run RefreshParser (part of the TPP) on the files. The --refreshparser argument runs RefreshParser itself; this can be run separately (below) if RefreshParser is not on your path.

If you are not combining your AMT results with MS/MS data, you can simply run the "convertfeaturefile" msInspect command to transform your AMT matching results to pepXML format.

Because ProteinProphet (see below) requires all possible proteins to be identified for each peptide, you will need to run RefreshParser, which will interrogate the FASTA file again, on all of the combined pepXML files. If you have RefreshParser on your path, using the "refreshparser" argument as described above will do this for you. However, if you need to run RefreshParser separately, unfortunately there is no built-in mechanism for running it on a number of files at once. Here is an example of a command for running it on a number of files at once, under Linux (this assumes that RefreshParser is on your path):

find merged_unfiltered -name "*pep.xml" -exec RefreshParser {} /mnt/cpl/data/fasta/ipi.HUMAN.fasta.20060823 \;

Labeled Quantitation

If the data you are analyzing contain isotopic labels, you can infer isotopic ratios from the AMT results in which either light or heavy labeled states (or both) are matched via AMT. Be sure to declare the light version of the isotopic label as a static modification during matching, and the difference between the light and heavy label weights as a variable modification.

For instance, acrylamide labeling has a 3.0101-Da separation between light and heavy labels on Cysteine, and the light label weighs 71.0366 Da and takes the place of the "normal" iodoacetamide modification used to protect Cysteine. Oxidized Methionine is still a possibility. So the correct "modifications" argument becomes "--modifications=C71.0366,M15.995V,C3.0101V".

Once you have performed matching with any variable modification, you can use that modification to identify abundance ratios between light and heavy isotopes with the Q3 algorithm, using the "q3" command in msInspect:

msinspect --q3new --labeledResidue=C --massDiff=3.01006449 --out=<output_file> <input_file> -d<mzxml_directory> --maxFracDeltaMass=20ppm --minPeptideProphet=.75 --forceoutput

This command will produce new LC-MS feature files with AMT peptide IDs, in which isotopic pairs are associated with each other and their intensity ratios are calculated. These ratios will be rolled forward in protein inference. The command above is appropriate for the Q3 algorithm, but any quantitation algorithm making use of PepXML files (e.g. XPress) can be used at this point.

Using AMT Results with ProteinProphet

Note: If you not combining MS2 and AMT data as described above, you will need to perform a couple more steps before proceeding to ProteinProphet. See below.

Now you are ready to use these data in protein inference. These steps involve the Trans-Proteomic Pipeline (TPP). The end result of this pipeline is a protXML file containing information about protein identifications and their associated peptide identifications and (if present) quantitation ratios. To run ProteinProphet using only your combined AMT-MS/MS files and not the original MS/MS pepXML files:

ProteinProphet <pepXML_file> <pepXML_file> ... <output_protXML_file>

Finally, for isotopically-labeled data you will need to run a quantitation algorithm to identify intensity ratios for all quantifiable proteins by combining peptide-level ratios. For a 3-Dalton-separation label such as acrylamide, for instance, you can run the Q3 algorithm as follows:

Q3ProteinRatioParser <protXML_file>

Preparing AMT-Only Data for ProteinProphet

If you are only using AMT data, and not combining it with MS2 data as described above, you'll need to perform a few more steps before running ProteinProphet:
  1. Convert your matching output files to pepXML, specifying a FASTA database: msinspect --convertfeaturefile --outdir=<output_dir> --outformat=pepxml --fasta=<FASTA_FILE> <input_files...>
  2. Guess an initial protein for each peptide match (necessary for RefreshParser, as described above): msinspect --guessproteinsfromfasta --fasta=<FASTA_file>
  3. Run RefreshParser, as described above
  4. If the data are isotopically labeled, run Q3, as described above