Introduction
Just like genomics, proteomics has a variety of data formats that are used to store information about biological samples. Similar to genomics, there is a data format to store sequence data (FASTA). However, there are additional types of information, such as measurement technology data (e.g., the mass-to-charge ratios from mass spectrometry) that pertain only to proteins. Here’s a quick overview. Proteomics research relies heavily on specialized data formats to store and analyze diverse types of biological information. These formats ensure that data is structured in a way that facilitates computational analysis and sharing among researchers. Here, we explore three key formats used in proteomics: FASTA, mzML, and pepXML.
FASTA
The FASTA format is foundational for storing sequences of proteins or nucleotides. Each entry in a FASTA file begins with a ">" symbol followed by a description line that includes metadata about the sequence, such as its identifier, organism, and gene name. Subsequent lines contain the actual sequence data in single-letter code (amino acids or nucleotides).
For instance, consider the entry:
>sp|P01308|INS_HUMAN Insulin OS=Homo sapiens OX=9606 GN=INS PE=1 SV=1
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
Here, "sp|P01308|INS_HUMAN" denotes the unique identifier for human insulin. The sequence follows, providing the specific amino acid sequence of the protein.
mzML
mzML (mass spectrometry data format) is crucial for storing raw data from mass spectrometry experiments. It employs an XML-based structure that includes comprehensive metadata about the experiment setup and acquisition parameters. Each <spectrum>
element within an mzML file represents a single spectrum acquired during the experiment. These spectra contain m/z (mass-to-charge ratio) values and corresponding intensity values.
A simplified example of an mzML file:
<?xml version="1.0" encoding="UTF-8"?>
<mzML xmlns="http://psi.hupo.org/ms/mzml" version="1.1.0">
<run id="my_experiment">
<spectrumList count="1">
<spectrum id="scan=1" index="0">
<cvParam name="ms level" value="2"/>
<binaryDataArrayList count="2">
<binaryDataArray>
<cvParam name="m/z array" unitAccession="MS:1000514"/>
<binary><!-- encoded m/z values --></binary>
</binaryDataArray>
<binaryDataArray>
<cvParam name="intensity array" unitAccession="MS:1000515"/>
<binary><!-- encoded intensity values --></binary>
</binaryDataArray>
</binaryDataArrayList>
</spectrum>
</spectrumList>
</run>
</mzML>
In this example, <spectrum>
contains detailed metadata (e.g., ms level
indicating the level of MS scan) and binary data arrays that store encoded m/z and intensity values.
pepXML
pepXML is designed specifically for storing peptide-spectrum match (PSM) results obtained from database search engines used in proteomics. It is also XML-based and includes information about identified peptides, their scores, modifications, and associated proteins.
An excerpt from a pepXML file:
<?xml version="1.0" encoding="UTF-8"?>
<msms_pipeline_analysis date="2024-07-03T10:00:00" xmlns="http://regis-web.systemsbiology.net/pepXML">
<msms_run_summary>
<spectrum_query spectrum="scan=1000" start_scan="1000" end_scan="1000" precursor_neutral_mass="1500.7" assumed_charge="2">
<search_result>
<search_hit hit_rank="1" peptide="PEPTIDESEQUENCE" protein="sp|P01308|INS_HUMAN" num_matched_ions="8" tot_num_ions="10" calc_neutral_pep_mass="1500.6" massdiff="0.1" num_tol_term="2" num_missed_cleavages="0" is_rejected="0">
<search_score name="xcorr" value="2.5"/>
<search_score name="deltacn" value="0.3"/>
</search_hit>
</search_result>
</spectrum_query>
</msms_run_summary>
</msms_pipeline_analysis>
This file contains information about a peptide-spectrum match. It shows that a peptide with the sequence "PEPTIDESEQUENCE" was identified from the human insulin protein (P01308). The peptide matched 8 out of 10 expected fragment ions, and has search scores (xcorr and deltacn) that indicate the quality of the match.
Conclusion
Understanding these data formats is essential for working with proteomics data. FASTA format is used for storing protein sequences, mzML for raw mass spectrometry data, and pepXML for peptide identification results. Each format has its specific structure and content, allowing researchers to store and share complex proteomics data efficiently.