Process GC-MS data for metabolomics and volatile compound analysis including peak detection, deconvolution, and NIST library matching. Use when: user has GC-MS data, needs retention index calculation, wants to identify volatiles, match EI spectra against NIST, or process derivatized metabolites. Triggers: GC-MS, gas chromatography, volatile analysis, NIST library, retention index, Kovats index, EI spectrum, electron ionization, deconvolution, AMDIS, TMS derivatives, volatile profiling, headspace analysis, SPME.
GC-MS metabolomics requires specialized workflows distinct from LC-MS: electron ionization (EI) fragmentation produces reproducible spectra suitable for library matching, but introduces challenges like derivatization artifacts, column bleed interference, and co-eluting peaks requiring deconvolution. This skill covers the full pipeline from raw data through metabolite identification.
Reference examples tested with: matchms 0.18-0.26, pyopenms 3.x, xcms 4.x
The matchms API for Scores objects and CosineGreedy.pair() return types changed
across versions. If code throws TypeError or KeyError, introspect the return value
(e.g., print(type(result), result)) and adapt the unpacking accordingly.
uv pip install matchms pyopenms numpy pandas scipy
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("xcms", "MSnbase"))
import matchms
import pyopenms as ms
print(f"matchms: {matchms.__version__}")
print(f"pyopenms: {ms.__version__}")
Read raw GC-MS files (mzML, mzXML, CDF) and inspect TIC vs EIC.
import pyopenms as ms
import numpy as np
exp = ms.MSExperiment()
ms.MzMLFile().load("gcms_sample.mzML", exp)
# Build total ion chromatogram (TIC)
retention_times = []
tic_intensities = []
for spectrum in exp:
if spectrum.getMSLevel() == 1:
retention_times.append(spectrum.getRT())
mz, intensity = spectrum.get_peaks()
tic_intensities.append(np.sum(intensity))
print(f"Spectra: {len(retention_times)}")
print(f"RT range: {min(retention_times):.1f} - {max(retention_times):.1f} s")
Extract an extracted ion chromatogram (EIC) for a target m/z:
def extract_eic(experiment, target_mz, tolerance=0.5):
rts = []
intensities = []
for spectrum in experiment:
if spectrum.getMSLevel() == 1:
rts.append(spectrum.getRT())
mz_array, int_array = spectrum.get_peaks()
mask = np.abs(mz_array - target_mz) <= tolerance
intensities.append(np.sum(int_array[mask]) if np.any(mask) else 0.0)
return np.array(rts), np.array(intensities)
# EIC for m/z 73 (common TMS fragment)
rt_arr, eic_arr = extract_eic(exp, target_mz=73.0, tolerance=0.5)
CentWave parameters tuned for GC-MS: narrower peak widths, wider ppm tolerance for unit-mass-resolution instruments.
library(xcms)
library(MSnbase)
raw_files <- list.files("gcms_data", pattern = "\\.mzML$", full.names = TRUE)
raw_data <- readMSData(raw_files, mode = "onDisk")
# CentWave parameters optimized for GC-MS
# - peakwidth: GC peaks are typically 2-10 seconds (narrower than LC)
# - ppm: unit-mass GC-MS needs wider tolerance (~30-40 ppm);
# for HR-GC-MS (e.g., GC-Orbitrap), use 5-10 ppm instead
# - noise: adjust to instrument baseline
cwp <- CentWaveParam(
peakwidth = c(2, 10),
ppm = 40,
snthresh = 5,
prefilter = c(3, 500),
mzdiff = 0.5,
noise = 500,
integrate = 2
)
xdata <- findChromPeaks(raw_data, param = cwp)
cat("Peaks detected:", nrow(chromPeaks(xdata)), "\n")
Retention indices normalize retention times against an alkane ladder, enabling cross-instrument comparison.
Kovats Retention Index (isothermal GC, uses logarithmic interpolation):
import numpy as np
def kovats_ri(rt_compound, alkane_rts, alkane_carbons):
"""Calculate Kovats retention index for isothermal GC.
Uses the original Kovats logarithmic formula:
RI = 100 * (n + (log(tR(x)) - log(tR(n))) / (log(tR(n+1)) - log(tR(n))))
Args:
rt_compound: retention time of the target compound (seconds)
alkane_rts: list of alkane retention times (seconds), sorted ascending
alkane_carbons: list of carbon numbers matching alkane_rts
"""
alkane_rts = np.array(alkane_rts, dtype=np.float64)
alkane_carbons = np.array(alkane_carbons)
idx = np.searchsorted(alkane_rts, rt_compound) - 1
idx = max(0, min(idx, len(alkane_rts) - 2))
n_z = alkane_carbons[idx]
rt_z = alkane_rts[idx]
rt_z1 = alkane_rts[idx + 1]
ri = 100 * (n_z + (np.log(rt_compound) - np.log(rt_z)) / (np.log(rt_z1) - np.log(rt_z)))
return ri
Linear Retention Index (LTPRI) (temperature-programmed GC, most common in metabolomics):
def linear_ri(rt_compound, alkane_rts, alkane_carbons):
"""Calculate linear (temperature-programmed) retention index.
For temperature-programmed GC, retention times increase linearly with
carbon number, so the simpler linear interpolation formula is used:
RI = 100 * (n + (tR(x) - tR(n)) / (tR(n+1) - tR(n)))
Args:
rt_compound: retention time of the target compound (seconds)
alkane_rts: list of alkane retention times (seconds), sorted ascending
alkane_carbons: list of carbon numbers matching alkane_rts
"""
alkane_rts = np.array(alkane_rts, dtype=np.float64)
alkane_carbons = np.array(alkane_carbons)
idx = np.searchsorted(alkane_rts, rt_compound) - 1
idx = max(0, min(idx, len(alkane_rts) - 2))
n_z = alkane_carbons[idx]
rt_z = alkane_rts[idx]
rt_z1 = alkane_rts[idx + 1]
ri = 100 * (n_z + (rt_compound - rt_z) / (rt_z1 - rt_z))
return ri
# Alkane ladder C8-C30
alkane_rts = [180.2, 240.5, 310.1, 385.7, 462.3, 540.8, 621.4, 703.9,
788.1, 874.2, 962.0, 1051.5, 1142.8, 1235.6, 1330.1,
1426.0, 1523.4, 1622.1, 1722.3, 1823.8, 1926.5, 2030.7, 2136.2]
alkane_carbons = list(range(8, 31))
ri = linear_ri(rt_compound=450.0, alkane_rts=alkane_rts, alkane_carbons=alkane_carbons)
print(f"Linear RI: {ri:.0f}")
matchms supports GC-MS spectral comparison against reference libraries such as NIST and MassBank.
from matchms.importing import load_from_msp
from matchms.similarity import CosineGreedy
from matchms import calculate_scores
from matchms.filtering import (
default_filters,
normalize_intensities,
select_by_mz,
select_by_relative_intensity,
)
def process_spectrum(spectrum):
spectrum = default_filters(spectrum)
spectrum = normalize_intensities(spectrum)
spectrum = select_by_mz(spectrum, mz_from=35, mz_to=500)
spectrum = select_by_relative_intensity(spectrum, intensity_from=0.01)
return spectrum
# Load reference library (NIST-format MSP export)
references = list(load_from_msp("nist_ei_library.msp"))
references = [process_spectrum(s) for s in references if s is not None]
# Load query spectra (deconvolved peaks exported as MSP)
queries = list(load_from_msp("sample_deconvolved.msp"))
queries = [process_spectrum(s) for s in queries if s is not None]
# Calculate cosine similarity scores
similarity = CosineGreedy(tolerance=0.5)
scores = calculate_scores(references, queries, similarity)
# Extract best matches per query
for i, query in enumerate(queries):
best = scores.scores_by_query(query, "CosineGreedy_score", sort=True)
# Each entry is (reference, score_object); access style depends on matchms version
top_matches = []
for ref, score_obj in best[:3]:
sc = float(score_obj["score"] if isinstance(score_obj, dict) else score_obj[0])
nm = int(score_obj["matches"] if isinstance(score_obj, dict) else score_obj[1])
if sc > 0.7 and nm >= 6:
top_matches.append((ref, sc, nm))
if top_matches:
ref, score, n_matches = top_matches[0]
name = ref.get("compound_name", "Unknown")
print(f"Query RT={query.get('retention_time')}: {name} "
f"(score={score:.3f}, matched_peaks={n_matches})")
For production-grade NIST matching, combine RI filtering with spectral similarity:
def nist_match_with_ri(query_spectrum, query_ri, references, ri_tolerance=20):
"""Filter candidates by RI window, then rank by spectral similarity."""
similarity = CosineGreedy(tolerance=0.5)
candidates = []
for ref in references:
ref_ri = ref.get("retention_index")
if ref_ri is not None and abs(float(ref_ri) - query_ri) <= ri_tolerance:
score_pair = similarity.pair(ref, query_spectrum)
if score_pair is not None:
# matchms >= 0.18 returns a dict-like object; older versions return a tuple
score_val = float(score_pair["score"] if isinstance(score_pair, dict) else score_pair[0])
n_matches = int(score_pair["matches"] if isinstance(score_pair, dict) else score_pair[1])
if score_val > 0.6 and n_matches >= 5:
candidates.append({
"name": ref.get("compound_name", "Unknown"),
"score": score_val,
"matched_peaks": n_matches,
"ri_diff": abs(float(ref_ri) - query_ri),
})
return sorted(candidates, key=lambda x: x["score"], reverse=True)
import pyopenms as ms
exp = ms.MSExperiment()
ms.MzMLFile().load("gcms_sample.mzML", exp)
# Smooth spectra
gaussian = ms.GaussFilter()
params = gaussian.getParameters()
params.setValue("gaussian_width", 0.15)
gaussian.setParameters(params)
gaussian.filterExperiment(exp)
# Peak picking (centroiding)
picker = ms.PeakPickerHiRes()
pick_params = picker.getParameters()
pick_params.setValue("signal_to_noise", 3.0)
picker.setParameters(pick_params)
centroided = ms.MSExperiment()
picker.pickExperiment(exp, centroided)
print(f"Centroided spectra: {centroided.getNrSpectra()}")
TMS (trimethylsilyl) derivatization is the most common approach for GC-MS metabolomics. Common reagents include MSTFA (N-Methyl-N-(trimethylsilyl)trifluoroacetamide) and BSTFA (N,O-Bis(trimethylsilyl)trifluoroacetamide), often with 1% TMCS as catalyst. TMS derivatization produces multiple derivatives per metabolite:
# Common TMS artifacts to flag and exclude
TMS_ARTIFACTS = {
"hexamethylcyclotrisiloxane": 207, # D3 siloxane, column bleed
"octamethylcyclotetrasiloxane": 281, # D4 siloxane, column bleed
"trimethylsilanol": 75, # TMS reagent byproduct
}
def flag_artifacts(peak_list, artifact_mz_values, tolerance=0.5):
"""Flag peaks matching known derivatization artifacts."""
flagged = []
for peak in peak_list:
base_peak_mz = peak.get("base_peak_mz", 0)
is_artifact = any(
abs(base_peak_mz - mz) <= tolerance
for mz in artifact_mz_values
)
flagged.append({**peak, "is_artifact": is_artifact})
return flagged
Column bleed produces characteristic siloxane fragments that increase with temperature:
COLUMN_BLEED_MZ = [207.0, 281.0, 355.0, 429.0]
def detect_column_bleed(spectrum_mz, spectrum_intensity, threshold=0.05):
"""Check if column bleed fragments dominate the spectrum."""
total_intensity = np.sum(spectrum_intensity)
if total_intensity == 0:
return False
bleed_intensity = sum(
spectrum_intensity[np.argmin(np.abs(spectrum_mz - mz))]
for mz in COLUMN_BLEED_MZ
if np.min(np.abs(spectrum_mz - mz)) < 0.5
)
return (bleed_intensity / total_intensity) > threshold
# 1. Load raw data
exp = ms.MSExperiment()
ms.MzMLFile().load("gcms_sample.mzML", exp)
# 2. Smooth and centroid
gaussian = ms.GaussFilter()
g_params = gaussian.getParameters()
g_params.setValue("gaussian_width", 0.15)
gaussian.setParameters(g_params)
gaussian.filterExperiment(exp)
picker = ms.PeakPickerHiRes()
centroided = ms.MSExperiment()
picker.pickExperiment(exp, centroided)
# 3. Export spectra for library matching
ms.MzMLFile().store("gcms_centroided.mzML", centroided)
# 4. Calculate RIs for detected peaks using alkane ladder
# 5. Match against NIST/MassBank using matchms with RI filtering
# 6. Export annotated feature table