Genomics Glossary
Comprehensive glossary of genomic and bioinformatics terms used in variant calling and cancer genomics
Genomics Glossary
Comprehensive definitions of genomic, bioinformatics, and variant calling terminology.
A
Alignment
Process of mapping sequencing reads to a reference genome to determine their genomic location.
Allele
One of two or more alternative forms of a gene or genetic locus. For example, at a SNP position, the reference might be 'A' and the alternate allele 'T'.
Allele Frequency (AF)
The proportion of chromosomes in a population that carry a specific allele. In cancer, refers to the frequency of a variant allele in the sample.
Aneuploidy
Abnormal number of chromosomes in a cell. Common in cancer cells (e.g., trisomy 21 = extra chromosome 21).
Annotation
Adding functional information to variants, such as gene names, predicted effects, population frequencies, and clinical significance.
B
BAI File
Index file for BAM files (.bam.bai) that enables fast random access to specific genomic regions.
BAM File
Binary Alignment Map - compressed binary format storing aligned sequencing reads. Standard format for NGS data.
Base Quality (BQ)
Phred-scaled confidence score for each sequenced base, indicating probability of sequencing error. BQ=30 means 99.9% confidence.
Batch Effect
Systematic technical variation between samples processed in different batches. Can cause false positive variants if not corrected.
BED File
Browser Extensible Data - tab-delimited format for genomic regions (chr, start, end). Used for target regions, high-confidence intervals.
Benchmarking
Comparing variant caller performance against truth sets to measure accuracy (recall, precision, F1).
BWA (Burrows-Wheeler Aligner)
Popular alignment tool for mapping short reads to reference genomes.
C
Call Variants
The process of identifying genetic differences between a sample and reference genome.
CIGAR String
Compact Idiosyncratic Gapped Alignment Report - encodes alignment with matches (M), insertions (I), deletions (D), etc.
Clonal
Mutations present in all or most tumor cells. High VAF indicates clonal variants.
CNV (Copy Number Variant)
Gain or loss of genomic segments. Includes amplifications (gains) and deletions (losses).
Coverage (Depth)
Number of sequencing reads overlapping a genomic position. Higher coverage = more confident variant calls.
COSMIC
Catalogue Of Somatic Mutations In Cancer - comprehensive database of cancer mutations.
CRAM File
Compressed alternative to BAM, requires reference genome for decompression. ~40% smaller than BAM.
D
dbSNP
Database of Single Nucleotide Polymorphisms - catalog of common germline variants.
DeepSomatic
Google's deep learning-based somatic variant caller used by Omics807.
DeepVariant
Google's deep learning germline variant caller, foundation for DeepSomatic.
Depth (DP)
Total read depth at a genomic position. In VCF FORMAT field, indicates coverage.
E
Exome
All protein-coding sequences in the genome (~1-2% of total genome, ~30Mb in humans).
Exon
Coding region of a gene that is transcribed and translated into protein.
F
FASTA
Text format for nucleotide or protein sequences. Reference genomes stored as FASTA.
FASTQ
Format for raw sequencing reads, includes sequence and quality scores. Input for alignment.
False Discovery Rate (FDR)
Proportion of false positives among all positive calls. Lower is better.
FFPE (Formalin-Fixed Paraffin-Embedded)
Common method for preserving tissue. Causes DNA damage (C→T artifacts) that specialized models handle.
FILTER Column
VCF field indicating if variant passed quality filters (PASS) or reasons for failure.
Frameshift
Insertion or deletion not divisible by 3, shifting the reading frame. Often severe impact on protein.
G
Germline Variant
Inherited genetic variant present in all cells, passed from parents.
Genotype (GT)
Allele combination at a locus. 0/0 = homozygous reference, 0/1 = heterozygous, 1/1 = homozygous alternate.
Genotype Quality (GQ)
Phred-scaled confidence in genotype assignment. GQ=40 means 99.99% confident.
GRCh37 (hg19)
Human reference genome version from 2009. Older but still widely used.
GRCh38 (hg38)
Current human reference genome from 2013. Improved accuracy and completeness.
H
Haplotype
Set of DNA variants inherited together from one parent.
HCC1395
Breast cancer cell line widely used as reference material for somatic variant calling.
Heterozygous
Having two different alleles at a locus (0/1). VAF ~50% for germline heterozygous variants.
Homozygous
Having two identical alleles at a locus. 0/0 = reference, 1/1 = alternate.
Hotspot
Genomic position with recurrent mutations across many cancer samples (e.g., BRAF V600E).
I
IGV (Integrative Genomics Viewer)
Popular genome browser for visualizing BAM files and variants.
Indel
Insertion or deletion variant. Generally harder to call accurately than SNVs.
Insertion
Addition of one or more nucleotides relative to reference.
Intron
Non-coding region between exons. Spliced out during mRNA processing.
L
LOH (Loss of Heterozygosity)
Loss of one allele in tumor. Germline heterozygous (50%) becomes homozygous (100%) in tumor.
Long-read Sequencing
Technologies producing reads >10kb (PacBio, Nanopore). Better for structural variants.
M
MAF (Mutation Annotation Format)
Format used by TCGA for annotated somatic mutations.
MAPQ (Mapping Quality)
Phred-scaled probability that read alignment is correct. MAPQ=60 means 99.9999% confident.
MNV (Multi-Nucleotide Variant)
Multiple consecutive nucleotide changes (e.g., CA→TG).
Mutation
Change in DNA sequence. Often used interchangeably with "variant" in cancer genomics.
N
NGS (Next-Generation Sequencing)
High-throughput sequencing technologies (Illumina, PacBio, Nanopore).
Nonsense Mutation
SNV creating a premature stop codon (e.g., CAG→TAG). Truncates protein.
O
ONT (Oxford Nanopore Technologies)
Long-read sequencing platform using nanopores to detect DNA bases.
Oncogene
Gene that can cause cancer when mutated or overexpressed (e.g., KRAS, MYC).
P
PacBio (Pacific Biosciences)
Long-read sequencing platform using Single Molecule Real-Time (SMRT) technology.
Panel of Normals (PoN)
Collection of normal samples used to filter common artifacts in tumor-only calling.
PASS
VCF FILTER value indicating variant passed all quality filters. High-confidence call.
Phred Score
Quality score scale: Q = -10 × log₁₀(P_error). Q30 = 99.9% accuracy, Q60 = 99.9999%.
Pileup
Vertical alignment of reads at a genomic position. DeepSomatic creates "pileup images".
Precision (PPV)
Proportion of called variants that are true positives. TP / (TP + FP).
Q
QUAL
VCF quality score - Phred-scaled probability that variant exists. Higher = more confident.
R
Read
DNA fragment sequence produced by sequencing machine. Typically 100-300bp for Illumina.
Read Depth
See Coverage/Depth.
Recall (Sensitivity)
Proportion of true variants successfully detected. TP / (TP + FN).
Reference Genome
Standard DNA sequence used as coordinate system. Human: GRCh38 or GRCh37.
RefCall
VCF FILTER indicating no variant detected - position matches reference.
S
SAM File
Sequence Alignment Map - text format for aligned reads. BAM is compressed version.
Sanger Sequencing
Traditional sequencing method. Gold standard for validating NGS variants.
SEQC2
FDA consortium providing validated truth sets for variant calling benchmarks.
Short-read Sequencing
Sequencing producing reads 50-300bp (e.g., Illumina). High accuracy but limited for complex regions.
SNP (Single Nucleotide Polymorphism)
Common germline single-base variant (>1% frequency in population).
SNV (Single Nucleotide Variant)
Single base change. Can be germline (SNP) or somatic (mutation).
Somatic Variant
Mutation acquired in tumor cells, not inherited. Absent in normal tissue.
Strand Bias
Variant preferentially on forward or reverse strand. Often indicates artifact.
Subclonal
Mutation present in subset of tumor cells. Low VAF indicates subclonal variants.
SV (Structural Variant)
Large genomic rearrangement (>50bp): deletions, duplications, inversions, translocations.
T
TCGA (The Cancer Genome Atlas)
Large-scale cancer genomics project with thousands of tumor/normal pairs.
Tumor Purity
Proportion of tumor cells in sample. Affects variant allele frequencies.
Tumor-Normal Pair
Matched tumor and normal samples from same patient. Gold standard for somatic calling.
Tumor-Only
Analysis of tumor sample without matched normal. Requires Panel of Normals for filtering.
V
VAF (Variant Allele Frequency)
Proportion of reads supporting variant allele. VAF = Alt reads / Total reads.
Variant
Difference between sample and reference genome. Includes SNVs, indels, SVs.
Variant Calling
Computational process of identifying genetic variants from sequencing data.
VCF (Variant Call Format)
Standard text format for storing genetic variants. Includes position, alleles, quality, annotations.
VEP (Variant Effect Predictor)
Tool from Ensembl for annotating variant functional consequences.
W
WES (Whole Exome Sequencing)
Sequencing only protein-coding regions (~1-2% of genome). Cheaper than WGS.
WGS (Whole Genome Sequencing)
Sequencing entire genome including non-coding regions.
Z
Zygosity
Number of copies of an allele. Homozygous (2 copies) or heterozygous (1 copy).
Acronyms Quick Reference
| Acronym | Full Name |
|---|---|
| AF | Allele Frequency |
| BAM | Binary Alignment Map |
| BED | Browser Extensible Data |
| CNV | Copy Number Variant |
| DP | Depth |
| FFPE | Formalin-Fixed Paraffin-Embedded |
| GQ | Genotype Quality |
| GT | Genotype |
| IGV | Integrative Genomics Viewer |
| LOH | Loss of Heterozygosity |
| MAPQ | Mapping Quality |
| MNV | Multi-Nucleotide Variant |
| NGS | Next-Generation Sequencing |
| ONT | Oxford Nanopore Technologies |
| PoN | Panel of Normals |
| PPV | Positive Predictive Value (Precision) |
| SAM | Sequence Alignment Map |
| SNP | Single Nucleotide Polymorphism |
| SNV | Single Nucleotide Variant |
| SV | Structural Variant |
| VAF | Variant Allele Frequency |
| VCF | Variant Call Format |
| VEP | Variant Effect Predictor |
| WES | Whole Exome Sequencing |
| WGS | Whole Genome Sequencing |
Related Resources
- Genomics 101 - Fundamental concepts explained
- Understanding Results - Apply terminology to real results
- Model Guide - Technical details