Genomics 101
Fundamental genomics concepts - BAM files, VCF format, somatic variants, DNA sequencing, and variant calling explained
Genomics 101: Essential Concepts
A beginner-friendly guide to understanding the genomics concepts behind Omics807 and somatic variant calling.
DNA Sequencing Basics
What is DNA Sequencing?
DNA sequencing is the process of determining the exact order of nucleotides (A, T, G, C) in a DNA molecule. Modern sequencing technologies can read billions of DNA fragments from a biological sample.
The Sequencing Process: 1. Extract DNA from tissue sample 2. Fragment DNA into small pieces 3. Sequence each fragment (generates "reads") 4. Align reads to reference genome 5. Call variants by comparing to reference
Sequencing Technologies
Short-Read Sequencing (Illumina) - Read length: 100-300 base pairs - Very high accuracy (>99.9%) - Best for: SNVs, small indels - Limitations: Structural variants, repetitive regions
Long-Read Sequencing - PacBio HiFi: 10-25kb reads, high accuracy - Oxford Nanopore (ONT): Ultra-long reads up to 2Mb - Best for: Structural variants, phasing, complex regions - Trade-off: Lower per-base accuracy
BAM Files Explained
What are BAM Files?
BAM = Binary Alignment Map
A BAM file contains all the sequencing reads from your sample aligned to a reference genome. Think of it as a digital record of where each DNA fragment from your sample maps to the genome.
Key Components: - Header: Metadata about the sequencing run - Alignment records: Each read's position, sequence, and quality - Index file (.bai): Allows fast access to specific regions
BAM File Structure
Read 1: chr1:1000-1150 ATCGATCG... MAPQ=60
Read 2: chr1:1050-1200 GCTAGCTA... MAPQ=42
Read 3: chr1:1100-1250 TTAACCGG... MAPQ=55
...
Important Fields: - RNAME: Chromosome name (e.g., chr1) - POS: Start position - MAPQ: Mapping quality (0-60, higher = better) - CIGAR: Alignment description (matches, insertions, deletions) - SEQ: Read sequence - QUAL: Base quality scores
Coverage and Depth
Read Depth (DP): Number of reads covering a genomic position
Position: 1000 1001 1002 1003 1004
Read 1: A T C G A
Read 2: A T C G A
Read 3: A T T G A <- variant at 1002
Read 4: A T C G A
Depth: 4 4 4 4 4
Typical Coverage Requirements: - Germline WGS: 30-50x - Tumor WGS: 60-100x - Exome: 100-150x - Low-frequency variants: 200x+
VCF Format Explained
What is a VCF File?
VCF = Variant Call Format
A VCF file is the standard format for storing genetic variants. It lists positions where the sample differs from the reference genome.
VCF Structure
##fileformat=VCFv4.2
##reference=GRCh38
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TUMOR
chr1 12345 . A T 60 PASS DP=100;VAF=0.3 GT:GQ:DP 0/1:45:100
chr1 67890 . GC G 42 PASS DP=80;VAF=0.25 GT:GQ:DP 0/1:38:80
Column Definitions:
| Column | Name | Description |
|---|---|---|
| CHROM | Chromosome | Chromosome name (chr1-22, X, Y) |
| POS | Position | 1-based genomic position |
| ID | Identifier | dbSNP ID if known (or '.') |
| REF | Reference | Reference allele from genome |
| ALT | Alternate | Variant allele observed |
| QUAL | Quality | Phred-scaled quality score |
| FILTER | Filter | PASS or reason for failure |
| INFO | Information | Additional annotations |
| FORMAT | Format | Genotype field definitions |
| SAMPLE | Sample | Per-sample genotype data |
Understanding FILTER Status
DeepSomatic uses specific FILTER values:
- PASS: High-confidence somatic variant
- GERMLINE: Likely germline, not tumor-specific
- RefCall: Reference call, no variant
- LowQual: Quality below threshold
Quality Scores
QUAL (Variant Quality) - Phred-scaled probability of variant - QUAL=30 → 99.9% confidence - QUAL=60 → 99.9999% confidence
GQ (Genotype Quality) - Confidence in genotype assignment - GQ=40 → 99.99% confident in genotype
DP (Depth) - Total read depth at position - Higher depth = more reliable calls
INFO Field Annotations
Common DeepSomatic INFO fields:
DP=100 # Total depth
VAF=0.3 # Variant allele frequency (30%)
SOR=0.8 # Strand bias odds ratio
Variant Types
SNV (Single Nucleotide Variant)
A single base change:
Reference: ...ATCGATCG...
Variant: ...ATCAATCG...
^
SNV: G→A
Indel (Insertion/Deletion)
Insertion: Added bases
Reference: ...ATCG----ATCG...
Variant: ...ATCGTTAATCG...
^^^^
+TTA insertion
Deletion: Removed bases
Reference: ...ATCGTTAATCG...
Variant: ...ATCG----ATCG...
^^^^
-TTA deletion
MNV (Multi-Nucleotide Variant)
Multiple consecutive base changes:
Reference: ...ATCGATCG...
Variant: ...ATTAATCG...
^^^^
MNV: CG→TA
Somatic vs Germline Variants
Germline Variants
Definition: Inherited from parents, present in all cells
Characteristics: - Present in both tumor and normal tissue - Variant allele frequency (VAF) ≈ 50% (heterozygous) or 100% (homozygous) - Found in blood/normal tissue
Example:
Normal tissue: VAF = 50%
Tumor tissue: VAF = 50%
→ Germline variant
Somatic Variants
Definition: Acquired mutations, only in tumor cells
Characteristics: - Absent in normal tissue - VAF depends on tumor purity and ploidy - Cancer-driving mutations
Example:
Normal tissue: VAF = 0%
Tumor tissue: VAF = 30%
→ Somatic variant
Mixed Scenarios
Loss of Heterozygosity (LOH):
Normal: A/T (50% VAF for T)
Tumor: T/T (100% VAF for T)
→ Lost normal A allele
Subclonal Mutations:
Tumor VAF = 15%
→ Only present in subset of tumor cells
→ Later acquired mutation
Variant Calling Process
What is Variant Calling?
Variant calling is the computational process of identifying genetic differences between your sample and the reference genome.
The Variant Calling Pipeline
1. Preprocessing - Quality control of reads - Mark duplicates - Base quality recalibration
2. Candidate Identification - Find positions that differ from reference - Filter for minimum quality and depth
3. Variant Classification - Statistical tests or machine learning - DeepSomatic uses deep neural networks
4. Filtering - Remove low-quality calls - Classify as somatic/germline - Apply confidence thresholds
Traditional vs Deep Learning Approaches
Traditional Methods (e.g., MuTect2, Strelka): - Hand-crafted statistical models - Feature engineering - Fixed thresholds
Deep Learning (DeepSomatic): - Learned from millions of examples - Pileup images as input - Superior accuracy, especially for indels
Reference Genomes
What is a Reference Genome?
A reference genome is the standard DNA sequence used as a coordinate system for mapping variants.
Common References: - GRCh38/hg38: Current human reference (2013) - GRCh37/hg19: Previous version (2009) - T2T-CHM13: Telomere-to-telomere (2022)
Chromosome Naming
GRCh38 style: chr1, chr2, ..., chrX, chrY
GRCh37 style: 1, 2, ..., X, Y
Omics807 uses GRCh38 by default.
Key Terms Summary
| Term | Definition |
|---|---|
| BAM | Binary format for aligned sequencing reads |
| VCF | Standard format for genetic variants |
| SNV | Single nucleotide change |
| Indel | Insertion or deletion |
| VAF | Variant Allele Frequency - % of reads with variant |
| DP | Read depth at a position |
| GQ | Genotype quality score |
| Somatic | Tumor-specific mutation |
| Germline | Inherited variant |
| WGS | Whole Genome Sequencing |
| WES | Whole Exome Sequencing |
Next Steps
Now that you understand the basics:
- Learn about DeepSomatic Models
- Understand How to Interpret Results
- Explore Example Datasets
- Read the Glossary for more terms