Genomics 101: Essential Concepts

A beginner-friendly guide to understanding the genomics concepts behind Omics807 and somatic variant calling.

DNA Sequencing Basics

What is DNA Sequencing?

DNA sequencing is the process of determining the exact order of nucleotides (A, T, G, C) in a DNA molecule. Modern sequencing technologies can read billions of DNA fragments from a biological sample.

The Sequencing Process: 1. Extract DNA from tissue sample 2. Fragment DNA into small pieces 3. Sequence each fragment (generates "reads") 4. Align reads to reference genome 5. Call variants by comparing to reference

Sequencing Technologies

Short-Read Sequencing (Illumina) - Read length: 100-300 base pairs - Very high accuracy (>99.9%) - Best for: SNVs, small indels - Limitations: Structural variants, repetitive regions

Long-Read Sequencing - PacBio HiFi: 10-25kb reads, high accuracy - Oxford Nanopore (ONT): Ultra-long reads up to 2Mb - Best for: Structural variants, phasing, complex regions - Trade-off: Lower per-base accuracy

BAM Files Explained

What are BAM Files?

BAM = Binary Alignment Map

A BAM file contains all the sequencing reads from your sample aligned to a reference genome. Think of it as a digital record of where each DNA fragment from your sample maps to the genome.

Key Components: - Header: Metadata about the sequencing run - Alignment records: Each read's position, sequence, and quality - Index file (.bai): Allows fast access to specific regions

BAM File Structure

Read 1: chr1:1000-1150 ATCGATCG... MAPQ=60
Read 2: chr1:1050-1200 GCTAGCTA... MAPQ=42
Read 3: chr1:1100-1250 TTAACCGG... MAPQ=55
...

Important Fields: - RNAME: Chromosome name (e.g., chr1) - POS: Start position - MAPQ: Mapping quality (0-60, higher = better) - CIGAR: Alignment description (matches, insertions, deletions) - SEQ: Read sequence - QUAL: Base quality scores

Coverage and Depth

Read Depth (DP): Number of reads covering a genomic position

Position:  1000  1001  1002  1003  1004
Read 1:     A     T     C     G     A
Read 2:     A     T     C     G     A
Read 3:     A     T     T     G     A  <- variant at 1002
Read 4:     A     T     C     G     A
Depth:      4     4     4     4     4

Typical Coverage Requirements: - Germline WGS: 30-50x - Tumor WGS: 60-100x - Exome: 100-150x - Low-frequency variants: 200x+

VCF Format Explained

What is a VCF File?

VCF = Variant Call Format

A VCF file is the standard format for storing genetic variants. It lists positions where the sample differs from the reference genome.

VCF Structure

##fileformat=VCFv4.2
##reference=GRCh38
#CHROM  POS     ID      REF  ALT   QUAL  FILTER  INFO           FORMAT      TUMOR
chr1    12345   .       A    T     60    PASS    DP=100;VAF=0.3  GT:GQ:DP    0/1:45:100
chr1    67890   .       GC   G     42    PASS    DP=80;VAF=0.25  GT:GQ:DP    0/1:38:80

Column Definitions:

Column Name Description
CHROM Chromosome Chromosome name (chr1-22, X, Y)
POS Position 1-based genomic position
ID Identifier dbSNP ID if known (or '.')
REF Reference Reference allele from genome
ALT Alternate Variant allele observed
QUAL Quality Phred-scaled quality score
FILTER Filter PASS or reason for failure
INFO Information Additional annotations
FORMAT Format Genotype field definitions
SAMPLE Sample Per-sample genotype data

Understanding FILTER Status

DeepSomatic uses specific FILTER values:

  • PASS: High-confidence somatic variant
  • GERMLINE: Likely germline, not tumor-specific
  • RefCall: Reference call, no variant
  • LowQual: Quality below threshold

Quality Scores

QUAL (Variant Quality) - Phred-scaled probability of variant - QUAL=30 → 99.9% confidence - QUAL=60 → 99.9999% confidence

GQ (Genotype Quality) - Confidence in genotype assignment - GQ=40 → 99.99% confident in genotype

DP (Depth) - Total read depth at position - Higher depth = more reliable calls

INFO Field Annotations

Common DeepSomatic INFO fields:

DP=100        # Total depth
VAF=0.3       # Variant allele frequency (30%)
SOR=0.8       # Strand bias odds ratio

Variant Types

SNV (Single Nucleotide Variant)

A single base change:

Reference:  ...ATCGATCG...
Variant:    ...ATCAATCG...
                  ^
            SNV: G→A

Indel (Insertion/Deletion)

Insertion: Added bases

Reference:  ...ATCG----ATCG...
Variant:    ...ATCGTTAATCG...
                  ^^^^
            +TTA insertion

Deletion: Removed bases

Reference:  ...ATCGTTAATCG...
Variant:    ...ATCG----ATCG...
                  ^^^^
            -TTA deletion

MNV (Multi-Nucleotide Variant)

Multiple consecutive base changes:

Reference:  ...ATCGATCG...
Variant:    ...ATTAATCG...
                ^^^^
            MNV: CG→TA

Somatic vs Germline Variants

Germline Variants

Definition: Inherited from parents, present in all cells

Characteristics: - Present in both tumor and normal tissue - Variant allele frequency (VAF) ≈ 50% (heterozygous) or 100% (homozygous) - Found in blood/normal tissue

Example:

Normal tissue: VAF = 50%
Tumor tissue:  VAF = 50%
 Germline variant

Somatic Variants

Definition: Acquired mutations, only in tumor cells

Characteristics: - Absent in normal tissue - VAF depends on tumor purity and ploidy - Cancer-driving mutations

Example:

Normal tissue: VAF = 0%
Tumor tissue:  VAF = 30%
 Somatic variant

Mixed Scenarios

Loss of Heterozygosity (LOH):

Normal: A/T (50% VAF for T)
Tumor:  T/T (100% VAF for T)
 Lost normal A allele

Subclonal Mutations:

Tumor VAF = 15%
→ Only present in subset of tumor cells
→ Later acquired mutation

Variant Calling Process

What is Variant Calling?

Variant calling is the computational process of identifying genetic differences between your sample and the reference genome.

The Variant Calling Pipeline

1. Preprocessing - Quality control of reads - Mark duplicates - Base quality recalibration

2. Candidate Identification - Find positions that differ from reference - Filter for minimum quality and depth

3. Variant Classification - Statistical tests or machine learning - DeepSomatic uses deep neural networks

4. Filtering - Remove low-quality calls - Classify as somatic/germline - Apply confidence thresholds

Traditional vs Deep Learning Approaches

Traditional Methods (e.g., MuTect2, Strelka): - Hand-crafted statistical models - Feature engineering - Fixed thresholds

Deep Learning (DeepSomatic): - Learned from millions of examples - Pileup images as input - Superior accuracy, especially for indels

Reference Genomes

What is a Reference Genome?

A reference genome is the standard DNA sequence used as a coordinate system for mapping variants.

Common References: - GRCh38/hg38: Current human reference (2013) - GRCh37/hg19: Previous version (2009) - T2T-CHM13: Telomere-to-telomere (2022)

Chromosome Naming

GRCh38 style: chr1, chr2, ..., chrX, chrY
GRCh37 style: 1, 2, ..., X, Y

Omics807 uses GRCh38 by default.

Key Terms Summary

Term Definition
BAM Binary format for aligned sequencing reads
VCF Standard format for genetic variants
SNV Single nucleotide change
Indel Insertion or deletion
VAF Variant Allele Frequency - % of reads with variant
DP Read depth at a position
GQ Genotype quality score
Somatic Tumor-specific mutation
Germline Inherited variant
WGS Whole Genome Sequencing
WES Whole Exome Sequencing

Next Steps

Now that you understand the basics:

Further Reading