Genomics 101: Essential Concepts

A beginner-friendly guide to understanding the genomics concepts behind Omics807 and somatic variant calling.

DNA Sequencing Basics

What is DNA Sequencing?

DNA sequencing is the process of determining the exact order of nucleotides (A, T, G, C) in a DNA molecule. Modern sequencing technologies can read billions of DNA fragments from a biological sample.

The Sequencing Process: 1. Extract DNA from tissue sample 2. Fragment DNA into small pieces 3. Sequence each fragment (generates "reads") 4. Align reads to reference genome 5. Call variants by comparing to reference

Sequencing Technologies

Short-Read Sequencing (Illumina) - Read length: 100-300 base pairs - Very high accuracy (>99.9%) - Best for: SNVs, small indels - Limitations: Structural variants, repetitive regions

Long-Read Sequencing - PacBio HiFi: 10-25kb reads, high accuracy - Oxford Nanopore (ONT): Ultra-long reads up to 2Mb - Best for: Structural variants, phasing, complex regions - Trade-off: Lower per-base accuracy

BAM Files Explained

What are BAM Files?

BAM = Binary Alignment Map

A BAM file contains all the sequencing reads from your sample aligned to a reference genome. Think of it as a digital record of where each DNA fragment from your sample maps to the genome.

Key Components: - Header: Metadata about the sequencing run - Alignment records: Each read's position, sequence, and quality - Index file (.bai): Allows fast access to specific regions

BAM File Structure

Read 1: chr1:1000-1150 ATCGATCG... MAPQ=60
Read 2: chr1:1050-1200 GCTAGCTA... MAPQ=42
Read 3: chr1:1100-1250 TTAACCGG... MAPQ=55
...

Important Fields: - RNAME: Chromosome name (e.g., chr1) - POS: Start position - MAPQ: Mapping quality (0-60, higher = better) - CIGAR: Alignment description (matches, insertions, deletions) - SEQ: Read sequence - QUAL: Base quality scores

Coverage and Depth

Read Depth (DP): Number of reads covering a genomic position

Position:  1000  1001  1002  1003  1004
Read 1:     A     T     C     G     A
Read 2:     A     T     C     G     A
Read 3:     A     T     T     G     A  <- variant at 1002
Read 4:     A     T     C     G     A
Depth:      4     4     4     4     4

Typical Coverage Requirements: - Germline WGS: 30-50x - Tumor WGS: 60-100x - Exome: 100-150x - Low-frequency variants: 200x+

VCF Format Explained

What is a VCF File?

VCF = Variant Call Format

A VCF file is the standard format for storing genetic variants. It lists positions where the sample differs from the reference genome.

VCF Structure

##fileformat=VCFv4.2
##reference=GRCh38
#CHROM  POS     ID      REF  ALT   QUAL  FILTER  INFO           FORMAT      TUMOR
chr1    12345   .       A    T     60    PASS    DP=100;VAF=0.3  GT:GQ:DP    0/1:45:100
chr1    67890   .       GC   G     42    PASS    DP=80;VAF=0.25  GT:GQ:DP    0/1:38:80

Column Definitions:

Column	Name	Description
CHROM	Chromosome	Chromosome name (chr1-22, X, Y)
POS	Position	1-based genomic position
ID	Identifier	dbSNP ID if known (or '.')
REF	Reference	Reference allele from genome
ALT	Alternate	Variant allele observed
QUAL	Quality	Phred-scaled quality score
FILTER	Filter	PASS or reason for failure
INFO	Information	Additional annotations
FORMAT	Format	Genotype field definitions
SAMPLE	Sample	Per-sample genotype data

Understanding FILTER Status

DeepSomatic uses specific FILTER values:

PASS: High-confidence somatic variant
GERMLINE: Likely germline, not tumor-specific
RefCall: Reference call, no variant
LowQual: Quality below threshold

Quality Scores

QUAL (Variant Quality) - Phred-scaled probability of variant - QUAL=30 → 99.9% confidence - QUAL=60 → 99.9999% confidence

GQ (Genotype Quality) - Confidence in genotype assignment - GQ=40 → 99.99% confident in genotype

DP (Depth) - Total read depth at position - Higher depth = more reliable calls

INFO Field Annotations

Common DeepSomatic INFO fields:

DP=100        # Total depth
VAF=0.3       # Variant allele frequency (30%)
SOR=0.8       # Strand bias odds ratio

Variant Types

SNV (Single Nucleotide Variant)

A single base change:

Reference:  ...ATCGATCG...
Variant:    ...ATCAATCG...
                  ^
            SNV: G→A

Indel (Insertion/Deletion)

Insertion: Added bases

Reference:  ...ATCG----ATCG...
Variant:    ...ATCGTTAATCG...
                  ^^^^
            +TTA insertion

Deletion: Removed bases

Reference:  ...ATCGTTAATCG...
Variant:    ...ATCG----ATCG...
                  ^^^^
            -TTA deletion

MNV (Multi-Nucleotide Variant)

Multiple consecutive base changes:

Reference:  ...ATCGATCG...
Variant:    ...ATTAATCG...
                ^^^^
            MNV: CG→TA

Somatic vs Germline Variants

Germline Variants

Definition: Inherited from parents, present in all cells

Characteristics: - Present in both tumor and normal tissue - Variant allele frequency (VAF) ≈ 50% (heterozygous) or 100% (homozygous) - Found in blood/normal tissue

Example:

Normal tissue: VAF = 50%
Tumor tissue:  VAF = 50%
→ Germline variant

Somatic Variants

Definition: Acquired mutations, only in tumor cells

Characteristics: - Absent in normal tissue - VAF depends on tumor purity and ploidy - Cancer-driving mutations

Example:

Normal tissue: VAF = 0%
Tumor tissue:  VAF = 30%
→ Somatic variant

Mixed Scenarios

Loss of Heterozygosity (LOH):

Normal: A/T (50% VAF for T)
Tumor:  T/T (100% VAF for T)
→ Lost normal A allele

Subclonal Mutations:

Tumor VAF = 15%
→ Only present in subset of tumor cells
→ Later acquired mutation

Variant Calling Process

What is Variant Calling?

Variant calling is the computational process of identifying genetic differences between your sample and the reference genome.

The Variant Calling Pipeline

1. Preprocessing - Quality control of reads - Mark duplicates - Base quality recalibration

2. Candidate Identification - Find positions that differ from reference - Filter for minimum quality and depth

3. Variant Classification - Statistical tests or machine learning - DeepSomatic uses deep neural networks

4. Filtering - Remove low-quality calls - Classify as somatic/germline - Apply confidence thresholds

Traditional vs Deep Learning Approaches

Traditional Methods (e.g., MuTect2, Strelka): - Hand-crafted statistical models - Feature engineering - Fixed thresholds

Deep Learning (DeepSomatic): - Learned from millions of examples - Pileup images as input - Superior accuracy, especially for indels

Reference Genomes

What is a Reference Genome?

A reference genome is the standard DNA sequence used as a coordinate system for mapping variants.

Common References: - GRCh38/hg38: Current human reference (2013) - GRCh37/hg19: Previous version (2009) - T2T-CHM13: Telomere-to-telomere (2022)

Chromosome Naming

GRCh38 style: chr1, chr2, ..., chrX, chrY
GRCh37 style: 1, 2, ..., X, Y

Omics807 uses GRCh38 by default.

Key Terms Summary

Term	Definition
BAM	Binary format for aligned sequencing reads
VCF	Standard format for genetic variants
SNV	Single nucleotide change
Indel	Insertion or deletion
VAF	Variant Allele Frequency - % of reads with variant
DP	Read depth at a position
GQ	Genotype quality score
Somatic	Tumor-specific mutation
Germline	Inherited variant
WGS	Whole Genome Sequencing
WES	Whole Exome Sequencing

Next Steps

Now that you understand the basics:

Learn about DeepSomatic Models
Understand How to Interpret Results
Explore Example Datasets
Read the Glossary for more terms

Genomics 101