DeepSomatic: The Science Behind Omics807

Learn about Google's DeepSomatic, the deep learning technology that powers Omics807's variant calling.

What is DeepSomatic?

DeepSomatic is a deep learning-based somatic variant caller developed by Google Health. It identifies cancer-specific mutations by distinguishing true variants from sequencing errors and germline polymorphisms.

Key Innovation

Traditional Approach: Hand-crafted statistical models with fixed rules

DeepSomatic Approach: Neural networks trained on millions of validated examples

Result: Higher accuracy, especially for challenging variants like indels and low-frequency mutations

How DeepSomatic Extends DeepVariant

DeepVariant Foundation

DeepVariant (2016) revolutionized germline variant calling using CNNs: - Treats variant calling as image classification - Creates "pileup images" from aligned reads - Inception-v3 architecture for classification

Somatic Challenges

Somatic variant calling adds complexity: - Tumor heterogeneity: Multiple cell populations with different mutations - Low variant allele frequency: Variants in small tumor subclones - Normal contamination: Mixed tumor/normal tissue - Artifacts: FFPE damage, PCR errors, alignment issues

DeepSomatic Innovations

1. Paired Sample Analysis - Processes tumor and normal simultaneously - Learns patterns distinguishing somatic from germline

2. Enhanced Training Data - SEQC2 consortium truth sets - Multiple sequencing technologies - Diverse tumor types and purities

3. Specialized Models - Technology-specific models (WGS, WES, PacBio, ONT) - FFPE-aware models for degraded DNA - Tumor-only mode with panel of normals

4. Three-Channel Architecture - Read base (sequence information) - Base quality (confidence scores) - Mapping quality (alignment confidence)

Research Publication

Preprint Details

Title: "DeepSomatic: Accurate somatic variant calling with deep learning"

Authors: Google Health AI team

Publication: bioRxiv preprint (2024)

DOI: 10.1101/2024.08.16.608331

Key Findings: - Outperforms existing callers (MuTect2, Strelka2) on indels - Superior performance on FFPE samples - Maintains high precision across diverse datasets - Generalizes well to unseen sequencing platforms

Performance Highlights

WGS (Illumina): - SNV: 95.1% recall, 98.9% precision - Indel: 93.0% recall, 84.8% precision

WES (Illumina): - SNV: 94.4% recall, 99.1% precision - Indel: 89.6% recall, 93.5% precision

FFPE WGS: - SNV: 81.7% recall, 94.5% precision - Handles C→T artifacts effectively

See Model Guide for complete metrics.

The 3-Stage Pipeline

DeepSomatic consists of three sequential stages:

Stage 1: make_examples

Purpose: Create pileup images from aligned reads

Process: 1. Candidate identification - Scan BAM files for potential variants - Compare tumor vs normal coverage - Filter obvious germline/artifacts

  1. Pileup generation
  2. Create image for each candidate
  3. Encode reads as pixel rows
  4. Color-code bases, qualities, mapping info

  5. Example creation

  6. Generate TensorFlow examples
  7. Include both tumor and normal pileups
  8. Add metadata (position, ref/alt alleles)

Output: TFRecord files with pileup images

Runtime: 30-40% of total time

Parallelization: Highly parallelizable by genomic region

Stage 2: call_variants

Purpose: Classify variants using deep neural network

Process: 1. Load model - Technology-specific CNN (WGS, WES, etc.) - Pre-trained on millions of examples

  1. Inference
  2. Process pileup images through network
  3. Output: Genotype probabilities
  4. For each candidate: P(0/0), P(0/1), P(1/1)

  5. Quality assignment

  6. Convert probabilities to Phred scores
  7. Generate genotype quality (GQ)

Output: CallVariantsOutput (intermediate format)

Runtime: 50-60% of total time

GPU acceleration: Can reduce this stage by 50-70%

Stage 3: postprocess_variants

Purpose: Convert to VCF and apply filtering

Process: 1. VCF generation - Convert predictions to VCF format - Add INFO/FORMAT fields - Calculate additional metrics (VAF, DP)

  1. Quality filtering
  2. Apply QUAL thresholds
  3. Mark germline variants (GERMLINE filter)
  4. Flag low-quality calls

  5. Variant merging

  6. Combine multi-allelic sites
  7. Left-align indels
  8. Normalize representations

  9. gVCF creation (optional)

  10. Include reference calls
  11. Useful for joint calling

Output: Final VCF/gVCF files

Runtime: 5-10% of total time

Deep Learning Architecture

Pileup Image Encoding

DeepSomatic encodes aligned reads as RGB images:

Image Dimensions: - Width: 221 pixels (genomic window) - Height: Variable (max reads, typically 100) - Channels: 3 (RGB)

Encoding Scheme:

Channel 1 (Red): Read bases

A = 250 (red)
C = 180 (orange)
G = 100 (yellow)
T = 30 (blue)
N = 0 (black)

Channel 2 (Green): Base quality

Quality 0-60 → Grayscale 0-255
Higher quality = Brighter

Channel 3 (Blue): Mapping quality

MAPQ 0-60 → Grayscale 0-255
Higher MAPQ = Brighter

Special Encodings: - Deletions: Gap in image - Insertions: Wider pixel - Strand: Top half (forward), Bottom half (reverse)

Example Pileup:

Reference:  A C G T A C G T
Read 1:     A C G T A C G T  (MAPQ=60, high quality)
Read 2:     A C - T A C G T  (deletion at pos 3)
Read 3:     A C G T A T G T  (CT variant at pos 6)
Read 4:     A C G T A C G T  (low quality at pos 5)

 Encoded as 221×4 RGB image

Neural Network Architecture

Base Model: Inception-v3 (modified)

Modifications for Somatic Calling:

  1. Dual-input branch
  2. Tumor pileup branch
  3. Normal pileup branch
  4. Shared feature extraction

  5. Attention mechanism

  6. Focus on discordant regions
  7. Weight tumor-specific patterns

  8. Output layer

  9. 3-class classifier: 0/0, 0/1, 1/1
  10. Softmax activation
  11. Calibrated probabilities

Training: - Loss: Cross-entropy with class weighting - Optimizer: Adam with learning rate scheduling - Regularization: Dropout, batch normalization - Data augmentation: Read shuffling, quality perturbation

Model Training

Training Data Sources:

  1. SEQC2 Consortium
  2. HCC1395 truth sets
  3. Multiple technologies
  4. Validated variants

  5. Synthetic Data

  6. BAMSurgeon spike-ins
  7. Controlled VAF variants
  8. Edge case coverage

  9. TCGA Samples (for diversity)

  10. Multiple cancer types
  11. Various tumor purities
  12. Real-world complexity

Training Strategy:

1. Pre-training on DeepVariant germline data
2. Fine-tuning on somatic training set
3. Technology-specific refinement
4. Validation on held-out chromosomes

Key Hyperparameters: - Batch size: 512 examples - Learning rate: 0.001 (initial) - Epochs: 100-200 - Validation: Chromosome 1 held out

Comparison to Other Methods

DeepSomatic vs MuTect2

MuTect2 (Broad Institute): - Bayesian statistical model - Hand-crafted features - Widely used in TCGA

DeepSomatic Advantages: - 15-20% better indel recall - More robust to low VAF (<10%) - Better FFPE performance - No manual parameter tuning

MuTect2 Advantages: - Faster runtime - Lower memory usage - More interpretable models

DeepSomatic vs Strelka2

Strelka2 (Illumina): - Empirical Bayesian model - Optimized for speed - Good germline filtering

DeepSomatic Advantages: - Higher precision (fewer FPs) - Better complex variant calling - Multi-platform support

Strelka2 Advantages: - Faster (2-3× speed) - Lower resource requirements - Well-established in clinical use

Ensemble Approaches

Recommended Strategy: - Run DeepSomatic + MuTect2 - Take consensus (intersection) - Rescue high-confidence unique calls - Best precision/recall balance

Technical Requirements

Computational Resources

Minimum: - CPU: 8 cores - RAM: 32GB - Storage: 500GB - Time: ~6 hours (WGS)

Recommended: - CPU: 32+ cores - RAM: 128GB - Storage: 2TB SSD - GPU: Optional (Nvidia T4 or better)

Optimal (Google's setup): - CPU: 96 cores (n2-standard-96) - RAM: 384GB - GPU: Nvidia A100 (for call_variants) - Time: ~3 hours (WGS)

Scaling Strategies

Parallelization:

# make_examples: Split by region
--regions=chr1,chr2,chr3
--num_shards=32

# call_variants: Batch processing
--batch_size=512

# postprocess_variants: Single-threaded (fast)

Cloud Deployment:

  1. Google Cloud
  2. Preemptible VMs (70% cost savings)
  3. Cloud Life Sciences API
  4. Batch processing

  5. AWS

  6. Spot instances
  7. AWS Batch
  8. S3 for data storage

  9. Azure

  10. Azure Batch
  11. Low-priority VMs
  12. Blob storage

Open Source and Community

GitHub Repository

URL: https://github.com/google/deepsomatic

Contents: - Source code (C++, Python) - Pre-trained models - Case studies and tutorials - Issue tracking

License: BSD 3-Clause

Docker Images

Available versions:

# CPU version (default)
google/deepsomatic:1.9.0

# GPU version
google/deepsomatic:1.9.0-gpu

# Specific commits
google/deepsomatic:1.9.0-gpu-d5f8c9e

Image size: ~5GB (includes models)

Community Contributions

Ways to contribute: - Report bugs and issues - Submit performance benchmarks - Add new model types - Improve documentation - Share case studies

Future Directions

Planned Improvements

Version 1.10+: - Multi-sample joint calling - Improved tumor-only filtering - Long-read model enhancements - Real-time analysis support

Research Areas: - Subclonal reconstruction - Copy number integration - Mutational signature calling - RNA-seq variant detection

Integration Opportunities

Potential add-ons: - SV calling (DeepSomatic-SV) - Methylation analysis - Gene fusion detection - Tumor purity estimation

Best Practices from Google

Input Data Quality

Pre-processing checklist: - ✅ Mark duplicates (Picard) - ✅ Base quality recalibration (GATK) - ✅ Proper read groups - ✅ Consistent reference genome

Quality metrics: - Insert size distribution - Coverage uniformity - Contamination check (<5%)

Model Selection

Decision tree: 1. Identify sequencing technology 2. Check for FFPE characteristics 3. Determine if normal available 4. Select appropriate model

Avoid common mistakes: - ❌ Don't use WGS model on exome - ❌ Don't ignore FFPE artifacts - ❌ Don't skip normal when available

Validation Strategy

Three-tier validation:

  1. In silico (computational):
  2. Compare to SEQC2 truth set
  3. Check recall/precision metrics

  4. Orthogonal platform:

  5. Sanger sequencing (gold standard)
  6. ddPCR for VAF validation

  7. Functional validation:

  8. Protein expression (IHC)
  9. Pathway activation assays

Learn More

References

Key Publications

  1. DeepSomatic preprint (2024)
    https://doi.org/10.1101/2024.08.16.608331

  2. DeepVariant (Nature Biotech 2018)
    Poplin et al. "A universal SNP and small-indel variant caller using deep neural networks"

  3. SEQC2 Truth Sets (Nature Biotech 2021)
    Fang et al. "Establishing community reference samples"