DeepSomatic Research
Understanding Google's DeepSomatic - deep learning approach, architecture, 3-stage pipeline, and how it extends DeepVariant
DeepSomatic: The Science Behind Omics807
Learn about Google's DeepSomatic, the deep learning technology that powers Omics807's variant calling.
What is DeepSomatic?
DeepSomatic is a deep learning-based somatic variant caller developed by Google Health. It identifies cancer-specific mutations by distinguishing true variants from sequencing errors and germline polymorphisms.
Key Innovation
Traditional Approach: Hand-crafted statistical models with fixed rules
DeepSomatic Approach: Neural networks trained on millions of validated examples
Result: Higher accuracy, especially for challenging variants like indels and low-frequency mutations
How DeepSomatic Extends DeepVariant
DeepVariant Foundation
DeepVariant (2016) revolutionized germline variant calling using CNNs: - Treats variant calling as image classification - Creates "pileup images" from aligned reads - Inception-v3 architecture for classification
Somatic Challenges
Somatic variant calling adds complexity: - Tumor heterogeneity: Multiple cell populations with different mutations - Low variant allele frequency: Variants in small tumor subclones - Normal contamination: Mixed tumor/normal tissue - Artifacts: FFPE damage, PCR errors, alignment issues
DeepSomatic Innovations
1. Paired Sample Analysis - Processes tumor and normal simultaneously - Learns patterns distinguishing somatic from germline
2. Enhanced Training Data - SEQC2 consortium truth sets - Multiple sequencing technologies - Diverse tumor types and purities
3. Specialized Models - Technology-specific models (WGS, WES, PacBio, ONT) - FFPE-aware models for degraded DNA - Tumor-only mode with panel of normals
4. Three-Channel Architecture - Read base (sequence information) - Base quality (confidence scores) - Mapping quality (alignment confidence)
Research Publication
Preprint Details
Title: "DeepSomatic: Accurate somatic variant calling with deep learning"
Authors: Google Health AI team
Publication: bioRxiv preprint (2024)
DOI: 10.1101/2024.08.16.608331
Key Findings: - Outperforms existing callers (MuTect2, Strelka2) on indels - Superior performance on FFPE samples - Maintains high precision across diverse datasets - Generalizes well to unseen sequencing platforms
Performance Highlights
WGS (Illumina): - SNV: 95.1% recall, 98.9% precision - Indel: 93.0% recall, 84.8% precision
WES (Illumina): - SNV: 94.4% recall, 99.1% precision - Indel: 89.6% recall, 93.5% precision
FFPE WGS: - SNV: 81.7% recall, 94.5% precision - Handles C→T artifacts effectively
See Model Guide for complete metrics.
The 3-Stage Pipeline
DeepSomatic consists of three sequential stages:
Stage 1: make_examples
Purpose: Create pileup images from aligned reads
Process: 1. Candidate identification - Scan BAM files for potential variants - Compare tumor vs normal coverage - Filter obvious germline/artifacts
- Pileup generation
- Create image for each candidate
- Encode reads as pixel rows
-
Color-code bases, qualities, mapping info
-
Example creation
- Generate TensorFlow examples
- Include both tumor and normal pileups
- Add metadata (position, ref/alt alleles)
Output: TFRecord files with pileup images
Runtime: 30-40% of total time
Parallelization: Highly parallelizable by genomic region
Stage 2: call_variants
Purpose: Classify variants using deep neural network
Process: 1. Load model - Technology-specific CNN (WGS, WES, etc.) - Pre-trained on millions of examples
- Inference
- Process pileup images through network
- Output: Genotype probabilities
-
For each candidate: P(0/0), P(0/1), P(1/1)
-
Quality assignment
- Convert probabilities to Phred scores
- Generate genotype quality (GQ)
Output: CallVariantsOutput (intermediate format)
Runtime: 50-60% of total time
GPU acceleration: Can reduce this stage by 50-70%
Stage 3: postprocess_variants
Purpose: Convert to VCF and apply filtering
Process: 1. VCF generation - Convert predictions to VCF format - Add INFO/FORMAT fields - Calculate additional metrics (VAF, DP)
- Quality filtering
- Apply QUAL thresholds
- Mark germline variants (GERMLINE filter)
-
Flag low-quality calls
-
Variant merging
- Combine multi-allelic sites
- Left-align indels
-
Normalize representations
-
gVCF creation (optional)
- Include reference calls
- Useful for joint calling
Output: Final VCF/gVCF files
Runtime: 5-10% of total time
Deep Learning Architecture
Pileup Image Encoding
DeepSomatic encodes aligned reads as RGB images:
Image Dimensions: - Width: 221 pixels (genomic window) - Height: Variable (max reads, typically 100) - Channels: 3 (RGB)
Encoding Scheme:
Channel 1 (Red): Read bases
A = 250 (red)
C = 180 (orange)
G = 100 (yellow)
T = 30 (blue)
N = 0 (black)
Channel 2 (Green): Base quality
Quality 0-60 → Grayscale 0-255
Higher quality = Brighter
Channel 3 (Blue): Mapping quality
MAPQ 0-60 → Grayscale 0-255
Higher MAPQ = Brighter
Special Encodings: - Deletions: Gap in image - Insertions: Wider pixel - Strand: Top half (forward), Bottom half (reverse)
Example Pileup:
Reference: A C G T A C G T
Read 1: A C G T A C G T (MAPQ=60, high quality)
Read 2: A C - T A C G T (deletion at pos 3)
Read 3: A C G T A T G T (C→T variant at pos 6)
Read 4: A C G T A C G T (low quality at pos 5)
→ Encoded as 221×4 RGB image
Neural Network Architecture
Base Model: Inception-v3 (modified)
Modifications for Somatic Calling:
- Dual-input branch
- Tumor pileup branch
- Normal pileup branch
-
Shared feature extraction
-
Attention mechanism
- Focus on discordant regions
-
Weight tumor-specific patterns
-
Output layer
- 3-class classifier: 0/0, 0/1, 1/1
- Softmax activation
- Calibrated probabilities
Training: - Loss: Cross-entropy with class weighting - Optimizer: Adam with learning rate scheduling - Regularization: Dropout, batch normalization - Data augmentation: Read shuffling, quality perturbation
Model Training
Training Data Sources:
- SEQC2 Consortium
- HCC1395 truth sets
- Multiple technologies
-
Validated variants
-
Synthetic Data
- BAMSurgeon spike-ins
- Controlled VAF variants
-
Edge case coverage
-
TCGA Samples (for diversity)
- Multiple cancer types
- Various tumor purities
- Real-world complexity
Training Strategy:
1. Pre-training on DeepVariant germline data
2. Fine-tuning on somatic training set
3. Technology-specific refinement
4. Validation on held-out chromosomes
Key Hyperparameters: - Batch size: 512 examples - Learning rate: 0.001 (initial) - Epochs: 100-200 - Validation: Chromosome 1 held out
Comparison to Other Methods
DeepSomatic vs MuTect2
MuTect2 (Broad Institute): - Bayesian statistical model - Hand-crafted features - Widely used in TCGA
DeepSomatic Advantages: - 15-20% better indel recall - More robust to low VAF (<10%) - Better FFPE performance - No manual parameter tuning
MuTect2 Advantages: - Faster runtime - Lower memory usage - More interpretable models
DeepSomatic vs Strelka2
Strelka2 (Illumina): - Empirical Bayesian model - Optimized for speed - Good germline filtering
DeepSomatic Advantages: - Higher precision (fewer FPs) - Better complex variant calling - Multi-platform support
Strelka2 Advantages: - Faster (2-3× speed) - Lower resource requirements - Well-established in clinical use
Ensemble Approaches
Recommended Strategy: - Run DeepSomatic + MuTect2 - Take consensus (intersection) - Rescue high-confidence unique calls - Best precision/recall balance
Technical Requirements
Computational Resources
Minimum: - CPU: 8 cores - RAM: 32GB - Storage: 500GB - Time: ~6 hours (WGS)
Recommended: - CPU: 32+ cores - RAM: 128GB - Storage: 2TB SSD - GPU: Optional (Nvidia T4 or better)
Optimal (Google's setup): - CPU: 96 cores (n2-standard-96) - RAM: 384GB - GPU: Nvidia A100 (for call_variants) - Time: ~3 hours (WGS)
Scaling Strategies
Parallelization:
# make_examples: Split by region
--regions=chr1,chr2,chr3
--num_shards=32
# call_variants: Batch processing
--batch_size=512
# postprocess_variants: Single-threaded (fast)
Cloud Deployment:
- Google Cloud
- Preemptible VMs (70% cost savings)
- Cloud Life Sciences API
-
Batch processing
-
AWS
- Spot instances
- AWS Batch
-
S3 for data storage
-
Azure
- Azure Batch
- Low-priority VMs
- Blob storage
Open Source and Community
GitHub Repository
URL: https://github.com/google/deepsomatic
Contents: - Source code (C++, Python) - Pre-trained models - Case studies and tutorials - Issue tracking
License: BSD 3-Clause
Docker Images
Available versions:
# CPU version (default)
google/deepsomatic:1.9.0
# GPU version
google/deepsomatic:1.9.0-gpu
# Specific commits
google/deepsomatic:1.9.0-gpu-d5f8c9e
Image size: ~5GB (includes models)
Community Contributions
Ways to contribute: - Report bugs and issues - Submit performance benchmarks - Add new model types - Improve documentation - Share case studies
Future Directions
Planned Improvements
Version 1.10+: - Multi-sample joint calling - Improved tumor-only filtering - Long-read model enhancements - Real-time analysis support
Research Areas: - Subclonal reconstruction - Copy number integration - Mutational signature calling - RNA-seq variant detection
Integration Opportunities
Potential add-ons: - SV calling (DeepSomatic-SV) - Methylation analysis - Gene fusion detection - Tumor purity estimation
Best Practices from Google
Input Data Quality
Pre-processing checklist: - ✅ Mark duplicates (Picard) - ✅ Base quality recalibration (GATK) - ✅ Proper read groups - ✅ Consistent reference genome
Quality metrics: - Insert size distribution - Coverage uniformity - Contamination check (<5%)
Model Selection
Decision tree: 1. Identify sequencing technology 2. Check for FFPE characteristics 3. Determine if normal available 4. Select appropriate model
Avoid common mistakes: - ❌ Don't use WGS model on exome - ❌ Don't ignore FFPE artifacts - ❌ Don't skip normal when available
Validation Strategy
Three-tier validation:
- In silico (computational):
- Compare to SEQC2 truth set
-
Check recall/precision metrics
-
Orthogonal platform:
- Sanger sequencing (gold standard)
-
ddPCR for VAF validation
-
Functional validation:
- Protein expression (IHC)
- Pathway activation assays
Learn More
- Model Guide - Performance metrics
- Understanding Results - Interpret output
- FAQ - Common questions
References
Key Publications
-
DeepSomatic preprint (2024)
https://doi.org/10.1101/2024.08.16.608331 -
DeepVariant (Nature Biotech 2018)
Poplin et al. "A universal SNP and small-indel variant caller using deep neural networks" -
SEQC2 Truth Sets (Nature Biotech 2021)
Fang et al. "Establishing community reference samples"