DeepSomatic Model Guide
Complete guide to all DeepSomatic models with runtime, accuracy metrics, and use case recommendations
DeepSomatic Model Guide
Choose the right DeepSomatic model for your sequencing data. This comprehensive guide covers all 8 models with performance metrics, use cases, and recommendations.
Model Overview
DeepSomatic provides specialized models for different sequencing technologies and sample types:
| Model | Technology | Mode | Best For |
|---|---|---|---|
WGS |
Illumina | Tumor-Normal | Complete genome analysis |
WES |
Illumina | Tumor-Normal | Exome regions only |
PACBIO |
PacBio HiFi | Tumor-Normal | Long-read WGS |
ONT_R104 |
Nanopore | Tumor-Normal | Ultra-long reads |
FFPE_WGS |
Illumina | Tumor-Normal | FFPE whole genome |
FFPE_WES |
Illumina | Tumor-Normal | FFPE exome |
WGS_TUMOR_ONLY |
Illumina | Tumor-Only | No normal available |
WES_TUMOR_ONLY |
Illumina | Tumor-Only | Exome without normal |
Additional tumor-only models: PACBIO_TUMOR_ONLY, ONT_TUMOR_ONLY, FFPE_WGS_TUMOR_ONLY, FFPE_WES_TUMOR_ONLY
Performance Metrics
All metrics from HCC1395 dataset on n2-standard-96 GCP instance (96 CPUs, 384GB RAM).
WGS (Whole Genome Sequencing)
Dataset: HCC1395, Normal 50x / Tumor 60x
Runtime (wall time):
make_examples_somatic: 64m 30s
call_variants: 106m 0s
postprocess_variants: 0m 59s
vcf_stats_report: 3m 1s
─────────────────────────────────
Total: ~3h 5m
Accuracy: - SNV Recall: 95.1% | Precision: 98.9% - Indel Recall: 93.0% | Precision: 84.8% - Overall Precision: 98.3%
Use When: - Complete genome coverage needed - Budget allows for WGS - Looking for structural variants, non-coding mutations - Research requiring all genomic regions
WES (Whole Exome Sequencing)
Dataset: HCC1395, Normal 140x / Tumor 120x
Runtime (wall time):
make_examples_somatic: 8m 17s
call_variants: 2m 37s
postprocess_variants: 0m 6s
vcf_stats_report: 0m 7s
─────────────────────────────────
Total: ~15m 33s
Accuracy: - SNV Recall: 94.4% | Precision: 99.1% - Indel Recall: 89.6% | Precision: 93.5% - Overall Precision: 98.9%
Use When: - Focused on protein-coding regions - Cost-effective alternative to WGS - Clinical diagnostics (most pathogenic variants in exons) - Faster turnaround time needed
PacBio (Long-Read Sequencing)
Dataset: HCC1395, Normal 45x / Tumor 60x
Runtime (wall time):
make_examples_somatic: 198m 23s
call_variants: 120m 50s
postprocess_variants: 1m 36s
vcf_stats_report: 4m 34s
─────────────────────────────────
Total: ~5h 35m
Accuracy: - SNV Recall: 94.4% | Precision: 96.0% - Indel Recall: 80.9% | Precision: 80.3% - Overall Precision: 95.3%
Use When: - Need to resolve structural variants - Phasing haplotypes - Sequencing difficult regions (GC-rich, repeats) - Studying copy number variations
Note: Lower indel accuracy than Illumina, but superior for large variants.
ONT (Oxford Nanopore)
Dataset: HCC1395, Normal 33x / Tumor 50x
Runtime (wall time):
make_examples_somatic: 121m 24s
call_variants: 167m 50s
postprocess_variants: 3m 13s
vcf_stats_report: 8m 52s
─────────────────────────────────
Total: ~5h 10m
Accuracy: - SNV Recall: 77.6% | Precision: 97.3% - Indel Recall: 63.7% | Precision: 82.1% - Overall Precision: 96.7%
Use When: - Ultra-long reads needed (>100kb) - Real-time sequencing required - Portable sequencing (MinION) - Methylation detection
Note: Lower base-level accuracy, but unmatched read lengths for structural analysis.
FFPE WGS (Formalin-Fixed Paraffin-Embedded)
Dataset: HCC1395, Normal 50x / Tumor 90x
Runtime (wall time):
make_examples_somatic: 116m 2s
call_variants: 252m 45s
postprocess_variants: 2m 10s
vcf_stats_report: 7m 8s
─────────────────────────────────
Total: ~6h 29m
Accuracy: - SNV Recall: 81.7% | Precision: 94.5% - Indel Recall: 80.0% | Precision: 79.0% - Overall Precision: 93.8%
Use When: - Working with archived clinical samples - FFPE-induced artifacts present - Retrospective studies - No fresh tissue available
Note: FFPE causes DNA damage (C→T artifacts). This model is trained to handle these specific errors.
FFPE WES (FFPE Exome)
Dataset: HCC1395, Normal 185x / Tumor 190x
Runtime (wall time):
make_examples_somatic: 14m 8s
call_variants: 3m 39s
postprocess_variants: 0m 6s
vcf_stats_report: 0m 9s
─────────────────────────────────
Total: ~30m
Accuracy: - SNV Recall: 82.5% | Precision: 96.6% - Indel Recall: 83.3% | Precision: 85.1% - Overall Precision: 96.0%
Use When: - FFPE samples + exome sequencing - Cost-effective FFPE analysis - Clinical archives - High coverage possible
WGS Tumor-Only
Dataset: HCC1395, Tumor 60x (no normal)
Runtime (wall time):
make_examples_somatic: 37m 37s
call_variants: 54m 42s
postprocess_variants: 1m 38s
vcf_stats_report: 3m 34s
─────────────────────────────────
Total: ~1h 48m
Accuracy: - SNV Recall: 90.4% | Precision: 79.9% ⚠️ - Indel Recall: 84.9% | Precision: 36.4% ⚠️ - Overall Precision: 76.5% ⚠️
Use When: - No matched normal available - Budget constraints - Rapid screening needed - Using Panel of Normals (PoN) for filtering
Warning: Higher false positive rate. Requires additional filtering with PoN.
WES Tumor-Only
Dataset: HCC1395, Tumor 120x (no normal)
Runtime (wall time):
make_examples_somatic: 8m 17s
call_variants: 2m 37s
postprocess_variants: 0m 6s
vcf_stats_report: 0m 7s
─────────────────────────────────
Total: ~11m
Accuracy (metrics not in provided data - typically similar precision concerns as WGS tumor-only)
Use When: - Exome sequencing without matched normal - Very fast turnaround needed - Must use with PoN filtering
Model Selection Decision Tree
Do you have matched normal tissue?
├─ YES → Tumor-Normal Models
│ ├─ What sequencing platform?
│ │ ├─ Illumina
│ │ │ ├─ Is tissue FFPE?
│ │ │ │ ├─ YES
│ │ │ │ │ ├─ WGS → FFPE_WGS
│ │ │ │ │ └─ Exome → FFPE_WES
│ │ │ │ └─ NO (Fresh/Frozen)
│ │ │ │ ├─ WGS → WGS
│ │ │ │ └─ Exome → WES
│ │ │ │
│ │ ├─ PacBio HiFi → PACBIO
│ │ └─ Oxford Nanopore → ONT_R104
│ │
└─ NO → Tumor-Only Models
├─ Illumina WGS → WGS_TUMOR_ONLY + PoN
├─ Illumina WES → WES_TUMOR_ONLY + PoN
├─ PacBio → PACBIO_TUMOR_ONLY + PoN
├─ ONT → ONT_TUMOR_ONLY + PoN
└─ FFPE
├─ WGS → FFPE_WGS_TUMOR_ONLY + PoN
└─ WES → FFPE_WES_TUMOR_ONLY + PoN
Comparing Technologies
Short-Read (Illumina) vs Long-Read
Illumina (WGS/WES) - ✅ Highest base-level accuracy - ✅ Well-established protocols - ✅ Lower cost per base - ❌ Poor in repetitive regions - ❌ Cannot resolve large structural variants
PacBio HiFi - ✅ Long reads (10-25kb) - ✅ High accuracy (>99%) - ✅ Excellent for structural variants - ❌ Higher cost - ❌ Longer runtime
Oxford Nanopore (ONT) - ✅ Ultra-long reads (>100kb) - ✅ Real-time sequencing - ✅ Portable devices - ❌ Lower base accuracy - ❌ Higher error rate
Tumor-Normal vs Tumor-Only
Tumor-Normal (Paired) - ✅ Can distinguish somatic from germline - ✅ Higher precision (fewer false positives) - ✅ No PoN required - ❌ Requires matched normal sample - ❌ More expensive (2× sequencing)
Tumor-Only - ✅ No normal sample needed - ✅ Faster, cheaper - ✅ Works for archival samples - ❌ Cannot distinguish somatic/germline - ❌ Requires Panel of Normals - ❌ Lower precision
Runtime Optimization
Speed vs Accuracy Trade-offs
Fastest Models (< 30 min): 1. WES_TUMOR_ONLY: ~11m 2. WES: ~15m 3. FFPE_WES: ~30m
Slowest Models (> 5 hours): 1. FFPE_WGS: ~6h 30m 2. PACBIO: ~5h 35m 3. ONT: ~5h 10m
Scaling Options
Multi-threading:
--num_shards=32 # Use 32 CPU cores
GPU Acceleration:
- Use -gpu Docker image
- Only speeds up call_variants step
- Can reduce call_variants time by 50-70%
Regional Analysis:
--regions=chr1:1000000-2000000 # Analyze specific region
Accuracy Considerations
SNV vs Indel Performance
SNVs (Single Nucleotide Variants): - Generally high accuracy (>90% recall) - Illumina models: >95% precision - Most reliable variant type
Indels (Insertions/Deletions): - More challenging to call - Illumina still best (>90% recall) - Long-read lower accuracy but detects larger events - FFPE models handle degradation artifacts
Coverage Requirements
Recommended minimum coverage:
| Sample Type | Minimum | Recommended |
|---|---|---|
| Normal WGS | 30x | 50x |
| Tumor WGS | 50x | 80-100x |
| Normal WES | 80x | 120x |
| Tumor WES | 100x | 150x |
| Tumor-only | 60x | 100x+ |
Higher coverage improves: - Low VAF variant detection - Indel accuracy - Confident genotyping
Best Practices
Model Selection
- Match model to sequencing technology - Don't use WGS model on PacBio data
- Use FFPE models for FFPE samples - Critical for accuracy
- Prefer tumor-normal when possible - Higher precision
- Consider runtime vs accuracy - WES is 12× faster than WGS
Quality Control
Before running DeepSomatic: - ✅ Check BAM file quality (samtools flagstat) - ✅ Verify sufficient coverage - ✅ Ensure proper reference genome (GRCh38) - ✅ Confirm BAM index exists (.bai)
Post-Processing
After variant calling: - Filter by QUAL score (>30 recommended) - Remove non-PASS variants for high confidence - For tumor-only: apply PoN filtering - Annotate with functional databases (COSMIC, ClinVar)
Model Updates
DeepSomatic models are continuously improved:
- Current version: 1.9.0
- Training data: Millions of validated variants
- Regular updates: Check DeepSomatic releases
Learn More
- Understanding Results - Interpret VCF output
- DeepSomatic Research - Technical details
- FAQ - Common questions