DeepSomatic Model Guide

Choose the right DeepSomatic model for your sequencing data. This comprehensive guide covers all 8 models with performance metrics, use cases, and recommendations.

Model Overview

DeepSomatic provides specialized models for different sequencing technologies and sample types:

Model Technology Mode Best For
WGS Illumina Tumor-Normal Complete genome analysis
WES Illumina Tumor-Normal Exome regions only
PACBIO PacBio HiFi Tumor-Normal Long-read WGS
ONT_R104 Nanopore Tumor-Normal Ultra-long reads
FFPE_WGS Illumina Tumor-Normal FFPE whole genome
FFPE_WES Illumina Tumor-Normal FFPE exome
WGS_TUMOR_ONLY Illumina Tumor-Only No normal available
WES_TUMOR_ONLY Illumina Tumor-Only Exome without normal

Additional tumor-only models: PACBIO_TUMOR_ONLY, ONT_TUMOR_ONLY, FFPE_WGS_TUMOR_ONLY, FFPE_WES_TUMOR_ONLY

Performance Metrics

All metrics from HCC1395 dataset on n2-standard-96 GCP instance (96 CPUs, 384GB RAM).

WGS (Whole Genome Sequencing)

Dataset: HCC1395, Normal 50x / Tumor 60x

Runtime (wall time):

make_examples_somatic:    64m 30s
call_variants:           106m 0s
postprocess_variants:      0m 59s
vcf_stats_report:          3m 1s
─────────────────────────────────
Total:                   ~3h 5m

Accuracy: - SNV Recall: 95.1% | Precision: 98.9% - Indel Recall: 93.0% | Precision: 84.8% - Overall Precision: 98.3%

Use When: - Complete genome coverage needed - Budget allows for WGS - Looking for structural variants, non-coding mutations - Research requiring all genomic regions


WES (Whole Exome Sequencing)

Dataset: HCC1395, Normal 140x / Tumor 120x

Runtime (wall time):

make_examples_somatic:     8m 17s
call_variants:             2m 37s
postprocess_variants:      0m 6s
vcf_stats_report:          0m 7s
─────────────────────────────────
Total:                   ~15m 33s

Accuracy: - SNV Recall: 94.4% | Precision: 99.1% - Indel Recall: 89.6% | Precision: 93.5% - Overall Precision: 98.9%

Use When: - Focused on protein-coding regions - Cost-effective alternative to WGS - Clinical diagnostics (most pathogenic variants in exons) - Faster turnaround time needed


PacBio (Long-Read Sequencing)

Dataset: HCC1395, Normal 45x / Tumor 60x

Runtime (wall time):

make_examples_somatic:   198m 23s
call_variants:           120m 50s
postprocess_variants:      1m 36s
vcf_stats_report:          4m 34s
─────────────────────────────────
Total:                   ~5h 35m

Accuracy: - SNV Recall: 94.4% | Precision: 96.0% - Indel Recall: 80.9% | Precision: 80.3% - Overall Precision: 95.3%

Use When: - Need to resolve structural variants - Phasing haplotypes - Sequencing difficult regions (GC-rich, repeats) - Studying copy number variations

Note: Lower indel accuracy than Illumina, but superior for large variants.


ONT (Oxford Nanopore)

Dataset: HCC1395, Normal 33x / Tumor 50x

Runtime (wall time):

make_examples_somatic:   121m 24s
call_variants:           167m 50s
postprocess_variants:      3m 13s
vcf_stats_report:          8m 52s
─────────────────────────────────
Total:                   ~5h 10m

Accuracy: - SNV Recall: 77.6% | Precision: 97.3% - Indel Recall: 63.7% | Precision: 82.1% - Overall Precision: 96.7%

Use When: - Ultra-long reads needed (>100kb) - Real-time sequencing required - Portable sequencing (MinION) - Methylation detection

Note: Lower base-level accuracy, but unmatched read lengths for structural analysis.


FFPE WGS (Formalin-Fixed Paraffin-Embedded)

Dataset: HCC1395, Normal 50x / Tumor 90x

Runtime (wall time):

make_examples_somatic:   116m 2s
call_variants:           252m 45s
postprocess_variants:      2m 10s
vcf_stats_report:          7m 8s
─────────────────────────────────
Total:                   ~6h 29m

Accuracy: - SNV Recall: 81.7% | Precision: 94.5% - Indel Recall: 80.0% | Precision: 79.0% - Overall Precision: 93.8%

Use When: - Working with archived clinical samples - FFPE-induced artifacts present - Retrospective studies - No fresh tissue available

Note: FFPE causes DNA damage (C→T artifacts). This model is trained to handle these specific errors.


FFPE WES (FFPE Exome)

Dataset: HCC1395, Normal 185x / Tumor 190x

Runtime (wall time):

make_examples_somatic:    14m 8s
call_variants:             3m 39s
postprocess_variants:      0m 6s
vcf_stats_report:          0m 9s
─────────────────────────────────
Total:                   ~30m

Accuracy: - SNV Recall: 82.5% | Precision: 96.6% - Indel Recall: 83.3% | Precision: 85.1% - Overall Precision: 96.0%

Use When: - FFPE samples + exome sequencing - Cost-effective FFPE analysis - Clinical archives - High coverage possible


WGS Tumor-Only

Dataset: HCC1395, Tumor 60x (no normal)

Runtime (wall time):

make_examples_somatic:    37m 37s
call_variants:            54m 42s
postprocess_variants:      1m 38s
vcf_stats_report:          3m 34s
─────────────────────────────────
Total:                   ~1h 48m

Accuracy: - SNV Recall: 90.4% | Precision: 79.9% ⚠️ - Indel Recall: 84.9% | Precision: 36.4% ⚠️ - Overall Precision: 76.5% ⚠️

Use When: - No matched normal available - Budget constraints - Rapid screening needed - Using Panel of Normals (PoN) for filtering

Warning: Higher false positive rate. Requires additional filtering with PoN.


WES Tumor-Only

Dataset: HCC1395, Tumor 120x (no normal)

Runtime (wall time):

make_examples_somatic:     8m 17s
call_variants:             2m 37s
postprocess_variants:      0m 6s
vcf_stats_report:          0m 7s
─────────────────────────────────
Total:                   ~11m

Accuracy (metrics not in provided data - typically similar precision concerns as WGS tumor-only)

Use When: - Exome sequencing without matched normal - Very fast turnaround needed - Must use with PoN filtering


Model Selection Decision Tree

Do you have matched normal tissue?
├─ YES → Tumor-Normal Models
│  ├─ What sequencing platform?
│  │  ├─ Illumina
│  │  │  ├─ Is tissue FFPE?
│  │  │  │  ├─ YES
│  │  │  │  │  ├─ WGS → FFPE_WGS
│  │  │  │  │  └─ Exome → FFPE_WES
│  │  │  │  └─ NO (Fresh/Frozen)
│  │  │  │     ├─ WGS → WGS
│  │  │  │     └─ Exome → WES
│  │  │  │
│  │  ├─ PacBio HiFi → PACBIO
│  │  └─ Oxford Nanopore → ONT_R104
│  │
└─ NO → Tumor-Only Models
   ├─ Illumina WGS → WGS_TUMOR_ONLY + PoN
   ├─ Illumina WES → WES_TUMOR_ONLY + PoN
   ├─ PacBio → PACBIO_TUMOR_ONLY + PoN
   ├─ ONT → ONT_TUMOR_ONLY + PoN
   └─ FFPE
      ├─ WGS → FFPE_WGS_TUMOR_ONLY + PoN
      └─ WES → FFPE_WES_TUMOR_ONLY + PoN

Comparing Technologies

Short-Read (Illumina) vs Long-Read

Illumina (WGS/WES) - ✅ Highest base-level accuracy - ✅ Well-established protocols - ✅ Lower cost per base - ❌ Poor in repetitive regions - ❌ Cannot resolve large structural variants

PacBio HiFi - ✅ Long reads (10-25kb) - ✅ High accuracy (>99%) - ✅ Excellent for structural variants - ❌ Higher cost - ❌ Longer runtime

Oxford Nanopore (ONT) - ✅ Ultra-long reads (>100kb) - ✅ Real-time sequencing - ✅ Portable devices - ❌ Lower base accuracy - ❌ Higher error rate

Tumor-Normal vs Tumor-Only

Tumor-Normal (Paired) - ✅ Can distinguish somatic from germline - ✅ Higher precision (fewer false positives) - ✅ No PoN required - ❌ Requires matched normal sample - ❌ More expensive (2× sequencing)

Tumor-Only - ✅ No normal sample needed - ✅ Faster, cheaper - ✅ Works for archival samples - ❌ Cannot distinguish somatic/germline - ❌ Requires Panel of Normals - ❌ Lower precision

Runtime Optimization

Speed vs Accuracy Trade-offs

Fastest Models (< 30 min): 1. WES_TUMOR_ONLY: ~11m 2. WES: ~15m 3. FFPE_WES: ~30m

Slowest Models (> 5 hours): 1. FFPE_WGS: ~6h 30m 2. PACBIO: ~5h 35m 3. ONT: ~5h 10m

Scaling Options

Multi-threading:

--num_shards=32  # Use 32 CPU cores

GPU Acceleration: - Use -gpu Docker image - Only speeds up call_variants step - Can reduce call_variants time by 50-70%

Regional Analysis:

--regions=chr1:1000000-2000000  # Analyze specific region

Accuracy Considerations

SNV vs Indel Performance

SNVs (Single Nucleotide Variants): - Generally high accuracy (>90% recall) - Illumina models: >95% precision - Most reliable variant type

Indels (Insertions/Deletions): - More challenging to call - Illumina still best (>90% recall) - Long-read lower accuracy but detects larger events - FFPE models handle degradation artifacts

Coverage Requirements

Recommended minimum coverage:

Sample Type Minimum Recommended
Normal WGS 30x 50x
Tumor WGS 50x 80-100x
Normal WES 80x 120x
Tumor WES 100x 150x
Tumor-only 60x 100x+

Higher coverage improves: - Low VAF variant detection - Indel accuracy - Confident genotyping

Best Practices

Model Selection

  1. Match model to sequencing technology - Don't use WGS model on PacBio data
  2. Use FFPE models for FFPE samples - Critical for accuracy
  3. Prefer tumor-normal when possible - Higher precision
  4. Consider runtime vs accuracy - WES is 12× faster than WGS

Quality Control

Before running DeepSomatic: - ✅ Check BAM file quality (samtools flagstat) - ✅ Verify sufficient coverage - ✅ Ensure proper reference genome (GRCh38) - ✅ Confirm BAM index exists (.bai)

Post-Processing

After variant calling: - Filter by QUAL score (>30 recommended) - Remove non-PASS variants for high confidence - For tumor-only: apply PoN filtering - Annotate with functional databases (COSMIC, ClinVar)

Model Updates

DeepSomatic models are continuously improved:

  • Current version: 1.9.0
  • Training data: Millions of validated variants
  • Regular updates: Check DeepSomatic releases

Learn More

References