Frequently Asked Questions

Answers to common questions about using Omics807 and understanding somatic variant calling.

General Questions

What is Omics807?

Omics807 is a cloud-based platform that combines deep learning variant calling with clinical insights and comprehensive multi-omics enrichment. It identifies cancer-specific mutations from sequencing data and provides interpretable results through an intuitive interface.

Key features: - Somatic variant calling for tumor samples with deep learning - Omics807 interpretation with multi-omics enrichment (population genetics, protein structures, drug targets, pathways, clinical evidence) - Support for multiple sequencing technologies - Cloud processing with real-time progress tracking - Interactive visualizations

Who should use Omics807?

Ideal users: - Cancer researchers studying somatic mutations - Bioinformaticians analyzing tumor samples - Clinical genomics labs - Students learning variant calling - Anyone needing accessible genomic analysis

Not recommended for: - Clinical diagnosis (use validated clinical pipelines) - Germline variant calling (use DeepVariant instead) - Production clinical workflows (regulatory considerations)

Is Omics807 free?

Omics807 itself is a free platform, but you need: - Cloud server - Your own infrastructure (costs vary by provider) - AI service API - For clinical interpretations and enrichment (~$0.01-0.10 per analysis) - Storage - For BAM files and results

Estimated costs: - Small test (chr1): $1-2 for server time - Whole genome: $10-50 depending on server specs - AI service API: $0.05-0.20 per job

Input Data Questions

What file types does Omics807 accept?

Required formats: - BAM files (.bam) - Binary Alignment Map - BAI files (.bam.bai) - BAM index files - CRAM files - Compressed alternative to BAM (supported)

Not accepted: - ❌ FASTQ files (raw reads - need alignment first) - ❌ VCF files (already called variants) - ❌ Images or PDFs - ❌ CSV or text files

Why BAM files?
BAM files contain aligned sequencing reads, which DeepSomatic needs to create pileup images for variant calling.

Do I need both tumor and normal samples?

Recommended: Yes, tumor-normal pairs give best accuracy

Tumor-Normal (Paired): - ✅ Distinguishes somatic from germline - ✅ Higher precision (fewer false positives) - ✅ No Panel of Normals needed - ✅ 95%+ precision typical

Tumor-Only: - ⚠️ Cannot distinguish somatic/germline - ⚠️ Higher false positive rate - ⚠️ Requires Panel of Normals filtering - ⚠️ 70-80% precision typical

When to use tumor-only: - No matched normal available - Archival FFPE samples - Budget constraints - Preliminary screening

What's the difference between WGS and WES?

Feature	WGS	WES
Coverage	Entire genome	Exome only (~2%)
Size	~100-200GB	~10-20GB
Cost	$800-1500	$200-500
Runtime	3-6 hours	15-30 minutes
Variants	All regions	Coding only
Use case	Complete analysis	Targeted, cost-effective

Choose WGS when: - Need non-coding variants - Studying structural variants - Comprehensive analysis required

Choose WES when: - Budget limited - Focus on coding mutations - Faster turnaround needed - Most pathogenic variants in exons

Can I upload files from my computer?

Yes, but with limitations:

File upload: - Maximum: 10GB per file (configurable) - Recommended: <5GB for reasonable upload times - Network speed: Critical factor

Better option: Use URLs

Instead of uploading 100GB BAM:
→ Upload to cloud storage (S3, GCS)
→ Generate public URL
→ Paste URL into Omics807
→ Server downloads directly (faster!)

Quick test URLs:

Tumor: https://storage.googleapis.com/.../tumor.bam
Normal: https://storage.googleapis.com/.../normal.bam

What reference genome does Omics807 use?

Default: GRCh38 (hg38)

Why GRCh38? - Current standard (2013 release) - Better accuracy than GRCh37 - Fewer gaps and errors - Used by DeepSomatic training

Important: Your BAM files must be aligned to GRCh38 - Check BAM header: samtools view -H your.bam | grep @SQ - If aligned to GRCh37, will cause incorrect variant calls

Analysis Questions

How long does analysis take?

Quick start (chr1 subset, 100kb): - ~5-10 minutes

Chromosome 1 complete: - WGS: ~30-45 minutes - WES: ~10-15 minutes

Whole genome: - WGS: 3-6 hours (96 CPUs) - WES: 15-30 minutes - PacBio: 5-6 hours - ONT: 5-6 hours

Factors affecting runtime: - BAM file size and coverage - Model type (WES faster than WGS) - Server CPU count - Number of shards (parallelization) - GPU availability (for call_variants)

Can I speed up the analysis?

Yes, several strategies:

1. Use more CPU cores:

--num_shards=32  # Use 32 cores instead of 4

Scales nearly linearly with cores

2. Add GPU acceleration: - Use -gpu Docker image - Reduces call_variants by 50-70% - Nvidia T4, A100, or V100 recommended

3. Analyze specific regions:

--regions=chr17  # Only chromosome 17
--regions=chr1:1000000-2000000  # Specific region

4. Use faster models: - WES instead of WGS (if applicable) - Skip tumor-only models if normal available

5. Optimize server: - SSD storage (faster I/O) - More RAM (avoid swapping) - Faster network (for downloads)

How accurate is DeepSomatic?

WGS (Illumina): - SNV: 95% recall, 99% precision - Indel: 93% recall, 85% precision - Overall: 98% precision

WES (Illumina): - SNV: 94% recall, 99% precision - Indel: 90% recall, 94% precision

Compared to alternatives: - MuTect2: DeepSomatic 15-20% better on indels - Strelka2: DeepSomatic ~5% better precision - VarScan2: DeepSomatic significantly better

Accuracy depends on: - Coverage depth (higher = better) - Sample quality (FFPE vs fresh) - Tumor purity (higher = easier) - Variant allele frequency (higher = easier)

See Model Guide for detailed metrics.

What's the minimum coverage required?

Recommended minimum:

Sample Type	Minimum	Recommended	Optimal
Normal WGS	30x	50x	80x
Tumor WGS	50x	80x	100x+
Normal WES	80x	120x	150x
Tumor WES	100x	150x	200x
Tumor-only	60x	100x	150x+

Why higher coverage matters: - Detect low VAF variants (<10%) - Improve indel calling accuracy - Reduce false negatives - Better genotype confidence

Too low coverage (<20x): - Many false negatives - Unreliable VAF estimates - Poor GQ scores

Can Omics807 detect all types of variants?

Well detected ✅: - SNVs (single nucleotide variants) - Small indels (<50bp) - MNVs (multi-nucleotide variants)

Limited detection ⚠️: - Medium indels (50-200bp) - accuracy drops - Structural variants (>200bp) - not designed for this - Copy number alterations - needs separate tool

Not detected ❌: - Large SVs (>1kb) - use SV caller - Gene fusions - use RNA-seq tools - Methylation - use bisulfite-seq - CNVs - use CNVkit, FACETS, etc.

For comprehensive analysis: - DeepSomatic: SNVs and indels - Manta/GRIDSS: Structural variants - CNVkit/FACETS: Copy number - STAR-Fusion: Gene fusions

Model Selection Questions

Which model should I use?

Follow this decision tree:

1. Do you have matched normal? - Yes → Tumor-normal models - No → Tumor-only models

2. What sequencing platform? - Illumina → WGS or WES - PacBio → PACBIO - Nanopore → ONT_R104

3. Is tissue FFPE? - Yes → FFPE_WGS or FFPE_WES - No → Standard models

4. Coverage type? - Whole genome → WGS - Exome only → WES

Quick examples: - Illumina WGS + normal → WGS - Illumina WES + normal → WES - Illumina WGS, no normal → WGS_TUMOR_ONLY - FFPE WGS + normal → FFPE_WGS - PacBio + normal → PACBIO

What if I use the wrong model?

Consequences: - Lower accuracy (10-30% drop) - More false positives/negatives - Incorrect quality scores - Wasted computational time

Common mistakes:

Using WGS on WES data: - Too many false positives in exons - Wrong statistical assumptions

Not using FFPE model on FFPE: - C→T artifacts called as variants - 20-30% more false positives

Using Illumina model on PacBio: - Different error profiles - Poor indel performance

How to verify correct model: - Check sequencing platform in metadata - Examine error rates in BAM - Review tissue preservation method

When should I use tumor-only mode?

Use tumor-only when: - ✅ No matched normal available - ✅ Archival/FFPE samples only - ✅ Budget constraints (half the sequencing cost) - ✅ Rapid screening needed - ✅ Have access to Panel of Normals

Avoid tumor-only when: - ❌ Normal tissue accessible - ❌ Need high precision (clinical decisions) - ❌ No Panel of Normals available - ❌ Studying rare variants

Tumor-only best practices: 1. Use high coverage (100x+) 2. Apply Panel of Normals filtering 3. Validate key variants orthogonally 4. Filter common population variants (dbSNP) 5. Be conservative with interpretation

Results Questions

What does "PASS" mean in the results?

PASS = High-confidence somatic variant

Criteria for PASS: - Sufficient quality score (QUAL > 30) - Present in tumor, absent (or low) in normal - Passes all filters (strand bias, mapping quality, etc.) - Likely true somatic mutation

Other FILTER values: - GERMLINE: Inherited, not somatic - RefCall: No variant, matches reference - LowQual: Quality below threshold

Focus on PASS variants for: - Clinical interpretation - Downstream analysis - Actionable mutation discovery

See Understanding Results for details.

How do I interpret variant allele frequency (VAF)?

VAF = Variant reads / Total reads

Interpretation:

VAF = 100%  →  Homozygous in tumor
VAF = 50%   →  Heterozygous or germline
VAF = 40%   →  Somatic, ~80% tumor purity
VAF = 20%   →  Somatic, low purity or subclonal
VAF = 5%    →  Very low frequency, validate carefully

Factors affecting VAF:

Tumor purity:

Pure tumor (100% purity):
→ Heterozygous somatic = 50% VAF

Mixed (50% tumor, 50% normal):
→ Heterozygous somatic = 25% VAF

Copy number:

Normal: 1 variant copy, 1 normal → 50% VAF
Amplification: 3 variant copies, 1 normal → 75% VAF
LOH: 1 variant copy, 0 normal → 100% VAF

Subclonality:

Clonal (all cells): High VAF (30-50%)
Subclonal (subset): Low VAF (5-20%)

What are "Omics807 Insights" and how reliable are they?

Omics807 Insights are comprehensive clinical interpretations generated by advanced analysis that integrate multi-omics enrichment data

What Omics807 provides: - Population genetics analysis for germline filtering - Protein structure predictions for mutation impact - Drug target matching and therapeutic options - Pathway analysis for biological context - Clinical evidence from curated databases - Research literature citations - Clinical significance assessment - Associated cancer types - Treatment implications - Recommended follow-up actions

Example:

Variant: BRAF p.V600E
AI: "Well-established oncogenic driver in melanoma 
and colorectal cancer. Targetable with BRAF inhibitors 
(vemurafenib, dabrafenib). Consider resistance testing."

Reliability: - ✅ Good for well-known hotspot mutations - ✅ Summarizes published literature - ⚠️ May not reflect latest research (knowledge cutoff) - ⚠️ Should be validated with databases (COSMIC, ClinVar) - ❌ Not a substitute for clinical interpretation

Best practice: 1. Use AI as starting point 2. Verify with clinical databases 3. Consult with oncologists/genetic counselors 4. Consider patient-specific context

How many variants should I expect?

Typical ranges (whole genome):

WGS: - Total detected: 10,000-100,000 - PASS (somatic): 1,000-10,000 - High impact: 10-100

WES: - Total detected: 100-1,000 - PASS (somatic): 50-500 - High impact: 5-50

Factors affecting count: - Cancer type (melanoma > pediatric tumors) - Tumor mutational burden (TMB) - Patient age (more with age) - Exposure (smoking, UV) - Filtering stringency

Concerning patterns: - Too few (<100): Low coverage, high purity issues - Too many (>100,000): Wrong reference, artifacts - All filtered (no PASS): Quality issues

Troubleshooting

Why is my job stuck at "Queued"?

Possible causes:

Server connection failed:
Check SSH credentials
Verify server is running
Test connection manually
Previous job still running:
Omics807 runs one job at a time
Wait for completion or cancel
Server out of resources:
Check disk space
Verify sufficient RAM
Monitor CPU usage

Solutions:

# Test SSH connection
ssh root@your-server-ip

# Check disk space
df -h

# Check running processes
ps aux | grep deepsomatic

# Kill stuck process
pkill -9 -f deepsomatic

Why did my job fail?

Common error messages:

"BAM file not found": - URL is incorrect or inaccessible - File was not uploaded properly - Network issues during transfer

"Reference genome missing": - GRCh38 reference not on server - Wrong path specified - Need to run setup script

"Out of memory": - BAM file too large for server RAM - Increase server memory - Use fewer shards

"Docker not installed": - Server setup incomplete - Run setup script - Install Docker manually

"Invalid BAM file": - BAM corrupted or incomplete - Wrong reference genome used - Missing BAM index (.bai)

Debugging steps: 1. Check job logs in Omics807 2. SSH to server and check /root/deepsomatic_jobs/[job_id]/deepsomatic.log 3. Verify input files exist and are valid 4. Test with quick start dataset

How do I validate my results?

Validation hierarchy (best to good):

1. Sanger Sequencing (Gold standard): - PCR amplify variant region - Sanger sequence - Confirms variant presence and VAF

2. Digital Droplet PCR (ddPCR): - Precise VAF measurement - Good for low-frequency variants - Quantitative validation

3. Alternative NGS Platform: - Re-sequence on different platform - Illumina → PacBio - Different library prep

4. Orthogonal Variant Caller: - Run MuTect2 or Strelka2 - Take consensus of multiple callers - Higher confidence on shared calls

5. Database Cross-Reference: - Check COSMIC for known cancer mutations - Review ClinVar for pathogenicity - Compare to published literature

Validation targets: - All clinically actionable variants - Variants driving treatment decisions - Novel/unexpected mutations - Low VAF variants (<10%)

Can I run Omics807 on my laptop?

Short answer: Not recommended

Why not: - DeepSomatic requires 32GB+ RAM (laptops typically 8-16GB) - WGS takes 3-6 hours on 96 CPU server (days on laptop) - Large BAM files (100GB+) need substantial storage - Resource-intensive Docker containers

Better alternatives: 1. Cloud server (recommended): - Rent on-demand (AWS, GCP, Kamatera) - Pay only when analyzing - Scale resources as needed

Institutional HPC:
Use university/hospital cluster
Often free for researchers
Pre-installed tools
Small datasets on laptop (chr1 only):
Possible for testing
Use quick start dataset
Expect 1-2 hour runtime

Minimum specs for laptop testing: - CPU: 8+ cores - RAM: 32GB - Storage: 100GB free - Time: Hours, not minutes

Advanced Questions

Can I use Omics807 for clinical diagnostics?

Currently: No, not recommended for clinical use

Reasons: - Not FDA approved for diagnostics - No CLIA/CAP validation - Research-grade implementation - Lacks clinical-grade QC

For clinical use, you need: - Validated clinical pipeline - CLIA-certified laboratory - CAP accreditation - Clinical-grade reporting

Omics807 is appropriate for: - Research studies - Method development - Educational purposes - Preliminary screening

If clinical application needed: - Partner with certified lab - Validate against clinical standards - Obtain regulatory approval - Implement QC procedures

How can I integrate Omics807 into my pipeline?

Integration options:

1. API Integration (future): - REST API for job submission - Programmatic result retrieval - Webhook notifications

2. Docker Container:

# Run DeepSomatic directly
docker run google/deepsomatic:1.9.0 \
  run_deepsomatic \
  --model_type=WGS \
  --ref=ref.fasta \
  --reads_tumor=tumor.bam \
  --reads_normal=normal.bam \
  --output_vcf=output.vcf.gz

3. Workflow Integration: - Add to Nextflow pipeline - Integrate with Snakemake - Use in WDL workflows

4. Batch Processing: - Process multiple samples - Cloud-based scaling - Results aggregation

Example Nextflow:

process deepsomatic {
    container 'google/deepsomatic:1.9.0'

    input:
    path tumor_bam
    path normal_bam

    output:
    path 'output.vcf.gz'

    script:
    """
    run_deepsomatic \
      --model_type=WGS \
      --reads_tumor=${tumor_bam} \
      --reads_normal=${normal_bam} \
      --output_vcf=output.vcf.gz
    """
}

Where can I get help?

Omics807 Support: - Check this FAQ - Review documentation guides - Examine case studies

DeepSomatic Support: - GitHub Issues - Documentation - Community forums

General Genomics Help: - Biostars - SEQanswers - r/bioinformatics

Training Resources: - Broad Institute Workshops - Coursera Genomics Courses - Galaxy Training

Still Have Questions?

Documentation: - Getting Started - Basic introduction - Model Guide - Technical details - Understanding Results - Interpretation help

External Resources: - DeepSomatic GitHub - SEQC2 Project - COSMIC Database

Community: Join discussions on genomics forums or create an issue on the Omics807 GitHub repository.