Example Datasets

Public genomic datasets available for testing Omics807 and learning variant calling. All datasets are freely accessible and well-characterized.

Perfect for testing Omics807 in under 10 minutes!

HCC1395 Chromosome 1 Subset

Description: Small region of chr1 from HCC1395 breast cancer cell line

Details: - Sample: HCC1395 (breast cancer) - Region: chr1:10,000,000-10,100,000 (100kb) - Coverage: ~50x normal, ~60x tumor - Sequencing: Illumina (150bp paired-end) - Reference: GRCh38

Files:

# Tumor BAM (~100MB)
https://storage.googleapis.com/deepvariant/deepsomatic-case-studies/quick-start/S1395_WGS_ilm_tumor.bwa.dedup.chr1.quickstart.bam

# Normal BAM (~50MB)
https://storage.googleapis.com/deepvariant/deepsomatic-case-studies/quick-start/S1395_WGS_ilm_normal.bwa.dedup.chr1.quickstart.bam

# Reference genome (chr1 only)
https://storage.googleapis.com/deepvariant/deepsomatic-case-studies/quick-start/GRCh38_no_alts_chr1.fasta

Expected Results: - Runtime: ~5-10 minutes - Variants: ~5-10 somatic variants - Quality: High-confidence PASS calls

How to Use in Omics807: 1. Go to "Start Analysis" 2. Paste tumor BAM URL 3. Paste normal BAM URL 4. Select model: WGS 5. Click "Start Analysis"


HCC1395 Full Dataset

About HCC1395

Cell Line: HCC1395 - Type: Primary ductal carcinoma (breast cancer) - Origin: African American female, 43 years old - Matched Normal: HCC1395BL (B-lymphocyte) - Characteristics: TP53 mutant, BRCA1/2 wild-type

Why HCC1395? - Well-characterized reference material - FDA SEQC2 consortium truth set - Multiple sequencing platforms available - Known mutation profile

Chromosome 1 Complete (WGS)

Details: - Region: Entire chromosome 1 - Size: ~4GB per BAM file - Coverage: 50x normal, 60x tumor - Platform: Illumina HiSeq

Download:

BASE_URL="https://storage.googleapis.com/deepvariant/deepsomatic-case-studies/deepsomatic-chr1-case-studies"

# Tumor BAM
${BASE_URL}/HCC1395_wgs.tumor.chr1.bam
${BASE_URL}/HCC1395_wgs.tumor.chr1.bam.bai

# Normal BAM
${BASE_URL}/HCC1395_wgs.normal.chr1.bam
${BASE_URL}/HCC1395_wgs.normal.chr1.bam.bai

# Reference
${BASE_URL}/GCA_000001405.15_GRCh38_no_alt_analysis_set.chr1.fna
${BASE_URL}/GCA_000001405.15_GRCh38_no_alt_analysis_set.chr1.fna.fai

Expected Results (DeepSomatic WGS model): - Total variants: ~3,500 - SNVs: ~3,300 (recall 96%, precision 99%) - Indels: ~150 (recall 95%, precision 85%) - Runtime: ~30-45 minutes (chr1 only)

Exome Data (WES)

Details: - Targeted: Exome regions only - Size: ~800MB per BAM file - Coverage: 140x normal, 120x tumor - Capture: Agilent SureSelect

Download:

BASE_URL="https://storage.googleapis.com/deepvariant/deepsomatic-case-studies/deepsomatic-chr1-case-studies"

# Tumor WES BAM
${BASE_URL}/HCC1395_wes.tumor.chr1.bam
${BASE_URL}/HCC1395_wes.tumor.chr1.bam.bai

# Normal WES BAM
${BASE_URL}/HCC1395_wes.normal.chr1.bam
${BASE_URL}/HCC1395_wes.normal.chr1.bam.bai

Expected Results (DeepSomatic WES model): - Total variants: ~150 - SNVs: ~140 (recall 91%, precision 100%) - Indels: ~8 (recall 100%, precision 88%) - Runtime: ~5-10 minutes (chr1 exome)


SEQC2 Truth Sets

The FDA-led SEQC2 consortium provides validated truth sets for HCC1395.

High-Confidence Regions

Description: Curated set of high-confidence somatic variants

Details: - Source: Multiple orthogonal technologies - Validation: Manual curation + Sanger sequencing - Version: v1.2.1 - Use: Benchmark variant callers

Download Truth VCF:

BASE_URL="https://storage.googleapis.com/deepvariant/deepsomatic-case-studies/SEQC2-S1395-truth"

# Truth set VCF
${BASE_URL}/high-confidence_sINDEL_sSNV_in_HC_regions_v1.2.1.merged.vcf.gz
${BASE_URL}/high-confidence_sINDEL_sSNV_in_HC_regions_v1.2.1.merged.vcf.gz.tbi

# High-confidence regions BED
${BASE_URL}/High-Confidence_Regions_v1.2.bed

# Quick start subset (chr1 only)
https://storage.googleapis.com/deepvariant/deepsomatic-case-studies/quick-start/SEQC2_truth.chr1.quick_start.vcf.gz

What's in the Truth Set?

Chromosome 1 Subset: - SNVs: ~3,440 validated - Indels: ~133 validated - Total: ~3,573 high-confidence variants

Whole Genome: - SNVs: ~39,000 validated - Indels: ~1,600 validated - Coverage: ~2.8Gb high-confidence regions

Using Truth Sets for Validation

Benchmarking Pipeline:

# Run Omics807 on HCC1395 data
# Download results VCF

# Compare to truth set using hap.py
docker run -v /data:/data pkrusche/hap.py:latest \
  /opt/hap.py/bin/som.py \
  /data/SEQC2_truth.vcf.gz \
  /data/cancerscope_results.vcf.gz \
  -r /data/reference.fasta \
  -o /data/benchmark_results \
  --feature-table generic

Metrics Generated: - Recall (sensitivity): % of true variants found - Precision (PPV): % of calls that are true - F1 score: Harmonic mean of recall/precision


Multi-Platform Datasets

PacBio HiFi (Long-Read)

HCC1395 PacBio: - Platform: PacBio Sequel II - Read length: 10-25kb HiFi reads - Coverage: 45x normal, 60x tumor

Use Case: Compare short vs long-read variant calling

Expected Differences: - Better resolution of structural variants - Improved phasing - Lower indel precision than Illumina - Better performance in repetitive regions

Oxford Nanopore (Ultra-Long)

HCC1395 ONT R10.4: - Platform: PromethION - Read length: >100kb possible - Coverage: 33x normal, 50x tumor

Use Case: Resolve complex genomic regions

Expected Differences: - Ultra-long reads (some >1Mb) - Lower base-level accuracy - Excellent for structural variants - Real-time sequencing capability


Other Public Datasets

GIAB (Genome in a Bottle)

HG001/HG002 Trio: - Sample: Ashkenazi Jewish trio - Type: Germline (not cancer) - Use: Test germline calling, quality control

Access: https://www.nist.gov/programs-projects/genome-bottle

TCGA (The Cancer Genome Atlas)

Description: Thousands of tumor/normal pairs across 33 cancer types

Access: - Portal: https://portal.gdc.cancer.gov/ - Cloud: gs://gdc-tcga-phs000178-open/ - Requires: dbGaP authorization for raw data

Popular Samples: - TCGA-BRCA: Breast cancer (>1,000 samples) - TCGA-LUAD: Lung adenocarcinoma - TCGA-COAD: Colon adenocarcinoma

ICGC (International Cancer Genome Consortium)

Description: International cancer genomics resource

Access: https://dcc.icgc.org/

Features: - Simple somatic mutations (SSM) - Copy number (CN) - Structural variants (SV) - Gene expression


Dataset Recommendations by Use Case

Learning Omics807

Start Here: 1. Quick start chr1 subset (10 min) 2. Full chr1 WGS (30 min) 3. Full chr1 WES (10 min)

Benchmarking Performance

Recommended: - HCC1395 chr1 + SEQC2 truth set - Known mutation profile - Validated results

Testing Different Models

Comparison Set: | Model | Dataset | Runtime | |-------|---------|---------| | WGS | HCC1395 chr1 WGS | 30 min | | WES | HCC1395 chr1 WES | 10 min | | PACBIO | HCC1395 chr1 PacBio | 45 min | | ONT | HCC1395 chr1 ONT | 40 min |

Production Validation

Before Clinical Use: 1. Run positive control (HCC1395) 2. Compare to SEQC2 truth set 3. Achieve >95% recall/precision 4. Validate key variants with Sanger


Download and Storage Tips

File Sizes

Dataset Tumor Normal Total
Quick start (100kb) 100MB 50MB 150MB
Chr1 WGS 4GB 3GB 7GB
Chr1 WES 800MB 600MB 1.4GB
Whole genome WGS 100GB 80GB 180GB

Download Methods

Direct Download (small files):

wget https://storage.googleapis.com/.../file.bam

Streaming URL (Omics807): - Paste URL directly into Omics807 - Server downloads automatically - No local storage needed

Google Cloud (large files):

gsutil -m cp gs://deepvariant/path/to/data/* ./

Storage Requirements

For Testing: - Quick start: 500MB free space - Chr1 dataset: 10GB free space - Full WGS: 500GB recommended

For Production: - Per sample: 200-500GB - Multiple samples: 1TB+ recommended - Consider cloud storage (S3, GCS)


Creating Your Own Test Data

Subsetting BAM Files

Extract specific regions for testing:

# Extract chr1:10M-20M region
samtools view -b input.bam chr1:10000000-20000000 > subset.bam
samtools index subset.bam

Downsampling Coverage

Reduce coverage for faster testing:

# Downsample to 30x (from 100x)
samtools view -s 0.3 -b input.bam > downsampled.bam

Simulated Data

Generate synthetic tumor/normal pairs:

Tools: - BAMSurgeon: Insert variants into BAM - neat-genreads: Simulate paired-end reads - dwgsim: Whole genome simulation


Data Access Policies

Open Access

  • HCC1395 datasets: Public, no restrictions
  • SEQC2 truth sets: Public
  • Reference genomes: Public

Controlled Access

  • TCGA raw sequencing: Requires dbGaP approval
  • ICGC: May require data access agreements
  • Clinical samples: IRB approval needed

Citation Requirements

When using HCC1395/SEQC2:

Fang LT, et al. (2021) Establishing community reference samples, 
data and call sets for benchmarking cancer mutation detection using 
whole-genome sequencing. Nature Biotechnology 39, 1151-1160.

Next Steps

External Resources