Example Datasets
Public genomic datasets for testing - HCC1395 breast cancer cell line, SEQC2 truth sets, and quick start samples
Example Datasets
Public genomic datasets available for testing Omics807 and learning variant calling. All datasets are freely accessible and well-characterized.
Quick Start Dataset (Recommended)
Perfect for testing Omics807 in under 10 minutes!
HCC1395 Chromosome 1 Subset
Description: Small region of chr1 from HCC1395 breast cancer cell line
Details: - Sample: HCC1395 (breast cancer) - Region: chr1:10,000,000-10,100,000 (100kb) - Coverage: ~50x normal, ~60x tumor - Sequencing: Illumina (150bp paired-end) - Reference: GRCh38
Files:
# Tumor BAM (~100MB)
https://storage.googleapis.com/deepvariant/deepsomatic-case-studies/quick-start/S1395_WGS_ilm_tumor.bwa.dedup.chr1.quickstart.bam
# Normal BAM (~50MB)
https://storage.googleapis.com/deepvariant/deepsomatic-case-studies/quick-start/S1395_WGS_ilm_normal.bwa.dedup.chr1.quickstart.bam
# Reference genome (chr1 only)
https://storage.googleapis.com/deepvariant/deepsomatic-case-studies/quick-start/GRCh38_no_alts_chr1.fasta
Expected Results: - Runtime: ~5-10 minutes - Variants: ~5-10 somatic variants - Quality: High-confidence PASS calls
How to Use in Omics807:
1. Go to "Start Analysis"
2. Paste tumor BAM URL
3. Paste normal BAM URL
4. Select model: WGS
5. Click "Start Analysis"
HCC1395 Full Dataset
About HCC1395
Cell Line: HCC1395 - Type: Primary ductal carcinoma (breast cancer) - Origin: African American female, 43 years old - Matched Normal: HCC1395BL (B-lymphocyte) - Characteristics: TP53 mutant, BRCA1/2 wild-type
Why HCC1395? - Well-characterized reference material - FDA SEQC2 consortium truth set - Multiple sequencing platforms available - Known mutation profile
Chromosome 1 Complete (WGS)
Details: - Region: Entire chromosome 1 - Size: ~4GB per BAM file - Coverage: 50x normal, 60x tumor - Platform: Illumina HiSeq
Download:
BASE_URL="https://storage.googleapis.com/deepvariant/deepsomatic-case-studies/deepsomatic-chr1-case-studies"
# Tumor BAM
${BASE_URL}/HCC1395_wgs.tumor.chr1.bam
${BASE_URL}/HCC1395_wgs.tumor.chr1.bam.bai
# Normal BAM
${BASE_URL}/HCC1395_wgs.normal.chr1.bam
${BASE_URL}/HCC1395_wgs.normal.chr1.bam.bai
# Reference
${BASE_URL}/GCA_000001405.15_GRCh38_no_alt_analysis_set.chr1.fna
${BASE_URL}/GCA_000001405.15_GRCh38_no_alt_analysis_set.chr1.fna.fai
Expected Results (DeepSomatic WGS model): - Total variants: ~3,500 - SNVs: ~3,300 (recall 96%, precision 99%) - Indels: ~150 (recall 95%, precision 85%) - Runtime: ~30-45 minutes (chr1 only)
Exome Data (WES)
Details: - Targeted: Exome regions only - Size: ~800MB per BAM file - Coverage: 140x normal, 120x tumor - Capture: Agilent SureSelect
Download:
BASE_URL="https://storage.googleapis.com/deepvariant/deepsomatic-case-studies/deepsomatic-chr1-case-studies"
# Tumor WES BAM
${BASE_URL}/HCC1395_wes.tumor.chr1.bam
${BASE_URL}/HCC1395_wes.tumor.chr1.bam.bai
# Normal WES BAM
${BASE_URL}/HCC1395_wes.normal.chr1.bam
${BASE_URL}/HCC1395_wes.normal.chr1.bam.bai
Expected Results (DeepSomatic WES model): - Total variants: ~150 - SNVs: ~140 (recall 91%, precision 100%) - Indels: ~8 (recall 100%, precision 88%) - Runtime: ~5-10 minutes (chr1 exome)
SEQC2 Truth Sets
The FDA-led SEQC2 consortium provides validated truth sets for HCC1395.
High-Confidence Regions
Description: Curated set of high-confidence somatic variants
Details: - Source: Multiple orthogonal technologies - Validation: Manual curation + Sanger sequencing - Version: v1.2.1 - Use: Benchmark variant callers
Download Truth VCF:
BASE_URL="https://storage.googleapis.com/deepvariant/deepsomatic-case-studies/SEQC2-S1395-truth"
# Truth set VCF
${BASE_URL}/high-confidence_sINDEL_sSNV_in_HC_regions_v1.2.1.merged.vcf.gz
${BASE_URL}/high-confidence_sINDEL_sSNV_in_HC_regions_v1.2.1.merged.vcf.gz.tbi
# High-confidence regions BED
${BASE_URL}/High-Confidence_Regions_v1.2.bed
# Quick start subset (chr1 only)
https://storage.googleapis.com/deepvariant/deepsomatic-case-studies/quick-start/SEQC2_truth.chr1.quick_start.vcf.gz
What's in the Truth Set?
Chromosome 1 Subset: - SNVs: ~3,440 validated - Indels: ~133 validated - Total: ~3,573 high-confidence variants
Whole Genome: - SNVs: ~39,000 validated - Indels: ~1,600 validated - Coverage: ~2.8Gb high-confidence regions
Using Truth Sets for Validation
Benchmarking Pipeline:
# Run Omics807 on HCC1395 data
# Download results VCF
# Compare to truth set using hap.py
docker run -v /data:/data pkrusche/hap.py:latest \
/opt/hap.py/bin/som.py \
/data/SEQC2_truth.vcf.gz \
/data/cancerscope_results.vcf.gz \
-r /data/reference.fasta \
-o /data/benchmark_results \
--feature-table generic
Metrics Generated: - Recall (sensitivity): % of true variants found - Precision (PPV): % of calls that are true - F1 score: Harmonic mean of recall/precision
Multi-Platform Datasets
PacBio HiFi (Long-Read)
HCC1395 PacBio: - Platform: PacBio Sequel II - Read length: 10-25kb HiFi reads - Coverage: 45x normal, 60x tumor
Use Case: Compare short vs long-read variant calling
Expected Differences: - Better resolution of structural variants - Improved phasing - Lower indel precision than Illumina - Better performance in repetitive regions
Oxford Nanopore (Ultra-Long)
HCC1395 ONT R10.4: - Platform: PromethION - Read length: >100kb possible - Coverage: 33x normal, 50x tumor
Use Case: Resolve complex genomic regions
Expected Differences: - Ultra-long reads (some >1Mb) - Lower base-level accuracy - Excellent for structural variants - Real-time sequencing capability
Other Public Datasets
GIAB (Genome in a Bottle)
HG001/HG002 Trio: - Sample: Ashkenazi Jewish trio - Type: Germline (not cancer) - Use: Test germline calling, quality control
Access: https://www.nist.gov/programs-projects/genome-bottle
TCGA (The Cancer Genome Atlas)
Description: Thousands of tumor/normal pairs across 33 cancer types
Access: - Portal: https://portal.gdc.cancer.gov/ - Cloud: gs://gdc-tcga-phs000178-open/ - Requires: dbGaP authorization for raw data
Popular Samples: - TCGA-BRCA: Breast cancer (>1,000 samples) - TCGA-LUAD: Lung adenocarcinoma - TCGA-COAD: Colon adenocarcinoma
ICGC (International Cancer Genome Consortium)
Description: International cancer genomics resource
Access: https://dcc.icgc.org/
Features: - Simple somatic mutations (SSM) - Copy number (CN) - Structural variants (SV) - Gene expression
Dataset Recommendations by Use Case
Learning Omics807
Start Here: 1. Quick start chr1 subset (10 min) 2. Full chr1 WGS (30 min) 3. Full chr1 WES (10 min)
Benchmarking Performance
Recommended: - HCC1395 chr1 + SEQC2 truth set - Known mutation profile - Validated results
Testing Different Models
Comparison Set: | Model | Dataset | Runtime | |-------|---------|---------| | WGS | HCC1395 chr1 WGS | 30 min | | WES | HCC1395 chr1 WES | 10 min | | PACBIO | HCC1395 chr1 PacBio | 45 min | | ONT | HCC1395 chr1 ONT | 40 min |
Production Validation
Before Clinical Use: 1. Run positive control (HCC1395) 2. Compare to SEQC2 truth set 3. Achieve >95% recall/precision 4. Validate key variants with Sanger
Download and Storage Tips
File Sizes
| Dataset | Tumor | Normal | Total |
|---|---|---|---|
| Quick start (100kb) | 100MB | 50MB | 150MB |
| Chr1 WGS | 4GB | 3GB | 7GB |
| Chr1 WES | 800MB | 600MB | 1.4GB |
| Whole genome WGS | 100GB | 80GB | 180GB |
Download Methods
Direct Download (small files):
wget https://storage.googleapis.com/.../file.bam
Streaming URL (Omics807): - Paste URL directly into Omics807 - Server downloads automatically - No local storage needed
Google Cloud (large files):
gsutil -m cp gs://deepvariant/path/to/data/* ./
Storage Requirements
For Testing: - Quick start: 500MB free space - Chr1 dataset: 10GB free space - Full WGS: 500GB recommended
For Production: - Per sample: 200-500GB - Multiple samples: 1TB+ recommended - Consider cloud storage (S3, GCS)
Creating Your Own Test Data
Subsetting BAM Files
Extract specific regions for testing:
# Extract chr1:10M-20M region
samtools view -b input.bam chr1:10000000-20000000 > subset.bam
samtools index subset.bam
Downsampling Coverage
Reduce coverage for faster testing:
# Downsample to 30x (from 100x)
samtools view -s 0.3 -b input.bam > downsampled.bam
Simulated Data
Generate synthetic tumor/normal pairs:
Tools: - BAMSurgeon: Insert variants into BAM - neat-genreads: Simulate paired-end reads - dwgsim: Whole genome simulation
Data Access Policies
Open Access
- HCC1395 datasets: Public, no restrictions
- SEQC2 truth sets: Public
- Reference genomes: Public
Controlled Access
- TCGA raw sequencing: Requires dbGaP approval
- ICGC: May require data access agreements
- Clinical samples: IRB approval needed
Citation Requirements
When using HCC1395/SEQC2:
Fang LT, et al. (2021) Establishing community reference samples,
data and call sets for benchmarking cancer mutation detection using
whole-genome sequencing. Nature Biotechnology 39, 1151-1160.
Next Steps
- Getting Started - Run your first analysis
- Model Guide - Choose the right model
- Understanding Results - Interpret output