'Single-Cell RNA-seq Analysis'
'Analyze single-cell gene expression data with Cell2Sentence AI models'
Single-Cell RNA-seq Analysis with Cell2Sentence
Omics807 integrates Cell2Sentence (C2S), a breakthrough framework from Yale's van Dijk Lab that uses Large Language Models to analyze single-cell RNA sequencing data.
What is Cell2Sentence?
Cell2Sentence transforms gene expression into natural language by converting cells into "cell sentences" - space-separated gene names ordered by expression level:
CD3D CD3E CD8A IL7R PTPRC CCR7 LEF1 TCF7 SELL ...
This allows AI models trained on millions of cells to: - Predict cell types (T cells, B cells, cancer cells, etc.) - Identify cell states (activated, quiescent, proliferating) - Discover biological insights using natural language understanding
Getting Started
Step 1: Prepare Your Data
Omics807 accepts expression matrices in CSV or TSV format:
Format Requirements: - Rows = Genes (gene symbols like CD3D, TP53, etc.) - Columns = Cells (cell identifiers like Cell_1, Cell_2, etc.) - Values = Expression levels (normalized counts, log-transformed, etc.)
Example CSV:
gene,Cell_1,Cell_2,Cell_3
CD3D,8.5,0.2,9.1
CD4,1.2,0.1,7.5
CD8A,6.2,0.0,6.8
CD19,0.1,9.5,0.2
...
Step 2: Upload Your Data
- Navigate to Single-Cell Analysis from the homepage
- Click Upload Your Data
- Drag and drop your CSV/TSV file or click to browse
- Click Analyze Cells
Step 3: View Results
Results include: - Cell Type Distribution - Interactive pie chart showing predicted cell types (Plotly) - Individual Predictions - Table with confidence scores per cell - Top Genes by Cell Type - Key marker genes for each population - AI-Generated Insights - Biological interpretation and clinical relevance
Note: Advanced visualizations (UMAP, t-SNE, gene expression heatmaps) are planned for future releases. Current implementation focuses on cell type prediction and distribution analysis.
How It Works
1. Cell Sentence Generation
Your expression data is converted to cell sentences:
# For a T cell with high CD3D, CD3E, CD8A expression:
"CD3D CD3E CD8A IL7R PTPRC CCR7 LEF1 TCF7 SELL ..."
2. AI Prediction
Cell2Sentence models (trained on 57M+ cells) predict: - Cell type (e.g., "CD8+ T Cell") - Confidence score (0-100%) - Method (C2S model or marker-based fallback)
3. Biological Interpretation
GPT-5 analyzes results to provide: - Cell identity and function - Key marker genes - Role in tumor microenvironment - Clinical relevance
Cell Types Detected
Omics807 can identify:
Immune Cells: - T cells (CD4+, CD8+, Regulatory) - B cells - NK cells - Monocytes - Macrophages - Dendritic cells - Neutrophils
Stromal Cells: - Fibroblasts - Endothelial cells - Pericytes
Cancer Cells: - Epithelial cancer cells - Proliferating tumor cells
Understanding Results
Cell Type Distribution
Shows the proportion of each cell type in your dataset:
T Cell: 45%
B Cell: 25%
NK Cell: 15%
Monocyte: 10%
Cancer Cell: 5%
Confidence Scores
- > 80% - High confidence (strong marker expression)
- 60-80% - Medium confidence (partial markers)
- < 60% - Low confidence (ambiguous markers)
Top Genes by Cell Type
Key markers for each population:
T Cells: CD3D, CD3E, CD4, CD8A
B Cells: CD19, CD79A, MS4A1
NK Cells: NCAM1, NKG7, GNLY
Example Use Cases
1. Tumor Microenvironment Profiling
Analyze immune infiltration in tumor samples: - Identify immune cell populations - Assess T cell vs B cell ratios - Detect exhausted/activated states
2. Drug Response Prediction
Understand cellular composition for treatment planning: - High T cell infiltration → immunotherapy candidates - Low immune cells → chemotherapy consideration - Specific markers → targeted therapy options
3. Cell Type Discovery
Find rare or novel cell populations: - Identify transitional states - Discover cell type-specific markers - Validate clustering results
Technical Details
Models Used
- C2S-Scale Pythia-160M - Fast cell type prediction
- GPT-5 - Biological interpretation and insights
- Marker-based fallback - Ensures reliability when API unavailable
Performance
- Speed: ~100 cells in 30-60 seconds
- Accuracy: 85-95% for well-characterized cell types
- Scalability: Tested on datasets up to 100,000 cells
Troubleshooting
File Upload Issues
Error: "Invalid file format" - Ensure CSV/TSV format - Check that genes are in rows, cells in columns - Verify no empty cells or special characters
Low Confidence Predictions
Possible causes: - Low gene coverage (< 500 genes/cell) - Unusual cell types not in training data - Poor quality/noisy data
Solutions: - Include more genes in analysis - Check data preprocessing - Review marker gene expression
Missing Cell Types
If expected cell types aren't detected: - Check marker gene expression - Ensure genes are named correctly (HGNC symbols) - Consider manual marker-based validation
Advanced Features
Custom Gene Sets
Focus analysis on specific pathways: - Immune checkpoint genes - Cell cycle markers - Metabolic pathways
Batch Processing
Analyze multiple samples: 1. Upload samples sequentially 2. Compare cell type distributions 3. Track changes across conditions
Sample Data
Try Omics807 with example datasets:
PBMC Sample (10 cells)
- 3 CD8+ T cells
- 3 B cells
- 2 CD4+ T cells
- 2 Cytotoxic T cells
Download: sample_scrna_data.csv
API Integration
Programmatic access coming soon:
# Future API example
response = cancerscope.analyze_scrna(
file_path="data.csv",
model="c2s-scale-gemma-27b"
)
Learn More
Support
Having issues? Contact us or check: - Troubleshooting Guide - FAQ - GitHub Issues