Single-Cell RNA-seq Analysis with Cell2Sentence

Omics807 integrates Cell2Sentence (C2S), a breakthrough framework from Yale's van Dijk Lab that uses Large Language Models to analyze single-cell RNA sequencing data.

What is Cell2Sentence?

Cell2Sentence transforms gene expression into natural language by converting cells into "cell sentences" - space-separated gene names ordered by expression level:

CD3D CD3E CD8A IL7R PTPRC CCR7 LEF1 TCF7 SELL ...

This allows AI models trained on millions of cells to: - Predict cell types (T cells, B cells, cancer cells, etc.) - Identify cell states (activated, quiescent, proliferating) - Discover biological insights using natural language understanding

Getting Started

Step 1: Prepare Your Data

Omics807 accepts expression matrices in CSV or TSV format:

Format Requirements: - Rows = Genes (gene symbols like CD3D, TP53, etc.) - Columns = Cells (cell identifiers like Cell_1, Cell_2, etc.) - Values = Expression levels (normalized counts, log-transformed, etc.)

Example CSV:

gene,Cell_1,Cell_2,Cell_3
CD3D,8.5,0.2,9.1
CD4,1.2,0.1,7.5
CD8A,6.2,0.0,6.8
CD19,0.1,9.5,0.2
...

Step 2: Upload Your Data

  1. Navigate to Single-Cell Analysis from the homepage
  2. Click Upload Your Data
  3. Drag and drop your CSV/TSV file or click to browse
  4. Click Analyze Cells

Step 3: View Results

Results include: - Cell Type Distribution - Interactive pie chart showing predicted cell types (Plotly) - Individual Predictions - Table with confidence scores per cell - Top Genes by Cell Type - Key marker genes for each population - AI-Generated Insights - Biological interpretation and clinical relevance

Note: Advanced visualizations (UMAP, t-SNE, gene expression heatmaps) are planned for future releases. Current implementation focuses on cell type prediction and distribution analysis.

How It Works

1. Cell Sentence Generation

Your expression data is converted to cell sentences:

# For a T cell with high CD3D, CD3E, CD8A expression:
"CD3D CD3E CD8A IL7R PTPRC CCR7 LEF1 TCF7 SELL ..."

2. AI Prediction

Cell2Sentence models (trained on 57M+ cells) predict: - Cell type (e.g., "CD8+ T Cell") - Confidence score (0-100%) - Method (C2S model or marker-based fallback)

3. Biological Interpretation

GPT-5 analyzes results to provide: - Cell identity and function - Key marker genes - Role in tumor microenvironment - Clinical relevance

Cell Types Detected

Omics807 can identify:

Immune Cells: - T cells (CD4+, CD8+, Regulatory) - B cells - NK cells - Monocytes - Macrophages - Dendritic cells - Neutrophils

Stromal Cells: - Fibroblasts - Endothelial cells - Pericytes

Cancer Cells: - Epithelial cancer cells - Proliferating tumor cells

Understanding Results

Cell Type Distribution

Shows the proportion of each cell type in your dataset:

T Cell: 45%
B Cell: 25%
NK Cell: 15%
Monocyte: 10%
Cancer Cell: 5%

Confidence Scores

  • > 80% - High confidence (strong marker expression)
  • 60-80% - Medium confidence (partial markers)
  • < 60% - Low confidence (ambiguous markers)

Top Genes by Cell Type

Key markers for each population:

T Cells: CD3D, CD3E, CD4, CD8A
B Cells: CD19, CD79A, MS4A1
NK Cells: NCAM1, NKG7, GNLY

Example Use Cases

1. Tumor Microenvironment Profiling

Analyze immune infiltration in tumor samples: - Identify immune cell populations - Assess T cell vs B cell ratios - Detect exhausted/activated states

2. Drug Response Prediction

Understand cellular composition for treatment planning: - High T cell infiltration → immunotherapy candidates - Low immune cells → chemotherapy consideration - Specific markers → targeted therapy options

3. Cell Type Discovery

Find rare or novel cell populations: - Identify transitional states - Discover cell type-specific markers - Validate clustering results

Technical Details

Models Used

  • C2S-Scale Pythia-160M - Fast cell type prediction
  • GPT-5 - Biological interpretation and insights
  • Marker-based fallback - Ensures reliability when API unavailable

Performance

  • Speed: ~100 cells in 30-60 seconds
  • Accuracy: 85-95% for well-characterized cell types
  • Scalability: Tested on datasets up to 100,000 cells

Troubleshooting

File Upload Issues

Error: "Invalid file format" - Ensure CSV/TSV format - Check that genes are in rows, cells in columns - Verify no empty cells or special characters

Low Confidence Predictions

Possible causes: - Low gene coverage (< 500 genes/cell) - Unusual cell types not in training data - Poor quality/noisy data

Solutions: - Include more genes in analysis - Check data preprocessing - Review marker gene expression

Missing Cell Types

If expected cell types aren't detected: - Check marker gene expression - Ensure genes are named correctly (HGNC symbols) - Consider manual marker-based validation

Advanced Features

Custom Gene Sets

Focus analysis on specific pathways: - Immune checkpoint genes - Cell cycle markers - Metabolic pathways

Batch Processing

Analyze multiple samples: 1. Upload samples sequentially 2. Compare cell type distributions 3. Track changes across conditions

Sample Data

Try Omics807 with example datasets:

PBMC Sample (10 cells) - 3 CD8+ T cells - 3 B cells
- 2 CD4+ T cells - 2 Cytotoxic T cells

Download: sample_scrna_data.csv

API Integration

Programmatic access coming soon:

# Future API example
response = cancerscope.analyze_scrna(
    file_path="data.csv",
    model="c2s-scale-gemma-27b"
)

Learn More

Support

Having issues? Contact us or check: - Troubleshooting Guide - FAQ - GitHub Issues