Downloadable Research Package

Run Our Digital Twins on Your Data, Your Infrastructure

The DNAI Research Toolkit brings our complete cancer digital twin pipeline to your institution. No patient data leaves your network. Full multi-omics analysis, locally.

328d
Latent dimensions
33
Cancer types
9,415
Training patients
18,435
DepMap genes
8
Analysis modules
How It Works

Three ways to use the toolkit

Choose the interface that fits your team. All three run entirely on your infrastructure.

Recommended

AI Agent Interface

Chat naturally with your data. The AI agent calls DNAI tools automatically, interprets results, generates visualizations, and answers follow-up questions.

“Analyze my cohort and find subgroups”

Command Line

Batch processing for bioinformatics teams. Analyze patients, run cohort discovery, validate predictions, and export structured results.

dnai cohort -f data.csv

Python API

Integrate DNAI into your existing pipelines. Full programmatic access to every analysis component with structured return types.

toolkit.analyze(...)
Data Privacy

Your data never leaves your institution

The entire pipeline — models, inference, analysis — runs on your hardware. No cloud calls, no data uploads, no API keys for inference.

Fully Local Execution

All model checkpoints ship with the toolkit. Inference runs on CPU, CUDA, or Apple MPS. No internet required.

Your Infrastructure

Docker, Singularity/Apptainer for HPC, or native Python install. Works on Linux, macOS, and Windows.

Share Metrics, Not Data

Validate DNAI predictions against your outcomes. Share only C-indices and statistical summaries back to us.

Capabilities

What the toolkit can do

Eight analysis modules powered by the same models behind the DNAI platform.

Digital Twin Creation

Converts each patient's multi-omics profile into a compact, structured 328-dimensional representation capturing proliferation, 50 biological pathways, immune context, and epigenetics in a single vector.

Survival Prediction

Predicts relative survival risk calibrated against 9,415 TCGA patients across 33 cancer types. Includes a site-robust checkpoint specifically designed for external institutional data.

Driver Identification

Ranks driver genes per patient by combining somatic mutation evidence with pathway context and protein-level structure. Covers 2,932 genes across the cancer genome.

Synthetic Lethality Scoring

Identifies druggable vulnerabilities from loss-of-function mutations. Combines 28 curated gene-drug pairs with a trained ML classifier (ρ=0.776 on held-out cell lines) that predicts novel context-dependent vulnerabilities using pathway state, DepMap essentiality (18,435 genes), and drug embeddings.

Immunogenic Variant Prioritization

Ranks somatic variants by immunotherapy potential, scoring tumor microenvironment permissiveness, clonal prevalence, expression level, and variant type to flag candidates for checkpoint inhibitors or vaccine pipelines.

Mechanistic Evidence Tracing

Traces causal chains from mutations through signaling pathways to druggable targets using the SIGNOR knowledge graph. Fully deterministic and auditable — every recommendation traces back to specific genes.

Clonal Deconvolution

Reconstructs the tumor’s clonal architecture from variant allele frequencies. Maps subpopulations to a fixed 4-slot ODE model with a Resistance Sentinel that preserves minority drug-resistant clones. Validated on 228K GENIE patients with per-clone drug sensitivity annotations.

Targetability Assessment

Scores every driver mutation across 6 dimensions: genetic evidence, essentiality (DepMap), druggability, clinical precedent, synthetic lethality, and pathway centrality (SIGNOR). Ranks actionable targets with evidence tiers and resistance route forecasting. 71.8% of patients have at least one actionable target.

Drug Combination Discovery

Predicts synergistic drug combinations from monotherapy data alone — zero-shot, without training on any combination data. Finds drug pairs that orthogonally target different clonal subpopulations: drug A suppresses dominant clones while drug B targets the Resistance Sentinel. Validated at ρ=0.800 on 1,209 drug pairs.

Schedule Optimization

Optimizes drug dosing schedules by simulating tumor dynamics under treatment pressure with pharmacokinetic constraints (half-life, toxicity budgets, maximum cumulative dose). Finds when to switch drugs based on clonal evolution. Achieves 42% dose reduction vs standard concurrent dosing while preventing resistance.

DNA-Only Panel Analysis

Creates digital twins from targeted sequencing panels (MSK-IMPACT, FoundationOne, etc.) without requiring RNA-seq. A trained panel adapter maps mutation and CNA data from 167 known panels to the full 328-dimensional latent space, enabling analysis for the millions of patients with panel data only.

Cohort Discovery Engine

Automatically scans your entire cohort for hidden patterns: molecular subgroups, synergistic pathway interactions, resistance signatures, outlier patients, and mutation-pathway links — with built-in replication testing.

Data Formats

Bring your data as-is

Auto-detection handles format conversion, gene symbol harmonization, and normalization. No manual preprocessing required.

ModalityFormatsNotes
RNA-seqCSV, TSV, GCT, H5AD, NPZAuto-normalizes counts/TPM/FPKM
MutationsMAF, VCF, CSVGene + variant type + VAF
Copy NumberGISTIC, segment, gene-levelAny numeric matrix
Methylation450K / EPIC beta-valuesClipped to [0, 1]
Histology (WSI)Pre-extracted NPZ / PTUNI2-h 1536d embeddings
ClinicalCSV / TSVFlexible column mapping

RNA-seq expression is the primary modality. All others are optional and enhance the analysis when available. Missing modalities are handled gracefully via Product-of-Experts architecture.

Discovery

Questions only a digital twin can answer

These are findings that standard bioinformatics pipelines cannot produce. They require the multi-omics, pathway-structured latent representation.

1Are there molecularly distinct subgroups hiding in my cohort?

The toolkit clusters patients in 328-dimensional latent space that integrates RNA, mutations, CNV, and methylation into pathway-level biology. Unlike PCA on raw expression, this captures cross-modal interactions invisible to single-modality analysis.

2Which patients have synthetic lethality vulnerabilities I haven't considered?

For every loss-of-function mutation, the toolkit combines 28 curated SL pairs with a trained ML classifier that predicts novel context-dependent vulnerabilities from DepMap essentiality (18,435 genes) and pathway state. Most clinicians check BRCA/PARP but miss TP53/WEE1, ARID1A/EZH2, ATM/ATR, and context-dependent vulnerabilities the classifier discovers.

3What pathway co-activations predict outcomes beyond individual pathways?

Tests all 1,225 pairwise interactions across 50 Hallmark pathways for synergistic effects on survival. A tumor with moderate MYC AND moderate EMT may be far more aggressive than high MYC alone — the interaction term carries the signal.

4Which mutations drive which pathways in my specific cohort?

For each recurrent mutation, tests whether mutated patients have significantly different pathway activations vs wild-type — discovered from your data, not from databases. KRAS may activate different downstream pathways in LUAD vs PAAD.

5Do my treatment-resistant patients share a molecular signature?

Compares 328d digital twins of responders vs non-responders to identify distinguishing dimensions, then traces them back to specific pathways, proliferation dynamics, and immune context.

6Which patients are molecular outliers that don't fit any known subtype?

Computes multi-dimensional distance from cohort centroid in latent space. These are your most interesting patients: potential novel subtypes, misdiagnosed samples, rare driver combinations, or basket trial candidates.

7What drug combinations might work for my patient that nobody has tried?

The combination discovery engine predicts synergistic pairs from monotherapy data alone by simulating which drugs target different clonal subpopulations. It finds combinations where drug A suppresses dominant clones and drug B targets resistant minorities — validated at ρ=0.800 on 1,209 drug pairs, including leave-target-family-out proof of genuine discovery.

8Is there a smarter dosing schedule than giving both drugs at the same time?

The schedule optimizer simulates tumor evolution under different dosing strategies and finds sequences that achieve the same tumor control with less total drug. In testing, optimized schedules reduce cumulative drug exposure by 42% vs concurrent dosing while still preventing resistance escape.

Plus any question you can think of

With the AI agent interface, researchers can ask any question about their data and the agent figures out how to answer it — “compare pathway profiles of my youngest vs oldest quartile”, “show me patients whose clonal architecture predicts resistance”, or “which patients are most similar to TCGA luminal B?” No pre-built dashboard needed.

Output Package

Comprehensive results package

Per-Patient Reports

HTML reports with digital twin summary, risk score, driver genes, SL opportunities, immunogenic variants, and clonal architecture.

Cohort Analysis Report

Risk distribution, top drivers across the cohort, quality summary, and stratification analysis with statistical tests.

Discovery Report

Prioritized non-obvious findings: latent subgroups, pathway interactions, resistance patterns, outliers, and mechanistic hypotheses.

Machine-Readable Exports

CSV files (risk scores, drivers, full 328d latent space) and JSON results for integration with your existing pipelines.

Validation Report

C-index with bootstrap confidence intervals against your known outcomes, per-cancer-type breakdown, and diagnostic notes.

Interpretation Guide

Built-in CLAUDE.md that teaches the AI agent about DNAI, so it can explain every result in the context of your specific data.

Eligibility

Who can request the toolkit

The DNAI Research Toolkit is available for non-commercial research use under a collaboration agreement.

Eligible

  • Academic research institutions
  • University hospitals and medical centers
  • Cancer research consortia (EURACAN, GENIE, SARC, etc.)
  • Government research labs (NCI, DKFZ, etc.)
  • Non-profit research organizations
  • Biopharma R&D teams (under pilot agreement)

Requirements

  • Signed research collaboration agreement
  • IRB/ethics approval for your dataset
  • Agreement to share validation metrics (not patient data) for joint publication
  • RNA-seq expression data for at least 50 patients (recommended)
  • Named PI and institutional affiliation
  • Commitment to cite DNAI in publications using the toolkit

Supported cancer types

33 TCGA cancer types in the training set. Cancer types not listed below can still be analyzed using the UNKNOWN_EXTERNAL generalization token, with reduced but non-zero signal.

ACCBLCABRCACESCCHOLCOADDLBCESCAGBMHNSCKICHKIRCKIRPLAMLLGGLIHCLUADLUSCMESOOVPAADPCPGPRADREADSARCSKCMSTADTGCTTHCATHYMUCECUCSUVM
Limitations

What you should know

We believe transparency about limitations is essential. These are the known constraints of the current toolkit.

Survival predictions rank patients, they do not predict exact survival times

C-index 0.704 (internal), 0.633 (external multi-site). Best suited for cohort stratification and identifying high-risk vs low-risk groups, not individual prognosis.

Synthetic lethality combines 28 curated pairs with a trained ML classifier, but predictions beyond curated pairs are investigational

Curated pairs are from clinical trials and preclinical studies. The v2 ML classifier (ρ=0.776) extends predictions beyond known pairs using pathway context and DepMap, but novel predictions require experimental validation.

Trained primarily on adult cancers (TCGA) — accuracy varies by cancer type

33 adult cancer types are well-represented. Pediatric, rare, and cancer types not in TCGA will have reduced but non-zero predictive signal.

Histology features do not transfer across institutions

Whole-slide image embeddings capture scanner and staining characteristics alongside biology. Histology analysis is available within-institution only; cross-site is suppressed by default.

Treatment recommendations are exploratory, not validated for clinical decisions

The treatment module provides hypothesis-generating rankings for research context. It has not undergone prospective clinical validation and should not guide treatment decisions.

Immunogenic variant candidates require downstream wet-lab confirmation

The toolkit scores variants by clonality, immune context, and expression, but HLA binding prediction and peptide-MHC stability testing are needed to confirm immunogenicity.

Clonal deconvolution from bulk sequencing has inherent resolution limits

Reliable detection of 2–3 major clones. Resolving many small subclones requires high-depth sequencing with well-separated allele frequency distributions.

Research use only. The DNAI Research Toolkit is not a medical device and is not intended for clinical decision-making. All findings should be independently validated before any clinical application.

Requirements

Technical requirements

Minimum

  • Python 3.10+
  • 8 GB RAM
  • 10 GB disk (models)
  • CPU only
  • Linux, macOS, or Windows

Recommended

  • Python 3.11+
  • 32 GB RAM
  • GPU (CUDA 12+ or Apple MPS)
  • 50+ patient cohort
  • Docker or Singularity

For AI Agent Interface

  • Claude Code installed
  • Anthropic API access
  • MCP server support
  • Terminal / CLI access
  • No GPU required for agent
Request Access

Request the Research Toolkit

Fill in the form below. After review and approval, you will receive instructions on how to download and install the package.

About your data

Questions? Contact us at info@dnai.bio