How Colo Works
A plain-language explanation of the analysis pipeline, the quality controls built into it, and what the outputs actually mean, so you can use them critically.
1. The big picture
Colo is a literature synthesis engine. You give it a research question or hypothesis. It retrieves relevant peer-reviewed papers from PubMed, runs two opposing expert perspectives through a structured debate that grades the evidence behind your idea, then carries the result forward into formal methodology and grant scaffolding. Every step is anchored to specific cited papers.
The synthesis runs across a five-stage pipeline: Setup → Adversarial debate → Methods design → Scaffold → Export. The transcript, citations, and verdicts from each stage carry into the next, so by the end you have a study you could actually run, traceable line-by-line back to the literature you started from.
The output is not a summary. It's a graded, contested evaluation of the hypothesis followed by structured next steps, produced by a system explicitly designed to challenge itself before reaching a conclusion. Think of it as a rapid, literature-grounded peer review that doesn't stop at "this is plausible." It keeps going until you have a study design.
A general chatbot can produce a confident-sounding paragraph about a research question in five seconds. It can't anchor that paragraph in 3.4M+ peer-reviewed abstracts, surface unresolved disagreements between cited studies, or carry the answer forward into a methodology section that meets RCT or NIH standards. Colo is built around the parts of research a generalist tool treats as out of scope.
2. The two agents and why they disagree
Each analysis is run by two agents with different roles:
Agent A: Clinical Researcher
Agent A evaluates the hypothesis from the perspective of applied, patient-facing evidence. It prioritizes clinical trial data, patient outcomes, treatment responses, and real-world feasibility. Its job is to ask: what does the evidence say happens in actual patients?
Agent B: Translational Researcher
Agent B evaluates from the perspective of mechanism and biology. It digs into signaling pathways, resistance mechanisms, cell biology, and whether the clinical conclusions are actually supported by what we know about how the underlying biology works. Its job is to ask: does the mechanism support what the clinical data claims?
This split is intentional. Clinical trial results sometimes outrun the mechanistic explanation. Mechanistic findings sometimes don't translate to patients. Forcing both perspectives into the same conversation, and requiring them to resolve disagreements before reaching a verdict, produces conclusions that are more complete than either perspective alone.
Both roles are fixed by the system. Agent A is always the clinical researcher, Agent B is always the translational researcher. The domain (e.g., Oncology) and the hypothesis you set in setup get injected into both prompts so the agents reason within scope.
3. Evidence tags and what they mean
Every time an agent cites a paper, it is required to tag it with an evidence type. These tags reflect how much confidence the study design supports a given conclusion.
| Tag | Study type | What it means for the claim |
|---|---|---|
| [RCT] | Randomized controlled trial | Highest level of clinical evidence. Supports causal claims about treatment effects. |
| [META] | Meta-analysis or systematic review | Synthesizes multiple studies. Strong for establishing consensus across a body of evidence. |
| [COHORT] | Observational cohort study | Observational. Establishes associations, not causation. Useful for real-world patterns. |
| [PRECLINICAL] | Animal model or in vitro study | Mechanistic evidence. Cannot directly support clinical claims without translational data. |
| [EXPERT] | Expert opinion, review, or editorial | Interpretive, not primary evidence. Useful for framing but carries the lowest evidentiary weight. |
A verdict supported only by [PRECLINICAL] evidence is much weaker than one supported by [RCT] or [META]. Look at the tags attached to the EVIDENCE lines in each agent's response. They tell you how strong the foundation actually is.
4. The INFERENCE rule
Agents are required to distinguish between two kinds of statements:
Cited claims: statements directly supported by a paper in the retrieved literature. These are attached to an EVIDENCE line with a PMID or paper title and an evidence tag.
Inferred claims: conclusions that follow logically from the evidence but are not
themselves directly stated in any cited paper. These must be prefixed with INFERENCE:.
This separation matters because it prevents an agent from presenting a logical leap as if it were an established finding. A verdict cannot be built on INFERENCE-labeled claims alone. The agents are instructed to enforce this themselves, and the consensus checkpoint requires a verdict to be traceable to [RCT] or [META] evidence.
If you see a VERDICT that relies heavily on INFERENCE-prefixed claims rather than cited papers, treat it with more skepticism. The reasoning may be plausible, but it hasn't been anchored to the literature.
5. Consensus checkpoints
Every four agent turns, Colo injects a consensus checkpoint into the conversation. At a checkpoint, both agents are required to stop debating and respond with only three things:
| VERDICT | The single most actionable research direction that has emerged, stated as a testable hypothesis. Must be supported by [RCT] or [META] evidence. |
| AGREE / DISAGREE | Whether their verdict aligns with the other agent's. If DISAGREE, they must state what evidence would resolve the conflict. |
| NEXT LANE | Which aspect of the topic should be the focus of the next exchange, and why. |
If one agent returns DISAGREE, the system flags the unresolved conflict and continues the debate. A final verdict requires both agents to agree, or for the disagreement to be resolved with additional evidence. This prevents a premature consensus when the underlying evidence is genuinely contradictory.
6. How the literature is retrieved
When RAG (Retrieval-Augmented Generation) is enabled, Colo searches a pre-built index of PubMed abstracts to find relevant papers before each agent turn. The retrieval works as follows:
Rather than searching once with the literal hypothesis text, Colo generates three query angles from the recent conversation: one framing the topic clinically, one mechanistically, and one from a biomarker perspective. This prevents the retrieval from narrowing too early.
Each query is converted into a numerical vector using a sentence embedding model and compared against the vectors of every abstract in the index. Papers are ranked by how closely their meaning matches the query, not just whether they share keywords.
Results from all three queries are merged and deduplicated by PubMed ID so the same paper doesn't appear multiple times in the context. The first occurrence (most semantically relevant) is kept.
The top results are re-ranked using a combined score: 70% semantic similarity + 30% iCite citation quality (see below). The final 12 abstracts are passed to both agents as their literature context for that turn.
The retrieval above defaults to Colo's pre-built corpus. If you need coverage outside the pre-built specialty list, use Build Custom Corpus on the Setup screen to index any PubMed query into a private per-user collection, the agents will retrieve from your custom corpus instead of the pre-built one during the debate. An optional Include preprints toggle on the same screen adds records from bioRxiv, medRxiv, and arXiv q-bio (via Europe PMC) to your custom corpus. Preprints are not peer-reviewed and are flagged accordingly in the mindmap so you can weigh them appropriately.
7. iCite citation quality weighting
Not all published papers carry equal evidential weight. A landmark randomized trial with 3,000 citations is not the same as a single-center case series with 4. Colo uses the NIH iCite Relative Citation Ratio (RCR) to account for this difference.
What is RCR?
The Relative Citation Ratio is a field-normalized citation metric developed by the NIH. It measures how often a paper is cited relative to other papers in the same research field and publication year. An RCR of 1.0 means a paper is cited at the average rate for its field. An RCR of 5.0 means it is cited five times more than average, a sign of outsized influence.
Because RCR is field-normalized, it is more meaningful than a raw citation count. A paper in molecular biology accumulates citations faster than one in a narrow clinical subspecialty. RCR adjusts for this so comparisons across subfields are fair.
How Colo uses it
When retrieving papers, Colo blends semantic relevance (how closely a paper's content matches the query) with citation quality (how influential the paper is in its field). The blend is currently 70% semantic / 30% RCR. This means a highly relevant but low-impact paper can still surface, but a highly cited paper on the exact topic will rank near the top.
Each retrieved abstract is displayed with its RCR score and raw citation count so you can see the quality signal directly.
RCR has a recency bias: papers published in the last 1–2 years haven't had time to accumulate citations, so they will rank lower than older papers even if they are more current. Recent high-quality work may be underweighted. This is a known limitation we are actively working to address.
8. What a VERDICT means
A VERDICT is the agents' consensus on the single most actionable research direction that has emerged from the debate, stated as a testable hypothesis. It is not a clinical recommendation. It is not a definitive answer. It is a synthesis of what the retrieved literature supports, filtered through adversarial debate.
A VERDICT is only considered valid under three conditions:
- It is supported by at least one [RCT] or [META]-tagged citation.
- There are no unresolved DISAGREE flags between the agents.
- The supporting claims are not labeled INFERENCE; they are traceable to cited papers.
When you see a verdict card appear in the dialogue, it means the agents have reached this bar. It does not mean the verdict is correct. It means it is internally consistent with the evidence they were given and the logic they were required to follow. Always verify against the cited papers directly.
9. The evidence mindmap
Every paper the agents cite during the debate is captured in the evidence mindmap , a persistent visual artifact that classifies and ranks the entire library for you. It's the thing you keep after the conversation ends. Most users spend more time in the mindmap than in the debate itself.
The mindmap is available from the Mindmap stage of the workflow and offers three views of the same underlying library. Each view answers a different question:
- Stance columns. Every cited paper is classified as supporting, challenging, or unresolved relative to your hypothesis. Papers are sorted by semantic similarity to the hypothesis within each column. Use this view to see at a glance where the literature pulls for or against your idea.
- Semantic plot. The hypothesis and every cited paper are projected into two dimensions using a semantic-distance algorithm (UMAP for libraries with 30+ papers, MDS for smaller ones, with a PCA fallback). Distance from the hypothesis dot approximates semantic distance, closer dots cite ideas closer to your question. Dots are colored by stance.
- Yearly. The same papers laid out along a publication-year axis, banded by stance. Use this view to see how the evidence base has shifted over time , whether challenging evidence is older or newer than supporting evidence, where the consensus broke, when foundational work landed.
Citation network overlay and foundational papers
The semantic plot also overlays in-set citation edges fetched from OpenAlex, thin grey lines connecting papers that cite each other within your library. Papers that are cited by multiple other papers in the same retrieved set are marked with a yellow star and labeled foundational. This is foundational to your specific debate, not foundational to the broader field, a paper with modest overall citation counts can still be foundational here if many of the papers in this library cite it.
The foundational marker appears as a star on the dot in the semantic plot, a star on the dot in the yearly view, and a gold badge inline next to the PMID in the stance columns. Same paper, same marker, every view.
What you can do with the mindmap
- Click any paper in any view to open the leaf detail, full title, summary, evidence tier, and a direct link to PubMed.
- Export the entire library to a reference manager. The Zotero button on the workflow nav generates a downloadable RIS file (Zotero, EndNote, Mendeley, Papers all accept it) or copies the PMIDs to your clipboard for Zotero's Add by Identifier dialog.
- Carry the library forward. Methods and Scaffold inherit the mindmap's full context automatically, when you write the methods section for a study or scaffold a grant, the agents already know which papers shaped your hypothesis.
Mindmap state persists per session and is restored whenever you reopen the session.
10. Known limitations
We believe in being transparent about what Colo cannot do. The following are active limitations you should factor into how you use and interpret its outputs. Many of these are scoped for future development and will be addressed as the product matures.
| Limitation | What it means in practice |
|---|---|
| Abstract-only retrieval | The pre-built corpus indexes PubMed abstracts, not full-text articles. Methods sections, supplemental data, and full results tables are not accessible to the agents. Workaround: use the Build custom corpus option in setup to index a domain or paper set you have full-text access to, or paste full-text excerpts into the Literature panel on the adversarial screen, both routes give the agents direct access to the deeper material. |
| Publication bias | PubMed skews toward positive results. Studies with null or negative findings are underrepresented, so agents may overweight evidence for efficacy. |
| Date cutoffs | The corpus covers a defined range (typically 2015–present). Landmark studies outside that window are not indexed and will not be cited unless added manually. |
| RCR recency bias | Recently published papers have accumulated fewer citations and will be underweighted by the RCR ranking signal, even when they represent important new evidence. |
| No retraction detection | Papers retracted after indexing are not flagged. Always verify the retraction status of any cited paper before relying on it. |
| AI reasoning errors | Agents can misread, misattribute, or hallucinate details even when given the abstract. The INFERENCE rule and evidence tagging reduce this risk but do not eliminate it. |
| Domain coverage | The pre-built corpus covers ~16 biomedical specialties (3.4M papers, 2015–present) spanning oncology, cardiology, neurology, immunology, pharmacology, infectious disease, geroscience, psychiatry, gastroenterology, pulmonology, nephrology, hematology, OB/GYN, and bioinformatics. A handful of fields (e.g., dermatology, ophthalmology) are not yet indexed in the pre-built corpus. Workaround: use the Build custom corpus feature on the Setup screen to build a focused corpus from any PubMed query, your query reaches all of PubMed, not just the pre-built specialty list. |
11. Further reading
A short list of papers and analyses worth knowing if you want to follow where language-model literature synthesis is heading. The field is moving quickly, and we expect to refactor as the evidence shifts. Colo's commitment is to rigor, not to any specific architecture.
Some entries are arXiv preprints rather than peer-reviewed publications. Treat them as directional signal, not settled findings.
-
NEPAQuAD: RAG vs. long-context benchmark
RAG models outperform long-context approaches on answer accuracy across frontier model families. Settles the "long context kills RAG" question for biomedical-scale corpora. -
iMAD: when does multi-agent debate help, and when does it hurt?
Triggering debate can override correct single-agent answers with incorrect ones. Implication: debate should be selective, not universal. -
Single-Agent vs. Multi-Agent under equal thinking budgets
Multi-agent improvement depends on task structure and verification protocols, not on multi-agent being universally better. -
Medical Hallucinations in Foundation Models
Tested 11 foundation models. 64-72% of medical hallucinations stem from causal or temporal reasoning failures rather than knowledge gaps. The gap retrieval grounding is meant to fill. -
CareMedEval: a critical-appraisal benchmark
State-of-the-art LLMs fail to exceed 0.5 Exact Match on critical appraisal questions, especially around study limitations and statistical analysis. The capability gap that justifies tools like Colo. -
DeepResearcher: end-to-end RL-trained research agents
Trains retrieval behavior into the model rather than prompting around it. The most likely future path by which trained-in retrieval beats prompted RAG. -
Consensus releases an MCP server
Specialty literature tools becoming connectors inside ChatGPT, Claude, and Cursor. A distribution shift Colo needs to follow. -
BioVerge: biomedical hypothesis-generation benchmark
Evaluates whether automated systems can generate testable hypotheses from biomedical literature. Directly relevant to Colo's setup-to-adversarial flow.
Suggestions welcome at privacy@colo-sci.com. This list is revised as papers are retracted, replicated, or superseded.
Workflow Stages
The sections above describe the engine; the sections below describe the user-facing pipeline. Each stage from Setup through Export uses Claude differently, and we name which Claude model handles each step so the choices are transparent.
12. Setup
Powered by Claude SonnetThe Setup screen is where you tell Colo what you want to investigate. You enter your hypothesis in plain language, and Claude Sonnet helps refine it, clarifying scope, identifying unstated assumptions, and preparing it for adversarial evaluation.
If you don't have a fully-formed PubMed query yet, the Query Assistant translates a research question into a properly formatted PubMed search using MeSH terms, field codes, and boolean operators. You can keep iterating with the assistant until the query returns a useful number of papers.
From there you pick a corpus. Colo ships with a pre-built corpus covering 16 biomedical specialties (3.4M+ peer-reviewed abstracts, 2015–present). If you need coverage outside that scope, the Build Custom Corpus feature indexes any PubMed query into a private per-user collection. An opt-in Include preprints toggle adds records from bioRxiv, medRxiv, and arXiv q-bio via Europe PMC, with a non-peer-reviewed disclosure so you can weigh them appropriately.
13. Adversarial debate
Sonnet by default, Opus on subscriptionThe Adversarial screen is the heart of Colo. Two AI agents debate your hypothesis using only retrieved peer-reviewed literature. Each turn carries strict citation requirements and forces the agents to either ground their claim in a specific PMID or explicitly tag it as INFERENCE.
The model is your choice. Sonnet is the default and produces fast, careful adversarial debates suitable for most research questions. Researcher-monthly subscribers can switch to Opus for the highest-cognitive-load step in Anthropic's lineup, a more deliberate model that produces a slower but more thorough analysis. Free and pay-as-you-go users run on Sonnet. The choice is made via the Claude badge on the Adversarial screen and locks once the first turn streams so transcripts never mix voices.
Agent A, Clinical Specialist reasons from clinical trial outcomes, patient endpoints, response rates, and real-world applicability. It challenges mechanistic claims that don't translate to measurable clinical benefit. Agent B, Translational Researcher reasons from molecular biology, signaling pathways, biomarker evidence, and resistance mechanisms. It challenges clinical framing that oversimplifies the underlying biology.
Every claim is tagged with an evidence tier (RCT, META, COHORT, PRECLINICAL, EXPERT) traceable to a cited PMID. The agents must commit to consensus at scheduled checkpoints or the debate continues. A VERDICT is only issued when the consensus is supported by randomized or meta-analytic evidence with no unresolved disagreements.
14. Mindmap
Powered by Claude HaikuEvery paper cited during the debate is classified by stance and added to your mindmap automatically. Claude Haiku handles the classification and per-paper summaries, fast, consistent, and well-suited to the structured stance/tier categorization the mindmap needs.
The mindmap offers three views of the same underlying library: Stance columns sort papers into supporting / challenging / unresolved, ranked by semantic similarity to your hypothesis. The Semantic plot projects every paper into a two-dimensional layout (UMAP for libraries with 30+ papers, MDS for smaller, with a PCA fallback) and overlays in-set citation edges from OpenAlex. Yearly lays the same papers along a publication-year axis so you can see how the evidence base shifted over time.
Papers cited by multiple other papers in your retrieved set are flagged as foundational with a yellow star on every view. You can also annotate any paper with up to four emojis (⭐ ❓ 🔬 etc.) that follow the paper across sessions , turning the mindmap into a personal knowledge base over time.
Export the entire library to Zotero, EndNote, Mendeley, or Papers via the Zotero button on the workflow nav (RIS file or copy-PMIDs-to-clipboard). Mindmap state persists per session and reopens automatically on return.
15. Methods
Powered by Claude SonnetOnce your hypothesis has survived adversarial review, the Methods screen helps you turn it into an actual study design. Six methodology templates are available: Clinical RCT, Phase I/II Trial, Cohort / Observational, In Vitro, In Vivo (Animal), and Translational.
The methods assistant is initialized with the full debate context, every paper cited, every consensus, every unresolved disagreement, so it opens with targeted questions about the design gaps that are most consequential for your specific hypothesis, not generic study-design boilerplate. As you discuss endpoints, controls, statistical power, and ethical considerations, it drafts methods-section language alongside the conversation.
Claude Sonnet handles this stage. It's well-suited to long-form drafting under structured constraints and produces methods text that reads as if a careful human researcher wrote it, not as templated AI prose.
16. Scaffold
Powered by Claude SonnetScaffold turns your hypothesis + methods into a grant or proposal draft. Five formats are supported: NIH R01 (full three-aim structure), NIH R21 (exploratory two-aim), Foundation (lay-accessible narrative), Industry Brief (decision-enabling business case), and Internal / Pilot (preliminary-data framing for internal approval).
The scaffold assistant inherits everything: the hypothesis, the full debate transcript, the cited evidence base, your design decisions from Methods. It drafts section language, flags gaps between your methods and your proposed aims, and gives you format-specific guidance, aim hierarchy for R01, exploratory framing for R21, lay narrative for foundation, decision-enabling framing for industry.
Claude Sonnet handles this stage for the same reasons it handles Methods: it produces careful long-form prose under structured constraints, with high tolerance for context length (the inherited debate transcript can be substantial).
17. Export
Rendering only, no modelExport bundles everything you've produced, hypothesis, debate, mindmap, methods, scaffold, into a single deliverable you can keep, share with a colleague, or hand off to a collaborator.
PDF export produces a print-ready document with proper citation formatting. Markdown export gives you raw text you can paste into Notion, Obsidian, or any other writing tool. Citations export to RIS for Zotero / EndNote / Mendeley, or as a simple PMID list for Zotero's Add by Identifier dialog. Nothing is destroyed when you export, sessions remain available and you can re-export with updated content at any time.
No Claude model is involved in Export itself. It's a rendering step over data that's already been generated by the earlier stages.