AI Researcher

CLI-first Research Copilot prototype. The user-facing CLI is still v0-style, while the codebase now contains a first v1 workflow slice.

For a fuller Chinese user manual, see docs/user-manual.zh-CN.md.

What It Does

This project is not a general-purpose autonomous researcher. The current CLI focuses on four task types plus seven inspection/workflow helpers:

topic Turns a research topic into a structured research map.
analyze-paper Analyzes a local file or paper identifier/URL and produces a structured paper card style output.
analyze-repo Analyzes a local repository and produces a first-pass structure and reproducibility review.
plan-experiments Turns a research idea into multiple falsifiable experiment plans.
search Searches arXiv and (optionally, via --source) OpenAlex, and prints matching papers to stdout.
collect Searches for papers and appends them as de-duplicated evidence refs to an existing task, without changing its workflow status.
critique Runs the structured critic on a task and advances it through the criticizing → criticized gate. This is the only way past the critic gate.
status Prints the saved workflow status, suggested next transition, and whether a critique or human decision is still required.
advance Advances a saved task through the workflow state machine after explicit human input. Refuses to leave criticizing until critique has run.
report Rewrites report.md for an existing task from the saved task state.
review Prints the path to the saved task.json for an existing task.

Current Scope

What the current prototype does:

persists every task to disk under artifacts/
writes both machine-readable and human-readable outputs
keeps prompts as editable files under prompts/
validates model JSON before writing structured outputs
supports OpenAI-compatible model generation when .env is configured
falls back to built-in deterministic templates if no model config is present
supports local paper/file input and remote paper download with deduplication
extracts first-pass PDF text with pdfminer.six, falling back to pypdf when needed
detects reference-section offsets and avoids sending bibliography text in paper prompt excerpts
flags ungrounded paper_card.method, paper_card.datasets, and paper_card.metrics entries when they do not appear in extracted text
adds a small related-paper search context to topic, analyze-paper, and plan-experiments when the model path is enabled (analyze-paper keeps these external papers separate from in-paper grounding, so other papers’ abstracts can’t masquerade as verbatim quotes from the analyzed paper)
searches papers through a multi-provider abstraction (arXiv + OpenAlex) with cross-source de-duplication; OpenAlex is opt-in via --source {arxiv,openalex,all} on search and collect, while the default and all internal auto-search remain arXiv-only, and a single provider failing degrades to a warning plus the other providers’ results
retries the model once when its JSON parses but fails schema validation before falling back to the deterministic template (transport/parse errors are already retried inside the client)
enforces the critic as a hard, state-machine gate: new tasks stop at criticizing, critique is the only edge to criticized, and no task can reach done without it
holds a sycophantic re-critique at the gate: a re-run that flips toward passing while conceding >50% of the prior round’s concerns (tracked in task.critique_log across regenerate) raises a concession_alarm and needs critique --override to proceed
regenerates a task’s analysis in place (regenerate) and auto-drives the workflow to the next decision point (loop), stopping at every critic block or human gate rather than forcing past it

Current boundaries:

loop is bounded by design: it never auto-resolves a critic block or auto-approves the human-decision gate
large-scale experiment execution is not automated
remote repositories are not cloned automatically
PDF parsing extracts text, sections, and figure/table caption lines (placeholder-level), but does not handle formulas or the figures/tables themselves
report is a regeneration helper and can overwrite the original generated report.md

Environment

Preferred local setup:

uv venv .venv
uv pip install -e . --python .venv/bin/python

If uv is unavailable, venv also works:

python3 -m venv .venv
./.venv/bin/pip install -e .

For an OpenAI-compatible model provider, create a local .env from .env.example and fill in your endpoint, key, and default model:

cp .env.example .env

Expected variables:

OPENAI_API_KEY
OPENAI_BASE_URL
OPENAI_MODEL
OPENAI_TIMEOUT_SECONDS
OPENAI_MAX_RETRIES

Quick Start

Use the installed CLI inside the environment:

./.venv/bin/research topic "weak-to-strong alignment"
./.venv/bin/research analyze-paper README.md
./.venv/bin/research analyze-repo .
./.venv/bin/research plan-experiments "skill mutation for research agents"

You can also use the repo-local launcher without installation:

python3 research topic "research map"

Each core command prints the generated task.json path, for example:

/absolute/project/path/artifacts/tasks/topic-map-weak-to-strong-alignment-1a2b3c4d/task.json

The task_id is the directory name under artifacts/tasks/, such as topic-map-weak-to-strong-alignment-1a2b3c4d. Use that ID with the helper commands:

./.venv/bin/research report topic-map-weak-to-strong-alignment-1a2b3c4d
./.venv/bin/research review topic-map-weak-to-strong-alignment-1a2b3c4d
./.venv/bin/research status topic-map-weak-to-strong-alignment-1a2b3c4d
./.venv/bin/research advance topic-map-weak-to-strong-alignment-1a2b3c4d --approve --reason "reviewed"

Commands And Outputs

`topic`

Example:

./.venv/bin/research topic "weak-to-strong alignment in research copilots"

Input:

one research topic string

Primary output fields in outputs/result.json:

summary
research_questions
hypotheses
reading_queue
evidence_needs
next_action

Human-readable output:

report.md with summary, questions, hypotheses, reading queue, and next action

`analyze-paper`

Example:

./.venv/bin/research analyze-paper agent.md
./.venv/bin/research analyze-paper 2401.12345
./.venv/bin/research analyze-paper https://arxiv.org/abs/2401.12345

Accepted input:

local file path
arXiv identifier
DOI
URL

Primary output fields in outputs/result.json:

summary
paper_input
resolved_input
claims_checklist
paper_card
ungrounded_paper_card_fields when model output contains method, dataset, or metric entries not found in extracted text
next_action

Notes:

remote papers are downloaded into downloads/
downloads are deduplicated through downloads/index.json
PDF inputs are parsed with pdfminer.six, falling back to pypdf when needed
detected references sections are trimmed out of the prompt body excerpt
current analysis still uses a limited content hint; it is not a full paper structure parser

`analyze-repo`

Example:

./.venv/bin/research analyze-repo .

Accepted input:

local repository path only

Primary output fields in outputs/result.json:

summary
repo_path
file_count
top_file_types
readme_excerpt
dependency_files
entrypoints
test_files
config_files
train_eval_scripts
ci_configs
notebooks
data_dirs
run_commands
critical_checks
reproducibility_notes
readme_assessment when generated by the model path
next_action

Notes:

this is currently a static local repo inspection
it does not clone remote repositories
it does not automatically execute repo code as part of analysis

`plan-experiments`

Example:

./.venv/bin/research plan-experiments "Evaluate whether structured critique protocols improve weak-to-strong alignment fidelity in research copilots"

Input:

one research idea string

Primary output fields in outputs/result.json:

summary
experiment_plans
next_action

Each item in experiment_plans includes:

name
objective
baseline
metrics
failure_conditions
risks

`search`

Example:

./.venv/bin/research search "weak-to-strong alignment" --max 5
./.venv/bin/research search "weak-to-strong alignment" --source all

Behavior:

searches for related papers across providers; --source {arxiv,openalex,all} selects the backend (default arxiv, so OpenAlex is opt-in)
prints results to stdout
prints provider failures (e.g. an arXiv timeout) to stderr while still returning the surviving providers’ results
does not currently write to artifacts/ or attach results to an existing task

`collect`

Example:

./.venv/bin/research collect <task_id> --source openalex --max 3
./.venv/bin/research collect <task_id> --query "weak-to-strong alignment evaluation"

Behavior:

searches for papers and appends them as de-duplicated evidence refs (by source_type + source_ref) to an existing task
defaults the query to the task title; --query overrides it, --max defaults to 3, and --source {arxiv,openalex,all} selects the backend (default arxiv)
updates evidence.json, task.json evidence refs, adds a collect_evidence entry to the decision log, and logs to logs/events.log
does not change the task’s workflow status

This closes the critic loop: when critique blocks a task back to collecting_evidence, run collect to attach external corroboration, then advance (collecting_evidence → criticizing), then critique again.

`status`

Example:

./.venv/bin/research status <task_id>

Behavior:

prints the saved task status, next action, suggested next status, whether a human decision is required, and whether a critique is still required

`critique`

Example:

./.venv/bin/research critique <task_id>

Behavior:

runs the structured critic (model path or deterministic fallback) and writes outputs/critic.json + critic.md
appends a critic ref (with recommended_action and open-concern count) to task.json
if the task is at criticizing, advances it to criticized — the only edge out of the critic gate
anti-sycophancy concession discipline: each critique appends a snapshot to task.critique_log (preserved across regenerate). On a re-critique that flips toward passing while dropping >50% of the prior round’s concerns (or concedes again right after a prior concession), it raises a concession_alarm, holds the task at collecting_evidence, and only critique --override REASON proceeds — stopping the model from rubber-stamping a regenerated result for conversational harmony
does not overwrite outputs/result.json

`advance`

Example:

./.venv/bin/research advance <task_id> --approve --reason "reviewed"

Behavior:

advances task state after explicit human input
refuses to advance a task that is at criticizing and tells you to run critique first
updates task.json and logs/events.log
does not call the model, search, parse PDFs, or execute experiments

`regenerate`

Example:

./.venv/bin/research regenerate <task_id>

Behavior:

re-runs the task’s generator in place under the same task_id, overwriting result.json / report.md (collect only attaches evidence and advance only changes state — neither recomputes the product)
merges freshly generated evidence onto the existing evidence (de-duped by source_type + source_ref), so externally collect-ed corroboration survives
clears the prior pass’s now-stale critic.json / critic.md / critic refs, then rewinds the task to criticizing to face the gate again
only accepts tasks inside the critic loop (collecting_evidence / criticizing / criticized)
this is how you clear a block that collect cannot: when the critic blocks because a field is absent from the paper itself (in-paper grounding), only recomputing the analysis helps, not more external evidence

`loop`

Example:

./.venv/bin/research loop <task_id>

Behavior:

auto-drives the task: mechanical states use advance, and at criticizing it runs the real critic (reusing critique/advance, so the gate logic stays single-sourced)
stops at every genuine decision point: a critic block (criticizing → collecting_evidence), the human-decision gate (needs_human_decision), terminal states (done / failed), and a safety step cap
deliberately does not auto-collect/regenerate to push past the critic gate — the gate exists to stop blind advancement, so resolving a block is a human call. loop stops and prints the remediation commands (collect / regenerate / critique --override); resolve, then run loop again to continue from where it stopped
prints each step’s state transition, where it stopped, and why

`report`

Example:

./.venv/bin/research report <task_id>

Behavior:

regenerates report.md from saved task state
prints the report path
overwrites the current report.md; the regenerated file is a task-state report and may be less detailed than the original generated analysis report

`review`

Example:

./.venv/bin/research review <task_id>

Behavior:

prints the path to the saved task.json

Test

./.venv/bin/python -m unittest discover -s tests

Output Layout

artifacts/tasks/<task_id>/task.json
artifacts/tasks/<task_id>/report.md
artifacts/tasks/<task_id>/evidence.json
artifacts/tasks/<task_id>/inputs/pdf_text.txt when extractable PDF text exists
artifacts/tasks/<task_id>/logs/events.log
artifacts/tasks/<task_id>/outputs/result.json
downloads/ stores fetched paper files
downloads/index.json stores download dedup metadata
prompts/<task>/system.txt and prompts/<task>/user.txt store editable prompt assets

v1 Slice Already Present

The latest code includes a small v1 foundation, but it is not a complete v1 product yet:

src/ai_researcher/workflow.py defines research-loop statuses and transition validation.
TaskRecord already has fields for research_question, hypotheses, evidence_needs, critic_refs, and decision_log.
docs/research-copilot-v1-spec.md describes the target v1 workflow and open questions.

The current executable CLI still exposes only:

topic | analyze-paper | analyze-repo | plan-experiments | search | collect | critique | status | advance | report | review

collect is the real, wired command (see the ### collect section above) and is distinct from the still-unimplemented names below. Do not document start, loop, collect-papers, add-evidence, evidence, or propose-run as available commands until they are wired into src/ai_researcher/cli.py.

File Meanings

`task.json`

This is the task state snapshot. It contains:

task_id
task_type
status
input
summary
evidence_refs
next_action
timestamps

It is the control record for the task, not the full analysis body.

`outputs/result.json`

This is the main structured output for the task.

This file contains the task-specific payload, for example:

topic mapping fields for topic
paper_card for analyze-paper
repo analysis fields for analyze-repo
experiment_plans for plan-experiments

If you want the actual structured result, this is the main file to inspect.

`report.md`

This is the human-readable version of the result.

It is not identical to task.json. It is derived from the structured output and meant for reading, not programmatic consumption.

`outputs/critic.json` and `critic.md`

These are created by research critique <task_id>. The JSON file is the structured critic result; the markdown file is the human-readable version. Critiques do not overwrite the original outputs/result.json.

`evidence.json`

This stores the evidence references attached to the task.

`inputs/pdf_text.txt`

This stores extracted PDF text for human audit when PDF text extraction succeeds.

`logs/events.log`

This stores task lifecycle events such as creation and finalization.

Verified End-To-End Paths

Real model-backed end-to-end runs have been verified for:

topic
analyze-paper
plan-experiments

The closed critic loop (topic → collect --source openalex → critique) has also been verified with a real model and real network: OpenAlex was live-verified, and arXiv instability (timeout / HTTP 429) was reproduced, with OpenAlex providing resilience so the task still completed.

analyze-repo is implemented and available, but the current implementation is still a local static analysis pass rather than a richer execution-aware repo workflow.

AI Researcher

What It Does

Current Scope

Environment

Quick Start

Commands And Outputs

topic

analyze-paper

analyze-repo

plan-experiments

search

collect

status

critique

advance

regenerate

loop

report

review

Test

Output Layout

v1 Slice Already Present

File Meanings

task.json

outputs/result.json

report.md

outputs/critic.json and critic.md

evidence.json

inputs/pdf_text.txt

logs/events.log

Verified End-To-End Paths

`topic`

`analyze-paper`

`analyze-repo`

`plan-experiments`

`search`

`collect`

`status`

`critique`

`advance`

`regenerate`

`loop`

`report`

`review`

`task.json`

`outputs/result.json`

`report.md`

`outputs/critic.json` and `critic.md`

`evidence.json`

`inputs/pdf_text.txt`

`logs/events.log`