目录

AI Researcher

CLI-first Research Copilot prototype. The user-facing CLI is still v0-style, while the codebase now contains a first v1 workflow slice.

For a fuller Chinese user manual, see docs/user-manual.zh-CN.md.

What It Does

This project is not a general-purpose autonomous researcher. The current CLI focuses on four task types plus seven inspection/workflow helpers:

  • topic Turns a research topic into a structured research map.
  • analyze-paper Analyzes a local file or paper identifier/URL and produces a structured paper card style output.
  • analyze-repo Analyzes a local repository and produces a first-pass structure and reproducibility review.
  • plan-experiments Turns a research idea into multiple falsifiable experiment plans.
  • search Searches arXiv and (optionally, via --source) OpenAlex, and prints matching papers to stdout.
  • collect Searches for papers and appends them as de-duplicated evidence refs to an existing task, without changing its workflow status.
  • critique Runs the structured critic on a task and advances it through the criticizing → criticized gate. This is the only way past the critic gate.
  • status Prints the saved workflow status, suggested next transition, and whether a critique or human decision is still required.
  • advance Advances a saved task through the workflow state machine after explicit human input. Refuses to leave criticizing until critique has run.
  • report Rewrites report.md for an existing task from the saved task state.
  • review Prints the path to the saved task.json for an existing task.

Current Scope

What the current prototype does:

  • persists every task to disk under artifacts/
  • writes both machine-readable and human-readable outputs
  • keeps prompts as editable files under prompts/
  • validates model JSON before writing structured outputs
  • supports OpenAI-compatible model generation when .env is configured
  • falls back to built-in deterministic templates if no model config is present
  • supports local paper/file input and remote paper download with deduplication
  • extracts first-pass PDF text with pdfminer.six, falling back to pypdf when needed
  • detects reference-section offsets and avoids sending bibliography text in paper prompt excerpts
  • flags ungrounded paper_card.method, paper_card.datasets, and paper_card.metrics entries when they do not appear in extracted text
  • adds a small related-paper search context to topic, analyze-paper, and plan-experiments when the model path is enabled (analyze-paper keeps these external papers separate from in-paper grounding, so other papers’ abstracts can’t masquerade as verbatim quotes from the analyzed paper)
  • searches papers through a multi-provider abstraction (arXiv + OpenAlex) with cross-source de-duplication; OpenAlex is opt-in via --source {arxiv,openalex,all} on search and collect, while the default and all internal auto-search remain arXiv-only, and a single provider failing degrades to a warning plus the other providers’ results
  • retries the model once when its JSON parses but fails schema validation before falling back to the deterministic template (transport/parse errors are already retried inside the client)
  • enforces the critic as a hard, state-machine gate: new tasks stop at criticizing, critique is the only edge to criticized, and no task can reach done without it
  • holds a sycophantic re-critique at the gate: a re-run that flips toward passing while conceding >50% of the prior round’s concerns (tracked in task.critique_log across regenerate) raises a concession_alarm and needs critique --override to proceed
  • regenerates a task’s analysis in place (regenerate) and auto-drives the workflow to the next decision point (loop), stopping at every critic block or human gate rather than forcing past it

Current boundaries:

  • loop is bounded by design: it never auto-resolves a critic block or auto-approves the human-decision gate
  • large-scale experiment execution is not automated
  • remote repositories are not cloned automatically
  • PDF parsing extracts text, sections, and figure/table caption lines (placeholder-level), but does not handle formulas or the figures/tables themselves
  • report is a regeneration helper and can overwrite the original generated report.md

Environment

Preferred local setup:

uv venv .venv
uv pip install -e . --python .venv/bin/python

If uv is unavailable, venv also works:

python3 -m venv .venv
./.venv/bin/pip install -e .

For an OpenAI-compatible model provider, create a local .env from .env.example and fill in your endpoint, key, and default model:

cp .env.example .env

Expected variables:

  • OPENAI_API_KEY
  • OPENAI_BASE_URL
  • OPENAI_MODEL
  • OPENAI_TIMEOUT_SECONDS
  • OPENAI_MAX_RETRIES

Quick Start

Use the installed CLI inside the environment:

./.venv/bin/research topic "weak-to-strong alignment"
./.venv/bin/research analyze-paper README.md
./.venv/bin/research analyze-repo .
./.venv/bin/research plan-experiments "skill mutation for research agents"

You can also use the repo-local launcher without installation:

python3 research topic "research map"

Each core command prints the generated task.json path, for example:

/absolute/project/path/artifacts/tasks/topic-map-weak-to-strong-alignment-1a2b3c4d/task.json

The task_id is the directory name under artifacts/tasks/, such as topic-map-weak-to-strong-alignment-1a2b3c4d. Use that ID with the helper commands:

./.venv/bin/research report topic-map-weak-to-strong-alignment-1a2b3c4d
./.venv/bin/research review topic-map-weak-to-strong-alignment-1a2b3c4d
./.venv/bin/research status topic-map-weak-to-strong-alignment-1a2b3c4d
./.venv/bin/research advance topic-map-weak-to-strong-alignment-1a2b3c4d --approve --reason "reviewed"

Commands And Outputs

topic

Example:

./.venv/bin/research topic "weak-to-strong alignment in research copilots"

Input:

  • one research topic string

Primary output fields in outputs/result.json:

  • summary
  • research_questions
  • hypotheses
  • reading_queue
  • evidence_needs
  • next_action

Human-readable output:

  • report.md with summary, questions, hypotheses, reading queue, and next action

analyze-paper

Example:

./.venv/bin/research analyze-paper agent.md
./.venv/bin/research analyze-paper 2401.12345
./.venv/bin/research analyze-paper https://arxiv.org/abs/2401.12345

Accepted input:

  • local file path
  • arXiv identifier
  • DOI
  • URL

Primary output fields in outputs/result.json:

  • summary
  • paper_input
  • resolved_input
  • claims_checklist
  • paper_card
  • ungrounded_paper_card_fields when model output contains method, dataset, or metric entries not found in extracted text
  • next_action

Notes:

  • remote papers are downloaded into downloads/
  • downloads are deduplicated through downloads/index.json
  • PDF inputs are parsed with pdfminer.six, falling back to pypdf when needed
  • detected references sections are trimmed out of the prompt body excerpt
  • current analysis still uses a limited content hint; it is not a full paper structure parser

analyze-repo

Example:

./.venv/bin/research analyze-repo .

Accepted input:

  • local repository path only

Primary output fields in outputs/result.json:

  • summary
  • repo_path
  • file_count
  • top_file_types
  • readme_excerpt
  • dependency_files
  • entrypoints
  • test_files
  • config_files
  • train_eval_scripts
  • ci_configs
  • notebooks
  • data_dirs
  • run_commands
  • critical_checks
  • reproducibility_notes
  • readme_assessment when generated by the model path
  • next_action

Notes:

  • this is currently a static local repo inspection
  • it does not clone remote repositories
  • it does not automatically execute repo code as part of analysis

plan-experiments

Example:

./.venv/bin/research plan-experiments "Evaluate whether structured critique protocols improve weak-to-strong alignment fidelity in research copilots"

Input:

  • one research idea string

Primary output fields in outputs/result.json:

  • summary
  • experiment_plans
  • next_action

Each item in experiment_plans includes:

  • name
  • objective
  • baseline
  • metrics
  • failure_conditions
  • risks

Example:

./.venv/bin/research search "weak-to-strong alignment" --max 5
./.venv/bin/research search "weak-to-strong alignment" --source all

Behavior:

  • searches for related papers across providers; --source {arxiv,openalex,all} selects the backend (default arxiv, so OpenAlex is opt-in)
  • prints results to stdout
  • prints provider failures (e.g. an arXiv timeout) to stderr while still returning the surviving providers’ results
  • does not currently write to artifacts/ or attach results to an existing task

collect

Example:

./.venv/bin/research collect <task_id> --source openalex --max 3
./.venv/bin/research collect <task_id> --query "weak-to-strong alignment evaluation"

Behavior:

  • searches for papers and appends them as de-duplicated evidence refs (by source_type + source_ref) to an existing task
  • defaults the query to the task title; --query overrides it, --max defaults to 3, and --source {arxiv,openalex,all} selects the backend (default arxiv)
  • updates evidence.json, task.json evidence refs, adds a collect_evidence entry to the decision log, and logs to logs/events.log
  • does not change the task’s workflow status

This closes the critic loop: when critique blocks a task back to collecting_evidence, run collect to attach external corroboration, then advance (collecting_evidence → criticizing), then critique again.

status

Example:

./.venv/bin/research status <task_id>

Behavior:

  • prints the saved task status, next action, suggested next status, whether a human decision is required, and whether a critique is still required

critique

Example:

./.venv/bin/research critique <task_id>

Behavior:

  • runs the structured critic (model path or deterministic fallback) and writes outputs/critic.json + critic.md
  • appends a critic ref (with recommended_action and open-concern count) to task.json
  • if the task is at criticizing, advances it to criticized — the only edge out of the critic gate
  • anti-sycophancy concession discipline: each critique appends a snapshot to task.critique_log (preserved across regenerate). On a re-critique that flips toward passing while dropping >50% of the prior round’s concerns (or concedes again right after a prior concession), it raises a concession_alarm, holds the task at collecting_evidence, and only critique --override REASON proceeds — stopping the model from rubber-stamping a regenerated result for conversational harmony
  • does not overwrite outputs/result.json

advance

Example:

./.venv/bin/research advance <task_id> --approve --reason "reviewed"

Behavior:

  • advances task state after explicit human input
  • refuses to advance a task that is at criticizing and tells you to run critique first
  • updates task.json and logs/events.log
  • does not call the model, search, parse PDFs, or execute experiments

regenerate

Example:

./.venv/bin/research regenerate <task_id>

Behavior:

  • re-runs the task’s generator in place under the same task_id, overwriting result.json / report.md (collect only attaches evidence and advance only changes state — neither recomputes the product)
  • merges freshly generated evidence onto the existing evidence (de-duped by source_type + source_ref), so externally collect-ed corroboration survives
  • clears the prior pass’s now-stale critic.json / critic.md / critic refs, then rewinds the task to criticizing to face the gate again
  • only accepts tasks inside the critic loop (collecting_evidence / criticizing / criticized)
  • this is how you clear a block that collect cannot: when the critic blocks because a field is absent from the paper itself (in-paper grounding), only recomputing the analysis helps, not more external evidence

loop

Example:

./.venv/bin/research loop <task_id>

Behavior:

  • auto-drives the task: mechanical states use advance, and at criticizing it runs the real critic (reusing critique/advance, so the gate logic stays single-sourced)
  • stops at every genuine decision point: a critic block (criticizing → collecting_evidence), the human-decision gate (needs_human_decision), terminal states (done / failed), and a safety step cap
  • deliberately does not auto-collect/regenerate to push past the critic gate — the gate exists to stop blind advancement, so resolving a block is a human call. loop stops and prints the remediation commands (collect / regenerate / critique --override); resolve, then run loop again to continue from where it stopped
  • prints each step’s state transition, where it stopped, and why

report

Example:

./.venv/bin/research report <task_id>

Behavior:

  • regenerates report.md from saved task state
  • prints the report path
  • overwrites the current report.md; the regenerated file is a task-state report and may be less detailed than the original generated analysis report

review

Example:

./.venv/bin/research review <task_id>

Behavior:

  • prints the path to the saved task.json

Test

./.venv/bin/python -m unittest discover -s tests

Output Layout

  • artifacts/tasks/<task_id>/task.json
  • artifacts/tasks/<task_id>/report.md
  • artifacts/tasks/<task_id>/evidence.json
  • artifacts/tasks/<task_id>/inputs/pdf_text.txt when extractable PDF text exists
  • artifacts/tasks/<task_id>/logs/events.log
  • artifacts/tasks/<task_id>/outputs/result.json
  • downloads/ stores fetched paper files
  • downloads/index.json stores download dedup metadata
  • prompts/<task>/system.txt and prompts/<task>/user.txt store editable prompt assets

v1 Slice Already Present

The latest code includes a small v1 foundation, but it is not a complete v1 product yet:

  • src/ai_researcher/workflow.py defines research-loop statuses and transition validation.
  • TaskRecord already has fields for research_question, hypotheses, evidence_needs, critic_refs, and decision_log.
  • docs/research-copilot-v1-spec.md describes the target v1 workflow and open questions.

The current executable CLI still exposes only:

topic | analyze-paper | analyze-repo | plan-experiments | search | collect | critique | status | advance | report | review

collect is the real, wired command (see the ### collect section above) and is distinct from the still-unimplemented names below. Do not document start, loop, collect-papers, add-evidence, evidence, or propose-run as available commands until they are wired into src/ai_researcher/cli.py.

File Meanings

task.json

This is the task state snapshot. It contains:

  • task_id
  • task_type
  • status
  • input
  • summary
  • evidence_refs
  • next_action
  • timestamps

It is the control record for the task, not the full analysis body.

outputs/result.json

This is the main structured output for the task.

This file contains the task-specific payload, for example:

  • topic mapping fields for topic
  • paper_card for analyze-paper
  • repo analysis fields for analyze-repo
  • experiment_plans for plan-experiments

If you want the actual structured result, this is the main file to inspect.

report.md

This is the human-readable version of the result.

It is not identical to task.json. It is derived from the structured output and meant for reading, not programmatic consumption.

outputs/critic.json and critic.md

These are created by research critique <task_id>. The JSON file is the structured critic result; the markdown file is the human-readable version. Critiques do not overwrite the original outputs/result.json.

evidence.json

This stores the evidence references attached to the task.

inputs/pdf_text.txt

This stores extracted PDF text for human audit when PDF text extraction succeeds.

logs/events.log

This stores task lifecycle events such as creation and finalization.

Verified End-To-End Paths

Real model-backed end-to-end runs have been verified for:

  • topic
  • analyze-paper
  • plan-experiments

The closed critic loop (topiccollect --source openalexcritique) has also been verified with a real model and real network: OpenAlex was live-verified, and arXiv instability (timeout / HTTP 429) was reproduced, with OpenAlex providing resilience so the task still completed.

analyze-repo is implemented and available, but the current implementation is still a local static analysis pass rather than a richer execution-aware repo workflow.

关于
778.0 KB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802047560号