Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training
A state-of-the-art open-source agent utilizing (as many as possible) free tools; the only paid tool is the Google Search API, which can be replaced with the free DuckDuckGo API if needed.
Fully reproducible open-source SFT training recipe that outperforms RL-based models like WebDancer and WebSailor—no RL required.
apt-get install -y poppler-utils default-jre libreoffice-common libreoffice-java-common libreoffice ffmpeg
# for ck_web
sh ck_pro/ck_web/_web/run_local.sh
IMPORTANT: it is recommended to run this program in a sandbox since the generated python code is directly executed and currently there are no safety checkings. (Disable sudo for your user to ensure safety.)
# run with root
echo "${USER}" 'ALL=(ALL) NOPASSWD: !ALL' | tee /etc/sudoers.d/${USER}-rule
chmod 440 /etc/sudoers.d/${USER}-rule
deluser ${USER} sudo
hostnamectl set-hostname localhost
brew install --cask libreoffice
brew install poppler
brew install ffmpeg
# for ck_web
sh ck_pro/ck_web/_web/run_local_mac.sh
IMPORTANT: it is recommended to run this program in a sandbox since the generated python code is directly executed and currently there are no safety checkings. (Disable sudo for your user to ensure safety.)
# run with root
echo "${USER}" 'ALL=(ALL) NOPASSWD: !ALL' | tee /etc/sudoers.d/${USER}-rule
chmod 440 /etc/sudoers.d/${USER}-rule
dseditgroup -o edit -d "$USER" admin
scutil --set HostName localhost
Example (A simple example)
See ck_main/_test for a simple example and its corresponding outputs
export PYTHONPATH=/your/path/to/CogKernel-Pro
# Assume we have set up a vllm model server and a web-browser server (currently these are active: WEB_IP
WEB_IP=localhost:3001 # web-browser server
LLM_URL=http://xx.xx.xx.xx:8080/v1/chat/completions # vllm model server
#LLM_URL=gpt:gpt-4.1 # using gpt
#VLM_URL=gpt:gpt-4.1 # using gpt
#LLM_URL=claude: # using claude
#VLM_URL=claude: # using claude
# run simple test
MAIN_ARGS="{'web_agent': {'model': {'call_target': '${LLM_URL}'}, 'model_multimodal': {'call_target': '${VLM_URL}'}, 'web_env_kwargs': {'web_ip': '${WEB_IP}'}}, 'file_agent': {'model': {'call_target': '${LLM_URL}'}, 'model_multimodal': {'call_target': '${VLM_URL}'}}, 'model': {'call_target': '${LLM_URL}'}}"
# use "NO_NULL_STDIN=1" for easier debugging
# you can also remove `--input` field to directly input your task from stdin
# you can also remove `-mpdb` flag to run the program directly instead of in debugging mode
NO_NULL_STDIN=1 python3 -u -mpdb -m ck_pro.ck_main.main --updates "${MAIN_ARGS}" --input /your/path/to/simple_test.jsonl --output /your/path/to/simple_test.output.jsonl |& tee _log_simple_test
less -R _log_simple_test # use 'less -R' to see the colored outputs
Example (Experimenting on the GAIA dataset)
# Step 1: prepare data
# decompress the gaia data (or you can download it by yourself from huggingface)
# -> assume all the gaia related input files are at the same DIR as the input json meta-file
unzip /your/path/to/CogKernel-Pro/Evaluation/gaia2504.zip
# Step 2: prepare web service (recommending using a PC or laptop to enable better network connection)
# -> prepare things according to "./ck_web/_web/run_local.sh"
#LISTEN_PORT=3001 npm start
#WEB_IP=localhost:3001 # web-browser server
# Step 3: prepare a vllm instance for model calling
# use gpt
#LLM_URL=gpt:gpt-4.1
#VLM_URL=gpt:gpt-4.1
#export AZURE_OPENAI_ENDPOINT="YOUR_ENDPOINT"
#export AZURE_OPENAI_API_KEY="YOUR_API_KEY"
#export AZURE_OPENAI_API_VERSION="YOUR_API_VERSION"
# or use claude
#LLM_URL=claude: # using claude
#VLM_URL=claude: # using claude
#export AWS_ACCESS_KEY="YOUR_KEY"
#export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_KEY"
LLM_URL=http://xx.xx.xx.xx:8080/v1/chat/completions # vllm model server
VLM_URL=http://xx.xx.xx.xx:8081/v1/chat/completions # for VLM
# Step 4: Setup search engine
# either using google api
#export SEARCH_BACKEND="Google"
#export SEARCH_API_KEY="YOUR_API_KEY"
#export SEARCH_CSE_ID="YOUR_CSE_ID"
# or simply use DuckDuckGo
export SEARCH_BACKEND="DuckDuckGo"
# Step 5: run
export PYTHONPATH=/your/path/to/CogKernel-Pro/
#pip install ... # see above in `Environment`
# it will be more stable to run a new web-browser for each web call, setup WEB_PORT (web browser service's port) and WEB_DIR (main dir of the web browser service)
# moreover, it is slightly better to use non-boxed screenshot (make sure to update the latest `server.js` and set screenshot_boxed=False)
WEB_DIR=/path/to/_web/ # where we put `server.js` and related `node_modules`
WEB_PORT=3001
MAIN_ARGS="{'web_agent': {'model': {'call_target': '${LLM_URL}'}, 'model_multimodal': {'call_target': '${VLM_URL}'}, 'web_env_kwargs': {'web_ip': 'localhost:${WEB_PORT}', 'web_command': 'cd ${WEB_DIR}; LISTEN_PORT=${WEB_PORT} npm start', 'screenshot_boxed': False}}, 'file_agent': {'model': {'call_target': '${LLM_URL}'}, 'model_multimodal': {'call_target': '${VLM_URL}'}}, 'model': {'call_target': '${LLM_URL}'}}"
python3.12 -u -m ck_pro.ck_main.main --updates "${MAIN_ARGS}" --input /your/path/to/gaia_dev.jsonl --output /your/path/to/gaia_dev.output.jsonl |& tee -a _log_gaia_dev
# Step 6: analyze and check the output
python -m ck_pro.ck_main.scripts.analyze -f /your/path/to/output/gaia_dev.output.jsonl -b 0
Extra Running Config
# calling claude+thinking for the outside main-agent
LLM_URL=gpt:gpt-4.1 # still use gpt4.1 for sub-agents
VLM_URL=gpt:gpt-4.1
export AZURE_OPENAI_ENDPOINT="YOUR_ENDPOINT" # find these keys in the corresponding spreadsheets
export AZURE_OPENAI_API_KEY="YOUR_API_KEY"
export AZURE_OPENAI_API_VERSION="YOUR_API_VERSION"
export AWS_ACCESS_KEY="YOUR_KEY"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_KEY"
MAIN_ARGS="{'web_agent': {'model': {'call_target': '${LLM_URL}'}, 'model_multimodal': {'call_target': '${VLM_URL}'}, 'web_env_kwargs': {'web_ip': 'localhost:${WEB_PORT}', 'web_command': 'cd ${WEB_DIR}; LISTEN_PORT=${WEB_PORT} npm start', 'screenshot_boxed': False}}, 'file_agent': {'model': {'call_target': '${LLM_URL}'}, 'model_multimodal': {'call_target': '${VLM_URL}'}}, 'model': {'thinking': 'True', 'call_target': 'claude:', 'call_kwargs': {'temperature': 0.2, 'top_p': 0.95, 'max_tokens': 4096}}}" # use claude+thinking for main-agent, allowing more max_token budgets
Enabling Reflection
Extra configs required:
# configuration of the evaluator LLM
export EVALUATOR_LLM=gpt:gpt-4.1
# langchain
export AZURE_OPENAI_API_VERSION=2025-01-01-preview
export OPENAI_API_TYPE=azure_ai
export AZURE_INFERENCE_ENDPOINT=$AZURE_OPENAI_ENDPOINT
export AZURE_INFERENCE_CREDENTIAL=$AZURE_OPENAI_API_KEY
Extra arguments when running ck_pro.ck_main.main: --inference-time-evaluation-method, where you can choose from no_answer and gpt_judge. no_answer simply checks whether the agent have returned anything meaningful, while gpt_judge use the LLM specified by EVALUATOR_LLM to perform evaluation on the trajectory and decide whether there’s a need to retry.
The class of Session is used to save trajectories session.py
The analysis script could help understand the data structure analyze.py
# one instance in one json-line
INSTANCE = {
"id": "Task ID",
"task": "Task Description",
"session": { # corresponding to the class of Session
"id": "Session ID",
"info": {...}, # other information such model calling token counts
"task": "Original Task Description",
"steps": [ # information for each step
{
"step_idx": 0,
"plan": {
"thought": "Model's thought",
"code": "Model's output code",
"state": {...}, # updated state
"llm_input": [], # model's direct input messages
"llm_output": "Model's raw output", # model's raw output
},
"action": {
"...": ..., # similar to plan
# "observation": ..., # simple outputs from code execution
# if calling a sub-agent, we have more complex structures storing the session from the sub-agent
"observation": { # see the class of AgentResult
"output": "formatted outputs",
"log": "logs",
"task": "Task for the sub-agent",
"session": {...},
},
},
}, # step 0
..., # later steps
{
"...": ..., # plan and action
"end": { # in the final step, we may also have an ending module if configured
"..." # fields are similar to plan and action
}
} # final step
],
},
}
The queries and answers of Multi-hop URLQA and AgentWebQA is here. A portion of the SFT data that is permitted for open-source release due to licensing restrictions is available here.
We release the checkpoint of fine-tuned Qwen3-8B-CK-Pro here.
Trajectory sampling
We use gpt-4.1 to sample trajectories. You need to download the queries first and then run the main agent execution code that is similar to previous sections. You may add additional arguments --sampling-mode --evaluation-method llm_score --max_retry_num 3 to sample the same query up to 3 times until it is successful.
PersonaHub: Scaling Synthetic Data Creation with 1,000,000,000 Personas
MobileGUI-RL: MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment
WebAggregator: Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents
Cite this work
@misc{fang2025cognitivekernelpro,
title={Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training},
author={Tianqing Fang and Zhisong Zhang and Xiaoyang Wang and Rui Wang and Can Qin and Yuxuan Wan and Jun-Yu Ma and Ce Zhang and Jiaqi Chen and Xiyun Li and Hongming Zhang and Haitao Mi and Dong Yu},
year={2025},
eprint={2508.00414},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.00414},
}
Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training
Updates
Running Cognitive Kernel-Pro (CogKernel-Pro for short) Agent
Environment
Python
Web Server (Powered by Playwright)
On Linux:
On Mac:
Example (A simple example)
ck_main/_testfor a simple example and its corresponding outputsExample (Experimenting on the GAIA dataset)
Extra Running Config
Enabling Reflection
Extra configs required:
Extra arguments when running ck_pro.ck_main.main:
--inference-time-evaluation-method, where you can choose fromno_answerandgpt_judge.no_answersimply checks whether the agent have returned anything meaningful, whilegpt_judgeuse the LLM specified byEVALUATOR_LLMto perform evaluation on the trajectory and decide whether there’s a need to retry.Data
Saved Data Format
The format of saved data is as followed:
Sessionis used to save trajectories session.pySystem Prompts
Prompts are saved in
prompts.pyfiles of each agent, such as ck_pro/ck_main/prompts.py,ck_web/prompts.py.Check out detailed notes for more details.
Data
The queries and answers of Multi-hop URLQA and AgentWebQA is here. A portion of the SFT data that is permitted for open-source release due to licensing restrictions is available here.
We release the checkpoint of fine-tuned Qwen3-8B-CK-Pro here.
Trajectory sampling
We use
gpt-4.1to sample trajectories. You need to download the queries first and then run the main agent execution code that is similar to previous sections. You may add additional arguments--sampling-mode --evaluation-method llm_score --max_retry_num 3to sample the same query up to 3 times until it is successful.Rejection Sampling and SFT Data Post-Process
Run the code convert_sft.py and choose a type of rejection sampling (
llm_judgefor langchain LLM score oremfor exact match).Friendly links to relevant agents works from Tencent AI Lab
Cite this work
Contact
tianqfang(at)tencent(dot)com