PaSa: An LLM Agent for Comprehensive Academic Paper Search ACL 2025 (Main)
ByteDance Seed
Introduction
We introduce PaSa, an advanced Paper Search agent powered by large language models. PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholarly queries. We optimize PaSa using reinforcement learning with a synthetic dataset, AutoScholarQuery, which includes 35k fine-grained academic queries and corresponding papers sourced from top-tier AI conference publications. Additionally, we develop RealScholarQuery, a benchmark collecting real-world academic queries to assess PaSa performance in more realistic scenarios. Despite being trained on synthetic data, PaSa significantly outperforms existing baselines on RealScholarQuery, including Google, Google Scholar, Google with GPT-4 for paraphrased queries, chatGPT (search-enabled GPT-4o), GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably, PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78% in recall@20 and 39.90% in recall@50. It also exceeds PaSa-GPT-4o by 30.36% in recall and 4.25% in precision.
Quick Start
You can prepare a detailed description of your academic search needs, and search for papers on https://pasa-agent.ai
Architecture
The PaSa system consists of two LLM agents, Crawler and Selector. The Crawler processes the user query and can access papers from the paper queue. It can autonomously invoke the search tool, expand citations, or stop processing of the current paper. All papers collected by the Crawler are appended to the paper queue. The Selector reads each paper in the paper queue to determine whether it meets the criteria specified in the user query.
AutoScholarQuery is a synthetic but high-quality dataset of academic queries and related papers, specifically curated for the AI field.
RealScholarQuery
RealScholarQuery is a test dataset consisting of 50 real-world and fine-grained research queries raised by AI researchers to use the system. The answers to each query are identified as comprehensively as possible by the professional annotators through various retrieval methods.
Experiments
Baselines
We evaluate our paper search agent on both the test set of AutoScholarQuery and RealScholarQuery. We compare PaSa-7b against the following baselines:
Google. Use Google to search the query directly.
Google Scholar. Queries are submitted directly to Google Scholar.
Google with GPT-4o. We first employ GPT-4o to paraphrase the scholar query. The paraphrased query is then searched on Google.
ChatGPT. We submit the scholar query to ChatGPT, powered by search-enabled GPT-4o. Due to the need for manual query submission, we evaluate only 100 randomly sampled instances from the AutoScholarQuery test set.
GPT-o1. Prompt GPT-o1 to process the scholar query.
PaSa-GPT-4o. Prompt GPT-4o within the PaSa framework. It can perform multiple searches, paper reading, and citation network crawling.
Main Results
As shown in Table 5, PaSa-7b outperforms all baselines on AutoScholarQuery test set. Specifically, compared to the strongest baseline, PaSa-GPT-4o, PaSa-7b demonstrates a 9.64% improvement in recall with comparable precision. Moreover, the recall of the Crawler in PaSa-7b is 3.66% higher than that in PaSa-GPT-4o. When compared to the best Google-based baseline, Google with GPT-4o, PaSa-7b achieves an improvement of 33.80%, 38.83% and 42.64% in Recall@20, Recall@50 and Recall@100, respectively.
We observe that using multiple ensembles of Crawler during inference can improve performance. Specifically, running Crawler twice during inference increased the Crawler recall by 3.34% on AutoScholarQuery, leading to the final recall improvement by 1.51%, with precision remaining similar.
To evaluate PaSa in a more realistic setting, we assess its effectiveness on RealScholarQuery. As illustrated in Table 6, PaSa-7b exhibits a greater advantage in real-world academic search scenarios. Compared to PaSa-GPT-4o, PaSa-7b achieves improvements of 30.36% in recall and 4.25% in precision. Against the best Google-based baseline on RealScholarQuery, Google with GPT-4o, PaSa-7b outperforms Google by 37.78%, 39.90%, and 39.83% in recall@20, recall@50 and recall@100, respectively. Additionally, the PaSa-7b-ensemble further enhances crawler recall by 4.32%, contributing to an overall 3.52% improvement in the recall of the entire agent system.
Run Locally
Data Preparation
Download dataset from pasa-dataset and save it in the data folder.
You need to first apply for a Google Search API key at serper.dev, and replace the ‘your google keys’ in trl/custom_agent/search_tools.py.
If you set use_selector=True, you need to deploy additional selector models, which can be accessed during training. Please modify call_selector function in trl/custom_agent/utils.py to call the selector and get the select results.
@misc{he2024pasa,
title={PaSa: An LLM Agent for Comprehensive Academic Paper Search},
author={Yichen He and Guanhua Huang and Peiyuan Feng and Yuan Lin and Yuchen Zhang and Hang Li and Weinan E},
year={2025},
eprint={2501.10120},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
ACL 2025 (Main)
Introduction
We introduce PaSa, an advanced Paper Search agent powered by large language models. PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholarly queries. We optimize PaSa using reinforcement learning with a synthetic dataset, AutoScholarQuery, which includes 35k fine-grained academic queries and corresponding papers sourced from top-tier AI conference publications. Additionally, we develop RealScholarQuery, a benchmark collecting real-world academic queries to assess PaSa performance in more realistic scenarios. Despite being trained on synthetic data, PaSa significantly outperforms existing baselines on RealScholarQuery, including Google, Google Scholar, Google with GPT-4 for paraphrased queries, chatGPT (search-enabled GPT-4o), GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably, PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78% in recall@20 and 39.90% in recall@50. It also exceeds PaSa-GPT-4o by 30.36% in recall and 4.25% in precision.
Quick Start
You can prepare a detailed description of your academic search needs, and search for papers on https://pasa-agent.ai
Architecture
The PaSa system consists of two LLM agents, Crawler and Selector. The Crawler processes the user query and can access papers from the paper queue. It can autonomously invoke the search tool, expand citations, or stop processing of the current paper. All papers collected by the Crawler are appended to the paper queue. The Selector reads each paper in the paper queue to determine whether it meets the criteria specified in the user query.
Dataset
All the datasets are available at pasa-dataset
AutoScholarQuery
AutoScholarQuery is a synthetic but high-quality dataset of academic queries and related papers, specifically curated for the AI field.
RealScholarQuery
RealScholarQuery is a test dataset consisting of 50 real-world and fine-grained research queries raised by AI researchers to use the system. The answers to each query are identified as comprehensively as possible by the professional annotators through various retrieval methods.
Experiments
Baselines
We evaluate our paper search agent on both the test set of AutoScholarQuery and RealScholarQuery. We compare PaSa-7b against the following baselines:
Google. Use Google to search the query directly.
Google Scholar. Queries are submitted directly to Google Scholar.
Google with GPT-4o. We first employ GPT-4o to paraphrase the scholar query. The paraphrased query is then searched on Google.
ChatGPT. We submit the scholar query to ChatGPT, powered by search-enabled GPT-4o. Due to the need for manual query submission, we evaluate only 100 randomly sampled instances from the AutoScholarQuery test set.
GPT-o1. Prompt GPT-o1 to process the scholar query.
PaSa-GPT-4o. Prompt GPT-4o within the PaSa framework. It can perform multiple searches, paper reading, and citation network crawling.
Main Results
As shown in Table 5, PaSa-7b outperforms all baselines on AutoScholarQuery test set. Specifically, compared to the strongest baseline, PaSa-GPT-4o, PaSa-7b demonstrates a 9.64% improvement in recall with comparable precision. Moreover, the recall of the Crawler in PaSa-7b is 3.66% higher than that in PaSa-GPT-4o. When compared to the best Google-based baseline, Google with GPT-4o, PaSa-7b achieves an improvement of 33.80%, 38.83% and 42.64% in Recall@20, Recall@50 and Recall@100, respectively.
We observe that using multiple ensembles of Crawler during inference can improve performance. Specifically, running Crawler twice during inference increased the Crawler recall by 3.34% on AutoScholarQuery, leading to the final recall improvement by 1.51%, with precision remaining similar.
To evaluate PaSa in a more realistic setting, we assess its effectiveness on RealScholarQuery. As illustrated in Table 6, PaSa-7b exhibits a greater advantage in real-world academic search scenarios. Compared to PaSa-GPT-4o, PaSa-7b achieves improvements of 30.36% in recall and 4.25% in precision. Against the best Google-based baseline on RealScholarQuery, Google with GPT-4o, PaSa-7b outperforms Google by 37.78%, 39.90%, and 39.83% in recall@20, recall@50 and recall@100, respectively. Additionally, the PaSa-7b-ensemble further enhances crawler recall by 4.32%, contributing to an overall 3.52% improvement in the recall of the entire agent system.
Run Locally
Data Preparation
Download dataset from pasa-dataset and save it in the data folder.
Model Preparation
Download model checkpoints pasa-7b-crawler and pasa-7b-selector and save it in the checkpoints folder.
Run Pasa
You need to first apply for a Google Search API key at serper.dev, and replace ‘your google keys’ in
utils.py.crawlergenerates the search queries from the user query and choose the expand sections from ll secondary section names of the paper.selectortakes the title and abstract of the paper as input and generates a score which indicates the relevance between the paper and user query.crawlerand use arxiv/ar5iv search api to get the complete paper.Training Your Own Agent
We modify the code of
trlandtransformers, you can do SFT and PPO training after cloning and installing them.https://github.com/hyc2026/trl
https://github.com/hyc2026/transformers
Install dependencies
Selector SFT Training
Crawler SFT Training
Crawler PPO Training
Before Training:
trl/custom_agent/search_tools.py.use_selector=True, you need to deploy additional selector models, which can be accessed during training. Please modifycall_selectorfunction intrl/custom_agent/utils.pyto call the selector and get the select results.Citation
Please cite us as: