制备参数推荐

步骤1：处理知识库数据

存入知识库：utils/extract2knowledge.py

python /home/maxine/ai4ms/recommend/utils/extract2knowledge.py \
  --input /home/maxine/ai4ms/recommend/data/extract_result/except_test_linejson.jsonl \
  --output /home/maxine/ai4ms/recommend/data/knowledge_base/knowledge_base.jsonl \
  --template /home/maxine/ai4ms/recommend/data/knowledge_base/template.jsonl

知识库后处理：utils/knowledge_post_process.py

cd /home/maxine/ai4ms/recommend
source .venv/bin/activate 
python utils/knowledge_post_process.py \
  --input data/knowledge_base/knowledge_base.jsonl \
  --output data/knowledge_base/knowledge_base_processed.jsonl

步骤2：基于处理数据做参数推荐

相似材料检索：similar_retrieval.py：

python -m src.similar_retrieval \
  --query "用助熔剂法制备AlInSe₃" --top-k 30 \
  --kb-path data/knowledge_base/knowledge_base_processed.jsonl \
  --intuition-template data/similar_mates/intuition_template.jsonl \
  --requirement-template data/similar_mates/how_to_parse_intuition.jsonl \
  --output data/similar_mates/similar_on-base_processed-AlInSe3.jsonl

统计参数窗口：statistic_window.py：

python -m src.statistic_window \
  --similar-file data/similar_mates/similar_on-base_processed-AlInSe3.jsonl \
  --input-file data/recommand_window/input_template.jsonl \
  --output-file data/recommand_window/output_result-AlInSe3.jsonl

方案总结与参数推荐：recommend_recipe.py：

python -m src.recommend_recipe \
  --similar-file data/similar_mates/similar_on-base_processed-AlInS3.jsonl \
  --window-file data/recommand_window/output_result-AlInS3.jsonl \
  --input-file data/recommand_recipe/input_real.jsonl \
  --output-file data/recommand_recipe/output_result-AlInS3-0303.jsonl \
  --feature-output data/recommand_recipe/similar_recipes_feat_result-AlInS3-0303.jsonl

步骤3：PDF信息抽取（仅抽取流程）

submodule_extract 模块提供了从PDF文件中根据给定的结构体模板抽取结构化信息的功能。

目录结构

submodule_extract/
├── main_extraction.py          # 主入口文件
├── run.sh                     # 运行脚本
├── .env.example               # 环境变量示例
├── configs/
│   └── config.yaml            # 配置文件
├── src/
│   ├── llm.py                 # LLM配置
│   ├── states_nodes.py        # 状态定义
│   ├── extraction/
│   │   └── extractor_agent.py # 抽取智能体
│   └── prompts/
│       └── extraction_prompt.txt  # 抽取提示词
└── utils/
    ├── pdf_processor.py       # PDF处理工具
    ├── json_parse.py          # JSON解析工具
    └── load_config_with_hydra.py  # 配置加载工具

依赖环境

运行仅抽取流程需要安装以下依赖包：

# 基础依赖
flask>=3.1.3
openai>=2.15.0
python-dotenv>=1.0.0

# 仅抽取流程依赖
langchain-core>=1.1.0
langchain-openai>=0.1.0
langgraph>=0.2.0
omegaconf>=2.3.0
hydra-core>=1.3.0
pdfplumber>=0.11.0
typing-extensions>=4.8.0

可以使用以下命令安装所有依赖：

cd /home/maxine/ai4ms/recommend
pip install -e .

环境配置

注意：submodule_extract 会自动从外层 /home/maxine/ai4ms/recommend/.env 文件读取环境变量，无需单独配置。

确保外层 .env 文件中包含以下配置（参考 /home/maxine/ai4ms/recommend/.env.example）：

LLM_API_KEY=your_api_key_here
LLM_BASE_URL=https://api.siliconflow.cn/v1
LLM_MODEL=gpt-4

使用方法

方法1：使用命令行参数

cd submodule_extract

# 基础用法 - 从指定目录抽取所有PDF
python main_extraction.py \
    onlyext.structure_template=data/template.json \
    onlyext.pdfpath=data/pdfs \
    onlyext.output_name=my_extraction

# 指定输出目录
python main_extraction.py \
    onlyext.structure_template=data/template.json \
    onlyext.pdfpath=data/pdfs \
    onlyext.output_dir=results/flux_crystal \
    onlyext.output_name=flux_2024

# 随机抽取模式（从目录中随机选择10个PDF）
python main_extraction.py \
    onlyext.structure_template=data/template.json \
    onlyext.pdfpath=data/pdfs \
    onlyext.mode=random \
    onlyext.random_num=10 \
    onlyext.output_name=random_sample

# ===== 文本文件输入模式（替代PDF文件）=====

# 从文本文件抽取
python main_extraction.py \
    onlyext.structure_template=data/template.json \
    onlyext.text_file=data/texts/paper1.txt \
    onlyext.output_name=text_extraction

# 使用备选模板文件路径
python main_extraction.py \
    onlyext.template_file=data/template.json \
    onlyext.text_file=data/texts/paper1.txt \
    onlyext.output_name=text_with_template

方法2：修改配置文件

编辑 configs/config.yaml 文件：

onlyext:
  value: True
  structure_template: data/structure_template.json  # 结构体模板路径
  mode: all  # PDF选择模式: hardcoded, random, all
  pdfpath: data/pdfs  # PDF目录路径
  pdfs: []  # 硬编码的PDF列表
  output_dir: results  # 输出目录
  output_name: extraction_result  # 输出文件名
  random_num: 5  # random模式下随机选择的PDF数量

然后运行：

python main_extraction.py

方法3：使用运行脚本

# 编辑 run.sh 中的参数，然后运行
bash run.sh

配置参数说明

参数	说明	可选值
`onlyext.value`	是否启用仅抽取模式	True/False
`onlyext.structure_template`	结构体模板JSON文件路径	相对路径或绝对路径
`onlyext.template_file`	结构体模板JSON文件路径（备选）	相对路径或绝对路径
`onlyext.mode`	PDF选择模式	hardcoded, random, all
`onlyext.pdfpath`	PDF目录路径（mode为random或all时使用）	目录路径
`onlyext.pdfs`	硬编码的PDF路径列表（mode为hardcoded时使用）	JSON数组格式
`onlyext.text_file`	文本文件路径（替代PDF文件）	文件路径
`onlyext.output_dir`	结果输出目录	目录路径
`onlyext.output_name`	输出文件名（不含扩展名）	字符串
`onlyext.random_num`	random模式下随机选择的PDF数量	整数

输入源优先级：text_file > PDF文件

输出格式

抽取结果保存为JSON文件，格式如下：

[
  {
    "_pdf_name": "paper1.pdf",
    "field1": "value1",
    "field2": "value2",
    ...
  },
  {
    "_pdf_name": "paper2.pdf",
    "_extraction_failed": true,
    "_error": "Error message",
    ...
  },
  {
    "_source_type": "text_file",
    "_source_file": "paper.txt",
    "field1": "value1",
    ...
  }
]

_pdf_name: PDF文件名
_extraction_failed: 抽取是否失败（true/false）
_error: 如果失败，错误信息

注意事项

PDF处理库：推荐安装 pdfplumber（pip install pdfplumber），也支持 PyPDF2 或 pypdf。
API配置：确保在 .env 文件中正确配置了 LLM API 的 URL 和密钥。
结构体模板：结构体模板需要是有效的JSON文件，定义了要抽取的字段结构。
断点续传：如果抽取过程中断，重新运行会自动跳过已成功抽取的PDF。
PDF页数限制：默认每个PDF最多提取50页，避免处理过长文档。