目录

sap (SRASS Application Platform)

Web-based seismic computing platform that consumes SRASS over VPN+SSH on domestic HPC systems (Tianhe, Dawning, Shenwei). No agent is installed on the supercomputer login node; all interaction goes through ssh/scp.

Status

Phase 1 MVP:

  • P1 Backend — see docs/superpowers/specs/2026-06-17-sap-phase1-mvp-design.md
  • P1 Frontend scaffold — Vite + React + TS, routing, job list view
    • polling.
  • P1 Frontend round 2 (submit form + multipart file uploads)
  • P1 Frontend round 3 (detail page + status timeline + outputs + logs)
  • P2 End-to-end integration on a real Tianhe account — full job lifecycle (DRAFT → PENDING_UPLOAD → SUBMITTING → QUEUED → RUNNING → COMPLETED) verified against hnu_wuq@25.8.100.22 (Tianhe new-generation, partition mt_module, account ecology). /api/jobs/{id}/output and /api/jobs/{id}/logs serve real artifacts fetched eagerly on COMPLETED.
  • Dawning (Sugon new-generation) real-HPC integration — SRASS uploaded, built on the hpctest06 compute nodes, and verified with a SLURM smoke test. DawningPlatform now generates srass specfem3d-globe forward ... jobs for the new SRASS CLI.

Real-HPC bugs fixed during end-to-end testing (2026-06-22 / 23)

The unit / integration tests pass against mocks, but the mocks cannot reproduce the SLURM toolchain’s exact behavior. End-to-end runs against the real Tianhe cluster surfaced six bugs that the test suite could not:

# Bug Surface Fix
1 scp_upload does not expand $HOME (sftp-server has no shell) First submit failed SSHClient._resolve_remote_path caches remote_home and rewrites paths before scp
2 #SBATCH --chdir=$HOME/... is parsed literally by SLURM JobLaunchFailure (workdir concatenation) TianhePlatform._workdir resolves $HOME to absolute before generating the sbatch script
3 SlurmClient.query raises on squeue "Invalid job id" for completed jobs Scheduler stuck polling forever squeue wrapped in try/except SSHError; falls through to sacct
4 State machine rejects QUEUED → COMPLETED for fast jobs Job stuck at QUEUED after run Added COMPLETED to the allowed set under QUEUED
5 GET /api/jobs/{id}/logs was never wired up to anything Endpoint existed, always 404 New Platform.download_log + JobService.download_log + eager fetch in scheduler on COMPLETED
6 GET /api/jobs/{id}/output listed local files only, never invoked download_outputs Endpoint existed, always empty Eager fetch in scheduler on COMPLETED pulls remote output dir to local

Every fix has a regression test. Total tests: 185 (up from 170 at the start of the e2e session).

Phase 1 scope

Minimum end-to-end slice for one job type (球坐标全球正演 / forward) on one platform (Tianhe). Dawning is now a working SLURM adapter and has been validated end-to-end on the scnet.cn E-Shell gateway; Shenwei remains a stub that returns 501 Not Implemented.

Running locally

Prerequisites: Python 3.11+, network access to the Tianhe login node.

cd backend
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Set the required env vars (or put them in .env and export them):

export SAP_DATA_ROOT=$HOME/.local/share/sap
export SAP_TIANHE_HOST=tianhe.login.example
export SAP_TIANHE_USER=your-username

Tianhe environment variables

These knobs control how jobs are submitted to the Tianhe SLURM cluster. All have safe defaults so a vanilla uvicorn start works for CI / mock testing, but a real deployment needs at least SAP_TIANHE_ACCOUNT and SAP_TIANHE_QOS set (Tianhe rejects account-less submissions with Invalid account or account/partition combination).

Env var Default Meaning
SAP_TIANHE_HOST tianhe.login SSH hostname of the login node
SAP_TIANHE_USER sap SSH username
SAP_TIANHE_PARTITION mt_module SLURM partition to submit into
SAP_TIANHE_ACCOUNT (empty) SLURM account. Omit the #SBATCH --account directive when empty.
SAP_TIANHE_QOS (empty) SLURM QOS. Omit the #SBATCH --qos directive when empty.
SAP_TIANHE_SRASS_BIN_DIR (empty) Directory containing the srass binary. When set, the generated sbatch script prepends it to PATH on the compute node (login /tmp is not shared with compute nodes). Empty = assume srass is on the compute node’s default PATH.

Worked example for the Tianhe new-generation system (hnu_wuq@25.8.100.22, partition mt_module, account ecology, SRASS installed under ~/srass_bin):

export SAP_TIANHE_HOST=25.8.100.22
export SAP_TIANHE_USER=hnu_wuq
export SAP_TIANHE_PARTITION=mt_module
export SAP_TIANHE_ACCOUNT=ecology
export SAP_TIANHE_QOS=nor8000
export SAP_TIANHE_SRASS_BIN_DIR=$HOME/srass_bin

Dawning (Sugon new-generation) environment variables

Dawning is accessed through the scnet.cn E-Shell gateway on a non-standard SSH port with key-based authentication. Set at least SAP_DAWNING_USER and SAP_DAWNING_KEY_PATH; the other knobs have safe defaults for the hpctest06 debug queue.

Env var Default Meaning
SAP_DAWNING_HOST zzeshell.scnet.cn E-Shell gateway hostname
SAP_DAWNING_PORT 65032 SSH port for the gateway
SAP_DAWNING_USER (empty) SSH username
SAP_DAWNING_KEY_PATH (empty) Path to the scnet.cn SSH private key
SAP_DAWNING_PARTITION hpctest06 SLURM partition to submit into
SAP_DAWNING_TIME_LIMIT 00:10:00 Wall-clock time limit for jobs
SAP_DAWNING_SRASS_BIN_DIR (empty) Directory containing the srass binary on Dawning.

Example:

export SAP_DAWNING_USER=scnethpc26108
export SAP_DAWNING_KEY_PATH=$HOME/.ssh/scnethpc26108_zzeshell.scnet.cn_RsaKeyExpireTime_2026-07-17_16-59-07.txt
export SAP_DAWNING_PARTITION=hpctest06
export SAP_DAWNING_SRASS_BIN_DIR=/public/home/scnethpc26108/SRASS/build/bin

Dawning SRASS build

A CPU-only build of SRASS (SPECFEM3D_GLOBE only) has already been produced on the Dawning hpctest06 debug queue. The build script, source tree, and binaries are at:

/public/home/scnethpc26108/build_srass_cpu.sh   # sbatch script used
/public/home/scnethpc26108/SRASS/               # uploaded source tree
/public/home/scnethpc26108/SRASS/build/bin/     # srass + specfem3d-globe binaries

The build/bin/srass wrapper loads scripts/platform/dawning/env.sh so that python3, MPI, NetCDF, and the srass Python package are available on the compute node without requiring SAP to emit module load commands.

To rebuild or add components (Cartesian, Unicycle, Grond, HIP/DCU), edit the CMake flags in ~/build_srass_cpu.sh and submit it from the login node:

sbatch ~/build_srass_cpu.sh

Then start the server in dev mode (auto-reload):

../scripts/dev.sh

Or directly:

uvicorn sap.main:app --reload --host 0.0.0.0 --port 8000

Running tests

Backend:

cd backend
pytest -q

Tests are organized:

  • tests/unit/ — module-level unit tests
  • tests/api/ — FastAPI TestClient tests
  • tests/integration/ — tests that hit a real HPC login node; currently test_dawning_real.py exercises the Dawning E-Shell gateway and SLURM scheduler (gated by SAP_DAWNING_KEY_PATH).

Frontend:

cd frontend
npm install
npm run test:run     # vitest run
npm run lint         # eslint
npm run typecheck    # tsc --noEmit
npm run build        # production bundle

Frontend tests live next to source as *.test.{ts,tsx}. Vitest with jsdom + @testing-library/react. Coverage: npm run test:coverage.

Frontend dev workflow

# Terminal 1: backend on :8000
cd backend && source .venv/bin/activate
uvicorn sap.main:app --reload

# Terminal 2: frontend dev server on :5173
cd frontend
npm run dev

The Vite dev server proxies /api/* to http://localhost:8000, so the frontend can call /api/jobs directly with no CORS configuration.

For production, run scripts/prod-build.sh from the repo root. It builds the frontend SPA into backend/sap/static/ and then packages the backend as a wheel. FastAPI serves the static files at / (API routes still take precedence).

API surface

Method Path Notes
GET /api/health Liveness probe
GET /api/platforms List known platforms + availability
GET /api/job-types List job types + param schemas
GET /api/jobs List all jobs
POST /api/jobs Create a DRAFT job
GET /api/jobs/{id} Get job details
POST /api/jobs/{id}/submit Walk DRAFT → QUEUED via platform.submit
POST /api/jobs/{id}/cancel Cancel a QUEUED/RUNNING job
GET /api/jobs/{id}/output List output files
GET /api/jobs/{id}/output/{filename} Download an output file
GET /api/jobs/{id}/logs Download the run log

Architecture

See docs/superpowers/specs/2026-06-17-sap-phase1-mvp-design.md and docs/superpowers/plans/2026-06-17-sap-phase1-p1-backend.md (the TDD-style implementation plan that produced this code).

Layered modules under backend/sap/:

api/         FastAPI routes + DTOs
services/    JobService, Scheduler, file_service
platforms/   Platform Protocol + Tianhe/Dawning/Shenwei adapters
clients/     ssh_client, slurm_client, srass_client, command_templates
state/       JobRepo (YAML CRUD), atomic file_store
models/      Job, JobStatus, state_machine
core/        exceptions, paths, logging
config.py    Pydantic settings
main.py      FastAPI app + lifespan

License

Internal. Not for redistribution.

关于

Seismic Application Platform

385.0 KB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802047560号