Web-based seismic computing platform that consumes SRASS over VPN+SSH on
domestic HPC systems (Tianhe, Dawning, Shenwei). No agent is installed
on the supercomputer login node; all interaction goes through ssh/scp.
Status
Phase 1 MVP:
P1 Backend — see docs/superpowers/specs/2026-06-17-sap-phase1-mvp-design.md
P1 Frontend scaffold — Vite + React + TS, routing, job list view
polling.
P1 Frontend round 2 (submit form + multipart file uploads)
P2 End-to-end integration on a real Tianhe account — full job
lifecycle (DRAFT → PENDING_UPLOAD → SUBMITTING → QUEUED → RUNNING →
COMPLETED) verified against hnu_wuq@25.8.100.22 (Tianhe new-generation,
partition mt_module, account ecology). /api/jobs/{id}/output and
/api/jobs/{id}/logs serve real artifacts fetched eagerly on COMPLETED.
Dawning (Sugon new-generation) real-HPC integration — SRASS
uploaded, built on the hpctest06 compute nodes, and verified with a
SLURM smoke test. DawningPlatform now generates
srass specfem3d-globe forward ... jobs for the new SRASS CLI.
Real-HPC bugs fixed during end-to-end testing (2026-06-22 / 23)
The unit / integration tests pass against mocks, but the mocks cannot
reproduce the SLURM toolchain’s exact behavior. End-to-end runs against
the real Tianhe cluster surfaced six bugs that the test suite could not:
#
Bug
Surface
Fix
1
scp_upload does not expand $HOME (sftp-server has no shell)
First submit failed
SSHClient._resolve_remote_path caches remote_home and rewrites paths before scp
2
#SBATCH --chdir=$HOME/... is parsed literally by SLURM
JobLaunchFailure (workdir concatenation)
TianhePlatform._workdir resolves $HOME to absolute before generating the sbatch script
3
SlurmClient.query raises on squeue "Invalid job id" for completed jobs
Scheduler stuck polling forever
squeue wrapped in try/except SSHError; falls through to sacct
4
State machine rejects QUEUED → COMPLETED for fast jobs
Job stuck at QUEUED after run
Added COMPLETED to the allowed set under QUEUED
5
GET /api/jobs/{id}/logs was never wired up to anything
Endpoint existed, always 404
New Platform.download_log + JobService.download_log + eager fetch in scheduler on COMPLETED
6
GET /api/jobs/{id}/output listed local files only, never invoked download_outputs
Endpoint existed, always empty
Eager fetch in scheduler on COMPLETED pulls remote output dir to local
Every fix has a regression test. Total tests: 185 (up from 170 at the start
of the e2e session).
Phase 1 scope
Minimum end-to-end slice for one job type (球坐标全球正演 / forward)
on one platform (Tianhe). Dawning is now a working SLURM adapter
and has been validated end-to-end on the scnet.cn E-Shell gateway;
Shenwei remains a stub that returns 501 Not Implemented.
Running locally
Prerequisites: Python 3.11+, network access to the Tianhe login node.
These knobs control how jobs are submitted to the Tianhe SLURM cluster.
All have safe defaults so a vanilla uvicorn start works for CI / mock
testing, but a real deployment needs at least SAP_TIANHE_ACCOUNT and
SAP_TIANHE_QOS set (Tianhe rejects account-less submissions with
Invalid account or account/partition combination).
Env var
Default
Meaning
SAP_TIANHE_HOST
tianhe.login
SSH hostname of the login node
SAP_TIANHE_USER
sap
SSH username
SAP_TIANHE_PARTITION
mt_module
SLURM partition to submit into
SAP_TIANHE_ACCOUNT
(empty)
SLURM account. Omit the #SBATCH --account directive when empty.
SAP_TIANHE_QOS
(empty)
SLURM QOS. Omit the #SBATCH --qos directive when empty.
SAP_TIANHE_SRASS_BIN_DIR
(empty)
Directory containing the srass binary. When set, the generated sbatch script prepends it to PATH on the compute node (login /tmp is not shared with compute nodes). Empty = assume srass is on the compute node’s default PATH.
Worked example for the Tianhe new-generation system
(hnu_wuq@25.8.100.22, partition mt_module, account ecology,
SRASS installed under ~/srass_bin):
Dawning is accessed through the scnet.cn E-Shell gateway on a non-standard
SSH port with key-based authentication. Set at least SAP_DAWNING_USER and
SAP_DAWNING_KEY_PATH; the other knobs have safe defaults for the
hpctest06 debug queue.
A CPU-only build of SRASS (SPECFEM3D_GLOBE only) has already been produced
on the Dawning hpctest06 debug queue. The build script, source tree, and
binaries are at:
/public/home/scnethpc26108/build_srass_cpu.sh # sbatch script used
/public/home/scnethpc26108/SRASS/ # uploaded source tree
/public/home/scnethpc26108/SRASS/build/bin/ # srass + specfem3d-globe binaries
The build/bin/srass wrapper loads scripts/platform/dawning/env.sh so that
python3, MPI, NetCDF, and the srass Python package are available on the
compute node without requiring SAP to emit module load commands.
To rebuild or add components (Cartesian, Unicycle, Grond, HIP/DCU), edit the
CMake flags in ~/build_srass_cpu.sh and submit it from the login node:
tests/integration/ — tests that hit a real HPC login node; currently
test_dawning_real.py exercises the Dawning E-Shell gateway and SLURM
scheduler (gated by SAP_DAWNING_KEY_PATH).
Frontend:
cd frontend
npm install
npm run test:run # vitest run
npm run lint # eslint
npm run typecheck # tsc --noEmit
npm run build # production bundle
Frontend tests live next to source as *.test.{ts,tsx}. Vitest with
jsdom + @testing-library/react. Coverage: npm run test:coverage.
Frontend dev workflow
# Terminal 1: backend on :8000
cd backend && source .venv/bin/activate
uvicorn sap.main:app --reload
# Terminal 2: frontend dev server on :5173
cd frontend
npm run dev
The Vite dev server proxies /api/* to http://localhost:8000, so the
frontend can call /api/jobs directly with no CORS configuration.
For production, run scripts/prod-build.sh from the repo root. It builds
the frontend SPA into backend/sap/static/ and then packages the backend
as a wheel. FastAPI serves the static files at / (API routes still take
precedence).
API surface
Method
Path
Notes
GET
/api/health
Liveness probe
GET
/api/platforms
List known platforms + availability
GET
/api/job-types
List job types + param schemas
GET
/api/jobs
List all jobs
POST
/api/jobs
Create a DRAFT job
GET
/api/jobs/{id}
Get job details
POST
/api/jobs/{id}/submit
Walk DRAFT → QUEUED via platform.submit
POST
/api/jobs/{id}/cancel
Cancel a QUEUED/RUNNING job
GET
/api/jobs/{id}/output
List output files
GET
/api/jobs/{id}/output/{filename}
Download an output file
GET
/api/jobs/{id}/logs
Download the run log
Architecture
See docs/superpowers/specs/2026-06-17-sap-phase1-mvp-design.md and
docs/superpowers/plans/2026-06-17-sap-phase1-p1-backend.md (the
TDD-style implementation plan that produced this code).
sap (SRASS Application Platform)
Web-based seismic computing platform that consumes SRASS over VPN+SSH on domestic HPC systems (Tianhe, Dawning, Shenwei). No agent is installed on the supercomputer login node; all interaction goes through
ssh/scp.Status
Phase 1 MVP:
docs/superpowers/specs/2026-06-17-sap-phase1-mvp-design.mdhnu_wuq@25.8.100.22(Tianhe new-generation, partitionmt_module, accountecology)./api/jobs/{id}/outputand/api/jobs/{id}/logsserve real artifacts fetched eagerly on COMPLETED.hpctest06compute nodes, and verified with a SLURM smoke test.DawningPlatformnow generatessrass specfem3d-globe forward ...jobs for the new SRASS CLI.Real-HPC bugs fixed during end-to-end testing (2026-06-22 / 23)
The unit / integration tests pass against mocks, but the mocks cannot reproduce the SLURM toolchain’s exact behavior. End-to-end runs against the real Tianhe cluster surfaced six bugs that the test suite could not:
scp_uploaddoes not expand$HOME(sftp-server has no shell)SSHClient._resolve_remote_pathcachesremote_homeand rewrites paths before scp#SBATCH --chdir=$HOME/...is parsed literally by SLURMTianhePlatform._workdirresolves$HOMEto absolute before generating the sbatch scriptSlurmClient.queryraises onsqueue "Invalid job id"for completed jobstry/except SSHError; falls through tosacctQUEUED → COMPLETEDfor fast jobsCOMPLETEDto the allowed set underQUEUEDGET /api/jobs/{id}/logswas never wired up to anythingPlatform.download_log+JobService.download_log+ eager fetch in scheduler on COMPLETEDGET /api/jobs/{id}/outputlisted local files only, never invokeddownload_outputsEvery fix has a regression test. Total tests: 185 (up from 170 at the start of the e2e session).
Phase 1 scope
Minimum end-to-end slice for one job type (球坐标全球正演 / forward) on one platform (Tianhe). Dawning is now a working SLURM adapter and has been validated end-to-end on the scnet.cn E-Shell gateway; Shenwei remains a stub that returns
501 Not Implemented.Running locally
Prerequisites: Python 3.11+, network access to the Tianhe login node.
Set the required env vars (or put them in
.envandexportthem):Tianhe environment variables
These knobs control how jobs are submitted to the Tianhe SLURM cluster. All have safe defaults so a vanilla
uvicornstart works for CI / mock testing, but a real deployment needs at leastSAP_TIANHE_ACCOUNTandSAP_TIANHE_QOSset (Tianhe rejects account-less submissions withInvalid account or account/partition combination).SAP_TIANHE_HOSTtianhe.loginSAP_TIANHE_USERsapSAP_TIANHE_PARTITIONmt_moduleSAP_TIANHE_ACCOUNT#SBATCH --accountdirective when empty.SAP_TIANHE_QOS#SBATCH --qosdirective when empty.SAP_TIANHE_SRASS_BIN_DIRsrassbinary. When set, the generated sbatch script prepends it toPATHon the compute node (login/tmpis not shared with compute nodes). Empty = assumesrassis on the compute node’s defaultPATH.Worked example for the Tianhe new-generation system (
hnu_wuq@25.8.100.22, partitionmt_module, accountecology, SRASS installed under~/srass_bin):Dawning (Sugon new-generation) environment variables
Dawning is accessed through the scnet.cn E-Shell gateway on a non-standard SSH port with key-based authentication. Set at least
SAP_DAWNING_USERandSAP_DAWNING_KEY_PATH; the other knobs have safe defaults for thehpctest06debug queue.SAP_DAWNING_HOSTzzeshell.scnet.cnSAP_DAWNING_PORT65032SAP_DAWNING_USERSAP_DAWNING_KEY_PATHSAP_DAWNING_PARTITIONhpctest06SAP_DAWNING_TIME_LIMIT00:10:00SAP_DAWNING_SRASS_BIN_DIRsrassbinary on Dawning.Example:
Dawning SRASS build
A CPU-only build of SRASS (SPECFEM3D_GLOBE only) has already been produced on the Dawning
hpctest06debug queue. The build script, source tree, and binaries are at:The
build/bin/srasswrapper loadsscripts/platform/dawning/env.shso thatpython3, MPI, NetCDF, and thesrassPython package are available on the compute node without requiring SAP to emitmodule loadcommands.To rebuild or add components (Cartesian, Unicycle, Grond, HIP/DCU), edit the CMake flags in
~/build_srass_cpu.shand submit it from the login node:Then start the server in dev mode (auto-reload):
Or directly:
Running tests
Backend:
Tests are organized:
tests/unit/— module-level unit teststests/api/— FastAPI TestClient teststests/integration/— tests that hit a real HPC login node; currentlytest_dawning_real.pyexercises the Dawning E-Shell gateway and SLURM scheduler (gated bySAP_DAWNING_KEY_PATH).Frontend:
Frontend tests live next to source as
*.test.{ts,tsx}. Vitest with jsdom + @testing-library/react. Coverage:npm run test:coverage.Frontend dev workflow
The Vite dev server proxies
/api/*tohttp://localhost:8000, so the frontend can call/api/jobsdirectly with no CORS configuration.For production, run
scripts/prod-build.shfrom the repo root. It builds the frontend SPA intobackend/sap/static/and then packages the backend as a wheel. FastAPI serves the static files at/(API routes still take precedence).API surface
/api/health/api/platforms/api/job-types/api/jobs/api/jobs/api/jobs/{id}/api/jobs/{id}/submit/api/jobs/{id}/cancel/api/jobs/{id}/output/api/jobs/{id}/output/{filename}/api/jobs/{id}/logsArchitecture
See
docs/superpowers/specs/2026-06-17-sap-phase1-mvp-design.mdanddocs/superpowers/plans/2026-06-17-sap-phase1-p1-backend.md(the TDD-style implementation plan that produced this code).Layered modules under
backend/sap/:License
Internal. Not for redistribution.