Platform Roadmap¶
This note tracks the migration from the current single-process demo flow to a deployment model that can run on Google Cloud, AWS, or a university-managed cluster without rewriting the synthesis logic.
Goals¶
- Keep the synthesis API and preprocessing pipeline cloud-agnostic.
- Replace in-memory session state with durable metadata and artifact storage.
- Support both public-cloud auth and campus SSO with the same backend claims model.
- Let the same job abstraction target Cloud Run Jobs, ECS/Fargate or AWS Batch, and Slurm/Kubernetes Jobs.
Phase 1: Durable Foundations¶
Status: started
- Add environment-driven settings (
web_app/settings.py). - Add a metadata persistence layer (
web_app/metadata_store.py). - Route metadata store construction through a backend factory so sqlite local dev and future Postgres deploys share the same call sites.
- Add an object storage abstraction with local and S3-compatible backends (
web_app/object_storage.py). - Keep the existing FastAPI flow unchanged while the new primitives harden under tests.
- Persist preview/inference bundles so confirmation survives in-memory session loss.
Deliverables:
- SQLite metadata store for local development.
- Local artifact storage rooted under
temp_synthesis_output/state/artifacts. - Tests that lock in user/job/artifact semantics.
Phase 2: Job Model¶
- Status: in progress
- Introduce explicit job states:
queued,running,succeeded,failed,cancelled. - Convert
/confirm_synthesisinto job submission plus status polling. - Move synthesized CSVs and uploaded parquet files behind the object storage abstraction.
- Keep inline execution as the local default backend to preserve the current dev UX.
- Keep backend selection config-driven through
PRIVSYN_JOB_BACKENDso routes do not have to change whenslurmor cloud backends land. - Persist confirmed run bundles so remote workers can consume portable input artifacts rather than route-local temp state.
- Treat queued backends as first-class job submissions: only inline-complete runs should populate the legacy in-memory session payload.
Phase 3: Auth Model¶
- Add a normalized
userstable keyed by external subject (sub) and provider. - Accept OIDC-backed identity claims in the backend.
- Map cloud auth providers and campus SSO into the same internal user record.
- Add per-job ownership checks before artifact download and evaluation.
Phase 4: Platform Adapters¶
Google Cloud¶
- Web/API: Cloud Run
- Jobs: Cloud Run Jobs
- Object storage: Google Cloud Storage
- Database: Cloud SQL Postgres or external Postgres
- Auth: Google login, Clerk, or another OIDC provider
AWS¶
- Web/API: App Runner
- Jobs: ECS/Fargate or AWS Batch
- Object storage: S3
- Database: RDS Postgres
- Auth: Cognito or another OIDC provider
University-managed deployment¶
- Web/API: campus VM or Kubernetes ingress
- Jobs: Slurm or Kubernetes Jobs
- Object storage: MinIO, Ceph, or shared storage behind the object storage interface
- Database: campus Postgres
- Auth: campus SSO via OIDC or SAML-to-OIDC bridge
Integration Rules¶
- Business logic should not import cloud-specific SDKs directly.
- Storage code should depend on storage adapters, not filesystem paths.
- Job submission should go through one backend interface, even for inline local runs.
- Authenticated user identity should enter the synthesis flow as a normalized user record, not as provider-specific fields.
Safe Rollout and Rollback¶
- Introduce each new subsystem as a dual-write or read-fallback layer first.
- Keep
SessionStoreand current local run directories working until metadata-backed paths are proven in tests. - Preserve current API response shapes while adding job and artifact metadata behind the scenes.
- Only remove legacy state paths after one full release cycle of stable tests and manual validation.
University Deployment Checklist¶
- Public entrypoint: confirm whether campus IT will host a VM, reverse proxy, or Kubernetes ingress.
- Job submission: confirm whether web services may submit to Slurm or another scheduler.
- Identity: confirm OIDC or SAML app registration path.
- Data services: confirm Postgres and object storage availability.
- Security review: confirm whether user-uploaded datasets require privacy or compliance review.
Immediate Next Steps¶
- Persist preview/inference artifacts so remote runners do not depend on
SessionStore. - Persist scheduler-side diagnostics (exit code, stderr pointer, submission host) back into job metadata.
- Add a GCS adapter and production Postgres deployment path behind the existing storage and metadata interfaces.
- Layer per-user ownership and auth checks on top of the durable job/artifact APIs.
- Add a higher-level deployment guide that compares local, Slurm, AWS Batch, and Cloud Run setup requirements side by side.