Backend Guide

Architecture Overview

  • FastAPI application: web_app/main.py
  • /synthesize infers metadata and caches the original DataFrame plus draft domain info.
  • /confirm_synthesis reconstructs the DataFrame with user overrides and invokes the synthesizer.
  • /jobs/{job_id} returns persisted job metadata and registered output artifacts.
  • /download_synthesized_data/{session_id} streams the generated CSV for a confirmed synthesis session.
  • /evaluate calculates metadata-aware metrics using web_app/data_comparison.py (keyed by the same session ID).
  • Synthesis service: web_app/synthesis_service.py
  • Bridges the cached inference bundle to the selected synthesizer.
  • Handles preprocessing (clipping, binning, categorical remap) before handing off to PrivSyn or AIM.

Algorithm references: PrivSyn follows the approach in PrivSyn: Differentially Private Data Synthesis; the AIM adapter implements The AIM Mechanism for Differentially Private Synthetic Data.

Key Modules

Module Role
web_app/data_inference.py Detect column types, normalise metadata, and prepare draft domain/info payloads.
web_app/synthesis_service.py Applies overrides, constructs the preprocesser, runs the synthesizer, and persists outputs.
web_app/job_service.py Dual-writes inline synthesis runs into durable job + artifact records without changing the current UI flow.
web_app/job_bundle.py Persists confirmed run inputs as portable bundle artifacts for future Slurm/cloud workers.
web_app/job_runner.py Swappable synthesis execution backend; current inline runner isolates execution from route-level preprocessing.
web_app/slurm_plan.py Builds Slurm submission scripts from persisted confirmed job bundles without coupling Slurm details to the route layer.
web_app/aws_batch_plan.py Builds aws batch submit-job ... commands from durable confirmed bundles without coupling AWS CLI details to the route layer.
web_app/cloud_run_plan.py Builds gcloud run jobs execute ... commands from durable confirmed bundles without coupling Cloud Run CLI details to the route layer.
web_app/aws_batch_control.py Wraps AWS Batch status and cancellation commands so queued batch jobs can be observed and cancelled through the same API shape.
web_app/cloud_run_control.py Wraps Cloud Run execution status and cancellation commands so queued cloud jobs can be observed and cancelled through the same API shape.
web_app/job_execution.py Shared helpers for materializing run directories and reconstructing synthesis inputs inside either the API process or a remote worker.
web_app/run_confirmed_job.py Worker entrypoint that rehydrates a persisted confirmed bundle and executes the synthesis run outside the web process.
web_app/auth.py Request-time auth adapter layer; currently supports none and trusted-header identity injection.
web_app/metadata_store.py Metadata store abstraction with sqlite and Postgres backends behind the same CRUD surface.
web_app/object_storage.py Object storage abstraction with local and S3-compatible backends, including local materialization for remote workers.
web_app/settings.py Environment-driven runtime configuration for state roots, storage backends, and future auth/job adapters.
privsyn_platform/ Shared platform-oriented import path for auth, storage, metadata, and settings reuse across future tabular/image/text apps.
web_app/data_comparison.py Implements histogram-aware TVD and other metrics for evaluation.
method/synthesis/privsyn/privsyn.py PrivSyn implementation (marginal selection + GUM).
method/api/base.py Core synthesizer API (SynthRegistry, PrivacySpec, RunConfig, Synthesizer protocol).
method/api/utils.py Helper utilities used by adapters (e.g., split_df_by_type, schema enforcement).
method/synthesis/AIM/adapter.py Adapter wiring AIM into the unified interface provided by method/api.
method/preprocess_common/ Shared discretizers (PrivTree, DAWA) and helper utilities.

Unified Synthesis Interface

method/api/base.py defines the shared contract every synthesis method must follow:

  • SynthRegistry exposes register, get, and list helpers so adapters (e.g., method/synthesis/privsyn/__init__.py, method/synthesis/AIM/__init__.py) can self-register at import time.
  • PrivacySpec and RunConfig capture the caller’s DP/compute requirements and are passed through to each adapter.
  • _AdapterSynth and _AdapterFitted wrap legacy prepare/run functions so existing method code needs minimal changes.

The backend dispatcher (web_app/methods_dispatcher.py) and tests such as test/test_methods_dispatcher.py rely on this registry to treat every method uniformly. Method-specific modules (method/synthesis/<name>/native.py, config.py, parameter_parser.py, etc.) stay alongside each algorithm because they encode behaviour that other methods do not share (e.g., PrivSyn’s marginal-selection parameters or AIM’s workload configuration). Keep the registry small and general, and let each method own its internal configuration files.

Endpoint Notes

POST /synthesize

  • Expects multipart form (fields documented in test/test_api_contract.py).
  • For sample runs, omit the file and set dataset_name=adult.
  • Stores the uploaded DataFrame and inferred metadata under a temporary UUID in memory.
  • Also persists the preview bundle (input parquet + inferred metadata + synthesis params) so /confirm_synthesis can recover after in-memory session loss.
  • All columns from the uploaded table participate in metadata inference; the API no longer accepts or drops a distinct target column.

POST /confirm_synthesis

  • Requires the unique_id returned by /synthesize.
  • Accepts JSON strings for confirmed_domain_data and confirmed_info_data.
  • Runs the chosen synthesizer (privsyn or aim) and writes synthesized CSV + evaluation bundle to the temp directory.
  • Also registers a durable job record plus synthesized artifact metadata so later platform adapters can replace inline execution without changing API contracts.
  • Falls back to the persisted preview bundle when the in-memory inference session has expired or the process has restarted.
  • Returns first-class job fields such as job_id, status, status_url, and download_url, while still preserving the legacy session_id alias for compatibility.
  • Persists a confirmed input parquet plus job_request.json artifact so remote workers can execute the same confirmed run without route-local state.
  • Only populates the legacy in-memory evaluation session when the job finishes inline; queued backends rely on /jobs/{job_id}, durable artifacts, and the remote worker path instead.

GET /jobs/{job_id}

  • Returns the persisted job state (running, succeeded, failed, etc.), metadata, and registered artifacts.
  • Exposes whether the legacy in-memory session bundle is still available for evaluation.
  • This endpoint is the bridge toward future remote execution backends such as Cloud Run Jobs, AWS Batch/ECS/Fargate, or Slurm.
  • The current inline execution path already routes through web_app/job_runner.py, so future backends can be swapped in without rewriting the preprocessing route.
  • Confirmed input artifacts and request bundles are now registered alongside synthesized outputs, which gives remote backends a stable payload to consume.
  • For queued backends, this is the authoritative polling endpoint until a remote worker runs python -m web_app.run_confirmed_job --job-id ... --job-request-key ....
  • The generated Slurm script now exports the shared metadata/artifact roots plus object-storage backend settings so the worker writes back into the same durable state as the web/API tier.
  • Slurm-backed jobs also opportunistically refresh queued/running state from squeue so the API reflects scheduler progress without waiting for a terminal worker callback.
  • AWS Batch-backed jobs now use the same /jobs/{job_id} and /jobs/{job_id}/cancel routes for observation and cancellation, via aws batch describe-jobs, cancel-job, and terminate-job.
  • Cloud Run job submission now goes through a parallel planning layer that emits gcloud run jobs execute commands; in production this still expects shared durable metadata/artifact backends rather than local-only paths.
  • Cloud Run-backed jobs now use the same /jobs/{job_id} and /jobs/{job_id}/cancel routes for observation and cancellation, via gcloud run jobs executions describe/cancel.
  • Remote submission backends now persist a lightweight submission diagnostic bundle in job metadata, including the backend name and the concrete CLI command used to enqueue the job.
  • Metadata store construction now goes through a factory seam keyed by PRIVSYN_METADATA_BACKEND / PRIVSYN_DATABASE_URL, defaulting to file-backed sqlite for local development and accepting Postgres URLs for shared deployments.
  • Request auth now goes through PRIVSYN_AUTH_BACKEND, with the first concrete non-demo adapter using trusted headers from a campus reverse proxy or other OIDC front end.
  • Preview bundles, durable jobs, downloads, evaluation, and RC compatibility jobs all now carry owner metadata so authenticated deployments can enforce per-user access without rewriting the synthesis core.

POST /jobs/{job_id}/cancel

  • Cancels a Slurm-backed job via scancel and marks the durable job record as cancelled.
  • Returns the same serialized job payload as GET /jobs/{job_id} so clients can reuse the polling shape after a cancel request.

GET /download_synthesized_data/{session_id}

  • Streams the generated CSV for a previously confirmed synthesis session.
  • Reads from the legacy in-memory SessionStore when available, but now falls back to the persisted artifact registry so downloads survive session cleanup.

POST /evaluate

  • Accepts session_id (form field) and reuses cached original/synth data to compute metrics (e.g., histogram TVD for numeric columns).
  • Falls back to the persisted confirmed input parquet plus synthesized CSV artifact when the in-memory session has already been evicted.

Local Development

uvicorn web_app.main:app --reload --port 8001

# Optionally set VITE_API_BASE_URL when running the frontend separately
export VITE_API_BASE_URL=http://127.0.0.1:8001

Configuration Tips

  • CORS origins are defined in web_app/main.py. Update the allow_origins list to include any new frontend domains.
  • Set the ADDITIONAL_CORS_ORIGINS environment variable (comma-separated list) in production to append extra origins, especially for Vercel preview/prod URLs.
  • CORS_ALLOW_ORIGINS is still accepted as a deprecated alias so older deploys do not break immediately.
  • Temporary artifacts (original data, synthesized CSVs) land under temp_synthesis_output/. Keep an eye on disk usage during iterative testing.
  • Use environmental overrides or .env files for production secrets (database URLs, etc.)—the current setup only handles the stateless demo flow.
  • For shared cloud or campus deployments, set PRIVSYN_METADATA_BACKEND=postgres and PRIVSYN_DATABASE_URL=postgresql://...; the backend normalizes this to the psycopg SQLAlchemy driver automatically.
  • For internal deployments behind campus SSO or another trusted proxy, set PRIVSYN_AUTH_BACKEND=trusted-header and have the proxy inject X-Privsyn-Subject plus optional X-Privsyn-Email, X-Privsyn-Name, and X-Privsyn-Admin.