PolicyEngine provides comprehensive microsimulation tools through multiple platforms: our web interface, Python packages, and API infrastructure. The API enables both our website and external applications such as MyFriendBen and Benefit Navigator to access PolicyEngine’s computational engine for precise benefit calculations and policy analysis.
To support our expanding user base and maintain robust performance, we have documented our architectural evolution and future development priorities. This technical specification will serve as a resource for current API users and potential contributors, detailing our infrastructure design decisions and planned enhancements.
Overview
The purpose of this document is to give a brief overview of the current architecture and propose a new target architecture and high-level incremental steps to transition from one to the other for review
- Are we aligned on the problem statement?
We expect the final architecture to
- Reduce the Google Cloud monthly costs by 90% (-$6,500 USD) reducing the amount and type of compute kept continuously “hot” and allocating more expensive hardware on an as-needed basis.
- Substantially reduce the code footprint and generally improve the maintainability of the API services we own.
- Substantially improve the observability of the services we run, providing detailed data for debugging and operations.
We do this by adjusting our API hosting environment and implementation frameworks.
For our hosting environment, we propose continuing to primarily use Google Cloud Platform for hosting our services, but leveraging Cloud Run, workflows, and metrics/logs/trace services to provide better cost, scalability, reliability and observability.
For our API stack we plan to combine our APIs (household and general) into a single, consistent API. For implementing our API we propose switching from Flask with SQLAlchemy to FastAPI and SQLModel with opentelemetry for trace/metric/log generation. We expect these changes to substantially reduce the code footprint and maintenance burden of our API code base.
As migrating from our current architecture to this improved version is non-trivial, we lay out an broad, incremental plan for transitioning from the current to target infrastructure below. This plan focuses on:
Following review of this document, we plan to create a more detailed roadmap and implementation plan focusing the most detail on the near future and continuing to develop our plan as we implement.
Scope
In scope
- Target architecture
- Observability
Out of scope
- SPA application — this is currently being redesigned separately and should not block our API work.
- Testability/Deployment — Requires its own design
- Security — we need to do this, but as a separate document.
- The actual simulation code (policyengine.py, population data, etc) and associated stability/replicability/deployment/etc. issues — this is being covered separately in the linked reference below.
- Comparison of alternative stacks/hosting platforms — already done and included in references.
- Detailed tasking/scheduling — This document will be an input to that process.
References
- API Stack/hosting evaluation 2025 — details of what other hosting platforms/stack configurations we considered before landing on this target architecture.
- policyengine.py package — moving all the “business logic” of running a simulation into a single package outside of the API.
- FastAPI Demo — repository demonstrating FAST API + integration with input/output and database model validation + observability + modularity.
Current Architecture
Flow Diagram (Simplified)
- PolicyEngine App — React Single Page Application (SPA) providing a UIX for running simulations
- External Clients— MyFriendBen, Benefit Navigator, etc. external, paying customers of our API who build user experiences on top
- API — Flask-based API used to support the PolicyEngine app.
- Household API— a completely separate API also implemented in flask and also running in app engine that only does household simulations
- Database — Cloud SQL managed database for storing policies, households, user data, simulation results, etc.
- Auth0— external OAuth 2.0 provider used to authenticate external users
Limitations
- Scaling — The current API scales by running more worker processes on a single, beefier container.
- Stability — Because they all run on one host
- Cost — The one container has to be scaled to support multiple workflows and stay up 100% of the time even though most of the time it is not running any workflows. This costs OOM 10K a month.
- Observability — The various components running in App Engine to not generate trace or metric information and provide limited logging.
- Billing — There is currently no automated mechanism to capture and bill for usage of the commercial API.
- Maintenance— Generally the system is hard to maintain.
Target Architecture
Flow Diagram
NOTE: Components only called out where new relative to current architecture
- PolicyEngine API — Instead of two completely separate APIs, one common API code base used to run multiple instances.
- GCP Workflow — GCP-based orchestrator able to run a sequence of tasks, handle retries/errors, etc.
- Simulation API — FastAPI based Cloud Run TBD (job, service, function) used to actually execute simulation tasks on appropriate hardware.
- Stripe — Automated billing for API usage of paying customers.
- logging/metrics/trace — GCP observability integrations, automatically generated by all FastAPI-based services for all operations, SQL statements, and logs.
Benefits
- Scaling - API is stateless and containers can be added/removed to support traffic
- Stability - Failure of a single container does not cause loss of state
- Cost
- Observability - All FastAPI-based services will be integrated with OpenTelemetry for metrics/logging/trace information providing good default observability for all services (latency, error rate, and log details for all operations and SQL queries by default)
- Billing — Addressed via Stripe integration.
- Maintenance
Transition Plan
Phase 1 — Reduce Cost of the main API
We initially tackle main API by removing the REDIS queue/worker setup and replacing it with a GCP workflow executed against a new Simulation API based on FastAPI and policyengine.py.
We then scale down the App Engine instance to just support running a Flask API, reducing the cost of the always-on host and configure the Simulation API to scale when used only, reducing that cost to the time to run actual simulation requests.
The main API otherwise remains the same and has many of the same limitations (local state, lack of easy observability, etc.)
Household API is unchanged.
Flow Diagram
Benefits
- Cost reduction
- Demonstrate/vet technologies we propose for the main api
- Demonstrate integration with the Workflow service and GCP logging/metrics/trace
- Demonstrate FastAPI on Cloud Run with scaling
Phase 2 — Implement Billing and Operations Improvements for Paying Customers
Household API is completely replaced by a new FastAPI-based implementation. This implementation is based on the full target architecture, but only implements the household simulation part of it.
Household simulations are executed using the same simulation API which is based on the new policyengine.py package.
Flow Diagram
Benefits
- Automated billing — using Stripe, we can now automate metering and billing our API customers by usage.
- Additional Flexibility — the addition of the workflow and database mean the household API can be extended to operate like the main API (and this is the next step) will all the same features.
- Improved observability/operability — household API now generates traces/logging/metrics which will support robust alarming and debuggability.
- Demonstrate/vet technologies
Phase 3+
We fully replace the web application with the new design currently being identified and implement it on top of the “Household API” by adding functionality until it is just the “PolicyEngine API”.
This will involve multiple phases and additional design work to flesh out.
- What data that customers re-use do we need to persist from the existing application/API?
- What links that customers may have saved/referenced do we need to persist from the existing application?
- What schema will we be using in our database to model users/policies/households/etc.?
Cost Analysis
We estimate we could reduce our current monthly GCP compute bill from ~$7,000 a month to no more than $500 a month, net (based on very conservative assumptions). This should reduce our overall GCP cost by about 90%.
The primary driver of cost now is running AppEngine. The primary cost in the new system is running simulations as Cloud Run Functions.
In the target architecture the main river of cost is running our simulation (policyengine.py) in Cloud Run functions:
This analysis was done assuming substantially more traffic than what we currently receive all day, every day, all month.
Current Major Drivers of Cost
AppEngine is by far the major driver of infrastructure costs on PolicyEngine comprising 90% of the GCP infrastructure cost.
Estimated Compute Cost (Target Architecture)
Assuming we have transitioned both APIs to just do API and delegating all simulation to a workflow.
Cost was estimated using these usage estimates:
- Simulation API
- API compute
- Workflow (not in calculator despite documentation assertions. Pricing is here: https://cloud.google.com/workflows/pricing))
Total Cost: $538 a month (calculator here)