Infrastructure Costs
Living Document - Last updated: January 25, 2026
This document tracks GCP infrastructure costs across all Eli Health projects. Review periodically to identify optimization opportunities.
Quick Summary
| Category | Estimated Daily | Estimated Monthly |
|---|---|---|
| Cloud SQL (all envs) | ~$34 | ~$1,020 |
| Cloud Run (production) | ~$13 | ~$390 |
| Cloud Run (staging) | ~$5 | ~$150 |
| Cloud Run (dev) | ~$8 | ~$240 |
| Datastream (all envs) | ~$9 | ~$270 |
| BigQuery | ~$3 | ~$90 |
| Artifact Registry | ~$2 | ~$60 |
| Compute (bastions) | ~$2 | ~$60 |
| Total | ~$76 | ~$2,280 |
Detailed Breakdown by Service
Cloud SQL Databases
The biggest fixed cost. Each environment has a dedicated PostgreSQL instance.
| Project | Instance | Tier | Disk | Region | Est. Daily |
|---|---|---|---|---|---|
| Production | postgres-instance-03e85fad-us | db-g1-small | 49 GB | us-east1 | ~$12 |
| Staging | postgres-instance-2b4c7291-us | db-g1-small | 10 GB | us-east1 | ~$10 |
| Dev | postgres-instance-fc27f365-ca | db-g1-small | 43 GB | northamerica-northeast1 | ~$12 |
Cost drivers:
- Instance uptime (24/7)
- Storage (charged per GB)
- Network egress
Cloud Run Services
Production (eli-health-prod)
| Service | CPU | Memory | minScale | maxScale | Est. Daily | Notes |
|---|---|---|---|---|---|---|
| image-analysis-us | 8 | 8Gi | 1 | 30 | ~$8 | Always running 8 vCPUs |
| api-service-us | 4 | 4Gi | 1 | 20 | ~$4 | Normal for API |
| kpi | 1 | 1Gi | 1 | 1 | ~$1 | Dashboard |
| alert-summarizer | 0.33 | 512Mi | 0 | 1 | minimal | Scales to zero |
| appstore-webhook | 0.17 | 256Mi | 0 | 1 | minimal | Scales to zero |
| syncfirebaseauthtobigquery | 0.33 | 512Mi | 0 | 1 | minimal | Scales to zero |
Key insight: image-analysis-us with minScale=1 at 8 CPU keeps one warm instance for fast response. Peak usage is ~54 requests/hour, well within capacity.
Staging (eli-health-staging)
| Service | CPU | Memory | minScale | Est. Daily |
|---|---|---|---|---|
| api-service-us | 4 | 4Gi | 0 | ~$2 |
| image-analysis-us | 4 | 4Gi | 0 | ~$2 |
| alert-summarizer | 0.33 | 512Mi | 0 | minimal |
Staging uses minScale=0 and cpu_idle=true for all services.
Development (eli-health-dev)
| Service | CPU | Memory | minScale | Est. Daily |
|---|---|---|---|---|
| api-service-ca | 4 | 4Gi | 0 | ~$2-4 |
| image-analysis-ca | 4 | 4Gi | 0 | ~$2-4 |
| kpi | 1 | 1Gi | 1 | ~$1 |
| docs | 1 | 512Mi | 0 | minimal |
| qa | 1 | 512Mi | 0 | minimal |
| alert-summarizer-development | 0.33 | 512Mi | 0 | minimal |
Dev uses minScale=0 and cpu_idle=true for most services. The kpi dashboard has minScale=1. Actual costs vary based on QA testing activity (~$5-8/day total).
Datastream (PostgreSQL to BigQuery CDC)
Real-time data replication from PostgreSQL to BigQuery.
| Project | Location | Status | Tables Synced | Est. Daily |
|---|---|---|---|---|
| Production | us-east1 | RUNNING | 15 tables | ~$3 |
| Staging | us-east1 | RUNNING | 15 tables | ~$3 |
| Dev | northamerica-northeast1 | RUNNING | 15 tables | ~$3 |
Tables being synced:
- health_goal, health_goal_lookup, health_tag, health_tag_type
- heart_rate_spike_log, measure_daily_curve, migrations_history
- period, reading, record, update_email, user
- user_connections, user_health_tag, wakeup_time
Excluded tables:
health_data- Terra wearable data (331 GB). Excluded to avoid expensive BigQuery MERGE operations.
BigQuery
Storage Costs
| Dataset | Size | Monthly Cost | Notes |
|---|---|---|---|
| eli_health_biometricspublic | 331 GB | ~$6.60 | Mostly health_data table |
| analytics (Firebase) | 5.7 GB | ~$0.11 | GA4 export |
| All others | minimal | ~$0.02 | Shopify, Klaviyo, etc. |
Table sizes in eli_health_biometricspublic:
| Table | Size | Rows |
|---|---|---|
| health_data | 331.37 GB | 2,479,787 |
| record | 0.01 GB | 18,549 |
| reading | 0.01 GB | 23,814 |
| All others | minimal | Various |
Query costs:
- On-demand pricing: $6.25 per TB scanned
- Current estimate: ~$1-3/day
Artifact Registry
Docker image storage.
| Project | Images | Est. Daily | Notes |
|---|---|---|---|
| Dev | ~2,000 | ~$8 | Cleanup policy active |
| Staging | ~120 | ~$0.50 | Normal |
| Production | ~225 | ~$1 | Normal |
Cleanup policies applied:
- Delete untagged images after 7 days
- Keep images tagged: latest, production, staging, development, dev, prod
- Delete other tagged images after 90 days
Compute Engine (Bastions)
SSH tunnels for Datastream to access Cloud SQL.
| Project | Instance | Type | Est. Daily |
|---|---|---|---|
| Production | sql-bastion-host | e2-micro | ~$0.20 |
| Staging | sql-bastion-host | e2-micro | ~$0.20 |
| Dev | sql-bastion-host | e2-micro | ~$0.20 |
These are required for Datastream connectivity and are minimal cost.
Optimization Opportunities
Future Optimizations to Explore
| Item | Current | Consideration |
|---|---|---|
| health_data in BigQuery | 331 GB stored | Delete if not needed for analytics (~$6.60/month) |
| image-analysis CPU | 8 vCPUs (prod) | Evaluate if lower CPU (4 vCPUs) maintains acceptable latency |
Already Optimized
- Bastion instances are already e2-micro (cheapest)
- Most Cloud Run services scale to zero
- Artifact Registry has cleanup policies to auto-delete old images
- image-analysis minScale reduced from 2 to 1 (Jan 2026)
Not Recommended
| Optimization | Why |
|---|---|
| Stop dev/staging databases | Dev and staging have continuous QA testing. Testers and developers need these available at all times. Manual restarts take 3-5 minutes and disrupt workflows. |
| Downgrade Cloud SQL to db-f1-micro | Would reduce RAM from 1.7 GB to 0.6 GB. Databases idle at ~0.8 GB due to connection pooling - f1-micro would cause out-of-memory crashes. |
Monitoring & Alerts
Billing Export
- Enabled: Standard and Detailed usage cost export
- Dataset:
eli-health-prod.gcp_billing_export - Region: northamerica-northeast1
Budget Alerts System
Automated Slack notifications when GCP spending crosses budget thresholds.
Architecture
Budget Configuration
| Environment | Project | Budget (USD) | Thresholds |
|---|---|---|---|
| Development | eli-health-dev | $500 | 50%, 90%, 100%, 120% |
| Staging | eli-health-staging | $500 | 50%, 90%, 100%, 120% |
| Production | eli-health-prod | $2,000 | 50%, 90%, 100%, 120% |
All budgets publish to a single Pub/Sub topic in eli-health-dev. This centralizes alerting infrastructure while monitoring all three projects.
Smart Deduplication
The Cloud Function implements deduplication to prevent alert spam. GCP sends repeated notifications every 30 minutes when a threshold is exceeded.
Rules:
- New billing period → Always alert (monthly reset)
- Higher threshold → Alert (50% → 90% → 100% → 120%)
- Same or lower threshold → Skip (prevents spam)
State is stored in GCS at: gs://eli-health-dev-billing-alerter-source/billing-alerts/{budget-name}.json
Slack Message Format
Messages use Slack Block Kit with severity indicators:
| Threshold | Emoji | Status Text |
|---|---|---|
| 50% | 📊 :bar_chart: | "50%" |
| 90% | 📈 :chart_with_upwards_trend: | "90% - Approaching budget" |
| 100% | ⚠️ :warning: | "100% - AT BUDGET" |
| 120%+ | 🚨 :rotating_light: | "120% - OVER BUDGET" |
Message includes:
- Environment name (Production/Staging/Development)
- Current spend vs budget amount
- Remaining budget
- Usage percentage
- "View Billing Console" button
Terraform Configuration
Module: eli-devops/tf/modules/global/billing-alerter/
Files:
| File | Purpose |
|---|---|
main.tf | Pub/Sub topic, Cloud Function (2nd gen), IAM |
variables.tf | project_id, region, enabled, slack_channel |
outputs.tf | pubsub_topic_id, function_url, service_account_email |
function/main.py | Python handler with deduplication logic |
function/test_main.py | 30 unit tests |
Variables in tf/variables.tf:
billing_alerter_enabled = true
billing_alerter_slack_channel = "alerts-billing"
billing_alerter_pubsub_topic_id = "projects/eli-health-dev/topics/billing-alerts"
Wiring in tf/main.tf:
module "billing_alerter" {
source = "./modules/global/billing-alerter"
enabled = var.billing_alerter_enabled
project_id = var.project_id
region = var.region
slack_channel = var.billing_alerter_slack_channel
}
module "billing_budget" {
# ... existing config ...
pubsub_topic_id = var.billing_alerter_pubsub_topic_id
}
Adding/Modifying Budgets
To change budget amounts:
- Edit
eli-devops/tf/{environment}.tfvars - Modify
billing_budget_amountvariable - Run
terraform apply
To add a new environment:
- Create budget module in new environment's Terraform
- Set
pubsub_topic_id = "projects/eli-health-dev/topics/billing-alerts" - The existing Cloud Function will handle notifications
Unit Tests
IMPORTANT: Run unit tests before deploying changes!
cd eli-devops/tf/modules/global/billing-alerter/function
python3 -m pytest test_main.py -v
Tests cover:
- Severity emoji selection
- Threshold text formatting
- Environment extraction from budget names
- Slack message structure
- Deduplication logic (new period, higher threshold, duplicates)
- Slack API posting
Troubleshooting
No alerts arriving:
- Check Cloud Function logs:
gcloud functions logs read billing-alerter --project=eli-health-dev - Verify Pub/Sub subscription exists
- Confirm budget has
pubsub_topicconfigured
Duplicate alerts:
- Check GCS state:
gsutil cat 'gs://eli-health-dev-billing-alerter-source/billing-alerts/{budget-name}.json' - State should show
last_thresholdandbilling_period - Reset by deleting state file if needed
Permission errors:
- Service account needs
roles/secretmanager.secretAccessorfor Slack token - Service account needs GCS access to state bucket
- Pub/Sub service account needs
roles/run.invokeron Cloud Function
Budget Alert (Legacy)
- Budget: $2,500 CAD/month (account-level)
- Account: 016E4B-83DE60-189CD9
- Note: This is the legacy account-level budget. Project-level budgets above provide more granular control.
How to Check Current Costs
GCP Console: https://console.cloud.google.com/billing/016E4B-83DE60-189CD9
BigQuery (after 24-48 hours of data):
SELECT
service.description,
SUM(cost) as total_cost
FROM `eli-health-prod.gcp_billing_export.gcp_billing_export_v1_*`
WHERE DATE(_PARTITIONTIME) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY service.description
ORDER BY total_cost DESC
Quick Reference
Projects
- eli-health-prod - Production
- eli-health-staging - Staging/QA
- eli-health-dev - Development
Regions
- Production and Staging:
us-east1 - Dev:
northamerica-northeast1
Key Terraform Files
- Cloud Run:
eli-devops/tf/modules/regional/backend-compute/ - Cloud SQL:
eli-devops/tf/modules/regional/storage/ - Datastream:
eli-devops/tf/modules/regional/datastream/ - Registry:
eli-devops/tf/modules/global/registry/
Changelog
January 27, 2026
- Comprehensive Budget Alerts documentation - Added full architecture diagrams, deduplication logic explanation, Terraform configuration details, and troubleshooting guide.
January 25, 2026
- Added Dev Cloud Run costs - Added missing Dev environment Cloud Run section (
$5-8/day) and updated Quick Summary totals ($76/day). - Reorganized optimization section - Added "Future Optimizations to Explore" (image-analysis CPU) and "Not Recommended" sections with clear reasoning.
January 24, 2026
- image-analysis-us minScale reduced from 2 to 1 - Peak usage is ~54 requests/hour, 1 instance is sufficient. Saves ~$7/day.
- Excluded health_data from Datastream - 331 GB Terra wearable data no longer synced to BigQuery, reducing MERGE costs by ~$24/day.
- Fixed staging Cloud Run scaling - Corrected minScale and cpu_idle settings that had drifted.
- Applied Artifact Registry cleanup policies - Dev registry will auto-delete old images (was 2,044 images).