Skip to main content

Infrastructure Costs

Living Document - Last updated: January 25, 2026

This document tracks GCP infrastructure costs across all Eli Health projects. Review periodically to identify optimization opportunities.

Quick Summary

CategoryEstimated DailyEstimated Monthly
Cloud SQL (all envs)~$34~$1,020
Cloud Run (production)~$13~$390
Cloud Run (staging)~$5~$150
Cloud Run (dev)~$8~$240
Datastream (all envs)~$9~$270
BigQuery~$3~$90
Artifact Registry~$2~$60
Compute (bastions)~$2~$60
Total~$76~$2,280

Detailed Breakdown by Service

Cloud SQL Databases

The biggest fixed cost. Each environment has a dedicated PostgreSQL instance.

ProjectInstanceTierDiskRegionEst. Daily
Productionpostgres-instance-03e85fad-usdb-g1-small49 GBus-east1~$12
Stagingpostgres-instance-2b4c7291-usdb-g1-small10 GBus-east1~$10
Devpostgres-instance-fc27f365-cadb-g1-small43 GBnorthamerica-northeast1~$12

Cost drivers:

  • Instance uptime (24/7)
  • Storage (charged per GB)
  • Network egress

Cloud Run Services

Production (eli-health-prod)

ServiceCPUMemoryminScalemaxScaleEst. DailyNotes
image-analysis-us88Gi130~$8Always running 8 vCPUs
api-service-us44Gi120~$4Normal for API
kpi11Gi11~$1Dashboard
alert-summarizer0.33512Mi01minimalScales to zero
appstore-webhook0.17256Mi01minimalScales to zero
syncfirebaseauthtobigquery0.33512Mi01minimalScales to zero

Key insight: image-analysis-us with minScale=1 at 8 CPU keeps one warm instance for fast response. Peak usage is ~54 requests/hour, well within capacity.

Staging (eli-health-staging)

ServiceCPUMemoryminScaleEst. Daily
api-service-us44Gi0~$2
image-analysis-us44Gi0~$2
alert-summarizer0.33512Mi0minimal

Staging uses minScale=0 and cpu_idle=true for all services.

Development (eli-health-dev)

ServiceCPUMemoryminScaleEst. Daily
api-service-ca44Gi0~$2-4
image-analysis-ca44Gi0~$2-4
kpi11Gi1~$1
docs1512Mi0minimal
qa1512Mi0minimal
alert-summarizer-development0.33512Mi0minimal

Dev uses minScale=0 and cpu_idle=true for most services. The kpi dashboard has minScale=1. Actual costs vary based on QA testing activity (~$5-8/day total).


Datastream (PostgreSQL to BigQuery CDC)

Real-time data replication from PostgreSQL to BigQuery.

ProjectLocationStatusTables SyncedEst. Daily
Productionus-east1RUNNING15 tables~$3
Stagingus-east1RUNNING15 tables~$3
Devnorthamerica-northeast1RUNNING15 tables~$3

Tables being synced:

  • health_goal, health_goal_lookup, health_tag, health_tag_type
  • heart_rate_spike_log, measure_daily_curve, migrations_history
  • period, reading, record, update_email, user
  • user_connections, user_health_tag, wakeup_time

Excluded tables:

  • health_data - Terra wearable data (331 GB). Excluded to avoid expensive BigQuery MERGE operations.

BigQuery

Storage Costs

DatasetSizeMonthly CostNotes
eli_health_biometricspublic331 GB~$6.60Mostly health_data table
analytics (Firebase)5.7 GB~$0.11GA4 export
All othersminimal~$0.02Shopify, Klaviyo, etc.

Table sizes in eli_health_biometricspublic:

TableSizeRows
health_data331.37 GB2,479,787
record0.01 GB18,549
reading0.01 GB23,814
All othersminimalVarious

Query costs:

  • On-demand pricing: $6.25 per TB scanned
  • Current estimate: ~$1-3/day

Artifact Registry

Docker image storage.

ProjectImagesEst. DailyNotes
Dev~2,000~$8Cleanup policy active
Staging~120~$0.50Normal
Production~225~$1Normal

Cleanup policies applied:

  • Delete untagged images after 7 days
  • Keep images tagged: latest, production, staging, development, dev, prod
  • Delete other tagged images after 90 days

Compute Engine (Bastions)

SSH tunnels for Datastream to access Cloud SQL.

ProjectInstanceTypeEst. Daily
Productionsql-bastion-hoste2-micro~$0.20
Stagingsql-bastion-hoste2-micro~$0.20
Devsql-bastion-hoste2-micro~$0.20

These are required for Datastream connectivity and are minimal cost.


Optimization Opportunities

Future Optimizations to Explore

ItemCurrentConsideration
health_data in BigQuery331 GB storedDelete if not needed for analytics (~$6.60/month)
image-analysis CPU8 vCPUs (prod)Evaluate if lower CPU (4 vCPUs) maintains acceptable latency

Already Optimized

  • Bastion instances are already e2-micro (cheapest)
  • Most Cloud Run services scale to zero
  • Artifact Registry has cleanup policies to auto-delete old images
  • image-analysis minScale reduced from 2 to 1 (Jan 2026)
OptimizationWhy
Stop dev/staging databasesDev and staging have continuous QA testing. Testers and developers need these available at all times. Manual restarts take 3-5 minutes and disrupt workflows.
Downgrade Cloud SQL to db-f1-microWould reduce RAM from 1.7 GB to 0.6 GB. Databases idle at ~0.8 GB due to connection pooling - f1-micro would cause out-of-memory crashes.

Monitoring & Alerts

Billing Export

  • Enabled: Standard and Detailed usage cost export
  • Dataset: eli-health-prod.gcp_billing_export
  • Region: northamerica-northeast1

Budget Alerts System

Automated Slack notifications when GCP spending crosses budget thresholds.

Architecture

Budget Configuration

EnvironmentProjectBudget (USD)Thresholds
Developmenteli-health-dev$50050%, 90%, 100%, 120%
Stagingeli-health-staging$50050%, 90%, 100%, 120%
Productioneli-health-prod$2,00050%, 90%, 100%, 120%

All budgets publish to a single Pub/Sub topic in eli-health-dev. This centralizes alerting infrastructure while monitoring all three projects.

Smart Deduplication

The Cloud Function implements deduplication to prevent alert spam. GCP sends repeated notifications every 30 minutes when a threshold is exceeded.

Rules:

  1. New billing period → Always alert (monthly reset)
  2. Higher threshold → Alert (50% → 90% → 100% → 120%)
  3. Same or lower threshold → Skip (prevents spam)

State is stored in GCS at: gs://eli-health-dev-billing-alerter-source/billing-alerts/{budget-name}.json

Slack Message Format

Messages use Slack Block Kit with severity indicators:

ThresholdEmojiStatus Text
50%📊 :bar_chart:"50%"
90%📈 :chart_with_upwards_trend:"90% - Approaching budget"
100%⚠️ :warning:"100% - AT BUDGET"
120%+🚨 :rotating_light:"120% - OVER BUDGET"

Message includes:

  • Environment name (Production/Staging/Development)
  • Current spend vs budget amount
  • Remaining budget
  • Usage percentage
  • "View Billing Console" button

Terraform Configuration

Module: eli-devops/tf/modules/global/billing-alerter/

Files:

FilePurpose
main.tfPub/Sub topic, Cloud Function (2nd gen), IAM
variables.tfproject_id, region, enabled, slack_channel
outputs.tfpubsub_topic_id, function_url, service_account_email
function/main.pyPython handler with deduplication logic
function/test_main.py30 unit tests

Variables in tf/variables.tf:

billing_alerter_enabled       = true
billing_alerter_slack_channel = "alerts-billing"
billing_alerter_pubsub_topic_id = "projects/eli-health-dev/topics/billing-alerts"

Wiring in tf/main.tf:

module "billing_alerter" {
source = "./modules/global/billing-alerter"
enabled = var.billing_alerter_enabled
project_id = var.project_id
region = var.region
slack_channel = var.billing_alerter_slack_channel
}

module "billing_budget" {
# ... existing config ...
pubsub_topic_id = var.billing_alerter_pubsub_topic_id
}

Adding/Modifying Budgets

To change budget amounts:

  1. Edit eli-devops/tf/{environment}.tfvars
  2. Modify billing_budget_amount variable
  3. Run terraform apply

To add a new environment:

  1. Create budget module in new environment's Terraform
  2. Set pubsub_topic_id = "projects/eli-health-dev/topics/billing-alerts"
  3. The existing Cloud Function will handle notifications

Unit Tests

IMPORTANT: Run unit tests before deploying changes!

cd eli-devops/tf/modules/global/billing-alerter/function
python3 -m pytest test_main.py -v

Tests cover:

  • Severity emoji selection
  • Threshold text formatting
  • Environment extraction from budget names
  • Slack message structure
  • Deduplication logic (new period, higher threshold, duplicates)
  • Slack API posting

Troubleshooting

No alerts arriving:

  1. Check Cloud Function logs: gcloud functions logs read billing-alerter --project=eli-health-dev
  2. Verify Pub/Sub subscription exists
  3. Confirm budget has pubsub_topic configured

Duplicate alerts:

  1. Check GCS state: gsutil cat 'gs://eli-health-dev-billing-alerter-source/billing-alerts/{budget-name}.json'
  2. State should show last_threshold and billing_period
  3. Reset by deleting state file if needed

Permission errors:

  1. Service account needs roles/secretmanager.secretAccessor for Slack token
  2. Service account needs GCS access to state bucket
  3. Pub/Sub service account needs roles/run.invoker on Cloud Function

Budget Alert (Legacy)

  • Budget: $2,500 CAD/month (account-level)
  • Account: 016E4B-83DE60-189CD9
  • Note: This is the legacy account-level budget. Project-level budgets above provide more granular control.

How to Check Current Costs

GCP Console: https://console.cloud.google.com/billing/016E4B-83DE60-189CD9

BigQuery (after 24-48 hours of data):

SELECT
service.description,
SUM(cost) as total_cost
FROM `eli-health-prod.gcp_billing_export.gcp_billing_export_v1_*`
WHERE DATE(_PARTITIONTIME) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY service.description
ORDER BY total_cost DESC

Quick Reference

Projects

  • eli-health-prod - Production
  • eli-health-staging - Staging/QA
  • eli-health-dev - Development

Regions

  • Production and Staging: us-east1
  • Dev: northamerica-northeast1

Key Terraform Files

  • Cloud Run: eli-devops/tf/modules/regional/backend-compute/
  • Cloud SQL: eli-devops/tf/modules/regional/storage/
  • Datastream: eli-devops/tf/modules/regional/datastream/
  • Registry: eli-devops/tf/modules/global/registry/

Changelog

January 27, 2026

  • Comprehensive Budget Alerts documentation - Added full architecture diagrams, deduplication logic explanation, Terraform configuration details, and troubleshooting guide.

January 25, 2026

  • Added Dev Cloud Run costs - Added missing Dev environment Cloud Run section ($5-8/day) and updated Quick Summary totals ($76/day).
  • Reorganized optimization section - Added "Future Optimizations to Explore" (image-analysis CPU) and "Not Recommended" sections with clear reasoning.

January 24, 2026

  • image-analysis-us minScale reduced from 2 to 1 - Peak usage is ~54 requests/hour, 1 instance is sufficient. Saves ~$7/day.
  • Excluded health_data from Datastream - 331 GB Terra wearable data no longer synced to BigQuery, reducing MERGE costs by ~$24/day.
  • Fixed staging Cloud Run scaling - Corrected minScale and cpu_idle settings that had drifted.
  • Applied Artifact Registry cleanup policies - Dev registry will auto-delete old images (was 2,044 images).