Infrastructure Costs

Living Document - Last updated: January 25, 2026

This document tracks GCP infrastructure costs across all Eli Health projects. Review periodically to identify optimization opportunities.

Quick Summary

Category	Estimated Daily	Estimated Monthly
Cloud SQL (all envs)	~$34	~$1,020
Cloud Run (production)	~$13	~$390
Cloud Run (staging)	~$5	~$150
Cloud Run (dev)	~$8	~$240
Datastream (all envs)	~$9	~$270
BigQuery	~$3	~$90
Artifact Registry	~$2	~$60
Compute (bastions)	~$2	~$60
Total	~$76	~$2,280

Detailed Breakdown by Service

Cloud SQL Databases

The biggest fixed cost. Each environment has a dedicated PostgreSQL instance.

Project	Instance	Tier	Disk	Region	Est. Daily
Production	postgres-instance-03e85fad-us	db-g1-small	49 GB	us-east1	~$12
Staging	postgres-instance-2b4c7291-us	db-g1-small	10 GB	us-east1	~$10
Dev	postgres-instance-fc27f365-ca	db-g1-small	43 GB	northamerica-northeast1	~$12

Cost drivers:

Instance uptime (24/7)
Storage (charged per GB)
Network egress

Cloud Run Services

Production (eli-health-prod)

Service	CPU	Memory	minScale	maxScale	Est. Daily	Notes
image-analysis-us	8	8Gi	1	30	~$8	Always running 8 vCPUs
api-service-us	4	4Gi	1	20	~$4	Normal for API
kpi	1	1Gi	1	1	~$1	Dashboard
alert-summarizer	0.33	512Mi	0	1	minimal	Scales to zero
appstore-webhook	0.17	256Mi	0	1	minimal	Scales to zero
syncfirebaseauthtobigquery	0.33	512Mi	0	1	minimal	Scales to zero

Key insight: image-analysis-us with minScale=1 at 8 CPU keeps one warm instance for fast response. Peak usage is ~54 requests/hour, well within capacity.

Staging (eli-health-staging)

Service	CPU	Memory	Est. Daily
api-service-us	4	4Gi	~$2
image-analysis-us	4	4Gi	~$2
alert-summarizer	0.33	512Mi	minimal

Staging uses minScale=0 and cpu_idle=true for all services.

Development (eli-health-dev)

Service	CPU	Memory	minScale	Est. Daily
api-service-ca	4	4Gi	0	~$2-4
image-analysis-ca	4	4Gi	0	~$2-4
kpi	1	1Gi	1	~$1
docs	1	512Mi	0	minimal
qa	1	512Mi	0	minimal
alert-summarizer-development	0.33	512Mi	0	minimal

Dev uses minScale=0 and cpu_idle=true for most services. The kpi dashboard has minScale=1. Actual costs vary based on QA testing activity (~$5-8/day total).

Datastream (PostgreSQL to BigQuery CDC)

Real-time data replication from PostgreSQL to BigQuery.

Project	Location	Status	Tables Synced	Est. Daily
Production	us-east1	RUNNING	15 tables	~$3
Staging	us-east1	RUNNING	15 tables	~$3
Dev	northamerica-northeast1	RUNNING	15 tables	~$3

Tables being synced:

health_goal, health_goal_lookup, health_tag, health_tag_type
heart_rate_spike_log, measure_daily_curve, migrations_history
period, reading, record, update_email, user
user_connections, user_health_tag, wakeup_time

Excluded tables:

health_data - Terra wearable data (331 GB). Excluded to avoid expensive BigQuery MERGE operations.

BigQuery

Storage Costs

Dataset	Size	Monthly Cost	Notes
eli_health_biometricspublic	331 GB	~$6.60	Mostly health_data table
analytics (Firebase)	5.7 GB	~$0.11	GA4 export
All others	minimal	~$0.02	Shopify, Klaviyo, etc.

Table sizes in eli_health_biometricspublic:

Table	Size	Rows
health_data	331.37 GB	2,479,787
record	0.01 GB	18,549
reading	0.01 GB	23,814
All others	minimal	Various

Query costs:

On-demand pricing: $6.25 per TB scanned
Current estimate: ~$1-3/day

Artifact Registry

Docker image storage.

Project	Images	Est. Daily	Notes
Dev	~2,000	~$8	Cleanup policy active
Staging	~120	~$0.50	Normal
Production	~225	~$1	Normal

Cleanup policies applied:

Delete untagged images after 7 days
Keep images tagged: latest, production, staging, development, dev, prod
Delete other tagged images after 90 days

Compute Engine (Bastions)

SSH tunnels for Datastream to access Cloud SQL.

Project	Instance	Type	Est. Daily
Production	sql-bastion-host	e2-micro	~$0.20
Staging	sql-bastion-host	e2-micro	~$0.20
Dev	sql-bastion-host	e2-micro	~$0.20

These are required for Datastream connectivity and are minimal cost.

Optimization Opportunities

Future Optimizations to Explore

Item	Current	Consideration
health_data in BigQuery	331 GB stored	Delete if not needed for analytics (~$6.60/month)
image-analysis CPU	8 vCPUs (prod)	Evaluate if lower CPU (4 vCPUs) maintains acceptable latency

Already Optimized

Bastion instances are already e2-micro (cheapest)
Most Cloud Run services scale to zero
Artifact Registry has cleanup policies to auto-delete old images
image-analysis minScale reduced from 2 to 1 (Jan 2026)

Not Recommended

Optimization	Why
Stop dev/staging databases	Dev and staging have continuous QA testing. Testers and developers need these available at all times. Manual restarts take 3-5 minutes and disrupt workflows.
Downgrade Cloud SQL to db-f1-micro	Would reduce RAM from 1.7 GB to 0.6 GB. Databases idle at ~0.8 GB due to connection pooling - f1-micro would cause out-of-memory crashes.

Monitoring & Alerts

Billing Export

Enabled: Standard and Detailed usage cost export
Dataset: eli-health-prod.gcp_billing_export
Region: northamerica-northeast1

Budget Alerts System

Automated Slack notifications when GCP spending crosses budget thresholds.

Architecture

Budget Configuration

Environment	Project	Budget (USD)	Thresholds
Development	eli-health-dev	$500	50%, 90%, 100%, 120%
Staging	eli-health-staging	$500	50%, 90%, 100%, 120%
Production	eli-health-prod	$2,000	50%, 90%, 100%, 120%

All budgets publish to a single Pub/Sub topic in eli-health-dev. This centralizes alerting infrastructure while monitoring all three projects.

Smart Deduplication

The Cloud Function implements deduplication to prevent alert spam. GCP sends repeated notifications every 30 minutes when a threshold is exceeded.

Rules:

New billing period → Always alert (monthly reset)
Higher threshold → Alert (50% → 90% → 100% → 120%)
Same or lower threshold → Skip (prevents spam)

State is stored in GCS at: gs://eli-health-dev-billing-alerter-source/billing-alerts/{budget-name}.json

Slack Message Format

Messages use Slack Block Kit with severity indicators:

Threshold	Emoji	Status Text
50%	📊 `:bar_chart:`	"50%"
90%	📈 `:chart_with_upwards_trend:`	"90% - Approaching budget"
100%	⚠️ `:warning:`	"100% - AT BUDGET"
120%+	🚨 `:rotating_light:`	"120% - OVER BUDGET"

Message includes:

Environment name (Production/Staging/Development)
Current spend vs budget amount
Remaining budget
Usage percentage
"View Billing Console" button

Terraform Configuration

Module: eli-devops/tf/modules/global/billing-alerter/

Files:

File	Purpose
`main.tf`	Pub/Sub topic, Cloud Function (2nd gen), IAM
`variables.tf`	`project_id`, `region`, `enabled`, `slack_channel`
`outputs.tf`	`pubsub_topic_id`, `function_url`, `service_account_email`
`function/main.py`	Python handler with deduplication logic
`function/test_main.py`	30 unit tests

Variables in tf/variables.tf:

billing_alerter_enabled       = true
billing_alerter_slack_channel = "alerts-billing"
billing_alerter_pubsub_topic_id = "projects/eli-health-dev/topics/billing-alerts"

Wiring in tf/main.tf:

module "billing_alerter" {
  source        = "./modules/global/billing-alerter"
  enabled       = var.billing_alerter_enabled
  project_id    = var.project_id
  region        = var.region
  slack_channel = var.billing_alerter_slack_channel
}

module "billing_budget" {
  # ... existing config ...
  pubsub_topic_id = var.billing_alerter_pubsub_topic_id
}

Adding/Modifying Budgets

To change budget amounts:

Edit eli-devops/tf/{environment}.tfvars
Modify billing_budget_amount variable
Run terraform apply

To add a new environment:

Create budget module in new environment's Terraform
Set pubsub_topic_id = "projects/eli-health-dev/topics/billing-alerts"
The existing Cloud Function will handle notifications

Unit Tests

IMPORTANT: Run unit tests before deploying changes!

cd eli-devops/tf/modules/global/billing-alerter/function
python3 -m pytest test_main.py -v

Tests cover:

Severity emoji selection
Threshold text formatting
Environment extraction from budget names
Slack message structure
Deduplication logic (new period, higher threshold, duplicates)
Slack API posting

Troubleshooting

No alerts arriving:

Check Cloud Function logs: gcloud functions logs read billing-alerter --project=eli-health-dev
Verify Pub/Sub subscription exists
Confirm budget has pubsub_topic configured

Duplicate alerts:

Check GCS state: gsutil cat 'gs://eli-health-dev-billing-alerter-source/billing-alerts/{budget-name}.json'
State should show last_threshold and billing_period
Reset by deleting state file if needed

Permission errors:

Service account needs roles/secretmanager.secretAccessor for Slack token
Service account needs GCS access to state bucket
Pub/Sub service account needs roles/run.invoker on Cloud Function

Budget Alert (Legacy)

Budget: $2,500 CAD/month (account-level)
Account: 016E4B-83DE60-189CD9
Note: This is the legacy account-level budget. Project-level budgets above provide more granular control.

How to Check Current Costs

GCP Console: https://console.cloud.google.com/billing/016E4B-83DE60-189CD9

BigQuery (after 24-48 hours of data):

SELECT
  service.description,
  SUM(cost) as total_cost
FROM `eli-health-prod.gcp_billing_export.gcp_billing_export_v1_*`
WHERE DATE(_PARTITIONTIME) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY service.description
ORDER BY total_cost DESC

Quick Reference

Projects

eli-health-prod - Production
eli-health-staging - Staging/QA
eli-health-dev - Development

Regions

Production and Staging: us-east1
Dev: northamerica-northeast1

Key Terraform Files

Cloud Run: eli-devops/tf/modules/regional/backend-compute/
Cloud SQL: eli-devops/tf/modules/regional/storage/
Datastream: eli-devops/tf/modules/regional/datastream/
Registry: eli-devops/tf/modules/global/registry/

Changelog

January 27, 2026

Comprehensive Budget Alerts documentation - Added full architecture diagrams, deduplication logic explanation, Terraform configuration details, and troubleshooting guide.

January 25, 2026

Added Dev Cloud Run costs - Added missing Dev environment Cloud Run section (~~$5-8/day) and updated Quick Summary totals (~~$76/day).
Reorganized optimization section - Added "Future Optimizations to Explore" (image-analysis CPU) and "Not Recommended" sections with clear reasoning.

January 24, 2026

image-analysis-us minScale reduced from 2 to 1 - Peak usage is ~54 requests/hour, 1 instance is sufficient. Saves ~$7/day.
Excluded health_data from Datastream - 331 GB Terra wearable data no longer synced to BigQuery, reducing MERGE costs by ~$24/day.
Fixed staging Cloud Run scaling - Corrected minScale and cpu_idle settings that had drifted.
Applied Artifact Registry cleanup policies - Dev registry will auto-delete old images (was 2,044 images).

Quick Summary​

Detailed Breakdown by Service​

Cloud SQL Databases​

Cloud Run Services​

Production (eli-health-prod)​

Staging (eli-health-staging)​

Development (eli-health-dev)​

Datastream (PostgreSQL to BigQuery CDC)​

BigQuery​

Storage Costs​

Artifact Registry​

Compute Engine (Bastions)​

Optimization Opportunities​

Future Optimizations to Explore​

Already Optimized​

Not Recommended​

Monitoring & Alerts​

Billing Export​

Budget Alerts System​

Architecture​

Budget Configuration​

Smart Deduplication​

Slack Message Format​

Terraform Configuration​

Adding/Modifying Budgets​

Unit Tests​

Troubleshooting​

Budget Alert (Legacy)​

How to Check Current Costs​

Quick Reference​

Projects​

Regions​

Key Terraform Files​

Changelog​

January 27, 2026​

January 25, 2026​

January 24, 2026​