Infrastructure Architecture

Cloud Infrastructure Overview

Eli Health runs on Google Cloud Platform with the following key services:

Service	Technology	Purpose
Compute	Cloud Run	Serverless container hosting
Database	Cloud SQL PostgreSQL	Primary data storage
Storage	Cloud Storage	Image and file storage
CDN	Cloud CDN	Static asset delivery
Load Balancing	Google Load Balancer	Traffic distribution
Monitoring	Cloud Monitoring + Sentry	Observability and error tracking

Cloud Run Architecture

Service Layout

Services

Service	Domain	Min Instances	Max Instances	Concurrency
Backend API	api.eli.health	3	20	20
HAE (Image Analysis)	internal	10	25	3
KPI Dashboard	kpi.eli.health	1	5	80
Documentation	docs.eli.health	1	3	80

Performance Validation: 100 Concurrent Users

Executive Summary

We conducted extensive stress testing to validate the infrastructure can handle 100 concurrent users submitting hormone test readings. Key findings:

Current stable capacity: 30 concurrent image analyses
Bottleneck identified: HAE service with concurrency bug
Recommendation: Scale HAE instances rather than concurrency until bug is fixed

HAE Bottleneck Analysis

The Hormone Analysis Engine (HAE) is the critical bottleneck in the system. Each image takes approximately 10-12 seconds to process.

Test Results

Configuration	HAE Instances	Concurrency/Instance	Total Capacity	Result
Conservative	3	2	6	✅ Stable, 117s for 50 images
Moderate	10	3	30	✅ Recommended, ~40s for 100 images
Aggressive	20	5	100	❌ HAE crashes with Python errors

Critical Bug Discovered

When HAE concurrency exceeds 3, a Python error occurs:

# Error at /app/api/app_analysis.py:307
UnboundLocalError: cannot access local variable 'payload'
where it is not associated with a value

Impact: Maximum stable concurrency per HAE instance is 3 until bug is fixed.

Performance Formula

Processing Time = (Number of Images ÷ HAE Capacity) × 12 seconds

Examples:
- 100 images with 30 capacity = (100 ÷ 30) × 12 = ~40 seconds
- 100 images with 100 capacity = (100 ÷ 100) × 12 = ~12 seconds

Recommendations for 100 Concurrent Users

Option 1: Fix HAE Bug (Best Solution)

Fix Python error in HAE code
Scale to: 20 instances × 5 concurrency = 100 capacity
Expected time: ~12-15 seconds for all 100 images

Option 2: Scale Instances (Current Approach)

Use: 34 instances × 3 concurrency = 102 capacity
Expected time: ~12-15 seconds for all 100 images
More expensive but avoids the concurrency bug

Option 3: Accept Batching

Use: 10 instances × 3 concurrency = 30 capacity
Expected time: ~40-50 seconds for all 100 images
Most cost-effective stable configuration

Cloud Run Auto-Scaling Behavior

Scaling Timeline (50 Concurrent Requests)

Time 0s:    All 50 requests initiated simultaneously
├─ 0-4s:    First 28 requests complete (warm instances)
├─ 4-10s:   Scaling triggered, new instances starting
├─ 20-35s:  Remaining requests complete (after cold start)
└─ Total:   ~30-35 seconds for all 50 requests

Scaling Characteristics

Metric	Value
Instance capacity	~28 concurrent requests before scaling
Cold start penalty	20-28 seconds additional latency
Scaling trigger delay	~4 seconds after queue detected
Success rate	100% (no dropped requests)

Scaling Visualization

Current Terraform Configuration

API Backend (Production)

module "backend_compute" {
  scaling = {
    min_instance_count      = 3
    max_instance_count      = 20
    max_request_concurrency = 20
  }

  resources = {
    cpu    = "1"
    memory = "512Mi"
  }
}

HAE Image Analysis (Production)

module "image_analysis_compute" {
  scaling = {
    min_instance_count      = 10   # Scale up for 100 users
    max_instance_count      = 25   # Allow auto-scaling
    max_request_concurrency = 3    # Cannot exceed 3 due to bug
  }

  resources = {
    cpu    = "2"
    memory = "2Gi"
  }
}

Stress Testing Tools

Stress testing tools are available in the eli-devops repository under /stress-testing/.

Quick Start

cd eli-devops/stress-testing
npm install

# Single reading test
node send-reading.js --env staging --hormone cortisol --wait

# Parallel load test (50 concurrent)
node parallel-reading-test.js --env staging --count 50 --wait

Available Tools

Tool	Purpose
`send-reading.js`	Send single test reading
`parallel-reading-test.js`	Load test with concurrent readings

See /eli-devops/stress-testing/README.md for full documentation.

Monitoring and Alerts

Key Metrics to Watch

Metric	Warning Threshold	Critical Threshold
HAE Response Time	> 15s	> 30s
HAE Error Rate	> 1%	> 5%
Cloud Run Instance Count	> 15	> 20
Queue Depth	> 50	> 100

Dashboards

Cloud Run Dashboard: Instance health and scaling
Sentry: Application errors and performance
Custom KPI: Business metrics and user analytics

Future Improvements

Fix HAE Concurrency Bug: Enable higher concurrency per instance
Queue-Based Processing: Implement async processing with status polling
Pre-warming: Auto-scale before predicted traffic spikes
Caching: Cache repeated image analysis results
Regional Deployment: Add regions for lower latency

DevOps Terraform Guide - Infrastructure as Code details
Monitoring Guide - Alerting and observability
HAE API Service - Image analysis service details

Cloud Infrastructure Overview​

Cloud Run Architecture​

Service Layout​

Services​

Performance Validation: 100 Concurrent Users​

Executive Summary​

HAE Bottleneck Analysis​

Test Results​

Critical Bug Discovered​

Performance Formula​

Recommendations for 100 Concurrent Users​

Option 1: Fix HAE Bug (Best Solution)​

Option 2: Scale Instances (Current Approach)​

Option 3: Accept Batching​

Cloud Run Auto-Scaling Behavior​

Scaling Timeline (50 Concurrent Requests)​

Scaling Characteristics​

Scaling Visualization​

Current Terraform Configuration​

API Backend (Production)​

HAE Image Analysis (Production)​

Stress Testing Tools​

Quick Start​

Available Tools​

Monitoring and Alerts​

Key Metrics to Watch​

Dashboards​

Future Improvements​

Related Documentation​