Skip to main content

Infrastructure Architecture

Cloud Infrastructure Overviewโ€‹

Eli Health runs on Google Cloud Platform with the following key services:

ServiceTechnologyPurpose
ComputeCloud RunServerless container hosting
DatabaseCloud SQL PostgreSQLPrimary data storage
StorageCloud StorageImage and file storage
CDNCloud CDNStatic asset delivery
Load BalancingGoogle Load BalancerTraffic distribution
MonitoringCloud Monitoring + SentryObservability and error tracking

Cloud Run Architectureโ€‹

Service Layoutโ€‹

Servicesโ€‹

ServiceDomainMin InstancesMax InstancesConcurrency
Backend APIapi.eli.health32020
HAE (Image Analysis)internal10253
KPI Dashboardkpi.eli.health1580
Documentationdocs.eli.health1380

Performance Validation: 100 Concurrent Usersโ€‹

Executive Summaryโ€‹

We conducted extensive stress testing to validate the infrastructure can handle 100 concurrent users submitting hormone test readings. Key findings:

  • Current stable capacity: 30 concurrent image analyses
  • Bottleneck identified: HAE service with concurrency bug
  • Recommendation: Scale HAE instances rather than concurrency until bug is fixed

HAE Bottleneck Analysisโ€‹

The Hormone Analysis Engine (HAE) is the critical bottleneck in the system. Each image takes approximately 10-12 seconds to process.

Test Resultsโ€‹

ConfigurationHAE InstancesConcurrency/InstanceTotal CapacityResult
Conservative326โœ… Stable, 117s for 50 images
Moderate10330โœ… Recommended, ~40s for 100 images
Aggressive205100โŒ HAE crashes with Python errors

Critical Bug Discoveredโ€‹

When HAE concurrency exceeds 3, a Python error occurs:

# Error at /app/api/app_analysis.py:307
UnboundLocalError: cannot access local variable 'payload'
where it is not associated with a value

Impact: Maximum stable concurrency per HAE instance is 3 until bug is fixed.

Performance Formulaโ€‹

Processing Time = (Number of Images รท HAE Capacity) ร— 12 seconds

Examples:
- 100 images with 30 capacity = (100 รท 30) ร— 12 = ~40 seconds
- 100 images with 100 capacity = (100 รท 100) ร— 12 = ~12 seconds

Recommendations for 100 Concurrent Usersโ€‹

Option 1: Fix HAE Bug (Best Solution)โ€‹

  • Fix Python error in HAE code
  • Scale to: 20 instances ร— 5 concurrency = 100 capacity
  • Expected time: ~12-15 seconds for all 100 images

Option 2: Scale Instances (Current Approach)โ€‹

  • Use: 34 instances ร— 3 concurrency = 102 capacity
  • Expected time: ~12-15 seconds for all 100 images
  • More expensive but avoids the concurrency bug

Option 3: Accept Batchingโ€‹

  • Use: 10 instances ร— 3 concurrency = 30 capacity
  • Expected time: ~40-50 seconds for all 100 images
  • Most cost-effective stable configuration

Cloud Run Auto-Scaling Behaviorโ€‹

Scaling Timeline (50 Concurrent Requests)โ€‹

Time 0s:    All 50 requests initiated simultaneously
โ”œโ”€ 0-4s: First 28 requests complete (warm instances)
โ”œโ”€ 4-10s: Scaling triggered, new instances starting
โ”œโ”€ 20-35s: Remaining requests complete (after cold start)
โ””โ”€ Total: ~30-35 seconds for all 50 requests

Scaling Characteristicsโ€‹

MetricValue
Instance capacity~28 concurrent requests before scaling
Cold start penalty20-28 seconds additional latency
Scaling trigger delay~4 seconds after queue detected
Success rate100% (no dropped requests)

Scaling Visualizationโ€‹

Current Terraform Configurationโ€‹

API Backend (Production)โ€‹

module "backend_compute" {
scaling = {
min_instance_count = 3
max_instance_count = 20
max_request_concurrency = 20
}

resources = {
cpu = "1"
memory = "512Mi"
}
}

HAE Image Analysis (Production)โ€‹

module "image_analysis_compute" {
scaling = {
min_instance_count = 10 # Scale up for 100 users
max_instance_count = 25 # Allow auto-scaling
max_request_concurrency = 3 # Cannot exceed 3 due to bug
}

resources = {
cpu = "2"
memory = "2Gi"
}
}

Stress Testing Toolsโ€‹

Stress testing tools are available in the eli-devops repository under /stress-testing/.

Quick Startโ€‹

cd eli-devops/stress-testing
npm install

# Single reading test
node send-reading.js --env staging --hormone cortisol --wait

# Parallel load test (50 concurrent)
node parallel-reading-test.js --env staging --count 50 --wait

Available Toolsโ€‹

ToolPurpose
send-reading.jsSend single test reading
parallel-reading-test.jsLoad test with concurrent readings

See /eli-devops/stress-testing/README.md for full documentation.

Monitoring and Alertsโ€‹

Key Metrics to Watchโ€‹

MetricWarning ThresholdCritical Threshold
HAE Response Time> 15s> 30s
HAE Error Rate> 1%> 5%
Cloud Run Instance Count> 15> 20
Queue Depth> 50> 100

Dashboardsโ€‹

  • Cloud Run Dashboard: Instance health and scaling
  • Sentry: Application errors and performance
  • Custom KPI: Business metrics and user analytics

Future Improvementsโ€‹

  1. Fix HAE Concurrency Bug: Enable higher concurrency per instance
  2. Queue-Based Processing: Implement async processing with status polling
  3. Pre-warming: Auto-scale before predicted traffic spikes
  4. Caching: Cache repeated image analysis results
  5. Regional Deployment: Add regions for lower latency