Infrastructure Architecture
Cloud Infrastructure Overviewโ
Eli Health runs on Google Cloud Platform with the following key services:
| Service | Technology | Purpose |
|---|---|---|
| Compute | Cloud Run | Serverless container hosting |
| Database | Cloud SQL PostgreSQL | Primary data storage |
| Storage | Cloud Storage | Image and file storage |
| CDN | Cloud CDN | Static asset delivery |
| Load Balancing | Google Load Balancer | Traffic distribution |
| Monitoring | Cloud Monitoring + Sentry | Observability and error tracking |
Cloud Run Architectureโ
Service Layoutโ
Servicesโ
| Service | Domain | Min Instances | Max Instances | Concurrency |
|---|---|---|---|---|
| Backend API | api.eli.health | 3 | 20 | 20 |
| HAE (Image Analysis) | internal | 10 | 25 | 3 |
| KPI Dashboard | kpi.eli.health | 1 | 5 | 80 |
| Documentation | docs.eli.health | 1 | 3 | 80 |
Performance Validation: 100 Concurrent Usersโ
Executive Summaryโ
We conducted extensive stress testing to validate the infrastructure can handle 100 concurrent users submitting hormone test readings. Key findings:
- Current stable capacity: 30 concurrent image analyses
- Bottleneck identified: HAE service with concurrency bug
- Recommendation: Scale HAE instances rather than concurrency until bug is fixed
HAE Bottleneck Analysisโ
The Hormone Analysis Engine (HAE) is the critical bottleneck in the system. Each image takes approximately 10-12 seconds to process.
Test Resultsโ
| Configuration | HAE Instances | Concurrency/Instance | Total Capacity | Result |
|---|---|---|---|---|
| Conservative | 3 | 2 | 6 | โ Stable, 117s for 50 images |
| Moderate | 10 | 3 | 30 | โ Recommended, ~40s for 100 images |
| Aggressive | 20 | 5 | 100 | โ HAE crashes with Python errors |
Critical Bug Discoveredโ
When HAE concurrency exceeds 3, a Python error occurs:
# Error at /app/api/app_analysis.py:307
UnboundLocalError: cannot access local variable 'payload'
where it is not associated with a value
Impact: Maximum stable concurrency per HAE instance is 3 until bug is fixed.
Performance Formulaโ
Processing Time = (Number of Images รท HAE Capacity) ร 12 seconds
Examples:
- 100 images with 30 capacity = (100 รท 30) ร 12 = ~40 seconds
- 100 images with 100 capacity = (100 รท 100) ร 12 = ~12 seconds
Recommendations for 100 Concurrent Usersโ
Option 1: Fix HAE Bug (Best Solution)โ
- Fix Python error in HAE code
- Scale to: 20 instances ร 5 concurrency = 100 capacity
- Expected time: ~12-15 seconds for all 100 images
Option 2: Scale Instances (Current Approach)โ
- Use: 34 instances ร 3 concurrency = 102 capacity
- Expected time: ~12-15 seconds for all 100 images
- More expensive but avoids the concurrency bug
Option 3: Accept Batchingโ
- Use: 10 instances ร 3 concurrency = 30 capacity
- Expected time: ~40-50 seconds for all 100 images
- Most cost-effective stable configuration
Cloud Run Auto-Scaling Behaviorโ
Scaling Timeline (50 Concurrent Requests)โ
Time 0s: All 50 requests initiated simultaneously
โโ 0-4s: First 28 requests complete (warm instances)
โโ 4-10s: Scaling triggered, new instances starting
โโ 20-35s: Remaining requests complete (after cold start)
โโ Total: ~30-35 seconds for all 50 requests
Scaling Characteristicsโ
| Metric | Value |
|---|---|
| Instance capacity | ~28 concurrent requests before scaling |
| Cold start penalty | 20-28 seconds additional latency |
| Scaling trigger delay | ~4 seconds after queue detected |
| Success rate | 100% (no dropped requests) |
Scaling Visualizationโ
Current Terraform Configurationโ
API Backend (Production)โ
module "backend_compute" {
scaling = {
min_instance_count = 3
max_instance_count = 20
max_request_concurrency = 20
}
resources = {
cpu = "1"
memory = "512Mi"
}
}
HAE Image Analysis (Production)โ
module "image_analysis_compute" {
scaling = {
min_instance_count = 10 # Scale up for 100 users
max_instance_count = 25 # Allow auto-scaling
max_request_concurrency = 3 # Cannot exceed 3 due to bug
}
resources = {
cpu = "2"
memory = "2Gi"
}
}
Stress Testing Toolsโ
Stress testing tools are available in the eli-devops repository under /stress-testing/.
Quick Startโ
cd eli-devops/stress-testing
npm install
# Single reading test
node send-reading.js --env staging --hormone cortisol --wait
# Parallel load test (50 concurrent)
node parallel-reading-test.js --env staging --count 50 --wait
Available Toolsโ
| Tool | Purpose |
|---|---|
send-reading.js | Send single test reading |
parallel-reading-test.js | Load test with concurrent readings |
See /eli-devops/stress-testing/README.md for full documentation.
Monitoring and Alertsโ
Key Metrics to Watchโ
| Metric | Warning Threshold | Critical Threshold |
|---|---|---|
| HAE Response Time | > 15s | > 30s |
| HAE Error Rate | > 1% | > 5% |
| Cloud Run Instance Count | > 15 | > 20 |
| Queue Depth | > 50 | > 100 |
Dashboardsโ
- Cloud Run Dashboard: Instance health and scaling
- Sentry: Application errors and performance
- Custom KPI: Business metrics and user analytics
Future Improvementsโ
- Fix HAE Concurrency Bug: Enable higher concurrency per instance
- Queue-Based Processing: Implement async processing with status polling
- Pre-warming: Auto-scale before predicted traffic spikes
- Caching: Cache repeated image analysis results
- Regional Deployment: Add regions for lower latency
Related Documentationโ
- DevOps Terraform Guide - Infrastructure as Code details
- Monitoring Guide - Alerting and observability
- HAE API Service - Image analysis service details