Smart Alerting System
Eli Health uses a multi-layered alerting system that proactively detects issues across all environments (development, staging, production) and provides AI-powered analysis to help teams quickly understand and resolve problems.
System Overview
The alerting system has two parts:
AI-Powered Alerts (this document):
- Detects errors and events across application and infrastructure
- Routes alerts to the right teams via Slack channels based on alert type
- Analyzes logs using Claude AI to provide actionable summaries
- Links directly to the exact time window in Cloud Logging for investigation
Alert Categories:
- Application Alerts →
alerts-backend-{env},alerts-hae-{env} - Infrastructure Alerts →
alerts-infrastructure-{env}
All alerts route through the AI summarizer for intelligent analysis before posting to Slack.
All alerts (application and infrastructure) route through the Pub/Sub topic to the AI summarizer Cloud Function. The function fetches relevant logs, generates an AI summary with Claude, and posts to the appropriate Slack channel based on alert type.
Key Features
- AI-powered summaries - Claude analyzes error logs and explains what's failing, where, and why
- Direct log links - Each alert includes a link to the exact 10-minute error window in Cloud Logging
- Team-based routing - Backend errors go to
alerts-backend-*, HAE errors go toalerts-hae-* - Noise filtering - Known non-actionable errors (Terra webhooks, test accounts) are automatically excluded
- Smart notification - Only posts to Slack when relevant logs are found (no empty alerts)
- Actionable insights - Suggestions for what to check first based on error patterns
Architecture
Three Environments
The system runs identically across all environments:
| Environment | Project | Application Alerts | Infrastructure Alerts |
|---|---|---|---|
| Development | eli-health-dev | alerts-backend-dev, alerts-hae-dev | alerts-infrastructure-dev |
| Staging | eli-health-staging | alerts-backend-staging, alerts-hae-staging | alerts-infrastructure-staging |
| Production | eli-health-prod | alerts-backend-production, alerts-hae-production | alerts-infrastructure-production |
Components
tf/
├── modules/global/
│ ├── monitoring/
│ │ ├── main.tf # Infrastructure & security alerts
│ │ ├── application_errors.tf # Backend & HAE error alerts + exclusions
│ │ ├── variables.tf
│ │ └── outputs.tf
│ │
│ └── alert-summarizer/
│ ├── main.tf # Terraform resources + IAM bindings
│ ├── variables.tf
│ ├── outputs.tf
│ └── function/
│ ├── main.py # Cloud Function code
│ └── requirements.txt
│
├── scripts/
│ └── apply-monitoring-all-envs.sh # Deploy monitoring to all environments
│
└── main.tf # Connects modules together
Alert Categories
1. Application Error Alerts (Backend & HAE)
These are the primary alerts for the engineering team:
Backend API Service:
- Metric: Counts errors in
api-service-*Cloud Run revisions - Trigger: Error count > threshold (default: 5) in 5 minutes
- Channel:
alerts-backend-{env}
HAE Image Analysis Service:
- Metric: Counts errors in
image-analysis-*Cloud Run revisions - Trigger: Error count > threshold in 5 minutes
- Channel:
alerts-hae-{env}
Both include:
- Error spike alerts (threshold-based)
- Error sample alerts (shows actual error messages)
- Critical error alerts (immediate notification for CRITICAL severity)
2. Infrastructure Alerts
These alerts monitor system health and route through the AI summarizer for intelligent analysis:
Cloud SQL Database Monitoring:
| Alert | Trigger | Purpose |
|---|---|---|
| Abnormal CPU Utilization | CPU > 90% | Database under heavy load |
| Abnormal Memory Utilization | Memory > 90% | Memory pressure |
| Abnormal Disk Utilization | Disk > 90% | Running out of storage |
| Abnormal Uptime | Unexpected restarts | Database stability issues |
Other Infrastructure:
| Alert | Trigger | Purpose |
|---|---|---|
| Cloud Run Runtime Error | Runtime errors in Cloud Run | Service health |
| Pub/Sub Latency Health | Abnormal message latency | Message queue health |
Security Alerts:
| Alert | Trigger | Purpose |
|---|---|---|
| IAM Policy Modification | IAM policy changes | Security audit |
| Secret Manager Creation/Update/Deletion | Secret changes | Security audit |
| KMS CryptoKey Operations | Key creation/update/destruction | Security audit |
Cloud Armor WAF blocks are logged but don't trigger alerts. These are routine internet noise (scanners, bots) that Cloud Armor handles automatically - no action needed.
Channel: All infrastructure alerts go to alerts-infrastructure-{env}
AI-Powered Summarization
When any alert fires, the AI summarizer provides context:
How It Works
- Alert fires → Pub/Sub message sent to
alert-summarizer-{env}topic - Cloud Function triggered → Receives alert metadata (policy name, timestamp)
- Logs fetched → Queries Cloud Logging for errors in 10-minute window
- AI analysis → Claude Sonnet analyzes logs and generates summary
- Slack post → Formatted message with summary and direct logs link
What the AI Provides
🔍 AI Summary: Backend API Errors - production
Started: 2026-01-10 00:14:03 UTC
─────────────────────────────────────────
What's failing: Database connection pool exhaustion causing 500 errors
Where: /api/users/profile and /api/orders/create endpoints
Pattern: All 26 errors show "Connection pool timeout" with same error code
Likely cause: Connection leak or surge in traffic exceeding pool capacity
Suggested action: Check database connection pool metrics and recent deployments
─────────────────────────────────────────
[View Logs (10 min window)]
The "View Logs" button links directly to Cloud Logging with:
- The exact 10-minute time window when errors occurred
- Pre-filled filter for the relevant service
- Correct GCP project
Noise Exclusion
The system automatically excludes known noise patterns at the log-based metric level (not in the function). This means excluded errors never trigger alerts in the first place.
Backend Exclusions (in application_errors.tf):
filter = <<-EOT
severity>="ERROR"
resource.type="cloud_run_revision"
resource.labels.service_name=~"^api-service-.*"
-jsonPayload.labels.context="TerraService" # Terra service context
-jsonPayload.methodName="...SignInWithPassword" # Auth method noise
-jsonPayload.status.message="INVALID_OOB_CODE" # Invalid Firebase codes
-"TOO_MANY_ATTEMPTS_TRY_LATER" # Firebase rate limiting
-httpRequest.requestUrl=~"^https://app.eli.health/terra/.*" # All Terra URLs
-"TerraController" # Terra controller errors
-"terra/webhook" # Terra webhook errors
EOT
HAE Exclusions (in application_errors.tf):
filter = <<-EOT
severity>="ERROR"
resource.type="cloud_run_revision"
resource.labels.service_name=~"^image-analysis-.*"
EOT
To add new exclusions, edit tf/modules/global/monitoring/application_errors.tf and apply to all environments.
Terraform Configuration
Enable the System
In your environment's .tfvars file:
# Enable all monitoring alerts
monitoring_alert_enabled = true
# Application error thresholds
application_error_threshold = 5 # Errors in 5 min to trigger (dev: 5, staging: 10, prod: 25)
enable_critical_alerts = true # Immediate alerts for CRITICAL severity
# Slack integration (same token for all environments)
slack_auth_token = "xoxb-..."
# AI Summarizer
alert_summarizer_enabled = true
anthropic_api_key = "sk-ant-api03-..."
Module Connection
The modules are connected in main.tf:
# Alert Summarizer creates a Pub/Sub notification channel
module "alert_summarizer" {
source = "./modules/global/alert-summarizer/"
enabled = var.alert_summarizer_enabled
project_id = var.gcp_project_id
environment = var.environment
region = var.gcp_region
anthropic_api_key = var.anthropic_api_key
slack_auth_token = var.slack_auth_token
}
# Monitoring module receives the channel ID
module "global_monitoring" {
source = "./modules/global/monitoring/"
# Pass the summarizer channel to alert policies
alert_summarizer_channel_id = module.alert_summarizer.notification_channel_id
# Other configuration...
}
Infrastructure Resources Created
Per Environment
| Resource | Name Pattern | Purpose |
|---|---|---|
| Pub/Sub Topic | alert-summarizer-{env} | Receives alert notifications |
| Cloud Function | alert-summarizer-{env} | Processes alerts |
| Service Account | alert-summarizer-{env} | Function identity |
| Storage Bucket | {project}-alert-summarizer-source | Function code |
| Notification Channel | Alert Summarizer - {env} | Pub/Sub channel for policies |
| Secrets | anthropic-api-key, slack-summarizer-token | API credentials |
IAM Permissions
Cloud Function service account:
roles/logging.viewer- Read logs from Cloud Loggingroles/secretmanager.secretAccessor- Access API keys (scoped to specific secrets)
Pub/Sub topic permissions:
- Cloud Monitoring notification service account has
roles/pubsub.publisheron the alert topic - This allows GCP alerting to publish messages to the Pub/Sub topic when alerts fire
Testing
Simulate an Alert
# Application alert (Backend)
gcloud pubsub topics publish alert-summarizer-development \
--project=eli-health-dev \
--message='{
"incident": {
"policy_name": "Backend API Errors - development",
"condition_name": "Error Rate Test",
"started_at": '"$(date +%s)"',
"state": "open",
"scoping_project_id": "eli-health-dev"
}
}'
# Infrastructure alert (Cloud SQL)
gcloud pubsub topics publish alert-summarizer-development \
--project=eli-health-dev \
--message='{
"incident": {
"policy_name": "Cloud SQL - Abnormal CPU Utilization",
"condition_name": "CPU utilization > 90%",
"started_at": '"$(date +%s)"',
"state": "open",
"scoping_project_id": "eli-health-dev"
}
}'
# Security alert (IAM)
gcloud pubsub topics publish alert-summarizer-development \
--project=eli-health-dev \
--message='{
"incident": {
"policy_name": "IAM - IAM Policy Modification",
"condition_name": "Change in IAM policy detected",
"started_at": '"$(date +%s)"',
"state": "open",
"scoping_project_id": "eli-health-dev"
}
}'
Write Test Logs
# Create test errors that will be picked up
gcloud logging write api-service-test-errors \
'{"message": "Test database connection error", "error": "ConnectionError"}' \
--project=eli-health-dev \
--payload-type=json \
--severity=ERROR
Check Function Logs
gcloud functions logs read alert-summarizer-development \
--project=eli-health-dev \
--region=northamerica-northeast1 \
--limit=20
Deploying Changes
Monitoring Changes (Filters, Thresholds, Exclusions)
Use the deployment script to apply monitoring changes to all environments:
cd tf
./scripts/apply-monitoring-all-envs.sh
This script:
- Applies to development, staging, and production in sequence
- Targets only monitoring-related resources (metrics, alert policies)
- Handles authentication and project switching automatically
Function Code Changes
When you modify the Cloud Function Python code:
cd tf
# Deploy to all environments
for env in development staging production; do
terraform init -reconfigure -backend-config=$env.gcs.tfbackend
terraform apply -var-file=$env.tfvars -auto-approve -target=module.alert_summarizer
done
Single Environment Deployment
For targeted changes to one environment:
cd tf
# Example: staging only
terraform init -reconfigure -backend-config=staging.gcs.tfbackend
terraform apply -var-file=staging.tfvars -auto-approve \
-target=module.global_monitoring \
-target=module.alert_summarizer
Troubleshooting
"No logs found in alert window"
- Verify the log filter matches your service's logs
- Check that errors occurred within the 10-minute window
- Ensure Cloud Logging has indexed the logs (can take 1-2 minutes)
Slack posting fails
- Verify the Slack token is valid:
curl -X POST "https://slack.com/api/auth.test" \
-H "Authorization: Bearer xoxb-your-token" - Ensure the bot has
chat:writeandchat:write.publicscopes - Check that the channel exists
Function timeout (120s limit)
- Check Claude API response times
- Reduce log limit in the function (currently 50 entries)
- Check for network issues
Alert not triggering
- Verify the log-based metric is counting:
gcloud logging metrics describe backend-errors-metric-development \
--project=eli-health-dev - Check the alert policy condition threshold
- Verify notification channels are correctly configured
Cost Considerations
| Component | Cost Driver | Estimate |
|---|---|---|
| Cloud Functions | Invocations + compute time | ~$0.50/month (scales to 0) |
| Claude API | Per-token usage | ~$0.003 per summary |
| Cloud Logging | Read operations | Included in quota |
| Pub/Sub | Message count | Minimal |
| Secret Manager | Secret versions | ~$0.06/secret/month |
Future Enhancements
Additional Monitoring Ideas
| Category | Alert | Purpose |
|---|---|---|
| Cloud Run | Container CPU/Memory > 90% | Service resource pressure |
| Cloud Run | Request Latency P95 > 2s | API performance degradation |
| Cloud Run | Cold Start Latency P95 > 5s | Slow startup affecting users |
| Cloud Run | Container Restarts > 3/5min | Crash loop detection |
| Pub/Sub | Unacked Messages > 1000 | Consumer falling behind |
| Pub/Sub | Dead Letter Queue messages | Failed processing |
| Database | Connection pool > 90% | Connection exhaustion |
| Database | Slow queries > 5s | Query performance |
| Cost | Budget > 80% threshold | Cost overrun warning |
| Cost | API Quota > 90% | Rate limiting imminent |
Feature Ideas
- Route infrastructure alerts to Slack channels
- Add AI summaries to infrastructure alerts
- Slack thread replies for related alerts (group by incident)
- Runbook links based on error patterns
- PagerDuty integration for on-call escalation
- Weekly error trend summaries