Skip to main content

Smart Alerting System

Eli Health uses a multi-layered alerting system that proactively detects issues across all environments (development, staging, production) and provides AI-powered analysis to help teams quickly understand and resolve problems.

System Overview

The alerting system has two parts:

AI-Powered Alerts (this document):

  1. Detects errors and events across application and infrastructure
  2. Routes alerts to the right teams via Slack channels based on alert type
  3. Analyzes logs using Claude AI to provide actionable summaries
  4. Links directly to the exact time window in Cloud Logging for investigation

Alert Categories:

  • Application Alertsalerts-backend-{env}, alerts-hae-{env}
  • Infrastructure Alertsalerts-infrastructure-{env}

All alerts route through the AI summarizer for intelligent analysis before posting to Slack.

Notification Routing

All alerts (application and infrastructure) route through the Pub/Sub topic to the AI summarizer Cloud Function. The function fetches relevant logs, generates an AI summary with Claude, and posts to the appropriate Slack channel based on alert type.

Key Features

  • AI-powered summaries - Claude analyzes error logs and explains what's failing, where, and why
  • Direct log links - Each alert includes a link to the exact 10-minute error window in Cloud Logging
  • Team-based routing - Backend errors go to alerts-backend-*, HAE errors go to alerts-hae-*
  • Noise filtering - Known non-actionable errors (Terra webhooks, test accounts) are automatically excluded
  • Smart notification - Only posts to Slack when relevant logs are found (no empty alerts)
  • Actionable insights - Suggestions for what to check first based on error patterns

Architecture

Three Environments

The system runs identically across all environments:

EnvironmentProjectApplication AlertsInfrastructure Alerts
Developmenteli-health-devalerts-backend-dev, alerts-hae-devalerts-infrastructure-dev
Stagingeli-health-stagingalerts-backend-staging, alerts-hae-stagingalerts-infrastructure-staging
Productioneli-health-prodalerts-backend-production, alerts-hae-productionalerts-infrastructure-production

Components

tf/
├── modules/global/
│ ├── monitoring/
│ │ ├── main.tf # Infrastructure & security alerts
│ │ ├── application_errors.tf # Backend & HAE error alerts + exclusions
│ │ ├── variables.tf
│ │ └── outputs.tf
│ │
│ └── alert-summarizer/
│ ├── main.tf # Terraform resources + IAM bindings
│ ├── variables.tf
│ ├── outputs.tf
│ └── function/
│ ├── main.py # Cloud Function code
│ └── requirements.txt

├── scripts/
│ └── apply-monitoring-all-envs.sh # Deploy monitoring to all environments

└── main.tf # Connects modules together

Alert Categories

1. Application Error Alerts (Backend & HAE)

These are the primary alerts for the engineering team:

Backend API Service:

  • Metric: Counts errors in api-service-* Cloud Run revisions
  • Trigger: Error count > threshold (default: 5) in 5 minutes
  • Channel: alerts-backend-{env}

HAE Image Analysis Service:

  • Metric: Counts errors in image-analysis-* Cloud Run revisions
  • Trigger: Error count > threshold in 5 minutes
  • Channel: alerts-hae-{env}

Both include:

  • Error spike alerts (threshold-based)
  • Error sample alerts (shows actual error messages)
  • Critical error alerts (immediate notification for CRITICAL severity)

2. Infrastructure Alerts

These alerts monitor system health and route through the AI summarizer for intelligent analysis:

Cloud SQL Database Monitoring:

AlertTriggerPurpose
Abnormal CPU UtilizationCPU > 90%Database under heavy load
Abnormal Memory UtilizationMemory > 90%Memory pressure
Abnormal Disk UtilizationDisk > 90%Running out of storage
Abnormal UptimeUnexpected restartsDatabase stability issues

Other Infrastructure:

AlertTriggerPurpose
Cloud Run Runtime ErrorRuntime errors in Cloud RunService health
Pub/Sub Latency HealthAbnormal message latencyMessage queue health

Security Alerts:

AlertTriggerPurpose
IAM Policy ModificationIAM policy changesSecurity audit
Secret Manager Creation/Update/DeletionSecret changesSecurity audit
KMS CryptoKey OperationsKey creation/update/destructionSecurity audit
Cloud Armor

Cloud Armor WAF blocks are logged but don't trigger alerts. These are routine internet noise (scanners, bots) that Cloud Armor handles automatically - no action needed.

Channel: All infrastructure alerts go to alerts-infrastructure-{env}

AI-Powered Summarization

When any alert fires, the AI summarizer provides context:

How It Works

  1. Alert fires → Pub/Sub message sent to alert-summarizer-{env} topic
  2. Cloud Function triggered → Receives alert metadata (policy name, timestamp)
  3. Logs fetched → Queries Cloud Logging for errors in 10-minute window
  4. AI analysis → Claude Sonnet analyzes logs and generates summary
  5. Slack post → Formatted message with summary and direct logs link

What the AI Provides

🔍 AI Summary: Backend API Errors - production
Started: 2026-01-10 00:14:03 UTC
─────────────────────────────────────────
What's failing: Database connection pool exhaustion causing 500 errors

Where: /api/users/profile and /api/orders/create endpoints

Pattern: All 26 errors show "Connection pool timeout" with same error code

Likely cause: Connection leak or surge in traffic exceeding pool capacity

Suggested action: Check database connection pool metrics and recent deployments
─────────────────────────────────────────
[View Logs (10 min window)]

The "View Logs" button links directly to Cloud Logging with:

  • The exact 10-minute time window when errors occurred
  • Pre-filled filter for the relevant service
  • Correct GCP project

Noise Exclusion

The system automatically excludes known noise patterns at the log-based metric level (not in the function). This means excluded errors never trigger alerts in the first place.

Backend Exclusions (in application_errors.tf):

filter = <<-EOT
severity>="ERROR"
resource.type="cloud_run_revision"
resource.labels.service_name=~"^api-service-.*"
-jsonPayload.labels.context="TerraService" # Terra service context
-jsonPayload.methodName="...SignInWithPassword" # Auth method noise
-jsonPayload.status.message="INVALID_OOB_CODE" # Invalid Firebase codes
-"TOO_MANY_ATTEMPTS_TRY_LATER" # Firebase rate limiting
-httpRequest.requestUrl=~"^https://app.eli.health/terra/.*" # All Terra URLs
-"TerraController" # Terra controller errors
-"terra/webhook" # Terra webhook errors
EOT

HAE Exclusions (in application_errors.tf):

filter = <<-EOT
severity>="ERROR"
resource.type="cloud_run_revision"
resource.labels.service_name=~"^image-analysis-.*"
EOT

To add new exclusions, edit tf/modules/global/monitoring/application_errors.tf and apply to all environments.

Terraform Configuration

Enable the System

In your environment's .tfvars file:

# Enable all monitoring alerts
monitoring_alert_enabled = true

# Application error thresholds
application_error_threshold = 5 # Errors in 5 min to trigger (dev: 5, staging: 10, prod: 25)
enable_critical_alerts = true # Immediate alerts for CRITICAL severity

# Slack integration (same token for all environments)
slack_auth_token = "xoxb-..."

# AI Summarizer
alert_summarizer_enabled = true
anthropic_api_key = "sk-ant-api03-..."

Module Connection

The modules are connected in main.tf:

# Alert Summarizer creates a Pub/Sub notification channel
module "alert_summarizer" {
source = "./modules/global/alert-summarizer/"
enabled = var.alert_summarizer_enabled
project_id = var.gcp_project_id
environment = var.environment
region = var.gcp_region
anthropic_api_key = var.anthropic_api_key
slack_auth_token = var.slack_auth_token
}

# Monitoring module receives the channel ID
module "global_monitoring" {
source = "./modules/global/monitoring/"

# Pass the summarizer channel to alert policies
alert_summarizer_channel_id = module.alert_summarizer.notification_channel_id

# Other configuration...
}

Infrastructure Resources Created

Per Environment

ResourceName PatternPurpose
Pub/Sub Topicalert-summarizer-{env}Receives alert notifications
Cloud Functionalert-summarizer-{env}Processes alerts
Service Accountalert-summarizer-{env}Function identity
Storage Bucket{project}-alert-summarizer-sourceFunction code
Notification ChannelAlert Summarizer - {env}Pub/Sub channel for policies
Secretsanthropic-api-key, slack-summarizer-tokenAPI credentials

IAM Permissions

Cloud Function service account:

  • roles/logging.viewer - Read logs from Cloud Logging
  • roles/secretmanager.secretAccessor - Access API keys (scoped to specific secrets)

Pub/Sub topic permissions:

  • Cloud Monitoring notification service account has roles/pubsub.publisher on the alert topic
  • This allows GCP alerting to publish messages to the Pub/Sub topic when alerts fire

Testing

Simulate an Alert

# Application alert (Backend)
gcloud pubsub topics publish alert-summarizer-development \
--project=eli-health-dev \
--message='{
"incident": {
"policy_name": "Backend API Errors - development",
"condition_name": "Error Rate Test",
"started_at": '"$(date +%s)"',
"state": "open",
"scoping_project_id": "eli-health-dev"
}
}'

# Infrastructure alert (Cloud SQL)
gcloud pubsub topics publish alert-summarizer-development \
--project=eli-health-dev \
--message='{
"incident": {
"policy_name": "Cloud SQL - Abnormal CPU Utilization",
"condition_name": "CPU utilization > 90%",
"started_at": '"$(date +%s)"',
"state": "open",
"scoping_project_id": "eli-health-dev"
}
}'

# Security alert (IAM)
gcloud pubsub topics publish alert-summarizer-development \
--project=eli-health-dev \
--message='{
"incident": {
"policy_name": "IAM - IAM Policy Modification",
"condition_name": "Change in IAM policy detected",
"started_at": '"$(date +%s)"',
"state": "open",
"scoping_project_id": "eli-health-dev"
}
}'

Write Test Logs

# Create test errors that will be picked up
gcloud logging write api-service-test-errors \
'{"message": "Test database connection error", "error": "ConnectionError"}' \
--project=eli-health-dev \
--payload-type=json \
--severity=ERROR

Check Function Logs

gcloud functions logs read alert-summarizer-development \
--project=eli-health-dev \
--region=northamerica-northeast1 \
--limit=20

Deploying Changes

Monitoring Changes (Filters, Thresholds, Exclusions)

Use the deployment script to apply monitoring changes to all environments:

cd tf
./scripts/apply-monitoring-all-envs.sh

This script:

  • Applies to development, staging, and production in sequence
  • Targets only monitoring-related resources (metrics, alert policies)
  • Handles authentication and project switching automatically

Function Code Changes

When you modify the Cloud Function Python code:

cd tf

# Deploy to all environments
for env in development staging production; do
terraform init -reconfigure -backend-config=$env.gcs.tfbackend
terraform apply -var-file=$env.tfvars -auto-approve -target=module.alert_summarizer
done

Single Environment Deployment

For targeted changes to one environment:

cd tf

# Example: staging only
terraform init -reconfigure -backend-config=staging.gcs.tfbackend
terraform apply -var-file=staging.tfvars -auto-approve \
-target=module.global_monitoring \
-target=module.alert_summarizer

Troubleshooting

"No logs found in alert window"

  1. Verify the log filter matches your service's logs
  2. Check that errors occurred within the 10-minute window
  3. Ensure Cloud Logging has indexed the logs (can take 1-2 minutes)

Slack posting fails

  1. Verify the Slack token is valid:
    curl -X POST "https://slack.com/api/auth.test" \
    -H "Authorization: Bearer xoxb-your-token"
  2. Ensure the bot has chat:write and chat:write.public scopes
  3. Check that the channel exists

Function timeout (120s limit)

  1. Check Claude API response times
  2. Reduce log limit in the function (currently 50 entries)
  3. Check for network issues

Alert not triggering

  1. Verify the log-based metric is counting:
    gcloud logging metrics describe backend-errors-metric-development \
    --project=eli-health-dev
  2. Check the alert policy condition threshold
  3. Verify notification channels are correctly configured

Cost Considerations

ComponentCost DriverEstimate
Cloud FunctionsInvocations + compute time~$0.50/month (scales to 0)
Claude APIPer-token usage~$0.003 per summary
Cloud LoggingRead operationsIncluded in quota
Pub/SubMessage countMinimal
Secret ManagerSecret versions~$0.06/secret/month

Future Enhancements

Additional Monitoring Ideas

CategoryAlertPurpose
Cloud RunContainer CPU/Memory > 90%Service resource pressure
Cloud RunRequest Latency P95 > 2sAPI performance degradation
Cloud RunCold Start Latency P95 > 5sSlow startup affecting users
Cloud RunContainer Restarts > 3/5minCrash loop detection
Pub/SubUnacked Messages > 1000Consumer falling behind
Pub/SubDead Letter Queue messagesFailed processing
DatabaseConnection pool > 90%Connection exhaustion
DatabaseSlow queries > 5sQuery performance
CostBudget > 80% thresholdCost overrun warning
CostAPI Quota > 90%Rate limiting imminent

Feature Ideas

  • Route infrastructure alerts to Slack channels
  • Add AI summaries to infrastructure alerts
  • Slack thread replies for related alerts (group by incident)
  • Runbook links based on error patterns
  • PagerDuty integration for on-call escalation
  • Weekly error trend summaries