Smart Alerting System

Eli Health uses a multi-layered alerting system that proactively detects issues across all environments (development, staging, production) and provides AI-powered analysis to help teams quickly understand and resolve problems.

System Overview

The alerting system has two parts:

AI-Powered Alerts (this document):

Detects errors and events across application and infrastructure
Routes alerts to the right teams via Slack channels based on alert type
Analyzes logs using Claude AI to provide actionable summaries
Links directly to the exact time window in Cloud Logging for investigation

Alert Categories:

Application Alerts → alerts-backend-{env}, alerts-hae-{env}
Infrastructure Alerts → alerts-infrastructure-{env}

All alerts route through the AI summarizer for intelligent analysis before posting to Slack.

Notification Routing

All alerts (application and infrastructure) route through the Pub/Sub topic to the AI summarizer Cloud Function. The function fetches relevant logs, generates an AI summary with Claude, and posts to the appropriate Slack channel based on alert type.

Key Features

AI-powered summaries - Claude analyzes error logs and explains what's failing, where, and why
Direct log links - Each alert includes a link to the exact 10-minute error window in Cloud Logging
Team-based routing - Backend errors go to alerts-backend-*, HAE errors go to alerts-hae-*
Noise filtering - Known non-actionable errors (Terra webhooks, test accounts) are automatically excluded
Smart notification - Only posts to Slack when relevant logs are found (no empty alerts)
Actionable insights - Suggestions for what to check first based on error patterns

Architecture

Three Environments

The system runs identically across all environments:

Environment	Project	Application Alerts	Infrastructure Alerts
Development	eli-health-dev	alerts-backend-dev, alerts-hae-dev	alerts-infrastructure-dev
Staging	eli-health-staging	alerts-backend-staging, alerts-hae-staging	alerts-infrastructure-staging
Production	eli-health-prod	alerts-backend-production, alerts-hae-production	alerts-infrastructure-production

Components

tf/
├── modules/global/
│   ├── monitoring/
│   │   ├── main.tf                 # Infrastructure & security alerts
│   │   ├── application_errors.tf   # Backend & HAE error alerts + exclusions
│   │   ├── variables.tf
│   │   └── outputs.tf
│   │
│   └── alert-summarizer/
│       ├── main.tf                 # Terraform resources + IAM bindings
│       ├── variables.tf
│       ├── outputs.tf
│       └── function/
│           ├── main.py             # Cloud Function code
│           └── requirements.txt
│
├── scripts/
│   └── apply-monitoring-all-envs.sh  # Deploy monitoring to all environments
│
└── main.tf                         # Connects modules together

Alert Categories

1. Application Error Alerts (Backend & HAE)

These are the primary alerts for the engineering team:

Backend API Service:

Metric: Counts errors in api-service-* Cloud Run revisions
Trigger: Error count > threshold (default: 5) in 5 minutes
Channel: alerts-backend-{env}

HAE Image Analysis Service:

Metric: Counts errors in image-analysis-* Cloud Run revisions
Trigger: Error count > threshold in 5 minutes
Channel: alerts-hae-{env}

Both include:

Error spike alerts (threshold-based)
Error sample alerts (shows actual error messages)
Critical error alerts (immediate notification for CRITICAL severity)

2. Infrastructure Alerts

These alerts monitor system health and route through the AI summarizer for intelligent analysis:

Cloud SQL Database Monitoring:

Alert	Trigger	Purpose
Abnormal CPU Utilization	CPU > 90%	Database under heavy load
Abnormal Memory Utilization	Memory > 90%	Memory pressure
Abnormal Disk Utilization	Disk > 90%	Running out of storage
Abnormal Uptime	Unexpected restarts	Database stability issues

Other Infrastructure:

Alert	Trigger	Purpose
Cloud Run Runtime Error	Runtime errors in Cloud Run	Service health
Pub/Sub Latency Health	Abnormal message latency	Message queue health

Security Alerts:

Alert	Trigger	Purpose
IAM Policy Modification	IAM policy changes	Security audit
Secret Manager Creation/Update/Deletion	Secret changes	Security audit
KMS CryptoKey Operations	Key creation/update/destruction	Security audit

Cloud Armor

Cloud Armor WAF blocks are logged but don't trigger alerts. These are routine internet noise (scanners, bots) that Cloud Armor handles automatically - no action needed.

Channel: All infrastructure alerts go to alerts-infrastructure-{env}

AI-Powered Summarization

When any alert fires, the AI summarizer provides context:

How It Works

Alert fires → Pub/Sub message sent to alert-summarizer-{env} topic
Cloud Function triggered → Receives alert metadata (policy name, timestamp)
Logs fetched → Queries Cloud Logging for errors in 10-minute window
AI analysis → Claude Sonnet analyzes logs and generates summary
Slack post → Formatted message with summary and direct logs link

What the AI Provides

🔍 AI Summary: Backend API Errors - production
Started: 2026-01-10 00:14:03 UTC
─────────────────────────────────────────
What's failing: Database connection pool exhaustion causing 500 errors

Where: /api/users/profile and /api/orders/create endpoints

Pattern: All 26 errors show "Connection pool timeout" with same error code

Likely cause: Connection leak or surge in traffic exceeding pool capacity

Suggested action: Check database connection pool metrics and recent deployments
─────────────────────────────────────────
[View Logs (10 min window)]

The "View Logs" button links directly to Cloud Logging with:

The exact 10-minute time window when errors occurred
Pre-filled filter for the relevant service
Correct GCP project

Noise Exclusion

The system automatically excludes known noise patterns at the log-based metric level (not in the function). This means excluded errors never trigger alerts in the first place.

Backend Exclusions (in application_errors.tf):

filter = <<-EOT
  severity>="ERROR"
  resource.type="cloud_run_revision"
  resource.labels.service_name=~"^api-service-.*"
  -jsonPayload.labels.context="TerraService"                              # Terra service context
  -jsonPayload.methodName="...SignInWithPassword"                         # Auth method noise
  -jsonPayload.status.message="INVALID_OOB_CODE"                          # Invalid Firebase codes
  -"TOO_MANY_ATTEMPTS_TRY_LATER"                                          # Firebase rate limiting
  -httpRequest.requestUrl=~"^https://app.eli.health/terra/.*"             # All Terra URLs
  -"TerraController"                                                       # Terra controller errors
  -"terra/webhook"                                                         # Terra webhook errors
EOT

HAE Exclusions (in application_errors.tf):

filter = <<-EOT
  severity>="ERROR"
  resource.type="cloud_run_revision"
  resource.labels.service_name=~"^image-analysis-.*"
EOT

To add new exclusions, edit tf/modules/global/monitoring/application_errors.tf and apply to all environments.

Terraform Configuration

Enable the System

In your environment's .tfvars file:

# Enable all monitoring alerts
monitoring_alert_enabled = true

# Application error thresholds
application_error_threshold = 5   # Errors in 5 min to trigger (dev: 5, staging: 10, prod: 25)
enable_critical_alerts = true     # Immediate alerts for CRITICAL severity

# Slack integration (same token for all environments)
slack_auth_token = "xoxb-..."

# AI Summarizer
alert_summarizer_enabled = true
anthropic_api_key = "sk-ant-api03-..."

Module Connection

The modules are connected in main.tf:

# Alert Summarizer creates a Pub/Sub notification channel
module "alert_summarizer" {
  source            = "./modules/global/alert-summarizer/"
  enabled           = var.alert_summarizer_enabled
  project_id        = var.gcp_project_id
  environment       = var.environment
  region            = var.gcp_region
  anthropic_api_key = var.anthropic_api_key
  slack_auth_token  = var.slack_auth_token
}

# Monitoring module receives the channel ID
module "global_monitoring" {
  source = "./modules/global/monitoring/"

  # Pass the summarizer channel to alert policies
  alert_summarizer_channel_id = module.alert_summarizer.notification_channel_id

  # Other configuration...
}

Infrastructure Resources Created

Per Environment

Resource	Name Pattern	Purpose
Pub/Sub Topic	`alert-summarizer-{env}`	Receives alert notifications
Cloud Function	`alert-summarizer-{env}`	Processes alerts
Service Account	`alert-summarizer-{env}`	Function identity
Storage Bucket	`{project}-alert-summarizer-source`	Function code
Notification Channel	`Alert Summarizer - {env}`	Pub/Sub channel for policies
Secrets	`anthropic-api-key`, `slack-summarizer-token`	API credentials

IAM Permissions

Cloud Function service account:

roles/logging.viewer - Read logs from Cloud Logging
roles/secretmanager.secretAccessor - Access API keys (scoped to specific secrets)

Pub/Sub topic permissions:

Cloud Monitoring notification service account has roles/pubsub.publisher on the alert topic
This allows GCP alerting to publish messages to the Pub/Sub topic when alerts fire

Testing

Simulate an Alert

# Application alert (Backend)
gcloud pubsub topics publish alert-summarizer-development \
  --project=eli-health-dev \
  --message='{
    "incident": {
      "policy_name": "Backend API Errors - development",
      "condition_name": "Error Rate Test",
      "started_at": '"$(date +%s)"',
      "state": "open",
      "scoping_project_id": "eli-health-dev"
    }
  }'

# Infrastructure alert (Cloud SQL)
gcloud pubsub topics publish alert-summarizer-development \
  --project=eli-health-dev \
  --message='{
    "incident": {
      "policy_name": "Cloud SQL - Abnormal CPU Utilization",
      "condition_name": "CPU utilization > 90%",
      "started_at": '"$(date +%s)"',
      "state": "open",
      "scoping_project_id": "eli-health-dev"
    }
  }'

# Security alert (IAM)
gcloud pubsub topics publish alert-summarizer-development \
  --project=eli-health-dev \
  --message='{
    "incident": {
      "policy_name": "IAM - IAM Policy Modification",
      "condition_name": "Change in IAM policy detected",
      "started_at": '"$(date +%s)"',
      "state": "open",
      "scoping_project_id": "eli-health-dev"
    }
  }'

Write Test Logs

# Create test errors that will be picked up
gcloud logging write api-service-test-errors \
  '{"message": "Test database connection error", "error": "ConnectionError"}' \
  --project=eli-health-dev \
  --payload-type=json \
  --severity=ERROR

Check Function Logs

gcloud functions logs read alert-summarizer-development \
  --project=eli-health-dev \
  --region=northamerica-northeast1 \
  --limit=20

Deploying Changes

Monitoring Changes (Filters, Thresholds, Exclusions)

Use the deployment script to apply monitoring changes to all environments:

cd tf
./scripts/apply-monitoring-all-envs.sh

This script:

Applies to development, staging, and production in sequence
Targets only monitoring-related resources (metrics, alert policies)
Handles authentication and project switching automatically

Function Code Changes

When you modify the Cloud Function Python code:

cd tf

# Deploy to all environments
for env in development staging production; do
  terraform init -reconfigure -backend-config=$env.gcs.tfbackend
  terraform apply -var-file=$env.tfvars -auto-approve -target=module.alert_summarizer
done

Single Environment Deployment

For targeted changes to one environment:

cd tf

# Example: staging only
terraform init -reconfigure -backend-config=staging.gcs.tfbackend
terraform apply -var-file=staging.tfvars -auto-approve \
  -target=module.global_monitoring \
  -target=module.alert_summarizer

Troubleshooting

"No logs found in alert window"

Verify the log filter matches your service's logs
Check that errors occurred within the 10-minute window
Ensure Cloud Logging has indexed the logs (can take 1-2 minutes)

Slack posting fails

Verify the Slack token is valid:

curl -X POST "https://slack.com/api/auth.test" \
  -H "Authorization: Bearer xoxb-your-token"

Ensure the bot has chat:write and chat:write.public scopes
Check that the channel exists

Function timeout (120s limit)

Check Claude API response times
Reduce log limit in the function (currently 50 entries)
Check for network issues

Alert not triggering

Verify the log-based metric is counting:

gcloud logging metrics describe backend-errors-metric-development \
  --project=eli-health-dev

Check the alert policy condition threshold
Verify notification channels are correctly configured

Cost Considerations

Component	Cost Driver	Estimate
Cloud Functions	Invocations + compute time	~$0.50/month (scales to 0)
Claude API	Per-token usage	~$0.003 per summary
Cloud Logging	Read operations	Included in quota
Pub/Sub	Message count	Minimal
Secret Manager	Secret versions	~$0.06/secret/month

Future Enhancements

Additional Monitoring Ideas

Category	Alert	Purpose
Cloud Run	Container CPU/Memory > 90%	Service resource pressure
Cloud Run	Request Latency P95 > 2s	API performance degradation
Cloud Run	Cold Start Latency P95 > 5s	Slow startup affecting users
Cloud Run	Container Restarts > 3/5min	Crash loop detection
Pub/Sub	Unacked Messages > 1000	Consumer falling behind
Pub/Sub	Dead Letter Queue messages	Failed processing
Database	Connection pool > 90%	Connection exhaustion
Database	Slow queries > 5s	Query performance
Cost	Budget > 80% threshold	Cost overrun warning
Cost	API Quota > 90%	Rate limiting imminent

Feature Ideas

Route infrastructure alerts to Slack channels
Add AI summaries to infrastructure alerts
Slack thread replies for related alerts (group by incident)
Runbook links based on error patterns
PagerDuty integration for on-call escalation
Weekly error trend summaries

System Overview​

Key Features​

Architecture​

Three Environments​

Components​

Alert Categories​

1. Application Error Alerts (Backend & HAE)​

2. Infrastructure Alerts​

AI-Powered Summarization​

How It Works​

What the AI Provides​

Noise Exclusion​

Terraform Configuration​

Enable the System​

Module Connection​

Infrastructure Resources Created​

Per Environment​

IAM Permissions​

Testing​

Simulate an Alert​

Write Test Logs​

Check Function Logs​

Deploying Changes​

Monitoring Changes (Filters, Thresholds, Exclusions)​

Function Code Changes​

Single Environment Deployment​

Troubleshooting​

"No logs found in alert window"​

Slack posting fails​

Function timeout (120s limit)​

Alert not triggering​

Cost Considerations​

Future Enhancements​

Additional Monitoring Ideas​

Feature Ideas​