Data Warehouse & Pipeline Architecture
This document describes the complete data pipeline and warehouse architecture for Eli Health, showing how data from multiple sources (PostgreSQL, Firebase Analytics, Firebase Auth, and Shopify) is integrated into BigQuery for centralized analytics and reporting.
Architecture Diagram
The diagram below shows the detailed infrastructure setup for the data warehouse and pipeline:
*Available in Production environment only
Data Pipeline Components
Bastion Host
- Compute Instance (e2-micro): Provides a secure entry point into the database environment
- Cloud SQL Auth Proxy: Manages authentication to PostgreSQL without exposing direct DB access
- Firewall Rules: Restricts access to the bastion host via IP filtering
Database
- PostgreSQL (Cloud SQL): Primary application database
- SSL Encryption: Ensures data-in-transit protection
Data Pipeline
The data pipeline integrates data from five distinct sources into BigQuery for centralized analytics:
Data Sources & Integration Methods
Available in All Environments (Dev, Stage, Production)
-
PostgreSQL → Datastream → BigQuery
- Source: Application database (PostgreSQL on Cloud SQL)
- Integration: Datastream for real-time CDC (Change Data Capture)
- Data: User profiles, health data, test results, all application data including the
recordtable with full metadata - Frequency: Real-time replication
- Dataset:
eli_health_biometricspublic - Note: ALL tables including
recordtable with sensitive metadata are replicated to BigQuery
-
Firebase Analytics → BigQuery
- Source: Mobile app usage analytics
- Integration: Native Firebase to BigQuery export
- Data: Mobile app events, user behavior, in-app conversions
- Frequency: Daily automatic export
- Dataset: Firebase analytics datasets
-
Firebase Auth → Cloud Function → BigQuery
- Source: User authentication system
- Integration: Custom Cloud Function
- Data: User profiles, authentication methods, sign-up timestamps
- Frequency: Hourly synchronization
- Dataset:
firebase_users
Production Environment Only
-
Shopify → Airbyte → BigQuery (Production Only)
- Source: E-commerce platform
- Integration: Airbyte (open-source ELT platform)
- Data: Orders, customers, products, transactions, inventory
- Frequency: Configurable (typically every 6-24 hours)
- Dataset:
eli_health_shopifyin northamerica-northeast1 - Note: E-commerce data is only synced in production for security and cost optimization
-
Google Analytics → Cloud Function → BigQuery (Production Only)
- Source: Marketing website (www.eli.health)
- Integration: Native GA4 export to US region, then Cloud Function syncs to Montreal
- Data: Website traffic, marketing campaigns, user acquisition, conversions, purchase attribution
- Frequency: Daily export + daily cross-region sync (6 AM UTC)
- Source Dataset:
analytics_361776673(US multi-region, native GA4 export) - Target Dataset:
analytics_ga4(northamerica-northeast1, synced via Parquet) - Sync Function:
ga4-cross-region-sync(Terraform module:tf/modules/global/ga4-sync/) - Key Link:
transaction_idin GA4 purchase events matches Shopifyorder.id(94% match rate) - Documentation: GA4-Shopify Data Linking
- Note: Cross-region sync enables direct BigQuery JOINs with Shopify, Klaviyo, and LoopWork data
-
LoopWork → eli-kpi → BigQuery (Production Only)
- Source: Loop Subscriptions (subscription management on Shopify)
- Integration: Custom sync via eli-kpi Cloud Run service
- Data: Subscriptions, subscription customers, subscription orders, products
- Frequency: Daily synchronization (4:00 AM UTC)
- Dataset:
eli_health_loopworkin northamerica-northeast1 - Documentation: LoopWork Data Integration
-
Klaviyo → Airbyte → BigQuery (Production Only)
- Source: Email marketing platform
- Integration: Airbyte (managed connector)
- Data: Email profiles, campaigns, events, flows, lists, metrics (2.85M events, 36.8K profiles)
- Frequency: Daily incremental synchronization
- Dataset:
eli_health_klaviyoin northamerica-northeast1 - Schema: Airbyte standard (
attributes,relationships,_airbyte_extracted_at) - Documentation: User Journey Data Architecture
Schema Change (December 2025)Klaviyo sync moved from eli-kpi to Airbyte. The schema changed from
datacolumn toattributes/relationshipscolumns. -
Videnglobe Ads → eli-kpi → BigQuery (Production Only)
- Source: Videnglobe's BigQuery (marketing agency managing Facebook/Google Ads)
- Integration: Custom cross-region sync via eli-kpi Cloud Run service
- Data: Facebook Ads insights (2 ad accounts), Google Ads campaigns/ad groups (1 account)
- Frequency: Daily synchronization (5:00 AM Toronto time)
- Dataset:
eli_health_adsin northamerica-northeast1 - Tables: 9 tables (~66K rows total)
- Documentation: Videnglobe Ads Integration
Cross-Region SyncVidenglobe's BigQuery is in the US region, while our data warehouse is in Montreal. The eli-kpi service performs in-memory cross-region transfer: queries execute in US, data transfers through the application, then streams to Montreal.
Central Data Warehouse
- BigQuery: Serves as the central data warehouse for all analytics and reporting
- Location: All datasets co-located in
northamerica-northeast1for optimal query performance - Cross-source Analytics: Enables powerful insights by joining data across sources:
- Match Shopify customers with Firebase users
- Track user journey from marketing website → app signup → purchase
- Analyze conversion rates across different acquisition channels
- Link Klaviyo email engagement → Shopify purchase → LoopWork subscription → App biometric tests
- Correlate subscription status with hormone tracking engagement
- Environment Segregation: Dev and Stage environments have limited data sources to reduce costs and complexity
- Dev & Stage: Include core data sources (PostgreSQL, Firebase Analytics, Firebase Auth)
- Production Only: Additionally includes Shopify (e-commerce), LoopWork (subscriptions), Klaviyo (email marketing), and Google Analytics (marketing website)
- All environments share the same infrastructure components but differ in available data sources
Data Security & Access Control
Security Model: Permissions over Exclusion
The Eli Health data pipeline follows a permission-based security model rather than an exclusion-based approach:
- All data is replicated: Every table from PostgreSQL, including the
recordtable with sensitive metadata, is replicated to BigQuery - Access controlled by permissions: Security is enforced through BigQuery IAM permissions, not by excluding data from replication
- Granular access control: Different user groups have access to different datasets and tables based on their roles
Access Patterns
Direct BigQuery Access
- Limited to authorized personnel: Only users with specific BigQuery IAM permissions can directly query sensitive tables
- Record table access: The
recordtable containing sensitive health metadata requires elevated permissions - Audit trail: All BigQuery queries are logged and auditable through Cloud Logging
KPI Service Access (Recommended)
- Primary access method: Most users should access analytics data through the KPI service
- Data abstraction: The KPI service provides curated, aggregated views of the data
- Built-in privacy controls: The service implements business logic to filter and aggregate sensitive data appropriately
- No direct table access: Users don't need BigQuery permissions to use the KPI service
Metadata Storage Challenges
The system currently has a confusing dual pattern for storing metadata:
Problem: Data like startDate, endDate, and value can exist in two places:
- As direct columns in the
recordtable - Inside the JSON
metadatafield
Impact: This leads to defensive programming with fallback logic throughout the codebase:
// Common pattern found in the codebase
const startDate = record?.metadata?.startDate || record?.startDate;
const endDate = record?.metadata?.endDate || record?.endDate;
const value = record?.metadata?.value || record?.value;
Root Cause: Historical evolution where some data was migrated from columns to JSON metadata, but both patterns remain supported for backward compatibility.
Related Documentation
- BigQuery Access Management - Learn how to grant user access to BigQuery datasets
- Infrastructure Overview - Complete infrastructure architecture documentation
- Airbyte Data Integration - Shopify to BigQuery synchronization
- LoopWork Data Integration - Subscription data synchronization
- Videnglobe Ads Integration - Facebook/Google Ads data synchronization
- GA4-Shopify Data Linking - Cross-region sync and marketing attribution
- User Journey Data Architecture - End-to-end data flow including Klaviyo and GA4
- KPI Analytics Service - Internal analytics dashboard and API