Monitoring Guide¶

Comprehensive monitoring setup for GreenGovRAG in production

Overview¶

Effective monitoring is critical for production RAG systems. This guide covers:

Health checks and uptime monitoring
Logging configuration
Performance metrics
Alerting strategies
Debugging tools

Health Checks¶

API Health Endpoint¶

Endpoint: GET /api/health

Response (healthy):

{
  "status": "healthy",
  "timestamp": "2025-11-15T12:34:56Z",
  "components": {
    "database": "healthy",
    "vector_store": "healthy",
    "llm_provider": "healthy",
    "cache": "healthy"
  },
  "version": "1.0.0",
  "uptime_seconds": 86400
}

Response (unhealthy):

{
  "status": "unhealthy",
  "timestamp": "2025-11-15T12:34:56Z",
  "components": {
    "database": "healthy",
    "vector_store": "unhealthy",
    "llm_provider": "healthy",
    "cache": "healthy"
  },
  "errors": ["Qdrant connection timeout"]
}

Health Check Implementation¶

File: backend/green_gov_rag/api/routes/health.py

from fastapi import APIRouter, status
from green_gov_rag.api.services.health_service import HealthService

router = APIRouter()

@router.get("/health", status_code=status.HTTP_200_OK)
async def health_check():
    """Comprehensive health check."""
    health_service = HealthService()
    result = await health_service.check_all()

    if result["status"] == "unhealthy":
        return JSONResponse(
            status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
            content=result
        )

    return result

Uptime Monitoring¶

External Monitoring Services:

Pingdom: https://www.pingdom.com (paid)
UptimeRobot: https://uptimerobot.com (free tier available)
AWS CloudWatch Synthetics: Canary monitoring
Azure Application Insights Availability Tests: URL ping tests

Setup (UptimeRobot example):

Create monitor: HTTP(S)
URL: https://your-api.com/api/health
Interval: 5 minutes
Alert contacts: email, Slack

Logging¶

Log Levels¶

# In .env
LOG_LEVEL=INFO  # DEBUG, INFO, WARNING, ERROR, CRITICAL

Development: DEBUG (verbose) Production: INFO (default) or WARNING (minimal)

Structured Logging¶

Format: JSON for machine parsing

{
  "timestamp": "2025-11-15T12:34:56.789Z",
  "level": "INFO",
  "logger": "green_gov_rag.api.routes.query",
  "message": "RAG query processed",
  "query": "What are NGER requirements?",
  "response_time_ms": 1234.56,
  "sources_count": 5,
  "trust_score": 0.85,
  "user_id": "anonymous",
  "request_id": "550e8400-e29b-41d4-a716-446655440000"
}

AWS CloudWatch Logs¶

View logs:

# Tail logs
aws logs tail /ecs/greengovrag-backend --follow

# Last 100 lines
aws logs tail /ecs/greengovrag-backend --since 1h

# Filter for errors
aws logs filter-pattern /ecs/greengovrag-backend --pattern "ERROR"

# Search for specific query
aws logs filter-pattern /ecs/greengovrag-backend --pattern "RAG query processed"

Query logs with Insights:

# Top 10 slowest queries
fields @timestamp, query, response_time_ms
| filter level = "INFO" and message = "RAG query processed"
| sort response_time_ms desc
| limit 10

# Error rate by hour
fields @timestamp, level
| filter level = "ERROR"
| stats count() by bin(1h)

# Average trust score
fields trust_score
| filter level = "INFO" and message = "RAG query processed"
| stats avg(trust_score) as avg_trust_score

Azure Log Analytics¶

Query logs:

// Last 100 errors
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "greengovrag-backend"
| where Level_s == "ERROR"
| order by TimeGenerated desc
| take 100

// Slow queries (> 2s)
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "greengovrag-backend"
| where Message_s contains "RAG query processed"
| extend response_time = todouble(extract("response_time_ms\":(\\d+\\.\\d+)", 1, Message_s))
| where response_time > 2000
| order by response_time desc

// Error rate over time
ContainerAppConsoleLogs_CL
| where ContainerAppName_s == "greengovrag-backend"
| where Level_s == "ERROR"
| summarize count() by bin(TimeGenerated, 1h)
| render timechart

Local Docker Logs¶

# All services
docker-compose logs -f

# Backend only
docker-compose logs -f backend

# Last 100 lines
docker-compose logs --tail=100 backend

# Filter for errors
docker-compose logs backend | grep ERROR

# Follow with timestamp
docker-compose logs -f -t backend

Metrics¶

Key Performance Indicators (KPIs)¶

Metric	Target	Critical Threshold
API Response Time (p95)	< 2s	> 5s
Vector Search Latency	< 100ms	> 500ms
Database Query Time	< 50ms	> 200ms
Cache Hit Rate	> 50%	< 20%
Error Rate	< 0.1%	> 1%
Uptime	> 99.9%	< 99%
Trust Score (avg)	> 0.7	< 0.5

AWS CloudWatch Metrics¶

View metrics:

# ECS CPU utilization
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ServiceName,Value=greengovrag-service \
  --start-time 2025-11-15T00:00:00Z \
  --end-time 2025-11-15T23:59:59Z \
  --period 3600 \
  --statistics Average

# API Gateway 5XX errors
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApiGateway \
  --metric-name 5XXError \
  --start-time 2025-11-15T00:00:00Z \
  --end-time 2025-11-15T23:59:59Z \
  --period 300 \
  --statistics Sum

Custom Metrics:

import boto3

cloudwatch = boto3.client('cloudwatch')

# Publish custom metric
cloudwatch.put_metric_data(
    Namespace='GreenGovRAG',
    MetricData=[
        {
            'MetricName': 'QueryTrustScore',
            'Value': 0.85,
            'Unit': 'None',
            'Timestamp': datetime.utcnow()
        }
    ]
)

Azure Monitor Metrics¶

View in Azure Portal: Monitor → Metrics → Select resource

Query with CLI:

# CPU percentage
az monitor metrics list \
  --resource greengovrag-backend \
  --resource-group greengovrag-rg \
  --resource-type Microsoft.App/containerApps \
  --metric "CpuPercentage" \
  --start-time 2025-11-15T00:00:00Z \
  --end-time 2025-11-15T23:59:59Z \
  --interval PT1H

# HTTP request count
az monitor metrics list \
  --resource greengovrag-backend \
  --resource-group greengovrag-rg \
  --resource-type Microsoft.App/containerApps \
  --metric "Requests" \
  --aggregation Total

Custom Metrics (Application Insights):

from applicationinsights import TelemetryClient

tc = TelemetryClient(instrumentation_key=os.getenv('APPINSIGHTS_KEY'))

# Track custom metric
tc.track_metric('QueryTrustScore', 0.85)
tc.flush()

Prometheus + Grafana (Advanced)¶

Install Prometheus exporter:

# backend/requirements.txt
prometheus-fastapi-instrumentator==6.1.0

# backend/green_gov_rag/api/main.py
from prometheus_fastapi_instrumentator import Instrumentator

app = FastAPI()

Instrumentator().instrument(app).expose(app, endpoint="/metrics")

Scrape metrics:

curl http://localhost:8000/metrics

Example metrics:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="POST",path="/api/query",status="200"} 1234

# HELP http_request_duration_seconds HTTP request duration
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.5"} 890
http_request_duration_seconds_bucket{le="1.0"} 1200
http_request_duration_seconds_bucket{le="2.0"} 1230

Alerting¶

Alert Channels¶

Email: Direct email notifications Slack: Webhook integration PagerDuty: On-call escalation SMS: Critical alerts only

AWS CloudWatch Alarms¶

High CPU Alert:

aws cloudwatch put-metric-alarm \
  --alarm-name greengovrag-high-cpu \
  --alarm-description "Alert when CPU > 80% for 5 minutes" \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ServiceName,Value=greengovrag-service \
  --statistic Average \
  --period 300 \
  --evaluation-periods 1 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:greengovrag-alerts

High Error Rate Alert:

aws cloudwatch put-metric-alarm \
  --alarm-name greengovrag-high-errors \
  --alarm-description "Alert when 5XX errors > 10/minute" \
  --namespace AWS/ApiGateway \
  --metric-name 5XXError \
  --statistic Sum \
  --period 60 \
  --evaluation-periods 1 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:greengovrag-alerts

Azure Monitor Alerts¶

High Memory Alert:

az monitor metrics alert create \
  --name HighMemoryAlert \
  --resource-group greengovrag-rg \
  --scopes /subscriptions/.../resourceGroups/greengovrag-rg/providers/Microsoft.App/containerApps/greengovrag-backend \
  --condition "avg MemoryPercentage > 90" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action email=contact@sundeep.id.au

Failed Health Check Alert:

az monitor metrics alert create \
  --name FailedHealthCheck \
  --resource-group greengovrag-rg \
  --scopes /subscriptions/.../resourceGroups/greengovrag-rg/providers/Microsoft.App/containerApps/greengovrag-backend \
  --condition "avg Replicas < 1" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action email=contact@sundeep.id.au

Slack Integration¶

AWS SNS → Slack:

Create SNS topic: greengovrag-alerts
Subscribe webhook: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
Configure CloudWatch alarms to publish to SNS topic

Azure Action Group:

az monitor action-group create \
  --name greengovrag-slack \
  --resource-group greengovrag-rg \
  --action webhook greengovrag-webhook https://hooks.slack.com/services/YOUR/WEBHOOK/URL

PagerDuty Integration¶

Create PagerDuty service
Get integration key
Configure SNS subscription (AWS) or Action Group (Azure)
Set escalation policy

Tracing¶

AWS X-Ray¶

Enable X-Ray:

# backend/requirements.txt
aws-xray-sdk==2.12.0

# backend/green_gov_rag/api/main.py
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.fastapi.middleware import XRayMiddleware

app = FastAPI()
app.add_middleware(XRayMiddleware, recorder=xray_recorder)

View traces:

# AWS Console: X-Ray → Traces
# Or CLI
aws xray get-trace-summaries \
  --start-time 2025-11-15T00:00:00Z \
  --end-time 2025-11-15T23:59:59Z \
  --filter-expression 'service("greengovrag-backend")'

Azure Application Insights¶

Enable tracing:

# backend/requirements.txt
opencensus-ext-azure==1.1.13
opencensus-ext-fastapi==0.7.1

# backend/green_gov_rag/api/main.py
from opencensus.ext.azure.trace_exporter import AzureExporter
from opencensus.ext.fastapi import FastAPIMiddleware

app = FastAPI()
app.add_middleware(
    FastAPIMiddleware,
    exporter=AzureExporter(connection_string=os.getenv('APPINSIGHTS_CONNECTION_STRING'))
)

View traces: Azure Portal → Application Insights → Transaction search

Dashboards¶

CloudWatch Dashboard¶

Create dashboard:

aws cloudwatch put-dashboard \
  --dashboard-name greengovrag-dashboard \
  --dashboard-body file://dashboard.json

dashboard.json:



name=__codelineno-25-1 href=#__codelineno-25-1>{ "widgets": [ { "type": "metric", "properties": { "title": "API Response Time (p95)", "metrics": [ ["GreenGovRAG", "QueryResponseTime", {"stat": "p95"}] ], "period": 300, "region": "us-east-1" } }, { "type": "metric", "properties": { "title": "Error Rate", "metrics": [ ["AWS/ApiGateway", "5XXError", {"stat": "Sum"}] ], "period": 60, "region": "us-east-1" } } ] class=p>}
 Azure Monitor Dashboard¶
 Create in Portal: Azure Portal → Dashboards → New dashboard
 Pin metrics:
  Navigate to Container App
 Click "Metrics"
 Select metric (CPU, Memory, Requests)
 Click "Pin to dashboard"
 
 Grafana Dashboard¶
 Install Grafana: 
docker run -d -p 3000:3000 grafana/grafana
 Add Prometheus data source: Configuration → Data sources → Add Prometheus
 Create dashboard:
  New dashboard
 Add panel
 Query: rate(http_requests_total[5m])
 Visualization: Time series
 
 Import community dashboard: Dashboard ID 1860 (Node Exporter)
 Debugging¶
 Live Debugging¶
 AWS ECS Exec (SSH into container): 
aws ecs execute-command \
  --cluster greengovrag-cluster \
  --task <task-id> \
  --container greengovrag-backend \
  --interactive \
  --command "/bin/bash"
 Azure Container Instance Exec: 
az container exec \
  --resource-group greengovrag-rg \
  --name greengovrag-backend \
  --exec-command "/bin/bash"
 Database Queries¶
 Check document count: 
SELECT COUNT(*) FROM documents;
 Check recent queries: 
SELECT * FROM query_logs
ORDER BY created_at DESC
LIMIT 10;
 Check cache hit rate: 
SELECT
  SUM(CASE WHEN cache_hit THEN 1 ELSE 0 END)::FLOAT / COUNT(*) AS cache_hit_rate
FROM query_logs
WHERE created_at > NOW() - INTERVAL '24 hours';
 Vector Store Debugging¶
 Qdrant API: 
# Collection info
curl http://qdrant-instance:6333/collections/greengovrag

# Count documents
curl http://qdrant-instance:6333/collections/greengovrag/points/count

# Search test
curl -X POST http://qdrant-instance:6333/collections/greengovrag/points/search \
  -H 'Content-Type: application/json' \
  -d '{
    "vector": [0.1, 0.2, ...],
    "limit": 5
  }'
 Performance Profiling¶
 Python Profiling¶
 cProfile: 
python -m cProfile -o profile.stats -m uvicorn green_gov_rag.api.main:app

# Analyze
python -m pstats profile.stats
> sort cumtime
> stats 20
 py-spy (production-safe): 
pip install py-spy

# Profile running process
py-spy top --pid <backend-pid>

# Flame graph
py-spy record -o profile.svg --pid <backend-pid>
 API Load Testing¶
 Apache Bench: 
ab -n 1000 -c 10 http://localhost:8000/api/health
 Locust (more advanced): 
# locustfile.py
from locust import HttpUser, task

class GreenGovRAGUser(HttpUser):
    @task
    def query(self):
        self.client.post("/api/query", json={
            "query": "What are NGER requirements?",
            "max_sources": 5
        })

# Run
locust -f locustfile.py --host http://localhost:8000
 Cost Monitoring¶
 AWS Cost Explorer¶
 View costs: 
aws ce get-cost-and-usage \
  --time-period Start=2025-11-01,End=2025-11-30 \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=SERVICE
 Cost alert: 
aws budgets create-budget \
  --account-id 123456789012 \
  --budget file://budget.json \
  --notifications-with-subscribers file://notifications.json
 Azure Cost Management¶
 View costs: Azure Portal → Cost Management → Cost analysis
 Create budget: 
az consumption budget create \
  --budget-name greengovrag-budget \
  --amount 100 \
  --time-grain Monthly \
  --start-date 2025-11-01 \
  --end-date 2026-11-01 \
  --resource-group greengovrag-rg
 Best Practices¶
 Logging¶
  Use structured logging (JSON)
 Include request IDs for tracing
 Don't log sensitive data (API keys, PII)
 Use appropriate log levels
 Aggregate logs centrally
 
 Metrics¶
  Track business metrics (trust score, query success rate)
 Monitor the four golden signals: latency, traffic, errors, saturation
 Use percentiles (p50, p95, p99) not averages
 Set appropriate thresholds
 
 Alerting¶
  Alert on symptoms, not causes
 Avoid alert fatigue (tune thresholds)
 Use runbooks for common alerts
 Escalate critical alerts (PagerDuty)
 Review and update alerts regularly
 
 Dashboards¶
  Keep dashboards simple and focused
 Use consistent time ranges
 Group related metrics
 Share dashboards with team
 Update dashboards as system evolves
 
 
 Last Updated: 2025-11-22
      2025-11-16      2025-11-15  
    Was this page helpful?      
   Thanks for your feedback! 
  Thanks for your feedback!