System Architecture Overview¶

This document provides a comprehensive overview of GreenGovRAG's system architecture, component design, and data flow patterns.

Table of Contents¶

Architecture Diagram
System Components
Data Flow
Technology Stack
Design Patterns
Deployment Architecture
Cross-References

Architecture Diagram¶

graph TB
    subgraph "Client Layer"
        WEB[Web Frontend<br/>React + TypeScript]
        CLI[CLI Tool<br/>greengovrag-cli]
        API_CLIENT[API Clients<br/>External Systems]
    end

    subgraph "API Layer"
        FASTAPI[FastAPI Application<br/>Python 3.12]
        PUBLIC_API[Public API Routes<br/>/api/*]
        ADMIN_API[Admin API Routes<br/>/api/admin/*]
        MIDDLEWARE[Middleware<br/>CORS, Rate Limiting, Auth]
    end

    subgraph "Business Logic Layer"
        RAG_CHAIN[RAG Chain<br/>Query Processing]
        ANALYTICS[Analytics Service<br/>Usage Tracking]
        CACHE[Cache Service<br/>Query Results]
        TRUST[Trust Score Service<br/>Citation Verification]
    end

    subgraph "RAG Components"
        HYBRID_SEARCH[Hybrid Geospatial Search<br/>Vector + BM25 + Spatial]
        EMBEDDINGS[Embedding Service<br/>HuggingFace/OpenAI]
        LLM_FACTORY[LLM Factory<br/>Multi-Provider Support]
        RESPONSE_GEN[Response Generator<br/>Citations + Deep Links]
    end

    subgraph "Data Storage"
        VECTOR_STORE[Vector Store<br/>FAISS/Qdrant]
        POSTGRES[(PostgreSQL<br/>pgvector)]
        CACHE_DB[(DynamoDB/Redis<br/>Query Cache)]
        CLOUD_STORAGE[Cloud Storage<br/>S3/Azure Blob]
    end

    subgraph "ETL Pipeline"
        SOURCES[Document Sources<br/>Plugin System]
        INGEST[Document Ingestion<br/>Download + Validate]
        PARSER[Document Parsing<br/>Unstructured.io]
        CHUNKER[Text Chunking<br/>Hierarchical]
        TAGGER[Metadata Tagging<br/>LLM-based]
        DB_WRITER[Database Writer<br/>Batch Insertion]
    end

    subgraph "External Services"
        OPENAI[OpenAI API<br/>GPT-4, GPT-3.5]
        AZURE_OPENAI[Azure OpenAI<br/>GPT-4, GPT-3.5]
        BEDROCK[AWS Bedrock<br/>Claude, Titan]
        ANTHROPIC[Anthropic API<br/>Claude]
    end

    WEB --> FASTAPI
    CLI --> FASTAPI
    API_CLIENT --> FASTAPI

    FASTAPI --> PUBLIC_API
    FASTAPI --> ADMIN_API
    FASTAPI --> MIDDLEWARE

    PUBLIC_API --> RAG_CHAIN
    PUBLIC_API --> ANALYTICS
    ADMIN_API --> ANALYTICS

    RAG_CHAIN --> HYBRID_SEARCH
    RAG_CHAIN --> RESPONSE_GEN
    HYBRID_SEARCH --> EMBEDDINGS
    HYBRID_SEARCH --> VECTOR_STORE

    RESPONSE_GEN --> LLM_FACTORY
    LLM_FACTORY --> OPENAI
    LLM_FACTORY --> AZURE_OPENAI
    LLM_FACTORY --> BEDROCK
    LLM_FACTORY --> ANTHROPIC

    RAG_CHAIN --> CACHE
    CACHE --> CACHE_DB

    RAG_CHAIN --> POSTGRES
    ANALYTICS --> POSTGRES

    SOURCES --> INGEST
    INGEST --> PARSER
    PARSER --> CHUNKER
    CHUNKER --> TAGGER
    TAGGER --> DB_WRITER
    DB_WRITER --> POSTGRES
    DB_WRITER --> VECTOR_STORE

    INGEST --> CLOUD_STORAGE
    PARSER --> CLOUD_STORAGE

    style FASTAPI fill:#4CAF50
    style RAG_CHAIN fill:#2196F3
    style VECTOR_STORE fill:#FF9800
    style POSTGRES fill:#9C27B0
    style LLM_FACTORY fill:#F44336

System Components¶

1. API Layer¶

FastAPI Application¶

Location: /backend/green_gov_rag/api/main.py
Purpose: REST API server for query processing and document management
Key Features:
- OpenAPI/Swagger documentation at /docs
- CORS middleware for cross-origin requests
- Rate limiting (30 requests/minute default)
- Request/response validation with Pydantic
- Health check endpoints

Public API Routes¶

Location: /backend/green_gov_rag/api/routes/
Endpoints:
- POST /api/query - RAG query with optional location filtering
- GET /api/documents - List available documents
- GET /api/analytics - Usage statistics
- GET /api/lga-boundaries - GeoJSON data for LGAs
- GET /api/health - Health check

Admin API Routes¶

Location: /backend/green_gov_rag/api/admin/
Endpoints:
- Document CRUD operations
- Analytics dashboard data
- System health monitoring
- Reprocessing triggers
- Cache management

2. RAG Components¶

RAG Chain¶

Location: /backend/green_gov_rag/rag/rag_chain.py
Responsibilities:
- Orchestrates end-to-end RAG pipeline
- Document retrieval coordination
- Context building from retrieved documents
- LLM prompt construction and invocation
- Response formatting with citations
Key Methods:
- retrieve_documents() - Vector similarity search with filters
- generate_answer() - LLM-based answer generation
- query_with_sources() - Complete RAG query with source attribution
- query_with_enhanced_citations() - Advanced citation formatting

Hybrid Geospatial Search¶

Location: /backend/green_gov_rag/rag/hybrid_search.py
Features:
- Vector similarity search (semantic)
- BM25 lexical search
- Spatial filtering by LGA, state, coordinates
- Hierarchical jurisdiction filtering (federal → state → local)
- Metadata filtering (ESG, category, topic)
- Query expansion (acronym resolution)
- Automatic location extraction via NER
Search Strategies:
- search() - Combined hybrid search
- search_with_lga() - LGA-specific search
- search_with_esg_filters() - ESG metadata filtering
- search_with_auto_location() - Automatic NER-based location detection
- advanced_search() - Multi-filter combination

Embedding Service¶

Location: /backend/green_gov_rag/rag/embeddings.py
Supported Providers:
- HuggingFace Transformers (default: sentence-transformers/all-MiniLM-L6-v2)
- OpenAI Embeddings (text-embedding-3-small, text-embedding-3-large)
- AWS Bedrock Embeddings
Capabilities:
- Batch embedding generation (default: 100 chunks/batch)
- Progress tracking for large datasets
- Empty chunk filtering
- Dimension: 384 (MiniLM) or 1536 (OpenAI)

LLM Factory¶

Location: /backend/green_gov_rag/rag/llm_factory.py
Supported Providers:
- OpenAI: GPT-4, GPT-4-turbo, GPT-3.5-turbo, GPT-4o
- Azure OpenAI: Same models via Azure deployment
- AWS Bedrock: Claude, Titan, Llama models
- Anthropic: Claude 3 Opus, Sonnet, Haiku
Configuration Parameters:
- temperature - Sampling randomness (0.0-1.0)
- max_tokens - Maximum response length
- model - Specific model name/ID
- Provider-specific credentials via environment variables

Response Generator¶

Location: /backend/green_gov_rag/rag/enhanced_response.py
Features:
- Inline citation markers [1], [2], etc.
- Deep links to PDF pages (#page=N)
- Section path display (e.g., "Section 2.1.3: Thresholds")
- Hierarchical breadcrumbs (Document > Section > Subsection)
- Confidence scoring for citations
- Markdown and JSON output formats

3. Vector Storage¶

Vector Store Interface¶

Location: /backend/green_gov_rag/rag/vector_store_interface.py
Abstraction: Unified interface for multiple vector stores
Implementations:
- FAISS (local development): Fast, in-memory index
- Qdrant (production): Distributed vector database with filters

FAISS Store¶

Location: /backend/green_gov_rag/rag/stores/faiss_store.py
Pros:
- No external dependencies
- Fast for small-to-medium datasets (<1M vectors)
- Simple setup
Cons:
- In-memory only (limited scalability)
- No distributed support
- Limited metadata filtering

Qdrant Store¶

Location: /backend/green_gov_rag/rag/stores/qdrant_store.py
Pros:
- Distributed architecture (scalable to billions of vectors)
- Advanced metadata filtering
- Persistent storage
- HNSW index for sub-linear search
Cons:
- Requires external service
- More complex deployment

4. ETL Pipeline¶

Document Sources (Plugin System)¶

Location: /backend/green_gov_rag/etl/sources/
Base Interface: BaseDocumentSource
Built-in Sources:
- Federal legislation (environment.gov.au)
- State legislation (SA, NSW)
- Local government (council portals)
- Emissions data (CER, NGER)
Custom Source Development:
- Extend BaseDocumentSource
- Implement fetch_documents(), validate_config(), get_metadata()
- Register in sources/factory.py

Document Ingestion¶

Location: /backend/green_gov_rag/etl/ingest.py
Workflow:
1. Load configuration from YAML
2. Validate source configurations
3. Generate consistent document IDs (for delta indexing)
4. Download documents with retry logic
5. Detect file types from magic bytes
6. Save to local filesystem or cloud storage
7. Extract and store metadata
Error Handling:
- Retry with exponential backoff (3 attempts)
- Cloudflare bot protection detection
- Failed download logging to failed_downloads.txt

Document Parsing¶

Location: /backend/green_gov_rag/etl/parsers/
Parser Types:
- UnstructuredPDFParser: Advanced layout analysis with Unstructured.io
- LayoutPDFParser: Hierarchical section extraction
- HTMLParser: Web content parsing
Parsing Strategies:
- hi_res: Detailed analysis (slow, accurate)
- fast: Quick parsing (faster, less accurate)
- auto: Automatic strategy selection
Extracted Metadata:
- Section hierarchy (section_hierarchy, section_title)
- Page numbers and ranges
- Clause references (e.g., "s.3.2.1", "cl.42")
- Table detection and association
- Element types (paragraph, table, list, header)

Text Chunking¶

Location: /backend/green_gov_rag/etl/chunker.py
Strategies:
- RecursiveCharacterTextSplitter: Paragraph → sentence → word splitting
- TokenTextSplitter: Token-based splitting
- HierarchicalChunker: Preserves section metadata
Configuration:
- chunk_size: 500-1000 tokens (default: 1000)
- chunk_overlap: 100-200 tokens (default: 100)
- Custom separators: ["\n\n", "\n", " ", ""]

Metadata Tagging¶

Location: /backend/green_gov_rag/etl/metadata_tagger.py
LLM-based Auto-tagging:
- ESG metadata extraction (emission scopes, frameworks)
- Spatial scope detection (federal/state/local)
- Topic classification
- Regulatory framework identification
Tagged Fields:
- esg_metadata.emission_scopes (scope_1, scope_2, scope_3)
- esg_metadata.frameworks (NGER, ISSB, GHG_Protocol)
- spatial_metadata.spatial_scope (federal, state, local)
- spatial_metadata.lga_codes, spatial_metadata.state

Database Writer¶

Location: /backend/green_gov_rag/etl/db_writer.py
Batch Operations:
- Batch size: 100 chunks (configurable)
- Upsert support (update or insert)
- Transaction management
- Error handling with rollback
Writes To:
- PostgreSQL (document metadata, chunks)
- Vector store (embeddings)
- Cloud storage (processed files)

5. Data Models¶

Database Models¶

Location: /backend/green_gov_rag/models/
SQLModel Entities:
- Document: Document metadata (title, jurisdiction, category)
- Chunk: Text chunks with embeddings
- Query: User queries and responses
- Analytics: Usage statistics
- LGABoundary: Geospatial boundaries (PostGIS)

API Schemas¶

Location: /backend/green_gov_rag/api/schemas/
Pydantic Models:
- QueryRequest: Query input with filters
- QueryResponse: Answer with sources and citations
- DocumentResponse: Document metadata
- AnalyticsResponse: Usage statistics

6. Services¶

Analytics Service¶

Location: /backend/green_gov_rag/api/services/analytics_service.py
Tracks:
- Query frequency
- Response latency
- Trust score distribution
- LGA query distribution
- Top queries and documents

Cache Service¶

Location: /backend/green_gov_rag/api/services/cache_service.py
Backends:
- DynamoDB (AWS production)
- Redis (local development)
TTL: 1 hour default
Cache Keys: SHA256 hash of query + filters

Trust Score Service¶

Location: /backend/green_gov_rag/api/services/trust_score_service.py
Scoring Factors:
- Citation verification (source exists)
- Regulatory hierarchy (federal > state > local)
- Jurisdiction match (query LGA vs. source LGA)
- Recency (newer documents scored higher)
- Confidence score from LLM

Data Flow¶

Query Processing Flow¶

sequenceDiagram
    participant Client
    participant API
    participant RAGChain
    participant HybridSearch
    participant VectorStore
    participant LLM
    participant Cache
    participant Database

    Client->>API: POST /api/query<br/>{query, lga_name}
    API->>Cache: Check cache
    alt Cache Hit
        Cache-->>API: Cached response
        API-->>Client: Return cached result
    else Cache Miss
        API->>RAGChain: query_with_enhanced_citations()
        RAGChain->>HybridSearch: search_with_auto_location()
        HybridSearch->>HybridSearch: Extract locations (NER)
        HybridSearch->>HybridSearch: Expand query (acronyms)
        HybridSearch->>VectorStore: similarity_search(query, k=10)
        VectorStore-->>HybridSearch: Top-K documents
        HybridSearch->>HybridSearch: Apply spatial filters
        HybridSearch->>HybridSearch: Apply metadata filters
        HybridSearch->>HybridSearch: Boost by jurisdiction
        HybridSearch-->>RAGChain: Filtered documents
        RAGChain->>RAGChain: build_context_from_documents()
        RAGChain->>LLM: generate_answer(query, context)
        LLM-->>RAGChain: Generated answer
        RAGChain->>RAGChain: format_enhanced_response()
        RAGChain-->>API: EnhancedResponse
        API->>Cache: Store result (TTL: 1h)
        API->>Database: Log query analytics
        API-->>Client: JSON response
    end

ETL Pipeline Flow¶

sequenceDiagram
    participant Scheduler as GitHub Actions<br/>or Airflow
    participant Pipeline as ETL Pipeline
    participant Sources as Document Sources
    participant Ingest as Ingest Module
    participant Parser as PDF Parser
    participant Chunker
    participant Tagger as Metadata Tagger
    participant Embedder as Embedding Service
    participant Storage as Cloud Storage
    participant DB as PostgreSQL
    participant Vector as Vector Store

    Scheduler->>Pipeline: Trigger ETL (daily 2AM UTC)
    Pipeline->>Sources: Load configs
    Sources->>Sources: Validate sources
    Sources-->>Pipeline: Document configs

    loop For each document
        Pipeline->>Ingest: download_document()
        Ingest->>Ingest: Generate document_id
        Ingest->>Ingest: Download with retry
        Ingest->>Storage: Upload raw file
        Ingest-->>Pipeline: File path

        Pipeline->>Parser: parse_with_structure()
        Parser->>Parser: Extract sections
        Parser->>Parser: Detect tables/lists
        Parser->>Parser: Extract clause refs
        Parser-->>Pipeline: Parsed chunks

        Pipeline->>Chunker: chunk_with_hierarchy()
        Chunker->>Chunker: Split into chunks
        Chunker->>Chunker: Preserve metadata
        Chunker-->>Pipeline: Text chunks

        Pipeline->>Tagger: tag_all()
        Tagger->>LLM: Extract ESG metadata
        LLM-->>Tagger: Tagged metadata
        Tagger-->>Pipeline: Enriched chunks

        Pipeline->>Embedder: embed_chunks()
        Embedder->>Embedder: Batch embed (100/batch)
        Embedder-->>Pipeline: Chunks with embeddings

        Pipeline->>DB: Write chunks (batch)
        Pipeline->>Vector: Build/update index
        Pipeline->>Storage: Save processed chunks
    end

    Pipeline-->>Scheduler: ETL complete

Technology Stack¶

Backend Stack¶

Component	Technology	Version	Purpose
Language	Python	3.12	Core language
Web Framework	FastAPI	0.104+	REST API
ORM	SQLModel	0.0.14+	Database models
Database	PostgreSQL	15+	Primary datastore
Vector Extension	pgvector	0.5+	Vector similarity search
Vector Store	FAISS / Qdrant	Latest	Embeddings index
LLM Framework	LangChain	0.1+	RAG orchestration
Embeddings	HuggingFace Transformers	4.35+	Text embeddings
PDF Parsing	Unstructured.io	0.10+	Document parsing
NER	spaCy	3.7+	Location extraction
Task Queue	Celery (optional)	5.3+	Background jobs
Validation	Pydantic	2.5+	Data validation
Testing	pytest	7.4+	Unit/integration tests
Linting	Ruff	0.1+	Code quality
Type Checking	MyPy	1.7+	Static typing

LLM Providers¶

Provider	Models	Use Case
OpenAI	GPT-4, GPT-4-turbo, GPT-3.5-turbo, GPT-4o	General purpose
Azure OpenAI	GPT-4, GPT-3.5-turbo	Enterprise deployment
AWS Bedrock	Claude 3, Titan, Llama 2	AWS-native
Anthropic	Claude 3 Opus, Sonnet, Haiku	Advanced reasoning

Recommended: Azure OpenAI with gpt-4o-mini (best cost/performance ratio)

Cloud Infrastructure¶

Service	AWS	Azure	Local
Compute	ECS Fargate	Container Apps	Docker Compose
Vector DB	EC2 Spot (Qdrant)	Container Instance	Docker
Database	RDS PostgreSQL	Azure Database	PostgreSQL
Storage	S3	Blob Storage	Filesystem
Cache	DynamoDB	Redis Cache	Redis
API Gateway	API Gateway HTTP	API Management	None
CDN	CloudFront	Front Door	None

Design Patterns¶

1. Factory Pattern¶

LLM Factory (llm_factory.py):

Creates LLM instances based on provider configuration
Abstracts provider-specific initialization
Centralized credential management

llm = LLMFactory.create_llm(
    provider="azure",
    model="gpt-4",
    temperature=0.2
)

Vector Store Factory (vector_store_factory.py):

Creates vector store instances (FAISS/Qdrant)
Abstracts storage backend differences
Consistent interface across stores

Document Source Factory (sources/factory.py):

Creates document source plugins
Auto-detects source type from config
Plugin registration and discovery

2. Repository Pattern¶

Database Writer (db_writer.py):

Abstracts database operations
Batch insert/update operations
Transaction management
Error handling with rollback

Storage Adapter (storage_adapter.py):

Abstracts cloud storage operations
Supports S3, Azure Blob, local filesystem
Consistent upload/download interface

3. Strategy Pattern¶

Parser Selection:

Different parsing strategies (hi_res, fast, auto)
Runtime strategy selection based on document type
Consistent parser interface

Chunking Strategies:

RecursiveCharacterTextSplitter
TokenTextSplitter
HierarchicalChunker
Configurable at runtime

4. Dependency Injection¶

FastAPI DI:

Service dependencies injected into route handlers
Singleton instances for expensive resources
Easy testing with mock injection

@router.post("/query")
async def query_endpoint(
    request: QueryRequest,
    rag_chain: RAGChain = Depends(get_rag_chain),
    analytics: AnalyticsService = Depends(get_analytics)
):
    # Use injected dependencies
    pass

5. Plugin Architecture¶

Document Sources:

Base interface: BaseDocumentSource
Auto-discovery via registry
Extensible without modifying core code

class CustomSource(BaseDocumentSource):
    def fetch_documents(self) -> list[Document]:
        # Custom implementation
        pass

6. Singleton Pattern¶

Vector Store:

Single instance shared across requests
Lazy initialization on first access
Thread-safe access

Embedding Service:

Model loaded once at startup
Shared across all ETL operations

Deployment Architecture¶

AWS Production Architecture¶

graph TB
    subgraph "Edge Layer"
        CLOUDFRONT[CloudFront CDN<br/>Static Assets]
        APIGW[API Gateway HTTP<br/>REST API]
    end

    subgraph "Compute Layer"
        ECS[ECS Fargate<br/>Backend API<br/>t4g.micro]
        EC2[EC2 Spot<br/>Qdrant Vector DB<br/>t4g.micro]
    end

    subgraph "Storage Layer"
        RDS[(RDS PostgreSQL<br/>pgvector<br/>t4g.micro)]
        S3[S3 Bucket<br/>Documents]
        DYNAMODB[(DynamoDB<br/>Query Cache)]
    end

    subgraph "Orchestration"
        GITHUB[GitHub Actions<br/>ETL Scheduler]
        EVENTBRIDGE[EventBridge<br/>Triggers]
    end

    CLOUDFRONT --> S3
    APIGW --> ECS
    ECS --> RDS
    ECS --> S3
    ECS --> DYNAMODB
    ECS --> EC2
    GITHUB --> EVENTBRIDGE
    EVENTBRIDGE --> ECS

    style CLOUDFRONT fill:#FF9800
    style APIGW fill:#FF9800
    style ECS fill:#4CAF50
    style EC2 fill:#4CAF50
    style RDS fill:#9C27B0
    style S3 fill:#2196F3
    style DYNAMODB fill:#2196F3

Local Development Architecture¶

graph TB
    subgraph "Docker Compose"
        API[Backend API<br/>:8000]
        POSTGRES[(PostgreSQL<br/>pgvector<br/>:5432)]
        QDRANT[Qdrant<br/>:6333]
        REDIS[(Redis<br/>:6379)]
        AIRFLOW[Airflow<br/>:8080<br/>dev mode only]
    end

    API --> POSTGRES
    API --> QDRANT
    API --> REDIS
    AIRFLOW --> API

    style API fill:#4CAF50
    style POSTGRES fill:#9C27B0
    style QDRANT fill:#FF9800
    style REDIS fill:#F44336
    style AIRFLOW fill:#2196F3

Start Commands:

# Production-like (no Airflow)
docker-compose up

# Development with Airflow UI
docker-compose --profile dev up

Cross-References¶

RAG Pipeline Deep Dive: rag-pipeline.md
- Detailed RAG architecture
- Query processing internals
- Vector retrieval algorithms
- Response generation strategies
ETL Pipeline Deep Dive: etl-pipeline.md
- Document ingestion workflow
- Parsing strategies
- Chunking algorithms
- Metadata extraction
LLM Configuration: ../llm-config.md
- Provider setup (OpenAI, Azure, Bedrock, Anthropic)
- Model selection guide
- Cost optimization
- Prompt engineering
Custom Parsers: ../custom-parsers.md
- Parser development guide
- Unstructured.io integration
- Layout-aware parsing
- Table extraction
Custom Embeddings: ../custom-embeddings.md
- Embedding model selection
- HuggingFace integration
- Dimensionality tradeoffs
- Re-indexing procedures
Cloud Storage Guide: ../cloud-storage.md
- AWS S3 setup
- Azure Blob configuration
- Storage adapter usage
Deployment Guides:
- AWS: /deploy/aws/README.md
- Azure: /deploy/azure/README.md
- Docker: /deploy/docker/README.md

API Documentation¶

OpenAPI Spec: http://localhost:8000/docs (live)
API Reference: /docs/api-reference/

Source Code References¶

Core Components:

RAG Chain: /backend/green_gov_rag/rag/rag_chain.py
Hybrid Search: /backend/green_gov_rag/rag/hybrid_search.py
LLM Factory: /backend/green_gov_rag/rag/llm_factory.py
ETL Pipeline: /backend/green_gov_rag/etl/pipeline.py

Configuration:

Settings: /backend/green_gov_rag/config.py
Document Config: /backend/configs/documents_config.yml
Environment: /backend/.env.example

Next Steps¶

Deep Dive into RAG: Read rag-pipeline.md for query processing details
Deep Dive into ETL: Read etl-pipeline.md for document ingestion
Customize LLMs: See ../llm-config.md for provider configuration
Extend Parsers: See ../custom-parsers.md for custom parsing logic
Deploy: Follow deployment guides in /deploy/

Last Updated: 2025-11-22