System Architecture Overview¶
This document provides a comprehensive overview of GreenGovRAG's system architecture, component design, and data flow patterns.
Table of Contents¶
- Architecture Diagram
- System Components
- Data Flow
- Technology Stack
- Design Patterns
- Deployment Architecture
- Cross-References
Architecture Diagram¶
graph TB
subgraph "Client Layer"
WEB[Web Frontend<br/>React + TypeScript]
CLI[CLI Tool<br/>greengovrag-cli]
API_CLIENT[API Clients<br/>External Systems]
end
subgraph "API Layer"
FASTAPI[FastAPI Application<br/>Python 3.12]
PUBLIC_API[Public API Routes<br/>/api/*]
ADMIN_API[Admin API Routes<br/>/api/admin/*]
MIDDLEWARE[Middleware<br/>CORS, Rate Limiting, Auth]
end
subgraph "Business Logic Layer"
RAG_CHAIN[RAG Chain<br/>Query Processing]
ANALYTICS[Analytics Service<br/>Usage Tracking]
CACHE[Cache Service<br/>Query Results]
TRUST[Trust Score Service<br/>Citation Verification]
end
subgraph "RAG Components"
HYBRID_SEARCH[Hybrid Geospatial Search<br/>Vector + BM25 + Spatial]
EMBEDDINGS[Embedding Service<br/>HuggingFace/OpenAI]
LLM_FACTORY[LLM Factory<br/>Multi-Provider Support]
RESPONSE_GEN[Response Generator<br/>Citations + Deep Links]
end
subgraph "Data Storage"
VECTOR_STORE[Vector Store<br/>FAISS/Qdrant]
POSTGRES[(PostgreSQL<br/>pgvector)]
CACHE_DB[(DynamoDB/Redis<br/>Query Cache)]
CLOUD_STORAGE[Cloud Storage<br/>S3/Azure Blob]
end
subgraph "ETL Pipeline"
SOURCES[Document Sources<br/>Plugin System]
INGEST[Document Ingestion<br/>Download + Validate]
PARSER[Document Parsing<br/>Unstructured.io]
CHUNKER[Text Chunking<br/>Hierarchical]
TAGGER[Metadata Tagging<br/>LLM-based]
DB_WRITER[Database Writer<br/>Batch Insertion]
end
subgraph "External Services"
OPENAI[OpenAI API<br/>GPT-4, GPT-3.5]
AZURE_OPENAI[Azure OpenAI<br/>GPT-4, GPT-3.5]
BEDROCK[AWS Bedrock<br/>Claude, Titan]
ANTHROPIC[Anthropic API<br/>Claude]
end
WEB --> FASTAPI
CLI --> FASTAPI
API_CLIENT --> FASTAPI
FASTAPI --> PUBLIC_API
FASTAPI --> ADMIN_API
FASTAPI --> MIDDLEWARE
PUBLIC_API --> RAG_CHAIN
PUBLIC_API --> ANALYTICS
ADMIN_API --> ANALYTICS
RAG_CHAIN --> HYBRID_SEARCH
RAG_CHAIN --> RESPONSE_GEN
HYBRID_SEARCH --> EMBEDDINGS
HYBRID_SEARCH --> VECTOR_STORE
RESPONSE_GEN --> LLM_FACTORY
LLM_FACTORY --> OPENAI
LLM_FACTORY --> AZURE_OPENAI
LLM_FACTORY --> BEDROCK
LLM_FACTORY --> ANTHROPIC
RAG_CHAIN --> CACHE
CACHE --> CACHE_DB
RAG_CHAIN --> POSTGRES
ANALYTICS --> POSTGRES
SOURCES --> INGEST
INGEST --> PARSER
PARSER --> CHUNKER
CHUNKER --> TAGGER
TAGGER --> DB_WRITER
DB_WRITER --> POSTGRES
DB_WRITER --> VECTOR_STORE
INGEST --> CLOUD_STORAGE
PARSER --> CLOUD_STORAGE
style FASTAPI fill:#4CAF50
style RAG_CHAIN fill:#2196F3
style VECTOR_STORE fill:#FF9800
style POSTGRES fill:#9C27B0
style LLM_FACTORY fill:#F44336 System Components¶
1. API Layer¶
FastAPI Application¶
- Location:
/backend/green_gov_rag/api/main.py - Purpose: REST API server for query processing and document management
- Key Features:
- OpenAPI/Swagger documentation at
/docs - CORS middleware for cross-origin requests
- Rate limiting (30 requests/minute default)
- Request/response validation with Pydantic
- Health check endpoints
- OpenAPI/Swagger documentation at
Public API Routes¶
- Location:
/backend/green_gov_rag/api/routes/ - Endpoints:
POST /api/query- RAG query with optional location filteringGET /api/documents- List available documentsGET /api/analytics- Usage statisticsGET /api/lga-boundaries- GeoJSON data for LGAsGET /api/health- Health check
Admin API Routes¶
- Location:
/backend/green_gov_rag/api/admin/ - Endpoints:
- Document CRUD operations
- Analytics dashboard data
- System health monitoring
- Reprocessing triggers
- Cache management
2. RAG Components¶
RAG Chain¶
- Location:
/backend/green_gov_rag/rag/rag_chain.py - Responsibilities:
- Orchestrates end-to-end RAG pipeline
- Document retrieval coordination
- Context building from retrieved documents
- LLM prompt construction and invocation
- Response formatting with citations
- Key Methods:
retrieve_documents()- Vector similarity search with filtersgenerate_answer()- LLM-based answer generationquery_with_sources()- Complete RAG query with source attributionquery_with_enhanced_citations()- Advanced citation formatting
Hybrid Geospatial Search¶
- Location:
/backend/green_gov_rag/rag/hybrid_search.py - Features:
- Vector similarity search (semantic)
- BM25 lexical search
- Spatial filtering by LGA, state, coordinates
- Hierarchical jurisdiction filtering (federal → state → local)
- Metadata filtering (ESG, category, topic)
- Query expansion (acronym resolution)
- Automatic location extraction via NER
- Search Strategies:
search()- Combined hybrid searchsearch_with_lga()- LGA-specific searchsearch_with_esg_filters()- ESG metadata filteringsearch_with_auto_location()- Automatic NER-based location detectionadvanced_search()- Multi-filter combination
Embedding Service¶
- Location:
/backend/green_gov_rag/rag/embeddings.py - Supported Providers:
- HuggingFace Transformers (default:
sentence-transformers/all-MiniLM-L6-v2) - OpenAI Embeddings (
text-embedding-3-small,text-embedding-3-large) - AWS Bedrock Embeddings
- HuggingFace Transformers (default:
- Capabilities:
- Batch embedding generation (default: 100 chunks/batch)
- Progress tracking for large datasets
- Empty chunk filtering
- Dimension: 384 (MiniLM) or 1536 (OpenAI)
LLM Factory¶
- Location:
/backend/green_gov_rag/rag/llm_factory.py - Supported Providers:
- OpenAI: GPT-4, GPT-4-turbo, GPT-3.5-turbo, GPT-4o
- Azure OpenAI: Same models via Azure deployment
- AWS Bedrock: Claude, Titan, Llama models
- Anthropic: Claude 3 Opus, Sonnet, Haiku
- Configuration Parameters:
temperature- Sampling randomness (0.0-1.0)max_tokens- Maximum response lengthmodel- Specific model name/ID- Provider-specific credentials via environment variables
Response Generator¶
- Location:
/backend/green_gov_rag/rag/enhanced_response.py - Features:
- Inline citation markers
[1],[2], etc. - Deep links to PDF pages (
#page=N) - Section path display (e.g., "Section 2.1.3: Thresholds")
- Hierarchical breadcrumbs (Document > Section > Subsection)
- Confidence scoring for citations
- Markdown and JSON output formats
- Inline citation markers
3. Vector Storage¶
Vector Store Interface¶
- Location:
/backend/green_gov_rag/rag/vector_store_interface.py - Abstraction: Unified interface for multiple vector stores
- Implementations:
- FAISS (local development): Fast, in-memory index
- Qdrant (production): Distributed vector database with filters
FAISS Store¶
- Location:
/backend/green_gov_rag/rag/stores/faiss_store.py - Pros:
- No external dependencies
- Fast for small-to-medium datasets (<1M vectors)
- Simple setup
- Cons:
- In-memory only (limited scalability)
- No distributed support
- Limited metadata filtering
Qdrant Store¶
- Location:
/backend/green_gov_rag/rag/stores/qdrant_store.py - Pros:
- Distributed architecture (scalable to billions of vectors)
- Advanced metadata filtering
- Persistent storage
- HNSW index for sub-linear search
- Cons:
- Requires external service
- More complex deployment
4. ETL Pipeline¶
Document Sources (Plugin System)¶
- Location:
/backend/green_gov_rag/etl/sources/ - Base Interface:
BaseDocumentSource - Built-in Sources:
- Federal legislation (environment.gov.au)
- State legislation (SA, NSW)
- Local government (council portals)
- Emissions data (CER, NGER)
- Custom Source Development:
- Extend
BaseDocumentSource - Implement
fetch_documents(),validate_config(),get_metadata() - Register in
sources/factory.py
- Extend
Document Ingestion¶
- Location:
/backend/green_gov_rag/etl/ingest.py - Workflow:
- Load configuration from YAML
- Validate source configurations
- Generate consistent document IDs (for delta indexing)
- Download documents with retry logic
- Detect file types from magic bytes
- Save to local filesystem or cloud storage
- Extract and store metadata
- Error Handling:
- Retry with exponential backoff (3 attempts)
- Cloudflare bot protection detection
- Failed download logging to
failed_downloads.txt
Document Parsing¶
- Location:
/backend/green_gov_rag/etl/parsers/ - Parser Types:
- UnstructuredPDFParser: Advanced layout analysis with Unstructured.io
- LayoutPDFParser: Hierarchical section extraction
- HTMLParser: Web content parsing
- Parsing Strategies:
hi_res: Detailed analysis (slow, accurate)fast: Quick parsing (faster, less accurate)auto: Automatic strategy selection
- Extracted Metadata:
- Section hierarchy (
section_hierarchy,section_title) - Page numbers and ranges
- Clause references (e.g., "s.3.2.1", "cl.42")
- Table detection and association
- Element types (paragraph, table, list, header)
- Section hierarchy (
Text Chunking¶
- Location:
/backend/green_gov_rag/etl/chunker.py - Strategies:
- RecursiveCharacterTextSplitter: Paragraph → sentence → word splitting
- TokenTextSplitter: Token-based splitting
- HierarchicalChunker: Preserves section metadata
- Configuration:
chunk_size: 500-1000 tokens (default: 1000)chunk_overlap: 100-200 tokens (default: 100)- Custom separators:
["\n\n", "\n", " ", ""]
Metadata Tagging¶
- Location:
/backend/green_gov_rag/etl/metadata_tagger.py - LLM-based Auto-tagging:
- ESG metadata extraction (emission scopes, frameworks)
- Spatial scope detection (federal/state/local)
- Topic classification
- Regulatory framework identification
- Tagged Fields:
esg_metadata.emission_scopes(scope_1, scope_2, scope_3)esg_metadata.frameworks(NGER, ISSB, GHG_Protocol)spatial_metadata.spatial_scope(federal, state, local)spatial_metadata.lga_codes,spatial_metadata.state
Database Writer¶
- Location:
/backend/green_gov_rag/etl/db_writer.py - Batch Operations:
- Batch size: 100 chunks (configurable)
- Upsert support (update or insert)
- Transaction management
- Error handling with rollback
- Writes To:
- PostgreSQL (document metadata, chunks)
- Vector store (embeddings)
- Cloud storage (processed files)
5. Data Models¶
Database Models¶
- Location:
/backend/green_gov_rag/models/ - SQLModel Entities:
Document: Document metadata (title, jurisdiction, category)Chunk: Text chunks with embeddingsQuery: User queries and responsesAnalytics: Usage statisticsLGABoundary: Geospatial boundaries (PostGIS)
API Schemas¶
- Location:
/backend/green_gov_rag/api/schemas/ - Pydantic Models:
QueryRequest: Query input with filtersQueryResponse: Answer with sources and citationsDocumentResponse: Document metadataAnalyticsResponse: Usage statistics
6. Services¶
Analytics Service¶
- Location:
/backend/green_gov_rag/api/services/analytics_service.py - Tracks:
- Query frequency
- Response latency
- Trust score distribution
- LGA query distribution
- Top queries and documents
Cache Service¶
- Location:
/backend/green_gov_rag/api/services/cache_service.py - Backends:
- DynamoDB (AWS production)
- Redis (local development)
- TTL: 1 hour default
- Cache Keys: SHA256 hash of query + filters
Trust Score Service¶
- Location:
/backend/green_gov_rag/api/services/trust_score_service.py - Scoring Factors:
- Citation verification (source exists)
- Regulatory hierarchy (federal > state > local)
- Jurisdiction match (query LGA vs. source LGA)
- Recency (newer documents scored higher)
- Confidence score from LLM
Data Flow¶
Query Processing Flow¶
sequenceDiagram
participant Client
participant API
participant RAGChain
participant HybridSearch
participant VectorStore
participant LLM
participant Cache
participant Database
Client->>API: POST /api/query<br/>{query, lga_name}
API->>Cache: Check cache
alt Cache Hit
Cache-->>API: Cached response
API-->>Client: Return cached result
else Cache Miss
API->>RAGChain: query_with_enhanced_citations()
RAGChain->>HybridSearch: search_with_auto_location()
HybridSearch->>HybridSearch: Extract locations (NER)
HybridSearch->>HybridSearch: Expand query (acronyms)
HybridSearch->>VectorStore: similarity_search(query, k=10)
VectorStore-->>HybridSearch: Top-K documents
HybridSearch->>HybridSearch: Apply spatial filters
HybridSearch->>HybridSearch: Apply metadata filters
HybridSearch->>HybridSearch: Boost by jurisdiction
HybridSearch-->>RAGChain: Filtered documents
RAGChain->>RAGChain: build_context_from_documents()
RAGChain->>LLM: generate_answer(query, context)
LLM-->>RAGChain: Generated answer
RAGChain->>RAGChain: format_enhanced_response()
RAGChain-->>API: EnhancedResponse
API->>Cache: Store result (TTL: 1h)
API->>Database: Log query analytics
API-->>Client: JSON response
end ETL Pipeline Flow¶
sequenceDiagram
participant Scheduler as GitHub Actions<br/>or Airflow
participant Pipeline as ETL Pipeline
participant Sources as Document Sources
participant Ingest as Ingest Module
participant Parser as PDF Parser
participant Chunker
participant Tagger as Metadata Tagger
participant Embedder as Embedding Service
participant Storage as Cloud Storage
participant DB as PostgreSQL
participant Vector as Vector Store
Scheduler->>Pipeline: Trigger ETL (daily 2AM UTC)
Pipeline->>Sources: Load configs
Sources->>Sources: Validate sources
Sources-->>Pipeline: Document configs
loop For each document
Pipeline->>Ingest: download_document()
Ingest->>Ingest: Generate document_id
Ingest->>Ingest: Download with retry
Ingest->>Storage: Upload raw file
Ingest-->>Pipeline: File path
Pipeline->>Parser: parse_with_structure()
Parser->>Parser: Extract sections
Parser->>Parser: Detect tables/lists
Parser->>Parser: Extract clause refs
Parser-->>Pipeline: Parsed chunks
Pipeline->>Chunker: chunk_with_hierarchy()
Chunker->>Chunker: Split into chunks
Chunker->>Chunker: Preserve metadata
Chunker-->>Pipeline: Text chunks
Pipeline->>Tagger: tag_all()
Tagger->>LLM: Extract ESG metadata
LLM-->>Tagger: Tagged metadata
Tagger-->>Pipeline: Enriched chunks
Pipeline->>Embedder: embed_chunks()
Embedder->>Embedder: Batch embed (100/batch)
Embedder-->>Pipeline: Chunks with embeddings
Pipeline->>DB: Write chunks (batch)
Pipeline->>Vector: Build/update index
Pipeline->>Storage: Save processed chunks
end
Pipeline-->>Scheduler: ETL complete Technology Stack¶
Backend Stack¶
| Component | Technology | Version | Purpose |
|---|---|---|---|
| Language | Python | 3.12 | Core language |
| Web Framework | FastAPI | 0.104+ | REST API |
| ORM | SQLModel | 0.0.14+ | Database models |
| Database | PostgreSQL | 15+ | Primary datastore |
| Vector Extension | pgvector | 0.5+ | Vector similarity search |
| Vector Store | FAISS / Qdrant | Latest | Embeddings index |
| LLM Framework | LangChain | 0.1+ | RAG orchestration |
| Embeddings | HuggingFace Transformers | 4.35+ | Text embeddings |
| PDF Parsing | Unstructured.io | 0.10+ | Document parsing |
| NER | spaCy | 3.7+ | Location extraction |
| Task Queue | Celery (optional) | 5.3+ | Background jobs |
| Validation | Pydantic | 2.5+ | Data validation |
| Testing | pytest | 7.4+ | Unit/integration tests |
| Linting | Ruff | 0.1+ | Code quality |
| Type Checking | MyPy | 1.7+ | Static typing |
LLM Providers¶
| Provider | Models | Use Case |
|---|---|---|
| OpenAI | GPT-4, GPT-4-turbo, GPT-3.5-turbo, GPT-4o | General purpose |
| Azure OpenAI | GPT-4, GPT-3.5-turbo | Enterprise deployment |
| AWS Bedrock | Claude 3, Titan, Llama 2 | AWS-native |
| Anthropic | Claude 3 Opus, Sonnet, Haiku | Advanced reasoning |
Recommended: Azure OpenAI with gpt-4o-mini (best cost/performance ratio)
Cloud Infrastructure¶
| Service | AWS | Azure | Local |
|---|---|---|---|
| Compute | ECS Fargate | Container Apps | Docker Compose |
| Vector DB | EC2 Spot (Qdrant) | Container Instance | Docker |
| Database | RDS PostgreSQL | Azure Database | PostgreSQL |
| Storage | S3 | Blob Storage | Filesystem |
| Cache | DynamoDB | Redis Cache | Redis |
| API Gateway | API Gateway HTTP | API Management | None |
| CDN | CloudFront | Front Door | None |
Design Patterns¶
1. Factory Pattern¶
LLM Factory (llm_factory.py):
- Creates LLM instances based on provider configuration
- Abstracts provider-specific initialization
- Centralized credential management
Vector Store Factory (vector_store_factory.py):
- Creates vector store instances (FAISS/Qdrant)
- Abstracts storage backend differences
- Consistent interface across stores
Document Source Factory (sources/factory.py):
- Creates document source plugins
- Auto-detects source type from config
- Plugin registration and discovery
2. Repository Pattern¶
Database Writer (db_writer.py):
- Abstracts database operations
- Batch insert/update operations
- Transaction management
- Error handling with rollback
Storage Adapter (storage_adapter.py):
- Abstracts cloud storage operations
- Supports S3, Azure Blob, local filesystem
- Consistent upload/download interface
3. Strategy Pattern¶
Parser Selection:
- Different parsing strategies (hi_res, fast, auto)
- Runtime strategy selection based on document type
- Consistent parser interface
Chunking Strategies:
- RecursiveCharacterTextSplitter
- TokenTextSplitter
- HierarchicalChunker
- Configurable at runtime
4. Dependency Injection¶
FastAPI DI:
- Service dependencies injected into route handlers
- Singleton instances for expensive resources
- Easy testing with mock injection
@router.post("/query")
async def query_endpoint(
request: QueryRequest,
rag_chain: RAGChain = Depends(get_rag_chain),
analytics: AnalyticsService = Depends(get_analytics)
):
# Use injected dependencies
pass
5. Plugin Architecture¶
Document Sources:
- Base interface:
BaseDocumentSource - Auto-discovery via registry
- Extensible without modifying core code
class CustomSource(BaseDocumentSource):
def fetch_documents(self) -> list[Document]:
# Custom implementation
pass
6. Singleton Pattern¶
Vector Store:
- Single instance shared across requests
- Lazy initialization on first access
- Thread-safe access
Embedding Service:
- Model loaded once at startup
- Shared across all ETL operations
Deployment Architecture¶
AWS Production Architecture¶
graph TB
subgraph "Edge Layer"
CLOUDFRONT[CloudFront CDN<br/>Static Assets]
APIGW[API Gateway HTTP<br/>REST API]
end
subgraph "Compute Layer"
ECS[ECS Fargate<br/>Backend API<br/>t4g.micro]
EC2[EC2 Spot<br/>Qdrant Vector DB<br/>t4g.micro]
end
subgraph "Storage Layer"
RDS[(RDS PostgreSQL<br/>pgvector<br/>t4g.micro)]
S3[S3 Bucket<br/>Documents]
DYNAMODB[(DynamoDB<br/>Query Cache)]
end
subgraph "Orchestration"
GITHUB[GitHub Actions<br/>ETL Scheduler]
EVENTBRIDGE[EventBridge<br/>Triggers]
end
CLOUDFRONT --> S3
APIGW --> ECS
ECS --> RDS
ECS --> S3
ECS --> DYNAMODB
ECS --> EC2
GITHUB --> EVENTBRIDGE
EVENTBRIDGE --> ECS
style CLOUDFRONT fill:#FF9800
style APIGW fill:#FF9800
style ECS fill:#4CAF50
style EC2 fill:#4CAF50
style RDS fill:#9C27B0
style S3 fill:#2196F3
style DYNAMODB fill:#2196F3 Local Development Architecture¶
graph TB
subgraph "Docker Compose"
API[Backend API<br/>:8000]
POSTGRES[(PostgreSQL<br/>pgvector<br/>:5432)]
QDRANT[Qdrant<br/>:6333]
REDIS[(Redis<br/>:6379)]
AIRFLOW[Airflow<br/>:8080<br/>dev mode only]
end
API --> POSTGRES
API --> QDRANT
API --> REDIS
AIRFLOW --> API
style API fill:#4CAF50
style POSTGRES fill:#9C27B0
style QDRANT fill:#FF9800
style REDIS fill:#F44336
style AIRFLOW fill:#2196F3 Start Commands:
# Production-like (no Airflow)
docker-compose up
# Development with Airflow UI
docker-compose --profile dev up
Cross-References¶
Related Documentation¶
-
RAG Pipeline Deep Dive: rag-pipeline.md
- Detailed RAG architecture
- Query processing internals
- Vector retrieval algorithms
- Response generation strategies
-
ETL Pipeline Deep Dive: etl-pipeline.md
- Document ingestion workflow
- Parsing strategies
- Chunking algorithms
- Metadata extraction
-
LLM Configuration: ../llm-config.md
- Provider setup (OpenAI, Azure, Bedrock, Anthropic)
- Model selection guide
- Cost optimization
- Prompt engineering
-
Custom Parsers: ../custom-parsers.md
- Parser development guide
- Unstructured.io integration
- Layout-aware parsing
- Table extraction
-
Custom Embeddings: ../custom-embeddings.md
- Embedding model selection
- HuggingFace integration
- Dimensionality tradeoffs
- Re-indexing procedures
-
Cloud Storage Guide: ../cloud-storage.md
- AWS S3 setup
- Azure Blob configuration
- Storage adapter usage
-
Deployment Guides:
- AWS:
/deploy/aws/README.md - Azure:
/deploy/azure/README.md - Docker:
/deploy/docker/README.md
- AWS:
API Documentation¶
- OpenAPI Spec: http://localhost:8000/docs (live)
- API Reference: /docs/api-reference/
Source Code References¶
Core Components:
- RAG Chain:
/backend/green_gov_rag/rag/rag_chain.py - Hybrid Search:
/backend/green_gov_rag/rag/hybrid_search.py - LLM Factory:
/backend/green_gov_rag/rag/llm_factory.py - ETL Pipeline:
/backend/green_gov_rag/etl/pipeline.py
Configuration:
- Settings:
/backend/green_gov_rag/config.py - Document Config:
/backend/configs/documents_config.yml - Environment:
/backend/.env.example
Next Steps¶
- Deep Dive into RAG: Read rag-pipeline.md for query processing details
- Deep Dive into ETL: Read etl-pipeline.md for document ingestion
- Customize LLMs: See ../llm-config.md for provider configuration
- Extend Parsers: See ../custom-parsers.md for custom parsing logic
- Deploy: Follow deployment guides in
/deploy/
Last Updated: 2025-11-22