LLM Configuration Guide¶
This guide covers configuration and optimization of Large Language Models (LLMs) in GreenGovRAG, including provider setup, model selection, cost optimization, and prompt engineering.
Table of Contents¶
- Supported LLM Providers
- OpenAI Configuration
- Azure OpenAI Configuration
- AWS Bedrock Configuration
- Anthropic Configuration
- Model Selection Guide
- Cost Optimization
- Prompt Customization
- Temperature and Parameter Tuning
Supported LLM Providers¶
GreenGovRAG uses a factory pattern to support multiple LLM providers through a unified interface via LangChain.
Module: /backend/green_gov_rag/rag/llm_factory.py
Supported Providers:
| Provider | Models | Strengths | Use Case |
|---|---|---|---|
| OpenAI | GPT-4, GPT-4-turbo, GPT-3.5-turbo, GPT-4o | Best general performance | Development, testing |
| Azure OpenAI | Same as OpenAI | Enterprise SLAs, data residency | Production (recommended) |
| AWS Bedrock | Claude 3, Titan, Llama 2 | AWS-native, no API keys | AWS deployments |
| Anthropic | Claude 3 Opus, Sonnet, Haiku | Strong reasoning, long context | Complex regulatory queries |
OpenAI Configuration¶
1. Prerequisites¶
- OpenAI API key from https://platform.openai.com/api-keys
- Sufficient quota/credits
2. Environment Variables¶
File: /backend/.env
# LLM Provider Selection
LLM_PROVIDER=openai
# OpenAI Configuration
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxx
LLM_MODEL=gpt-4o-mini # or gpt-4, gpt-4-turbo, gpt-3.5-turbo
# Generation Parameters
LLM_TEMPERATURE=0.2
LLM_MAX_TOKENS=500
3. Code Example¶
from green_gov_rag.rag.llm_factory import LLMFactory
# Create LLM instance
llm = LLMFactory.create_llm(
provider="openai",
model="gpt-4o-mini",
temperature=0.2,
max_tokens=500
)
# Use in RAG chain
from green_gov_rag.rag.rag_chain import RAGChain
rag_chain = RAGChain(
vector_store=vector_store,
llm_provider="openai",
llm_model="gpt-4o-mini"
)
result = rag_chain.query("What are NGER thresholds?")
4. Model Comparison¶
| Model | Speed | Cost | Quality | Max Tokens |
|---|---|---|---|---|
| gpt-4o-mini | Fast (500ms) | $0.15/1M tokens | Good | 128K |
| gpt-4o | Medium (1s) | $2.50/1M tokens | Excellent | 128K |
| gpt-4-turbo | Medium (1.5s) | $10/1M tokens | Excellent | 128K |
| gpt-3.5-turbo | Very fast (300ms) | $0.50/1M tokens | Good | 16K |
Recommended: gpt-4o-mini for best cost/performance ratio.
Azure OpenAI Configuration¶
1. Prerequisites¶
- Azure subscription
- Azure OpenAI resource created in Azure Portal
- Deployment created for specific model (e.g.,
gpt-4o-mini)
2. Setup Steps¶
Step 1: Create Azure OpenAI resource
# Via Azure Portal
1. Navigate to Azure Portal
2. Create Resource → Azure OpenAI
3. Select region (e.g., Australia East)
4. Choose pricing tier (Standard S0)
Step 2: Create model deployment
# In Azure OpenAI Studio
1. Go to Deployments
2. Create New Deployment
3. Model: gpt-4o-mini (or gpt-4, gpt-35-turbo)
4. Deployment name: gpt-4o-mini-deployment
5. Deploy
Step 3: Get credentials
# From Azure Portal → Your OpenAI Resource → Keys and Endpoint
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxx
3. Environment Variables¶
# LLM Provider Selection
LLM_PROVIDER=azure
# Azure OpenAI Configuration
AZURE_OPENAI_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxx
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini-deployment # Your deployment name
AZURE_OPENAI_API_VERSION=2024-02-15-preview # API version
# Model Selection
LLM_MODEL=gpt-4o-mini # Base model name
# Generation Parameters
LLM_TEMPERATURE=0.2
LLM_MAX_TOKENS=500
4. Code Example¶
from green_gov_rag.rag.llm_factory import LLMFactory
llm = LLMFactory.create_llm(
provider="azure",
model="gpt-4o-mini",
temperature=0.2,
max_tokens=500
)
# Environment variables automatically loaded
# Uses AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_DEPLOYMENT
5. Benefits of Azure OpenAI¶
- Enterprise SLAs: 99.9% uptime guarantee
- Data Residency: Data stays in Australia (Australia East region)
- No API Keys in Code: Managed Identity support
- Cost Control: Dedicated capacity prevents rate limiting
- Compliance: SOC 2, ISO 27001, HIPAA certified
Pricing (Australia East):
- gpt-4o-mini: $0.15/1M input tokens, $0.60/1M output tokens
- gpt-4o: $2.50/1M input tokens, $10/1M output tokens
AWS Bedrock Configuration¶
1. Prerequisites¶
- AWS account with Bedrock access
- IAM user with Bedrock permissions
- Model access enabled in AWS Console
2. Enable Model Access¶
# Via AWS Console
1. Navigate to AWS Bedrock
2. Model Access → Manage Model Access
3. Enable: Anthropic Claude 3 Sonnet, Claude 3 Haiku, Amazon Titan
4. Request access (usually instant approval)
3. IAM Permissions¶
Policy (bedrock-policy.json):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream"
],
"Resource": [
"arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0",
"arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-haiku-20240307-v1:0"
]
}
]
}
4. Environment Variables¶
# LLM Provider Selection
LLM_PROVIDER=bedrock
# AWS Credentials
AWS_ACCESS_KEY_ID=AKIAXXXXXXXXXXXXXXXXX
AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
AWS_REGION=us-east-1 # Bedrock available in us-east-1, us-west-2, eu-west-1
# Model Selection (Bedrock model ID)
BEDROCK_MODEL_ID=anthropic.claude-3-sonnet-20240229-v1:0
# Or: anthropic.claude-3-haiku-20240307-v1:0
# Or: amazon.titan-text-premier-v1:0
# Generation Parameters
LLM_TEMPERATURE=0.2
LLM_MAX_TOKENS=500
5. Supported Models¶
| Model ID | Model | Speed | Cost | Context |
|---|---|---|---|---|
| anthropic.claude-3-sonnet-20240229-v1:0 | Claude 3 Sonnet | Medium | $3/1M input, $15/1M output | 200K |
| anthropic.claude-3-haiku-20240307-v1:0 | Claude 3 Haiku | Fast | $0.25/1M input, $1.25/1M output | 200K |
| amazon.titan-text-premier-v1:0 | Titan Text Premier | Fast | $0.50/1M tokens | 32K |
6. Code Example¶
llm = LLMFactory.create_llm(
provider="bedrock",
model="anthropic.claude-3-sonnet-20240229-v1:0",
temperature=0.2,
max_tokens=500
)
Anthropic Configuration¶
1. Prerequisites¶
- Anthropic API key from https://console.anthropic.com/
- Sufficient credits
2. Environment Variables¶
# LLM Provider Selection
LLM_PROVIDER=anthropic
# Anthropic Configuration
ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxxxxxxxxxx
# Model Selection
LLM_MODEL=claude-3-sonnet-20240229
# Or: claude-3-opus-20240229, claude-3-haiku-20240307
# Generation Parameters
LLM_TEMPERATURE=0.2
LLM_MAX_TOKENS=500
3. Model Comparison¶
| Model | Speed | Cost | Quality | Context |
|---|---|---|---|---|
| Claude 3 Opus | Slow (3s) | $15/1M input, $75/1M output | Highest | 200K |
| Claude 3 Sonnet | Medium (1.5s) | $3/1M input, $15/1M output | High | 200K |
| Claude 3 Haiku | Fast (500ms) | $0.25/1M input, $1.25/1M output | Good | 200K |
4. Code Example¶
llm = LLMFactory.create_llm(
provider="anthropic",
model="claude-3-sonnet-20240229",
temperature=0.2,
max_tokens=500
)
Model Selection Guide¶
1. By Use Case¶
| Use Case | Recommended Model | Reason |
|---|---|---|
| Production RAG (general) | Azure gpt-4o-mini | Best cost/performance, enterprise SLA |
| High-accuracy regulatory | Claude 3 Sonnet | Strong reasoning, long context |
| Development/testing | OpenAI gpt-4o-mini | Fast iteration, low cost |
| AWS-native deployment | Bedrock Claude 3 Haiku | No API keys, AWS integrated |
| Complex legal analysis | Claude 3 Opus or GPT-4 | Highest quality |
| Budget-constrained | gpt-3.5-turbo or Titan | Lowest cost |
2. By Latency Requirements¶
| Latency Requirement | Model | Average Response Time |
|---|---|---|
| Real-time (<500ms) | gpt-4o-mini, gpt-3.5-turbo | 300-500ms |
| Interactive (<1s) | Claude 3 Haiku, Titan | 500-800ms |
| Standard (<2s) | gpt-4o, Claude 3 Sonnet | 1-1.5s |
| Batch/offline (>2s) | gpt-4-turbo, Claude 3 Opus | 2-3s |
3. By Cost Budget¶
Monthly Cost Estimates (100K queries/month, avg 500 input tokens, 200 output tokens):
| Model | Input Cost | Output Cost | Total/Month |
|---|---|---|---|
| gpt-4o-mini | $7.50 | $12 | $19.50 |
| gpt-3.5-turbo | $25 | $50 | $75 |
| Claude 3 Haiku | $12.50 | $25 | $37.50 |
| gpt-4o | $125 | $200 | $325 |
| Claude 3 Sonnet | $150 | $300 | $450 |
| gpt-4-turbo | $500 | $1,000 | $1,500 |
Recommendation: Start with gpt-4o-mini for production (best value).
Cost Optimization¶
1. Reduce Input Tokens¶
Strategy: Limit number of retrieved documents
# Default: retrieve 5 documents
rag_chain.query(query, k=5) # ~2000 input tokens
# Optimized: retrieve 3 documents
rag_chain.query(query, k=3) # ~1200 input tokens (40% reduction)
Trade-off: May miss relevant context, but reduces cost by 40%.
2. Use Smaller Models¶
# Production (high quality)
LLM_MODEL=gpt-4o-mini # $0.15/1M input
# Cost-optimized (good quality)
LLM_MODEL=gpt-3.5-turbo # $0.50/1M input (but faster)
3. Cache Responses¶
Implementation:
# Enable query caching (default: 1 hour TTL)
# Saves ~30-40% of LLM calls
# In .env
CACHE_ENABLED=true
CACHE_TTL=3600 # 1 hour
Savings: ~$100/month for typical production workload.
4. Adjust Max Tokens¶
# Default: 500 tokens output (sufficient for most queries)
LLM_MAX_TOKENS=500
# Short answers: 200 tokens
LLM_MAX_TOKENS=200 # 60% cost reduction on output
# Long explanations: 1000 tokens
LLM_MAX_TOKENS=1000 # Increased cost but more detail
5. Batch Processing¶
For ETL metadata tagging:
# Process 10 documents per batch (reduce API calls)
tagger.tag_all(documents, batch_size=10)
# Add delay between batches to avoid rate limiting
time.sleep(1) # 1 second delay
Prompt Customization¶
1. Default RAG Prompt¶
Location: /backend/green_gov_rag/rag/rag_chain.py
Current Prompt:
2. Enhanced Regulatory Prompt¶
Customization:
prompt = f"""You are an expert assistant for Australian environmental and planning regulations.
Answer the query based ONLY on the provided context. Follow these guidelines:
1. Cite specific sections, clauses, or regulations when available
2. Use exact wording from regulations for definitions and requirements
3. Highlight jurisdiction-specific rules (federal vs. state vs. local)
4. If the context doesn't contain enough information, say so explicitly
5. Use Australian English spelling and terminology
6. Include relevant thresholds, dates, and numeric values
Context:
{context}
Query: {query}
Answer:"""
3. Domain-Specific Prompts¶
For ESG/NGER Queries:
if "nger" in query.lower() or "emissions" in query.lower():
prompt = f"""You are an expert in Australian greenhouse gas emissions reporting under NGER.
Answer the query using the provided regulatory context. Always:
- Specify emission scopes (Scope 1, 2, 3)
- Include thresholds in tonnes CO2-e
- Cite relevant NGER legislation sections
- Mention reporting deadlines when applicable
Context:
{context}
Query: {query}
Answer:"""
For Planning/Vegetation Queries:
if "tree" in query.lower() or "vegetation" in query.lower():
prompt = f"""You are an expert in Australian planning and vegetation management regulations.
Answer using the provided context. Always:
- Specify LGA-level requirements
- Cite relevant planning schemes
- Mention permit/approval requirements
- Include clearance thresholds (hectares, tree count)
Context:
{context}
Query: {query}
Answer:"""
4. Few-Shot Examples (Future)¶
prompt = f"""You are a regulatory expert. Answer based on provided context.
Example 1:
Query: What are NGER thresholds?
Answer: Under NGER, facilities must report Scope 1 emissions exceeding 25,000 tonnes CO2-e annually [Section 2.1.3]. Corporate groups report if total emissions exceed 50,000 tonnes CO2-e [Section 4.2].
Example 2:
Query: Can I clear native vegetation in Adelaide?
Answer: In the City of Adelaide (LGA 40070), native vegetation clearance requires approval under the SA Planning and Design Code [Section 5.3.2]. Exemptions apply for trees <5m height or non-significant vegetation [Clause 42(a)].
Now answer this query:
Context:
{context}
Query: {query}
Answer:"""
Temperature and Parameter Tuning¶
1. Temperature Guidelines¶
Temperature controls randomness in LLM responses:
| Temperature | Behavior | Use Case |
|---|---|---|
| 0.0 | Deterministic, always same output | Regulatory compliance (recommended) |
| 0.1-0.2 | Highly consistent, minimal variation | Default RAG queries |
| 0.3-0.5 | Balanced creativity and consistency | Explanations, summaries |
| 0.6-0.8 | Creative, diverse responses | Ideation, brainstorming |
| 0.9-1.0 | Highly random | Not recommended for regulatory |
Recommended for GreenGovRAG: 0.2
2. Max Tokens¶
Max Tokens limits response length:
| Token Limit | Words (~) | Use Case |
|---|---|---|
| 100 | 75 | Short answers, definitions |
| 200 | 150 | Concise answers (cost-optimized) |
| 500 | 375 | Default (balanced) |
| 1000 | 750 | Detailed explanations |
| 2000 | 1500 | Comprehensive analysis |
Recommended: 500 (sufficient for most regulatory queries)
3. Top-P (Nucleus Sampling)¶
Top-P (not currently exposed, but available via LangChain):
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="gpt-4o-mini",
temperature=0.2,
max_tokens=500,
top_p=0.9 # Consider top 90% of probability mass
)
Recommendation: Use default (top_p=1.0) for regulatory queries.
4. Frequency Penalty¶
Frequency Penalty reduces repetition:
llm = ChatOpenAI(
model="gpt-4o-mini",
temperature=0.2,
max_tokens=500,
frequency_penalty=0.5 # Penalize repeated tokens
)
Use Case: Long-form answers where repetition is undesirable.
5. Configuration Examples¶
Production RAG (default):
llm = LLMFactory.create_llm(
provider="azure",
model="gpt-4o-mini",
temperature=0.2,
max_tokens=500
)
High-Precision Regulatory:
llm = LLMFactory.create_llm(
provider="azure",
model="gpt-4o",
temperature=0.0, # Deterministic
max_tokens=300 # Concise
)
Detailed Explanations:
llm = LLMFactory.create_llm(
provider="anthropic",
model="claude-3-sonnet-20240229",
temperature=0.3, # Slightly more creative
max_tokens=1000 # Longer responses
)
Testing and Validation¶
1. Model A/B Testing¶
models = [
{"provider": "azure", "model": "gpt-4o-mini"},
{"provider": "azure", "model": "gpt-4o"},
{"provider": "anthropic", "model": "claude-3-sonnet-20240229"}
]
test_queries = [
"What are NGER Scope 1 thresholds?",
"Can I clear native vegetation in Adelaide?",
"What are Scope 3 Category 4 emissions?"
]
for model_config in models:
llm = LLMFactory.create_llm(**model_config)
rag_chain = RAGChain(vector_store=vector_store, llm_provider=model_config["provider"])
for query in test_queries:
result = rag_chain.query(query)
print(f"Model: {model_config['model']}")
print(f"Answer: {result['result']}")
print(f"Sources: {len(result['source_documents'])}")
print("---")
2. Cost Tracking¶
# Track token usage
from langchain.callbacks import get_openai_callback
with get_openai_callback() as cb:
result = rag_chain.query(query)
print(f"Input tokens: {cb.prompt_tokens}")
print(f"Output tokens: {cb.completion_tokens}")
print(f"Total cost: ${cb.total_cost:.4f}")
Troubleshooting¶
1. Rate Limiting¶
Error: RateLimitError: Rate limit exceeded
Solution:
# Add exponential backoff
from tenacity import retry, wait_exponential, stop_after_attempt
@retry(wait=wait_exponential(multiplier=1, min=2, max=10), stop=stop_after_attempt(3))
def query_with_retry(query: str):
return rag_chain.query(query)
2. Context Length Exceeded¶
Error: InvalidRequestError: maximum context length exceeded
Solution:
# Reduce number of retrieved documents
rag_chain.query(query, k=3) # Instead of k=5
# Or use a model with longer context
LLM_MODEL=gpt-4o # 128K context vs 16K for gpt-3.5-turbo
3. Empty Responses¶
Issue: LLM returns empty or very short responses
Solution:
# Increase max_tokens
LLM_MAX_TOKENS=1000 # Instead of 500
# Adjust temperature
LLM_TEMPERATURE=0.3 # Instead of 0.2 (less deterministic)
Next Steps¶
- Try Different Models: A/B test multiple models on your queries
- Customize Prompts: Tailor prompts for your specific regulatory domain
- Monitor Costs: Track token usage and optimize accordingly
- Read Architecture: See architecture/rag-pipeline.md
Last Updated: 2025-11-22