LLM Configuration Guide¶

This guide covers configuration and optimization of Large Language Models (LLMs) in GreenGovRAG, including provider setup, model selection, cost optimization, and prompt engineering.

Table of Contents¶

Supported LLM Providers
OpenAI Configuration
Azure OpenAI Configuration
AWS Bedrock Configuration
Anthropic Configuration
Model Selection Guide
Cost Optimization
Prompt Customization
Temperature and Parameter Tuning

Supported LLM Providers¶

GreenGovRAG uses a factory pattern to support multiple LLM providers through a unified interface via LangChain.

Module: /backend/green_gov_rag/rag/llm_factory.py

Supported Providers:

Provider	Models	Strengths	Use Case
OpenAI	GPT-4, GPT-4-turbo, GPT-3.5-turbo, GPT-4o	Best general performance	Development, testing
Azure OpenAI	Same as OpenAI	Enterprise SLAs, data residency	Production (recommended)
AWS Bedrock	Claude 3, Titan, Llama 2	AWS-native, no API keys	AWS deployments
Anthropic	Claude 3 Opus, Sonnet, Haiku	Strong reasoning, long context	Complex regulatory queries

OpenAI Configuration¶

1. Prerequisites¶

OpenAI API key from https://platform.openai.com/api-keys
Sufficient quota/credits

2. Environment Variables¶

File: /backend/.env

# LLM Provider Selection
LLM_PROVIDER=openai

# OpenAI Configuration
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxx
LLM_MODEL=gpt-4o-mini  # or gpt-4, gpt-4-turbo, gpt-3.5-turbo

# Generation Parameters
LLM_TEMPERATURE=0.2
LLM_MAX_TOKENS=500

3. Code Example¶

from green_gov_rag.rag.llm_factory import LLMFactory

# Create LLM instance
llm = LLMFactory.create_llm(
    provider="openai",
    model="gpt-4o-mini",
    temperature=0.2,
    max_tokens=500
)

# Use in RAG chain
from green_gov_rag.rag.rag_chain import RAGChain

rag_chain = RAGChain(
    vector_store=vector_store,
    llm_provider="openai",
    llm_model="gpt-4o-mini"
)

result = rag_chain.query("What are NGER thresholds?")

4. Model Comparison¶

Model	Speed	Cost	Quality	Max Tokens
gpt-4o-mini	Fast (500ms)	$0.15/1M tokens	Good	128K
gpt-4o	Medium (1s)	$2.50/1M tokens	Excellent	128K
gpt-4-turbo	Medium (1.5s)	$10/1M tokens	Excellent	128K
gpt-3.5-turbo	Very fast (300ms)	$0.50/1M tokens	Good	16K

Recommended: gpt-4o-mini for best cost/performance ratio.

Azure OpenAI Configuration¶

1. Prerequisites¶

Azure subscription
Azure OpenAI resource created in Azure Portal
Deployment created for specific model (e.g., gpt-4o-mini)

2. Setup Steps¶

Step 1: Create Azure OpenAI resource

# Via Azure Portal
1. Navigate to Azure Portal
2. Create Resource → Azure OpenAI
3. Select region (e.g., Australia East)
4. Choose pricing tier (Standard S0)

Step 2: Create model deployment

# In Azure OpenAI Studio
1. Go to Deployments
2. Create New Deployment
3. Model: gpt-4o-mini (or gpt-4, gpt-35-turbo)
4. Deployment name: gpt-4o-mini-deployment
5. Deploy

Step 3: Get credentials

# From Azure Portal → Your OpenAI Resource → Keys and Endpoint
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxx

3. Environment Variables¶

# LLM Provider Selection
LLM_PROVIDER=azure

# Azure OpenAI Configuration
AZURE_OPENAI_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxx
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini-deployment  # Your deployment name
AZURE_OPENAI_API_VERSION=2024-02-15-preview     # API version

# Model Selection
LLM_MODEL=gpt-4o-mini  # Base model name

# Generation Parameters
LLM_TEMPERATURE=0.2
LLM_MAX_TOKENS=500

4. Code Example¶

from green_gov_rag.rag.llm_factory import LLMFactory

llm = LLMFactory.create_llm(
    provider="azure",
    model="gpt-4o-mini",
    temperature=0.2,
    max_tokens=500
)

# Environment variables automatically loaded
# Uses AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_DEPLOYMENT

5. Benefits of Azure OpenAI¶

Enterprise SLAs: 99.9% uptime guarantee
Data Residency: Data stays in Australia (Australia East region)
No API Keys in Code: Managed Identity support
Cost Control: Dedicated capacity prevents rate limiting
Compliance: SOC 2, ISO 27001, HIPAA certified

Pricing (Australia East):

gpt-4o-mini: $0.15/1M input tokens, $0.60/1M output tokens
gpt-4o: $2.50/1M input tokens, $10/1M output tokens

AWS Bedrock Configuration¶

1. Prerequisites¶

AWS account with Bedrock access
IAM user with Bedrock permissions
Model access enabled in AWS Console

2. Enable Model Access¶

# Via AWS Console
1. Navigate to AWS Bedrock
2. Model Access → Manage Model Access
3. Enable: Anthropic Claude 3 Sonnet, Claude 3 Haiku, Amazon Titan
4. Request access (usually instant approval)

3. IAM Permissions¶

Policy (bedrock-policy.json):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "Resource": [
        "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0",
        "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-haiku-20240307-v1:0"
      ]
    }
  ]
}

4. Environment Variables¶

# LLM Provider Selection
LLM_PROVIDER=bedrock

# AWS Credentials
AWS_ACCESS_KEY_ID=AKIAXXXXXXXXXXXXXXXXX
AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
AWS_REGION=us-east-1  # Bedrock available in us-east-1, us-west-2, eu-west-1

# Model Selection (Bedrock model ID)
BEDROCK_MODEL_ID=anthropic.claude-3-sonnet-20240229-v1:0
# Or: anthropic.claude-3-haiku-20240307-v1:0
# Or: amazon.titan-text-premier-v1:0

# Generation Parameters
LLM_TEMPERATURE=0.2
LLM_MAX_TOKENS=500

5. Supported Models¶

Model ID	Model	Speed	Cost	Context
anthropic.claude-3-sonnet-20240229-v1:0	Claude 3 Sonnet	Medium	$3/1M input, $15/1M output	200K
anthropic.claude-3-haiku-20240307-v1:0	Claude 3 Haiku	Fast	$0.25/1M input, $1.25/1M output	200K
amazon.titan-text-premier-v1:0	Titan Text Premier	Fast	$0.50/1M tokens	32K

6. Code Example¶

llm = LLMFactory.create_llm(
    provider="bedrock",
    model="anthropic.claude-3-sonnet-20240229-v1:0",
    temperature=0.2,
    max_tokens=500
)

Anthropic Configuration¶

1. Prerequisites¶

Anthropic API key from https://console.anthropic.com/
Sufficient credits

2. Environment Variables¶

# LLM Provider Selection
LLM_PROVIDER=anthropic

# Anthropic Configuration
ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxxxxxxxxxx

# Model Selection
LLM_MODEL=claude-3-sonnet-20240229
# Or: claude-3-opus-20240229, claude-3-haiku-20240307

# Generation Parameters
LLM_TEMPERATURE=0.2
LLM_MAX_TOKENS=500

3. Model Comparison¶

Model	Speed	Cost	Quality	Context
Claude 3 Opus	Slow (3s)	$15/1M input, $75/1M output	Highest	200K
Claude 3 Sonnet	Medium (1.5s)	$3/1M input, $15/1M output	High	200K
Claude 3 Haiku	Fast (500ms)	$0.25/1M input, $1.25/1M output	Good	200K

4. Code Example¶

llm = LLMFactory.create_llm(
    provider="anthropic",
    model="claude-3-sonnet-20240229",
    temperature=0.2,
    max_tokens=500
)

Model Selection Guide¶

1. By Use Case¶

Use Case	Recommended Model	Reason
Production RAG (general)	Azure gpt-4o-mini	Best cost/performance, enterprise SLA
High-accuracy regulatory	Claude 3 Sonnet	Strong reasoning, long context
Development/testing	OpenAI gpt-4o-mini	Fast iteration, low cost
AWS-native deployment	Bedrock Claude 3 Haiku	No API keys, AWS integrated
Complex legal analysis	Claude 3 Opus or GPT-4	Highest quality
Budget-constrained	gpt-3.5-turbo or Titan	Lowest cost

2. By Latency Requirements¶

Latency Requirement	Model	Average Response Time
Real-time (<500ms)	gpt-4o-mini, gpt-3.5-turbo	300-500ms
Interactive (<1s)	Claude 3 Haiku, Titan	500-800ms
Standard (<2s)	gpt-4o, Claude 3 Sonnet	1-1.5s
Batch/offline (>2s)	gpt-4-turbo, Claude 3 Opus	2-3s

3. By Cost Budget¶

Monthly Cost Estimates (100K queries/month, avg 500 input tokens, 200 output tokens):

Model	Input Cost	Output Cost	Total/Month
gpt-4o-mini	$7.50	$12	$19.50
gpt-3.5-turbo	$25	$50	$75
Claude 3 Haiku	$12.50	$25	$37.50
gpt-4o	$125	$200	$325
Claude 3 Sonnet	$150	$300	$450
gpt-4-turbo	$500	$1,000	$1,500

Recommendation: Start with gpt-4o-mini for production (best value).

Cost Optimization¶

1. Reduce Input Tokens¶

Strategy: Limit number of retrieved documents

# Default: retrieve 5 documents
rag_chain.query(query, k=5)  # ~2000 input tokens

# Optimized: retrieve 3 documents
rag_chain.query(query, k=3)  # ~1200 input tokens (40% reduction)

Trade-off: May miss relevant context, but reduces cost by 40%.

2. Use Smaller Models¶

# Production (high quality)
LLM_MODEL=gpt-4o-mini  # $0.15/1M input

# Cost-optimized (good quality)
LLM_MODEL=gpt-3.5-turbo  # $0.50/1M input (but faster)

3. Cache Responses¶

Implementation:

# Enable query caching (default: 1 hour TTL)
# Saves ~30-40% of LLM calls

# In .env
CACHE_ENABLED=true
CACHE_TTL=3600  # 1 hour

Savings: ~$100/month for typical production workload.

4. Adjust Max Tokens¶

# Default: 500 tokens output (sufficient for most queries)
LLM_MAX_TOKENS=500

# Short answers: 200 tokens
LLM_MAX_TOKENS=200  # 60% cost reduction on output

# Long explanations: 1000 tokens
LLM_MAX_TOKENS=1000  # Increased cost but more detail

5. Batch Processing¶

For ETL metadata tagging:

# Process 10 documents per batch (reduce API calls)
tagger.tag_all(documents, batch_size=10)

# Add delay between batches to avoid rate limiting
time.sleep(1)  # 1 second delay

Prompt Customization¶

1. Default RAG Prompt¶

Location: /backend/green_gov_rag/rag/rag_chain.py

Current Prompt:

prompt = f"""Answer the query based on the following context:

{context}

Query: {query}

Answer:"""

2. Enhanced Regulatory Prompt¶

Customization:

prompt = f"""You are an expert assistant for Australian environmental and planning regulations.

Answer the query based ONLY on the provided context. Follow these guidelines:

1. Cite specific sections, clauses, or regulations when available
2. Use exact wording from regulations for definitions and requirements
3. Highlight jurisdiction-specific rules (federal vs. state vs. local)
4. If the context doesn't contain enough information, say so explicitly
5. Use Australian English spelling and terminology
6. Include relevant thresholds, dates, and numeric values

Context:
{context}

Query: {query}

Answer:"""

3. Domain-Specific Prompts¶

For ESG/NGER Queries:

if "nger" in query.lower() or "emissions" in query.lower():
    prompt = f"""You are an expert in Australian greenhouse gas emissions reporting under NGER.

Answer the query using the provided regulatory context. Always:
- Specify emission scopes (Scope 1, 2, 3)
- Include thresholds in tonnes CO2-e
- Cite relevant NGER legislation sections
- Mention reporting deadlines when applicable

Context:
{context}

Query: {query}

Answer:"""

For Planning/Vegetation Queries:

if "tree" in query.lower() or "vegetation" in query.lower():
    prompt = f"""You are an expert in Australian planning and vegetation management regulations.

Answer using the provided context. Always:
- Specify LGA-level requirements
- Cite relevant planning schemes
- Mention permit/approval requirements
- Include clearance thresholds (hectares, tree count)

Context:
{context}

Query: {query}

Answer:"""

4. Few-Shot Examples (Future)¶

prompt = f"""You are a regulatory expert. Answer based on provided context.

Example 1:
Query: What are NGER thresholds?
Answer: Under NGER, facilities must report Scope 1 emissions exceeding 25,000 tonnes CO2-e annually [Section 2.1.3]. Corporate groups report if total emissions exceed 50,000 tonnes CO2-e [Section 4.2].

Example 2:
Query: Can I clear native vegetation in Adelaide?
Answer: In the City of Adelaide (LGA 40070), native vegetation clearance requires approval under the SA Planning and Design Code [Section 5.3.2]. Exemptions apply for trees <5m height or non-significant vegetation [Clause 42(a)].

Now answer this query:
Context:
{context}

Query: {query}

Answer:"""

Temperature and Parameter Tuning¶

1. Temperature Guidelines¶

Temperature controls randomness in LLM responses:

Temperature	Behavior	Use Case
0.0	Deterministic, always same output	Regulatory compliance (recommended)
0.1-0.2	Highly consistent, minimal variation	Default RAG queries
0.3-0.5	Balanced creativity and consistency	Explanations, summaries
0.6-0.8	Creative, diverse responses	Ideation, brainstorming
0.9-1.0	Highly random	Not recommended for regulatory

Recommended for GreenGovRAG: 0.2

2. Max Tokens¶

Max Tokens limits response length:

Token Limit	Words (~)	Use Case
100	75	Short answers, definitions
200	150	Concise answers (cost-optimized)
500	375	Default (balanced)
1000	750	Detailed explanations
2000	1500	Comprehensive analysis

Recommended: 500 (sufficient for most regulatory queries)

3. Top-P (Nucleus Sampling)¶

Top-P (not currently exposed, but available via LangChain):

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.2,
    max_tokens=500,
    top_p=0.9  # Consider top 90% of probability mass
)

Recommendation: Use default (top_p=1.0) for regulatory queries.

4. Frequency Penalty¶

Frequency Penalty reduces repetition:

llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.2,
    max_tokens=500,
    frequency_penalty=0.5  # Penalize repeated tokens
)

Use Case: Long-form answers where repetition is undesirable.

5. Configuration Examples¶

Production RAG (default):

llm = LLMFactory.create_llm(
    provider="azure",
    model="gpt-4o-mini",
    temperature=0.2,
    max_tokens=500
)

High-Precision Regulatory:

llm = LLMFactory.create_llm(
    provider="azure",
    model="gpt-4o",
    temperature=0.0,  # Deterministic
    max_tokens=300    # Concise
)

Detailed Explanations:

llm = LLMFactory.create_llm(
    provider="anthropic",
    model="claude-3-sonnet-20240229",
    temperature=0.3,  # Slightly more creative
    max_tokens=1000   # Longer responses
)

Testing and Validation¶

1. Model A/B Testing¶

models = [
    {"provider": "azure", "model": "gpt-4o-mini"},
    {"provider": "azure", "model": "gpt-4o"},
    {"provider": "anthropic", "model": "claude-3-sonnet-20240229"}
]

test_queries = [
    "What are NGER Scope 1 thresholds?",
    "Can I clear native vegetation in Adelaide?",
    "What are Scope 3 Category 4 emissions?"
]

for model_config in models:
    llm = LLMFactory.create_llm(**model_config)
    rag_chain = RAGChain(vector_store=vector_store, llm_provider=model_config["provider"])

    for query in test_queries:
        result = rag_chain.query(query)
        print(f"Model: {model_config['model']}")
        print(f"Answer: {result['result']}")
        print(f"Sources: {len(result['source_documents'])}")
        print("---")

2. Cost Tracking¶

# Track token usage
from langchain.callbacks import get_openai_callback

with get_openai_callback() as cb:
    result = rag_chain.query(query)

    print(f"Input tokens: {cb.prompt_tokens}")
    print(f"Output tokens: {cb.completion_tokens}")
    print(f"Total cost: ${cb.total_cost:.4f}")

Troubleshooting¶

1. Rate Limiting¶

Error: RateLimitError: Rate limit exceeded

Solution:

# Add exponential backoff
from tenacity import retry, wait_exponential, stop_after_attempt

@retry(wait=wait_exponential(multiplier=1, min=2, max=10), stop=stop_after_attempt(3))
def query_with_retry(query: str):
    return rag_chain.query(query)

2. Context Length Exceeded¶

Error: InvalidRequestError: maximum context length exceeded

Solution:

# Reduce number of retrieved documents
rag_chain.query(query, k=3)  # Instead of k=5

# Or use a model with longer context
LLM_MODEL=gpt-4o  # 128K context vs 16K for gpt-3.5-turbo

3. Empty Responses¶

Issue: LLM returns empty or very short responses

Solution:

# Increase max_tokens
LLM_MAX_TOKENS=1000  # Instead of 500

# Adjust temperature
LLM_TEMPERATURE=0.3  # Instead of 0.2 (less deterministic)

Next Steps¶

Try Different Models: A/B test multiple models on your queries
Customize Prompts: Tailor prompts for your specific regulatory domain
Monitor Costs: Track token usage and optimize accordingly
Read Architecture: See architecture/rag-pipeline.md

Last Updated: 2025-11-22