Code Style Guide¶

Standards and conventions for writing clean, maintainable code in GreenGovRAG

Table of Contents¶

Overview
Python Code Style
Ruff Configuration
MyPy Type Checking
Docstring Standards
Import Ordering
Naming Conventions
File Organization
Code Formatting
Running Tools
Pre-commit Hooks
Common Patterns
Anti-patterns to Avoid

Overview¶

GreenGovRAG follows strict code quality standards to ensure maintainability, readability, and consistency across the codebase. We use automated tools to enforce these standards.

Core Principles:

Code should be self-documenting
Type hints are required for all functions
Documentation should explain "why", not "what"
Complexity should be minimized (max McCabe complexity: 10)
Security considerations should be addressed
Performance implications should be understood

Tools We Use:

Ruff: Fast Python linter and formatter (replaces Black, isort, flake8, pylint)
MyPy: Static type checker
Pytest: Testing framework
Pre-commit: Git hooks for automated checks

Python Code Style¶

PEP 8 Compliance¶

We follow PEP 8 with specific customizations defined in pyproject.toml.

Key Rules:

Line length: 100 characters (not 79)
Indentation: 4 spaces (never tabs)
Encoding: UTF-8
Quotes: Double quotes for strings (configurable)
Line endings: Unix-style (LF)

Line Length¶

# Good: Within 100 characters
def process_document(doc_path: str, metadata: dict[str, Any]) -> ProcessedDocument:
    """Process a document with the given metadata."""
    return processor.process(doc_path, metadata)

# Bad: Exceeds 100 characters
def process_document(doc_path: str, metadata: dict[str, Any], chunk_size: int = 1000, chunk_overlap: int = 200, enable_ocr: bool = False) -> ProcessedDocument:
    return processor.process(doc_path, metadata, chunk_size, chunk_overlap, enable_ocr)

# Good: Split long function signature
def process_document(
    doc_path: str,
    metadata: dict[str, Any],
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
    enable_ocr: bool = False,
) -> ProcessedDocument:
    """Process a document with the given metadata."""
    return processor.process(doc_path, metadata, chunk_size, chunk_overlap, enable_ocr)

Indentation¶

# Good: Consistent 4-space indentation
def query_with_filters(
    query: str,
    lga_name: str | None = None,
    document_types: list[str] | None = None,
) -> QueryResult:
    """Execute query with optional filters."""
    if lga_name:
        filters = {"lga_name": lga_name}
    else:
        filters = {}

    return search_engine.query(query, filters)

# Bad: Inconsistent indentation
def query_with_filters(
  query: str,
    lga_name: str | None = None,
      document_types: list[str] | None = None,
) -> QueryResult:
  if lga_name:
      filters = {"lga_name": lga_name}
  return search_engine.query(query, filters)

Ruff Configuration¶

Ruff is configured in backend/pyproject.toml with strict rules enabled.

Enabled Rule Categories¶

The following rule categories are enabled (see pyproject.toml for complete list):

F (Pyflakes): Basic Python errors
E/W (pycodestyle): PEP 8 style violations
I (isort): Import sorting
N (pep8-naming): Naming conventions
D (pydocstyle): Docstring conventions
UP (pyupgrade): Modern Python syntax
ANN (flake8-annotations): Type annotations
S (flake8-bandit): Security issues
B (flake8-bugbear): Common bugs
C4 (flake8-comprehensions): List/dict comprehensions
PT (flake8-pytest-style): Pytest conventions
RUF (Ruff-specific): Ruff's own rules

Ignored Rules¶

Some rules are explicitly ignored:

# See pyproject.toml [tool.ruff.lint.ignore] section

# ANN101, ANN102: Type hints for self/cls (MyPy handles this)
# S101: Assert usage (allowed in tests)
# TD002, TD003: TODO formatting (relaxed)
# COM812: Trailing commas (preference)

Running Ruff¶

# Format code (modifies files)
ruff format .

# Check for issues (no modifications)
ruff check .

# Fix auto-fixable issues
ruff check --fix .

# Show explanation for a rule
ruff rule E501  # Line too long

# Check specific file
ruff check green_gov_rag/api/routes/query.py

# Check with specific rules
ruff check --select=D .  # Only docstring rules

MyPy Type Checking¶

Type Hints Requirements¶

All functions must have type hints:

# Good: Complete type hints
def fetch_documents(
    source_url: str,
    max_pages: int = 10,
    timeout: int = 30,
) -> list[Document]:
    """Fetch documents from the given source."""
    documents: list[Document] = []
    for page in range(max_pages):
        doc = fetch_page(source_url, page, timeout)
        documents.append(doc)
    return documents

# Bad: Missing type hints
def fetch_documents(source_url, max_pages=10, timeout=30):
    documents = []
    for page in range(max_pages):
        doc = fetch_page(source_url, page, timeout)
        documents.append(doc)
    return documents

Modern Type Syntax¶

Use Python 3.12+ type syntax:

# Good: Modern type syntax (Python 3.12+)
def process_batch(items: list[str]) -> dict[str, int]:
    return {item: len(item) for item in items}

def get_optional_value(key: str) -> str | None:
    return cache.get(key)

# Bad: Old type syntax (pre-3.10)
from typing import List, Dict, Optional

def process_batch(items: List[str]) -> Dict[str, int]:
    return {item: len(item) for item in items}

def get_optional_value(key: str) -> Optional[str]:
    return cache.get(key)

Complex Types¶

from typing import Any, TypeVar, Generic, Protocol
from collections.abc import Callable, Iterable

# Good: Proper generic types
T = TypeVar("T")

def first_or_none(items: Iterable[T]) -> T | None:
    """Return first item or None if empty."""
    return next(iter(items), None)

# Good: Callable types
def retry_on_error(
    func: Callable[[str], bool],
    max_retries: int = 3,
) -> bool:
    """Retry function on error."""
    for _ in range(max_retries):
        try:
            return func("test")
        except Exception:
            continue
    return False

# Good: Protocol for structural typing
class Processor(Protocol):
    """Protocol for document processors."""

    def process(self, text: str) -> str:
        """Process text."""
        ...

Type Checking Configuration¶

MyPy is configured in backend/pyproject.toml:

[tool.mypy]
python_version = "3.12"
warn_unused_ignores = true
ignore_missing_imports = true
warn_redundant_casts = true
strict_optional = true
check_untyped_defs = true

Running MyPy¶

# Check entire codebase
mypy green_gov_rag tests

# Check specific file
mypy green_gov_rag/api/routes/query.py

# Show error codes
mypy --show-error-codes green_gov_rag

# Generate coverage report
mypy --html-report mypy-report green_gov_rag

Docstring Standards¶

We use Google-style docstrings exclusively.

Module Docstrings¶

"""Document ingestion and processing module.

This module provides functionality for ingesting documents from various sources,
parsing them into structured format, and storing metadata in the database.

Example:
    >>> from green_gov_rag.etl import ingest
    >>> result = ingest.process_source("epbc_act")
    >>> print(f"Processed {result.count} documents")
"""

Function Docstrings¶

def query_with_location(
    query: str,
    lga_name: str,
    buffer_km: float = 0.0,
    top_k: int = 5,
) -> QueryResult:
    """Execute RAG query with geospatial filtering.

    Retrieves relevant regulatory documents based on the query text and filters
    by Local Government Area. Optionally expands the search to neighboring LGAs
    within the specified buffer distance.

    Args:
        query: The user's question or search query.
        lga_name: Name of the Local Government Area (e.g., "Adelaide City Council").
        buffer_km: Distance in kilometers to expand search radius. Defaults to 0.0.
        top_k: Number of top results to return. Defaults to 5.

    Returns:
        QueryResult containing the answer, source documents, and metadata.

    Raises:
        ValueError: If lga_name is empty or invalid.
        VectorStoreError: If vector store connection fails.
        LLMError: If LLM API call fails.

    Example:
        >>> result = query_with_location(
        ...     query="Can I clear native vegetation?",
        ...     lga_name="Adelaide Hills Council",
        ...     buffer_km=10.0,
        ... )
        >>> print(result.answer)
    """
    if not lga_name:
        raise ValueError("lga_name cannot be empty")

    # Implementation...

Class Docstrings¶

class DocumentProcessor:
    """Process regulatory documents for ingestion into the RAG system.

    This class handles document parsing, chunking, and metadata extraction
    for various document formats including PDF, HTML, and Word documents.

    Attributes:
        chunk_size: Maximum size of text chunks in tokens.
        chunk_overlap: Number of overlapping tokens between chunks.
        enable_ocr: Whether to use OCR for scanned documents.

    Example:
        >>> processor = DocumentProcessor(chunk_size=1000)
        >>> chunks = processor.process_file("path/to/document.pdf")
        >>> print(f"Created {len(chunks)} chunks")
    """

    def __init__(
        self,
        chunk_size: int = 1000,
        chunk_overlap: int = 200,
        enable_ocr: bool = False,
    ) -> None:
        """Initialize the document processor.

        Args:
            chunk_size: Maximum chunk size in tokens.
            chunk_overlap: Overlapping tokens between chunks.
            enable_ocr: Enable OCR for scanned PDFs.
        """
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.enable_ocr = enable_ocr

Property Docstrings¶

@property
def is_healthy(self) -> bool:
    """Check if the service is healthy.

    Returns:
        True if all dependencies are reachable, False otherwise.
    """
    return self._check_database() and self._check_vector_store()

Import Ordering¶

Imports are automatically sorted by Ruff (isort integration).

Import Order¶

Standard library imports
Third-party imports
Local application imports

# Good: Proper import ordering
import os
import sys
from datetime import datetime
from pathlib import Path

import numpy as np
from fastapi import APIRouter, Depends, HTTPException
from langchain_community.vectorstores import FAISS
from sqlmodel import Session, select

from green_gov_rag.api.schemas import QueryRequest, QueryResponse
from green_gov_rag.config import settings
from green_gov_rag.models.database import get_session
from green_gov_rag.rag.enhanced_response import EnhancedRAGPipeline

# Bad: Random import order
from green_gov_rag.config import settings
import os
from fastapi import APIRouter
from green_gov_rag.models.database import get_session
import numpy as np
from datetime import datetime

Import Style¶

# Good: Explicit imports
from green_gov_rag.rag.vector_store import VectorStoreFactory
from green_gov_rag.models.document import Document, DocumentMetadata

# Acceptable: Module import for many items
from green_gov_rag import models

# Bad: Star imports (never use)
from green_gov_rag.rag.vector_store import *

Conditional Imports¶

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    # Import only for type checking, not runtime
    from green_gov_rag.rag.llm_factory import LLMProvider

Naming Conventions¶

Variables and Functions¶

# Good: snake_case for variables and functions
document_count = 0
max_retry_attempts = 3

def fetch_documents() -> list[Document]:
    """Fetch documents from database."""
    pass

def calculate_trust_score(sources: list[Source]) -> float:
    """Calculate trust score from sources."""
    pass

# Bad: camelCase or PascalCase for variables/functions
documentCount = 0
maxRetryAttempts = 3

def FetchDocuments():
    pass

Classes¶

# Good: PascalCase for classes
class DocumentProcessor:
    """Process documents."""
    pass

class EnhancedRAGPipeline:
    """Enhanced RAG pipeline."""
    pass

# Bad: snake_case or other styles
class document_processor:
    pass

class enhanced_rag_pipeline:
    pass

Constants¶

# Good: UPPER_CASE for constants
MAX_CHUNK_SIZE = 1000
DEFAULT_EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
API_VERSION = "v1"

# Bad: Other cases for constants
max_chunk_size = 1000
defaultEmbeddingModel = "..."

Private Members¶

class DocumentCache:
    """Cache for documents."""

    def __init__(self) -> None:
        # Good: Leading underscore for private
        self._cache: dict[str, Document] = {}
        self._hit_count = 0

    def _update_stats(self) -> None:
        """Private method for internal use."""
        self._hit_count += 1

    # Good: Public API
    def get(self, key: str) -> Document | None:
        """Get document from cache."""
        doc = self._cache.get(key)
        if doc:
            self._update_stats()
        return doc

Type Variables¶

# Good: PascalCase for type variables
T = TypeVar("T")
DocumentT = TypeVar("DocumentT", bound=Document)

# Bad: Other cases
t = TypeVar("t")
DOCUMENT_T = TypeVar("DOCUMENT_T")

File Organization¶

Module Structure¶

"""Module docstring here."""

# Standard library imports
import os
from datetime import datetime
from pathlib import Path

# Third-party imports
from fastapi import APIRouter
from langchain_community.vectorstores import FAISS

# Local imports
from green_gov_rag.config import settings
from green_gov_rag.models import Document

# Constants
MAX_RETRIES = 3
DEFAULT_TIMEOUT = 30

# Type definitions
DocumentList = list[Document]

# Classes
class DocumentProcessor:
    """Process documents."""
    pass

# Functions
def process_document(doc: Document) -> ProcessedDocument:
    """Process a document."""
    pass

# Main execution
if __name__ == "__main__":
    main()

File Naming¶

# Good: snake_case for modules
enhanced_response.py
vector_store.py
metadata_tagger.py

# Bad: Other cases
enhancedResponse.py
VectorStore.py
MetadataTagger.py

Code Formatting¶

Blank Lines¶

# Good: Proper spacing
class DocumentProcessor:
    """Process documents."""

    def __init__(self) -> None:
        """Initialize processor."""
        self.cache = {}

    def process(self, doc: Document) -> ProcessedDocument:
        """Process document."""
        # Function implementation
        pass


class VectorStore:
    """Vector store implementation."""
    pass


def standalone_function() -> None:
    """Standalone function."""
    pass

# Bad: Inconsistent spacing
class DocumentProcessor:
    def __init__(self) -> None:
        self.cache = {}
    def process(self, doc: Document) -> ProcessedDocument:
        pass
class VectorStore:
    pass

Line Breaks¶

# Good: Logical line breaks
def query_documents(
    query: str,
    filters: dict[str, Any] | None = None,
    top_k: int = 5,
) -> list[Document]:
    """Query documents with filters."""
    if filters is None:
        filters = {}

    results = vector_store.similarity_search(
        query,
        k=top_k,
        filter=filters,
    )

    return results

# Bad: No logical grouping
def query_documents(query: str, filters: dict[str, Any] | None = None, top_k: int = 5) -> list[Document]:
    if filters is None:
        filters = {}
    results = vector_store.similarity_search(query, k=top_k, filter=filters)
    return results

String Quotes¶

# Good: Double quotes (default)
message = "Processing document"
query = "SELECT * FROM documents WHERE type = 'regulation'"

# Acceptable: Single quotes for strings containing double quotes
html = '<div class="container">Content</div>'

# Good: Triple double quotes for docstrings
def func() -> None:
    """This is a docstring."""
    pass

Running Tools¶

Format and Lint Workflow¶

# 1. Format code with Ruff
cd backend
ruff format .

# 2. Fix auto-fixable lint issues
ruff check --fix .

# 3. Check remaining issues
ruff check .

# 4. Run type checking
mypy green_gov_rag tests

# 5. Run tests
pytest

CI/CD Checks¶

All of these checks run in CI/CD:

# Format check (fails if not formatted)
ruff format --check .

# Lint check (fails if issues found)
ruff check .

# Type check
mypy green_gov_rag tests

# Test with coverage
pytest --cov=green_gov_rag --cov-report=xml

Using Make Commands¶

If Makefile is available:

make format    # Format code
make lint      # Check linting
make mypy      # Type check
make test      # Run tests
make check-all # Run all checks

Pre-commit Hooks¶

Pre-commit hooks automatically run checks before commits.

Install Hooks¶

pip install pre-commit
pre-commit install

Manual Run¶

# Run on all files
pre-commit run --all-files

# Run specific hook
pre-commit run ruff --all-files
pre-commit run mypy --all-files

Skip Hooks (Use Sparingly)¶

# Skip all hooks for a commit
git commit --no-verify -m "message"

# Better: Fix the issues instead of skipping

Common Patterns¶

Error Handling¶

# Good: Specific exceptions with context
def fetch_document(doc_id: str) -> Document:
    """Fetch document by ID."""
    try:
        doc = database.get(doc_id)
    except DatabaseError as e:
        logger.error(f"Failed to fetch document {doc_id}: {e}")
        raise HTTPException(
            status_code=500,
            detail=f"Database error: {e}",
        ) from e

    if doc is None:
        raise HTTPException(
            status_code=404,
            detail=f"Document {doc_id} not found",
        )

    return doc

# Bad: Bare except, no context
def fetch_document(doc_id: str):
    try:
        doc = database.get(doc_id)
        return doc
    except:
        return None

Context Managers¶

# Good: Use context managers for resources
def process_file(file_path: Path) -> str:
    """Process file contents."""
    with file_path.open("r", encoding="utf-8") as f:
        content = f.read()
    return content.strip()

# Bad: Manual resource management
def process_file(file_path: Path) -> str:
    f = file_path.open("r")
    content = f.read()
    f.close()
    return content.strip()

Comprehensions¶

# Good: Clear comprehensions
doc_ids = [doc.id for doc in documents if doc.is_active]
metadata = {doc.id: doc.title for doc in documents}

# Bad: Complex nested comprehensions
result = {
    doc.id: [
        chunk.text for chunk in doc.chunks
        if chunk.length > 100 and chunk.has_metadata
    ]
    for doc in documents
    if doc.is_active and doc.type in ["regulation", "policy"]
}

# Better: Break down complex logic
active_docs = [doc for doc in documents if doc.is_active]
result = {}
for doc in active_docs:
    if doc.type in ["regulation", "policy"]:
        large_chunks = [
            chunk.text for chunk in doc.chunks
            if chunk.length > 100 and chunk.has_metadata
        ]
        result[doc.id] = large_chunks

Anti-patterns to Avoid¶

Magic Numbers¶

# Bad: Magic numbers
def chunk_text(text: str) -> list[str]:
    chunks = []
    for i in range(0, len(text), 1000):
        chunks.append(text[i:i + 1000])
    return chunks

# Good: Named constants
DEFAULT_CHUNK_SIZE = 1000

def chunk_text(text: str, chunk_size: int = DEFAULT_CHUNK_SIZE) -> list[str]:
    """Split text into chunks."""
    chunks = []
    for i in range(0, len(text), chunk_size):
        chunks.append(text[i:i + chunk_size])
    return chunks

Mutable Default Arguments¶

# Bad: Mutable default argument
def add_document(doc: Document, tags: list[str] = []) -> None:
    tags.append("default")
    doc.tags = tags

# Good: None default with initialization
def add_document(doc: Document, tags: list[str] | None = None) -> None:
    """Add document with tags."""
    if tags is None:
        tags = []
    tags.append("default")
    doc.tags = tags

Overly Complex Functions¶

# Bad: High complexity (McCabe complexity > 10)
def process_document(doc: Document) -> ProcessedDocument:
    if doc.type == "pdf":
        if doc.is_scanned:
            if doc.language == "en":
                # ... many nested conditions
                pass

# Good: Break down into smaller functions
def process_document(doc: Document) -> ProcessedDocument:
    """Process document based on type."""
    if doc.type == "pdf":
        return process_pdf(doc)
    elif doc.type == "html":
        return process_html(doc)
    else:
        return process_generic(doc)

def process_pdf(doc: Document) -> ProcessedDocument:
    """Process PDF document."""
    if doc.is_scanned:
        return process_scanned_pdf(doc)
    return process_text_pdf(doc)

Ready to write tests? Continue to the Testing Guide to learn about our testing practices!