Models and Benchmarks Thrust
The Models and Benchmarks team at PermitAI is dedicated to advancing and rigorously evaluating AI systems for environmental review and permitting, with a particular focus on NEPA workflows. Through systematic evaluation of off-the-shelf large language models, the team has identified key limitations in applying general-purpose models to regulatory domains such as NEPA. In response, they leverage the NEPATEC data lakehouse to develop domain-adapted models and, critically, a growing suite of benchmarks that measure model performance on real-world permitting tasks.
Custom Model Development
The team's approach prioritizes smaller language models, ranging from 1 to 7 billion parameters, which strike a balance between performance efficiency and resource consumption, thus keeping inference costs and energy usage low. These models are tailored for tasks including comment processing evaluation, GIS analytics, and NEPA document drafting.
Benchmarking and Evaluation
A core initiative of this thrust area is the creation of NEPA-Bench, a comprehensive set of benchmarks that assess AI model performance on real-world NEPA tasks.
NEPABench
NEPABench is PermitAI’s comprehensive benchmark suite for environmental permitting. Rather than focusing on a single task, NEPABench evaluates AI systems across the full permitting lifecycle, including question answering, document drafting, information extraction, and public comment processing.
The suite integrates a range of task-specific benchmarks, including:
- NEPAQuAD (question answering and regulatory reasoning)
- DraftNEPABench (document drafting)
- EIS-Bench (metadata extraction from EIS documents)
- EA-Bench (metadata extraction from EA documents)
- Tribe-Bench (tribal entity identification and consultation analysis)
- FedReg-Bench (structured extraction from Federal Register notices)
- Comment-Bench (public comment delineation, categorization, and summarization)
By unifying these capabilities into a single framework, NEPABench enables more realistic, end-to-end evaluation of AI systems operating in regulatory environments. The suite currently encompasses more than 10,000 evaluation instances across diverse task types and document sources.
DraftNEPABench
DraftNEPABench evaluates the ability of large language models and agent-based systems to draft sections of Environmental Impact Statements (EIS). The benchmark consists of expert-curated drafting tasks derived from real-world NEPA documents, requiring models to synthesize information from multiple technical, regulatory, and scientific sources into coherent, structured text. Results demonstrate that agent-based approaches significantly improve drafting performance compared to standard methods, while also highlighting the continued need for human oversight in high-stakes regulatory contexts.
Innovative Tools
In addition to developing benchmarks, the team has established automated and human evaluation procedures to rigorously examine the effectiveness and safety of models and applications prior to public release. They have also introduced MAPLE, a cloud API-friendly assessment pipeline designed for seamless evaluation of large language models against benchmarks like NEPABench.
MAPLE
MAPLE (Multi-context Assessment Pipeline for Language Model Evaluation) is PermitAI’s modular evaluation framework for benchmarking AI models across NEPA tasks. Initially released as MAPLE v1.0, the framework provided a standardized pipeline for evaluating models on question answering and document retrieval tasks across multiple context settings, including no-context, document-level, retrieval-augmented, and gold-context evaluation.
Building on this foundation, MAPLE v2 expands support to a broader range of permitting tasks, including information extraction, structured data processing, and public comment analysis. It introduces task-specific evaluators and enhanced scoring modules, enabling consistent and reproducible evaluation across the full NEPABench suite. Together, these versions establish MAPLE as the core infrastructure for assessing model performance in real-world environmental permitting workflows.
PermitAI is committed to enabling rapid AI model prototyping, allowing researchers to experiment with diverse AI model architectures, algorithms, and preprocessing techniques, thus fostering innovation and efficiency in environmental review and permitting processes.
PermitAI Models and Benchmarks Team
- Anurag Acharya (Model and Benchmark, Technical Lead)
- Sadie Montgomery (Model and Benchmark, Domain Lead)
- Rounak Meyur
- Koby Hayashi
- Bishal Lakha
- Anusha Devulapally
- Henry Warmerdam