Some First Principles on Large Language Model Capabilities and Federal Rulemaking

When it comes to federal administrative capacity, the Trump administration is working at cross-purposes with itself. On the one hand, it is taking a hatchet to the institutions of government: slashing the federal workforce and eliminating entire agencies (or trying to). On the other hand, its regulatory goals are ambitious. It is moving aggressively to rescind numerous federal regulations while at the same time asserting federal power in new and disquieting ways.

To do “more” with less, the administration is turning to artificial intelligence. Last year, DOGE proposed to use a large language model (“LLM”) to rescind half of all federal regulations. This year, the Department of Transportation has reportedly been planning to use Google Gemini to write proposed rules—an initiative the agency characterized as the “point of the spear” of federal AI use. Agencies are now self-reporting numerous AI “use cases,” many of which relate to the rulemaking process. Among other things, agencies are developing AI tools that can sort and categorize public comments on proposed rules. And the Department of the Treasury reported a tool that can identify regulations for rescission and draft regulatory documents.

Whatever might be said about the prudence or transparency of these particular initiatives (and there is a lot to say), AI’s role in administrative decisionmaking is likely to outlast this administration. In particular, agencies are likely to continue using LLMs in the rulemaking process. LLMs are capable of generating vast quantities of prose—sometimes very good prose—at the push of a button. It is easy to see why agency policymakers, often tasked with developing dense, lengthy regulations and rule preambles, might find that capability useful. What’s more, the next administration may well face an even more acute dilemma than this one: there will be a lot to do and fewer people and institutions to do it. There are many reasons to think even current LLMs can aid that work—before even considering their potential capabilities in three years.

Policymakers need a set of principles to guide the appropriate use of LLMs in the rulemaking process. Indeed, the question of what guardrails, frameworks, and “leashes” should govern AI deployment in the federal government and beyond has become a focus of policy analysis. There are many ways to approach the problem as it arises in governance, including by assessing the compatibility of LLM usage with important values like democratic accountability, technocratic judgment and discretion, transparency, safety, and public participation—not to mention the legal rules that govern the administrative process.

This essay takes a more modest first step. It analyzes (in lay terms) what LLMs are good and bad at, on the premise that LLMs might be appropriately used for tasks they are good at, but that their place in rulemaking must be circumscribed by the kinds of errors they tend to make, the risk of those errors occurring in a particular application, and the gravity of the consequences should an error occur. Seen this way, agencies might derive principles for how to deploy AI from the manner in which they oversee the work of junior employees. LLMs are capable of diligence and useful analysis, but their outputs must be subject to rigorous review by agency staff, and agency processes must be designed to ensure that oversight actually takes place.

Of course, speaking in general terms about the strengths and weaknesses of LLMs is both fraught and empirically contingent. It is hard (for a lawyer without technical training, at least) to test LLMs with any rigor. Moreover, conventional wisdom about LLM quality can quickly become stale as the technology advances. It is difficult to tease out which common LLM flaws are systemic and likely to persist and which might well recede. With that said, three qualities that LLMs are generally understood to possess illuminate their strengths and weaknesses.

First, LLMs generate text that is statistically likely in light of their training data, a process that tends to produce plausible-sounding outputs. This is a strength: LLMs are extremely fluent. They often generate responses that are sophisticated and can read as thoughtful. It is also a weakness: a text’s verisimilitude is only loosely correlated with its accuracy. Notwithstanding claims about “PhD level” intelligence, LLMs frequently make factual mistakes or omit important information. This problem is heightened by the fact that LLMs tend to sound confident whether or not their outputs are accurate and complete. Any regular LLM user knows from experience that, as often as they produce useful outputs, they also breezily say incorrect, nonresponsive, or downright bizarre things, and sometimes act in a way that can only be characterized as cagey and evasive when called out on their inaccuracy. There is some reason to think that methods like retrieval-augmented generation and chain-of-thought prompting can improve the accuracy of LLM responses, but it is also possible that inaccuracy will remain a systemic problem. Making matters worse, LLMs’ writerly polish and air of authority might cause human users to defer to model outputs without sufficient critical thought, a phenomenon known as “automation bias.” Finally, LLMs are stochastic, meaning that an element of chance determines how they respond to prompts. The same LLM might answer the same question differently at two different times.

Second, LLMs draw upon training data consisting of an enormous corpus of human writing. So, generally speaking, if humans have written about it, LLMs can write about it. LLMs are good—often excellent—at identifying, explaining, and synthesizing concepts, including legal and technical ones, that are well represented in their training data. The flip side is that an LLM might perform poorly on factually specific questions unless relevant materials have been provided to it, either in a prompt or through retrieval. In other words, LLMs might not be able to answer specific questions from training data alone. LLMs may also reproduce errors or biases embodied in their training data. For instance, the way a particular policy problem has been written about historically may well shape what an LLM has to say.

Third, LLMs generally do as they are told. This makes them helpful. But it also means that LLMs tend to tell users what they want to hear. This phenomenon can reflect the model’s own tendency to align outputs to a user’s perceived views (“sycophancy”) or to user prompting that, consciously or not, primes LLMs to respond in a particular way. LLMs may not push back on users or add necessary context on their own initiative, and they may accede to users who push back in turn. They will also likely follow users’ implicit or explicit direction to generate weak but colorable arguments. In a phenomenon sometimes called “contextual drift,” LLMs might, over the course of a conversation, come to treat premises introduced by a user as established background, progressively shaping outputs around those premises rather than subjecting them to scrutiny. These problems might be addressed by prompting an LLM, either at a system level or in an individual exchange, to be forthright, consider competing perspectives, and flag uncertainty. To that end, Anthropic’s Claude (Sonnet 4.6), directed in a system prompt to be candid, generally declines to articulate objectively wrong legal arguments, whether because of that prompt or the model’s underlying design.

These attributes point toward some intuitive principles for how LLMs might be applied to the work agencies do in the rulemaking process. At each step in rulemaking, there are tasks LLMs might be good at. As agency staffers collaborate to develop regulatory proposals, LLMs can review relevant policy and legal literatures. They can articulate competing views on a particular question and explain important concepts. They can help structure policymakers’ thinking and even play devil’s advocate, “steelmanning” and “red-teaming” proposals. They can generate and sharpen prose that might appear in a public document. After the agency gives the public notice of a proposed rule, LLMs might be able to help address comments. LLMs could conceivably categorize and summarize comments and identify within them significant arguments for and against a particular proposal, thereby helping agency staff assess the proposed policy and generate thoughtful responses. They can, in other words, act like diligent, informed subordinates: law clerks, junior attorneys, and research assistants.

But the work of subordinates—no matter how eager, well informed, or hard working—must be subject to engaged, meaningful review by managers. All the more so for a technology with known accuracy problems (and one that lacks human workers’ subjective judgment, understanding of context, and professional accountability). As a matter of good policy and legal doctrine, any LLM output created in the rulemaking process must be subject to human oversight and validation. Individual users can never treat LLM outputs as the final word on any matter (and, more ministerially, outputs must always be carefully cite-checked). And because LLMs are stochastic, users should consider prompting them more than once on the same question, or with slightly different instructions or context. Variation across responses can reveal uncertainty or inconsistency that a single output would obscure; conversely, agreement across multiple outputs provides greater confidence in a result.

Agency leadership must also build internal processes to facilitate verification. For one thing, agencies must ensure that agency staff have the time and bandwidth to actually perform meaningful review. Oversight means little if agency personnel are inundated with more LLM-generated text than they can actually evaluate. Human capacity to oversee LLMs should thus function as a rate-limiting factor in agency LLM usage. For another, agencies should match the review required to the potential costs of an LLM mistake. As just one example, if an LLM is involved in a task that directly informs the substance of a regulation (and thus one that, if performed erroneously, could produce adverse real-world consequences), additional safeguards should be in place, like additional      human review by senior staff.

The need for oversight also dictates what kinds of tasks LLMs can be used to perform and how they can be prompted. Agencies should not elicit or use LLM outputs when verification is impracticable—where the effort required to perform verification is comparable to or exceeds the amount of effort that would be required to conduct the underlying analysis in the first instance. For example, an agency lawyer can decide whether an LLM has correctly concluded that a rule comports with the holding of a particular case. The lawyer would have a much harder time verifying an LLM’s conclusion that a rule is consistent with all relevant judicial precedent.

It is not yet clear whether reviewing voluminous comment records, a task for which agencies are increasingly using AI, comports with this principle. Agencies must adequately respond to significant comments, which might number in the thousands on a particular proposed rule. It is therefore important that LLMs produce accurate and complete comment summaries. It may be difficult for an agency to validate an LLM’s work without recreating it—that is, reviewing the comment record according to established, pre-AI processes. It is possible, though, that agencies could develop tests of the typical quality of LLM comment review. For instance, they might use LLMs to summarize comment records from past rulemakings and compare the results to the agency’s own analysis, or cross-check LLM summaries of comments on a new proposed rule against an appropriate sample of those the agency independently believes to be significant. But this is a situation where the potential cost of an error—a court striking down a rule for failure to respond to comments—is high. If verification is impracticable, LLM usage might not be appropriate.

The risk of sycophancy means that agencies should also develop best practices for how to prompt LLMs. LLMs should be asked to identify substantial counterarguments and objections to a position before reaching a conclusion. They should be tasked with flagging uncertainty and contested claims. They should, by default, canvass competing schools of thought when explaining a particular issue. These instructions should be imparted to LLMs through system prompts and in particular exchanges with agency staff. For their part, users should avoid prompts that embed a desired conclusion or otherwise lead an LLM. And to make sycophancy and priming more visible, agency policymakers and lawyers reviewing an LLM output should have before them the prompts used to generate it. These evenhandedness measures might also serve to mitigate—or at least help identify—improper bias embedded in LLM responses.

Even with these precautions, substantial risks remain. For instance, oversight and validation might become a pro forma exercise in which agency officials merely rubber-stamp LLM outputs rather than meaningfully engage with them. Even if agency staff have sufficient time to oversee LLMs, they may fall prey to automation bias, become complacent about the quality of LLM outputs, or otherwise fail to do the hard work of verification. When LLMs are used for internal purposes, like preparatory research, the risk of a user failing to confirm the accuracy, completeness, or pertinence of an LLM output might be especially high. To that end, overreliance on LLMs to perform intellectual work—so-called “cognitive offloading”—might erode, rather than aid, the analytic capabilities of agency staff. As Professor Bridget Dooling puts it, “[i]f you skip the writing, it’s easy to skip the thinking.” And no amount of careful prompting can eliminate the risk that agency staff under pressure will consciously or unconsciously cue an LLM to reach preferred conclusions.

Importantly, though, none of these risks is entirely new. They all exist, to some degree, in every hierarchical workplace. Senior officials have always had to delegate tasks to fallible subordinates and manage the attendant risks through proper instructions and attentive oversight. These problems may not be solved, but at least they are familiar. How agencies approach effective management of their staff might inform how they think about deploying LLMs prudently.

The upshot is, in some respects, both dramatic and modest. LLMs might well be capable of assisting with large amounts of substantive agency work. At the same time, this analysis suggests that any proposal, like DOGE’s or the Department of Transportation’s, to automate the vast majority of work on a particular rulemaking task must be viewed with extreme skepticism. The need for careful design of LLM workflows and genuine review of LLM outputs generally should preclude such ambitious deployments.

This analysis is, of course, provisional. The issues are novel, the technology is advancing rapidly, and even those who build the models have limited insight into their function and nature. This is not meant to be a comprehensive or authoritative prescription for the best practices agencies should adopt as they explore LLM deployment, much less a suggestion that these principles give rise to judicially enforceable rules of administrative procedure. And, to repeat, its analytical frame—a lay assessment of LLM capabilities—is narrow. There are many other concerns that independently weigh on whether and to what extent agencies should use LLMs. These include: whether LLM usage comports with doctrines of administrative law; whether LLM usage is consistent with the democratic accountability of federal agencies; whether LLM usage might degrade or undercut the judgment and discretion we expect federal policymakers to exercise; whether LLM usage is compatible with public administration’s aspirational values of due process, deliberation, transparency, and public participation; and whether it is ever appropriate for tools controlled by immensely powerful private companies to play a substantive role, even as an adjunct, in federal policymaking. This post leaves those questions for another day.

Jordan Ascher is Policy Counsel at Governing for Impact.

Scroll to Top