{"id":4618,"date":"2026-01-24T15:43:49","date_gmt":"2026-01-24T20:43:49","guid":{"rendered":"https:\/\/journals.law.harvard.edu\/jol\/?p=4618"},"modified":"2026-01-24T15:49:36","modified_gmt":"2026-01-24T20:49:36","slug":"generative-misinterpretation","status":"publish","type":"post","link":"https:\/\/journals.law.harvard.edu\/jol\/2026\/01\/24\/generative-misinterpretation\/","title":{"rendered":"Generative Misinterpretation"},"content":{"rendered":"<p><strong><span style=\"float: left\">James Grimmelman<a href=\"#_edn1\" name=\"_ednref1\">[*]<\/a><\/span><\/strong><\/p>\n<p><strong><span style=\"float: left\">, Benjamin L.W. Sobel<a href=\"#_edn2\" name=\"_ednref2\">[\u2020]<\/a><\/span><\/strong><\/p>\n<p><strong><span style=\"float: left\">, David Stein<a href=\"#_edn3\" name=\"_ednref3\">[\u2021]<\/a><\/span><\/strong><\/p>\n<h4>\u00a0<\/h4>\n<h4 style=\"text-align: center\"><span style=\"font-variant: small-caps\">Abstract<\/span><\/h4>\n<p><em>In a series of provocative experiments, a loose group of scholars, lawyers, and judges has endorsed generative interpretation: asking large language models (LLMs) like ChatGPT and Claude to resolve interpretive issues from actual cases. With varying degrees of confidence, they argue that LLMs are (or will soon be) able to assist\u2014or even replace\u2014judges in performing interpretive tasks like determining the meaning of a term in a contract or statute. A few go even further and argue for using LLMs to decide entire cases and to generate opinions supporting those decisions.<\/em><\/p>\n<p><em>We respectfully dissent. In this Article, we show that LLMs are not yet fit for use in judicial chambers. Generative interpretation, like all empirical methods, must bridge two gaps to be useful and legitimate. The first is a reliability gap: are its methods consistent and reproducible enough to be trusted in high-stakes, real-world settings? Unfortunately, as we show, LLM proponents\u2019 experimental results are brittle and frequently arbitrary. The second is an epistemic gap: do these methods measure what they purport to? Here, LLM proponents have pointed to (1) LLMs\u2019 training processes on large datasets, (2) empirical measures of LLM outputs, (3) the rhetorical persuasiveness of those outputs, and (4) the assumed predictability of algorithmic methods. We show, however, that all of these justifications rest on unstated and faulty premises about the nature of LLMs and the nature of judging.<\/em><\/p>\n<p><em>The superficial fluency of LLM-generated text conceals fundamental gaps between what these models are currently capable of and what legal interpretation requires to be methodologically and socially legitimate. Put simply, any human or computer can put words on a page, but it takes something more to turn those words into a legitimate act of legal interpretation. LLM proponents do not yet have a plausible story of what that \u201csomething more\u201d comprises.<\/em><\/p>\n<h4 style=\"text-align: center\">I. <span style=\"font-variant: small-caps\">Introduction<br \/><\/span><\/h4>\n<p style=\"text-align: center\"><em>\u201cThis isn\u2019t right. It\u2019s not even wrong.\u201d<\/em><sup class=\"modern-footnotes-footnote \" data-mfn=\"1\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-1\">1<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-1\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"1\">Attributed to Wolfgang Pauli. Michael Shermer, <em>Wronger Than Wrong<\/em>, <span style=\"font-variant: small-caps\">Sci. Am.<\/span> (Nov. 1, 2006), https:\/\/www.scientificamerican.com\/article\/wronger-than-wrong [https:\/\/perma.cc\/YHL5-39NY].<\/span><\/p>\n<p>\u201cGenerative interpretation\u201d is all the rage.<sup class=\"modern-footnotes-footnote \" data-mfn=\"2\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-2\">2<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-2\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"2\"><em>See generally <\/em>Yonathan Arbel &amp; David A. Hoffman, <em>Generative Interpretation<\/em>, 99 <span style=\"font-variant: small-caps\">N.Y.U. L. Rev.<\/span> 451 (2024) [hereinafter <em>Generative Interpretation<\/em>] (introducing the concept of \u201cgenerative interpretation\u201d).<\/span> Instead of asking a human judge to interpret a contract, a statute, or other legal text, generative interpretation assigns the task to a large language model (LLM). The idea itself is not new; versions of computer-assisted interpretation have been bouncing around in legal scholarship for over a decade. But the launch of ChatGPT in Fall 2022, followed by rapid improvements and the launch of impressive competitors like Claude, Llama, and Gemini, has led some to argue that the future is now.<\/p>\n<p>In a series of provocative experiments, scholars, judges, and lawyers have fed legal materials into LLMs to generate plausible-seeming answers to real-life interpretive questions. Yonathan Arbel and David Hoffman\u2019s <em>Generative Interpretation<\/em> uses a series of case studies to argue that LLMs can interpret contractual terms well enough that they are ready to assist courts and parties in contract litigation.<sup class=\"modern-footnotes-footnote \" data-mfn=\"3\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-3\">3<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-3\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"3\"><em>See id. <\/em>at 458 (\u201c[G]enerative interpretation is good enough for many cases that currently employ more expensive . . . methodologies.\u201d).<\/span> In a pair of concurring opinions, Judge Kevin Newsom has experimented with asking LLMs to interpret key terms from the cases before him.<sup class=\"modern-footnotes-footnote \" data-mfn=\"4\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-4\">4<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-4\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"4\"><em>See <\/em>Snell v. United Specialty Ins. Co., 102 F.4th 1208, 1224\u201325 (11th Cir. 2024) (Newsom, J., concurring) (describing the use of LLMs to determine the ordinary meaning of \u201clandscaping\u201d); United States v. Deleon, 116 F.4th 1260, 1272\u201375 (11th Cir. 2024) (Newsom, J., concurring) (describing the use of LLMs to determine the ordinary meaning of \u201cphysically restrained\u201d and comparing the outputs of different models).<\/span> And in a series of blog posts, appellate lawyer Adam Unikowsky has fed entire case files into Claude, asking it to generate judicial opinions.<sup class=\"modern-footnotes-footnote \" data-mfn=\"5\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-5\">5<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-5\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"5\"><em>See <\/em>Adam Unikowsky, <em>In AI We Trust<\/em>, <span style=\"font-variant: small-caps\">Adam\u2019s Legal Newsletter<\/span> (June 8, 2024), https:\/\/adamunikowsky.substack.com\/p\/in-ai-we-trust [https:\/\/perma.cc\/7NSQ-VDM9] [hereinafter <em>In AI We Trust I<\/em>]; Adam Unikowsky, <em>In AI We Trust, Part II<\/em>, <span style=\"font-variant: small-caps\">Adam\u2019s Legal Newsletter<\/span> (June 16, 2024), https:\/\/adamunikowsky.substack.com\/p\/in-ai-we-trust-part-ii [https:\/\/perma.cc\/BF5E-Q27X] [hereinafter <em>In AI We Trust II<\/em>]; Adam Unikowsky, <em>A Brief History of the Confrontation Clause<\/em>, <span style=\"font-variant: small-caps\">Adam\u2019s Legal Newsletter<\/span> (June 26, 2024), https:\/\/adamunikowsky.substack.com\/p\/a-brief-history-of-the-confrontation [https:\/\/perma.cc\/5PEL-7Z4Z] [hereinafter <em>A Brief History<\/em>]; Adam Unikowsky, <em>Automating Criminal Appeals<\/em>, <span style=\"font-variant: small-caps\">Adam\u2019s Legal Newsletter<\/span> (Sep. 18, 2024), https:\/\/adamunikowsky.substack.com\/p\/automating-criminal-appeals [https:\/\/perma.cc\/JZ3X-KMFA] [hereinafter <em>Automating Criminal Appeals<\/em>]; Adam Unikowsky, <em>Ignore the Future<\/em>, <span style=\"font-variant: small-caps\">Adam\u2019s Legal Newsletter<\/span> (Oct. 21, 2024), https:\/\/adamunikowsky.substack.com\/p\/ignore-the-future [https:\/\/perma.cc\/GQN4-LF8H] [hereinafter <em>Ignore the Future<\/em>].<\/span><\/p>\n<p>These LLM proponents are joined by their admirable curiosity and willingness to experiment with new methods, and by their cautious optimism about generative interpretation. Their bottom lines are broadly similar: LLMs are already \u201cgood enough,\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"6\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-6\">6<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-6\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"6\">For varying definitions of \u201cgood enough,\u201d <em>see infra<\/em> Part II.D.<\/span> and judges should seriously consider trusting them to assist with interpretive work in actual cases. They argue that generative interpretation has both quality and quantity advantages over purely human judging. Quality, because LLMs make inferences based on huge amounts of data on contemporary language use, and because automated processes can be objective in a way that fallible and biased humans cannot. And quantity, because LLMs can be deployed cheaply and at scale.<\/p>\n<p>We respectfully dissent. Generative interpretation in its current form is Potemkin interpretation: an attractive facade with nothing behind it. The superficial fluency of LLM-generated text is deeply misleading. LLMs are remarkable text-producing machines, but the way they produce that text falls far short of what the standard account of adjudication demands.<\/p>\n<p>Put simply, any human or computer can put words on a page, but it takes something more to turn those words into a legitimate act of legal interpretation. LLM proponents have no convincing story of what that something more entails. Currently, society treats judges\u2019 opinions as legitimate (most of the time) because they represent a principled working forward from source materials to a persuasive legal conclusion.<sup class=\"modern-footnotes-footnote \" data-mfn=\"7\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-7\">7<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-7\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"7\"><em>See generally<\/em> Neil Duxbury, <em>Faith in Reason: The Process Tradition in American Jurisprudence<\/em>, 15 <span style=\"font-variant: small-caps\">Cardozo L. Rev.<\/span> 601 (1993) (discussing Legal Process School\u2019s emphasis on reason-giving).<\/span> LLM proponents treat the persuasiveness of the output as proof that the process was principled. But the lesson of their experiments is precisely the opposite: they have shown that it is possible to produce persuasive, law-like rhetoric without going through the hard work of legal reasoning.<\/p>\n<p>While LLM proponents are right to point out LLMs\u2019 potential and to experiment with them, they are wrong to the extent that they claim that LLMs are currently fit for purpose at legal interpretation. LLMs are already good at some kinds of tasks\u2014such as writing JavaScript code\u2014where the output text is valued for what it does.<sup class=\"modern-footnotes-footnote \" data-mfn=\"8\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-8\">8<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-8\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"8\"><em>See, e.g.<\/em>, Simon Willison, <em>Here\u2019s How I Use LLMs to Help Me Write Code<\/em>, <span style=\"font-variant: small-caps\">Simon Willison\u2019s Weblog<\/span> (Mar. 11, 2025), https:\/\/simonwillison.net\/2025\/Mar\/11\/using-llms-for-code [https:\/\/perma.cc\/9N34-ZVMB] (providing a tutorial on how to use LLMs to write code).<\/span> They may even be good at some kinds of legal writing: under the right circumstances and with proper adult supervision, it is possible to imagine them helping draft contracts or briefs reliably and effectively.<sup class=\"modern-footnotes-footnote \" data-mfn=\"9\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-9\">9<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-9\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"9\"><em>Compare<\/em> Daniel Martin Katz, Michael James Bommarito, Shang Gao, &amp; Pablo Arredondo, <em>GPT-4 Passes the Bar Exam<\/em>, 382 <span style=\"font-variant: small-caps\">Phil. Transactions Royal Soc\u2019y<\/span>, Feb. 26, 2024, at<span style=\"font-variant: small-caps\"> 5\u20138 <\/span>(arguing that a leading LLM is capable of passing the bar exam), <em>and<\/em> Andrew Blair-Stanek, Donald G. Gifford, Mark Graber, Guha Krishnamurthi, Jeff Sovern, Donald B. Tobin &amp; Michael Van P. Alstine, <em>AI Gets Its First Law School A+s<\/em> (forthcoming) (manuscript at 4), https:\/\/papers.ssrn.com\/sol3\/papers.cfm?abstract_id=5274547 [https:\/\/perma.cc\/3JL4-Z9TS] (\u201cAI models can now perform at an A+ level on some law school final exams.\u201d), <em>with <\/em>Andrew Blair-Stanek, Nils Holzenberger, &amp; Benjamin Van Durme, <em>BLT: Can Large Language Models Handle Basic Legal Text?<\/em>, <em>in<\/em> <span style=\"font-variant: small-caps\">Proceedings of the Natural Legal Language Processing Workshop 2024 <\/span>216, 216 (\u201c[W]e find that leading publicly available LLMs perform poorly out-of-the-box.\u201d).<\/span> But legal interpretation is different because adjudication is different; it plays a different social role and must satisfy different constraints. Human interpretation varies\u2014somewhat more or less reliable, somewhat more or less persuasive, somewhat more or less affordable\u2014but is fundamentally the same kind of thing. LLM interpretation is something different in kind. It is a simulacrum of interpretation, one that bears no necessary relationship to the realities of linguistic meaning.<sup class=\"modern-footnotes-footnote \" data-mfn=\"10\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-10\">10<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-10\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"10\"><em>See generally<\/em> <span style=\"font-variant: small-caps\">Jean Baudrillard, Simulacra and Simulation<\/span> (Sheila Faria Glaser trans., U. Michigan Press 1994) (1981) (offering classic account of simulacra).<\/span> Perhaps this chasm can be bridged. The first, necessary step in doing so is to admit that the chasm exists.<\/p>\n<p>In particular, generative interpretation must overcome two challenges to be useful and legitimate when used in the judicial process. First, it faces a <em>reliability<\/em> gap: LLM proponents have not yet established that their methods reliably measure anything. The examples they have proffered are intriguing and worthy of further study, but they are very far from being sufficiently replicable to be regarded as reliable.<sup class=\"modern-footnotes-footnote \" data-mfn=\"11\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-11\">11<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-11\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"11\"><em>Cf.<\/em> <span style=\"font-variant: small-caps\">Fed. R. Evid. <\/span>702(c) (requiring that expert testimony be \u201cthe product of reliable principles and methods\u201d); Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579, 589 (1993) (confirming that the Federal Rules of Evidence require that \u201call scientific testimony or evidence admitted [be] . . . reliable.\u201d).<\/span> A consumer survey with a sample size of n=1 and an interviewer who improvised the questions on the spot would be unreliable<sup class=\"modern-footnotes-footnote \" data-mfn=\"12\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-12\">12<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-12\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"12\"><em>Cf<\/em>. Gail M. Sullivan &amp; Anthony R. Artino, <em>How to Create a Bad Survey Instrument<\/em>, 9 <span style=\"font-variant: small-caps\">J. Grad. Med. Educ.<\/span> 411, 412 (2017) (listing common issues in survey design).<\/span>\u2014no matter how plausible or how dramatic the results were\u2014because repeating the survey would lead to very different results. In a similar way, generative interpretation is unreliable if small tweaks to the prompt or the system flip the outputs. The generative interpretation experiments reported in the literature purport to be \u201cinformed conclusion[s] based on a statistical analysis of billions of texts.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"13\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-13\">13<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-13\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"13\"><em>Generative Interpretation<\/em>, <em>supra<\/em> note 2, at 458.<\/span> Our examination of their own examples shows, however, that the results might be better described as unwarranted extrapolations from extraneous factors.<\/p>\n<p>Second, generative interpretation faces an <em>epistemic<\/em> gap: LLM proponents have not yet established that their methods measure what they purport to. Even if those methods were robust and replicable, it would all be for naught if they were measuring the wrong thing. A study to determine the surface temperature of the sun by surveying breakfast-cereal consumers would be epistemically invalid, even if it had a rigorously specified survey protocol, interviewed an immense sample size, and passed every statistical test for significance\u2014simply because <em>what people think the sun\u2019s surface temperature is<\/em> is not a good measure of <em>what the sun\u2019s surface temperature actually is<\/em>. In a similar way, generative interpretation is epistemically invalid if its text predictions do not correspond to the linguistic judgments of the relevant community of speakers and listeners. This gap is most obvious for generative adjudication, in which LLMs are tasked with generating entire opinions, but we will show that it is present for the seemingly more constrained task of generative interpretation as well.<\/p>\n<p>The reliability gap and the epistemic gap are closely related. Indeed, we think that the beginning of wisdom is to recognize that both gaps must be bridged for generative interpretation to be useful. Some of the most thoughtful attempts to deal with the reliability gap raise the biggest epistemic problems, and vice versa. Arbel and Hoffman\u2019s precise measurements of the inner workings of an LLM, for example, help make generative methods more reliable\u2014but they beg the question of why those measurements are a good epistemic measure of meaning.<sup class=\"modern-footnotes-footnote \" data-mfn=\"14\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-14\">14<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-14\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"14\"><em>See, e.g.<\/em>, <em>id.<\/em> at 484\u201385 (examining probability of outputs from LLMs to help gauge the use of \u201cEnglish in its most public and common sense.\u201d).<\/span> Conversely, Unikowsky\u2019s <em>res ipsa loquitur<\/em> willingness to accept LLM outputs as authoritative because they read persuasively gives a (debatable) response to the epistemic question, but it does nothing to answer the reliability question.<sup class=\"modern-footnotes-footnote \" data-mfn=\"15\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-15\">15<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-15\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"15\"><em>See, e.g.<\/em>, <em>In AI We Trust I<\/em>, <em>supra<\/em> note 5.<\/span> Taking both of these gaps seriously <em>at the same time<\/em> is essential, and it is here that we think the LLM proponents have the most work ahead of them.<\/p>\n<p>Our arguments dovetail with recent work by Brandon Waldon, Nathan Schneider, Ethan Wilcox, Amir Zeldes, and Kevin Tobia. In a forthcoming paper, they persuasively show that arguments for LLMs as authoritative \u201c\u2018super-judges\u2019 of ordinary meaning rest[] upon fundamental myths about how these tools work.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"16\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-16\">16<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-16\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"16\">Brandon Waldon, Nathan Schneider, Ethan Wilcox, Amir Zeldes &amp; Kevin Tobia, <em>Large Language Models for Legal Interpretation? Don\u2019t Take Their Word for It<\/em>, <span style=\"font-variant: small-caps\">114 Geo. L.J.<\/span> (forthcoming) (manuscript at 22), https:\/\/papers.ssrn.com\/abstract=5123124 [https:\/\/perma.cc\/2JVM-CMHU].<\/span> They focus on debunking the factual assumptions about LLMs that generative-interpretation proponents have invoked to justify their purported expertise on ordinary meaning.<sup class=\"modern-footnotes-footnote \" data-mfn=\"17\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-17\">17<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-17\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"17\"><em>See id.<\/em> at 22\u201339.<\/span> Our complementary critique addresses distinct empirical and jurisprudential problems with generative-interpretation proposals. On the empirical side, we offer detailed, under the hood demonstrations that generative interpretation produces arbitrary and unpredictable results in a manner that belies its proponents\u2019 optimistic claims. On the jurisprudential side, we ask what it would take for LLMs to possess the legal-interpretive authority that generative-interpretation proponents claim they do and show why, as a jurisprudential matter, the proponents\u2019 accounts fall short of justifying such authority.<\/p>\n<p>This Article proceeds in five Parts. Part I surveys the recent wave of writings from LLM proponents. We divide them roughly into two camps: those who are interested in <em>generative interpretation<\/em>, using LLMs to ascertain the meaning of words and phrases, and those who are interested in <em>generative adjudication<\/em>, using LLMs to resolve entire disputes.<\/p>\n<p>Part II provides a brief overview of LLM technology. We do not rehash the history of LLMs or describe the technology in detail. Instead, we highlight a few points that will be relevant to our discussion.<\/p>\n<p>Part III considers the reliability gap. Here, our methodological critique shows, with representative examples, that LLM proponents\u2019 results are brittle and often arbitrary. Small changes to an LLM, its settings, or its input can dramatically change the nature of its outputs, often in ways that would alter the outcome of cases. These are not small quibbles. Legal interpretation is a large and complex task, and the space of possible experimental setups is correspondingly large and unconstrained. The fact that LLMs are so easy to use\u2014at the simplest, just type a question in a box\u2014makes it easy to overlook the assumptions and choices involved in using them.<\/p>\n<p>Part IV then considers the epistemic gap. We taxonomize four types of arguments that LLM proponents have made to establish the legitimacy of their use in adjudication. Respectively, they are based on (1) an LLM\u2019s <em>training process <\/em>and the immense bodies of text it is trained on, (2) <em>empirical validation<\/em> of the LLM\u2019s outputs in comparison with other methods, (3) the <em>rhetorical persuasiveness<\/em> of the text the LLM produces, or (4) the purported <em>predictability<\/em> of LLM-generated outputs. In each case, we point out the unstated and mistaken assumptions on which these arguments rest. In particular, we argue that LLMs highlight the distinction between the <em>process<\/em> that makes adjudication legitimate\u2014the hard work of legal reasoning that judges engage in\u2014and the <em>artifacts<\/em> the process generates\u2014like definitions of disputed terms and opinions justifying an outcome. LLMs sever the connection between the two, because they are capable of producing plausible-looking opinions through a process that is fundamentally alien to traditional legal reasoning. Indeed, they force us to consider the prospect that the superficial fluency of legal text is no longer a meaningful guarantee of anything substantive.<\/p>\n<p>Part V considers what it might take to make LLMs practice-ready for judicial chambers. Even if generative interpretation and adjudication are different and harder, we are more optimistic about the use of LLMs in other aspects of legal practice, and for certain uses in judicial practice. Some scholars have argued that judging is an intrinsically human activity that cannot be outsourced, in whole or part, to any algorithmic system.<sup class=\"modern-footnotes-footnote \" data-mfn=\"18\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-18\">18<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-18\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"18\"><em>See, e.g.<\/em>, Amin Ebrahimi Afrouzi<em>, John Robots, Thurgood Martian, and the Syntax Monster: A New Argument Against AI Judges<\/em>, 37 <span style=\"font-variant: small-caps\">Can. J. L. &amp; Juris.<\/span> 369, 370 (2024) (arguing that \u201ceven if AI reaches the same outcome as human judges,\u201d it will be \u201cdeficient in rationale\u201d); Kiel Brennan-Marquez &amp; Stephen E. Henderson, <em>Artificial Intelligence and Role-Reversible Judgment<\/em>, 109 <span style=\"font-variant: small-caps\">J. Crim. L. &amp; Criminology <\/span>137, 140 (2019) (\u201cIn some contexts, those who exercise judgment should be vulnerable, in reverse, to its processes and effects.\u201d); Ian Kerr &amp; Carissima Mathen, <em>Chief Justice John Roberts is a Robot<\/em> (U. of Ottawa Working Paper, 2019) (manuscript at 39\u201340), https:\/\/papers.ssrn.com\/sol3\/papers.cfm?abstract_id=3395885 [https:\/\/perma.cc\/FL9R-HPBX] (arguing that AI cannot be a judge because \u201clegal reasoning cannot be reduced to mere functional capabilities\u201d). <em>But see, e.g.<\/em>, Eugene Volokh, <em>Chief Justice Robots<\/em>, 68 <span style=\"font-variant: small-caps\">Duke L.J.<\/span> 1135, 1158 (2019) (arguing for human-algorithmic equivalence in judging). An even stronger version of the claim is that certain decisions must be made by <em>specific people<\/em> and that any delegation of the reasoning or justification even to <em>other people<\/em> is problematic. <em>See <\/em>Bridget C.E. Dooling, <em>Ghostwriting the Government<\/em>, 109 <span style=\"font-variant: small-caps\">Marq. L. Rev. <\/span>(forthcoming 2026) (manuscript at 29), https:\/\/papers.ssrn.com\/sol3\/papers.cfm?abstract_id=5200672 [https:\/\/perma.cc\/PG5G-G8HC] (arguing that a \u201cduty to reason\u201d restricts delegation both to humans and to generative AI).<\/span> Our claim is narrower. We take no position on whether adjudication requires the exercise of specifically moral faculties or whether computer systems, including LLMs, are capable of having and exercising those faculties. Instead, our view is that adjudication is a fundamentally social process, one that must meet certain criteria of rationality to serve its function of legitimating the exercise of authority. Thus, in this Part we describe the work required to make generative interpretation meet the necessary criteria of rationality. In particular, we compare it to two other empirical interpretive methods\u2014trademark surveys and legal corpus linguistics\u2014one of which has successfully established itself in adjudication and the other of which is attempting to do so.<\/p>\n<p>Finally, in a brief Conclusion, we consider the broader implications of our analysis. Perhaps the problem facing generative interpretation is not that LLMs aren\u2019t good enough at language to be useful to judges, but that they are <em>too good<\/em>. Proponents of generative interpretation cite the persuasiveness of LLM-generated text as proof of that text\u2019s legitimacy. But the lesson of their experiments is precisely the opposite. They show that it is possible to produce persuasive, law-like rhetoric to support a conclusion without going through the hard work of legal reasoning to validate that conclusion. Their work shows that we should be more suspicious of words on a page in the age of AI, not less.<\/p>\n<h4 style=\"text-align: center\">II. <span style=\"font-variant: small-caps\">Generative Interpretation<\/span><\/h4>\n<p style=\"text-align: left\">LLM proponents have made two different kinds of proposals for incorporating generative AI into adjudication. Some of them, more modestly, have suggested using generative AI as an <em>interpretive tool<\/em> that provides targeted answers to discrete questions about the meaning of a legal text. Others, more ambitiously, have suggested using generative AI as an <em>adjudicator<\/em> that answers broader questions about the proper resolution of a case. They are united by a belief that LLMs can perform some tasks traditionally performed by judges well enough that the legal system should think seriously about incorporating them into actual judicial practice. They differ in how much of the judicial task they believe should be delegated to LLMs. And even within these two broad camps, approaches towards the proper use of LLMs are highly diverse.<\/p>\n<p>These are not completely new ideas. For years, commentators have debated whether decision-making by AIs could be an acceptable substitute for decision-making by human judges.<sup class=\"modern-footnotes-footnote \" data-mfn=\"19\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-19\">19<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-19\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"19\"><em>See, e.g.<\/em>, Afrouzi, <em>supra<\/em> note 18; Brennan-Marquez, <em>supra<\/em> note 18; Volokh, <em>supra<\/em> note 18.<\/span> These conversations span many legal fields, and encompass not just judicial adjudication but also the immense volume of administrative decisions. Common themes include accuracy,<sup class=\"modern-footnotes-footnote \" data-mfn=\"20\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-20\">20<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-20\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"20\"><em>See, e.g.<\/em>, Danielle Keats Citron, <em>Technological Due Process<\/em>, 85 <span style=\"font-variant: small-caps\">Wash. U. L. Rev. <\/span>1249, 1301\u201308 (2008) (providing framework to evaluate technological accuracy in law).<\/span> consistency,<sup class=\"modern-footnotes-footnote \" data-mfn=\"21\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-21\">21<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-21\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"21\"><em>See, e.g.<\/em>, Cary Coglianese &amp; Lavi Ben Dor, <em>AI in Adjudication and Administration<\/em>, 86 <span style=\"font-variant: small-caps\">Brook. L. Rev.<\/span> 791, 828 (2021) (\u201c[M]achine-learning tools . . . could potentially reduce inconsistencies and other foibles that permeate human judgment.\u201d).<\/span> and explainability.<sup class=\"modern-footnotes-footnote \" data-mfn=\"22\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-22\">22<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-22\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"22\"><em>See, e.g.<\/em>, Margot Kaminski, <em>The Right to Explanation, Explained<\/em>, 34 <span style=\"font-variant: small-caps\">Berkeley. Tech. L.J. <\/span>189 (2019); Andrew D. Selbst &amp; Solon Barocas, <em>The Intuitive Appeal of Explainable Machines<\/em>, <span style=\"font-variant: small-caps\">87 Fordham L. Rev.<\/span> 1085 (2018).<\/span> In the first instance, these debates have primarily been about the quality of the <em>decisions<\/em> themselves, rather than about the quality of the <em>explanations<\/em> offered to justify those decisions. In other words, these have largely been debates about <em>classification <\/em>or <em>predictive<\/em> AI, rather than about <em>generative <\/em>AI.<\/p>\n<p>Generative AI brings two new things to the table. First, it turns hypotheticals about AIs capable of writing opinions that read like they were written by a human into actualities. That means the question of whether the judicial system should actually use those AIs is far more pressing than it was even a few years ago.<sup class=\"modern-footnotes-footnote \" data-mfn=\"23\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-23\">23<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-23\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"23\">It also changes the tenor of discussions of the effects these AIs will have. Actual experience with LLMs makes the conversation significantly more concrete. <em>See, e.g.<\/em>, Richard M. Re, <em>Artificial Authorship and Judicial Opinions<\/em>, 92 <span style=\"font-variant: small-caps\">Geo. Wash. L. Rev.<\/span> 1558, 1582\u201385 (2024) (offering predictions about effects of widespread use of AI by courts and litigants).<\/span> When proponents like Adam Unikowsky claim that LLMs are good enough for judges to use <em>now<\/em>, they are speaking in the present tense and the indicative mood.<sup class=\"modern-footnotes-footnote \" data-mfn=\"24\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-24\">24<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-24\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"24\"><em>See In AI We Trust II<\/em>, <em>supra<\/em> note 5.<\/span><\/p>\n<p>Second, generative AI opens up the possibility of generative interpretation: giving an LLM an interpretive task that is significantly smaller than deciding an entire case. This is not actually a task where the generative nature of the AI is essential to its use for interpretation. One could imagine training a classification AI to disambiguate between cases where the last-antecedent canon or the series-qualifier canon is more appropriate.<sup class=\"modern-footnotes-footnote \" data-mfn=\"25\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-25\">25<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-25\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"25\"><em>See generally<\/em> Adam Crews, <em>The So-Called Series-Qualifier Canon<\/em>, 116 <span style=\"font-variant: small-caps\">Nw. U. L. Rev. Online<\/span> 198, 208\u201309 (2021) (comparing the two canons).<\/span> Such an AI might have an architecture and training process completely different from the deep-learning methods used to train LLMs; it might be prompted with a passage of text and output a single numerical value expressing the degree to which the passage resembled other passages in which one canon rather than the other applied. <em>Generative<\/em> AI, on the other hand, has seemed like it might work for interpretation because it has already worked for so many other tasks. That is, LLMs are such powerful and versatile tools that they can be (and have been) used, off the shelf or with only slight modifications, for a variety of classification tasks.<sup class=\"modern-footnotes-footnote \" data-mfn=\"26\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-26\">26<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-26\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"26\"><em>See,<\/em> <em>e.g.<\/em>, Rebecca M. M. Hicke &amp; David Mimno, <em>T5 Meets Tybalt: Author Attribution in Early Modern English Drama Using Large Language Models<\/em><span style=\"font-variant: small-caps\">, 2023 Computational Humans. Rsch. Conf. 1 <\/span>(using LLM to identify authorship of text passage).<\/span> If they are truly jacks of all trades, perhaps legal interpretation is one of those trades.<\/p>\n<p style=\"text-align: center\">A. <em>The Proponents\u2019 Common Premises<br \/><\/em><\/p>\n<p>The justifications that LLM proponents offer for generative interpretation and adjudication are broadly consistent. The remainder of this Part will review arguments from some of the early proponents of LLM-powered interpretation and adjudication\u2014from cautious experimenters like Judges Kevin Newsom and Joshua Deahl, to measured enthusiasts like professors Yonathan Arbel and David Hoffman, to unbridled optimists like practitioners Adam Unikowsky and Jack Kieffaber. These proponents differ along many dimensions, but all see LLMs as potentially more accurate, more consistent, and more efficient than current interpretive methods. The intuition behind these arguments is consistent across the board.<\/p>\n<p>First, the proponents say that LLMs can be more <em>accurate<\/em> than human judges: the models offer higher-quality insights into the ordinary meaning of language in legal texts. They argue that generative interpretation offers a more \u201cmajoritarian reading\u201d of text than a dictionary or a jurist\u2019s intuitions.<sup class=\"modern-footnotes-footnote \" data-mfn=\"27\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-27\">27<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-27\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"27\"><em>Generative Interpretation<\/em>, <em>supra<\/em> note 2, at 488.<\/span> The core tenet of this argument is that because the models train on what are perhaps the largest corpora of natural-language text ever assembled, and because they use statistical methods to optimize their predictions of successive words in context, their output offers probative insights about general English usage.<sup class=\"modern-footnotes-footnote \" data-mfn=\"28\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-28\">28<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-28\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"28\"><em>See<\/em> Snell v. United Specialty Ins. Co., 102 F.4th 1208, 1226 (11th Cir. 2024) (Newsom, J., concurring) (\u201c[T]he best reason to think that LLMs might provide useful information to those engaged in the interpretive enterprise\u201d is that \u201cthe models train on a mind-bogglingly enormous amount of raw data taken from the internet\u201d and therefore \u201ccan provide useful statistical predictions about how, in the main, ordinary people ordinarily use words and phrases in ordinary life.\u201d); <em>see also<\/em> <em>Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 513\u201314 (praising \u201csuperior sensitivity to actual usage,\u201d which, the authors argue, derives from the fact that the models are trained on immense amounts of real-life examples of linguistic usage); <em>In AI We Trust I<\/em>, <em>supra <\/em>note 5 (\u201c[W]e should come at this problem in the spirit of \u2018AI has read every case ever written and in most cases, it will be more accurate than humans\u2019. . . .\u201d).<\/span> A related idea is that they are more democratic because their training data reflects a far broader sample than other authoritative sources, like dictionaries.<sup class=\"modern-footnotes-footnote \" data-mfn=\"29\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-29\">29<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-29\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"29\"><em>See<\/em> <em>supra<\/em> note 14 and accompanying text.<\/span><\/p>\n<p>Second, the proponents argue that generative interpretation is more <em>consistent<\/em>: more constrained, more predictable, and more transparent.<sup class=\"modern-footnotes-footnote \" data-mfn=\"30\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-30\">30<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-30\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"30\"><em>See Snell<\/em>, 102 F.4th at 1227\u201330 (Newsom, J., concurring) (\u201cUsing LLMs to facilitate ordinary meaning interpretation may actually enhance the transparency and reliability of the interpretive enterprise itself . . . .\u201d); <em>see also In AI We Trust I<\/em>, <em>supra <\/em>note 5 (\u201cDispersing the judicial power among so many different judges inevitably undermines predictability. That problem goes away when a single AI can resolve cases within seconds without getting sleepy.\u201d). <em>See generally <\/em>Kimo Gandall, Jack Kieffaber &amp; Kenny McLaren, <em>We Built Judge.ai. And You Should Buy It<\/em> (Working Paper), https:\/\/papers.ssrn.com\/abstract=5115184 [https:\/\/perma.cc\/D5JW-8Y7K].<\/span> In contrast to current interpretive methods\u2019 \u201cvalue-laden\u201d choices about which dictionaries to consult<sup class=\"modern-footnotes-footnote \" data-mfn=\"31\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-31\">31<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-31\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"31\"><em>Generative Interpretation<\/em>, <em>supra<\/em> note 2, at 466.<\/span> or which interpretive canons to employ, generative interpretation offers a \u201ccutting-edge \u2018mathematization\u2019 of language,\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"32\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-32\">32<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-32\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"32\"><em>Snell<\/em>, 102 F.4th at 1227 (Newsom, J., concurring).<\/span> that is more \u201ctransparent\u201d and \u201cobjective.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"33\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-33\">33<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-33\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"33\"><em>Generative Interpretation<\/em>, <em>supra<\/em> note 2, at 455, 511.<\/span> In <em>Snell v. United Specialty Insurance Company<\/em>,<sup class=\"modern-footnotes-footnote \" data-mfn=\"34\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-34\">34<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-34\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"34\">102 F.4th 1208 (11th Cir. 2024).<\/span> Judge Newsom suggested that \u201cLLM research is relatively transparent.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"35\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-35\">35<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-35\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"35\"><em>Id.<\/em> at 1228 (Newsom, J., concurring).<\/span> \u201c[W]e tend to take dictionaries for granted, as if delivered by a prophet,\u201d he observes, but \u201cthe precise details of [dictionaries\u2019] construction aren\u2019t always self-evident.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"36\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-36\">36<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-36\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"36\"><em>Id.<\/em><\/span> Moreover, judges can shop around for dictionary definitions that suit their predispositions and \u201crarely explain in any detail the process by which they selected one definition over others.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"37\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-37\">37<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-37\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"37\"><em>Id. <\/em>at 1229.<\/span> By contrast, Judge Newsom argued, LLM research is comparatively transparent because we \u201cknow,\u201d on some general level, what data LLMs \u201clearn[]\u201d from and because judges could disclose their full queries and models\u2019 full answers.<sup class=\"modern-footnotes-footnote \" data-mfn=\"38\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-38\">38<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-38\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"38\"><em>See id.<\/em><\/span> Plus, LLMs are less vulnerable to hidden exercises of discretion than tools like corpus linguistics (and perhaps even dictionaries).<sup class=\"modern-footnotes-footnote \" data-mfn=\"39\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-39\">39<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-39\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"39\"><em>See id<\/em>. at 1230. For a longer discussion of corpus linguistics, <em>see<\/em> <em>infra <\/em>note 286 and accompanying text.<\/span><\/p>\n<p>And third, the proponents tell us that generative interpretation is more <em>efficient, <\/em>so its use can improve access to justice. LLMs are widely accessible and far cheaper than actual surveys of humans\u2019 perceptions of meaning.<sup class=\"modern-footnotes-footnote \" data-mfn=\"40\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-40\">40<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-40\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"40\"><em>See Snell<\/em>, 102 F.4th at 1228\u201330.<\/span> Judicial use of a predictable, widely accessible model might let parties predict judicial outcomes to resolve disputes while avoiding court altogether.<sup class=\"modern-footnotes-footnote \" data-mfn=\"41\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-41\">41<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-41\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"41\"><em>See Generative Interpretation<\/em>, <em>supra<\/em> note 2, at 510.<\/span><\/p>\n<p style=\"text-align: center\">B. <em>LLMs as Interpretive Tools<br \/><\/em><\/p>\n<p>We start with the commentators who have proposed using LLMs for interpretation. Overall, their proposals are more concrete and specific, and thus easier to assess. Part III\u2019s more technical critique revisits these examples in greater detail.<\/p>\n<p>1. <em>Yonathan Arbel and David Hoffman<\/em><\/p>\n<p>The leading exposition of using LLMs for legal interpretation is Yonathan Arbel and David Hoffman\u2019s widely praised<sup class=\"modern-footnotes-footnote \" data-mfn=\"42\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-42\">42<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-42\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"42\"><em>See, e.g.<\/em>, Lawrence Solum, <em>Hoffman &amp; Arbel on Interpretation of Contracts by Generative AI<\/em>, <span style=\"font-variant: small-caps\">Legal Theory Blog<\/span> (Aug. 1, 2023) https:\/\/lsolum.typepad.com\/legaltheory\/2023\/08\/hoffman-arbel-on-interpretation-of-contracts-by-generative-ai.html [https:\/\/perma.cc\/U87X-FFZK] (\u201cA profoundly important article.\u201d); Omri Ben-Shahar, <em>Towards the End of Normative Interpretation of Contracts<\/em>, <span style=\"font-variant: small-caps\">Jotwell (<\/span>Nov. 2, 2023<span style=\"font-variant: small-caps\">)<\/span>, https:\/\/contracts.jotwell.com\/towards-the-end-of-normative-interpretation-of-contracts\/ [https:\/\/perma.cc\/4MS3-P4EE] (reviewing <em>Generative Interpretation<\/em> positively).<\/span> <em>Generative Interpretation<\/em>.<sup class=\"modern-footnotes-footnote \" data-mfn=\"43\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-43\">43<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-43\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"43\"><em>Generative Interpretation<\/em>,<em> supra <\/em>note 2.<\/span> The article posits that \u201cAI models can help factfinders ascertain ordinary meaning in context, quantify ambiguity, and fill gaps in parties\u2019 agreements.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"44\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-44\">44<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-44\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"44\"><em>Id.<\/em> at 451.<\/span> The authors\u2019 essential argument is that large language models can help interpret contested provisions of legal instruments.<\/p>\n<p>They demonstrate their approach by working through several case studies of litigated contractual disputes. In these case studies, the authors solicit responses from LLMs about the meaning of legal text, and they present these responses as \u201crelevant\u201d evidence of the language\u2019s \u201cpublic and common\u201d meaning.<sup class=\"modern-footnotes-footnote \" data-mfn=\"45\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-45\">45<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-45\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"45\"><em>See id.<\/em> at 485 (\u201cGenerative interpretation . . . offers courts a better sense of the relevant probabilities if the parties were intending to use English in its most public and common sense.\u201d); <em>see also<\/em> <em>id.<\/em> at 489 (\u201cGenerative interpretation . . . helps us visualize a broad spectrum of meaning and quantify how likely a particular result is.\u201d); <em>id.<\/em> at 505 (\u201cModels offer an approximation of general understanding that may simply not be available in any other way, and thus advance long-held goals of contract theory.\u201d).<\/span> \u201c[G]enerative interpretation,\u201d they conclude, \u201cpromises an accessible, relatively predictable, tool that will help lawyers and judges interpret contracts.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"46\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-46\">46<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-46\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"46\"><em>Id.<\/em> at 509.<\/span><\/p>\n<p>For example, they consider a dispute about whether language in a prenuptial agreement referring to \u201cthe time a Petition for Dissolution of Marriage is filed\u201d meant the time the <em>first <\/em>such petition was filed or the time that the <em>most recent<\/em> petition was filed.<sup class=\"modern-footnotes-footnote \" data-mfn=\"47\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-47\">47<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-47\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"47\"><em>See id.<\/em> at 483\u201384.<\/span> They gave OpenAI\u2019s Davinci-003 model\u2014a variant of the LLM used to power ChatGPT at the time\u2014the text of the agreement and asked it, \u201cIf one of the parties files a divorce petition, withdraws it, and then a few years later a new petition is filed, what date determines the number of full years of marriage: the first filing or the second one?\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"48\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-48\">48<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-48\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"48\"><em>Generative Interpretation<\/em>,<em> supra <\/em>note 2, at 484.<\/span> They then observe that the model has a 94.72% probability of outputting \u201csecond\u201d in a sentence answering the question and only a 0.68% probability of outputting \u201cfirst.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"49\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-49\">49<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-49\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"49\"><em>See id.<\/em><\/span><\/p>\n<p>Arbel and Hoffman acknowledge that their article is meant as an introduction to a promising interpretive technique, rather than as an exhaustive explication of an interpretive methodology. They do not claim that LLMs surface the absolute truth of contracting parties\u2019 intentions: instead, they present generative interpretation as a \u201cworkable, workmanlike method for a resource-constrained litigation world.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"50\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-50\">50<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-50\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"50\"><em>Id.<\/em> at 458, 460\u201361.<\/span> Nor do they claim that LLMs are ready to serve as \u201crobot judges\u201d vested with the final say in adjudication.<sup class=\"modern-footnotes-footnote \" data-mfn=\"51\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-51\">51<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-51\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"51\"><em>Id.<\/em> at 461.<\/span> Rather, they explain that their proposal is for judges to use LLMs \u201cas tools\u201d to supplement their judgment, and they disclaim that techniques for querying these tools \u201cstill await a process of development, refinement, and validation.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"52\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-52\">52<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-52\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"52\"><em>Id.<\/em> at 455, 461.<\/span><\/p>\n<p>Alongside these disclaimers, however, Arbel and Hoffman consistently characterize LLM outputs as accurate, probative evidence of words\u2019 legal meanings.<sup class=\"modern-footnotes-footnote \" data-mfn=\"53\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-53\">53<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-53\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"53\"><em>See supra <\/em>note 43; <em>see also Generative Interpretation, supra<\/em> note 2, at 495 (\u201c[W]e have provided examples that showcase how large language models might power a stronger, cheaper, more robust form of textualism.\u201d); <em>id. <\/em>at 505 (\u201c[O]n average, these models predict with great accuracy linguistic distinctions that humans make.\u201d); <em>id.<\/em> at 509 (\u201c[G]enerative interpretation promises an accessible, relatively predictable, tool that will help lawyers and judges interpret contracts\u201d). <em>Compare<\/em> <em>id.<\/em> at 485 (\u201cGenerative interpretation . . . offers courts a better sense of the relevant probabilities if the parties were intending to use English in its most public and common sense.\u201d), <em>with<\/em> <em>id.<\/em> at 485 n.165 (\u201c[T]he probabilities shouldn\u2019t be interpreted literally.\u201d). We think the best way to interpret these claims and caveats together is as an endorsement of the models\u2019 <em>accuracy<\/em>, but a disavowal of the models\u2019 claimed <em>precision<\/em>.<em> See id.<\/em> at 496 n.210 (\u201c[Y]ou should be skeptical of model\u2019s [sic] expressed confidence; the direction of change with every new piece of evidence, not its quantification, is informative.\u201d). For an explanation of the distinction between accuracy and precision, <em>see, e.g.<\/em>, <em>Accuracy and Precision<\/em>, <span style=\"font-variant: small-caps\">Wikipedia<\/span> (last visited Mar. 17, 2025), https:\/\/en.wikipedia.org\/w\/index.php?title=Accuracy_and_precision&amp;oldid=1281017807 [https:\/\/perma.cc\/79CF-BTUP].<\/span> They further advertise that \u201cgenerative interpretation is good enough for many cases that currently employ more expensive, and arguably less certain, methodologies.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"54\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-54\">54<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-54\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"54\"><em>Generative Interpretation<\/em>, <em>supra<\/em> note 2, at 458.<\/span> Whether or not the authors intended to present generative interpretation as a method ready for primetime use,<sup class=\"modern-footnotes-footnote \" data-mfn=\"55\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-55\">55<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-55\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"55\">After the authors assert in the present tense that \u201cgenerative interpretation is good enough,\u201d <em>id.<\/em>, they later present the question as, \u201cwhether [generative interpretation] is good enough, <em>if not today then soon<\/em>, for resource-deprived courts to adopt in ordinary cases,\u201d <em>id. <\/em>at 502 (emphasis added and original emphasis omitted).<\/span> it now occupies a primetime slot, thanks in part to its influence on prominent LLM proponents like Judge Newsom, whose views we discuss in the following section.<\/p>\n<p>Overall, Arbel and Hoffman\u2019s proposal is <em>narrow<\/em> in the questions it asks LLMs, <em>broad<\/em> in the evidence it asks them to consider, and <em>narrow<\/em> in the outputs it examines. They invite judges to turn to generative AI to help with discrete interpretive tasks\u2014<em>e.g.<\/em>, does the term \u201cother affiliates\u201d as used in a contract refer only to entities already in existence?\u2014and they are capacious in the text they provide LLMs to help answer these questions\u2014frequently, entire contracts.<sup class=\"modern-footnotes-footnote \" data-mfn=\"56\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-56\">56<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-56\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"56\"><em>See id.<\/em> at 487, 489\u201390.<\/span> Indeed, they present the use of LLMs as a new form of contextualism: a way for contract interpretation to consider a wide array of evidence bearing on a contract\u2019s meaning while preserving the values of predictability, restraint, and judicial economy typically advanced for its alternative, textualism.<sup class=\"modern-footnotes-footnote \" data-mfn=\"57\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-57\">57<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-57\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"57\"><em>See<\/em> <em>id.<\/em> at 510\u201313.<\/span> They are enthusiastic about giving LLMs access to the complete text of affidavits and other documents beyond the four corners of a contract.<sup class=\"modern-footnotes-footnote \" data-mfn=\"58\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-58\">58<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-58\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"58\"><em>See<\/em> <em>id.<\/em> at 513.<\/span> Overall, however, they use LLMs as meaning meters rather than legal writers; rather than relying on the apparent meanings of a chatbot\u2019s generated responses, Arbel and Hoffman look closely at the underlying AI\u2019s internal calculations.<\/p>\n<p>2. <em>Judge Kevin Newsom<\/em><\/p>\n<p>The next major proponent of generative interpretation is Judge Kevin Newsom of the United States Court of Appeals for the Eleventh Circuit. In 2024, he wrote his concurrence in <em>Snell<\/em>, in which he characterized himself as \u201cth[inking] the unthinkable\u201d and \u201csa[ying] the unsayable.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"59\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-59\">59<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-59\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"59\">Snell v. United Specialty Ins. Co., 102 F.4th 1208, 1221 (11th Cir. 2024) (Newsom, J., concurring).<\/span> \u201cLLM[s] like ChatGPT,\u201d he mused, \u201cmight have something useful to say about the common, everyday meaning of the words and phrases used in legal texts.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"60\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-60\">60<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-60\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"60\"><em>Id.<\/em> at 1234.<\/span><\/p>\n<p>The relevant issue in <em>Snell <\/em>was whether an insurance policy that covered liability arising out of the insured\u2019s \u201cperform[ance of] landscaping\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"61\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-61\">61<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-61\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"61\"><em>Id.<\/em> at 1213 (majority opinion). <\/span> applied to the allegedly negligent installation of a trampoline.<sup class=\"modern-footnotes-footnote \" data-mfn=\"62\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-62\">62<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-62\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"62\"><em>Id.<\/em> at 1211\u201312.<\/span> While the majority disposed of the case without ruling on the meaning of \u201clandscaping,\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"63\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-63\">63<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-63\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"63\"><em>See id.<\/em> at 1216\u201317.<\/span> Judge Newsom argued that \u201c[t]hose, like me, who believe that \u2018ordinary meaning\u2019 is <em>the<\/em> foundational rule for the evaluation of legal texts should consider\u2014<em>consider<\/em>\u2014whether and how AI-powered large language models like OpenAI\u2019s ChatGPT, Google\u2019s Gemini, and Anthropic\u2019s Claude might\u2014<em>might<\/em>\u2014inform the interpretive analysis.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"64\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-64\">64<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-64\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"64\"><em>Id.<\/em> at 1221\u201322 (Newsom, J., concurring).<\/span><\/p>\n<p>On \u201ca lark,\u201d Judge Newsom directed one of his clerks to query ChatGPT, \u201cWhat is the ordinary meaning of \u2018landscaping\u2019?\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"65\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-65\">65<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-65\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"65\">Snell v. United Specialty Ins. Co., 102 F.4th 1208, 1224\u201325 (11th Cir. 2024) (Newsom, J., concurring).<\/span> ChatGPT produced a two-sentence definition that \u201caligned with [his] priors.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"66\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-66\">66<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-66\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"66\"><em>Id.<\/em> at 1225.<\/span> He then asked, \u201cIs installing an in-ground trampoline \u2018landscaping\u2019?\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"67\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-67\">67<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-67\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"67\"><em>Id.<\/em><\/span> Both ChatGPT and Google Bard answered in the affirmative.<sup class=\"modern-footnotes-footnote \" data-mfn=\"68\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-68\">68<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-68\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"68\"><em>Id.<\/em><\/span> The coherence of ChatGPT\u2019s response impressed him, and he found that it \u201csquared with [his] own impression\u201d about how ordinary speakers might use the term.<sup class=\"modern-footnotes-footnote \" data-mfn=\"69\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-69\">69<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-69\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"69\"><em>Id.<\/em><\/span><\/p>\n<p>A few months later, Judge Newsom reaffirmed his enthusiasm for LLMs in another concurrence, this time in a case, <em>United States v. Deleon<\/em>,<sup class=\"modern-footnotes-footnote \" data-mfn=\"70\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-70\">70<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-70\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"70\">116 F.4th 1260 (11th Cir. 2024).<\/span> interpreting a provision of the United States Sentencing Guidelines.<sup class=\"modern-footnotes-footnote \" data-mfn=\"71\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-71\">71<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-71\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"71\"><em>See id.<\/em> at 1261.<\/span> The Guidelines impose a sentencing enhancement for an armed robbery \u201cif any person was physically restrained to facilitate commission of the offense or to facilitate escape.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"72\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-72\">72<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-72\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"72\"><em>Id.<\/em> at 1261 (quoting <span style=\"font-variant: small-caps\">U.S. Sent\u2019g Guidelines Manual<\/span> \u00a7 2B3.1(b)(4)(B)).<\/span> The defendant had \u201cwalked into a store, pointed a gun at the cashier while demanding money from the register, received the money, and . . . left . . . within about one minute.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"73\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-73\">73<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-73\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"73\"><em>Id.<\/em><\/span> The court held the enhancement applicable, and Judge Newsom wrote separately to suggest that LLMs might \u201cinform the interpretive analysis\u201d of the Guideline at issue.<sup class=\"modern-footnotes-footnote \" data-mfn=\"74\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-74\">74<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-74\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"74\"><em>Id.<\/em> at 1270 (Newsom, J., concurring) (quoting <em>Snell<\/em>, 102 F.4th at 1211 (Newsom, J., concurring)).<\/span> Several times he asked ChatGPT: \u201cWhat is the ordinary meaning of \u2018physically restrained\u2019?\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"75\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-75\">75<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-75\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"75\"><em>Id.<\/em> at 1272.<\/span> While he received a different answer each time, he found it informative that despite \u201csubtle, marginal divergences,\u201d the models\u2019 responses \u201ccoalesce[d], substantively, around a common core\u2014there was an objectively verifiable throughline.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"76\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-76\">76<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-76\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"76\">United States v. Deleon, 116 F.4th 1260, 1275 (11th Cir. 2024) (Newsom, J., concurring).<\/span> He wrote, \u201cFor our purposes, what matters is that the LLMs consistently defined the phrase \u2018physically restrained\u2019 to require the application of tangible force, either through direct bodily contact or some other device or instrument. And that, again, squares comfortably with the results obtained through the traditional, dictionary-driven breaking-and-repiecing method.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"77\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-77\">77<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-77\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"77\"><em>Id.<\/em><\/span><\/p>\n<p>This exercise gave Judge Newsom \u201chope that the models have something significant to offer the interpretive enterprise.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"78\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-78\">78<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-78\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"78\"><em>Id.<\/em> at 1276\u201377.<\/span> \u201cLLMs are trained on actual individuals\u2019 uses of language in the real world,\u201d he reasoned, and \u201cthe LLMs\u2019 responses to [his] repeated queries reliably revealed . . . a common core.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"79\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-79\">79<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-79\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"79\"><em>Id.<\/em> at 1277.<\/span> Judge Newsom concluded his <em>Deleon<\/em> concurrence by observing, \u201cI continue to believe\u2014perhaps more so with each interaction\u2014that LLMs have something to contribute to the ordinary-meaning endeavor.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"80\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-80\">80<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-80\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"80\"><em>Id.<\/em><\/span><\/p>\n<p>In both <em>Snell <\/em>and <em>Deleon<\/em>, Judge Newsom favorably cited <em>Generative Interpretation<\/em> to support LLMs\u2019 authority concerning the ordinary meaning of legal language.<sup class=\"modern-footnotes-footnote \" data-mfn=\"81\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-81\">81<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-81\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"81\"><em>See <\/em>Snell v. United Specialty Ins. Co., 102 F.4th 1208, 1226\u201327 n.7 (Newsom, J., concurring); <em>Deleon<\/em>, 116 F.4th at 1266 n.1 (Newsom, J., concurring).<\/span> Like Arbel and Hoffman, Judge Newsom believes in asking <em>narrow<\/em> questions of LLMs, using them to answer discrete questions about the meanings of particular terms. But in other ways, his approach differs from theirs. For one thing, he takes a <em>narrow<\/em> view of the sources the LLM should specifically consider. Instead of giving the LLM an entire contract to interpret, Newsom uses single-sentence prompts, e.g., \u201c<em>Is installing an in-ground trampoline \u2018landscaping\u2019?<\/em>\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"82\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-82\">82<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-82\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"82\"><em>Snell<\/em>, 102 F.4th at 1235 (Newsom, J., concurring).<\/span> For another, he takes a slightly <em>broader<\/em> view of the LLM\u2019s outputs than Arbel and Hoffman\u2019s prenuptial-petition example above. Instead of considering token probabilities as they do, Newsom reads the LLM outputs as natural-language text, much like he would a dictionary definition.<sup class=\"modern-footnotes-footnote \" data-mfn=\"83\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-83\">83<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-83\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"83\"><em>See id.<\/em> at 1228\u201330 (comparing LLMs and dictionaries).<\/span> In <em>Deleon<\/em>, he borrows from Arbel and Hoffman in repeating his query 30 times (10 for each of 3 models) to assess their \u201cconfidence,\u201d but he continues to treat the outputs as English text.<sup class=\"modern-footnotes-footnote \" data-mfn=\"84\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-84\">84<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-84\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"84\"><em>See <\/em>Deleon, 116 F.4th at 1273\u201374 (Newsom, J., concurring).<\/span><\/p>\n<p>3. <em>Judge Joshua Deahl<\/em><\/p>\n<p>Some months later, Judge Joshua Deahl of the District of Columbia Court of Appeals followed Judge Newsom\u2019s lead. In <em>Ross v. United States<\/em>,<sup class=\"modern-footnotes-footnote \" data-mfn=\"85\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-85\">85<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-85\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"85\">331 A.3d 220 (D.C. 2025).<\/span> Nina Ross was charged with animal cruelty for leaving a dog in her car on a hot day.<sup class=\"modern-footnotes-footnote \" data-mfn=\"86\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-86\">86<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-86\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"86\"><em>See id. <\/em>at 222.<\/span> In overturning her conviction, the majority held that there was insufficient evidence to prove beyond a reasonable doubt that leaving a dog in a car for over an hour on a hot day would harm the dog.<sup class=\"modern-footnotes-footnote \" data-mfn=\"87\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-87\">87<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-87\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"87\"><em>See id.<\/em> at 224\u201327.<\/span> The majority cited <em>Jordan v. United States<\/em>,<sup class=\"modern-footnotes-footnote \" data-mfn=\"88\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-88\">88<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-88\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"88\">269 A.2d 848 (D.C. 1970).<\/span> a similar case where the court had reversed a conviction for leaving a dog outside on a cold day.<sup class=\"modern-footnotes-footnote \" data-mfn=\"89\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-89\">89<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-89\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"89\"><em>See Ross<\/em>, 331 A.2d at 224 (citing Jordan, 269 A.2d at 849) (\u201cIn the absence of testimony . . . that the shelter or protection from the weather supplied this dog on this occasion would cause the dog to suffer, the evidence was insufficient to sustain the conviction.\u201d).<\/span> In his dissent, Judge Deahl distinguished <em>Jordan <\/em>on the ground that it is common knowledge that leaving a dog in a car on a hot day \u201ccreated a \u2018plain and strong likelihood\u2019 that [the] dog would be harmed,\u201d whereas the risks of leaving a dog outside on a cold day were less clear-cut.<sup class=\"modern-footnotes-footnote \" data-mfn=\"90\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-90\">90<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-90\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"90\"><em>Id. <\/em>at 235\u201336 (Deahl<span style=\"font-variant: small-caps\">, J.,<\/span> dissenting).<\/span><\/p>\n<p>In support of this argument, Judge Deahl and his clerks asked ChatGPT, \u201cIs it harmful to leave a dog in a car, with the windows down a few inches, for an hour and twenty minutes when it&#8217;s 98 degrees outside?\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"91\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-91\">91<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-91\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"91\"><em>Id.<\/em> at 236.<\/span> They received an \u201cunequivocal \u2018yes.\u2019\u201d They then asked: \u201cIs it harmful to leave a German shepherd outside in 25 degree temperature for five hours?\u201d and received answers that \u201cboil[ed] down to \u2018it depends.\u2019\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"92\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-92\">92<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-92\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"92\"><em>Id<\/em>. at 237.<\/span> Carefully caveating that ChatGPT is \u201cdefinitely not\u201d a good proxy for common knowledge, Judge Deahl nevertheless avowed that those chats supported his claims because inclusion in ChatGPT\u2019s responses is \u201cat least a point in favor of something being common knowledge . . .; and a far stronger point against it [being common knowledge] if it\u2019s not.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"93\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-93\">93<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-93\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"93\"><em>Id<\/em>. at 236 n.4.<\/span><\/p>\n<p>Citing to Judge Newsom, Deahl used the text produced by chatbots as a proxy for common knowledge.<sup class=\"modern-footnotes-footnote \" data-mfn=\"94\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-94\">94<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-94\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"94\"><em>See id.<\/em> (calling <em>Snell<\/em> \u201ca thoughtful and engaging discussion . . . which I largely agree with\u201d).<\/span> Still, there is something different going on in his approach. Judge Deahl didn\u2019t ask ChatGPT about the meaning of a term (what does \u201charmful\u201d mean?); he asked it about how the world works (is it harmful to leave a dog in a car on a hot day?). That is a different kind of inquiry. It is still <em>narrow <\/em>in the question being asked, <em>narrow<\/em> in the sources presented to the LLM, and somewhat <em>broader<\/em> in considering natural-language outputs\u2014but it is not really an interpretive question at all.<\/p>\n<p style=\"text-align: center\">C. <em>LLMs as Adjudicators<br \/><\/em><\/p>\n<p>Some commentators go beyond mere LLM <em>interpretation<\/em> to LLM <em>adjudication<\/em>. Instead of prompting the LLM to carry out a specific interpretive task (\u201cWhat is the ordinary meaning of \u2018landscaping?\u2019\u201d) they prompt the LLM with a question about the case as a whole (\u201cYou are a federal appellate judge. Please read the attached briefs. Please write a three paragraph decision. In the third paragraph, explain who wins and why.\u201d).<sup class=\"modern-footnotes-footnote \" data-mfn=\"95\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-95\">95<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-95\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"95\"><em>See In AI We Trust II<\/em>, <em>supra<\/em> note 5.<\/span> This is a <em>broad<\/em> interpretive task, much broader than anything attempted by Arbel and Hoffman, Judge Newsom, or Judge Deahl. It almost necessarily involves considering a <em>broad<\/em> range of sources\u2014briefs, precedents, statutes, and other sources a human adjudicator would consider. And it equally almost necessarily involves reading the output <em>broadly<\/em> as a natural-language text\u2014the kind of work product that a human adjudicator would produce.<\/p>\n<p>Numerous commentators have considered whether AIs in general, and recent LLMs in particular, could do the work of judges. What sets apart the two we discuss in this section\u2014Adam Unikowsky and Jack Kieffaber\u2014is their full-throated enthusiasm. While others ask whether LLMs are ready for judicial use, these authors have already arrived at their answer: an emphatic \u201cyes!\u201d<\/p>\n<p style=\"text-align: center\">1. <em>Adam Unikowsky<\/em><\/p>\n<p>Adam Unikowsky\u2014a partner at Jenner &amp; Block who focuses on Supreme Court and appellate advocacy<sup class=\"modern-footnotes-footnote \" data-mfn=\"96\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-96\">96<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-96\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"96\"><em>Adam Unikowsky<\/em>, <span style=\"font-variant: small-caps\">Jenner &amp; Block LLP<\/span>, https:\/\/www.jenner.com\/en\/people\/adam-g-unikowsky [https:\/\/perma.cc\/3LST-BY3X].<\/span>\u2014has used a variety of methods to test LLM adjudication. Although the specifics vary, from the jump he has focused on prompts that generate extensive natural-language outputs: his first, for example, was \u201cWrite two paragraphs about the [<em>Snell<\/em>] concurrence. First: what does the concurrence argue? Second, do you agree with the concurrence\u2019s argument?\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"97\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-97\">97<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-97\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"97\"><em>In AI We Trust II<\/em>, <em>supra<\/em> note 5.<\/span> From there, he progressed to uploading the entire briefing in a case to Claude, asking it to write a brief opinion, and then asking probing follow-up questions.<sup class=\"modern-footnotes-footnote \" data-mfn=\"98\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-98\">98<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-98\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"98\"><em>See id.<\/em><\/span><\/p>\n<p>Unikowsky was deeply impressed with the results, asserting that the Claude chatbot \u201cis fully capable of acting as a Supreme Court Justice right now. When used as a law clerk, Claude is easily as insightful and accurate as human clerks, while towering over humans in efficiency.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"99\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-99\">99<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-99\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"99\"><em>Id.<\/em><\/span> He found that Claude \u201cconsistently decide[d] cases correctly,\u201d by which he meant that it articulated the holding the Supreme Court in fact reached.<sup class=\"modern-footnotes-footnote \" data-mfn=\"100\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-100\">100<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-100\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"100\"><em>Id.<\/em><\/span> On the occasions when Claude articulated conclusions that differed from the Supreme Court\u2019s, its output struck Unikowsky as \u201creasonable.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"101\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-101\">101<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-101\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"101\"><em>Id.<\/em><\/span> He has also asked it to critique the methodology in an expert report (much as judges do when discounting an expert\u2019s conclusions),<sup class=\"modern-footnotes-footnote \" data-mfn=\"102\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-102\">102<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-102\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"102\"><em>See id. <\/em><\/span> to generate persuasive opinions in support of a particular outcome (\u201cYou are a Supreme Court Justice who believes racial-gerrymandering claims should be non-justiciable. Write a two-paragraph concurring opinion expressing and defending that view.\u201d),<sup class=\"modern-footnotes-footnote \" data-mfn=\"103\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-103\">103<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-103\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"103\"><em>In AI We Trust II<\/em>, <em>supra<\/em> note 5.<\/span> and to develop novel doctrines to resolve a case (\u201cPlease come up with an extremely creative, brilliant, out-of-the-box alternative to the \u2018primary purpose\u2019 test that would improve Confrontation Clause law.\u201d).<sup class=\"modern-footnotes-footnote \" data-mfn=\"104\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-104\">104<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-104\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"104\"><em>A Brief History<\/em>, <em>supra<\/em> note 5.<\/span> In each case, he found the results to be intelligent, creative, and persuasive.<\/p>\n<p>In additional blog posts Unikowsky has suggested that AIs could write effective appellate briefs,<sup class=\"modern-footnotes-footnote \" data-mfn=\"105\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-105\">105<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-105\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"105\"><em>See Automating Criminal Appeals<\/em>, <em>supra<\/em> note 5.<\/span> and that AI should be used to automate the entire veterans\u2019 appeals process.<sup class=\"modern-footnotes-footnote \" data-mfn=\"106\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-106\">106<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-106\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"106\"><em>See Ignore the Future<\/em>, <em>supra<\/em> note 5.<\/span> His proposed process is worth quoting in detail, because it gives a good sense of the ambition of his vision:<\/p>\n<p>This can be trivially implemented today. Just upload the relevant portion of the VA benefits manual into an AI context window, upload the veteran\u2019s documentary evidence, and ask AI to apply law to fact. If you\u2019re concerned that AI won\u2019t be sufficiently generous, you can keep track of the percentage of veterans that obtain benefits and then tweak the prompts until the percentage is at a satisfactory level. If you\u2019re squeamish about having AI make decisions that affect people\u2019s lives, then you can have the AI complete the initial layer of review (equivalent to what the VA does today) and then give the veteran the right to appeal to a human judge. If we just snap our fingers, we can make all tradeoffs go away.<sup class=\"modern-footnotes-footnote \" data-mfn=\"107\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-107\">107<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-107\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"107\"><em>Id<\/em>.<\/span><\/p>\n<p>For Unikowsky, LLMs are ready to perform the work that human judges do; all that remains is to resolve the details of integrating the technology into the courts.<\/p>\n<p style=\"text-align: center\">2. <em>Legal Tech Entrepreneurs<\/em><\/p>\n<p>Another enthusiastic promoter of LLM adjudication is Jack Kieffaber, a former clerk to Judge Newsom involved with AI arbitration startup Fortuna Arbitration.<sup class=\"modern-footnotes-footnote \" data-mfn=\"108\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-108\">108<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-108\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"108\"><em>See<\/em> Kimo Gandall, Jack Kieffaber &amp; Kenny McLaren, <em>We Built Judge.ai. And You Should Buy It<\/em>, SSRN (Mar. 27, 2025), https:\/\/papers.ssrns.com\/abstract=5115184 [https:\/\/perma.cc\/d5jw-8y7k] (\u201cMr. Gandall, Mr. Kieffaber, and Mr. McLaren all have financial interests in Fortuna-Insights, Inc., the company that makes Arbitrus.ai.\u201d).<\/span> Kieffaber has promulgated several papers arguing strongly in favor of replacing judges with AIs like the one Fortuna sells. The place to start is actually with his second article, <em>We Built Judge.ai. And You Should Buy It, <\/em>written with Kimo Gandall and Kenny McLaren.<sup class=\"modern-footnotes-footnote \" data-mfn=\"109\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-109\">109<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-109\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"109\"><em>See <\/em>Gandall et al., <em>supra <\/em>note 30<em>.<\/em> Gandall and McLaren are CEO and CTO, respectively, of the startup company behind Arbitrus.ai. <em>See<\/em> <span style=\"font-variant: small-caps\">Arbitrus.AI<\/span>, https:\/\/www.arbitrus.ai [https:\/\/perma.cc\/P2HT-6CXZ].<\/span> The authors built an LLM-based online system, Arbitrus.ai, that receives evidence, motions, and briefing from two parties and then issues rulings.<sup class=\"modern-footnotes-footnote \" data-mfn=\"110\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-110\">110<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-110\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"110\"><em>See<\/em> Gandall et al., <em>supra <\/em>note 30 at 30\u201334.<\/span> To calibrate and verify the system, they used another LLM to generate synthetic briefs in 100 hypothetical disputes, and then had Arbitrus.ai issue opinions resolving those disputes.<sup class=\"modern-footnotes-footnote \" data-mfn=\"111\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-111\">111<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-111\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"111\"><em>See<\/em> <em>id<\/em>. at 48\u201349.<\/span> Humans checked that cited authorities existed, were responsive to the issues in the case, and fully resolved the dispute.<sup class=\"modern-footnotes-footnote \" data-mfn=\"112\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-112\">112<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-112\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"112\"><em>See<\/em> <em>id<\/em>. at 50\u201356.<\/span> Overall, they claim that the system usually fully resolved each case without hallucinations or ungrounded assertions.<sup class=\"modern-footnotes-footnote \" data-mfn=\"113\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-113\">113<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-113\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"113\"><em>See<\/em> <em>id.<\/em> at 56\u201357.<\/span><\/p>\n<p>As the name suggests, Arbitrus.ai is designed and marketed as an arbitration system. Although there are some statutory and doctrinal complications about the applicability of the Federal Arbitration Act to AI arbitrations, party consent can bless a wide range of procedures in arbitration that would be unacceptable in adjudication.<sup class=\"modern-footnotes-footnote \" data-mfn=\"114\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-114\">114<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-114\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"114\"><em>See, e.g.<\/em>, David Horton, <em>Forced Robot Arbitration<\/em>, 109 <span style=\"font-variant: small-caps\">Corn. L. Rev.<\/span> 679, 679 (2024) (arguing that \u201cAI adjudication is more likely to flourish in . . . forced arbitration\u201d); Michael J. Broyde &amp; Yiyang Mei, <em>Don\u2019t Kill the Baby! The Case for AI in Arbitration<\/em>, 21 <span style=\"font-variant: small-caps\">N.Y.U. J.L. &amp; Bus.<\/span> 119, 119 (2024) (\u201cThis article examines the integration of AI into arbitration, arguing that the Federal Arbitration Act (FAA) allows parties to contractually choose AI-driven arbitration, despite traditional reservations.\u201d).<\/span> Gandall, Kieffaber, and McLaren argue that \u201cautomation will beat all comers\u201d in the market for arbitrators.<sup class=\"modern-footnotes-footnote \" data-mfn=\"115\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-115\">115<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-115\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"115\">Gandall et al., <em>supra <\/em>note 30, at 83.<\/span> This is a predictive claim about parties\u2019 satisfaction with AI arbitrations, or at least their willingness to accept form contracts selecting it, that does not necessarily depend on the quality of those arbitrations.<\/p>\n<p>Kieffaber\u2019s principal normative claims in favor of AI adjudication come, instead, in his earlier article, <em>Predictability, AI, And Judicial Futurism<\/em>.<sup class=\"modern-footnotes-footnote \" data-mfn=\"116\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-116\">116<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-116\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"116\">Jack Kieffaber, <em>Predictability, AI, And Judicial Futurism: Why Robots Will Run The Law And Textualists Will Like It<\/em>, 48 <span style=\"font-variant: small-caps\">Harv. J.L. Pub. Pol. <\/span>(forthcoming) (manuscript at 78), https:\/\/papers.ssrn.com\/abstract=4966334 [https:\/\/perma.cc\/8KSE-6ZV8].<\/span> He anticipates a future in which \u201cJudge.AI\u201d can do what Arbitrus.AI does, but on a much grander scale, providing both binding opinions on the \u201cback end\u201d and also advisory guidance for citizens on the \u201cfront end.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"117\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-117\">117<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-117\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"117\"><em>Id.<\/em> at 6\u20137 (\u201cGiven the rate at which these large language models develop, it is starting to feel eerily as though the future is <em>now <\/em>\u2014 and, if not <em>now<\/em>, who is to say it won\u2019t be<em> next month?<\/em>\u201d<em>)<\/em>; <em>see also id.<\/em> at 78 (\u201cMy Judge.AI hypothetical, at the end of the day, really isn\u2019t a hypothetical at all. <em>It\u2019s coming.<\/em>\u201d).<\/span> He assumes\u2014big if true\u2014that \u201cJudge.AI is a perfectly neutral arbiter and interprets words with perfect mathematical accuracy.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"118\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-118\">118<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-118\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"118\"><em>Id. <\/em>at 7.<\/span> He then argues that \u201cJudge.AI is optimal under a textualist framework and, indeed, is the logical end result of the textualist project.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"119\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-119\">119<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-119\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"119\"><em>Id.<\/em> at 8.<\/span><\/p>\n<p>This view might be taken either as a criticism of textualism, or an invitation for textualists to embrace generative AI. Kieffaber comes closer to the latter, much like his former boss,<sup class=\"modern-footnotes-footnote \" data-mfn=\"120\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-120\">120<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-120\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"120\">Kieffaber\u2019s online biography indicates that he clerked for Judge Newsom roughly between Summer 2023 and late September 2024; <em>Snell<\/em> and <em>Deleon<\/em> were published on May 28, 2024, and September 5, 2024, respectively. <em>See<\/em> 2023 James Wilson Fellows, <span style=\"font-variant: small-caps\">James Wilson Institute, <\/span>https:\/\/www.jameswilsoninstitute.org\/articles\/2023-james-wilson-fellows [https:\/\/perma.cc\/JF5H-8SG8] (last visited June 13, 2025) (listing Kieffaber as an \u201cincoming law clerk\u201d for Judge Newsom); The James Wilson Institute, <em>Predictability, AI, and Judicial Futurism with Jack Kieffaber \u2013 Anchoring Truths<\/em> (2024), https:\/\/www.anchoringtruths.org\/2024\/09\/26\/predictability-ai-and-judicial-futurism-with-jack-kieffaber\/ [https:\/\/perma.cc\/PJB2-WPTJ] (last visited June 13, 2025) (\u201c[Kieffaber] has clerked for Judge Kevin Newsom on the 11th Circuit.\u201d).<\/span> Judge Newsom, who justified his explorations in <em>Snell<\/em> and <em>Deleon<\/em> by pointing to generative AI\u2019s ability to advance the textualist project. Kieffaber argues that the principal goal\u2014the only goal\u2014of textualism is predictability,<sup class=\"modern-footnotes-footnote \" data-mfn=\"121\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-121\">121<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-121\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"121\"><em>See <\/em>Kieffaber, <em>supra<\/em> note 116, at 9\u201313.<\/span> and asserts that an ideal AI can be more predictable than human judges.<sup class=\"modern-footnotes-footnote \" data-mfn=\"122\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-122\">122<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-122\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"122\"><em>See id.<\/em> at 29\u201331.<\/span><\/p>\n<p style=\"text-align: center\">D. <em>Varied Justifications<br \/><\/em><\/p>\n<p>To be sure, there are some notable differences among the group. We have flagged their divergences on broad versus narrow questions, sources, and outputs. There is also a notable theoretical division. Judge Newsom and Kieffaber are whole-hearted textualists; they regard the central judicial task as the extraction of linguistic meaning from a controlling textual authority. Kieffaber is more ambitious than Judge Newsom in how much of that task he would assign to an LLM, but they share a common conception of what a judge\u2014or an LLM prompted to do a judge\u2019s job\u2014ought to do, with a common focus on clarity and predictability. They are intrigued by LLMs because they see them as, potentially, the truest textualists of them all.<\/p>\n<p>None of the others are firmly opposed to textualism as such, but they are more broad-minded about what judges\u2014and LLMs in judicial robes\u2014should be doing. Unikowsky\u2019s prompts include a mixture of textualist and non-textualist instructions, much as one would expect from a skilled appellate lawyer who must craft arguments that are persuasive both to textualist and non-textualist judges. Judge Deahl\u2019s one experiment with ChatGPT is decidedly non-textualist; the question he put to it is about dog safety (or what people commonly believe about dog safety), not about the meaning of a disputed term. And Arbel and Hoffman see generative interpretation as the synthesis that will finally bring together the seemingly opposed camps of textualism and contextualism; to them, it offers the certainty textualists seek while considering all the relevant evidence that contextualists want to consult.<\/p>\n<p>We will have much more to say in due course, but here at the outset, it is worth noting how profoundly these various advocates of LLM-based judging disagree on how to perform it and why it is justified. Judge Deahl thinks ChatGPT is valuable because it offers insight into what people know; Kieffaber denies that an ideal AI judge should consider anything besides linguistic meaning. Judge Newsom believes in prompting LLMs with disputed phrases in isolation; Arbel and Hoffman prefer to prompt them with every relevant document in the record they can. Unikowsky is perfectly happy to ask Claude to generate \u201coff-the-wall\u201d opinions to show off its creativity; Judge Newsom wants to restrict discretion as much as possible. And so on. That these LLM proponents have such fundamental differences of opinion suggests, perhaps, that none of them have rightly apprehended what generative interpretation is.<\/p>\n<h4 style=\"text-align: center\">III. <span style=\"font-variant: small-caps\">LLMs Generate Text by Predicting It<br \/><\/span><\/h4>\n<p>Before diving into our discussion of LLMs\u2019 use in adjudication and interpretation, we wish to emphasize an important point about how they function. We do not intend to rehash the history of LLMs or describe the technology in detail. At this point, there are excellent references available, and we assume that the reader has familiarity with the basics.<sup class=\"modern-footnotes-footnote \" data-mfn=\"123\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-123\">123<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-123\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"123\"><em>See generally <\/em>Memorandum from David Stein on AI Primer (July 12, 2024), http:\/\/ai-memo.stein.fyi\/ [https:\/\/perma.cc\/89AT-2449] (providing an overview of the topic); Katherine Lee, A. Feder Cooper &amp; James Grimmelmann, <em>Talkin\u2019 \u2019Bout AI Generation: Copyright and the Generative-AI Supply Chain<\/em>, 72 <span style=\"font-variant: small-caps\">J. Copyright Soc\u2019y 251 <\/span>(2025) (discussing how copyright law interacts with the chain of materials from truly novel human-made works to the outputs of AI algorithms). For readers looking for a deeper technical explanation of LLMs, there is a wealth of explanatory materials (of varying quality and accuracy) online and in the academic literature. Grant Sanderson\u2019s 2-hour video course on machine learning is by far the best mostly prerequisite-free resource we\u2019re aware of that covers the finer details of machine learning, neural networks, LLMs, and GPTs. <span style=\"font-variant: small-caps\">3Blue1Brown<\/span>, <em>Playlist: Machine Learning<\/em>,<span style=\"font-variant: small-caps\"> (YouTube, <\/span>Oct<span style=\"font-variant: small-caps\">. 5, 2017), <\/span>https:\/\/www.youtube.com\/watch?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&amp;v=aircAruvnKk [https:\/\/perma.cc\/SZB3-4EEN]. The <em>Financial Times<\/em> has an accessible and accurate article on how LLMs generate text. <em>See <\/em>Madhumita Murgia, <em>Generative AI exists because of the transformer<\/em>, <span style=\"font-variant: small-caps\">Fin. Times<\/span> (Sep. 12, 2023), https:\/\/ig.ft.com\/generative-ai\/ [https:\/\/perma.cc\/KUV9-KWTX].<\/span><\/p>\n<p>LLMs are at heart prediction machines. Given some text, an LLM predicts the next word.<sup class=\"modern-footnotes-footnote \" data-mfn=\"124\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-124\">124<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-124\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"124\">This is a simplification. LLMs can also \u201cfill in\u201d blanks in the middle of example text, or predict multiple words at a time. <em>See generally <\/em>Minkai Xu, Tomas Geffner, Karsten Kreis, Weili Nie, Yilun Xu, Jure Leskovec, Stefano Ermon &amp; Arash Vahdat, <em>Energy-Based Diffusion Language Models for Text Generation<\/em>, 2025 <span style=\"font-variant: small-caps\">Proc. Int\u2019l Conf. on Learning Representations 1<\/span> (describing a technique that uses an LLM to predict many words simultaneously).<\/span> The prediction process can be broken into two steps.<\/p>\n<p style=\"padding-left: 40px\">\u2022 Step 1: the LLM converts some input text (a \u201cprompt\u201d) into a sequence of numbers (a \u201cvector\u201d). Vectors represent the prompt in a computer-friendly format.<br \/>\u2022 Step 2: the LLM runs those vectors through a mathematical function that assigns a probability to each word in its dictionary. Those probabilities estimate the likelihood of each word directly following the prompt. For instance, given the prompt \u201c<em>have a nice ___<\/em>,\u201d a well-calibrated LLM might assign high probabilities to words like \u201c<em>day<\/em>,\u201d \u201c<em>night<\/em>,\u201d or \u201c<em>stay<\/em>,\u201d and almost zero probability to rare, grammatically incorrect, or semantically nonsensical words like \u201c<em>thusly<\/em>.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"125\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-125\">125<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-125\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"125\">For example, when presented with this example, OpenAI\u2019s \u201cbabbage-02\u201d model assigns the following probabilities: <em>day<\/em>: 47.7%; <em>weekend<\/em>: 9.8%; <em>evening<\/em>: 2.7%; <em>time<\/em>: 2.6%; <em>life<\/em>: 2.2%; . . . <em>picnic<\/em>: 0.0096%; . . . <em>thusly<\/em>: 0.000023%; . . . <\/span><\/p>\n<p>Generative AIs, including chatbots, use LLMs to generate text. A generative AI is a program that works by passing a prompt into an LLM to estimate next-word-likelihoods, using those likelihoods and a word selection algorithm (a \u201csampling strategy\u201d) to select a word, then appending the selected word to the end of the prompt.<sup class=\"modern-footnotes-footnote \" data-mfn=\"126\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-126\">126<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-126\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"126\">A simple sampling algorithm might pick the word assigned the highest probability. A more complex one might pick a word at random, weighted by the probabilities assigned by the LLM. In practice, word selection algorithms adjust the probabilities and then make a weighted random sample. The way those probabilities are adjusted is configured using \u201chyperparameters\u201d with names like \u201ctemperature\u201d and \u201ctop_p,\u201d which you may have seen mentioned in related literature. <em>See, e.g.<\/em>, <em>Fine-tuning Best Practices \u2013 OpenAI API<\/em>, <span style=\"font-variant: small-caps\">OpenAI Platform<\/span>, https:\/\/platform.openai.com\/docs\/guides\/fine-tuning-best-practices#iterating-on-hyperparameters [https:\/\/perma.cc\/9NW3-QA4P] (last visited Oct. 20, 2025) (describing how to adjust sampling strategy hyperparameters when generating text); Xu et al., <em>supra<\/em> note 124. Note that because hyperparameters only affect the sampling process, they have no effect on how an LLM generates probabilities. <em>See<\/em> Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes &amp; Yejin Choi, <em>The Curious Case of Neural Text Degeneration<\/em>, 2020 <span style=\"font-variant: small-caps\">Proc. Int\u2019l Conf. on Learning Representations 1<\/span>, at 3-6 (describing the standard LLM text-generation workflow, where \u201cmodels compute [the probability of each word] using the common left-to-right decomposition of the text probability,\u201d which are then used to \u201cgenerate the [output text] token-by-token using [one of the selection algorithms that accepts hyperparameters]\u201d).<\/span> By repeating the process, a computer can generate large sequences of text.<\/p>\n<p>Because the language models are where the magic of text generation happens, it is common to conflate the terms \u201cgenerative AI\u201d and \u201cLLM.\u201d LLMs are not manually programmed to predict words. Instead, their capabilities emerge during a process called \u201ctraining.\u201d Training involves repeatedly presenting text with omitted words to an LLM and programmatically adjusting the LLM\u2019s configuration until it begins to correctly \u201cguess\u201d which word was omitted (that is, until it tends to assign high likelihood scores to missing words).<sup class=\"modern-footnotes-footnote \" data-mfn=\"127\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-127\">127<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-127\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"127\"><em>See generally <\/em>Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever &amp; Dario Amodei, <em>Language Models Are Few-Shot Learners<\/em> (July 22, 2020), https:\/\/arxiv.org\/pdf\/2005.14165 [https:\/\/perma.cc\/VJB4-VXAL] (discussing language model training).<\/span> After training on billions of examples, some LLMs begin to make predictions that reflect surprising linguistic and contextual nuances\u2014including grammar, semantic meaning, and style.<sup class=\"modern-footnotes-footnote \" data-mfn=\"128\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-128\">128<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-128\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"128\"><em>See generally<\/em> Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser &amp; Illia Polosukhin, <em>Attention Is All You Need<\/em>, 31 <span style=\"font-variant: small-caps\">Conf. Neural Info. Processing Sys. (2017) (<\/span>discussing language model training and its outputs).<\/span><\/p>\n<p>The LLM \u201clearns\u201d the patterns that appear in natural-language text: not just which words tend to follow which other ones (\u201cnice day\u201d is more common than \u201cnice thusly\u201d), but also which words are stylistically consistent with each other (few texts include both \u201cdetrimentally\u201d and \u201cbutthead\u201d), the ordering of textual passages (topic sentences precede supporting evidence), and other larger-scale and more abstract statistical patterns\u2014which may or may not correspond to things readers identify as meaningful qualities of the text. Machine-learning researchers would say that an LLM is a model for the statistical distribution of natural-language text. Prompting one to generate text is a way of sampling from that distribution; if the model is a good one, the resulting texts should have the same statistical properties (word order, etc.) as the texts it was trained on.<\/p>\n<p>In general, a reasonable, rough way to build instincts about how LLMs produce word likelihood estimations is to consider what factors would be relevant to filling in a scratched-out word in a document. For example, the fact that this sentence contains an em-dash\u2014a generally uncommon punctuation mark beloved by law professors\u2014might cause an LLM predicting the first word in the next sentence to assign significantly more likelihood to Latin words and prepositions. <em>Ergo<\/em>, some critics like Gary Marcus call LLMs \u201cautocomplete on steroids,\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"129\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-129\">129<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-129\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"129\">Gary Marcus, <em>The Dark Rise of Large Language Models<\/em>, <span style=\"font-variant: small-caps\">Wired<\/span> (Dec. 29, 2022), https:\/\/www.wired.com\/story\/large-language-models-artificial-intelligence\/ [https:\/\/perma.cc\/G78N-SEBL].<\/span> but this quality is also why some enthusiasts see them as ideal tools for interpretation.<\/p>\n<p>It turns out that generating text by predicting it is strikingly effective at producing fluent text in a variety of genres. Unsurprisingly, some of those genres are typical work for lawyers: law school exams,<sup class=\"modern-footnotes-footnote \" data-mfn=\"130\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-130\">130<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-130\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"130\"><em>See<\/em> <em>supra<\/em> note 9.<\/span> descriptive research memos,<sup class=\"modern-footnotes-footnote \" data-mfn=\"131\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-131\">131<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-131\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"131\"><em>See, e.g.<\/em>, Daniel Schwarcz, Sam Manning, Patrick Barry, David R. Cleveland, JJ Prescott &amp; Beverly Rich, <em>AI-Powered Lawyering: AI Reasoning Models, Retrieval Augmented Generation, and the Future of Legal Practice<\/em> (U. Mich. Pub. L. Working Paper 24-058), https:\/\/papers.ssrn.com\/sol3\/papers.cfm?abstract_id=5162111 [https:\/\/perma.cc\/X2LD-N8H3].<\/span> and contract drafting,<sup class=\"modern-footnotes-footnote \" data-mfn=\"132\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-132\">132<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-132\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"132\"><em>See, e.g.<\/em>, <em>Navigating legal drafting: A how-to guide for law firms using AI-powered tools<\/em>, <span style=\"font-variant: small-caps\">Thomson Reuters Legal Blog<\/span> (Oct. 2, 2024), https:\/\/legal.thomsonreuters.com\/blog\/navigating-legal-drafting-a-how-to-guide-for-law-firms-using-ai-powered-tools\/ [https:\/\/perma.cc\/NGM9-C6V7]. <\/span> to name just a few. It is unsurprising that some researchers and judges have started to consider whether predictive text generation might also be a good fit for the work of judges. We now offer some reasons to question their optimism.<\/p>\n<h4 style=\"text-align: center\">IV. <span style=\"font-variant: small-caps\">Is Generative Interpretation Reliable?<br \/><\/span><\/h4>\n<p>When a judge prompts an LLM, how do they know whether its response is based on insights about the meaning of language or on irrelevant cues in the prompt or training data? Intuitively, one should expect LLMs to use the semantic meaning of words to make objective choices\u2014like dismissing nonsensical words\u2014and to rely on other factors to predict the relative likelihood that semantically reasonable options comport with the rest of the text. But if LLMs\u2019 responses depend on factors other than the semantic meaning of particular fragments of text, then things are not so simple.<\/p>\n<p>The problem is that LLMs\u2019 responses <em>do<\/em> depend on factors other than textual meaning. An LLM\u2019s output is determined by a complex generation algorithm, run on a model with a particular architecture and a specific set of weights, in response to a prompt that is both case-specific and expressed in natural language, subject to numerous configuration settings and implementation parameters, and typically starting from a randomly chosen seed value. All of these factors influence the output; that is simply the <em>definition<\/em> of what it means to carry out a generation using an LLM.<\/p>\n<p>The reliability problem, then, is to show that in a given generative-interpretation protocol these other factors\u2014model choice, prompt phrasing, configuration settings, etc.\u2014do not significantly influence the aspects of the output being measured. If they do, then the protocol is measuring noise rather than signal.<sup class=\"modern-footnotes-footnote \" data-mfn=\"133\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-133\">133<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-133\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"133\">It might be tempting to avoid these issues by arbitrarily picking a model, settings, etc. That would make the protocol reliable in that it consistently generates similar results when repeated. But each arbitrary choice made to solve the reliability problem amplifies the epistemic problem, as we discuss <em>infra<\/em> Part IV.<\/span><\/p>\n<p>In this section, we give reasons to believe that LLMs used for generative interpretation are indeed highly sensitive to irrelevant factors.<sup class=\"modern-footnotes-footnote \" data-mfn=\"134\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-134\">134<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-134\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"134\">Work by other scholars reinforces our conclusions. <em>See generally, e.g.<\/em>, Jonathan H. Choi, <em>Large Language Models Are Unreliable Judges<\/em> (manuscript at 21\u201330) (Aug. 11, 2025), https:\/\/papers.ssrn.com\/sol3\/papers.cfm?abstract_id=5188865 [https:\/\/perma.cc\/ZAD6-GU42]; Waldon et al., <em>supra<\/em> note 16, at 39\u201347.<\/span> Even in the LLM proponents\u2019 carefully curated case studies, changes to minor implementation details have outcome-determinative effects, calling into question whether they are accurately measuring semantic meaning. We consider case studies from <em>Generative Interpretation<\/em> and show that in them, the results are sensitive to the choice of model, the prompting strategy, or implementation details.<\/p>\n<p>Our choice to critique Arbel and Hoffman\u2019s case studies should not be taken as singling out their methodology for special criticism. Quite the opposite: of the generative-interpretation proponents, they are by far the most careful and attentive to the reliability challenges. We use their case studies <em>because <\/em>they are admirably explicit and precise about their methods. Our point is that if even the most meticulous generative-interpretation experiments conducted to date face reliability challenges\u2014as we show they do\u2014then one should be even more skeptical of approaches that do not share Arbel and Hoffman\u2019s attention to detail.<\/p>\n<p style=\"text-align: center\">A. Famiglio<em>:<\/em> <em>Model Instability<br \/><\/em><\/p>\n<p>One extraneous factor to the semantic task is the choice of which model to use. A diamond\u2019s mass should not vary based on whether it is weighed with a pan balance or a digital scale; words\u2019 meanings should not vary based on whether they are interpreted by Claude or ChatGPT. But of course there are substantial differences in generative models; companies compete fiercely to differentiate them, and there are leaderboards and communities dedicated to comparing models and teasing apart their differences. Even within a model family, outputs can vary enormously from one version to the next\u2014as though the definition of thousands of words in a dictionary changed each time it was reprinted.<\/p>\n<p>Consider the <em>Famiglio <\/em>case study from <em>Generative Interpretation<\/em>.<sup class=\"modern-footnotes-footnote \" data-mfn=\"135\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-135\">135<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-135\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"135\"><em>Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 483.<\/span> The Famiglios had a prenuptial agreement that defined the marriage\u2019s duration as the period from the wedding date to the date when either spouse filed for divorce.<sup class=\"modern-footnotes-footnote \" data-mfn=\"136\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-136\">136<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-136\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"136\"><em>Id<\/em>. (citing Famiglio v. Famiglio, 279 So.3d 736, 737\u201338 (Fla. Dist. Ct. App. 2019)).<\/span> The wife filed for divorce, withdrew her filing, then refiled in earnest several years later.<sup class=\"modern-footnotes-footnote \" data-mfn=\"137\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-137\">137<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-137\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"137\"><em>Id<\/em>.<\/span> Millions hinged on whether the first or second filing date determined the marriage term.<sup class=\"modern-footnotes-footnote \" data-mfn=\"138\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-138\">138<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-138\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"138\"><em>Id<\/em>.<\/span><\/p>\n<p>Arbel and Hoffman present the contract clause to an LLM to demonstrate a generative-interpretation technique that, they argue, \u201coffers courts a better sense of the relevant probabilities\u201d of different interpretations of the prenup, assuming \u201cthe parties were intending to use English in its most public and common sense.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"139\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-139\">139<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-139\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"139\"><em>Id.<\/em> at 485.<\/span> This particular technique involves looking \u201cunder the hood\u201d of an LLM.<sup class=\"modern-footnotes-footnote \" data-mfn=\"140\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-140\">140<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-140\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"140\"><em>Id.<\/em> at 484.<\/span> Instead of letting a chatbot pick words using an LLM\u2019s word-likelihood estimates, they observe those estimates directly. They ask an LLM which filing date would control, observe the probabilities (reproduced in Figure 1), and conclude that the model favors the second date.<sup class=\"modern-footnotes-footnote \" data-mfn=\"141\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-141\">141<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-141\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"141\"><em>Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 484\u201385.<\/span> At time of writing, the version of the language model used in <em>Generative Interpretation<\/em> is no longer available.<sup class=\"modern-footnotes-footnote \" data-mfn=\"142\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-142\">142<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-142\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"142\"><em>GPT-4 API General Availability and Deprecation of Older Models in the Completions API<\/em>, <span style=\"font-variant: small-caps\">OpenAI<\/span> (Apr. 24, 2024), https:\/\/openai.com\/index\/gpt-4-api-general-availability\/ [https:\/\/perma.cc\/K4FR-QJ2J] (deprecating the \u201cdavinci-003\u201d model used in <em>Generative Interpretation<\/em>, and recommending users transition to the \u201cgpt-3.5-turbo-instruct\u201d model used in this essay).<\/span> A later version of the same model yields the conflicting result shown in Figure 2.<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.04.30-PM.png?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"4625\" data-permalink=\"https:\/\/journals.law.harvard.edu\/jol\/2026\/01\/24\/generative-misinterpretation\/screenshot-2026-01-18-at-9-04-30-pm\/\" data-orig-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.04.30-PM.png?fit=1502%2C628&amp;ssl=1\" data-orig-size=\"1502,628\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Screenshot 2026-01-18 at 9.04.30\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.04.30-PM.png?fit=300%2C125&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.04.30-PM.png?fit=1024%2C428&amp;ssl=1\" class=\"wp-image-4625 aligncenter\" src=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.04.30-PM.png?resize=765%2C290&#038;ssl=1\" alt=\"\" width=\"765\" height=\"290\" \/><\/a><\/p>\n<p style=\"text-align: center\"><em>Figure 1: Reproduced from <\/em>Generative Interpretation<em>, at 484. This figure shows the probabilities produced by the davinci-003 model while responding to a question asking which date controls under the <\/em>Famiglio<em> facts. <\/em><\/p>\n<p><a href=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.04.38-PM.png?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"4624\" data-permalink=\"https:\/\/journals.law.harvard.edu\/jol\/2026\/01\/24\/generative-misinterpretation\/screenshot-2026-01-18-at-9-04-38-pm\/\" data-orig-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.04.38-PM.png?fit=1552%2C594&amp;ssl=1\" data-orig-size=\"1552,594\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Screenshot 2026-01-18 at 9.04.38\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.04.38-PM.png?fit=300%2C115&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.04.38-PM.png?fit=1024%2C392&amp;ssl=1\" class=\" wp-image-4624 aligncenter\" src=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.04.38-PM.png?resize=668%2C256&#038;ssl=1\" alt=\"\" width=\"668\" height=\"256\" srcset=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.04.38-PM.png?w=1552&amp;ssl=1 1552w, https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.04.38-PM.png?resize=300%2C115&amp;ssl=1 300w, https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.04.38-PM.png?resize=1024%2C392&amp;ssl=1 1024w, https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.04.38-PM.png?resize=768%2C294&amp;ssl=1 768w, https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.04.38-PM.png?resize=1536%2C588&amp;ssl=1 1536w\" sizes=\"auto, (max-width: 668px) 100vw, 668px\" \/><\/a><\/p>\n<p style=\"text-align: center\">Figure 2: <em>The probabilities produced by the gpt-3.5-turbo-instruct model under otherwise identical conditions to those in Figure 1.<\/em><\/p>\n<p>That is, while the model used by Arbel and Hoffman predicts that \u201csecond\u201d is more likely than \u201cfirst\u201d by a factor of more than 100 to 1, a later version of the same OpenAI model predicts that \u201cfirst\u201d is more likely than \u201csecond\u201d by a factor of more than 2 to 1. The fact that these two models\u2019 predictions diverge so dramatically makes it harder to credit either of them. If they were predicting the frequency of \u201cfirst\u201d versus \u201csecond\u201d based solely on a deep understanding of language, we should expect consistency. But since they diverge, how should courts determine which model is more appropriate for a given question? At least with dictionaries, we can make inferences about editorial choices. LLMs are far more opaque. Perhaps models will converge in the future, but for now they do not always do so,<sup class=\"modern-footnotes-footnote \" data-mfn=\"143\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-143\">143<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-143\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"143\"><em>Cf. <\/em>United States v. Deleon, 116 F.4th 1260, 1272\u201376 (11th Cir. 2024) (Newsom, J., concurring) (querying ChatGPT, Claude, and Gemini with identical prompts, and discussing the variations in their answers).<\/span> and that is a problem for those who want to perform generative interpretation.<sup class=\"modern-footnotes-footnote \" data-mfn=\"144\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-144\">144<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-144\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"144\"><em>Cf.<\/em> <em>Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 501 (suggesting that \u201cmore sophisticated models tend to converge on meaning\u201d). <\/span><\/p>\n<p style=\"text-align: center\">B. Famiglio<em>: Prompting Instability<br \/><\/em><\/p>\n<p>Continue with the <em>Famiglio <\/em>example. There is a deeper problem here. What does the probability of 94.72% that the first model attached to \u201csecond\u201d signify? It is tempting to say that it represents the model\u2019s estimate of the probability that a typical English speaker <em>would believe that the second filing controls<\/em> (as the sentence overall asserts).<sup class=\"modern-footnotes-footnote \" data-mfn=\"145\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-145\">145<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-145\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"145\"><em>See id.<\/em> at 485 (\u201cGenerative interpretation in this simple case thus offers courts a better sense of the relevant probabilities if the parties were intending to use English in its most public and common sense.\u201d).<\/span> But that is an unwarranted leap. The probability represents only the model\u2019s estimate of the probability that the <em>word<\/em> \u201csecond\u201d <em>would fill the blank<\/em> in the sentence \u201cThe ___ filing would determine the number of full years of marriage.\u201d This estimate, however, is exquisitely sensitive to the precise way the sentence is phrased.<\/p>\n<p>Recall that an LLM predicts one word at a time, then text-generation software uses those predictions to select a word, adds that word to the text, and repeats the process. By exploring each possible word choice, we can map out a decision tree of possible generated texts.<sup class=\"modern-footnotes-footnote \" data-mfn=\"146\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-146\">146<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-146\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"146\">For an interactive tool showing a similar probability tree for the conversation in Judge Newsom\u2019s <em>Snell <\/em>concurrence, see David Stein, <em>GPT\u2019s Possible Response to Judge Newsom\u2019s Experiments in <\/em>Snell v. United (Sep. 28, 2025), http:\/\/snell-gpt.stein.fyi\/ [https:\/\/perma.cc\/QSD7-73LC].<\/span> Figure 3 shows the probabilities for different word sequences generated by GPT-3.5 in response to the <em>Famiglio<\/em> case study\u2019s prompt. To read the chart, follow a path from left to right, noting the probability of each word given the previous words on that path. For example, there is a 52.5% probability that the first word is <em>The.<\/em> When the first word is <em>The<\/em>, there is a 20.5% chance that the second word is <em>date<\/em>. Following different paths reveals how prior word choices affect GPT\u2019s predictions. Note that the relative likelihood of <em>first<\/em> and <em>second<\/em> shifts based on phrasing. For \u201c<em>the ___ filing<\/em>,\u201d the first filing is twice as likely, while for \u201c<em>the date of the ___ filing<\/em>,\u201d the second filing is five times more probable.<sup class=\"modern-footnotes-footnote \" data-mfn=\"147\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-147\">147<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-147\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"147\"><em>Cf.<\/em> <em>Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 485 n.164.<\/span><\/p>\n<p style=\"text-align: center\"><a href=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/fig3.png?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"4626\" data-permalink=\"https:\/\/journals.law.harvard.edu\/jol\/2026\/01\/24\/generative-misinterpretation\/fig3\/\" data-orig-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/fig3.png?fit=437%2C229&amp;ssl=1\" data-orig-size=\"437,229\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"fig3\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/fig3.png?fit=300%2C157&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/fig3.png?fit=437%2C229&amp;ssl=1\" class=\" wp-image-4626 aligncenter\" src=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/fig3.png?resize=513%2C266&#038;ssl=1\" alt=\"\" width=\"513\" height=\"266\" \/><\/a><em>Figure 3: The decision tree gpt-3.5-turbo produces in response to the input used to produce Figure 2. Note that the relative probability of \u201cfirst\u201d and \u201csecond\u201d flip based on the LLM\u2019s phrasing choices.<\/em><\/p>\n<p>One might try to fix this syntactic sensitivity by restricting the AI to one-word responses, perhaps by presenting the question as multiple-choice.<sup class=\"modern-footnotes-footnote \" data-mfn=\"148\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-148\">148<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-148\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"148\"><em>See id<\/em>. (proposing this solution).<\/span> While that eliminates the phrasing decisions, it does so by arbitrarily committing to one specific phrasing. This behavior tells us nothing about the relative merits of the conflicting probability estimates we observed; it merely endorses one option while obscuring others.<sup class=\"modern-footnotes-footnote \" data-mfn=\"149\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-149\">149<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-149\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"149\">In fact, having the LLM \u201cthink out loud\u201d when generating text is a common technique for <em>improving<\/em> the accuracy and quality of answers. <em>See, e.g.<\/em>, Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le &amp; Denny Zhou, <em>Chain-of-Thought Prompting Elicits Reasoning in Large Language Models<\/em>, 36 C<span style=\"font-variant: small-caps\">onf. Neural Info. Processing Sys.<\/span> 1 (2022). At time of writing, clever applications of chain-of-thought techniques is a meaningful differentiator for the generative AI techniques that perform best against major benchmarks. Daya Guo et al., <em>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning<\/em>, <span style=\"font-variant: small-caps\">arXiv:2501.12948 [cs]<\/span> (Jan. 2025), https:\/\/arxiv.org\/pdf\/2501.12948 [https:\/\/perma.cc\/W3GM-D9VB].<\/span> In fact, presenting the <em>Famiglio<\/em> case as multiple-choice amplifies generative interpretation\u2019s sensitivities. As shown in Figure 4, GPT-3.5 almost exclusively relies on the order of multiple-choice options.<\/p>\n<p style=\"text-align: center\"><a href=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/fig4.png?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"4627\" data-permalink=\"https:\/\/journals.law.harvard.edu\/jol\/2026\/01\/24\/generative-misinterpretation\/fig4\/\" data-orig-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/fig4.png?fit=476%2C230&amp;ssl=1\" data-orig-size=\"476,230\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"fig4\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/fig4.png?fit=300%2C145&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/fig4.png?fit=476%2C230&amp;ssl=1\" class=\"size-full wp-image-4627 aligncenter\" src=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/fig4.png?resize=476%2C230&#038;ssl=1\" alt=\"\" width=\"476\" height=\"230\" srcset=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/fig4.png?w=476&amp;ssl=1 476w, https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/fig4.png?resize=300%2C145&amp;ssl=1 300w\" sizes=\"auto, (max-width: 476px) 100vw, 476px\" \/><\/a><em>Figure 4: Phrasing the <\/em>Famiglio<em> case study as a multiple-choice question results in GPT ignoring substance and picking option \u201cA\u201d. Probabilities generated using gpt-3.5-turbo-instruct.<\/em><\/p>\n<p>Similar problems haunt attempts to cross-validate between models.<sup class=\"modern-footnotes-footnote \" data-mfn=\"150\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-150\">150<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-150\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"150\"><em>But cf.<\/em> <em>Generative Interpretation<\/em>, <em>supra<\/em> note 2, at 503\u201304 (\u201c[A]s a best practice, judges would do well to cross-verify the answers that they get from one platform against another.\u201d).<\/span> If models disagree, which should we trust? If models agree, how do we know which factors they converged around? As Table 1 shows, OpenAI\u2019s GPT-3.5 and Anthropic\u2019s Claude-3 models both favor option \u201cA,\u201d regardless of what that option represents.<sup class=\"modern-footnotes-footnote \" data-mfn=\"151\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-151\">151<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-151\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"151\">This is true of almost every version of both models. The table below shows the probabilities generated by every version of GPT-3.5 and Claude-3 available at time of writing. Note that the preference for option \u201cA\u201d is stable for most models, even when the contents of options \u201cA\u201d and \u201cB\u201d are swapped.<br \/><br \/>\n<a href=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.55.17-PM.png?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"4619\" data-permalink=\"https:\/\/journals.law.harvard.edu\/jol\/2026\/01\/24\/generative-misinterpretation\/screenshot-2026-01-18-at-8-55-17-pm\/\" data-orig-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.55.17-PM.png?fit=1412%2C920&amp;ssl=1\" data-orig-size=\"1412,920\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Screenshot 2026-01-18 at 8.55.17\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.55.17-PM.png?fit=300%2C195&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.55.17-PM.png?fit=1024%2C667&amp;ssl=1\" class=\"alignnone size-full wp-image-4619\" src=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.55.17-PM.png?resize=1412%2C920&#038;ssl=1\" alt=\"\" width=\"1412\" height=\"920\" srcset=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.55.17-PM.png?w=1412&amp;ssl=1 1412w, https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.55.17-PM.png?resize=300%2C195&amp;ssl=1 300w, https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.55.17-PM.png?resize=1024%2C667&amp;ssl=1 1024w, https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.55.17-PM.png?resize=768%2C500&amp;ssl=1 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/a><br \/><br \/>\n(Some predictions do not add to 100% because the model sometimes doesn\u2019t commit to an option in the first few words.)<br \/><br \/>\nGPT-4 and GPT-4o are harder to measure because they often refuse to pick \u201cA\u201d or \u201cB,\u201d instead writing a few equivocating sentences. They appear to be less sensitive to the order in which options are presented. Instead, they converge on the importance of the parties\u2019 names: both models are nearly twice as confident in their answers when the spouse filing for divorce is named \u201cJennie\u201d and the spouse paying alimony is named \u201cMark.\u201d This chart shows the probability that GPT-4 models immediately commit to the second filing date. That is, GPT-4 models equivocate about 80% of the time if Mark is paying Jennie, and only 50% of the time when Jennie is paying Mark.<br \/><br \/>\n<a href=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.57.46-PM.png?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"4620\" data-permalink=\"https:\/\/journals.law.harvard.edu\/jol\/2026\/01\/24\/generative-misinterpretation\/screenshot-2026-01-18-at-8-57-46-pm\/\" data-orig-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.57.46-PM.png?fit=1266%2C268&amp;ssl=1\" data-orig-size=\"1266,268\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Screenshot 2026-01-18 at 8.57.46\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.57.46-PM.png?fit=300%2C64&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.57.46-PM.png?fit=1024%2C217&amp;ssl=1\" class=\"alignnone size-full wp-image-4620\" src=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.57.46-PM.png?resize=1266%2C268&#038;ssl=1\" alt=\"\" width=\"1266\" height=\"268\" srcset=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.57.46-PM.png?w=1266&amp;ssl=1 1266w, https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.57.46-PM.png?resize=300%2C64&amp;ssl=1 300w, https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.57.46-PM.png?resize=1024%2C217&amp;ssl=1 1024w, https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.57.46-PM.png?resize=768%2C163&amp;ssl=1 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/a><br \/><br \/>\n<\/span> Mere convergence does not imply convergence around information we consider relevant.<sup class=\"modern-footnotes-footnote \" data-mfn=\"152\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-152\">152<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-152\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"152\">One might reasonably expect LLMs to converge more quickly around simple rules like basic syntax, or simple concepts like \u201cpeople tend to pick the first option,\u201d and more slowly around complex concepts like ways to resolve semantic ambiguity. There\u2019s promising research exploring ways to probe the inner workings of LLMs, but those research efforts are years away from practical application. <em>See, e.g.<\/em>, Leo Gao, Tom Dupr\u00e9 la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike &amp; Jeffrey Wu, <em>Scaling and evaluating sparse autoencoders<\/em>, <span style=\"font-variant: small-caps\">arXiv:2406.04093 [cs]<\/span> (June 6, 2024), https:\/\/arxiv.org\/pdf\/2406.04093 [https:\/\/perma.cc\/CU9Y-5QUG]; Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah &amp; Tom Henighan, <em>Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet<\/em>, <span style=\"font-variant: small-caps\">Anthropic<\/span> (May 21, 2024), https:\/\/transformer-circuits.pub\/2024\/scaling-monosemanticity\/ [https:\/\/perma.cc\/H3RK-6GFE].<\/span><\/p>\n<p><a href=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.58.57-PM.png?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"4621\" data-permalink=\"https:\/\/journals.law.harvard.edu\/jol\/2026\/01\/24\/generative-misinterpretation\/screenshot-2026-01-18-at-8-58-57-pm\/\" data-orig-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.58.57-PM.png?fit=940%2C860&amp;ssl=1\" data-orig-size=\"940,860\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Screenshot 2026-01-18 at 8.58.57\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.58.57-PM.png?fit=300%2C274&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.58.57-PM.png?fit=940%2C860&amp;ssl=1\" class=\" wp-image-4621 aligncenter\" src=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.58.57-PM.png?resize=732%2C670&#038;ssl=1\" alt=\"\" width=\"732\" height=\"670\" srcset=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.58.57-PM.png?w=940&amp;ssl=1 940w, https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.58.57-PM.png?resize=300%2C274&amp;ssl=1 300w, https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-8.58.57-PM.png?resize=768%2C703&amp;ssl=1 768w\" sizes=\"auto, (max-width: 732px) 100vw, 732px\" \/><\/a><\/p>\n<p style=\"text-align: center\"><em>Table 1: GPT and Claude models converge on the same, obviously arbitrary answer when presented with the <\/em>Famiglio<em> case study rephrased as a multiple-choice question.<\/em><\/p>\n<p>Methods for determining which factors sway a model\u2019s choices is an active area of technical research.<sup class=\"modern-footnotes-footnote \" data-mfn=\"153\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-153\">153<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-153\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"153\"><em>See<\/em> <em>supra<\/em> note 134.<\/span> Still, in this case, GPT and Claude\u2019s fixation on option order is not overly surprising. Semantic meaning is less useful for predicting which of two subjectively reasonable options is more likely to appear in text. In those situations, other factors (e.g., formatting) offer greater predictive power relative to semantic meaning and thus more heavily influence LLM predictions.<\/p>\n<p>These examples illustrate how a multitude of factors\u2014including many that users of generative interpretation would deem irrelevant\u2014influence LLM predictions. Multiple-choice bias might feel obvious and easily mitigated. Other outcome-determinative choices are more subtle, like asking about \u201cthe first filing\u201d (not \u201c<em>the date of <\/em>the first filing\u201d), or Judge Newsom asking about the \u201c<em>ordinary<\/em> meaning\u201d of landscaping.<sup class=\"modern-footnotes-footnote \" data-mfn=\"154\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-154\">154<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-154\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"154\">For additional discussion and data on the effect of phrasing on Judge Newsom\u2019s <em>Snell<\/em> experiments, see <em>infra<\/em> note 220 and accompanying text.<\/span> Controlling for specific kinds of unwanted influences offers no assurances that LLMs will use relevant information. It merely introduces new, less obvious biases.<sup class=\"modern-footnotes-footnote \" data-mfn=\"155\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-155\">155<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-155\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"155\"><em>See<\/em> Justin Curl, Peter Henderson, Kart Kandula &amp; Faiz Surani, <em>Judges Shouldn\u2019t Rely on AI for the Ordinary Meaning of Text<\/em>, <span style=\"font-variant: small-caps\">Lawfare<\/span> (May 22, 2025), https:\/\/www.lawfaremedia.org\/article\/judges-shouldn-t-rely-on-ai-for-the-ordinary-meaning-of-text [https:\/\/perma.cc\/U3YJ-4Y4K].<\/span><\/p>\n<p style=\"text-align: center\">C. Snell<em>: Instability through Randomness<br \/><\/em><\/p>\n<p>We have seen that generative interpretation can be extremely sensitive to the choice of model and to the phrasing of the prompt. But these are still only two of the factors that affect a model\u2019s outputs. Another is the randomness inherent to some LLM systems, including every major chatbot: asking the same question twice can yield different outputs.<sup class=\"modern-footnotes-footnote \" data-mfn=\"156\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-156\">156<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-156\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"156\">There is some well-documented non-determinism in commercial LLMs. <em>See<\/em> Seherman Chann, <em>Non-Determinism in GPT-4 is Caused by Sparse MoE<\/em>, <span style=\"font-variant: small-caps\">152334H <\/span>(Aug. 5, 2023), https:\/\/152334H.github.io\/blog\/non-determinism-in-gpt-4\/ [https:\/\/perma.cc\/UE3N-3JAT]. The magnitude of the non-determinism is fairly small.<\/span><\/p>\n<p>Some of the consequences of this randomness can be surprising. Consider <em>Snell<\/em>, where Newsom asked ChatGPT whether installing an in-ground trampoline is \u201clandscaping.\u201d He included the following transcript of his conversation with ChatGPT:<\/p>\n<p style=\"padding-left: 40px\"><strong>Question<\/strong>: What is the ordinary meaning of \u201clandscaping\u201d?<\/p>\n<p style=\"padding-left: 40px\"><strong>Response[GPT]<\/strong>: \u201cLandscaping\u201d refers to the process of altering the visible features of an area of land, typically a yard, garden or outdoor space, for aesthetic or practical purposes. This can include activities such as planting trees, shrubs, flowers, or grass, as well as installing paths, fences, water features, and other elements to enhance the appearance and functionality of the outdoor space.<\/p>\n<p style=\"padding-left: 40px\"><strong>Question<\/strong>: Is installing an in-ground trampoline \u201clandscaping\u201d?<\/p>\n<p style=\"padding-left: 40px\"><strong>Response[GPT]<\/strong>: <em>Yes<\/em>, installing an in-ground trampoline can be considered a part of landscaping. Landscaping involves altering the visible features of an outdoor area for aesthetic or practical purposes, and adding an in-ground trampoline would modify the appearance and function of the space. It\u2019s a deliberate change to the outdoor environment, often aimed at enhancing the overall landscape and usability of the area.<sup class=\"modern-footnotes-footnote \" data-mfn=\"157\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-157\">157<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-157\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"157\">Snell v. United Specialty Ins. Co<em>.<\/em>, 102 F.4th 1208, 1234 (11th Cir. 2024) (Newsom, J., concurring) (emphasis added).<\/span><\/p>\n<p>We posed the same questions to ChatGPT-3.5\u2014and got back the opposite answer:<\/p>\n<p style=\"padding-left: 40px\"><strong>Question<\/strong>: What is the ordinary meaning of \u201clandscaping\u201d?<\/p>\n<p style=\"padding-left: 40px\"><strong>[GPT-3.5]<\/strong>: The ordinary meaning of \u201clandscaping\u201d refers to the act or process of designing, shaping, or beautifying the outdoor area around a building or property. This may include planting trees, shrubs, and flowers, as well as creating paths, lawns, and other features to enhance the aesthetic appeal of the space. Landscaping also encompasses maintenance and upkeep of these elements to keep the outdoor area looking visually appealing.<\/p>\n<p style=\"padding-left: 40px\"><strong>Question<\/strong>: Is installing an in-ground trampoline \u201clandscaping\u201d?<\/p>\n<p style=\"padding-left: 40px\"><strong>[GPT-3.5]<\/strong>: While installing an in-ground trampoline may involve shaping and altering the ground surface, <span style=\"text-decoration: underline\"><strong><em>it is not typically considered landscaping<\/em><\/strong><\/span> because it does not involve the traditional elements of design and beautification of an outdoor space.<\/p>\n<p>What explains this discrepancy? It turns out that ChatGPT\u2019s answer to Newsom\u2019s second question was heavily influenced by how it phrased its answer to the first. For example, if ChatGPT uses the phrase \u201cother features\u201d at the end of its list of examples, it is three times more likely to include trampoline installation in the definition of \u201clandscaping\u201d than to exclude it. Conversely, if ChatGPT uses the phrase \u201cordinary meaning\u201d when answering the first question, it will tend to <em>exclude<\/em> trampoline installation from the definition.<\/p>\n<p>These wording choices are determined through a randomized process, essentially reducing ChatGPT\u2019s determination of whether trampoline installation falls within the meaning of the word \u201clandscaping\u201d to a roll of the dice.<sup class=\"modern-footnotes-footnote \" data-mfn=\"158\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-158\">158<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-158\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"158\">Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes &amp; Yejin Choi, <em>The Curious Case of Neural Text Degeneration<\/em>, 2020 <span style=\"font-variant: small-caps\">Proc. Int\u2019l Conf. on Learning Representations<\/span> (describing the word-selection technique used by most modern text generation AIs, and explaining that always picking the most likely word option leads to flat-sounding text).<\/span> The following table lists a few more examples:<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.02.36-PM.png?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"4623\" data-permalink=\"https:\/\/journals.law.harvard.edu\/jol\/2026\/01\/24\/generative-misinterpretation\/screenshot-2026-01-18-at-9-02-36-pm\/\" data-orig-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.02.36-PM.png?fit=1296%2C982&amp;ssl=1\" data-orig-size=\"1296,982\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Screenshot 2026-01-18 at 9.02.36\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.02.36-PM.png?fit=300%2C227&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.02.36-PM.png?fit=1024%2C776&amp;ssl=1\" class=\" wp-image-4623 aligncenter\" src=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.02.36-PM.png?resize=632%2C583&#038;ssl=1\" alt=\"\" width=\"632\" height=\"583\" \/><\/a><\/p>\n<p style=\"text-align: center\"><em>Table 2: The effect of phrasing choices on GPT\u2019s relative likelihood to say that an in-ground trampoline falls within the ordinary meaning of the word \u201clandscaping\u201d when presented with the questions in Judge Newsom\u2019s <\/em>Snell<em> concurrence.<\/em><\/p>\n<p>It is possible to explore GPT-3.5\u2019s entire decision tree to build intuitions about what influences Judge Newsom\u2019s experiments. As shown below, the probability of GPT saying that a trampoline is not landscaping drops precipitously if it uses the word \u201ctypically,\u201d and increases substantially if it starts its sentence with \u201cthe\u201d and then puts the word \u201clandscaping\u201d in quotes.<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/figure5.png?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"4628\" data-permalink=\"https:\/\/journals.law.harvard.edu\/jol\/2026\/01\/24\/generative-misinterpretation\/figure5\/\" data-orig-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/figure5.png?fit=864%2C577&amp;ssl=1\" data-orig-size=\"864,577\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"figure5\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/figure5.png?fit=300%2C200&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/figure5.png?fit=864%2C577&amp;ssl=1\" class=\" wp-image-4628 aligncenter\" src=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/figure5.png?resize=741%2C508&#038;ssl=1\" alt=\"\" width=\"741\" height=\"508\" \/><\/a><\/p>\n<p style=\"text-align: center\"><em>Figure 5: Part of the tree of possible responses to Judge Newsom\u2019s first question in <\/em>Snell<em>. The bars next to each word represent the probability that GPT\u2019s answer to the second question will be yes (green), no (red), or a non-committal response (blue). An interactive version of this tree exploration tool is available at <\/em>snell-gpt.stein.fyi<em>.<\/em><\/p>\n<p>What should we make of this path dependence? One lesson, surely correct, is the one that Newsom, Arbel and Hoffman, and other pioneers in generative interpretation have drawn: it is important to conduct experiments multiple times.<sup class=\"modern-footnotes-footnote \" data-mfn=\"159\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-159\">159<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-159\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"159\"><em>See, e.g.<\/em>, United States v. Deleon, 116 F.4th 1260, 1273\u201376 (11th Cir. 2024) (Newsom, J., concurring) (running each query ten times, and discussing the dependence of outputs on random variations). <em>But see<\/em> Gandall et al., <em>supra<\/em> note 30, at 32 (assuming \u201cperfect predictability\u201d).<\/span> Random variation can be quantified and managed statistically; indeed, statistical methods are essential to empirical validation of generative interpretation.<\/p>\n<p>Another and subtler lesson is that LLMs can influence <em>themselves<\/em> in unintuitive and extraneous ways. ChatGPT\u2019s answers to Newsom\u2019s first questions appear highly similar; the random variation between them does not substantially affect the interpretive takeaways. But the differences in those answers, insignificant though they seem, are enough to channel ChatGPT towards diametrically opposed answers to Newsom\u2019s second question. If ChatGPT were responding solely based on the linguistic meaning of \u201clandscaping,\u201d the random variation in whether it used the phrase \u201ctypically\u201d in answer to the first question should not influence its answer to the second. And yet it does.<\/p>\n<p style=\"text-align: center\">D. Katrina Canal Breaches<em>: Implementation Instability<\/em><\/p>\n<p>Even this does not exhaust the factors that influence a generative AI\u2019s word choices. The generation process also depends on a large number of configurable parameters and other implementation details.<\/p>\n<p>Consider another case study from <em>Generative Interpretation<\/em>, one that revisits a series of Fifth Circuit cases involving an insurance contract. In relevant part, it asks whether man-made disasters can fall within the meaning of the word \u201cflood\u201d as used in one of the contract\u2019s clauses.<sup class=\"modern-footnotes-footnote \" data-mfn=\"160\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-160\">160<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-160\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"160\"><em>See Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 453\u201355, and accompanying footnotes (citing In re Katrina Canal Breaches Consolidated Litig., 466 F. Supp. 2d 729, 747\u201363 (E.D. La. 2006); and Willy E. Rice, <em>The Court of Appeals for the Fifth Circuit: A Review of 2007\u20132008 Insurance Decisions<\/em>, 41 <span style=\"font-variant: small-caps\">Tex. Tech. L. Rev.<\/span> 1013, 1039 (2009).<\/span> To answer the question, Arbel and Hoffman analyze the \u201cvectors\u201d LLMs use to represent text.<\/p>\n<p>A quick refresher: vectors are sequences of numbers that LLMs use to represent words and phrases. It is often helpful to think of them as coordinates. LLMs tend to place similar text in close proximity: the \u201cLondon\u201d and \u201cBerlin\u201d vectors lie closer to each other than to the \u201cUmbrella\u201d vector.<sup class=\"modern-footnotes-footnote \" data-mfn=\"161\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-161\">161<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-161\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"161\"><em>See <\/em>Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado &amp; Jeffrey Dean, <em>Distributed Representations of Words and Phrases and Their Compositionality<\/em> (Oct. 16, 2013), https:\/\/arxiv.org\/pdf\/1310.4546 [https:\/\/perma.cc\/M4D7-J4TS].<\/span> LLMs also tend to arrange vectors so direction is meaningful: a line running from \u201cLondon\u201d to \u201cEngland\u201d might run roughly parallel to the line connecting \u201cBerlin\u201d to \u201cGermany.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"162\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-162\">162<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-162\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"162\"><em>Id.<\/em><\/span> Those properties make it possible to mathematically estimate similarities and relationships between words and phrases.<sup class=\"modern-footnotes-footnote \" data-mfn=\"163\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-163\">163<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-163\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"163\"><em>See <\/em>Tom\u00e1\u0161 Mikolov, Wen-tau Yih &amp; Geoffrey Zweig, <em>Linguistic Regularities in Continuous Space Word Representations<\/em>, 2013 <span style=\"font-variant: small-caps\">Proc. Conf. N. Am. Chapter Ass\u2019n For Computational Linguistics: Hum. Language Techs. 746, 746<\/span>.<\/span><\/p>\n<p>Applying this insight to the flood question, Arbel and Hoffman estimate the similarity between the insurance clause and a list of terms related to natural and unnatural causes of floods. They include a few unrelated terms as quality checks.<sup class=\"modern-footnotes-footnote \" data-mfn=\"164\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-164\">164<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-164\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"164\"><em>Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 457.<\/span> Their result is reproduced in Figure 6.<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/figure6.png?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"4629\" data-permalink=\"https:\/\/journals.law.harvard.edu\/jol\/2026\/01\/24\/generative-misinterpretation\/figure6\/\" data-orig-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/figure6.png?fit=598%2C420&amp;ssl=1\" data-orig-size=\"598,420\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"figure6\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/figure6.png?fit=300%2C211&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/figure6.png?fit=598%2C420&amp;ssl=1\" class=\"size-full wp-image-4629 aligncenter\" src=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/figure6.png?resize=598%2C420&#038;ssl=1\" alt=\"\" width=\"598\" height=\"420\" srcset=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/figure6.png?w=598&amp;ssl=1 598w, https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/figure6.png?resize=300%2C211&amp;ssl=1 300w\" sizes=\"auto, (max-width: 598px) 100vw, 598px\" \/><\/a><\/p>\n<p><em>Figure 6: Reproduced from <\/em>Generative Interpretation<em>, at p. 457. This chart shows how different LLMs measure the distance between an insurance clause and various terms. We added highlights to demark <\/em>Generative Interpretation<em>\u2019s \u201cman-made\u201d exclusion terms (yellow) and \u201cquality check\u201d terms (grey).<\/em><\/p>\n<p>Generative Interpretation explains that the further the average \u201csimilarity\u201d (denoted by a red dot) is to the right, the less semantically related that term is to the insurance clause. Noting that natural and unnatural causes are interleaved, they conclude that whether a flood was man-made or natural is not a major factor in determining its relationship to the insurance clause.<sup class=\"modern-footnotes-footnote \" data-mfn=\"165\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-165\">165<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-165\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"165\"><em>Id.<\/em><\/span> They present this insight as \u201cobjective, cheap support for the court\u2019s judgment that floods can be unnaturally caused.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"166\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-166\">166<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-166\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"166\"><em>Id.<\/em><\/span> With caveats, they suggest this kind of chart contains information that would be useful to courts, perhaps even superior to dictionary definitions:<\/p>\n<p style=\"padding-left: 40px\">[T]he model doesn\u2019t provide (nor could it) a scientific answer to the question of whether certain words are sufficiently close to make the plain meaning of <em>flood<\/em> unambiguous. That choice is ultimately a normative one which judges must make. But there is a bit of difference between an informed conclusion based on a statistical analysis of billions of texts and a judgment by a few dictionary editors. And there is an ocean of difference between the baroque and expensive textualism the court used and code that is cheap, replicable, quick, and most importantly, extremely straightforward to use. Simply put, generative interpretation is good enough for many cases that currently employ more expensive, and arguably less certain, methodologies.<sup class=\"modern-footnotes-footnote \" data-mfn=\"167\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-167\">167<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-167\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"167\"><em>Id.<\/em> at 458. We did indeed find the code provided in tandem with the article cheap (~$50), quick (a few hours), and relatively straightforward to use.<\/span><\/p>\n<p>But to generate that chart, Arbel and Hoffman make a series of implementation choices\u2014and changes to these technical choices change the outcome of the experiment. Their measurements of similarity in ten LLMs\u2019 vector representations of terms<sup class=\"modern-footnotes-footnote \" data-mfn=\"168\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-168\">168<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-168\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"168\"><em>Id. <\/em>at 456 n.22.<\/span> are calculated using a measure called \u201ccosine distance.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"169\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-169\">169<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-169\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"169\"><em>Id.<\/em> (describing and justifying the use of the cosine distance metric). Cosine distance measures how far apart two points appear when viewed from a third reference point. For example, measuring the distance between stars by measuring how far apart they appear in sky. <em>See<\/em> G. Salton, A. Wong &amp; C.S. Yang, <em>A Vector Space Model for Automatic Indexing<\/em>, <span style=\"font-variant: small-caps\">18 Ass\u2019n. Computing Machinery<\/span> 613 (1975). This is a standard measure of similarity. It is popular because in high-dimensional spaces, <em>direction<\/em> is often more meaningful than absolute distance. (To see why, for example, consider a line, circle, and sphere with the same diameter. The longest possible line that can fit within each shape is the same, but the number of directions available increases with each new dimension. The amount of information conveyed by direction grows with each new dimension, while the number of possible distances remains the same).<\/span> But using cosine distance requires implementors to choose a reference frame<sup class=\"modern-footnotes-footnote \" data-mfn=\"170\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-170\">170<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-170\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"170\">In high-dimensional spaces, the perspective from which you measure direction matters. For example, the vectors for \u201cLondon\u201d and \u201cBerlin\u201d will tend to fall in roughly the same direction from most vantage points (e.g., the relationship between \u201cswimming pool\u201d and \u201cLondon\u201d is roughly the same as the one between \u201cswimming pool\u201d and \u201cBerlin\u201d). Differences start to show up when measured from a relevant perspective (e.g., the relationship between the words \u201cEngland\u201d and \u201cLondon\u201d are very different than the relationship between the words \u201cEngland\u201d and \u201cBerlin\u201d). The two reasonable measurement choices for this case study are (a) to use some arbitrary neutral perspective and measure the distance between each term and the insurance clause, or (b) to use the perspective of the insurance clause and measure the similarity between terms. <em>Generative Interpretation <\/em>chooses the default, neutral perspective. <em>Cf.<\/em> Mikolov, Yih &amp; Zweig, <em>supra <\/em>note 154, at 749 (illustrating how the direction between two vectors can be meaningful, and describing how to apply semantically meaningful linear transformations\u2014what I refer to as \u201cchange of perspective\u201d\u2014to embedding spaces).<\/span> and, under certain circumstances, normalize the results.<sup class=\"modern-footnotes-footnote \" data-mfn=\"171\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-171\">171<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-171\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"171\"><em>Generative Interpretation <\/em>applies min-max linear unit normalization without outlier correction to the measurements from each model. Because they intentionally include outliers as quality checks, the results are warped and hard to interpret. For example, the difference between <em>flood<\/em> and <em>fire<\/em> varies significantly less across models than the difference between <em>flood <\/em>and <em>police<\/em>. But <em>fire<\/em>\u2019s curve is more spread out in the chart reproduced in Figure 6 because there is significant variance between <em>fire<\/em> and the \u201cquality check\u201d term <em>tears of joy<\/em>. Replacing the term <em>tears of joy<\/em> with a more fire-related term (e.g., <em>burn ward<\/em>) would cause \u201cpolice\u201d to appear closer to the origin than \u201cfire.\u201d Even the authors seem to get tripped up by this choice. <em>Id.<\/em> at 504 (\u201cthe policy exceptions were closer to \u2018fire\u2019 than to . . . \u2018police\u2019\u201d). A more straightforward approach might measure everything relative to the distance between the reference clause and \u201cflood.\u201d <em>See, e.g.<\/em>, Prashant V. Kamat, <em>Absolute, Arbitrary, Relative, or Normalized Scale? How to Get the Scale Right<\/em>, 4 <span style=\"font-variant: small-caps\">ACS Energy Lett.<\/span> 2005 (2019) (describing relative scale and discussing the difference between relative and normalized scales).<\/span> These are choices so seemingly minor they don\u2019t even show up in the footnotes\u2014they are subtleties buried in the implementing code.<sup class=\"modern-footnotes-footnote \" data-mfn=\"172\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-172\">172<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-172\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"172\">The specific code changes are available at https:\/\/gist.github.com\/davidbstein\/27360bdc86bc5adc7a3050e44368ec15 [https:\/\/perma.cc\/7SVJ-556Z]. The top-left chart in Figure 7 is reproduced from <em>Generative Interpretation<\/em>, <em>supra<\/em> note 2, at 457. Based on our reading of the article, the top bottom-right chart seems to track most closely to the context-dependent interpretive approach described in the article.<\/span><\/p>\n<p>And yet these choices make all the difference. We reproduced the \u201cflood\u201d case study four times. Each iteration used the same cosine-distance method described in <em>Generative Interpretation<\/em>, with the same models, terms, embeddings, data, and measurement techniques. The only differences are that we tried replacing Arbel and Hoffman\u2019s choices of reference frame and normalization with reasonable alternatives. In one trial (top left in Figure 7) we kept both of their choices; in a second (top right) we replaced their reference frame with one from the perspective of the insurance clause; in a third (bottom left) we kept their reference frame but used a simpler and more natural normalization algorithm; and in a fourth (bottom right) we made both changes.<\/p>\n<p>The four trials reached four completely different results. Switching from Arbel and Hoffman\u2019s normalization technique to a simpler one weakened their results; switching from their choice of reference frame to another logical one caused the results to fail their quality check; and doing both at once completely flipped the result\u2014seemingly showing that natural and unnatural causes are <em>not<\/em> significantly interleaved.<\/p>\n<p>Our point is not that Arbel and Hoffman are making <em>unreasonable<\/em> choices; it is that they are making <em>unacknowledged<\/em> choices. Are the inferences enabled by these charts really \u201cinformed conclusion[s] based on a statistical analysis,\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"173\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-173\">173<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-173\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"173\"><em>Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 458.<\/span> or do they mistake implementation artifacts for semantic meaning; noise for signal?<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.48.50-PM.png?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"4632\" data-permalink=\"https:\/\/journals.law.harvard.edu\/jol\/2026\/01\/24\/generative-misinterpretation\/screenshot-2026-01-18-at-9-48-50-pm\/\" data-orig-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.48.50-PM.png?fit=734%2C1480&amp;ssl=1\" data-orig-size=\"734,1480\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Screenshot 2026-01-18 at 9.48.50\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.48.50-PM.png?fit=149%2C300&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.48.50-PM.png?fit=508%2C1024&amp;ssl=1\" class=\"size-full wp-image-4632 aligncenter\" src=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.48.50-PM.png?resize=734%2C1480&#038;ssl=1\" alt=\"\" width=\"734\" height=\"1480\" srcset=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.48.50-PM.png?w=734&amp;ssl=1 734w, https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.48.50-PM.png?resize=149%2C300&amp;ssl=1 149w, https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-18-at-9.48.50-PM.png?resize=508%2C1024&amp;ssl=1 508w\" sizes=\"auto, (max-width: 734px) 100vw, 734px\" \/><\/a><\/p>\n<p style=\"text-align: center\"><em>Figure 7: The data and method in <\/em>Generative Interpretation<em>\u2019s Hurricane Katrina case study plotted using two different reference frames and normalization approaches. Highlights to demark <\/em>Generative Interpretation<em>\u2019s \u201cman-made\u201d exclusion terms (yellow) and \u201cquality check\u201d terms (grey). Depending on implementation choice, the chart either: contradicts the Fifth Circuit (top left), supports the Fifth Circuit (bottom right), indicates an inconclusive measurement (bottom left), or fails <\/em>Generative Interpretation<em>\u2019s<\/em> <em>\u201cquality check\u201d (top right).<\/em><\/p>\n<p>Imagine a judge trying to use embeddings as interpretive tools or contracting parties pre-committing to an interpretation technique, as <em>Generative Interpretation<\/em> suggests they ought to do.<sup class=\"modern-footnotes-footnote \" data-mfn=\"174\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-174\">174<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-174\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"174\"><em>Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 501.<\/span> How should they navigate these kinds of implementation choices, possibly without expert guidance? How should they resolve disagreements about those choices? How can they determine which implementation choice is better?<sup class=\"modern-footnotes-footnote \" data-mfn=\"175\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-175\">175<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-175\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"175\">Another issue: it is not clear\u2014though it seems incredibly likely\u2014that embedding vectors produced by LLMs have the same semantic meaning as embedding vectors used by other kinds of AI. Harald Steck, Chaitanya Ekanadham &amp; Nathan Kallus, <em>Is Cosine-Similarity of Embeddings Really About Similarity?<\/em>, 2024 <span style=\"font-variant: small-caps\">ACM Web <\/span>887 (\u201ccosine-similarity can yield arbitrary and therefore meaningless \u2018similarities.\u2019\u201d). <em>Compare with<\/em> <em>Generative Interpretation<\/em>, <em>supra <\/em>note 2, 458 n.24 (citing to papers about embedding distances generated using non-LLM technologies).<\/span> They need a principled framework to pick one implementation over another, but <em>Generative Interpretation<\/em> does not supply one. None of the LLM proponents does.<\/p>\n<p>E. <em>Methodological Instability<br \/><\/em><\/p>\n<p>There is a larger issue here. As computer scientist Andrew Tanenbaum quipped, \u201cThe nice thing about standards is that there are so many to choose from. And if you do not like any of them, just wait a year or two.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"176\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-176\">176<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-176\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"176\"><span style=\"font-variant: small-caps\">Andrew S. Tanenbaum &amp; David J. Wetherall, Computer Networks 702 <\/span>(5th ed. 2011).<\/span> The same is true of generative interpretation. It is not just that there are unsettled details that affect the results of a commonly used interpretive method. It is that the <em>interpretive method<\/em> <em>itself<\/em> is unsettled. \u201cGenerative interpretation\u201d does not refer to a specific approach to using LLMs to answer a particular type of judicial question. Rather, it is a family name for a disparate collection of methods, many of which have in common that they make use of LLMs. Indeed, these methods diverge dramatically in terms of how to set up the query and how to interpret the results.<\/p>\n<p>As we saw in Part I, generative interpretation\u2019s proponents disagree on:<\/p>\n<p style=\"padding-left: 40px\">\u2022 Whether to ask discrete interpretive questions (Arbel and Hoffman, Judge Newsom, Judge Deahl) or to resolve entire disputes (Unikowsky, Kieffaber).<br \/>\u2022 Whether to prompt the LLM with a short question in isolation (Judge Newsom, Judge Deahl) or to give it as many materials pertaining to a dispute as possible (Arbel and Hoffman, Unikowsky, Kieffaber).<br \/>\u2022 Whether to interpret the LLM by using technical tools like word probabilities and cosine distances (Arbel and Hoffman) or by reading its outputs as natural language (Judge Newsom, Judge Deahl, Unikowsky, Kieffaber).<\/p>\n<p>These are not small differences. There is no strong reason to expect that they will all yield similar results, and no strong reason to think one of them is obviously better than another. Indeed, even within this taxonomy, each box contains multitudes.<\/p>\n<p>Consider, for example, the four different ways in which <em>Generative Interpretation<\/em>\u2019s case studies produce numerical estimates of interpretive meaning.<\/p>\n<p><em>Method 1: Character Values as Probabilities<\/em><\/p>\n<p>As their first example, Arbel and Hoffman prompt ChatGPT with text from an insurance policy and instruct the chatbot,<\/p>\n<p style=\"padding-left: 40px\">please state your prediction\u2014with the associated numerical level of confidence in parentheses\u2014on the likely expectations of most policyholders under these terms for the following proposition[]: 1. The policy will provide compensation for losses resulting from a substantiated third-party burglary.<sup class=\"modern-footnotes-footnote \" data-mfn=\"177\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-177\">177<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-177\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"177\"><em>Generative Interpretation<\/em>, <em>supra<\/em> note 2, at 475 n.129.<\/span><\/p>\n<p>ChatGPT replies, \u201c1. Likely Expectation (90%): The policy will compensate for third-party burglary.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"178\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-178\">178<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-178\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"178\"><em>Id.<\/em><\/span> Arbel and Hoffman assert that this response constitutes \u201cChat GPT-4 [telling] us that it was 90% likely that the policy would pay in response to a \u2018substantiated third-party burglary.\u2019\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"179\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-179\">179<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-179\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"179\"><em>Id. <\/em>at 476.<\/span> As they explain, \u201c90%\u201d is the string of text that ChatGPT predicted would appear at that particular place in its response.<sup class=\"modern-footnotes-footnote \" data-mfn=\"180\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-180\">180<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-180\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"180\"><em>Id.<\/em> at 482.<\/span> It reflects an AI model\u2019s determination about the string of <em>characters<\/em> that was most likely to be produced in a written response to that question. It does not reflect a direct, <em>numerical<\/em> assessment of the probability \u201cthat the policy would pay in response to a \u2018substantiated third-party burglary.\u2019\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"181\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-181\">181<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-181\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"181\"><em>Id. <\/em><\/span><\/p>\n<p><em>Method 2: Floating-Point Values as Probabilities<\/em><\/p>\n<p>For its next example, <em>Generative Interpretation<\/em> invokes a number with an entirely different significance. This is the <em>Famiglio<\/em> case study discussed above, in which the authors extract a probability from the internals of GPT\u2019s word-selection algorithm. Given the choice between the words \u201cfirst\u201d and \u201csecond,\u201d Arbel and Hoffman find the AI will select the word \u201csecond\u201d roughly 94.72% of the time (and our attempt to replicate this result with a newer LLM yielded a 54.16% chance of \u201cfirst\u201d).<sup class=\"modern-footnotes-footnote \" data-mfn=\"182\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-182\">182<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-182\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"182\"><em>Id. <\/em>at 484.<\/span> From this example, Arbel and Hoffman conclude, \u201cGenerative interpretation . . . thus offers courts a better sense of the relevant probabilities if the parties were intending to use English in its most public and common sense.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"183\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-183\">183<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-183\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"183\"><em>Generative Interpretation<\/em>, <em>supra<\/em> note 2, at 482.<\/span> We take them to be arguing that the different \u201cprobability\u201d values that the model assigns to \u201csecond\u201d versus \u201cfirst\u201d constitute \u201crelevant probabilities\u201d for ascertaining the meaning of the prenuptial agreement.<sup class=\"modern-footnotes-footnote \" data-mfn=\"184\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-184\">184<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-184\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"184\">Arbel and Hoffman also caution that \u201cthe probabilities shouldn\u2019t be interpreted literally. The model could, for example, continue the sentence with \u2018The first filing would not control.\u2019\u201d <em>Id.<\/em> at 485 n.165. We interpret this warning to mean that the relative probabilities of \u201csecond\u201d and \u201cfirst\u201d in their example do not necessarily reflect the respective probabilities that the model would have generated complete sentences that endorse each respective reading. This recognition seems appropriate, but acknowledging it undermines the probative value of those numbers.<\/span><\/p>\n<p>The numbers that Arbel and Hoffman are evaluating in this prenup example differ dramatically from the numbers that they evaluated in the earlier insurance-policy example. In the insurance-policy example, they evaluated the string \u201c90%\u201d that the model\u2019s text-prediction engine produced as a response to their written query. In the <em>Famiglio<\/em> case study, they evaluate the model\u2019s internal estimate of the probability that the text \u201csecond\u201d should follow the text \u201cThe\u201d in its response. <em>These numbers are the results of completely different processes<\/em>. We looked \u201cunder the hood\u201d at the responses ChatGPT gave to the authors\u2019 first question about the meaning of the insurance policy and found that the model assigned an approximate probability of .5416 to \u201c90%,\u201d a probability of .013 to \u201c0%,\u201d and a probability of .001 to \u201cyes.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"185\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-185\">185<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-185\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"185\">Generated using the OpenAI chat completions API, using GPT-4 with recommended settings. Code available at https:\/\/gist.github.com\/davidbstein\/f95d908344205ef583d1f57392ace76f [https:\/\/perma.cc\/L9WL-34AY].<\/span><\/p>\n<p>There may or may not be some basis on which to ascribe legally relevant meaning to either the text \u201c90%\u201d from the first example or the number from the second example. How should courts understand and differentiate between the \u201c90%\u201d from the first example and the .9472 from the second example (or the .5416 from our replication attempt)?<sup class=\"modern-footnotes-footnote \" data-mfn=\"186\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-186\">186<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-186\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"186\">For example, how should the inherent biases in number-as-word selection inform interpretation of those numbers? If you ask GPT-3.5 to fill in the blank on \u201cRating = __ \/ 100\u201d with no additional context, about half the time it will pick one of: <em>50<\/em> (p=.1488), <em>0<\/em> (p=.1220), <em>10 <\/em>(p=.0667), <em>90<\/em> (p=.0558), or <em>60<\/em> (p=.0542). Computed using the OpenAI \u201cplayground\u201d using the gpt-3.5-turbo-instruct model on default settings.<\/span> Both can be notated as numerical percentages, but they are produced by vastly different processes.<\/p>\n<p><em>Method 3: A Different Method for Characters as Probabilities <\/em><\/p>\n<p>The third method returns to textual percentages. The authors pass the entire text of the promissory note from <em>Trident Center v. Connecticut General Life Ins. Co.<sup class=\"modern-footnotes-footnote \" data-mfn=\"187\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-187\">187<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-187\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"187\"><\/em>847 F.2d 564 (9th Cir. 1988).<em><\/span><\/em> to an LLM with the instructions: \u201cRate the following proposition on a 1-100 scale, where 0 is wrong and 100 is correct\u2026[:] The language is REASONABLY susceptible to being read as providing the borrower the right to early prepayment.\u201d They repeat this process 100 times for each of 3 popular chatbots and report the results in a chart (reproduced in Figure 8(a)).<sup class=\"modern-footnotes-footnote \" data-mfn=\"188\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-188\">188<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-188\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"188\">Taken from the source code accompanying <em>Generative Interpretation<\/em>, <em>supra <\/em>note 2, available at https:\/\/github.com\/yonathanarbel\/generativeinterpretation\/blob\/67dc982ec69aa383d3c2422dda55294af68158a7\/Scripts\/Trident.ipynb [https:\/\/perma.cc\/A8A4-JYWC]. Input Box 19, line 9.<\/span> They conclude \u201cthe models roughly agree on average that prepayment is not allowed, with a mean score of ~41.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"189\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-189\">189<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-189\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"189\"><em>Generative Interpretation<\/em>,<em> supra <\/em>note 2, at 488.<\/span><\/p>\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><a href=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-24-at-2.37.23-PM.png?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"1386\" height=\"956\" data-attachment-id=\"4658\" data-permalink=\"https:\/\/journals.law.harvard.edu\/jol\/2026\/01\/24\/generative-misinterpretation\/screenshot-2026-01-24-at-2-37-23-pm\/\" data-orig-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-24-at-2.37.23-PM.png?fit=1386%2C956&amp;ssl=1\" data-orig-size=\"1386,956\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Screenshot 2026-01-24 at 2.37.23\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-24-at-2.37.23-PM.png?fit=300%2C207&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-24-at-2.37.23-PM.png?fit=1024%2C706&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-24-at-2.37.23-PM.png?resize=1386%2C956&#038;ssl=1\" alt=\"\" class=\"wp-image-4658\" style=\"width:682px;height:auto\" srcset=\"https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-24-at-2.37.23-PM.png?w=1386&amp;ssl=1 1386w, https:\/\/i0.wp.com\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/Screenshot-2026-01-24-at-2.37.23-PM.png?resize=300%2C207&amp;ssl=1 300w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/a><\/figure>\n\n\n<p style=\"text-align: center\">Figure 8: Results from the Generative Interpretation<em>\u2019s<\/em> Trident<em> case study. (a) Above: the estimated \u201cdensity\u201d of responses, as presented in the original paper. (b) Below: the same data, displayed as a box-and-whisker chart. We include the second chart to make the data points (black dots) and averages (red lines) explicit.<sup class=\"modern-footnotes-footnote \" data-mfn=\"190\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-190\">190<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-190\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"190\">Id. (asking models to rate whether prepayment is allowed on a scale of 0-100, where 0 means repayment is definitely not allowed, and 100 means repayment is definitely allowed). The curves shown in Generative Interpretation are computed using a method called kernel density estimation. As a part of that method, the authors needed to choose between one of several bandwidth \u201cestimators,\u201d the function used to estimate density based on a collection of datapoints. The different methods result in significant differences in the resulting chart. See David M. Bashtannyk &amp; Rob J. Hyndman, Bandwidth Selection for Kernel Conditional Density Estimation, 36 <span style=\"font-variant: small-caps\">Computational Stat. &amp; Data Analysis<\/span> 279, 279 (2001) (reviewing methods for picking a bandwidth function). See also Part III, infra, discussing minor implementation details that affect interpretive results.<\/span><\/em><\/p>\n<p><em>Method 4: From \u201cYes\/No\u201d to Numerical Distributions<\/em><\/p>\n<p>In a fourth example, Arbel and Hoffman present a contract to several LLMs and solicit yes\/no answers about the meaning of a clause. They pose 20 linguistic \u201cvariations of the same legal question,\u201d a \u201cyes\/no question[] where yes indicates agreement with the judge\u2019s interpretation,\u201d and graph the results.<sup class=\"modern-footnotes-footnote \" data-mfn=\"191\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-191\">191<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-191\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"191\"><em>Generative Interpretation<\/em>,<em> supra <\/em>note 2, at 490.<\/span> Here, the authors are not instructing the model to respond with its confidence, nor are they looking under the hood at the probability of the model picking a certain response. Instead, this number is the relative frequency with which a model provides a \u201cyes\u201d or \u201cno\u201d answer. From a technical perspective, this method is mathematically equivalent to measuring the number from method 2 for each of the 20 questions, then averaging the results and adding a large dose of random noise.<sup class=\"modern-footnotes-footnote \" data-mfn=\"192\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-192\">192<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-192\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"192\">The choice between \u201cyes\u201d and \u201cno\u201d is technically equivalent to the choice between \u201cfirst\u201d or \u201csecond\u201d in <em>Famiglio<\/em>: a binary selection between two possible words. We could have looked \u201cunder the hood\u201d to get the underlying probabilities for each phrasing. Instead, for each phrasing, we use the result of a single weighted coin toss based on those probabilities. Even taking this number at face value, it is troubling: would we ascribe the same meaning to a human interlocutor who answered an identical question differently when polled multiple times in quick succession, or would we simply find their response incoherent? <em>Cf.<\/em> Thomas R. Lee &amp; Jesse Egbert, <em>Artificial Meaning?<\/em>, 77 <span style=\"font-variant: small-caps\">Fla. L. Rev.<\/span> (forthcoming) (manuscript at 43), https:\/\/papers.ssrn.com\/abstract=4973483 [https:\/\/perma.cc\/42P3-UHDW] (observing that variation in LLM responses \u201cis not indicative of the variability we would observe between different individuals in a larger population\u2014just the variability we would see between responses from a single person who is repeatedly asked the same question.\u201d).<\/span><\/p>\n<p>* * *<\/p>\n<p>We do not have a definitive opinion about which of these four methods is best.<sup class=\"modern-footnotes-footnote \" data-mfn=\"193\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-193\">193<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-193\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"193\">Though method 4 seems strictly worse than method 2.<\/span> Nor do we fault Arbel and Hoffman for exploring numerous different approaches. That is exactly what exploratory early work needs to do. Our point is that an embarrassment of riches can still be an embarrassment. Any one generative interpretation methodology raises difficult reliability problems on its own. Adding three more does not resolve those problems; it multiplies them. It may be that one or more of these methods can be made robust and legitimate. But a judge considering generative interpretation must still choose which ones to try, and that by itself is an entire can of worms.<\/p>\n<h4 style=\"text-align: center\">V. <span style=\"font-variant: small-caps\">Is Generative Interpretation Epistemically Justified?<br \/><\/span><\/h4>\n<p>In this Part, we turn to the epistemic problem: should judges accept LLM outputs as providing authoritative guidance on legal issues? For generative interpretation, those issues involve the linguistic meaning of disputed texts. For generative adjudication, those issues involve the proper resolution of disputed legal issues. But in both cases, the epistemic question is the same. What makes <em>this<\/em> measure relevant to deciding <em>that<\/em> issue? To assert that an LLM output \u201caccurately reflects real people\u2019s everyday speech patterns,\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"194\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-194\">194<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-194\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"194\">United States v. Deleon, 116 F.4th 1260, 1270 (11th Cir. 2024) (Newsom, J., concurring). To be clear, Newsom goes on to provide reasons, recognizing that this is a proposition that must be demonstrated, not simply assumed. As we detail in this Part, our point of divergence is that we think the reasons he gives do not warrant the conclusion that LLM outputs \u201caccurately reflect[]\u201d ordinary usage. <em>Id.<\/em><\/span> without more, is to beg the question. <em>Why would<\/em> the outputs of a complicated algorithm necessarily reflect <em>anything<\/em>, let alone \u201ceveryday speech patterns?\u201d Something more is needed to show that <em>this<\/em> algorithm, run in <em>this<\/em> way on <em>this <\/em>input, yields \u201ccorrect\u201d or \u201caccurate\u201d answers to the question being asked.<\/p>\n<p>This is a general challenge for empirical methods; it goes by many names, including \u201cconstruct validity.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"195\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-195\">195<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-195\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"195\"><em>See <\/em>Dag I. K. Sj\u00f8berg &amp; Gunnar R. Bergersen, <em>Construct Validity in Software Engineering<\/em>, <span style=\"font-variant: small-caps\">49 IEEE Transactions Software Eng\u2019g<\/span> 1374, 1374 (2023).<\/span> A method may be consistent and robust enough to measure <em>something<\/em>, but that <em>something<\/em> may not be the question the experimenter hopes to answer. A moment\u2019s thought shows that establishing construct validity requires attending to the details of not just method and the question. The Sch\u00f6nhage-Strassen algorithm multiplies large numbers significantly faster than the algorithm taught in elementary school; it is an excellent source of knowledge about the products of integers, but it tells us nothing about ordinary English usage.<\/p>\n<p>LLM proponents have offered, we think, four broad types of responses to the epistemic problem: they have tried to justify the use of LLMs based on <em>how they work<\/em>, based on <em>calibrating<\/em> their outputs against an accepted baseline, based on the rhetorical <em>persuasiveness<\/em> of their outputs, and based on a claim that they offer <em>predictability<\/em> as such. All four types are intuitively appealing, and three of them can be sound ways of establishing a method\u2019s epistemic <em>bona fides<\/em>. But we think that in this case, for LLMs as a source of legal meaning, the case is unproven. LLM proponents\u2019 claims depend on a mix of unstated assumptions, mistaken beliefs about how LLMs work, and misunderstandings of the judicial function.<\/p>\n<p style=\"text-align: center\">A. <em>Relying on the Training Process?<br \/><\/em><\/p>\n<p>Start by considering the class of arguments that an LLM should be trusted because of what it is: a statistical model trained on a corpus of natural-language text. The problem here is that these are explanations of why an LLM <em>could<\/em> work at the task of capturing common linguistic usage, not an explanation of why a specific LLM <em>actually does<\/em> capture common linguistic usage in a specific setting.<\/p>\n<p>1. <em>Deductive Arguments<\/em><\/p>\n<p>The strongest argument for a decisional legal AI system would be to validate the system <em>deductively<\/em>, using the tools of formal logic to craft an abstract model of law and formal verification to show that the AI system correctly implements that abstract model.<sup class=\"modern-footnotes-footnote \" data-mfn=\"196\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-196\">196<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-196\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"196\"><em>See generally<\/em> James Grimmelmann, <em>Programming Languages and Law: A Research Agenda<\/em>, 2022 <span style=\"font-variant: small-caps\">Proc. Symposium Comput. Sci. &amp; L.<\/span> 1 (discussing formal approaches to law).<\/span> Indeed, there have been numerous previous attempts to bring software-based deductive rigor to legal reasoning, with mixed success.<sup class=\"modern-footnotes-footnote \" data-mfn=\"197\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-197\">197<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-197\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"197\"><em>See, e.g.<\/em>, L. Thorne McCarty, <em>Reflections on TAXMAN: An Experiment In Artificial Intelligence and Legal Reasoning<\/em>, <span style=\"font-variant: small-caps\">90 Harv. L. Rev<\/span>. 837 (1976). Note that McCarty, writing nearly 50 years ago, still correctly identified the ambiguities and open texture of law as key challenges to its formalization.<\/span> To validate generative interpretation deductively would require demonstrating, <em>a priori<\/em>, that the internal workings of the LLMs <em>necessarily <\/em>lead it to estimate linguistic meaning to some degree of accuracy.<\/p>\n<p>Deduction has the virtue of generating conclusions that are guaranteed to be true in specific instances. Of course, the veracity of a valid, deductive proof is as strong (and only as strong) as its assumptions. If those assumptions hold, the conclusion always follows. For example, say we want to know whether a two-sided coin is fair. If we assume that the coin is infinitesimally thin and perfectly symmetrical, that it is flipped with a fixed upward velocity <em>v<\/em> and an angular velocity drawn from a known distribution <em>X<\/em> at a fixed distance <em>d <\/em>above the ground, that it is unaffected by air resistance, and that it collides inelastically with the ground when it lands, then for suitable choices of <em>v<\/em>, <em>X<\/em>, and <em>d<\/em>, we can establish that this method for tossing coins is necessarily fair. The probability that it comes up heads on any given toss is equal to the probability that it comes up tails.<\/p>\n<p>But proving LLMs\u2019 legal-interpretive authority isn\u2019t like positing the fairness of a coin, for two reasons. First, while \u201cfairness\u201d in coin-flipping has a straightforward definition\u2014an independent, 50-50 expectation of heads or tails for any given flip\u2014\u201caccuracy\u201d in legal interpretation has no equivalent formal specification. The second reason follows from the first. Because legal-interpretive accuracy is underspecified, we don\u2019t know what factual premises about LLMs would have to hold in order for them to be accurate. By contrast, because a coin\u2019s fairness is easily formalized, we <em>can<\/em> posit the factual properties that characterize a fair coin. Put another way, there are two things that stand in the way of using LLMs to reason deductively about law: the nature of law and the nature of LLMs.<\/p>\n<p>Even assuming a deductive model of legal interpretation is reconcilable with law as a social enterprise, adopting such a model would require transformational changes to present-day attitudes. Indeed, even generative interpretation proponents seem to reject a deductive approach. In <em>Snell<\/em>, Judge Newsom wrote, \u201c[w]ould the consideration of LLM outputs in interpreting legal texts inevitably put us on some dystopian path toward \u2018robo judges\u2019 algorithmically resolving human disputes? I don\u2019t think so. As Chief Justice Roberts recently observed, the law will always require \u2018gray area[ ]\u2019 decision-making that entails the \u2018application of human judgment.\u2019\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"198\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-198\">198<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-198\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"198\">Snell v. United Specialty Ins. Co., 102 F.4th 1208, 1232 (11th Cir. 2024) (Newsom, J., concurring).<\/span><\/p>\n<p>Setting aside whether law can ever be a deductive enterprise, LLMs are extraordinarily poor vehicles for deductive validation.<sup class=\"modern-footnotes-footnote \" data-mfn=\"199\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-199\">199<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-199\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"199\"><em>Cf.<\/em> E. Mark Gold, <em>Language Identification in the Limit<\/em>, 10 <span style=\"font-variant: small-caps\">Info. &amp; Control<\/span> 447 (1967) (formally proving that it is impossible to learn something perfectly and completely through example-based processes like machine learning process).<\/span> The origins of LLMs trace back to a decision to dispense with logical rigor and precisely curated knowledge bases, and to rely instead on recognizing patterns in massive datasets.<sup class=\"modern-footnotes-footnote \" data-mfn=\"200\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-200\">200<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-200\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"200\"><em>See<\/em> Benjamin L.W. Sobel, <em>Artificial Intelligence\u2019s Fair Use Crisis<\/em>, 41 <span style=\"font-variant: small-caps\">Colum. J.L. &amp; Arts<\/span> 45, 58 (2017).<\/span> Today\u2019s LLMs did not develop because computer scientists deduced that a particular technical architecture would, as a matter of logical necessity, produce AI with remarkable text-generating abilities. Instead, they came about when engineers decided to run with the techniques that produced the best results, even when the reasons these techniques work so well were (and, in some cases, are) yet to be discovered.<sup class=\"modern-footnotes-footnote \" data-mfn=\"201\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-201\">201<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-201\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"201\"><em>See generally<\/em> Alon Halevy, Peter Norvig &amp; Fernando Pereira, <em>The Unreasonable Effectiveness of Data<\/em>, 24 <span style=\"font-variant: small-caps\">IEEE Intelligent Sys. 8 (<\/span>2009).<\/span> As Arbel and Hoffman note in <em>Generative Interpretation<\/em>: \u201cwhatever [an LLM] tells you, it is really no explanation at all . . . working with LLMs admittedly requires a leap of faith, a realization that no better explanation is forthcoming.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"202\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-202\">202<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-202\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"202\"><em>Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 483.<\/span><\/p>\n<p>Kieffaber deals with these problems simply by assuming them away. One of the six premises of his \u201csci-fi-hypothetical\u201d is that \u201cJudge.AI is a perfectly neutral arbiter and interprets words with <em>perfect mathematical accuracy<\/em>.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"203\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-203\">203<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-203\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"203\">Kieffaber, <em>supra<\/em> note 116, at 7 (emphasis added).<\/span> This assumption serves as an intuition pump for his thought experiment about the nature of textualism,<sup class=\"modern-footnotes-footnote \" data-mfn=\"204\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-204\">204<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-204\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"204\">One might object that treating natural language as susceptible to \u201cperfect mathematical\u201d interpretation not only assumes away the debate between textualism and its critics but also distorts beyond recognition the texts that textualists understand themselves to be interpreting.<\/span> but it tells us nothing about actually existing LLMs. Arbitrus.AI is an actual system, and while Kieffaber and his coauthors are enthusiastic about its abilities, they do not attempt to validate it deductively. They don\u2019t even detail their model architecture, training algorithm, training datasets, or prompts, let alone show that these particular design choices are logically guaranteed to produce correct outputs. To state the idea is to refute it. LLMs are not spherical cows.<sup class=\"modern-footnotes-footnote \" data-mfn=\"205\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-205\">205<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-205\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"205\"><em>See<\/em> David Kaiser, <em>The Sacred, Spherical Cows of Physics<\/em>, <span style=\"font-variant: small-caps\">Nautilus<\/span> (Apr. 25, 2014), https:\/\/nautil.us\/the-sacred-spherical-cows-of-physics-234898\/ [https:\/\/perma.cc\/F3ZZ-FYWR].<\/span><\/p>\n<p>2. <em>Informal Arguments<\/em><\/p>\n<p>In practice, most LLM proponents offer a related but much weaker argument. As Unikowsky puts it:<\/p>\n<p style=\"padding-left: 40px\">To me, the proposition \u201cAI is useful for determining the ordinary meaning of English words\u201d should be approximately as controversial as \u201cGPS is useful for determining directions.\u201d Cutting-edge LLMs have read the entire Internet, or almost all of it anyway. Of course they\u2019ll be useful in determining how words are ordinarily used.<sup class=\"modern-footnotes-footnote \" data-mfn=\"206\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-206\">206<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-206\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"206\"><em>In AI We Trust I<\/em>, <em>supra<\/em> note 5.<\/span><\/p>\n<p>Similarly, Judge Newsom writes, \u201cLLMs can be expected to offer meaningful insight into . . . ordinary meaning . . . because the internet data on which they train contain <em>so many<\/em> uses . . . , from <em>so many<\/em> different sources . . . .\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"207\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-207\">207<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-207\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"207\">Snell v. United Specialty Ins. Co., 102 F.4th 1208, 1227 (11th Cir. 2024) (Newsom, J., concurring).<\/span><\/p>\n<p>These are not deductive proofs that the way LLMs are created means they <em>necessarily<\/em> provide accurate information on semantic meaning. Instead, they are arguments that the way LLMs are created gives us <em>good reason to think<\/em> that they provide information on semantic meaning.<\/p>\n<p>Consider Unikowsky\u2019s GPS analogy. The GPS system uses a set of satellites with precisely calibrated clocks that transmit synchronized radio signals; a receiver uses the timing information from signals from multiple satellites to calculate its position relative to them\u2014and thus its position on the surface of the Earth.<sup class=\"modern-footnotes-footnote \" data-mfn=\"208\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-208\">208<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-208\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"208\"><em>See How GPS Works<\/em>, <span style=\"font-variant: small-caps\">GPS.Gov<\/span>, https:\/\/www.gps.gov\/sites\/default\/files\/2025-08\/Educational_Poster.pdf [https:\/\/perma.cc\/UVM4-DSF7].<\/span> Given the design of the system\u2014and the relevant orbital mechanics, wave physics, electronics, and so on\u2014we have good reason to think that the calculation carried out by a particular receiver in a particular place at a particular moment is accurate. Similarly, given some reasonable assumptions about coins\u2019 weight, dimensions, and initial flipping characteristics, we might have good reason to think that particular coin flips are likely to be close to fair.<sup class=\"modern-footnotes-footnote \" data-mfn=\"209\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-209\">209<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-209\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"209\"><em>But see<\/em> Persi Diaconis, Susan Holmes &amp; Richard Montgomery, <em>Dynamical Bias in the Coin Toss<\/em>, 49 <span style=\"font-variant: small-caps\">SIAM Rev.<\/span> 211, 211 (2007) (\u201cWe show that vigorously flipped coins tend to come up the same way they started.\u201d).<\/span><\/p>\n<p>Once again, however, it is a much heavier lift for generative interpretation than for other methods, because LLMs are much more complicated than a coin or a GPS receiver, and generative interpretation asks LLMs to resolve far more complicated problems. The argument that LLMs work because they have been exposed to almost \u201cthe entire Internet\u201d proves far too much. The Google Books corpus was assembled by scanning millions of hard-copy books.<sup class=\"modern-footnotes-footnote \" data-mfn=\"210\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-210\">210<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-210\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"210\">Authors Guild v. Google, Inc., 804 F.3d 202, 208 (2d Cir. 2015).<\/span> The scanners that Google used have \u201cread\u201d millions of books; does it follow that \u201cthey\u2019ll be useful in determining how words are ordinarily used?\u201d With enough time, you could tap the entire corpus that trained ChatGPT in binary into the keys of a calculator. You could place a tortoise in front of a speaker broadcasting the entire corpus as audio. At the end of these (interminable) processes, the calculator will have \u201cread\u201d the same data that ChatGPT \u201cread,\u201d and the tortoise will have apprehended it, at least in the sense that they will have been presented with that information. Yet neither the calculator, nor the tortoise, nor the scanner will be a greater authority on ordinary meaning than it was before these indoctrination processes.<\/p>\n<p>The same point holds regardless of whether the claim is phrased in terms of LLMs\u2019 accuracy, their legitimacy, their adherence to democratic norms, or any other desideratum. There is nothing about \u201ctraining on the entire Internet\u201d that implies the output will satisfy any particular criteria. The calculator and the tortoise were trained on the entire Internet too; that doesn\u2019t make them democratically legitimate, either.<\/p>\n<p>Mere exposure to large amounts of natural-language text does not automatically confer authority about linguistic meaning. Indeed, the road to ChatGPT is littered with the corpses of generative AIs that were emphatically terrible authorities on ordinary meaning.<sup class=\"modern-footnotes-footnote \" data-mfn=\"211\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-211\">211<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-211\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"211\"><em>See generally<\/em> <span style=\"font-variant: small-caps\">Janelle Shane, You Look Like a Thing and I Love You: How Artificial Intelligence Works and Why It\u2019s Making the World a Weirder Place<\/span> (2019) (extensively describing AIs\u2019 humorously bad attempts to emulate human writing).<\/span> It was not that OpenAI\u2019s model-training methods were <em>a priori<\/em> better than all that had come before in a way that could have been expected to yield far more reliable results. Instead, a series of advances in model architecture, feasible model size, and large, high-quality training datasets came together to produce a model with demonstrably\u2014but very surprisingly\u2014better <em>a posteriori<\/em> performance.<\/p>\n<p>In other words, \u201cit trained on the entire Internet\u201d is a plausible theory of why an LLM that usually generates fluent text does so, but does nothing to demonstrate that it will. What makes it reasonable to think that an LLM will usually generate fluent text is <em>that it has generated fluent text<\/em>. The demonstration is fundamentally empirical.<\/p>\n<p>Now, to be sure, the fact that ChatGPT-3.5 generates fluent text also makes it more plausible that ChatGPT-4 will do so too\u2014they are trained in similar ways on similar sources of data. Similarly, ChatGPT\u2019s fluency can transfer, in part, to Gemini, to Claude, and so on\u2014it has become widely accepted knowledge that certain types of LLM architectures and training tend to work reasonably well. But notice that while we can appeal to \u201chow Claude was trained\u201d to validate Claude\u2019s fluency by drawing on ChatGPT\u2019s, this appeal still rests on the empirical demonstration of ChatGPT\u2019s fluency. The appeal to how LLMs are trained adds no new and additional justification; it just provides a way of organizing and marshalling the empirical evidence from various LLMs\u2019 usage. And so, it is to those empirical justifications that we now turn.<\/p>\n<p style=\"text-align: center\">B. <em>Relying on Empirical Measurements?<br \/><\/em><\/p>\n<p>The second way one might validate generative interpretation is <em>empirically<\/em> demonstrating the integrity of a particular methodology for querying LLMs. Instead of deductively or informally arguing <em>a priori <\/em>that LLMs are necessarily or likely to be accurate, this approach demonstrates inductively on the basis of evidence that a particular process for using LLMs produces correct answers consistently, predictably, and robustly. If satisfied, this inquiry justifies trusting LLM outputs produced through that validated process. If an LLM consistently produces accurate information via a specified method in representative testing scenarios, then subsequent uses following the same method in similar scenarios are probably trustworthy\u2014and we may be able to quantify that probability.<\/p>\n<p>Returning to our two-sided coin: if we toss a coin 400 times and it comes up heads 199 times and tails 201 times, we are now 95% confident the coin is fair within a 5% margin of error when tossed in the same way as the experiment. Additional experimentation could increase confidence and reduce the margin of error. Though empirical tests cannot provide the perfect confidence of a deductive proof, they can provide strong, statistical evidence that a certain process has a tested property.<\/p>\n<p><em>1. Internal Robustness<\/em><\/p>\n<p>But what would it actually take to show that a given LLM provides accurate linguistic interpretations? The coin has the advantage that the problem can be specified finitely and precisely: there is a single \u201cflip\u201d process with exactly two possible outcomes, and we seek to learn their respective probabilities. The problem of generative interpretation is significantly more open-ended, and the LLM is a significantly more complicated object of study.<\/p>\n<p>Recall that an LLM is a prediction machine that uses whatever it can find in a textual prompt to make accurate predictions about what comes next. That might be the semantic meaning of the term it has been asked to gloss. Or it might be whether the prompt author asked the LLM to \u201cexplore\u201d an issue or to \u201cdelve\u201d into it.<sup class=\"modern-footnotes-footnote \" data-mfn=\"212\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-212\">212<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-212\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"212\"> <em>See, e.g.<\/em>, Tom S. Juzek &amp; Zina B. Ward, <em>Why Does ChatGPT \u201cDelve\u201d So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models<\/em>, 31 <span style=\"font-variant: small-caps\">Proc. Intern. Conf. Computational Linguistics<\/span> 6397 (2025).<\/span> If authors who eschew contractions are more conservative and sesquipedalian, then a prompt that eschews contractions is more likely to elicit an output that not only eschews contractions, but also reads in a conservative and sesquipedalian register in other ways. While it is likely (though unproven) that insights about general language understanding are tucked away somewhere inside LLMs\u2019 internal structures, that information is intertwined with myriad other factors that can help the predictive task.<sup class=\"modern-footnotes-footnote \" data-mfn=\"213\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-213\">213<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-213\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"213\">For an accessible discussion exploring what LLMs \u201cknow,\u201d see Kenneth Li, <em>Do Large Language Models learn world models or just surface statistics?<\/em>, <span style=\"font-variant: small-caps\">The Gradient<\/span> (Jan. 21, 2023), https:\/\/thegradient.pub\/othello\/ [https:\/\/perma.cc\/Q4JB-NK4V], and the underlying paper, Kenneth Li, Aspen K. Hopkins, David Bau, Fernanda Vi\u00e9gas, Hanspeter Pfister &amp; Martin Wattenberg, <em>Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task<\/em>, 2023 <span style=\"font-variant: small-caps\">Proc. Int\u2019l Conf. on Learning Representations 1<\/span>.<\/span> Currently, there is no easy way to discern which information an LLM uses to answer questions. It might use the information courts care about. It might use some unrepresentative subset of that information. It might rely on some inscrutable blend of factors that courts would consider irrelevant. These are null hypotheses: possibilities that can only be ruled out by conducting tests to rule them out.<\/p>\n<p>In short, we are completely dependent on running actual studies to generate the kind of empirical evidence needed to validate LLMs\u2019 use for generative interpretation. The LLM proponents have started on this important task, some more rigorously (Arbel and Hoffman) and some more informally (Judge Newsom, Judge Deahl, and Unikowsky). And this is exactly right\u2014a journey of a thousand miles begins with a single step.<\/p>\n<p>But as we showed in the previous Part, the empirical evidence to date falls well short of showing that generative interpretation really is justified. In particular, we documented that existing methods are not robust. Switching models can flip the direction of a result; small changes to queries produce large variations in the responses.<sup class=\"modern-footnotes-footnote \" data-mfn=\"214\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-214\">214<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-214\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"214\"><em>See<\/em> Choi,<em> supra<\/em> note 134, at 20\u201330 (discussing sensitivity to prompts); <em>id.<\/em> at 30\u201336 (discussing wide variability between LLMs).<\/span> If protocols <em>A<\/em> and <em>B<\/em> to measure phenomenon <em>X <\/em>are identical except in some small respect that has no apparent connection with <em>X<\/em>, but <em>A<\/em> and <em>B<\/em> produce very different results, it counts as evidence that <em>neither<\/em> <em>A<\/em> nor<em> B<\/em> is actually measuring <em>X<\/em> well. Waldon et al.\u2019s results are similar to ours; they \u201cshow that LLMs\u2019 metalinguistic judgments are highly sensitive to subtle prompting variations\u201d and that \u201cLLMs can be easily \u2018gamified\u2019 to reflect a user\u2019s preconceived biases, even when the lawyer or judge adheres to what appears to be a rigorous and neutral methodology.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"215\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-215\">215<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-215\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"215\">Waldon et al., <em>supra <\/em>note 16, at 39.<\/span><\/p>\n<p>This is a problem common to all empirical disciplines, and the obvious way to respond to it is to improve methodological robustness. If <em>A<\/em> and <em>B<\/em> produce highly correlated measurements despite their differences, and so do variations <em>C<\/em>, <em>D<\/em>, <em>E<\/em>, and so on, then we are more justified in believing that all the members of this family really are measuring the same thing. And some of this work is already occurring. From <em>Snell<\/em> to <em>Deleon<\/em>, Judge Newsom went from two queries to each of two models to ten queries to each of three models and more systematically analyzed them. In their work on LLM interpretation, Christoph Engel and Richard McAdams analyzed distributions of responses rather than a model\u2019s single \u201cbest\u201d response.<sup class=\"modern-footnotes-footnote \" data-mfn=\"216\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-216\">216<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-216\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"216\"><em>See <\/em>Christoph Engel &amp; Richard H. McAdams, <em>Asking GPT for the Ordinary Meaning of Statutory Terms<\/em>, 2024 <span style=\"font-variant: small-caps\">U. Ill. J.L. Tech. &amp; Pol\u2019y<\/span> 235, 235 (2024).<\/span> As the sophistication and scale of generative-interpretation experiments continues to increase, its practitioners may be able to find robust protocols that are not sensitive to small differences in experimental setups.<\/p>\n<p>This is not to say that a protocol must be robust against all possible variations. Some can be excluded for principled reasons. Say a coin flip gets fairer\u2014in a way that can be empirically measured with strong statistical significance\u2014as the coin is tossed from a greater height. If that\u2019s the case, the protocol in which a coin is tossed from two inches above a table may produce different results than the protocol in which it is tossed from two feet, but that will be because the two-inch protocol is demonstrably worse. From political polling to laser interferometry, empiricists in numerous fields have accumulated bodies of standards and best practices for conducting their studies using specific methods in ways that are broadly accepted over their alternatives.<sup class=\"modern-footnotes-footnote \" data-mfn=\"217\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-217\">217<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-217\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"217\"><em>See, e.g.<\/em>, <span style=\"font-variant: small-caps\">The Am. Soc\u2019y of Mech. Eng\u2019rs<\/span>, <span style=\"font-variant: small-caps\">B89.1.8 &#8211; Performance Evaluation of Displacement-Measuring Laser Interferometers (2011), <\/span>https:\/\/www.asme.org\/codes-standards\/find-codes-standards\/performance-evaluation-of-displacement-measuring-laser-interferometers\/2011\/pdf [https:\/\/perma.cc\/ZZ6P-ZFCJ]; <span style=\"font-variant: small-caps\">Am. Assoc. For Pub. Op. Rsch., Best Practices for Survey Research (M<\/span>ar<span style=\"font-variant: small-caps\">. 2022)<\/span>, https:\/\/aapor.org\/wp-content\/uploads\/2023\/06\/Survey-Best-Practices.pdf [https:\/\/perma.cc\/CB2X-DAF2].<\/span> Here too the LLM proponents are tentatively beginning to move beyond simply brainstorming different possible methods to testing them comparatively, and excluding the ones that fail robustness checks. Engel and McAdams, for example, have an extended discussion of different types of design choices for generative-interpretation studies, including tentative thoughts on how to make many of those choices.<sup class=\"modern-footnotes-footnote \" data-mfn=\"218\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-218\">218<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-218\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"218\"><em>See <\/em>Engel &amp; McAdams<em>, supra<\/em> note 216, at 271\u201388.<\/span><\/p>\n<p>It is worth emphasizing how daunting this task is for generative interpretation. LLMs are sensitive to their configuration and prompts <em>by design<\/em>. Consumer-grade LLMs aim to respond in ways their users prefer; a chatbot that can conform to its user\u2019s preferences by picking up on subtle cues in their inputs will provide a better service.<sup class=\"modern-footnotes-footnote \" data-mfn=\"219\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-219\">219<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-219\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"219\">Text generation is tuned to align with human feedback. <em>See, e.g.<\/em>, Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J\u00e9r\u00e9my Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Rapha\u00ebl Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem B\u0131y\u0131k, Anca Dragan, David Krueger, Dorsa Sadigh &amp; Dylan Hadfield-Menell, <em>Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback <\/em>(Sep. 11, 2023), https:\/\/arxiv.org\/abs\/2307.15217 [https:\/\/perma.cc\/Q8WM-ZV8W] (describing Reinforcement Learning from Human Feedback and enumerating many of its limitations).<\/span> Consider <em>Snell<\/em>: if Newsom\u2019s clerk had asked, \u201cwhats landscaping mean?\u201d, he would have seen vastly different results to those he received in response to \u201cWhat is the ordinary meaning of \u2018landscaping\u2019?\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"220\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-220\">220<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-220\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"220\">Using the GPT-3.5-turbo-instruct model and repeating the questions from <em>Snell<\/em> 1000 times yields the distribution of responses shown in the following table. Newsom\u2019s phrasing resulted in GPT answering \u201cyes\u201d twice as often as \u201cno.\u201d The more colloquial phrasing of the question results in GPT saying \u201cno\u201d six times more often than \u201cyes.\u201d<br \/><br \/>\nThe code used to generate this data is available at David B. Stein, <em>davidbstein<\/em>, snell-question-phrasing.py, https:\/\/gist.github.com\/davidbstein\/1a60483a71fdeb2da91f88efe24a55ad [https:\/\/perma.cc\/QS7G-WBD3].<\/span> Professional- and business-grade LLMs are used inside tools that need to conform to organizations\u2019 preferred tone, values, opinions, and topics. For example, an LLM-powered filing assistant on a court website should never acknowledge potential innuendos or double-entendres, whereas X\u2019s Elon Musk has specifically promoted Grok by claiming that it \u201cloves sarcasm\u201d and would answer questions with \u201ca little humour.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"221\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-221\">221<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-221\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"221\">Lucy Hooker, <em>Musk says his new AI chatbot has \u2018a little humour\u2019<\/em>, <span style=\"font-variant: small-caps\">BBC News<\/span> (Nov. 5, 2023), https:\/\/www.bbc.com\/news\/business-67327060 [https:\/\/perma.cc\/ZHA6-5URT] (quoting Elon Musk).<\/span> The demand for configurable general-purpose tools leads to LLM outputs that are highly sensitive to technical configuration.<sup class=\"modern-footnotes-footnote \" data-mfn=\"222\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-222\">222<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-222\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"222\">Sensitivities are also great for people using chatbots to answer factual questions, draft documents, brainstorm new ideas, or build bespoke tools. Users can fiddle with configurations and prompts to achieve desired outcomes.<\/span> In short, many LLM sensitivities are features, not bugs. The very nature of many LLMs pushes against their ability to be robust interpreters of meaning.<\/p>\n<p><em>2. External Calibration<\/em><\/p>\n<p>Merely passing robustness checks, however, is not sufficient to show that an empirical method is fit for purpose. It may be measuring <em>something<\/em> robustly, but that something might not be the object of interest. Robustness is a necessary condition, not a sufficient one.<\/p>\n<p>In other words, to show empirically that an LLM truly measures linguistic meaning, its outputs must be compared to <em>an accepted authority on linguistic meaning<\/em>. This is the point of Engel and McAdams\u2019s study of ChatGPT as a source of ordinary meaning: they compare ChatGPT-3.5 Turbo\u2019s responses (under various prompting conditions) to the results of Kevin Tobia\u2019s 2020 survey of 2,800 English speakers about the meaning of statutory terms.<sup class=\"modern-footnotes-footnote \" data-mfn=\"223\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-223\">223<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-223\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"223\"><em>See <\/em>Engel &amp; McAdams, <em>supra<\/em> note 216, at 256\u201370 (citing Kevin P. Tobia, <em>Testing Ordinary Meaning<\/em>, 134 <span style=\"font-variant: small-caps\">Harv. L. Rev<\/span>. 726 (2020));<em> see also<\/em> Johannes Kruse, <em>The Ordinary Meaning Bot: Simulating Human Surveys with LLMs<\/em> 3 (Max Planck Inst. for Research on Collective Goods, Discussion Paper No. 2025\/12, 2025), https:\/\/papers.ssrn.com\/sol3\/papers.cfm?abstract_id=5378203 [https:\/\/perma.cc\/SJ2K-PZJX] (presenting an \u201capproach [that] differs crucially from Engel and McAdams . . . : Rather than asking GPT for aggregate yes-rates, it seeks to approximate the individual response behavior that underlies those aggregates.\u201d).<\/span> Strikingly, three of the four LLM protocols they tested produced results that significantly diverged from Tobia\u2019s survey data.<\/p>\n<p>You may or may not agree that surveys are an appropriate measure of the meaning of legal language. Our point is that while it need not be surveys, any attempt to calibrate LLMs empirically depends on having some external benchmark to calibrate against. Without one, no amount of experiments can provide a way to differentiate between \u201cinformed conclusion[s] based on a statistical analysis of billions of texts,\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"224\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-224\">224<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-224\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"224\"><em>Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 458.<\/span> and wild extrapolation based on extraneous factors.<\/p>\n<p>Consider Unikowsky\u2019s experiments with Supreme Court cases. He found, \u201cOf the 37 merits cases decided so far this Term, Claude decided 27 in the same way the Supreme Court did.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"225\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-225\">225<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-225\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"225\"><em>In AI We Trust II<\/em>, <em>supra<\/em> note 5.<\/span> One take on this result would be that Claude has an accuracy rate of 73%\u2014definitely not good enough for government work. But that is not Unikowsky\u2019s response; instead, he says, \u201cI frequently was more persuaded by Claude\u2019s analysis than the Supreme Court\u2019s.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"226\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-226\">226<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-226\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"226\"><em>Id.<\/em><\/span> That may well be (and we will discuss the argument from persuasiveness in the next section), but note what it does to the project of validating Claude\u2019s accuracy. It means that the Supreme Court\u2019s actual decisions are no longer functioning as a source of ground truth. It means that Adam Unikowsky trusts <em>his own judgment<\/em> more than he trusts <em>the Supreme Court\u2019s opinions<\/em> as a benchmark of legal analysis. This too may be right,<sup class=\"modern-footnotes-footnote \" data-mfn=\"227\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-227\">227<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-227\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"227\">As readers of his newsletter, we certainly agree that his arguments are often more convincing than those of the judges and lawyers he critiques!<\/span> but notice what it does to the project of validating LLMs for judicial use. It means that you too need to trust Adam Unikowsky\u2019s legal judgment more than the Supreme Court\u2019s. It means we would need a societal consensus that Claude is <em>better at this than the Supreme Court<\/em>. There is no way to bootstrap into such a consensus by counting up how frequently Claude agrees with the Supreme Court. If you think Claude is better at judging than the Supreme Court, you need to argue that Claude is better at judging than the Supreme Court, and back it up with your own standard of what constitutes good judging and evidence that Claude meets that standard.<\/p>\n<p>Other LLM proponents confront this problem, but none of them are able to dispose of it. Newsom relies primarily on intuition, finding that LLMs\u2019 responses \u201csquared with what I had assumed,\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"228\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-228\">228<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-228\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"228\">United States v. Deleon, 116 F.4th 1260, 1272 (11th Cir. 2024) (Newsom, J., concurring).<\/span> and \u201csquared with my own impression\u2014informed by my own experience writing, reading, speaking, and listening to American English in the real world.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"229\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-229\">229<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-229\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"229\">Snell v. United Specialty Ins. Co., 102 F.4th 1208, 1225 (11th Cir. 2024) (Newsom, J., concurring).<\/span> What, one wonders, would Newsom have done with their responses if those responses had <em>not<\/em> squared with his own interpretations? For its part, <em>Generative Interpretation<\/em> defines \u201c[a]ccuracy\u201d as \u201cthinking that we really got as close as we could to knowing what the parties would have said.\u201d Arbel and Hoffman acknowledge that \u201cthere is no ground truth at hand\u2014we can\u2019t really know what the parties intended at contracting and have to make instead our best guess.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"230\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-230\">230<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-230\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"230\"><em>Generative Interpretation<\/em>, <em>supra <\/em>note 2<em>,<\/em> at 459\u201360, 462.<\/span><\/p>\n<p>The trouble is that this mode of analysis invites circular reasoning. Arbel and Hoffman define accuracy in terms of \u201cbest guess[es]\u201d while simultaneously implying that LLM-produced guesses constitute our \u201cbest guess[es].\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"231\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-231\">231<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-231\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"231\"><em>See id.<\/em> at 459\u201360, 485.<\/span> But if LLM-produced guesses are <em>definitionally<\/em> our \u201cbest guess[es],\u201d and thus the benchmark for any measurement of accuracy, then <em>Generative Interpretation<\/em>\u2019s case studies do not measure accuracy. Instead, they project authoritative meaning onto LLM outputs, irrespective of what those outputs signify.<\/p>\n<p>The diversity of the techniques employed for generative interpretation is a warning of the breadth of empirical validation that generative interpretation will demand. Every distinct technique for querying an LLM will require its own distinct, inductive validation. Empirical support that an LLM\u2019s word-likelihood estimates reflect linguistic meaning does not necessarily establish that LLMs\u2019 probability estimates reflect linguistic meaning. Even if generative-interpretation proponents can empirically validate <em>one<\/em> method of querying LLMs\u2014and, we emphasize, they have not yet done so, although early research gives some reason for optimism<sup class=\"modern-footnotes-footnote \" data-mfn=\"232\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-232\">232<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-232\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"232\"><em>See <\/em>Engel &amp; McAdams, <em>supra<\/em> note 216, at 43.<\/span>\u2014that validation would not establish the accuracy of <em>other<\/em> methods of querying an LLM. Evidence that some LLM output is accurate does not entail that output is accurate because it comes from an LLM.<\/p>\n<p><em>3. Protocol Specification<\/em><\/p>\n<p>Another way of phrasing the difficulty with answering the epistemic challenge empirically is as a problem of specification. The more sensitive generative interpretation is to implementation details, the more comprehensive ex ante commitments need to be to achieve predictable or consistent results. Posing a question to an LLM is a wildly underspecified task. As we have seen, outputs can depend on subtle differences in question formulation, model choice, interpretation method, and the configuration and implementation of any software used to operate the LLM.<\/p>\n<p>Proponents of generative interpretation imagine parties agreeing on specific models and prompting methods, then resolving disputes by presenting them to an LLM using those agreed-upon methods.<sup class=\"modern-footnotes-footnote \" data-mfn=\"233\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-233\">233<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-233\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"233\"><em>See, e.g.<\/em>, <em>Generative Interpretation, supra <\/em>note 2, at 501.<\/span> At least for the case studies explored here, every phrasing and formatting choice seems to influence outcomes. Constructing a sufficiently detailed pre-commitment for predictable generative interpretation requires agreeing on outcome-altering details. For all but the easiest of cases, many of those details are dispute-specific.<sup class=\"modern-footnotes-footnote \" data-mfn=\"234\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-234\">234<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-234\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"234\"><em>But see id.<\/em> at 510 (\u201c[I]f courts follow our proposed best practices, this method is also predictable <em>ex ante.<\/em>\u201d).<\/span> Why not negotiate the anticipated disputes directly? Put another way, picking one experimental setup over another merely displaces the arbitrariness from the empirical realm (how to reconcile the different results of protocol <em>A<\/em> and protocol <em>B<\/em>?) to the epistemic one (what makes protocol <em>A<\/em> better or worse than protocol <em>B<\/em>?)<\/p>\n<p>Again, to repeat, our claim is not that this <em>cannot<\/em> <em>be done<\/em> in a principled way, just that it <em>has not yet been<\/em> <em>done<\/em> in a principled way, and that it <em>will have to be done<\/em> in a principled way to justify generative interpretation empirically. The empirical woods are lonely, dark, and deep\u2014and generative interpretation has miles to go.<\/p>\n<p><em>4. No Easy Fixes<\/em><\/p>\n<p>Rather than waiting for more robust validation, it is tempting to play whack-a-mole with problems like the ones we spotted in Part III. When an LLM generates text that influences its later answers\u2014like in <em>Snell <\/em>and <em>Famiglio<\/em>\u2014some might naturally respond by constraining the model to one-word responses.<sup class=\"modern-footnotes-footnote \" data-mfn=\"235\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-235\">235<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-235\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"235\"><em>See<\/em> <em>Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 485 n.164 (proposing this approach).<\/span> But that merely shoves the improvident salience into the prompt,<sup class=\"modern-footnotes-footnote \" data-mfn=\"236\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-236\">236<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-236\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"236\"><em>See <\/em>Felipe Maia Polo, Ronald Xu, Lucas Weber, M\u00edrian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun &amp; Mikhail Yurochkin, <em>Efficient Multi-Prompt Evaluation of LLMs<\/em>, 38 C<span style=\"font-variant: small-caps\">onf. Neural Info. Processing Sys. 1,<\/span> (2024) (collecting sources).<\/span> and some research suggests that LLMs produce lower-quality answers when forced to be succinct.<sup class=\"modern-footnotes-footnote \" data-mfn=\"237\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-237\">237<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-237\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"237\">Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le &amp; Denny Zhou, <em>Chain-of-Thought Prompting Elicits Reasoning in Large Language Models<\/em>, 36 C<span style=\"font-variant: small-caps\">onf. Neural Info. Processing Sys. 1, <\/span>6 (2022).<\/span> So maybe we cross-reference between models instead.<sup class=\"modern-footnotes-footnote \" data-mfn=\"238\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-238\">238<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-238\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"238\"><em>See<\/em> <em>Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 503\u201304 (advising use of this approach as a \u201cbest practice\u201d).<\/span> But model quality can vary; models can (and do) converge on arbitrary answers.<sup class=\"modern-footnotes-footnote \" data-mfn=\"239\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-239\">239<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-239\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"239\"><em>See<\/em> <em>supra<\/em> note 151 and accompanying text.<\/span> So maybe we generate multiple versions of the same prompt.<sup class=\"modern-footnotes-footnote \" data-mfn=\"240\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-240\">240<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-240\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"240\"><em>See<\/em> <em>Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 490 (\u201ctr[ying] something new\u201d by generating multiple prompts); <em>see also<\/em> Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel &amp; Pontus Stenetorp, <em>Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity<\/em>, 60 <span style=\"font-variant: small-caps\">Proc. Assoc. Computational Linguistics<\/span> 8086, 8089 (2022) (generating multiple prompts automatically when querying a model); Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi &amp; Graham Neubig, <em>Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing<\/em>, 55 <span style=\"font-variant: small-caps\">ACM Computing Survs.<\/span>, Jan. 2023, at 8 (collecting and reviewing literature on \u201cprompt paraphrasing\u201d and \u201cprompt generation\u201d).<\/span> But sampling across low-quality prompts can nudge models towards the same irrelevant details.<sup class=\"modern-footnotes-footnote \" data-mfn=\"241\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-241\">241<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-241\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"241\"><em>See <\/em>Albert Webson &amp; Ellie Pavlick, <em>Do Prompt-Based Models Really Understand the Meaning of Their Prompts?<\/em>, 2022 <span style=\"font-variant: small-caps\">Proc. Conf. N. Am. Chap. Ass\u2019n For Computational Linguistics: Hum. Language Techs. <\/span>2300, 2307.<\/span> And so on. Each problem exposes or introduces another, with some solutions possibly decreasing answer quality. Without a reliable way to measure performance, we cannot know which changes improve generative interpretation and which choices exacerbate or obscure problems.<\/p>\n<p>It is also tempting to over-extrapolate from simplified explanations of complex systems. It may feel natural to assume that because LLMs are trained on a large amount of text, their answers are informed by that text\u2019s semantic content. But that inference is unsupported; semantic information is one (potentially inconsequential) part of a more complex whole. If the argument that LLMs are trustworthy because of how they work does not go through on its own (as we argued in section IV.A), it also does not establish that LLMs are empirically trustworthy (as we discuss in this section). That claim must be made, and supported, on its own merits. An LLM might use great data in an unhelpful way. The brittleness and arbitrariness of early experiments in generative interpretation should caution us that the method is not yet \u201cgood enough\u201d for use in litigation.<sup class=\"modern-footnotes-footnote \" data-mfn=\"242\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-242\">242<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-242\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"242\"><em>Contra Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 458.<\/span> Indeed, numerous empirical studies have documented that LLMs consistently make basic errors in legal reasoning.<sup class=\"modern-footnotes-footnote \" data-mfn=\"243\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-243\">243<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-243\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"243\"><em>See, e.g.<\/em>, Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning &amp; Daniel E. Ho, <em>Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools<\/em>, 22 <span style=\"font-variant: small-caps\">J. Empirical Legal Stud.<\/span> 216, 217 (2025).<\/span><\/p>\n<p style=\"text-align: center\">C. <em>Relying on Persuasiveness?<br \/><\/em><\/p>\n<p>The third major way that generative-interpretation proponents attempt to justify their methodology is by arguing that the results literally speak for themselves: that LLM-generated text is sufficiently <em>persuasive<\/em> that it should be treated as authoritative. This type of argument is particularly slippery, because it comes in two closely related forms that make very different assumptions. One is that the persuasiveness of LLM-generated text is <em>evidence<\/em> that it resulted from a reliable process of linguistic interpretation or legal reasoning. The other is that persuasive LLM-generated text is authoritative purely <em>because it is persuasive<\/em>, and that it is irrelevant how it was generated. To clarify the difference, and to understand why both forms of argument are deceptive, it is necessary to speak first of abduction.<\/p>\n<p>1. <em>Generative Interpretation as Abduction<\/em><\/p>\n<p>Consider the claim, \u201cThis text looks like it was the result of a skilled process of legal reasoning, so therefore it was produced by one.\u201d The name for this form of reasoning is <em>abduction<\/em>, or sometimes \u201cinference to the best explanation.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"244\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-244\">244<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-244\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"244\"><em>See generally<\/em> Igor Douven, <em>Abduction<\/em>, <span style=\"font-variant: small-caps\">Stan. Encyc. Phil. <\/span>(June 18, 2025), https:\/\/plato.stanford.edu\/entries\/abduction\/ [https:\/\/perma.cc\/6HHU-CGKY].<\/span> Unlike <em>deduction<\/em>, which reasons from premises to their logical entailments, or <em>induction<\/em>, which reasons to find generalizations consistent with existing data, abduction reasons by identifying plausible causes for observed phenomena. \u201cAll the balls in that urn are black; these balls came from that urn; therefore these balls are black\u201d is deduction; \u201cThese balls are black; these balls came from that urn; therefore all balls in that urn are black\u201d is induction; \u201cThese balls are black; all the balls in that urn are black; therefore these balls came from that urn\u201d is abduction.<\/p>\n<p>Though potentially less robust than inductive or deductive proof, abductive reasoning is often the only option in the law. For example, the preponderance of the evidence test asks for the most reasonable or coherent interpretation of the evidence. The legitimacy of legal dispositions is also something we may infer abductively. As later subparts explore more deeply, a judge\u2019s decision is assumed to be the product of a good-faith application of legal reasoning because her written decision describes that reasoning process. When a judicial opinion sets forth a persuasive chain of legal reasoning that accurately refers to relevant law, we assume abductively that it was in fact realized through the analytical process it describes, and hence that the opinion is legitimate.<\/p>\n<p>Abduction is not in itself a wrongheaded analytical device. We rely on abduction all the time. When a student writes an exam that makes cogent, well-informed arguments about the assigned materials, we treat that document as evidence that the student has mastered the topic. We presume that producing such an exam response requires a sophisticated understanding of the course materials, and we regard the exam as proof that its author has such an understanding. Of course, our abductive inference might be wrong. Maybe the exam response was produced by a cat walking across a keyboard, or a fountain pen leaking into a bluebook in exactly the pattern of handwritten text. But, based on our experience, we conclude that such situations are implausible enough that we award the student a good grade unless we have evidence to believe that her exam response was not her own composition.<\/p>\n<p>Abduction, however, is limited by what we know and by the accuracy of what we think we know. In 2025, for example, it is probably educational malpractice for an instructor to be unaware that students have access to LLMs, because the wide availability of LLMs weakens the abductive inferences that educators customarily make after inspecting students\u2019 work. LLMs allow students to generate exactly the sorts of prose that, previously, would have required comprehension of the course materials. A cogent term paper no longer signifies what it signified in the pre-AI age. To ensure that written documents continue to serve as rough measurements of students\u2019 knowledge, instructors must redesign examinations to ensure that they measure what they purport to measure. At least two of us, for example, have forgone take-home examinations in favor of proctored, in-class assessments, to ensure that students\u2019 written work supports the same inferences about their mastery of course materials as it did before the advent of LLMs.<\/p>\n<p>Just as they complicate the inferences that a teacher can make from a well-written essay, LLMs complicate the inferences that a jurist can make from a well-written legal argument. It used to be fair to infer that someone who authored a cogent legal opinion had considered the relevant issues. But as the processes for generating legal writing change, so do the permissible inferences that a legal work product can engender.<\/p>\n<p>Abduction clarifies the nature of the two kinds of arguments from persuasiveness. A claim that LLM-generated text is authoritative because its persuasiveness reassures us that it was generated through a valid process of legal reasoning is an abductive argument. It uses the superficial qualities of LLM-generated text to make a claim about the legitimacy of the underlying reasoning. That claim, however, is defeasible, and can be rebutted by a showing that there are good reasons to doubt the bona fides of the process that generated the text. A claim that LLM-generated text is authoritative simply because it is persuasive is not abductive. It avoids the rebuttal by treating the generation process as irrelevant. But, as we will see, it is a claim that is dramatically at odds with the theory of adjudication on which our legal system currently rests.<\/p>\n<p>2. <em>Rhetoric \u2260 Adjudication<\/em><\/p>\n<p>LLMs are rhetoric machines. Rhetoric is the art of persuasive argumentation,<sup class=\"modern-footnotes-footnote \" data-mfn=\"245\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-245\">245<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-245\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"245\"><em>See, e.g.<\/em>, <span style=\"font-variant: small-caps\">Sharon Crowley &amp; Debra Hawhee, Ancient Rhetorics for Contemporary Students <\/span> 9 (1994). <\/span> and it is unsurprising that LLMs trained on an enormous corpus of humans doing their best to make persuasive arguments are capable of replicating the textual features of those arguments that make them persuasive. Indeed, the training process (particularly the stages that involve human feedback) may select for persuasiveness, thereby amplifying the connections in an LLM\u2019s internal model of language that contribute to fluent, persuasive outputs.<\/p>\n<p>Rhetoric is successful precisely to the extent it is persuasive.<sup class=\"modern-footnotes-footnote \" data-mfn=\"246\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-246\">246<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-246\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"246\">Note that we are not using \u201crhetoric\u201d to refer to the concept of \u201cconstructive rhetoric,\u201d as it sometimes does in writing on legal philosophy. <em>See,<\/em> <em>e.g.<\/em>, James Boyd White, <em>Law as Rhetoric, Rhetoric as Law: The Arts of Cultural and Communal Life<\/em>, <span style=\"font-variant: small-caps\">52 U. Chi. L. Rev. 684<\/span>, 701 (1985) (using a \u201chighly expanded\u201d definition for \u201crhetoric\u201d). That definition is useful in interrogating the cultural and social relationship between rhetorical legal artifacts\u2014like statutes\u2014and their authors and audience, but assumes a human author and is therefore ill-suited for this context. <em>See id.<\/em> at 689\u201390 (defining constructive rhetoric in reference to the lawyers creating it).<\/span> As anyone who reads their output can tell, LLMs excel at rhetorical writing. Today\u2019s LLMs reproduce the form of high-quality legal reasoning. As Unikowsky and others have observed, LLM output now replicates rational, eloquent argumentation that applies precedent to novel facts.<sup class=\"modern-footnotes-footnote \" data-mfn=\"247\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-247\">247<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-247\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"247\"><em>See<\/em> <em>In AI We Trust II<\/em>, <em>supra<\/em> note 5.<\/span> An LLM can produce text that may be formally indistinguishable from\u2014or even formally superior to\u2014the reasoning described by an opinion written by a human judge.<\/p>\n<p>There are many ways to elicit responses from LLMs that sound plausibly correct. But as anyone who has dealt with lawyers well knows, sounding correct is not necessarily the same as being correct. Indeed, there is ample evidence from the use of LLMs in law to illustrate the difference: ask any lawyer who has been sanctioned for submitting a brief containing LLM-generated \u201challucinations\u201d whether the fluency of LLM-generated text guarantees that the legal claims it makes are sound.<sup class=\"modern-footnotes-footnote \" data-mfn=\"248\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-248\">248<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-248\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"248\"><em>See, e.g.<\/em>, Mata v. Avianca, Inc., 678 F. Supp. 3d 443, 448\u201349 (S.D.N.Y. 2023) (imposing sanctions on attorneys who \u201csubmitted non-existent judicial opinions with fake quotes and citations created by the artificial intelligence tool ChatGPT\u201d).<\/span> Though tempting, conflating persuasiveness and coherence with authority or accuracy is a category error.<sup class=\"modern-footnotes-footnote \" data-mfn=\"249\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-249\">249<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-249\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"249\"><em>Cf.<\/em> Michael Townsen Hicks, James Humphries &amp; Joe Slater, <em>ChatGPT Is Bullshit<\/em>, 26 <span style=\"font-variant: small-caps\">Ethics Inf. Tech., <\/span>at 3 (June 2024) (\u201cThe problem here isn\u2019t that large language models hal\u00aclucinate, lie, or misrepresent the world in some way. It\u2019s that they are not designed to represent the world at all; instead, they are designed to convey convincing lines of text.\u201d).<\/span><\/p>\n<p>The problem with rhetoric is that sometimes an argument can be persuasive for the wrong reasons.<sup class=\"modern-footnotes-footnote \" data-mfn=\"250\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-250\">250<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-250\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"250\">Consider, by way of contrast, the use of AI for victim impact statements. <em>See, e.g.<\/em>, Juliana Kim,<em> Family shows AI video of slain victim as an impact statement\u2014possibly a legal first,<\/em> NPR (May 12, 2025), https:\/\/www.npr.org\/2025\/05\/07\/g-s1-64640\/ai-impact-statement-murder-victim [https:\/\/perma.cc\/JD7P-Y6F5]. Even here, in a domain where pathos is as valued as logos, such uses raise the fear that AI will be persuasive for the wrong reasons.<\/span> Rhetoric encompasses arguments that comply with all the formal conventions of good-faith legal reasoning. It also, however, encompasses writing that is persuasive for other reasons, like well-concealed gaps in logic and appeals to emotion. Rhetoric covers everything from \u201clegal reasoning\u201d to \u201ccasuistry,\u201d and the rhetorician\u2019s good or bad faith is irrelevant. LLMs make newly salient the ancient anxiety about rhetoric itself<sup class=\"modern-footnotes-footnote \" data-mfn=\"251\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-251\">251<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-251\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"251\"><em>See, e.g.<\/em>, <span style=\"font-variant: small-caps\">Plato, Gorgias 27 <\/span>(Donald. J. Zeyl, trans., Hackett Publishing 1987).<\/span>\u2014the danger that rhetorical skill can persuade the audience, regardless of whether the argument itself is right or wrong.<\/p>\n<p>By contrast, we expect more from adjudication. To be sure, a good adjudicator will present her decisions in a rhetorically effective format. But adjudications are not successful simply because they reach persuasive conclusions. Rather, adjudication requires the integrity of the process of legal reasoning that realized that conclusion.<sup class=\"modern-footnotes-footnote \" data-mfn=\"252\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-252\">252<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-252\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"252\"><em>See <\/em>Chad M. Oldfather, <em>Writing, Cognition, and the Nature of the Judicial Function<\/em>, 96 <span style=\"font-variant: small-caps\">Geo. L.J.<\/span> 1283, 1318 (2008) (discussing the widely-shared belief that judicial \u201cwriting provides an important discipline on thought\u201d).<\/span><\/p>\n<p>That is, adjudication is also subject to procedural criteria. At minimum, these criteria require that a decision-maker make a good-faith, rational effort to produce a legal conclusion. If a formally adequate legal determination is produced in bad faith, or without rational effort, it does not adhere to the process of legal reasoning. A decision-maker\u2019s good faith establishes that she has not elaborated a formally adequate conclusion for improper, extralegal reasons. For example, a conflict of interest will taint a decision-maker\u2019s conclusion of law. Even if the conclusion is set forth in a formally adequate decision, it is not a legitimate legal conclusion because an observer reasonably can suspect that the judge\u2019s personal interests influenced her determination of the appropriate outcome.<\/p>\n<p>Procedural constraints also require a decision-maker\u2019s conclusion to be at least minimally rational. Paul the Octopus was a cephalopod whose handlers would serve him food in boxes that bore the logos of competing soccer teams.<sup class=\"modern-footnotes-footnote \" data-mfn=\"253\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-253\">253<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-253\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"253\"><em>Paul the Octopus<\/em>, <span style=\"font-variant: small-caps\">Wikipedia<\/span>, https:\/\/en.wikipedia.org\/wiki\/Paul_the_Octopus [https:\/\/perma.cc\/2765-B2VT].<\/span> When Paul chose to eat from one box before another, his handlers deemed him to have \u201cpredicted\u201d that the corresponding team would win.<sup class=\"modern-footnotes-footnote \" data-mfn=\"254\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-254\">254<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-254\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"254\"><em>Id.<\/em><\/span> When this predictive method accurately foretold the winners of a number of international soccer matches, Paul became world famous.<sup class=\"modern-footnotes-footnote \" data-mfn=\"255\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-255\">255<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-255\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"255\"><em>Id.<\/em><\/span> If a judge issued a formally adequate legal opinion, and it were later revealed that she had chosen the winning party because her pet octopus ate from a box labeled with that party\u2019s name, this procedural deficiency would undermine the decision\u2019s status as legal reasoning.<\/p>\n<p>A formally adequate legal conclusion is legal reasoning\u2019s end product, but is not in itself proof that legal reasoning has taken place. This is because a formally adequate legal work product can be realized through processes that do not constitute legal reasoning. In some intellectual traditions, the integrity of the form is the integrity of the process. If, for example, a mathematician authored a formally accurate proof and later claimed that it came to him as a divine message, the proof would still be mathematically legitimate.<sup class=\"modern-footnotes-footnote \" data-mfn=\"256\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-256\">256<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-256\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"256\">The esteemed mathematician Srinivasa Ramanujan claimed scholarly inspiration from a family goddess, whose consort visited him in a dream. <em>See<\/em> Roshni Chakrabarty, <em>Srinivasa Ramanujan: The mathematical genius who credited his 3900 formulae to visions from Goddess Mahalakshmi<\/em>, <span style=\"font-variant: small-caps\">India Today<\/span> (Dec. 23, 2022), https:\/\/www.indiatoday.in\/education-today\/gk-current-affairs\/story\/srinivasa-ramanujan-life-story-973662-2017-04-26 [https:\/\/perma.cc\/95H6-SMVZ].<\/span> If, on the other hand, a judge issued a formally adequate opinion and later revealed that she had reached the conclusion that a divine message had commanded her to adopt, such a revelation would undermine her claim to have engaged in legal reasoning.<sup class=\"modern-footnotes-footnote \" data-mfn=\"257\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-257\">257<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-257\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"257\"><em>Cf. <\/em>Lon L. Fuller, <em>The Forms and Limits of Adjudication<\/em>, 92 <span style=\"font-variant: small-caps\">Harv. L. Rev.<\/span> 353, 390\u201391 (1978) (arguing that a decision rendered on \u201cgrounds . . . largely unrelated to what occurred at the hearing . . . . fall[s] beyond the limits of adjudication\u201d).<\/span> In an essay written before her judicial appointment, Justice Amy Coney Barrett concludes that Catholic doctrine forbids observant judges from enforcing the death penalty, and that a moral refusal to consider a death sentence warrants recusal from the sentencing phase of a capital case.<sup class=\"modern-footnotes-footnote \" data-mfn=\"258\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-258\">258<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-258\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"258\">Amy C. Barrett &amp; John H. Garvey, <em>Catholic Judges in Capital Cases<\/em>, 81 <span style=\"font-variant: small-caps\">Marq. L. Rev.<\/span> 303, 305, 335, 339 (1998).<\/span> Even if the judge can justify her conclusion in a formally adequate legal document, her recusal is warranted because she cannot reach that conclusion through a process recognized as legal reasoning.<\/p>\n<p>The form and process of legal reasoning are socially contingent. Even a single legal tradition admits variations: modern American jurists disagree about the precise form that legal reasoning may take. Some, for example, \u201cobject to the use of legislative history on principle,\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"259\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-259\">259<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-259\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"259\">Antonin Scalia, <em>Common-Law Courts in a Civil-Law System: The Role of United States Federal Courts in Interpreting the Constitution and Laws<\/em>, <em>in<\/em> <span style=\"font-variant: small-caps\">A Matter of Interpretation<\/span> 3, 31 (Amy Gutmann ed., 2018).<\/span> while others encourage it.<sup class=\"modern-footnotes-footnote \" data-mfn=\"260\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-260\">260<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-260\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"260\"><em>See generally<\/em> Stephen Breyer, <em>On the Uses of Legislative History in Interpreting Statutes<\/em>, 65 <span style=\"font-variant: small-caps\">S. Cal. L. Rev.<\/span> 845 (1992).<\/span> A legislative-history skeptic may find an opinion by a legislative-history enthusiast formally inadequate on the ground that it employs a categorically impermissible mode of argument. The legislative-history skeptic may find this legal reasoning just as illegitimate as legal reasoning that invokes the judge\u2019s dreams or her subjective \u201cvalue preferences\u201d to justify a conclusion.<sup class=\"modern-footnotes-footnote \" data-mfn=\"261\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-261\">261<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-261\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"261\"><em>Cf.<\/em> Robert H. Bork, <em>Neutral Principles and Some First Amendment Problems<\/em>, 47 <span style=\"font-variant: small-caps\">Ind. L.J.<\/span> 1, 9 (1971) (arguing that when the Court departs from the text and history of the Constitution, it is left solely with unprincipled and illegitimate \u201cvalue preferences\u201d to decide constitutional cases).<\/span><\/p>\n<p>But granular disagreements about particular formal qualities can reveal the interlocutors\u2019 broader agreement about the general form that legal argumentation should take. Some arguments for constitutional originalism, for example, posit that the methodology is, in consequentialist terms, the best approach to constitutional interpretation.<sup class=\"modern-footnotes-footnote \" data-mfn=\"262\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-262\">262<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-262\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"262\"><em>See<\/em> Cass R. Sunstein, <em>Originalism<\/em>, 93 <span style=\"font-variant: small-caps\">Notre Dame L. Rev.<\/span> 1671, 1680\u201381 (2018); William Baude, <em>Is Originalism Our Law<\/em>, 115 <span style=\"font-variant: small-caps\">Colum. L. Rev.<\/span> 2349, 2351 (2015).<\/span> Consequentialist arguments for originalism do not assert that originalism is the only true form of legal reasoning.<sup class=\"modern-footnotes-footnote \" data-mfn=\"263\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-263\">263<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-263\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"263\">Of course, some proponents of constitutional originalism do posit that it is the only legitimate method of constitutional interpretation, and this view comes closer to asserting that non-originalist reasoning is in fact not legal reasoning. <span style=\"font-variant: small-caps\">Antonin Scalia &amp; Bryan A. Garner<\/span>, <span style=\"font-variant: small-caps\">Reading Law 89 (2012) <\/span>(\u201c[Originalism] is the only objective standard of interpretation . . . .\u201d).<\/span> Rather, these arguments suggest a broader view of legal argumentation\u2014as, say, consequentialist arguments rationally propounded\u2014and employ that argumentative paradigm to advocate for originalism.<\/p>\n<p>Similarly, that a document formally resembles legal reasoning does not make it adjudication, nor does this resemblance necessarily make the document authoritative. Whether something constitutes adjudication, and whether it is authoritative, depends on social facts like Hart\u2019s rule of recognition.<sup class=\"modern-footnotes-footnote \" data-mfn=\"264\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-264\">264<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-264\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"264\"><span style=\"font-variant: small-caps\">H.L.A. Hart, The Concept of Law 92\u201393 (1961) (<\/span>\u201c[A] \u2018rule of recognition\u2019 . . . will specify some feature or features possession of which by a suggested rule is taken as a conclusive affirmative indication that it is a rule of the group to be supported by the social pressure it exerts.\u201d)<span style=\"font-variant: small-caps\">.<\/span><\/span> The authority of an adjudication depends not just on formal resemblance to a reasoned judicial opinion, but on a set of social practices that imbue particular artifacts of legal reasoning with authority. Although the formal adequacy of written legal reasoning produced by humans may have been a decent proxy for presuming that it was realized in accordance with a bona fide adjudicative process, the same can\u2019t be said for formally adequate legal reasoning generated using an LLM. To presume that these AI-generated simulacra of rationality constitute the social process of adjudication is to conflate form and process.<\/p>\n<p>It is almost trivially easy to make an LLM produce any kind of output desired (e.g., \u201cWrite a textualist opinion finding for the defendant while making as many references as possible to the films of Steven Spielberg\u201d). Without some other kind of reassurance that the results reflect the kind of reasoning we associate with competent and good-faith judging\u2014the kind that could be provided in theory by inductive or deductive approaches\u2014saying that LLM outputs are valid interpretations because they look good boils down to a vibe-check. Ironically, it is precisely the sort of \u201cvisceral, gut-instinct decisionmaking\u201d that proponents of generative interpretation hope to avoid.<sup class=\"modern-footnotes-footnote \" data-mfn=\"265\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-265\">265<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-265\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"265\">Snell v. United Specialty Ins. Co., 102 F.4th 1208, 1224 (11th Cir. 2024) (Newsom, J., concurring); <em>see Generative<\/em> <em>Interpretation<\/em>, <em>supra <\/em>note 2, at 459 (describing how their case studies applying generative interpretation \u201cilluminate how transparent and objective interpretative methodologies have advantages over intuitive ones\u201d).<\/span><\/p>\n<p>Ideally, some sort of assessment could separate the know-nothing scanners and tortoises from authoritative intelligences\u2014some attribute of LLMs that, if observed, would support the conclusion that its interpretations are accurate and meaningful. Unikowsky acknowledges that before delegating legal decision-making to LLMs, we want to \u201cmak[e] sure the AI works.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"266\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-266\">266<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-266\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"266\"><em>In AI We Trust I<\/em>, <em>supra<\/em> note 5.<\/span> \u201c[T]his,\u201d he says, \u201cshouldn\u2019t be that hard to do\u201d; validating AI\u2019s decision-making would require only \u201cdownload[ing] a few hundred briefs off of PACER, input[ting] them into your AI of choice, ask[ing] the AI to adjudicate the case, and compar[ing] the results to the actual judicial decisions that came down in those cases.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"267\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-267\">267<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-267\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"267\"><em>Id.<\/em><\/span><\/p>\n<p>But does Unikowsky\u2019s proposed test really measure accuracy? Unikowsky\u2019s writings on LLMs conflate at least three different desiderata. First is LLMs\u2019 ability to get the \u201cright answer\u201d\u2014that is, to reach the same disposition as a court on particular briefing. Second is LLMs\u2019 ability to sound reasonable. Third is LLMs\u2019 ability to be reasonable\u2014that is, to \u201creason.\u201d These are distinct capabilities. Possessing either of the first two does not entail possessing any of the others. For example, \u201cgetting the right answer\u201d alone may not prove much: we wouldn\u2019t delegate adjudication to a dog, even if it demonstrated a robust and uncanny ability to \u201cpredict\u201d case outcomes in conformity with litigated outcomes by choosing to eat from a particular labeled bowl. Simply sounding reasonable isn\u2019t sufficient either. An LLM, like a skilled lawyer, can produce reasonable-sounding justifications for incompatible legal outcomes. Generating a reasonable-sounding justification does not in itself demonstrate the reasoning we expect from an adjudicator.<\/p>\n<p>3. <em>Why Text Matters\u2014And Doesn\u2019t<\/em><\/p>\n<p>It is important to understand the work that text in particular does in legal reasoning\u2014and why LLM-generated text does not automatically do that work. The rule of law is not merely a collection of written artifacts; rather, it is the social process that produces those artifacts. In the contemporary United States, those artifacts happen to look like legal opinions. The texts are the visible traces of the system in action, but they are not the system itself.<\/p>\n<p>It\u2019s unsurprising that a lawyer or a judge might regard written legal reasoning as the paramount artifact of the American legal system. For centuries, the legal profession has produced written judicial opinions. These authoritative opinions form part of a process that constitutes the rule of law. Lawyers, laypersons, and members of other branches of government could read these decisions and coordinate future action based on the reasoning they set forth. Their holdings bind private citizens and government officials alike. Now, LLMs can produce artifacts that are formally indistinguishable from the artifacts produced by courts that apply the law authoritatively.<\/p>\n<p>Notwithstanding the pride of place it enjoys today, written adjudication is not essential to an authoritative legal process. Juries, for example, are never called upon to explain their findings rationally, even though they serve a vital legitimating function in the criminal law.<sup class=\"modern-footnotes-footnote \" data-mfn=\"268\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-268\">268<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-268\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"268\"><em>See, e.g.<\/em>, Apprendi v. New Jersey, 530 U.S. 466, 497 (2000) (referring to \u201cthe jury tradition that is an indispensable part of our criminal justice system\u201d); <em>see also id.<\/em> at 477 (quoting Justice Story and William Blackstone to corroborate the importance of \u201ctrial by jury\u201d \u201c\u2018to guard against a spirit of oppression and tyranny\u2019\u201d).<\/span> A typical verdict has the barest formal qualities: it is an unelaborated yes or no. It is not the form of a verdict that primarily inspires faith in the jury. Rather, it is our understanding of the jury\u2019s deliberative process that legitimates its authority.<\/p>\n<p>Comparing juries to judges illustrates that law is process in addition to formal output. Unlike the work product of, say, an appellate court, the output that a jury produces is trivial. Providing a verdict is, of course, a fundamental part of the jury\u2019s job. But the point of the jury system isn\u2019t just to manufacture yes\/no verdicts. Human juries are resource-intensive; rendering verdicts by a coin flip or a random binary generator would be far more economical, offering great efficiencies and improving access to the courts. Yet we routinely reject innovations that would make verdicts cheaper and more efficient to produce.<sup class=\"modern-footnotes-footnote \" data-mfn=\"269\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-269\">269<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-269\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"269\"><em>See, e.g.<\/em>, Granfinanciera, S.A. v. Nordberg, 492 U.S. 33, 63 (1989) (\u201cIt may be that providing jury trials in some fraudulent conveyance actions . . . would impede swift resolution of bankruptcy proceedings . . . . But \u2018these considerations are insufficient to overcome the clear command of the Seventh Amendment.\u2019\u201d); <em>Apprendi<\/em>, 530 U.S. at 498 (Scalia, J., concurring) (\u201c[T]he jury-trial guarantee . . . has never been efficient; but it has always been free.\u201d); Parklane Hosiery Co. v. Shore, 439 U.S. 322, 343 (1979) (Rehnquist, J., dissenting) (\u201c[T]he concerns for the institution of jury trial that led to the passages of the Declaration of Independence and to the Seventh Amendment were not animated by a belief that use of juries would lead to more efficient judicial administration.\u201d).<\/span> We do so to protect the integrity of the process that realizes the verdict. In other words, there is very little about the form of a verdict that makes it authoritative and legal. What gives a verdict its legal authority is almost entirely our beliefs about a jury\u2019s process.<\/p>\n<p>LLM proponents assert that AI is authoritative not simply because of how it sounds, but because of how it works. This reasoning suggests that we ought to trust the conclusions that AI reaches even when they are unaccompanied by output that resembles legal reasoning.<sup class=\"modern-footnotes-footnote \" data-mfn=\"270\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-270\">270<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-270\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"270\">Note that when an LLM produces an output that describes a reasoning process, this description does not necessarily correspond to the process that the LLM actually followed to generate that output. <em>See, e.g.<\/em>, Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peters Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan &amp; Ethan Perez, <em>Reasoning Models Don\u2019t Always Say What They Think <\/em>(May 8, 2025), https:\/\/arxiv.org\/abs\/2505.05410 [https:\/\/perma.cc\/2DKD-U9ZT] (finding that LLMs prompted to explain their reasoning via a \u201cchain of thought\u201d frequently drew conclusions based on \u201chints\u201d that they did not disclose in their supporting reasoning).<\/span> And indeed, this is precisely what practitioners of generative interpretation do when they find that individual words and numbers generated by LLMs are probative to their understanding of a legal instrument or a particular term (as Arbel and Hoffman do in their Famiglio and \u201cflood\u201d case studies discussed above, for example).<sup class=\"modern-footnotes-footnote \" data-mfn=\"271\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-271\">271<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-271\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"271\"><em>See<\/em> <em>Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 455\u201358, 483\u201385.<\/span> If our faith in AI should persist even when it does not present its conclusions in rational-sounding language, then shouldn\u2019t that faith justify replacing not just lawyers and judges with AI, but juries, too?<\/p>\n<p>Many readers may find the prospect of replacing juries with AI unpalatable, particularly in criminal cases. Many readers probably suspect that the public will not regard AI juries as having the same sort of legitimacy and authority as a human jury. The intuition that the public will reject their authority is a very good reason to resist AI juries. It also underscores that legality is a social fact; it does not follow inexorably from form. Just because AI can render something formally indistinguishable from a verdict does not mean that it can render a legally acceptable verdict. Correspondingly, just because LLMs can output text that resembles well-reasoned adjudication, does not mean that it can engage in the social process that constitutes acceptable adjudication.<\/p>\n<p>To put the point another way, juries are not important because they render binary verdicts; countless mechanisms can output a \u201c1\u201d or a \u201c0.\u201d They are important because they instantiate a legitimate and authoritative process for reaching a binary verdict. If we cared only about the formal artifacts of juries\u2014that is, only the verdicts they produce\u2014then we could delegate jury duty to a coin flip. The thought of deciding criminal culpability by coin flip appalls us because we value juries for the process they instantiate, not simply the artifacts they produce. The same is true of adjudication\u2014but focusing on the formal qualities of an opinion\u2019s text obscures this point. The temptation is to treat formally adequate legal text as evidence of the process we call \u201clegal reasoning\u201d or \u201cadjudication.\u201d But formally adequate legal text is not necessarily adjudication\u2014just like a bare yes or no resembles a verdict but is not necessarily one.<\/p>\n<p>Consider one of Unikowsky\u2019s proposals. He observes that submitting cases nationwide to a single AI adjudicator could reduce the unpredictability that arises in a judicial system that vests power in thousands of individual judges.<sup class=\"modern-footnotes-footnote \" data-mfn=\"272\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-272\">272<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-272\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"272\"><em>See In AI We Trust I<\/em>, <em>supra<\/em> note 5.<\/span> However, he writes,<\/p>\n<p>intellectual diversity across the judiciary is healthy, at least up to a point, and . . . it also seems bad to concentrate all judicial power in one (or a small number of) computer systems. But this problem can be dealt with by, e.g., adding a random ideological factor to the AI\u2019s outputs, having the AI highlight when a particular dispute would come out differently across different philosophies, using multiple independent AIs, or through other mechanisms. The point is, if we use AI, we can add exactly as much unpredictability and ideology as we want via effective prompt engineering rather than having unpredictability thrust upon us by the constraint of individual judges\u2019 bandwidth.<sup class=\"modern-footnotes-footnote \" data-mfn=\"273\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-273\">273<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-273\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"273\"><em>Id.<\/em><\/span><\/p>\n<p>Why is it that adding a \u201crandom ideological factor\u201d\u2014a fancy version of a coin flip\u2014would make AI adjudication more, rather than less, legitimate? Or rather, why does it seem to Unikowsky that taking a system that is arguably inferior to human judging (an AI adjudicator) and combining it with a system that is definitely inferior to human judging (random chance) will fix its deficiencies?<\/p>\n<p>Unikowsky\u2019s proposal to inject arbitrary amounts of \u201cunpredictability and ideology\u201d misses what is important to the process of adjudication. Outputs do not have an \u201cideological factor\u201d; judges do. Intellectual diversity in the judiciary is not desirable because it guarantees that judges will resolve cases differently. (It guarantees no such thing.) What intellectual diversity does guarantee is that judges will approach cases differently. And it is a diversity of approaches that is valuable, not a diversity of dispositions for its own sake.<sup class=\"modern-footnotes-footnote \" data-mfn=\"274\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-274\">274<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-274\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"274\"><em>See generally <\/em>David Orentlicher, <em>Politics and the Supreme Court: The Need for Ideological Balance<\/em>, 79 U. <span style=\"font-variant: small-caps\">Pitt. L. Rev.<\/span> 411 (2018) (arguing that an ideological balance of justices on the Supreme Court will result in better decisions); Deseriee A. Kennedy, <em>Judicial Review and Diversity<\/em>, 71 <span style=\"font-variant: small-caps\">Tenn. L. Rev.<\/span> 287 (2004) (arguing that increasing judicial diversity is necessary to bring additional perspectives into decisions and ensure fair review of cases and issues relevant to people of color).<\/span> In fact, it tends to inspire confidence in the judiciary when judges of different ideological persuasions agree on the disposition of a case.<sup class=\"modern-footnotes-footnote \" data-mfn=\"275\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-275\">275<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-275\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"275\"><em>See, e.g.<\/em>, Devin Dwyer, <em>Supreme Court Defies Critics with Wave of Unanimous Decisions<\/em>, ABC <span style=\"font-variant: small-caps\">News <\/span>(June 29, 2021), https:\/\/abcnews.go.com\/Politics\/supreme-court-defies-critics-wave-unanimous-decisions\/story?id=78463255 [https:\/\/perma.cc\/6TRW-RXBK] (quoting Kate Shaw: \u201cI suspect the justices feel there is value in conveying to the American people that in a hyper-partisan moment, . . . the court remains a largely nonpartisan institution.\u201d).<\/span><\/p>\n<p>Unikowsky\u2019s other suggestions come closer. \u201c[H]aving the AI highlight when a particular dispute would come out differently across different philosophies\u201d explicitly puts the diversity of judicial approaches into a single LLM\u2019s considerations; \u201cusing multiple independent AIs\u201d attempts to have a diversity of LLMs emulate a diversity of judges.<sup class=\"modern-footnotes-footnote \" data-mfn=\"276\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-276\">276<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-276\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"276\"><em>See In AI We Trust I<\/em>, <em>supra<\/em> note 5.<\/span> Explicitly spelling out these considerations, however, raises exactly the same questions we have been discussing\u2014what kinds of validation would it take to make us confident that the process of LLM adjudication sufficiently captures the process values of human adjudication?<\/p>\n<p>So far, this discussion has centered on the artifacts produced by legal reasoning. But generative-interpretation moderates do not propose replacing judges entirely. Instead, they suggest merely using LLMs as an alternative to common sense or a dictionary.<\/p>\n<p>There is a procedural oddity to judicial use of dictionaries: looking something up in a dictionary is in some respects a factual inquiry, yet one that judges perform <em>sua sponte<\/em>, outside of the ordinary adversarial process. The legitimacy of judicial use of external sources of fact during the interpretive process relies <em>even more heavily<\/em> on social norms and the judge\u2019s legal-reasoning process than does the legitimacy of using facts that have passed through the gauntlet of a trial. Judge Newsom complains that judges \u201crarely explain in any detail the process by which they selected one definition over others.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"277\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-277\">277<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-277\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"277\">Snell v. United Specialty Ins. Co., 102 F.4th at 1208, 1229 (11th Cir. 2024) (Newsom, J., concurring).<\/span> Yet even without explanation, <em>the selection was a product of the judge\u2019s reasoning<\/em>.<\/p>\n<p>The counterintuitive upshot is that interpretation is one of the <em>last<\/em> places where LLM use is justified within the adjudicatory process. When juries produce a \u201cyes\u201d or \u201cno\u201d response, they provide no explanation. The fact that the jury produced the answer is the source of its legitimacy. In short, when explanations run thin and discretion peaks, procedural and institutional sources of legitimacy acquire special import. Textual interpretation is an adjudicative task where judges\u2019 social role has special importance. LLM proponents suggest that computer programs might provide objective answers to subjective questions. Unilaterally abdicating the interpretive role in favor of an LLM (especially without substantial supporting evidence for those claims) actively subverts an important judicial responsibility.<\/p>\n<p><em>4. The Risk of Projection<\/em><\/p>\n<p>We make these points at length because of the danger that LLM users might unreflectingly project authority on to LLM outputs simply because the outputs sound right.<sup class=\"modern-footnotes-footnote \" data-mfn=\"278\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-278\">278<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-278\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"278\"><em>See<\/em> Joe McIntyre, <em>Pareidolic Illusions of Meaning: ChatGPT, Pseudolaw and the Triumph of Form over Substance <\/em>39\u201345 (Mar. 16, 2025), https:\/\/papers.ssrn.com\/abstract=5181165 [https:\/\/perma.cc\/FMS6-2JGP].<\/span> Arbel and Hoffman mention the concern explicitly.<sup class=\"modern-footnotes-footnote \" data-mfn=\"279\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-279\">279<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-279\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"279\"><em>See Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 476 (\u201c[J]ust because the probabilities are reasonable doesn\u2019t mean they are accurate.\u201d).<\/span> And in both <em>Snell <\/em>and <em>Deleon<\/em>, Judge Newsom observed, \u201cI definitely didn\u2019t want to fall into the trap of embracing ChatGPT\u2019s definition just because it aligned with my priors.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"280\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-280\">280<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-280\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"280\"><em>Snell<\/em>, 102 F.4th at 1225 (Newsom, J., concurring); <em>see also <\/em>United States v. Deleon, 116 F.4th 1260, 1272 (11th Cir. 2024) (Newsom, J., concurring).<\/span> Yet Judge Newsom wrote that his fears were &#8220;reassur[ed]\u201d after he employed Arbel and Hoffman\u2019s proposed technique for measuring LLMs\u2019 \u201cconfidence\u201d and found that he could produce similar answers by posing the same interpretive question multiple times and to multiple LLMs.<sup class=\"modern-footnotes-footnote \" data-mfn=\"281\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-281\">281<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-281\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"281\"><em>Deleon<\/em>, 116 F.4th at 1274 (Newsom, J., concurring) (\u201c[R]eassuringly, the 30 results I received\u201410 apiece from each of the three leading LLMs\u2014largely echoed the initial response that I got from ChatGPT.\u201d) (citing<em> Generative Interpretation<\/em>, <em>supra <\/em>note 2 at 487\u201388); <em>see also Snell<\/em>, 102 F.4th at 1225, 1233 (Newsom, J., concurring) (proposing that users \u201ctry different prompts\u201d and \u201cquery multiple models to ensure that the results are consistent\u201d).<\/span><\/p>\n<p>Judge Newsom is right to identify the risk of projection, but the measures he adopts do not guard against it. His confidence checks do not demonstrate that he is not vesting LLMs with authority simply because their output matches his expectations. Rather, they demonstrate only that he is more comfortable vesting LLMs with authority when he thinks they are confidently and consistently producing output that matches his expectations. Indeed, he said as much: \u201cIf the model . . . returned essentially consistent answers\u2014responses that coalesced around a common core\u2014then one could probably say, with a higher degree of confidence, that the model was getting at the \u2018correct\u2019 response.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"282\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-282\">282<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-282\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"282\"><em>Deleon<\/em>, 116 F.4th at 1274 (Newsom, J., concurring).<\/span> This framing assumes that confidence and consistency corroborate correctness. They do not.<sup class=\"modern-footnotes-footnote \" data-mfn=\"283\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-283\">283<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-283\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"283\"><em>See <\/em>McIntyre, <em>supra<\/em> note 278, at 45\u201346.<\/span> The confidence and consistency of a model\u2019s output bolster its probative value only when there is reason to believe that the model\u2019s determinations accurately reflect facts about linguistic meaning in the first place. And this latter belief is something generative-interpretation proponents justify, at least in part, with appeals to the superficial reasonableness and plausibility of LLM output.<sup class=\"modern-footnotes-footnote \" data-mfn=\"284\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-284\">284<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-284\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"284\"><em>See, e.g.<\/em>, <em>Snell<\/em>, 102 F.4th at 1228 (Newsom, J., concurring) (\u201cAnd as anyone who has used them can attest, modern LLMs\u2019 results are often sensible . . . . [LLMs] are[] high-octane language-prediction machines capable of probabilistically mapping, among other things, <em>how ordinary people use words and phrases in context<\/em>.\u201d (emphasis added)).<\/span> So long as they do so, the danger remains that users will project authority onto LLMs not because LLMs are objectively \u201caccurate,\u201d but because they like what they hear.<\/p>\n<p style=\"text-align: center\">D. <em>Relying on Predictability?<br \/><\/em><\/p>\n<p>The final general type of argument offered by LLM proponents is that they are more <em>predictable<\/em> than human judges.<sup class=\"modern-footnotes-footnote \" data-mfn=\"285\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-285\">285<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-285\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"285\"><em>See<\/em> <em>Generative<\/em> <em>Interpretation<\/em>, <em>supra <\/em>note 2, at 509.<\/span> The importance of predictability to the rule of law is well established, and we need not dwell on why predictability is a worthwhile goal. Decades of legal scholarship highlight arbitrariness in current interpretative techniques.<sup class=\"modern-footnotes-footnote \" data-mfn=\"286\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-286\">286<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-286\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"286\"><em>See<\/em> <em>id.<\/em> at 464\u201366 (summarizing and collecting relevant literature); <em>see also<\/em> <em>In AI We Trust II<\/em>, <em>supra<\/em> note 5.<\/span> If anything, predictability looms even larger for LLM proponents than it does for other judges and scholars. Kieffaber, for example, argues that other goals for interpretation are \u201creally predictability in disguise.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"287\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-287\">287<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-287\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"287\">Kieffaber, <em>supra <\/em>note 116, at 12.<\/span> If LLMs provide consistent answers, they might improve judicial consistency and offer parties a reliable way to determine the meaning of terms without litigating. LLMs that sometimes produce arbitrary results might still improve upon the status quo if those results are predictable.<\/p>\n<p>LLM proponents offer, broadly, four reasons to think that LLMs are more predictable. First, there are claims that LLMs will be more predictable because they are algorithmic and automated, and thus not subject to human biases. But this is an unwarranted generalization about how computers work. Computations are typically predictable step-by-step: the results of each instruction a computer executes are formally specified and highly reliable. But the overall results of a computation need not be predictable in the slightest; indeed, because a computer can execute so many steps of an algorithm so quickly, it can produce outputs of astonishing complexity. This complexity is on full display for LLMs. Even compared with other types of programs, they are notoriously inscrutable. While scholars have begun to probe the internal representations employed by LLMs as they generate outputs, this work is still at a relatively early stage.<sup class=\"modern-footnotes-footnote \" data-mfn=\"288\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-288\">288<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-288\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"288\"><em>See, e.g.<\/em>, Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah &amp; Joshua Batson, <em>On the Biology of a Large Language Model<\/em>, <span style=\"font-variant: small-caps\">Anthropic<\/span> (Mar. 27, 2025), https:\/\/transformer-circuits.pub\/2025\/attribution-graphs\/biology.html [https:\/\/perma.cc\/5U4J-F2QP].<\/span><\/p>\n<p>Second, there are claims that LLMs will be more predictable because they are trained on huge corpora of human-written text.<sup class=\"modern-footnotes-footnote \" data-mfn=\"289\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-289\">289<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-289\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"289\"><em>Generative Interpretation<\/em>, <em>supra <\/em>note 2, at 485 (referring to \u201cthe vast corpus on which [GPT-4] sits\u201d); <em>see also <\/em>David Hoffman, <em>Consumers\u2019 Unreasonable Textual Expectations<\/em>, 15 <span style=\"font-variant: small-caps\">Harv. Bus. L. Rev.<\/span> 43, 50 (2025) (\u201cThe gist of our claim [in <em>Generative Interpretation<\/em>] is that large language models are particularly good at offering majoritarian readings of phrases read in the context of contracts, and can probabilistically estimate what meaning the ordinary reader of that document will take.\u201d)<\/span> This argument fails for the same reason as claims that LLMs will be more accurate because they are trained on huge corpora of human-written text. The training process by itself guarantees nothing. Whether or not an LLM\u2019s outputs are predictable is an empirical question, and our empirical demonstrations in Part III show that frequently they are not.<\/p>\n<p>Third, there are claims that the low cost and high accessibility of LLMs will give people access to them in advance of litigation, so that they can predict how courts or arbitrators would rule.<sup class=\"modern-footnotes-footnote \" data-mfn=\"290\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-290\">290<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-290\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"290\"><em>See e.g.<\/em>, Gandall et. al, <em>supra<\/em> note 30, at 4\u20135.<\/span> This is not necessarily a claim based on accuracy. It is not necessary that the LLM be correct for it to be predictable; all that is required is that parties can obtain the same results ex ante by consulting the LLM as judges would obtain ex post by consulting it. The fly in the ointment is that for parties to predict outcomes they need to anticipate courts\u2019 prompting and configuration choices\u2014many of which are dispute-specific. The more sensitive a method of generative interpretation is to those choices, the more precisely parties must predict those choices. If parties must negotiate or litigate dispute-specific choices to achieve predictable results, the efficiency benefits of generative interpretation shrink. And if those negotiations now require access to a new, expensive kind of additional technical expertise, generative interpretation might simply exacerbate challenges indigent parties face in gaining access to justice.<\/p>\n<p>Finally, it is important to remember that predictability alone is not sufficient for legal legitimacy. An interpretive method that always reads text in the way that most favors the party with the longer name is cheap, accessible, <em>and <\/em>predictable. It would improve access and reduce costs. Yet we suspect most readers would agree that name length is not a legitimate factor in adjudicatory reasoning.<sup class=\"modern-footnotes-footnote \" data-mfn=\"291\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-291\">291<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-291\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"291\"><em>Cf.<\/em> Brainerd Currie, <em>Conflict, Crisis, and Confusion in New York<\/em>, 1963 <span style=\"font-variant: small-caps\">Duke L.J.<\/span> 1, 11\u201313 (1963) (arguing that if uniformity and predictability were all that mattered, a choice of law rule selecting Alaska law for all disputes, whether or not they had anything to do with Alaska, would be ideal).<\/span><\/p>\n<h4 style=\"text-align: center\">VI. <span style=\"font-variant: small-caps\">Reworking Generative Interpretation<br \/><\/span><\/h4>\n<p>We aren\u2019t writing to announce unbridled AI pessimism. Our concern is that generative-interpretation proponents\u2019 influential proposals treat LLM output as probative of facts that they do not actually signify. Although these particular uses of LLMs are problematic\u2014or rather, the problems they raise have not yet been surmounted\u2014the good news is that there are many legal applications where LLMs can realize these benefits without running into the same problems.<\/p>\n<p style=\"text-align: center\"><em>A. How AI-Generated Artifacts Can Acquire Normative Weight<\/em><\/p>\n<p>Broadly speaking, we see three ways in which LLMs may be fit for use in the legal process: for their purely rhetorical value, when they have been experimentally validated, or when they are subject to adversarial testing.<\/p>\n<p>1. <em>Rhetorical Value<\/em><\/p>\n<p>LLM proponents often appeal to AI\u2019s strength as a rhetorician\u2014and sometimes being rhetorically effective is exactly what one wants, and <em>all<\/em> that one wants. Unikowsky, for example, writes that after he presented Claude with the merits-stage briefs of adjudicated cases from the Supreme Court\u2019s most recent term, Claude decided all but ten of them \u201cin the same way the Supreme Court did. In the other 10 . . . , I frequently was more persuaded by Claude\u2019s analysis than the Supreme Court\u2019s.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"292\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-292\">292<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-292\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"292\"><em>In AI We Trust II<\/em>, <em>supra<\/em> note 5.<\/span> If Claude is as persuasive a legal writer as Unikowsky suggests, we see no downsides to harnessing LLMs\u2019 rhetorical power <em>per se<\/em>. LLMs are legitimate assets to the legal profession at least insofar as they are harnessed for their rhetorical power. Lawyers could ask LLMs to generate persuasive arguments for their client\u2019s positions, then personally ensure that these outputs adhere to the other formal requirements of legal briefing. And lawyers and judges alike could ask LLMs to generate compelling counterarguments to their conclusions, then use those counterpoints to strengthen their briefs and opinions.<sup class=\"modern-footnotes-footnote \" data-mfn=\"293\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-293\">293<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-293\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"293\"><em>See, e.g.<\/em>, Andrew Coan &amp; Harry Surden, <em>Artificial Intelligence and Constitutional Interpretation<\/em>, 96 <span style=\"font-variant: small-caps\">U. Colo. L. Rev.<\/span> 413, 482 (2025) (\u201cAnother benefit of LLMs in legal decision-making is their ability to provide multiple perspectives when prompted appropriately.\u201d).<\/span> Deployed for their rhetorical power, LLMs might be valuable tools to help judges and advocates strengthen their reasoning. Our argument here lines up with Waldon et al.\u2019s recommendation for \u201cdialectical legal AI,\u201d which similarly endorses LLM usage to \u201cgenerate novel perspectives that challenge or expand a judge\u2019s theory of a disputed legal text,\u201d but not the \u201cdelegat[ion] [of] . . . authority to the model.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"294\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-294\">294<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-294\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"294\">Waldon et. al, <em>supra<\/em> note 16, at 53\u201355.<\/span><\/p>\n<p>To return to our fair-coin analogy one last time, consider the practice of tossing a coin to make a major decision. Deciding to follow the coin\u2019s \u201cchoice\u201d is arbitrary.<sup class=\"modern-footnotes-footnote \" data-mfn=\"295\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-295\">295<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-295\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"295\">A surprising number of judges have been sanctioned for decision\u2013by\u2013coin\u2013flip. <em>E.g.<\/em>, In re Daniels, 340 So. 2d 301, 309 (La. 1976) (sanctions for pretending to decide guilt or innocence of criminal defendants by coin-flip); In re Application of Friess, 91 A.D.2d 554, 556 (N.Y. App. Div. 1982) (sanctions for determining sentence-length by coin flip); In re Brown, 662 N.W.2d 733, 733 (Mich. 2003) (sanctions for deciding which parent has child custody on Christmas by coin-flip); Jud. Inquiry &amp; Rev. Comm\u2019n v. Shull, 651 S.E.2d 648, 659\u201360 (Va. 2007) (sanctions for determining child visitation rights by coin-flip). Some legal theorists have considered the (narrow) circumstances under which randomness is preferable to human judgment. <em>See,<\/em> <em>e.g.<\/em>, Bernard E. Harcourt, <em>Post-Modern Meditations on Punishment<\/em>, <em>in<\/em> <span style=\"font-variant: small-caps\">Criminal Law Conversations<\/span> 163, 165, 167\u201368 (Paul H. Robinson, Stephen Garvey &amp; Kimberly Kessler Ferzan eds., 2009) (\u201c[T]here always came a moment when the empirical facts ran out . . . <em>and yet the reasoning continued<\/em>. There was always this moment when the moderns\u2014those paragons of reason\u2014took a leap of faith. . . . Where our social scientific theories run out, where our principles run dry, we should leave the decision-making to chance. We should no longer take that leap of faith, but turn instead to the coin toss.\u201d); <span style=\"font-variant: small-caps\">Jeph Loeb &amp; Tim Sale<\/span>, <span style=\"font-variant: small-caps\">Batman: The Long Halloween<\/span> 330\u201331 (<span style=\"font-variant: small-caps\">DC Comics, <\/span>1998) (depicting comic book villain Two-Face arguing that coin flips are normatively preferable to courts or legislatures because random chance is impartial); Adam M. Samaha, <em>Randomization and Adjudication<\/em>, 51 <span style=\"font-variant: small-caps\">Wm<\/span>. &amp; <span style=\"font-variant: small-caps\">Mary L. Rev.<\/span> 1, 34 (2009) (\u201cIn a subset of these cases, however small, randomization will be the theoretically superior option for reasons of practical indivisibility, equality norms, nagging uncertainty, incentive effects, and\/or experimental value. That we have difficulty identifying this class of cases with precision is no reason to think it is an empty set.\u201d); <em>see generally<\/em> Ronen Perry &amp; Tal Zarsky, <em>\u201cMay the Odds Be Ever in Your Favor\u201d: Lotteries in Law<\/em>, 66 <span style=\"font-variant: small-caps\">Ala. L. Rev<\/span>. 1035 (2015) (examining the role of randomization in law).<\/span> But tossing the coin and introspecting on whether the outcome feels right or wrong is a popular trick for getting insight into your internal compass. In that case, the coin toss\u2019s outcome is arbitrary, but the insights it generates are not.<sup class=\"modern-footnotes-footnote \" data-mfn=\"296\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-296\">296<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-296\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"296\"><em>See <\/em>Mariela E. Jaff\u00e9, Leonie Reutner &amp; Rainer Greifeneder, <em>Catalyzing Decisions: How a Coin Flip Strengthens Affective Reactions<\/em>, <span style=\"font-variant: SMALL-CAPS\">PLOS ONE, <\/span>Aug. 2019, at 13 (explaining that \u201ca decision aid [such as a coin flip] can function as a catalyst in the decision-making process\u201d); Mariela E. Jaff\u00e9 &amp; Rainer Greifeneder, <em>Deciding Advantageously after Flipping a Coin<\/em>, 223 <span style=\"font-variant: small-caps\">Acta Psychologica<\/span>, Mar. 2022, at 6 (finding that the coin-flip technique may improve decision-making under certain controlled conditions).<\/span> And, just as importantly, the coin\u2019s value as a tool to probe one\u2019s own thoughts does not depend on whether the coin actually is unbiased.<\/p>\n<p>Of course, rhetorical use of AI isn\u2019t risk-free\u2014but the risks are largely that rhetorical use might invite unfounded assumptions about LLMs\u2019 authority. For example, a lawyer or a judge may ask an LLM to produce competing arguments about an issue in a case. Any of the arguments that the LLM produces can help its interlocutor think through the issues and refine a brief or an opinion. The LLM user may, however, find that the LLM is able to argue for one result, A, more persuasively than for a competing result, B. If the LLM user takes this behavior as evidence that the LLM \u201cbelieves\u201d in the result it argued for more persuasively, and thus that A is correct and B is incorrect, then the user has stopped using the LLM as a rhetorical tool and has instead begun to defer to it, unfoundedly, as an epistemic authority. An LLM might craft a more persuasive argument for A than B for any number of arbitrary reasons, and the arguments that an LLM produces are not perforce the strongest ones. That an LLM argued for A more persuasively than B is not evidence that the strongest argument for A beats the strongest argument for B, and it would be an error to interpret as such.<\/p>\n<p>But evaluating arguments is precisely what lawyers and judges know how to do best. So long as they approach LLMs as rhetoricians\u2014not as oracles whose minds need to be read\u2014lawyers and judges can use LLMs\u2019 powers of persuasion to test their own reasoning and make their writing more persuasive.<\/p>\n<p>2. <em>Experimental Validation<\/em><\/p>\n<p>Limiting LLM use to purely rhetorical applications would obviate one of the most exciting potential benefits of AI: using automation to reduce judicial and administrative backlogs. By limiting our qualitative assumptions about LLM-produced text to those that are susceptible to objective benchmarks, we can expand the set of situations in which LLM use is justifiable. These applications of AI do not involve the splashy, novel issues of interpretation that make for exciting technical demonstrations. Instead of answering unfamiliar interpretive questions, LLMs can assist with the exact opposite: run-of-the-mill cases, where fact patterns rarely deviate from a handful of common situations, and voluminous precedent provides ready-made benchmarks for testing predictive accuracy.<\/p>\n<p>In these cases, inductively validating the factual accuracy of LLM outputs is well within the technical capabilities of most social scientists. For example, an AI that can detect choice-of-venue clauses in a contract, or specific, common but invalid clauses in rental agreements from certain jurisdictions, could be tested against a curated list of similar agreements. A tool that performs well against that benchmark might be able to organize dockets and direct judicial attention in ways that increase efficiency without supplanting any part of the adjudicatory process in any individual case. Similarly, AIs could be tested on their ability to detect or correct procedural deficiencies in court and agency filings, allowing petitioners to amend those filings without an extra visit to the courthouse.<\/p>\n<p>The common thread across these examples is <em>verifiability<\/em>. By using LLMs to perform rote tasks with verifiable results, we can establish ex ante the likelihood that those results are accurate and can monitor ex post whether the system remains accurate over time. With well-designed and continually retested benchmarks, LLMs could automate portions of the legal process while conforming to the same principles and standards used for other kinds of scientifically verified mechanization and measurement.<sup class=\"modern-footnotes-footnote \" data-mfn=\"297\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-297\">297<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-297\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"297\">We do not mean to imply that all the questions surrounding the use of scientific and empirical evidence in courts are settled. Just that these situations\u2014unlike blind faith in an LLM\u2019s output\u2014do not require courts to make any <em>new<\/em> descriptive or epistemic assumptions.<\/span><\/p>\n<p>Inductively validating LLMs is at least a tractable problem; this is a kind of science that computational legal scholars know how to do. But a comparison with immensely simpler and yet still highly controversial empirical interpretive projects\u2014trademark surveys and corpus linguistics\u2014shows that here the mountains are high and we are barely in the foothills.<sup class=\"modern-footnotes-footnote \" data-mfn=\"298\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-298\">298<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-298\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"298\"><em>See generally, e.g.<\/em>, Barton Beebe &amp; Jeanne C. Fromer, <em>Are We Running Out of Trademarks? An Empirical Study of Trademark Depletion and Congestion<\/em>, 131 <span style=\"font-variant: small-caps\">Harv. L. Rev. <\/span>945 (2018) (thoughtful empirical study of trademark depletion); Jens Frankenreiter &amp; Michael A. Livermore, <em>Computational Methods in Legal Analysis<\/em>, 16 <span style=\"font-variant: small-caps\">Ann. Rev. L. &amp; Soc. Sci. <\/span>39 (2020) (overview of computational textual methods).<\/span> Effective legal benchmarks would need to provide a set of tests representative of the variety and complexity of interpretation tasks that face courts, and would need some well-defined \u201cground truth\u201d to measure against.<sup class=\"modern-footnotes-footnote \" data-mfn=\"299\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-299\">299<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-299\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"299\"><em>See generally<\/em> Michael J. Hasday, <em>Accuracy and the Robot Judge<\/em>, 25 <span style=\"font-variant: small-caps\">J. App. Prac. &amp; Process<\/span> 1 (2025) (arguing that AI judges could be benchmarked by comparing their decisions to decisions made by human judges). <\/span><\/p>\n<p>Today\u2019s high-profile legal-reasoning benchmarks are far from representative.<sup class=\"modern-footnotes-footnote \" data-mfn=\"300\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-300\">300<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-300\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"300\"><em>See<\/em> Florencia Marotta-Wurgler &amp; David Stein, <em>Building a Long Text Privacy Policy Corpus with Multi-Class Labels<\/em>, 63 <span style=\"font-variant: small-caps\">Proc. Assoc. Computational Linguistics <\/span>8156, 8156 (2025).<\/span> An LLM\u2019s ability to pass the bar exam, for instance, is impressive but uninformative.<sup class=\"modern-footnotes-footnote \" data-mfn=\"301\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-301\">301<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-301\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"301\"><em>See<\/em> Will Douglas Heaven, <em>AI hype is built on high test scores. Those tests are flawed<\/em>, MIT <span style=\"font-variant: small-caps\">Tech. Rev. (<\/span>Aug. 30, 2023), https:\/\/www.technologyreview.com\/2023\/08\/30\/1078670\/large-language-models-arent-people-lets-stop-testing-them-like-they-were\/ [https:\/\/perma.cc\/8756-7LDZ] (discussing how to evaluate AI test results).<\/span> The bar exam is designed to test humans, and it assumes that test-takers are capable of lateral applications of knowledge. No LLM has been shown to be capable of that kind of lateral \u201cthinking.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"302\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-302\">302<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-302\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"302\"><em>See<\/em> Marotta-Wurgler &amp; Stein, <em>supra<\/em> note 300, at 8163\u201364 (\u201cAt least for the systems we tested, an LLM\u2019s ability to answer one legal question appears to not be predictive of that LLM\u2019s ability to answer other questions, even within extremely narrow domains . . . .\u201d).<\/span> Unless courts need an LLM to do <em>exactly <\/em>what a benchmark tests (e.g., answer bar exam questions from the same corpus, using the same format, writing style, level of complexity, and subject matter), that benchmark is not necessarily predictive of LLM accuracy. Though initial work on building better benchmarks is underway, those efforts currently focus on questions with well-defined right answers\u2014like checking outputs for rhetorical and factual consistency.<sup class=\"modern-footnotes-footnote \" data-mfn=\"303\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-303\">303<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-303\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"303\"><em>See generally <\/em>Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher R\u00e9, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan H. Choi, Kevin Tobia, Margaret Hagan, Megan Ma, Michael Livermore, Nikon Rasumov-Rahe, Nils Holzenberger, Noam Kolt, Peter Henderson, Sean Rehaag, Sharad Goel, Shang Gao, Spencer Williams, Sunny Gandhi, Tom Zur, Varun Iyer &amp; Zehua Li, <span style=\"font-variant: small-caps\"><em>LegalBench<\/em><\/span>: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models <span style=\"font-variant: small-caps\">(<\/span>Aug. 23, <span style=\"font-variant: small-caps\">2023) <\/span>https:\/\/arxiv.org\/pdf\/2308.11462 [https:\/\/perma.cc\/K7G5-WX8M] (describing <span style=\"font-variant: small-caps\">LegalBench<\/span>, presenting an empirical evaluation of 20 LLMs, and illustrating the types of research explorations <span style=\"font-variant: small-caps\">LegalBench<\/span> enables).<\/span> Subjective tasks\u2014like the interpretation of ambiguous or contested terms\u2014have no well-defined right answers to test against.<sup class=\"modern-footnotes-footnote \" data-mfn=\"304\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-304\">304<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-304\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"304\">Again, there is some nascent progress here; some recent benchmarks make the initial steps towards measuring an LLM\u2019s ability to navigate the grey areas in legal interpretation. <em>See, e.g.<\/em>, Marotta-Wurgler &amp; Stein, <em>supra<\/em> note 300, at 8164.<\/span><\/p>\n<p>3. <em>Adversarial Testing<\/em><\/p>\n<p>When a judge uses an LLM, they are recruiting an expert without the <em>Daubert<\/em> rule, without impeachment, without a jury, without the adversarial checks on which our legal system relies. By contrast, if a litigant introduces LLM outputs as evidence, they are forced to justify that use in a way that can survive evidentiary checks and adversarial challenge. In contrast to generative interpretation, where parties have no opportunity to challenge factual assumptions that judges make about the LLM-generated text, we take no similar issue with the use of LLM outputs by litigants. In those cases, the trial process <em>is<\/em> the verification process.<sup class=\"modern-footnotes-footnote \" data-mfn=\"305\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-305\">305<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-305\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"305\"><em>Cf. Generative Interpretation<\/em>, <em>supra<\/em> note 2, at 506\u201309 (\u201c[Judges will] want to be careful about parties\u2019 manipulative [use of LLMs] . . . .\u201d).<\/span><\/p>\n<p style=\"text-align: center\">B. <em>Two Examples<\/em><\/p>\n<p>Two other approaches to interpretation\u2014trademark surveys and corpus linguistics\u2014show what it takes to make empirical measurements of linguistic meaning normatively legitimate in adjudication. In each case, a field of expert specialists has grappled seriously with the problem of providing both descriptive and epistemic justifications for their methods. They have provided descriptive justifications by developing a consistent and relatively standardized set of methods; they have provided epistemic justifications by being precise about what their methods measure and modest in their ambitions. They show a possible road forward for generative interpretation, but that road is a long and hard one.<\/p>\n<p>1. <em>Trademark Surveys<\/em><\/p>\n<p>Survey evidence is broadly accepted and widely used in trademark and false advertising cases.<sup class=\"modern-footnotes-footnote \" data-mfn=\"306\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-306\">306<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-306\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"306\"><em>See <\/em>5 <span style=\"font-variant: small-caps\">J. Thomas McCarthy, McCarthy on Trademarks and Unfair Competition<\/span> \u00a7 32:158 (5th ed. 2025) (\u201cSurvey Evidence is Routinely Received and Weighed by the Courts.\u201d); Shari Seidman Diamond &amp; David J. Franklyn, <em>Trademark Surveys: An Undulating Path<\/em>, <span style=\"font-variant: small-caps\">92 Tex. L. Rev.<\/span> 2029, 2040 (2014) (noting the presence of over 1000 trademark surveys in reported cases between 1991 and 2012).<\/span> Many core issues in trademark litigation involve questions of how consumers understand words or other marks, including whether a mark describes a product feature or the source of the goods, whether the defendant\u2019s use of a mark refers to their own goods or to the plaintiff\u2019s, and what a challenged statement conveys to consumers. These questions are all fundamentally empirical. While a judge\u2019s intuitions about language may play a role, the legal question is not whether \u201cFish-Fri\u201d in the abstract uniquely refers to the plaintiff\u2019s coating mix or to any coating mix for frying fish, but whether <em>consumers understand <\/em>\u201cFish-Fri\u201d narrowly to refer to the plaintiff\u2019s coating mix or broadly to refer to any coating mix. Indeed, even evidence that specific individual consumers have been confused about the source of the goods they are buying is only one factor in the tests used to ask whether consumers in general are likely to be confused.<sup class=\"modern-footnotes-footnote \" data-mfn=\"307\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-307\">307<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-307\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"307\"><em>See, e.g.<\/em>, JL Beverage Co., LLC v. Jim Beam Brands Co., 828 F.3d 1098, 1106 (9th Cir. 2016) (reciting an \u201ceight-factor\u201d test for likelihood of confusion, one factor of which is \u201cevidence of actual confusion\u201d).<\/span> The interpretive community of consumers is sovereign in trademark law.<sup class=\"modern-footnotes-footnote \" data-mfn=\"308\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-308\">308<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-308\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"308\"><em>See <\/em>Stanley E. Fish, <em>Interpreting the <\/em>Variorum, 2 <span style=\"font-variant: small-caps\">Critical Inquiry<\/span> 465, 483\u201384 (1976) (arguing that the meaning of text is created by interpretive communities).<\/span><\/p>\n<p>To establish descriptive validity, judges, scholars, and experts have built up an extensive body of standardized best practices for trademark surveys.<sup class=\"modern-footnotes-footnote \" data-mfn=\"309\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-309\">309<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-309\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"309\"><em>See generally, e.g.<\/em>, <span style=\"font-variant: small-caps\">James T. Berger &amp; R. Mark Halligan, Trademark Surveys: A Litigator\u2019s Guide<\/span> (2023); <span style=\"font-variant: small-caps\">Trademark and Deceptive Advertising Surveys: Law, Science, and Design<\/span> (Shari Seidman Diamond &amp; Jerre B. Swann eds., 2d ed. 2022); <span style=\"font-variant: small-caps\">Jacob Jacoby, Trademark Surveys, Volume 1: Designing, Implementing, and Evaluating Surveys<\/span> (2015).<\/span> A typical survey design must consider how to identify consumers in the relevant market, how to present the marks to them, how and in what order to ask questions about their perceptions of those marks, what control questions to include to establish a relevant baseline, and what statistical analyses to apply to the raw data of participants\u2019 answers. All of these constrain the process so that it is more reliable and replicable. Even so, litigated trademark cases frequently feature the parties\u2019 trying to impeach each other\u2019s surveys, requiring judges to probe the fine details of how questions are worded.<sup class=\"modern-footnotes-footnote \" data-mfn=\"310\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-310\">310<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-310\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"310\"><em>See, e.g.<\/em>, Novartis Consumer Health, Inc. v. Johnson &amp; Johnson-Merck Consumer Pharmaceuticals Co., 290 F.3d 578, 590\u201395 (3d Cir. 2002) (considering whether survey questions were improperly leading). Scholars continue to propose refinements to survey methodology. <em>See, e.g.<\/em>, Irina D. Manta, <em>In Search of Validity: A New Model for the Content and Procedural Treatment of Trademark Infringement Surveys<\/em>, 24 <span style=\"font-variant: small-caps\">Cardozo Arts &amp; Ent. L.J.<\/span> 1027, 1065\u201370 (2007) (proposing new quality standards for product stimuli, control stimuli, survey questions, and study environments, coupled with judicial pre-clearance of surveys).<\/span><\/p>\n<p>As for epistemic validity, it is striking how modest judicial reliance on trademark surveys is, notwithstanding the immense effort and expense that go into them. Every circuit has a detailed multifactor test for consumer confusion, including factors that are not susceptible to surveying, such as the defendant\u2019s good or bad faith and the quality of their products.<sup class=\"modern-footnotes-footnote \" data-mfn=\"311\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-311\">311<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-311\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"311\"><em>See generally <\/em>Barton Beebe, <em>An Empirical Study of the Multifactor Tests for Trademark Infringement<\/em>, 94 <span style=\"font-variant: small-caps\">Calif. L. Rev.<\/span> 1581 (2006) (discussing multifactor tests carried out by different circuits).<\/span> Judges admit but discount survey evidence when they have concerns about its methodology,<sup class=\"modern-footnotes-footnote \" data-mfn=\"312\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-312\">312<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-312\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"312\"><em>See, e.g.<\/em>, Jordache Enters., Inc. v. Hogg Wyld, Ltd., 828 F.2d 1482, 1487\u201388 (10th Cir. 1987).<\/span> and they weigh survey evidence against other sources of evidence on consumer understandings.<sup class=\"modern-footnotes-footnote \" data-mfn=\"313\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-313\">313<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-313\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"313\"><em>Cf., e.g.<\/em>, Dan Sarel &amp; Howard Marmorstein, <em>The Effect of Consumer Surveys and Actual Confusion Evidence in Trademark Litigation: An Empirical Assessment<\/em>, 99 <span style=\"font-variant: small-caps\">Trademark Rep.<\/span> 1416, 1430 (2009) (\u201c[I]n the aggregate[,] the odds of winning improve only slightly with the presentation of actual confusion or survey evidence.\u201d).<\/span> In short, the legal system treats all of the issues that trademark surveys speak to as discrete factual questions about the beliefs of actual consumers. Surveys are a powerful and potentially probative source of evidence about those discrete factual questions, but they are not regarded as conclusive even on those questions, let alone on an entire case. Scholars continue to offer powerful critiques that question whether trademark surveys actually measure what they purport to.<sup class=\"modern-footnotes-footnote \" data-mfn=\"314\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-314\">314<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-314\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"314\"><em>See, e.g.<\/em>, Barton Beebe, Roy Germano, Christopher Jon Sprigman &amp; Joel H. Steckel, <em>Consumer Uncertainty in Trademark Law: An Experimental Investigation<\/em>, 72 <span style=\"font-variant: small-caps\">Emory L.J.<\/span> 489, 543 (2023); Sepehr Shahshahani &amp; Maggie Wittlin, <em>The Missing Element in Trademark Infringement<\/em>, 110 <span style=\"font-variant: small-caps\">Iowa L. Rev.<\/span> 1247, 1251 (2025).<\/span><\/p>\n<p>2. <em>Corpus Linguistics<\/em><\/p>\n<p>If trademark surveys are a mature empirical interpretive tool, corpus linguistics is in the process of attempting to become one. Corpus methods are well-established in linguistics, where they provide new sources of data on large-scale patterns of linguistic use. Like generative interpretation, corpus linguistics and law developed from the combined efforts of academics<sup class=\"modern-footnotes-footnote \" data-mfn=\"315\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-315\">315<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-315\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"315\"><em>See generally <\/em>Stephen C. Mouritsen, <em>The Dictionary Is Not a Fortress: Definitional Fallacies and a Corpus-Based Approach to Plain Meaning<\/em>, 2010 <span style=\"font-variant: small-caps\">BYU L. Rev.<\/span> 1915; Stephen C. Mouritsen, <em>Contract Interpretation with Corpus Linguistics<\/em>, 94 <span style=\"font-variant: small-caps\">Wash. L. Rev<\/span>. 1337 (2019); James C. Phillips, Daniel M. Ortner &amp; Thomas R. Lee, <em>Corpus Linguistics and Original Public Meaning: A New Tool to Make Originalism More Empirical<\/em>, 126 <span style=\"font-variant: small-caps\">Yale L.J.F.<\/span> 21 (2016).<\/span> and judges.<sup class=\"modern-footnotes-footnote \" data-mfn=\"316\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-316\">316<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-316\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"316\"><em>See, e.g.<\/em>, State v. Rasabout, 2015 UT 72, 356 P.3d 1258, 1275\u201390 (Lee, A.C.J., concurring in part and concurring in the judgment); Wilson v. Safelite Grp., Inc., 930 F.3d 429, 439 (6th Cir. 2019) (Thapar, J., concurring in part and concurring in the judgment); Richards v. Cox, 2019 UT 57, 450 P.3d 1074; People v. Harris, 885 N.W.2d 832 (Mich. 2016); Facebook, Inc. v. Duguid, 592 U.S. 395, 410\u201311 (2021) (Alito. J., concurring in the judgment).<\/span> The basic idea is to take a large corpus of existing linguistic data\u2014a collection of texts written by actual humans, such as the Corpus of Contemporary American Usage (1.1 billion words from 1990 to 2010)\u2014and run an algorithm over it to obtain an answer to an interpretive question of interest.<\/p>\n<p>Again, consider this empirical tool through the lenses of descriptive and epistemic validity. A linguistic corpus by itself says nothing; it is simply a collection of things people have written or said. To use a corpus to answer specific interpretive questions\u2014e.g., does a person \u201cdischarge\u201d a weapon once for each shot, or once for a group of shots fired in close succession?\u2014one must find a way to interpret the corpus. In this example, from <em>State v. Rasabout<\/em>,<sup class=\"modern-footnotes-footnote \" data-mfn=\"317\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-317\">317<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-317\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"317\">2015 UT 72, 356 P.3d 1258.<\/span> Associate Chief Justice Lee used a collocation analysis. He identified 86 cases in which \u201cdischarge\u201d was used within five words of a term relating to a firearm; of them, 27 said or implied that a \u201cdischarge\u201d was a single shot, and only 1 referred to a group of shots.<sup class=\"modern-footnotes-footnote \" data-mfn=\"318\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-318\">318<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-318\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"318\"><em>Id.<\/em> \u00b6\u00b6 88\u201391, 356 P.3d at 1281\u201382 (Lee, J., concurring in part and concurring in the judgment).<\/span><\/p>\n<p>This is a specific empirical protocol, and like any empirical protocol, it can be challenged on grounds of unreliability. Scholars like Kevin Tobia have done significant empirical work questioning whether corpus-linguistic methods for legal interpretation are replicable in the scientific sense; Tobia found that they are \u201csurprisingly unreliable.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"319\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-319\">319<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-319\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"319\">Tobia, <em>supra<\/em> note 223, at 799.<\/span> Prominent proponents have responded in defense of their methods, and the debate continues in the law-review pages.<sup class=\"modern-footnotes-footnote \" data-mfn=\"320\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-320\">320<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-320\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"320\"><em>See, e.g.<\/em>, Thomas R. Lee and Stephen C. Mouritsen, <em>The Corpus and the Critics<\/em>, 88 <span style=\"font-variant: small-caps\">U. Chi. L. Rev.<\/span> 275, 279 (2021); Kevin Tobia, <em>The Corpus and the Courts<\/em>,<span style=\"font-variant: small-caps\"> U. Chi. L. Rev. Online<\/span> (2021), https:\/\/lawreview.uchicago.edu\/online-archive\/corpus-and-courts [https:\/\/perma.cc\/58XE-X6DX].<\/span> This back-and-forth is a normal part of the scholarly process; the point is simply that it is also a necessary part of establishing an empirical method as reliable in the sense required for use in adjudication.<sup class=\"modern-footnotes-footnote \" data-mfn=\"321\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-321\">321<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-321\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"321\">Of course, it is possible that this process could cut in favor of generative interpretation compared with other empirical interpretive methods. Tobia\u2019s surveys found that dictionaries and corpus linguistics did a worse job of tracking lay survey participants\u2019 assessments of linguistic meaning; judges who did not use these tools came closer to the participants\u2019 assessments. Tobia, <em>supra<\/em> note 223, at 753\u201377. It is entirely plausible that generative interpretation could outperform dictionaries and corpus linguistics, precisely because generative systems are engineered to emulate usage patterns in their training data. But this is a fundamentally empirical question, and note what could be measured. No experiment can determine whether a generative method yields correct results, because there is no accessible source of ground truth for legal meaning. (If there were, then legal interpretation would be unnecessary, and so would generative interpretation.) All that can be measured, even in principle, is the extent to which generative methods <em>track<\/em> other measurements of meaning, such as dictionaries, survey participants, and judges\u2019 opinions.<\/span><\/p>\n<p>Note, by way of comparison, how much more tightly constrained the task specification is in corpus linguistics than in generative interpretation. The algorithms used in contemporary legal corpus linguistics are straightforward: they can generally be specified in a single search query or a few lines of code, and they have broadly accepted common implementations. In contrast, generative methods are remarkably open-ended. One must write prompts (a natural-language task!), give them to a model whose internal functioning and semantic correlations are incredibly poorly understood, and then often interpret the outputs as well (another natural-language task!). Similarly, there are only a handful of corpora in wide use for corpus-linguistics purposes in law, while there are dozens of plausible models one could consult for generative interpretation, and more are being introduced constantly.<sup class=\"modern-footnotes-footnote \" data-mfn=\"322\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-322\">322<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-322\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"322\">Peter Henderson, Daniel E. Ho, Andrea Vallebueno &amp; Cassandra Handan-Nader, <em>Corpus Enigmas and Contradictory Linguistics: Tensions Between Empirical Semantic Meaning and Judicial Interpretation<\/em>, 25 <span style=\"font-variant: small-caps\">Minn. J.L. Sci. &amp; Tech.<\/span> 127, 153 &amp; n.95 (2024).<\/span> Indeed, two of legal corpus linguistics\u2019 most prominent advocates, Thomas Lee and Jesse Egbert, have criticized generative interpretation because it lacks \u201c<em>transparency <\/em>in the methods used and <em>replicability <\/em>and <em>generalizability <\/em>in the findings that are generated.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"323\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-323\">323<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-323\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"323\">Lee &amp; Egbert, <em>supra<\/em> note 192, at 4.<\/span><\/p>\n<p>Just as with trademark surveys, to pick among competing methods, one must have a theory of epistemic validity based on an underlying normative interpretive theory. Here, the theoretical debates among corpus linguists and their critics show why these theories are deeply contested; there is no simple linguistic truth that can be read off unproblematically from a corpus. Tobia points to the deep tension about what to do when corpus-linguistics methods conflict with other sources of meaning; the divergence could mean that corpus linguistics comes closer to the true collective linguistic meaning, that corpus linguistics falls further away from collective linguistic meaning, or that collective linguistic meaning itself is a construct that does not exist as a ground truth in the world that can be objectively ascertained.<sup class=\"modern-footnotes-footnote \" data-mfn=\"324\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-324\">324<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-324\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"324\"><em>See<\/em> Tobia, <em>supra <\/em>note 223, at 735\u201336.<\/span> Scholars like Anya Bernstein and Carissa Byrne Hessick observe that there is necessarily a gap between the empirical facts that corpus linguistics can ascertain and the normative claim that its results should be treated as authoritative.<sup class=\"modern-footnotes-footnote \" data-mfn=\"325\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-325\">325<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-325\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"325\">Anya Bernstein, <em>Legal Corpus Linguistics and the Half Empirical Attitude<\/em>, 106 <span style=\"font-variant: small-caps\">Cornell L. Rev.<\/span> 1397, 1400 (2021); Carissa Byrne Hessick, <em>Corpus Linguistics and the Criminal Law<\/em>, 2017 <span style=\"font-variant: small-caps\">BYU L. Rev.<\/span> 1503, 1509 (2018).<\/span><\/p>\n<p>3. <em>Implications<\/em><\/p>\n<p>Our point is not that these gaps cannot be overcome, either for corpus linguistics or generative interpretation. Trademark surveys have overcome them, within their domain. They provide a kind of interpretive evidence that is regularly used and is widely accepted as legitimate. Corpus linguistics is doing its level best to do the same. Still, it is being forced to articulate and explain why particular methods are empirically reliable and why the results of those methods should be regarded as legitimate interpretive data. It is slow, laborious, contentious work.<\/p>\n<p>Our greatest fear is that generative interpretation will attempt to circumvent these challenges rather than take them seriously. Given the laboriousness of corpus-linguistic methods, generative interpretation can seem like a remarkable shortcut: rather than specifying a detailed algorithmic query that will provide circumstantial evidence of meaning, why not just ask a model the interpretive question directly and go with what the output says? But that approach has, to quote Bertrand Russell, all \u201cthe advantages of theft over honest toil.\u201d<sup class=\"modern-footnotes-footnote \" data-mfn=\"326\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-326\">326<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-326\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"326\"><span style=\"font-variant: small-caps\">Bertrand Russell, Introduction to Mathematical Philosophy<\/span> 71 (2d ed. 1920).<\/span> For generative interpretation to fulfill the hopes of its advocates, it will have to confront all of the tedious empirical challenges and all of the contentious normative issues that trademark surveys and corpus linguistics have had to deal with. That will be no mean feat.<\/p>\n<p>Consider how generative interpretation would look if it followed a trademark-survey model. Parties to a case involving an interpretive question would commission their own experts to conduct generative-interpretation studies. The experts would specify detailed experimental methods: the models and systems they use, the prompts they provide to those models, the outputs they examine, how they interpret those outputs, the number and nature of variations on the inputs they try, and the statistical tests they apply to summarize the outputs across numerous prompts. Every choice they made in doing so would be adversarially probed by opposing counsel and experts: Why Claude 4 Opus and not ChatGPT-o3? Why did the prompt use \u201cplaintiff\u201d and \u201cdefendant\u201d instead of the parties\u2019 names? As we have shown above in Part III, many of these choices can dramatically affect the results of generative interpretation. They will have to be pinned down in a principled way, and that will require hard conversations about what makes one form of generative interpretation more reliable than another. As we have shown in Part IV, there is no way to resolve these debates without a normative theory of what it is that generative interpretation is supposed to be measuring. Trademark surveys have just such a theory, but it is narrower and less ambitious than what generative interpretation currently aims for. Corpus linguistics shows what the process of working out such a theory looks like\u2014a drawn-out debate that simultaneously gets deep into the technical weeds and ascends to the highest peaks of jurisprudence.<\/p>\n<p>Or compare Lisa Larrimore Ouelette\u2019s proposal to use search results as evidence of a trademark\u2019s strength and of the degree of overlap between two marks\u2019 likelihood of confusion.<sup class=\"modern-footnotes-footnote \" data-mfn=\"327\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-327\">327<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-327\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"327\"><em>See generally<\/em> Lisa Larrimore Ouelette, <em>The Google Shortcut to Trademark Law<\/em>, 102 <span style=\"font-variant: small-caps\">Calif. L. Rev.<\/span> 351 (2014).<\/span> Her suggestion combines trademark surveys\u2019 domain (consumer understandings of marks), corpus linguistics\u2019s quantitative methods (count occurrences with a given word sense), and generative interpretation\u2019s use of Internet datasets and reliance on complex black-box machine-learning systems.<sup class=\"modern-footnotes-footnote \" data-mfn=\"328\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-328\">328<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-328\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"328\">For additional hybrids, see generally Jake Linford &amp; Kyra Nelson, <em>Trademark Fame and Corpus Linguistics<\/em>, 45 <span style=\"font-variant: small-caps\">Colum. J.L. &amp; Arts <\/span>171 (2022) (arguing that corpus-linguistic methods have advantages over surveys for assessing trademark fame); Shivam Adarsh, Elliott Ash, Stefan Bechtold, Barton Beebe &amp; Jeanne Fromer, <em>Automating <\/em>Abercrombie<em>: Machine-Learning Trademark Distinctiveness<\/em>, <span style=\"font-variant: small-caps\">21 J. Empirical Legal Stud. <\/span>826 (2024) (using machine learning to predict trademark distinctiveness).<\/span> But the modesty of Ouelette\u2019s proposal leaps off the page: she discusses in extensive detail the ways in which search results might be unstable or unrepresentative of linguistic usage; she individually examines every case in her dataset where a court and Google disagree; she emphasizes the limited nature of her empirical results; and her bottom line is only that search results are probative evidence that courts should not categorically exclude.<sup class=\"modern-footnotes-footnote \" data-mfn=\"329\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-329\">329<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-329\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"329\">Ouelette, <em>supra<\/em> note 327, at 369, 384\u201395, 406\u201307.<\/span> This is the kind of scholarly foundation on which a persuasive argument for generative interpretation will ultimately need to rest.<\/p>\n<h4 style=\"text-align: center\">VII. <span style=\"font-variant: small-caps\">Conclusion<br \/><\/span><\/h4>\n<p>The legal philosopher Ronald Dworkin famously deployed the figure of Judge Hercules, an interpreter of unlimited competence and intelligence, with perfect access to all of the relevant authorities and all the time and diligence needed to study and harmonize them. Judge Hercules, Dworkin argued, was the ideal towards which human judging aspires.<sup class=\"modern-footnotes-footnote \" data-mfn=\"330\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-330\">330<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-330\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"330\"><em>See generally<\/em> <span style=\"font-variant: small-caps\">Ronald Dworkin, Law\u2019s Empire<\/span> (1986); <em>cf.<\/em> Kieffaber, <em>supra<\/em> note 116, at 31\u201332 (arguing that an ideal AI would be \u201c<em>Scalia\u2019s<\/em> Hercules, with a complete understanding of <em>language alone<\/em> rather than the various capricious elements that inform a Dworkinian positivist\u201d). But of course, as we have been arguing at length, there is no reason to assume that any particular LLM\u2019s outputs reflect \u201clanguage alone.\u201d For one thing, a typical LLM\u2019s training process includes extensive pre-training on heavily curated datasets, fine-tuning for specific use cases, and reinforcement learning from human feedback to optimize its responses against human preferences. <em>See generally<\/em> Lee et al., <em>supra<\/em> note 123 (discussing model development and the numerous choice points it offers to shape a system\u2019s behavior). To claim that this process learns \u201clanguage\u201d and nothing else is to hold up a randomly selected kitchen tool and assume that it is a can opener.<\/span><\/p>\n<p>Generative AI proponents see AIs as modern miracles, capable of tackling any cognitive task with superhuman skill. So perhaps it should come as no surprise that a version of this boosterism has taken hold in the legal academy. In their various ways, the thinkers we have been calling \u201cLLM proponents\u201d describe ChatGPT, Claude, and the like in terms that recall Dworkin\u2019s Judge Hercules. Like him, LLMs have mastered a corpus of text far larger than any human ever could. Like him, they can produce an answer to a complex problem far faster than any human ever could. And like him, they can knit together disparate and conflicting sources into a seamless justification as persuasively as any human could.<\/p>\n<p>To all of this we say: <em>perhaps<\/em>. Even granting that LLMs are capable of Herculean feats, a deep and fundamental problem remains. <em>How can we be confident that Judge Hercules really has decided the case before him competently and in good faith<\/em>? Anyone could claim to be Judge Hercules, to have mastered the world\u2019s accumulated statutes and precedents. And even if Judge Hercules is who he claims to be, the fact that he purports to present the best harmonization of the law is no proof that he is telling the truth about it. It does no good to point out that he can also write iambic pentameter or functional JavaScript code, when the question that needs answering is whether each interpretation he produces really is faithful to the legal authorities. Perhaps he forgot, or misunderstood, or hallucinated the nuances of the ERISA regulations on which the case depends. Perhaps he is a textualist in insurance cases but consults legislative history in admiralty cases. Perhaps he systematically favors corporate plaintiffs. Perhaps he knows how most people would understand a phrase, but ignores that knowledge when a simpler interpretation is available. Perhaps he finds arguments more persuasive when they SOMETIMES BREAK INTO ALL CAPS. Or perhaps the superhuman judge is also a superhuman liar.<\/p>\n<p>LLMs are like humans in some ways and unlike them in others, and whether an LLM should or should not be treated as equivalent to a human in a particular context depends on the LLM, the human, and the context.<sup class=\"modern-footnotes-footnote \" data-mfn=\"331\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-331\">331<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-331\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"331\"><em>Cf. <\/em>Robert Brauneis, <em>Copyright and the Training of Human Authors and Generative Machines<\/em>, 48 <span style=\"font-variant: small-caps\">Colum. J.L. &amp; Arts<\/span> 1, 22\u201356 (2024) (exploring in detail whether LLMs should be regarded as fundamentally human-like, fundamentally inhuman, or somewhere in between for purposes of copyright infringement).<\/span> A merely human judge\u2019s written opinion is a kind of proof of work. It is meant to persuade the reader of the legal conclusion it presents, but it is also meant to persuade the reader that the conclusion was arrived at competently and in good faith. A judge can fall short of the judicial ideal not just by being <em>too bad<\/em> at legal writing\u2014writing an opinion that fails to persuade because it lays bare the threadbare reasoning that resulted in it\u2014but also by being <em>too good<\/em> at it\u2014writing an opinion that is too persuasive because it carries the reader along too briskly to notice the better road not taken. True, the test of persuasiveness has never been sufficient to ensure complete honesty and competence in judging. But LLMs threaten to explode it altogether.<sup class=\"modern-footnotes-footnote \" data-mfn=\"332\" data-mfn-post-scope=\"00000000000007360000000000000000_4618\"><a href=\"javascript:void(0)\"  role=\"button\" aria-pressed=\"false\" aria-describedby=\"mfn-content-00000000000007360000000000000000_4618-332\">332<\/a><\/sup><span id=\"mfn-content-00000000000007360000000000000000_4618-332\" role=\"tooltip\" class=\"modern-footnotes-footnote__note\" tabindex=\"0\" data-mfn=\"332\"><em>Cf.<\/em> Eric A. Posner &amp; Shivam Saran, <em>Judge AI: Assessing Large Language Models in Judicial Decision-Making <\/em>28 (Univ. Chi. Coase-Sandor Inst. for L. &amp; Econ. Rsch., Working Paper No. 25-03, 2025) (\u201cIf the goal is to produce AI judges that operate like human judges, success would be achieved only if the AI judges decide cases in a realist way while using formalist reasoning\u2014meaning that they do not explain how they actually decide the cases. It is hard to imagine such AI judges being acceptable in a democracy or any well-ordered political system.\u201d); Re, <em>supra<\/em> note 23, at 1585 (\u201cThe existence of legal norms and elites has always depended on there being a scarcity of persuasive resources and arguments. . . . By undermining these constraints, a surfeit of persuasiveness threatens the effectiveness of legal norms.\u201d).<\/span><\/p>\n<p>Something more is required to make generative interpretation legitimate, some assurance that <em>this<\/em> <em>particular<\/em> output from among <em>all possible<\/em> outputs was generated fairly, objectively, on the basis of the evidence, and in accordance with the relevant authorities. To date, LLM proponents have not paid sufficient attention to what that something more might be. We hope that they will.<\/p>\n\n\n<p><a href=\"#_ednref1\" name=\"_edn1\">[*]<\/a> Tessler Family Professor of Digital and Information Law, Cornell Law School and Cornell Tech. Author order alphabetical.<\/p>\n\n\n\n<p><a href=\"#_ednref2\" name=\"_edn2\">[\u2020]<\/a> Assistant Professor, University of Wisconsin Law School.\n<\/p>\n\n\n\n<p><a href=\"#_ednref3\" name=\"_edn3\">[\u2021]<\/a> Assistant Professor, Vanderbilt Law School. We owe thanks for helpful suggestions and discussions to Yonathan Arbel, Micheal Beauvais, Aislinn Black, Katrina Geddes, David Hoffman, Florencia Marotta-Wurgler, Margaret Lemos, Frank Pasquale, Morgan Ricks, Joel Sobel, Christopher Sprigman, Katherine Jo Strandburg, and Kathryn Woolard. We also thank participants at the NYU Information Law Institute Fellows Workshop, the Internet Law Works-in-Progress Conference, and Vanderbilt Law School. Sobel gratefully acknowledges the support of a grant from the Law &amp; AI Lab at the University of Miami School of Law. All opinions and errors are our own. This Article may be freely reused under the terms of the Creative Commons Attribution 4.0 International license, https:\/\/creativecommons.org\/licenses\/by\/4.0\/.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-a89b3969 wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/journals.law.harvard.edu\/jol\/wp-content\/uploads\/sites\/86\/2026\/01\/06-HLL-63-1_Stein-et.-al.229-308_.pdf\">View PDF Version<\/a><\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>James Grimmelman[*] , Benjamin L.W. Sobel[\u2020] , David Stein[\u2021] \u00a0 Abstract In a series of provocative experiments, a loose group [&hellip;]<\/p>\n","protected":false},"author":204,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[47],"tags":[67],"class_list":["post-4618","post","type-post","status-publish","format-standard","hentry","category-print","tag-volume-63-1-winter-2026-articles"],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/peZQ7o-1cu","jetpack_sharing_enabled":true,"post_mailing_queue_ids":[],"_links":{"self":[{"href":"https:\/\/journals.law.harvard.edu\/jol\/wp-json\/wp\/v2\/posts\/4618","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/journals.law.harvard.edu\/jol\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/journals.law.harvard.edu\/jol\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/journals.law.harvard.edu\/jol\/wp-json\/wp\/v2\/users\/204"}],"replies":[{"embeddable":true,"href":"https:\/\/journals.law.harvard.edu\/jol\/wp-json\/wp\/v2\/comments?post=4618"}],"version-history":[{"count":0,"href":"https:\/\/journals.law.harvard.edu\/jol\/wp-json\/wp\/v2\/posts\/4618\/revisions"}],"wp:attachment":[{"href":"https:\/\/journals.law.harvard.edu\/jol\/wp-json\/wp\/v2\/media?parent=4618"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/journals.law.harvard.edu\/jol\/wp-json\/wp\/v2\/categories?post=4618"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/journals.law.harvard.edu\/jol\/wp-json\/wp\/v2\/tags?post=4618"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}