Models model language model product paper

Systems programming the model

AI & Society JournalMarch 19, 20261 min read0 views

This paper examines the status of the language model object in generative AI, arguing that what we call a ‘model’ is inseparable from the systems deploying it. I first theorize how these objects emerge from systems-level interactions between trained artifacts, prompting mechanisms, and sampling methods, drawing on the philosophy of digital objects as well as software studies to show how models gain their objective character. Such interactions converge on programming, not prompting, language models, and I illustrate how critical code studies can therefore track these dynamics. In an overview of language model programming approaches, I discuss how prompt and program converge, demonstrating how this confluence tends toward the production of new feedback loops wherein models become models

1 No model

So-called AI agents force a set of questions about the status of the model object. With major AI labs shifting emphasis from conversational interfaces to the “agentic loop” (Willison 2025), it is increasingly difficult to isolate the ‘model’ in large language models (LLMs). Gone is the linearity of turn taking mediated by conversation, wherein LLMs sit on standby after responding to users’ prompts. In the agentic loop, models serve as the reasoning core of multi-part systems, directing the planning and execution of tasks across a broadened surface of interaction between generative AI and software: web search, writing and executing code, online shopping, more. Now, an LLM operates not in isolation but as an embedded component within a larger assemblage of agents that includes memory, planning capabilities, and integrated tools (Toner et al. 2024, pp. 8–9). How can we speak of the model when it is distributed across such systems? Might the distinction between model and software be eroding? If so, what becomes of the prompt, or LLM outputs? If, with agentic AI, LLMs are positioned to act almost like operating systems, in what way are such models still models?

My first claim: these questions are not unique to AI agents. They have accompanied generative AI in some form since the days of GPT-2, and indeed, they inhere broadly across modern deep learning. But the status of the model tends to be remarkably under-specified in critical accounts. For example, celebrated critiques of big data and machine learning offer little insight on the matter. “A model,” one explains, “is nothing more than an abstract representation of some process […]. Whether it’s running in a computer program or in our head, the model takes what we know and uses it to predict responses in various situations” (O’Neil 2016, p. 18). A definition like this fails to address the technical conditions that constitute particular models and their uses, running counter to more recent scholarship that grounds analysis of deep learning within specific model architectures (Dobson 2023; Offert and Phan 2024). Another book features this slippage: “you can’t always be sure of what a model has learned from the data it has been given. Even a system that appears to perform spectacularly well can make terrible predictions when presented with novel data in the world” (Crawford 2021, p. 4; emphasis added). The model is a model—until it is not. Then, it is a system—no model at all, discrete and singular. If model and system share some relation, we will find no explanation of their connection here.

This is no mere matter of semantics. Obtaining clarity about what models are in generative AI is vital amid efforts to “infrastructuralize” LLMs as load-bearing components in software systems (Plantin et al. 2018; see also Salvaggio 2025). AI agents mark the latest such attempts, the endpoint of which is the transformation of LLM behavior into a far-reaching fabric of predictive inference, a kind of “world model” (Amoore et al. 2024) or “cognitive infrastructure” (Berry 2023). As LLMs continue to gain access to our infrastructural backend, it will only become harder to identify what such models are and on that basis critique them.

Building toward my second claim requires acknowledging the difficulty of doing such work with even isolated codebases. In an analysis of the original GPT-2 code, Minh Hua and Rita Raley identify a tension between critical code studies (CCS) and critical AI efforts that select the model as an object of analysis. The two are in tension because model operations need not resolve to programming; in some sense, models exist at a remove from code and its execution. Nevertheless, Hua and Raley observe that LLMs must be implemented in particular programs. Code instantiates models and manages interactions with them. Classifying these two kinds of code into “core” and “ancillary” functions, they argue that ancillary code—for example, code that shuttles prompts to GPT-2 or prints its output to screen—should be a major site of CCS interventions because it serves as the contact point between models and their uses (2023, paras. 15–16). Hua and Raley then compare two text generation programs, with the bulk of their analysis tracking differences in code between random sampling and a frontend interface for user prompts. But their analysis encounters an obstacle: intermediary code, which glues ancillary code to the core code instantiating GPT-2. Hua and Raley are quite aware of this problem. They write, this intermediary code is “harder to classify because it remains an open question whether the code that samples text from the model or encodes user input and decodes model outputs is considered part of the model or ancillary functions that support it” (2023, para. 8). A similar ambiguity would characterize the model’s training code. OpenAI did not release it; Hua and Raley make no attempt to classify it either. Would such code be core or ancillary to GPT-2?

The source of this ambiguity is not the classification itself, which remains analytically useful, but the gap between what the scheme assumes (that models are modular, extractable things) and what the code reveals (that they are not). Intermediary code exposes this gap, a gap which remains open if we continue treating LLMs like discrete things to be separated from their deployment context. Yet, if we extend the CCS approach Hua and Raley undertake, reading code shows that a necessary immateriality inheres in the models of generative AI: they cannot be fully located in a single file, function, or codebase, even as they are made to work through specific programs. This is my second claim. To develop it into an argument about LLMs, I turn to a series of software engineering efforts aimed at “programming—not prompting—foundation models” (Stanford NLP 2024). Though many such efforts currently aid in infrastructuralizing AI, their grounding in programmatic interaction creates a site where CCS can specify how models are made to work in the service of this process, and, in so doing, clarify what we mean by a model. That said, there is a catch: privileging programming will make apparent that we are not actually interacting with models at all. We are using systems.

2 Language modeling systems

This section theorizes system with respect to generative AI by putting the philosophy of digital objects and software studies into dialogue with an unlikely interlocutor: the computational linguist Christopher Potts. A longtime member of the influential Stanford Natural Language Processing Group, Potts has made a series of statements in webinars and blog posts that I will extend to address the status of the model object in LLMs; despite their informality, these statements contain a digital media-theoretic kernel worth elaborating.Footnote 1 This section thus operates in the mode of an exegesis of Potts, which ends by combining his thinking with that of Yuk Hui and Wendy Hui Kyong Chun. Subsequently, I bring my theorization of system to bear on “language model programming” (Beurer-Kellner et al. 2023) where the “automation of automation” that Brian Lennon (2024) identifies as essential to all programming gains new urgency amid AI infrastructuralization. In this context, prompt and programming become indiscernible, demanding a systems-level view to clarify what language model programming (and indeed all of generative AI) actually operates on: not models, but systems.

2.1 Systems thinking

Generally, Potts understands a model to be a mechanism for obtaining predictive outputs, like next token candidates with an LLM. By contrast, a system sends data to that model and routes its outputs to different software components. While models “get the hype,” Potts explains in a recent talk that the power of an LLM stems from the systems it enables people to build (2024). This was the primary lesson of scaling up to GPT-3; OpenAI indicated as such when announcing the model in 2020. At the end of the GPT-3 technical report, researchers at the company conclude with a final sentence that reads, significant gains on performance benchmarks “suggest that very large language models may be an important ingredient in the development of adaptable, general language systems” (Brown et al. 2020, p. 9; emphasis added). This conclusion licensed aggressive scaling efforts premised on the logic that scale would lead to better outputs (read: salable products). But in the years since, these efforts have obscured how scale is not the sole determinant for producing highly performant outputs. Models, even, are not the sole determinant for producing such outputs. Establishing this as a fact is the essence of Potts’s position. In another venue, he and colleagues argue that “state-of-the-art AI results are increasingly obtained by compound systems with multiple components, not just monolithic models” (Zaharia et al. 2024). Thus, for Potts, the need to move away from the predominant mode of “LLM thinking” and its reliance on scale toward what he calls “systems thinking” (2024).

Potts’s advocacy for systems thinking would at first seem to be about adapting to the market conditions of AI infrastructuralization. In part, this is his intent, and an undeniable instrumentalism, aimed at bringing “clarity” to software engineering practices, informs his corrective. But despite this focus on competitive products and performance gains, Potts’s reframing of generative AI in terms of systems thinking also taps into a deeper truth about LLMs. Even if we set aside compound AI systems and ignore AI agents, the linguist still claims, in a strong sense, that “we are all the time dealing with systems.” In fact, “we only ever use systems” when using LLMs (Potts 2024). There can be no next token prediction without systems-level interactions. Compound AI systems only compound this basic condition—that of always using a “language modeling system.”

A system of this kind requires three components at minimum. Others may be added but these three are fundamental; there is no system without them. The first is the language model itself, a checkpoint of frozen parameters saved during training. On its own, this checkpoint is “completely inert.” It is an “artifact […] sitting on disk,” incapable of generating text or performing any other behavior (Potts 2024). To spur it into action, it needs input. Thus, the second component of Potts’s system is a prompt, without which there would be nothing for the model to model. The third concerns output. Prompting an LLM to generate text does not in itself produce a new token. Model output instead represents a probability distribution over all tokens in a model’s vocabulary, leaving developers to decide how to draw on this information and “decode” it as a means of extending user prompts (Holtzman et al. 2020). Those decisions are as essential for generation as a prompt and model, and for this reason, Potts names them as his last component: a sampling method. From the model state, induced by some prompt, such a method must select a new token and extend that prompt so it may be sent back to the model for another round of all the above. Finally, “[h]aving made a choice about a prompt and a sampling method,” Potts concludes, “you now have a system” (2024). Model, prompt, sampling: this is system degree zero for language modeling.

Significantly, two of these three components (prompt and sampling) perform those very operations that remain ambiguous under a core/ancillary classification scheme. But locating these two components in Potts’s system does not entirely resolve the scheme’s ambiguity. Let me rehearse the problem Hua and Raley have flagged. Code that manages prompts or implements a sampling method is core in the sense that, without this code, there would be no way to perform text generation, let alone interact with an LLM at all. The model would have no mechanism for receiving input and its output does not automatically suggest what token should come next. At the same time, prompt and sampling code are decidedly not the trained checkpoint on disk and should be considered ancillary to it. They are code, not the model. Therefore, the ambiguity remains. However, in Potts’s language modeling system, this ambiguity extends to a deeper problem regarding the model as such. For the checkpoint is not itself the model either. Not quite. It is data and metadata, requiring at minimum some core code to instantiate it.Footnote 2 Follow the logic to its end and this back and forth could go on interminably, spiraling out into a sorites paradox set in motion by posing the same question repeatedly: Is any single component in this system core or ancillary to the model?

To return to my own core argument, we circumvent this paradox if we understand how the model itself is the product of a language modeling system. Treating any one component within that system as if it is ‘the’ model fails to capture the inseparability between model and system. Prompt and sampling code contribute to the production of the thing that becomes a model just as the trained checkpoint does. In this way, classification of code does not cause ambiguity about the boundaries of a model; failing to recognize that we are “all the time dealing with systems” causes that ambiguity (Potts 2024). Even so, differences do remain between model and system, despite their interdependence. A model is not just its system, and the challenge is to describe that difference adequately. Doing so below involves integrating Potts’s thinking into the philosophy of digital objects and software studies.

2.2 The model object

I argue that the model component in a language modeling system constitutes a “digital object” in Yuk Hui’s philosophy of technology. Hui’s work, built on the technical systems philosophy of Gilbert Simondon, shows that the facticity of data and metadata does not comprise the thingly character of database items, object-oriented programming classes, bitmapped images, or other objects that “take shape on a screen or hide in the backend of a computer program” (2016, p. 1). Characteristic of the existence of digital objects is an immaterial materiality brought into being through “virtual relations determined by representations and controlled by automation” (2016, p. 161). Digital objects are “at the same time logical statements”: data and metadata structured by schema, markup, ontologies, standards, and protocols; and “sources for the formation of networks”: nodes through which systems materialize and take on concrete form during automation (2016, p. 25). Their immaterial materiality develops through this conjoint dynamic, and it is only possible to speak of these objects as ‘objects’ in that context.

So, too, is this the case with model. With Hui, we can see how the difficulty of demarcating the model in LLMs stems from how these digital objects are materialized through interlocking processes that are themselves reciprocally conditioned by those models. Hui calls this dynamic “interobjectivity” (2016, p. 167). It forms the basis of technical systems and is at the same time subject to them. Hence, what often goes by the term ‘model’ in discourse about the latest AI products is in fact a reificatory tag for a language modeling system, with functionality for prompts and sampling built in. There is a certain truth in this regard to some consumer-facing statements made by AI companies. For example, OpenAI has taken to calling ChatGPT an “artificial intelligence-based service” (OpenAI 2024; emphasis added). While, to be sure, “artificial intelligence” plays its standard role in this instance as a fuzzy, “catch-all marketing term” (Nguyen and Mateescu 2024, p. 6), the company’s FAQs are remarkably clear about the kind of product OpenAI pushes to consumers. Whatever AI means, ChatGPT and similar such products are not models. Scaffolding Potts’s statements about language modeling systems with Hui’s digital object philosophy builds something like an ontological case for this phrasing.

Building that case may seem to risk ceding control of our terms of reference, letting the interests behind commercial AI claim what they might. In other words, the risk is that if OpenAI says ChatGPT is a service, ChatGPT must be a service. But my intent is just the opposite. By drawing on Potts and Hui, I am advocating for an appraisal of LLM interactions that can articulate how models emerge from compound systems and are thereafter integrated into more such systems (Bunz 2019). Undertaking this appraisal requires us to articulate two things at once. On the one hand, despite language modeling systems being the basic condition for our use of models, some pieces of those systems are more core than ancillary; one of those pieces, the trained checkpoint, seems to sit at the core of them all. On the other, that checkpoint is constantly subject to reificatory tendencies across both discursive and technical registers, which together inflate the image of a model beyond its status as an object. Even so, that thing called a model will nevertheless have something “vapory” about it.

With this word “vapory,” I am signaling an alignment between the immaterial materiality characteristic of Hui’s digital objects and the “vapory materialization” of software, which Wendy Hui Kyong Chun once diagnosed as an essential feature of new media (2011, p. 2). A similar dynamic characterizes language modeling systems in generative AI, and the last step of my exegesis of Potts returns to code and programming via Chun’s diagnosis. Whereas some early 2000s efforts in software studies sought to “banish vapor theory” from the field, favoring instead a hard-line materialism to militate against “theory that fails to distinguish between the demo and the product,” Chun’s intervention demonstrates how immateriality determines software (2011, pp. 20–21). Immateriality inheres in these objects as objects. Yet, unlike with Hui, Chun demonstrates how ideology informs the immateriality of software. This dimension of her work applies to how ‘model’ becomes a reificatory tag for language modeling systems in generative AI.

Consider Chun’s centerpiece example of source code. There would seem to be a clear line of intent between the instructions a programmer writes and the resultant behavior of a program. But the apparent fixity that source code might guarantee as an object of analysis for software studies is undercut, Chun argues, by the reality that source code is “more accurately a re-source” (2011, pp. 24–25; original emphasis). That is, code compilation transforms source code into source code. Source code comes after compilation, obtaining an immaterial relation with executable software objects only after the fact of their creation. If such code seems to be the ground truth of software, what the programmer intended, this is as much the work of a belief structure as it is the materialization of objects; Chun goes so far as to name this structure a “fetish,” emphasizing “code as a set of relations, rather than as an enclosed object” (2011, p. 36). In this approach to software studies, what Hui would call interobjectivity is infused with “an immaterial relation become thing” (Chun 2011, p. 20): the products of programming as such.

Triangulating the model in language modeling systems across Potts, Hui, and Chun articulates how this object is simultaneously vapory and material. Like source code, LLMs function as sources of behavior in systems to the extent that systems work in concert to conjure the image of a model that appears to power them. Yet, this image is not wholly illusory. The trained checkpoint on disk emerges as this very model only after the fact of programmatic interaction—at minimum, a mechanism each for prompting and sampling. The immateriality of this dynamic does not detract, however, from the model’s objective character: there is a model artifact there, instantiated and trainable. At the same time, systems-level interactions inhere as core components of this digital object because these interactions belatedly make that model a model, much as code compilation transforms source code into source code. In this way, the model comes to model its own system. It both emerges from and structures the very processes that constitute it. This is what it means to speak of the model object with LLMs.

3 Language model programming

The rest of this paper demonstrates how CCS can track this conjoint dynamic in language modeling systems. It proceeds in two parts. This section overviews three relatively niche approaches to building language modeling systems using language model programming. By combining free-text strings (prompts) and syntactically constrained program text (code), these approaches expose the programmatic tendencies driving all language modeling systems. “Prompts aren’t just strings,” the documentation for one package reads. “[T]hey are all the code that leads to strings being sent to a language model” (Guss 2025). In this part of my analysis, I show how such a claim crystallizes the interactions that produce models. The second part, which serves as a concluding discussion, returns to the model produced by those interactions—or rather, models, in the plural. AI infrastructuralization works by combining multiple models within one system, and I contend that systems thinking with LLMs can position us to make sense of this new order of complexity and abstraction.

3.1 Templated exchanges

Broadly, language model programming refers to a general category of approaches that interleave programs and prompts. In various ways, these approaches attempt to patch over the fundamental mismatch (or what should in principle be a mismatch) between next token prediction and the logical statements inherent to digital objects in Hui’s philosophy: schema, markup, ontologies, standards, protocols.Footnote 3 Relatively open-ended, free-text formats may work for chat sessions, document summaries, image captioning, and other forms of LLM outputs ultimately meant for human readers; but for models to become “sources for the formation of networks” (the second aspect of digital objects, for Hui), prompt and text generation must be brought under “protocological control” (Galloway 2004).

Open standards like the Model Context Protocol, said to work “like a USB-C port for AI applications,” promise to enforce such control by providing a universal interface between all manner of software and network protocols and LLMs (Anthropic 2025). As of this writing, companies and developers have only begun introducing standards of this kind. But the demand for protocological control is already evident in even the most basic chatbot exchanges. Consider the prompt template. It marks up a single span of LLM input text into its component parts, including internal instructions meant only for the model, past responses from that model, user requests, and any additional contextual information required to complete a task (Hugging Face 2025):

All chat exchanges require templating prompts and responses into sequences of this kind, with tags like <|system|> and <|user|> demarcating participants and message types among otherwise undifferentiated streams of text. But rather than inserting template tags manually, chat applications automate the process. The following snippet, from Hugging Face’s transformers library, demonstrates how:Footnote 4

After message roles and content are defined, calling the language model tokenizer’s. apply_chat_template() method automates template tag interpolation. On the frontend, users read and write relatively unformatted text; on the backend, templates package it up and manage the rest.

One subset of language model programming approaches, all considerably less popular than transformers, transforms such templates into program text itself. These packages and libraries offer relatively traditional support for developing software by providing programmers with functionality ranging from convenience abstractions and function decorators to additional handling for asynchronous data streams. They share a goal: making the inherent structure of LLM exchanges explicit using programming language syntax. The example below, which converts a question/answer pair into a function, uses SGLang, a serving framework for generative AI models (SGLang Team 2025):

During code compilation, the @function decorator directs SGLang to interpret each component in basic_qa() as pieces of a prompt; adding them via += operators converts them into model input. In gen(), the component responsible for text generation, specifies the length of an LLM’s response, forming one aspect of the control logic for a sampling method in this function. Recalling Potts’s claim that sampling is requisite to any language modeling system, the presence of that method here alongside a prompt (user(question)) and the model (assistant(...)) means basic_qa() encapsulates the very basics of systems-level interactions. This function is system degree zero for language modeling—rendered in code.

Like Hugging Face’s, .apply_chat_template(), basic_qa() constructs prompts programmatically. But SGLang goes a step further to incorporate text generation directly inside a function; in transformers, doing the same requires more lines of code. SGLang, by dispensing with an external data structure in favor of composing messages inside the body of a function, evinces an aspiration toward modularity and abstraction in all language model programming. The goal is to “compos[e] modular operators” (Khattab et al. 2024, p. 2) and “fully encapsulated functions” (Guss 2025). Code that does so “generalizes language model prompting” by providing components that remain “agnostic” about the internal details of an LLM (Beurer-Kellner et al. 2023, p. 3). With such approaches, the goal is to offer a “clean interface” to users, who “only need to be aware of the required data” (Guss 2025). A sense of the model as a discrete thing recedes as the emphasis shifts toward building systems with reusable, modular components.Call basic_qa() from anywhere, even in programs that do not ostensibly support chat, and model behavior can be made available on demand.

3.2 Tool use and vibes

The logic behind this software design amounts to the following: if LLM outputs can be accessed across multiple interfaces and environments, why implement certain functionality in one’s software at all? A programmer might instead simply define some basic functionality for receiving data, like a computer’s IP address, and delegate the command logic to a model. This setup, termed “tool use” in AI discourse (Qin et al. 2025), represents an extreme case of what Andrej Karpathy has recently branded as “vibe coding,” a form of programming with LLMs that tends toward self-imposed deskilling (Edwards 2025). Another subset of approaches in language model programming provides the scaffolding to write software in this manner.

For example, the snippet below (Prefect 2025) features a function that sends a model-generated command to a command-line interpreter like Bash or Zsh. When called by a task, the function executes this command via a subprocess, which pipes the command to a computer and then returns the output of that command back to the model. After passing a simple data validation check, the model parses this output to find the current IP address:

Other pieces of code in this package implement error handling to manage fallout from instances where interactions between a language modeling system and one’s operating system go awry. But alongside functionality for system errors, there is also additional control flow to manage a new component at work: the user. In the internal system prompt for the above tool use, which is exposed only to an LLM and sequestered from users during interaction, a portion of the instructions reads:

If the Cold War collision between game theory and psychology has positioned human irrationality as a central consideration for chatbot design since the early 1970s, here, that ideology of computational personhood surfaces in the figure of a bumbling, inarticulate programmer (see Slater 2022). To borrow an oft-cited trick in online advice about prompt writing, one could rephrase these instructions in a way that makes their assumptions about models and users clear:

3.3 The languages of machine instruction

In anticipation of my discussion of a final subset of language model programming approaches, let me pause on a fundamental question that emerges from the above: are the instructions assigned to INSTRUCTIONS source code? That developers have assigned the contents of a system prompt to a variable in a Python file and stowed them deep in a codebase, enabling other modules to call on this prompt as needed, suggests these instructions are source code. That those contents form a free-text prompt with no special syntax unique to any programming language suggests they are not.

A response to this question might declare that the contents of INSTRUCTIONS are most akin to a code comment—included in a source code file but ignored by compilers and meant instead to address other programmers, or to serve as a note to oneself. CCS scholars have long held that code comments are a central site at which to read culture in and alongside code (Marino 2020). But comments are not code. Not quite. Moreover, with LLMs, a new dimension informs the mode of address made by code comments, and this dimension further complicates attempts to determine whether INSTRUCTIONS is source code. In the context of LLMs, any and all text, whether it be code, code comments, transcriptions of speech, data, or whatever else, may be made completely subject to machine reading. If, under these conditions of reading, INSTRUCTIONS addresses a model and invokes behavior thereby, in what sense should its contents be read as source code?

One way through this conundrum would be to read INSTRUCTIONS along the lines of Brian Lennon’s work on programming language cultures. Meant in some respects as a rejoinder to CCS, his core argument is that programming languages are “systems of automation” (2024, p. 13). Lennon suggests that focusing too closely on the readability of text as source code occludes the underlying disposition toward automation in programming languages. What matters for him is how any sequence of program text serves to indicate the way computer scientists and software engineers would categorize that sequence according to a hierarchical model of programming languages, which spans low-level machine instructions, scripting interfaces, code comments, and even natural language text. To this, Chun might add that source code itself was initially called “pseudocode,” a convenience wrapper for machine instructions and therefore at least one step removed from an underlying programming language (2011, p. 42). This ladder of abstraction is precisely what Lennon underscores when reading programming text. As he explains, the programming language hierarchy is emblematic of a “generally recursive automation of programming as a political–economic activity: an activity that has no purpose but to automate other labor activities, not excluding itself” (2024, p. 5). Reading the particulars of program text should, in his view, disclose this wider tendency, the horizon of which is currently demarcated by vibe coding.

Following Lennon, I propose a somewhat preliminary argument for reading prompts in relation to program text. This argument suggests that the key question for conducting a reading along these lines is not whether INSTRUCTIONS and its contents are actually source code. It is how INSTRUCTIONS enables the “enclosure of computation within natural language that programming languages represent” (Lennon 2024, p. 93). In other words, prompts are (a) programming language if what they do is automate and execute programs. Reading the prompts that appear alongside program text in this manner would locate them as the newest and highest layer of abstraction in the programming language hierarchy, where they can contribute to building programs on top of other languages in that stack. One language model programming approach (Elliott 2025), barely more than a sketch, has in fact attempted to formalize this new layer in the language hierarchy quite explicitly. Using string interpolation to splice together prompts with lines of code, its name braids the originating break between machine instructions and pseudocode with the “superuser” conventions of the Unix command line: SudoLang.

Significant for my broader argument is the fact that prompts already produce one class of programs: Potts’s language modeling system. A prompt, one of the minimal requirements for this system, programs many interactions therein. In my reading, then, locating prompts at a high point in the programming language hierarchy aims to reinforce the systems-level interactions contributing to the reification of the model. Reading prompts as (a) programming language articulates how it is that “[e]very prompt a user feeds into a large foundation model instantiates a new model, one more limited in scope: it enters into a different conditional probability distribution,” which in turn gives rise to “a new piece of composable modular software” (Roon 2022). Such a statement simultaneously expresses a widely held ideology among many developers at AI companies as well as a core dynamic of language modeling systems.

3.4 Automated prompt engineering

The final subset of approaches in my overview attempts a full integration of prompts and programs. Packages in this subset favor programmatic methods for constructing prompts, attempting to dispense with the need for expending effort on rewriting prompts to maximize LLM performance. This prompt writing strategy is known as “prompt engineering”—something one group of developers assert is “conceptually akin to hand-tuning the weights for a classifier” (Khattab et al. 2024, p. 2). As they argue, there are too many variables to account for when prompt engineering, too many possibilities involved in rewriting instructions to land on a best fit for that model’s model of language. Potts also speaks about this: “We write these prompts in English, but in fact, they are more like an effort to communicate with this sort of alien creature, the language model, and we can deceive ourselves into thinking that our understanding is going to translate into the performance of the system” (2024). Therefore, rather than “manually reviewing a handful of generated outputs and deciding by intuition whether one version of a prompt is better than another,” these packages enable users to write programs so that LLMs can optimize their own performance through self-reference (Guss 2025).

Writing programs in this manner erases the distinction between program text and prompts, which is why I have sought to locate the latter in the programming language hierarchy. The centerpiece example of this strategy is Stanford NLP’s DSPy. As its developers explain, the package “decouple[s] AI system design from messy incidental choices about specific [language models] or prompting strategies” (Stanford NLP 2024). Below, a small information extraction program outlines the software design developers implement instead:

As with other examples above, this program leverages differences between prompt and program text to compose a modular workflow, pulling “Apple Inc.,” “Tim Cook,” and other entities from a string. But DSPy actually exposes all the code in ExtractInfo to the underlying LLM. That includes the docstring for this class and the type hints of its four variables.Footnote 5 At runtime, the Python interpreter does not treat either as executable code. It passes over docstrings and employs other strategies instead of type hints to determine what kind of data has been assigned to a variable. But DSPy sends all this information to an LLM, furnishing the model with in-text indicators for how it should respond to certain components as well as a general task description. The program becomes the prompt, with no substantial difference between these two orders of text.

Automated prompt engineering builds from the basis of this elision. In the snippet below, a simple module for producing document summaries via so-called “chain-of-thought” reasoning is coupled with an ideal_length_reward() function (Wei et al. 2023). As the docstring in the latter explains, this function assigns a penalty to model-generated summaries if they are either shorter or longer than 75 words. When it receives a summary, it counts the words, measures the absolute difference between that count and 75, and converts this difference into a reward score. Summaries exactly 75 words in length earn a 1.0 score from ideal_length_reward(); scores decrease proportionally as word counts deviate in either direction, reaching 0.0 when they are 125 or more words away from the 75-word ideal.

As with the information extraction program, DSPy presents the docstring and other function components to the model. But first this code passes the chain-of-thought module and ideal_length_reward() to another class, Refine. There, in a feedback loop of N interactions (50 in this case), the model responds to the program-cum-prompt, and its responses are measured against the reward function. After each iteration, responses are further refined if they fail to meet the target length.

A hint about the source of these refinements lies in one of Lennon’s claims, in which he states that the programming language hierarchy is characterized by a “continuous, insistent orientation toward purified self-reference” (2024, p. 5). The ideal of programming, in other words, would be to write in what Lennon calls an “autological” language that only requires references to itself. In this language there would be no bugs to squash, no other dependencies to install. With all such external components expunged, the “automation of automation” would take over, and from that point forward: no programmer needed.

DSPy follows a similar course. Refine makes its refinements to the document summaries on the basis of feedback “advice” from an LLM—the very same one that generated output in the first place. After an initial prompting of the model, DSPy retrieves its summary and solicits advice for improving the result. It then adds the latter remarks to the original input text for another refinement with Refine. The process continues N times, or until the reward score meets a value set for threshold, with the model reading and writing its own autological self-critique. Tucked away in the source code for DSPy, the general shape of that advice is evident in the following class:

Below, in a “forward pass” through Refine, any and all advice is solicited in a single line:

where it is called upon elsewhere and formatted as an additional piece of input, not unlike how prompt templates work:

From here, this program will run, assigning blame, prescribing “concrete and actionable advice.” The program is the prompt, the prompted model prompts itself. In this language modeling system, all these components are compiled into a single, autological feedback loop, wherein a model becomes a model of and for itself. Language model programming ends with a language model, programming.

4 Language model cascades

In a widely reported May 2024 study, Anthropic’s interpretability team claimed to uncover “high-quality features” inside a production-scale LLM at the company (Templeton et al. 2024). Centered on Claude 3 Sonnet, the study explains how the team identified parts of this model where it stores information about everything ranging from code errors and expressions of sadness to the Golden Gate Bridge. Rather than read model outputs to interpret Sonnet though, the team captured neuron patterns inside the LLM at its middlemost layer. Then, they trained a second neural network to observe interactions between neurons as data passed through that layer. With this second network, the Anthropic team identified texts that exemplified each of the millions of features their process had uncovered. Their final step, interpreting these features, resembles the advice generation workflow in DSPy’s Refine: the team turned back to Sonnet and prompted it with instructions to read all the texts and determine an appropriate label for each one; those labels served as explanations for model behavior in the published study. In this way, the Golden Gate Bridge (feature 34M/31164353) seemed to emerge from inside Sonnet, alongside the many other emergent abilities LLMs are said to have.

This kind of “mechanistic” interpretability work promises to decompile the internal workings of LLMs and submit them as proof of model behavior (Saphra and Wiegreffe 2024). Like the ideology of source code in Chun’s account, such work requires a fetish object to ballast its belief structures: a concrete location inside a model where behavior not only emerges but may be manipulated and steered.Footnote 6 But while mechanistic interpretability efforts claim to take the model as their object of analysis, I have sought to position CCS in such a way to demonstrate how the Anthropic team members have reified their own system. The version of Sonnet that emerges from their experiment—promoted as Golden Gate Claude—is a new digital object, inseparable from the systemic interactions that programmatically prompted this LLM into being. Put simply, instead of anchoring features directly in the source of those features, the model, the team conjured those features out of a system.

The implications of this reification extend beyond Golden Gate Claude. If mechanistic interpretability produces its explanations through systems-level operations, then the features it claims to find are inextricably bound to the contingent setups researchers employ: particular prompting methods, a test corpus, secondary neural networks, an LLM’s own interpretations, and other components in the pipeline. This poses a problem for such research more broadly, which has staked its legitimacy on the promise of accessing model internals directly, of finding ground truth inside neural networks. Since LLMs are always already systems, there is no interior to access independent of the system that produces a model. Mechanistic interpretability cannot escape this bind; it can only obscure it through the fetish object of the ‘feature’ said to exist inside some model.Footnote 7 That is why no meaningful interpretability work with LLMs, mechanistic or otherwise, can proceed without reckoning with the fact that “we are all the time dealing with systems” (Potts 2024).

This bind at the heart of mechanistic interpretability suggests we must take a different direction for investigating the model component in LLMs. To build toward my conclusion, I return to DSPy’s Refine and demonstrate how CCS makes visible what the Anthropic study obscures: the systems that produce models like Golden Gate Claude in the first place.

When the code for Refine gathers advice from an LLM, it performs a forward pass. Referring to this step as a “forward pass” is deliberate: DSPy developers think of package modules as “akin to neural network layers” (Khattab et al. 2024, p. 2), using the metaphor of data passing through a network to describe how code sends text to an LLM and retrieves output. In the terminology of connectionism, every step in the training of a neural network decomposes into two parts. The network first makes a forward pass, during which it processes training data and makes predictions. Second, it calculates the error between predictions and that data (for example, the predicted next token and the actual one) and sends this information back through each of its layers in the form of gradients. The network uses these gradients to update each layer incrementally so that it can produce better predictions during the next forward pass. This second step, known as the “backward pass,” or “backpropagation,” is how deep learning models learn (Reigeluth and Castelle 2021). Though DSPy developers do not follow the neural network metaphor through to this second step, it clearly informs how they have designed their code to solicit advice from an LLM. Just as in backpropagation, the second step of Refine calculates error and passes this information back to the model for another round of predictions.

Other researchers have been more explicit with this connection between model feedback and the backward pass. One group explains, to “optimize the new generation of AI systems,” LLMs should be linked together in an enormous, automatic feedback loop (Yuksekgonul et al. 2024, p. 2; original emphasis). Within that loop, models can pass around advice and critiques in the form of “textual gradients”: not weightings that indicate how parameters should change but plain, free-text suggestions. Another group used a system that “mirror[ed] the steps of gradient descent within a text-based Socratic dialogue.” This setup, they explain, orchestrates multiple LLMs within a single, self-updating system by “substituting” the mathematical operations of neural network training with LLM “feedback” and “editing” (Pryzant et al. 2023, p. 7958). In such studies, LLMs become “nodes in the computation graph” (Yuksekgonul et al. 2024, p. 25), the “programs” that coordinate them, “language model cascades” (Dohan et al. 2022). No need for prompt engineering, or perhaps for handcrafted prompts at all. Yet, while this vision appears to be pure vaporware, it has already begun to materialize. In December 2024, for example, DeepSeek released a series of reasoning models that disrupted the assumed necessity of massive computational expense in generative AI. Those models were trained via language modeling systems employing some of these very cascade techniques (DeepSeek-AI et al. 2024, p. 29).

When LLMs are made to cascade in this compound manner, the figure of the neural network expands into an overarching abstraction for a new language modeling system. Research and engineering activities that conceptualize automatic prompting with the metaphor of the forward and backward pass aim to build one enormous neural network from individual LLMs. Just as the reification of multiple models working in tandem gave rise to the Golden Gate Bridge in Claude 3 Sonnet, these cascades of LLMs abstract beyond any one network to reify one single model from many. Within this model, programmed prompts propagate back and forth in an extended autology, passing between components that are simultaneously self-contained and commingled in mutual interobjectivity.

DSPy developers frame the process this way: their package treats models as “abstract devices for text generation” (Khattab et al. 2024, p. 3). A most literal description of what Turing machines accomplish if there ever was one—though in this case a system’s generated text does not refer to numerals written on incrementally moving tape but is instead the product of stochastic inference, mediated by prompts and sampling methods. Moreover, at a certain level of abstraction, such text could come from one model or from many. The distinction becomes vaporous, as does the one between model and system: it would all be the same to this systems-level model, to say nothing of us who may receive text at the very end of its process.

Efforts to infrastructuralize AI are pushing in this very direction, seeking to produce a “new universal computing interface” that is at once a single model and many (Roon 2022). Behind this proposed interface, individual neural networks—each a discrete model—combine into a single united language modeling system that functions as an abstract model of models. Such a system is poised to compound only further, its cascading architecture substituting the deterministic mechanisms of the universal Turing machine with vibe-coded model inferences on demand. In this way, language model cascades currently represent the culmination of a dynamic I have argued is central to all LLMs: they are systems, emerging from programmed interaction. AI infrastructuralization leverages precisely these conditions, working to perform reification at scale by rendering disparate systems as a singular, seamless interface for model behavior. Whether such reification manifests as an agentic loop, Golden Gate Claude, or some other compounded configuration, the thrust of this process obscures the complexity of its own systems-level operations. Once embedded in our software, the mediations performed by these operations become invisible, not just as models or systems but as infrastructure itself. Reading these mediations through code shows how the vapory, immaterial model object is continually produced by cascades of programmed interaction. Critical code studies is a wedge for prying open this process before it calcifies.

Original source

AI & Society Journal

https://link.springer.com/article/10.1007/s00146-026-02897-y

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelproduct

Models

AI Conference Unveils Uganda’s First Multilingual Language Model to Support African Languages - iAfrica.com

<a href="https://news.google.com/rss/articles/CBMiswFBVV95cUxOVFExejBJODBERXhtZGROcTJwdF9YMzdtbElRdkR5UlltOGJ1ZDU4MXFiTWxVM0hXNVNGNlg1TVJ6al9UNVRfR1B2Z2VXR3RSYjdqN2ZCNXBTeUZWUWRzQklZVGd5c3B3dmlhQmFDX19HcGlyODNZSU4tNlhmS2R0dWZVczJZY29Bd3ZMSzgxMkVLQ1BtZmhsVXEzOVVJMXhyeUVxLXRlTUEtQlRsY2VSbFpCSQ?oc=5" target="_blank">AI Conference Unveils Uganda’s First Multilingual Language Model to Support African Languages</a> iAfrica.com

Google News - AI Uganda

1m4 months ago

Research Papers

Telia agrees Swedish sovereign AI deal with Brookfield - Telecompaper

<a href="https://news.google.com/rss/articles/CBMingFBVV95cUxQY1ZCaEFJUVJLNFJUOWoyLVBqVGxCdjQ1QUJ6WEdPdVFvU0ZMVnZpZG9IY1YxaFlFOXhqME1lRXBWd2x5Tjg2bDdnaWlzQUxwQkZPWG1KU1RwN25BelRhREJyTXEwZWI2Vk9nTTlLdnI1RDFhQnpWa3hpa1ZwTHc1cGNNVmVtckFianM2YlNVZXJFZ3U2X2NmMl9BcUN4QQ?oc=5" target="_blank">Telia agrees Swedish sovereign AI deal with Brookfield</a> Telecompaper

Google News AI Sweden

1m15 days ago

Models

French AI lab Mistral releases new AI models as it looks to keep pace with OpenAI and Google - CNBC

<a href="https://news.google.com/rss/articles/CBMipgFBVV95cUxOMzNXWkdOSm5xek00QnZhYVJKY2FnaGdrWmxXUTZHdXhtazlvOGN4a3VPcnQ1UUZaSEdQeXZGUjBlRFJwa1U1WnMyTWlCMU1yN3NJaDBNZW5DdjJSY3E1SG8zYTVZdUUzeER1LWg2RktBZm9JeW54Tzllc3dMZUVxV1JnMEFUUElJcWZRMzNWRFFsc1h6LXVBa2ctMGlGeUVZUkZibUlB0gGrAUFVX3lxTE5rUUh6S0dBQkNBdXlZQTZIWW9WenQtMkZRZDJva1BCeEptek5MLWI3TGhlbGRXcS1ES1pYaHlKZGFvR21SclAyZmxpQWg5NHJjQlU1RC1JdjJ3cUlJRGJlbi1nVGp3Wm1PX1JSa08yYmw3OWUxb0xPcFBRS0ZlVTB0Q01qTWZxeENpcS1ZV1NCZ0ZoN0h0c0sweGVpemloM2xEeTFpYlFBSl9mTQ?oc=5" target="_blank">French AI lab Mistral releases new AI models as it looks to keep pace with OpenAI and Google</a> CNBC

Google News - Mistral AI France

1m4 months ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 86 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsRecent

Ukraine to launch Obrii digital labor system with AI job matching in 2026 - The New Voice of Ukraine

<a href="https://news.google.com/rss/articles/CBMisAFBVV95cUxNMHRJY0lSeW9Zbl9GcmlqOHczRWFVSE1XUF9MMk1OR3NBTjJNUTRCRzQzMEMya2Q5a3BTVXFoY0U3NUJ4X1E0U0RHNUE2MzFEQWFKb3ZiVHMwY0pqTENVLXVhZGtIVmVNdmRfZlZwV21NbmdOZUxVajBfWFdha2FjY0R6RlNyRG00Z2ktakN0OUYxMDRWSEhPdHlzSWhjYjZXakl1Vy1zNFZuaTd3OFdOcNIBqgFBVV95cUxOTmgxZThtZjE3Qk5JMUhNN2JrU0p6Yk5QcFFiRWJrMHZGZ1V4aUVXR0F0emFySGZJUjJDT2kxNERpNVRvM2Yzd200MDNkMjk3LUROcGcwV1hqaHUzQzdoU29FWDFzd0YzNGptLXplNS1zd21GNU5PZlc3cDdjUGdlbDhGYjZ1Z0RYLV9CU3hQalVRUUNMdHdid0RVNU9heTlucnBUeVVXVjZWZw?oc=5" target="_blank">Ukraine to launch Obrii digital labor system with AI job matching in 2026</a> The New Voice of Ukraine

Google News - AI Ukraine

1m1 day ago

Models

Larry Madowo Issues Clarification on Tanzania’s Smear Campaign - Kenyans.co.ke

<a href="https://news.google.com/rss/articles/CBMiowFBVV95cUxNb0FWeDVKT2hCaTlMYkdUNFM0LVh5ZkRCNEJuSUViWDJ6ZW9YcHdOSHlHZDZaeVc3YkNIX2VHZzVVRHVYUTQ5NmpTQ0lvWUk0b281endfWkRYbTdyeF9CRXRBWTl1N0dzRDNiZlhhVllMRjNva1dZblRBRHF4M0k0M0xMSmYzNE1yM3JCQ0luSWtXeHBZR2Rnb2tkaTBVTnA4QUZv?oc=5" target="_blank">Larry Madowo Issues Clarification on Tanzania’s Smear Campaign</a> Kenyans.co.ke

Google News - AI Tanzania

1m4 months ago

Models

AI Conference Unveils Uganda’s First Multilingual Language Model to Support African Languages - iAfrica.com

Google News - AI Uganda

1m4 months ago

Models

Mistral AI invests €1.2 billion in a data center in Sweden - incyber news

<a href="https://news.google.com/rss/articles/CBMilAFBVV95cUxOcXNOc0tBenFJQjNmZTB4b2hZUkpBSHF4Vm9mOG1nSmZWblZpak9pSWRSWTVKOFd3a3ZrXzdlRUNTREZkSkF4SlFnVDF3WExBeUx4QXVsTlRsU25ibDhnRURhbmpFRDUxWHlFQ2xZTzVyN1R1M1dDUTI5eVBZaWFxZW95cXVqVmotQ1dJUmJMOGV2UFA4?oc=5" target="_blank">Mistral AI invests €1.2 billion in a data center in Sweden</a> incyber news

Google News AI Sweden

1mabout 2 months ago