17 min read

Towards a Human-Powered AI

Towards a Human-Powered AI

Part 1. The Cardsharp in the Room

The very nature of thinking is "but." Things are complicated. It’s “and,” “but,” “either” — it’s all those things. – Susan Sontag, 1979

Heading into this new academic year, many universities have developed certificates and courses for learning to work with AI across a range of disciplines. A few have committed to broader initiatives directed at “AI literacy." University of Florida announced they will be “infusing AI learning broadly across all majors,” insisting, wrongly, that “every profession uses AI.” My own university goes further in its ambitions than most, as it was announced over the summer that our goal is now for every graduating senior to be “fluent in both their major field of study and the application of AI in that area.” A worthy goal. Too bad that the vast majority of our talented educators have no idea what that would mean in their respective fields.

As educators, determining how to incorporate AI into our fields of study is a matter of considerable urgency. Despite LLMs like ChatGPT having been readily available less than three years, recent studies suggests that upwards of 90% of university students are using AI in their coursework. Blanket prohibitions by faculty have done nothing to slow the embrace of AI by students, and faculty who think they can identify AI’s hand in student writing or other assessments are fooling themselves and almost certainly leveling false accusations.

All of which points to the need for this work, but while we have lot of bold language to cheer us on, there is no consensus as to how to get there—or even where there is. This is no one’s fault, and I do not claim to have the answers. But to get there we need first to cut through a lot of noise, hype, bluster, and panic. And we need to look honestly at what AI is, what it's likely to be in the next decade, and where it can align with educational goals, old and new—and where it definitely cannot and never will.

This is the first in a series of posts dedicated sharing my own progress trying to clear away the hype and nonsense that surrounds AI—including defining it as an "intelligence"—and instead understand it as a tool. We must inoculate ourselves against those who would tell us AI will lead to either Rapture or Apocalypse, just as we must be inured to those who would tell us this is a bubble that will quietly burst if we would only ignore it.

After a couple of years of reading in this space and using LLMs and AI agents in my research and daily work, my thinking about the paths forward remains very much in flux. But talking to colleagues in higher education—and especially those in Humanities—I realize that many don’t know where to begin. Having myself taken at least a few steps on the road, I am hoping some of what I share here will be of interest. More selfishly, I am hopeful other ambivalent enthusiasts, militant pragmatists, or luddites with android dreams will be willing to share their discoveries, insights, and missteps.


In what follows, I will use the term "AI" and "LLM" interchangeably, even as in truth I am not really talking about either. I prefer "LLM" (large language model) to "AI" because it keeps our understanding of these tools grounded in the hard fact of the data on which it is built. "AI" is a rather absurd branding obscures and mystifies far more than it illuminates. Insofar as AI has real meaning, it is something very close to Gatsby's green light—by definition forever just out of reach of those who earnestly pursue it.

✳️
You will encounter other terms, including foundation models (so called because they are trained on broad data that can be adapted for different tasks); generative models (because these tools can generate text, images, audio, etc); and conversational agents (because these tools are often activated—prompted—through conversation or chat). I dislike all of them, so I will stick to LLM since it is the least grandiose and the least pathetic in its fallacies.

The long history of the pursuit of artificial intelligence, dating back to the post-war years, has used the term to describe many goals pursued through different technologies and approaches. But no one in the field itself can agree as to what exactly that goal is. Now that Silicon Valley has glommed onto "AI" as brand for its current products, the elusive goal has been allocated to the grandiose fantasy of AGI: artificial general intelligence. AGI is the mythical moment when our current AI models will suddenly outstrip the limits of their programming and transmute into new gods capable of doing everything humans can do, only better. A janus-faced industry of billionaires has emerged in recent years, half promising the blessings of digital transmutation while the other half foretells our destruction at the hands of our new AI overlords.

These people are quite literally two sides of the same con. AGI is bullshit of a particularly dangerous kind. I will talk another time about how the industry plays on these fears and fantasies all the while making some serious would-be supervillains richer than humans have ever been. But for now, I want to stay grounded in real things that are made by real people (most of whom are not supervillains).

While the tools we are talking of here—ChatGPT, Claude, Gemini—are not "true" AI (whatever that is), they are also not exactly LLMs. Better put, they are applications built atop large language models. But large data sets were the foundation that made possible the end of the long "AI Winter" that extended from the 80s through the early 2000s.

For tools like ChatGPT to emerge, we first needed large data sets. The ability to gather, store, and process such data was the first leg of the three-legged stool of modern AI. The second was processing power, which arrived somewhat circuitously from advanced graphic processors developed for videogame hardware. The final leg—what we can call the software—would develop in fits and false-starts throughout the 2010s, ultimately culminating in the generative pre-trained transformer (GPT) model that was released into the wild—and into our classrooms—in November 2022 in the form of ChatGPT 3.5.

At its core, an LLM is a language generating machine, capable of producing increasingly convincing imitations of the way we write and speak thanks to ever-increasing data sets and to refinements to the algorithms that allow them to guess what word should come next in any given sentence. In 2021 academic AI researchers—including a couple working at Google, coined the term "stochastic parrot" in an attempt to demystify the remarkable ability of these tools to present through language as seemingly capable of thought, emotion, and even personality.

Joseph Weizenbaum demonstrating ELIZA in 1966

Language is not as reliable a marker of personhood as we might imagine, as some of us learned back in the day when playing with ELIZA, a very early chatbot developed by Joseph Weizenbaum in 1966 as an experiment in natural language processing. He programmed ELIZA to mimic a Rogerian psychotherapist (“tell me more about X… how does X make you feel?”) because the rote pattern of questions made it easy to generate plausible engagement from very simple programming.

Weizenbaum was surprised by how deeply people connected to what was, in fact, an extremely shallow program. (His secretary supposedly asked him to leave the room so she could talk to Eliza privately). He came to believe that humans are primed to attribute depth and intention to social language, even when the clues are hiding in plain sight. Weizenbaum would go on to become an early AI skeptic, pointing to the shallowness of a machine “intelligence” that could be demonstrated through language. We are simply too credulous to the power of language—as Silicon Valley hype demonstrates—to ever use the facility with language as evidence of an intelligence.

While today's LLM-based AI is light years more sophisticated and nuanced than ELIZA, the fundamental principles remain the same. AI is not approaching sentience; it is getting more sophisticated at performing sentience. AI built on this model is not intelligent and will never be; but it is a remarkable predictor of what will be received as an intelligent response to a query or prompt. A polite term for how it engages is guessing, although it has been less charitably described as a consummate bullshit machine.

There has been a recent spate of essays and posts with titles like "ChatGPT is Bullshit," drawing on the distinction between lying and bullshitting first formulated in 1986 by Harry G. Frankfurt. For Frankfurt, a liar knows what the truth is and deliberately withholds it or warps it; a bullshitter, on the other hand, is indifferent to the very notion of truth. A liar values truth's power, because truth is what stands in the way of the goal that inspires the lie. At its most insidious bullshit renders the very concept of truth unreliable by willfully refusing to recognize the difference between truth and falsehood. For this reason, Frankfurt argues, bullshit is in the long run more dangerous to society than lies.

I push back on language that accords to AI emotion, agency, and especially mystical powers. But I am equally inclined to push back on those who would argue that these AI tools are inherently agents of destruction. For one thing, the very concept of "bullshit" as an amoral, indifferent agent has been severely complicated in the current moment when bullshit is clearly deployed as an agent of willful destruction by those who would reduce liberal democracy. In 2018 Steve Bannon identified this strategy as "flooding the zone": “The Democrats don’t matter. The real opposition is the media. And the way to deal with them is to flood the zone with shit.” This has indeed been the primary strategy of this and other would-be autocrats around the world, as Ayala Panievsky describes in heart-rending detail in her The New Censorship: How the War on the Media is Taking Us Down (2025). This strategy has involved both capturing and spoofing traditional media and a leveraging a new media army of bots and trolls to flood all corners of the information landscape with massive amounts of content that includes propaganda and misinformation but is mostly made up of garbage.

If the bullshitter is incentivized to destabilize the distinction between true and false, the LLM is incentivized to predict what will be true. LLMs are thus trained to care, in a probabilistic sense, as their whole optimization objective is to minimize error in predicting the next token. They are incentivized to align their guesses with the statistical patterns of how words, facts, and contexts appear in the training corpus. The model “cares” because bad guesses are penalized. If the model’s guess doesn’t match the actual token in the training text, it incurs a loss—most commonly a cross-entropy loss—as a measure of its “wrongness.” The further the probability distribution diverges from the reality as represented in the training corpus, the bigger the penalty. Through backpropagation, then, the model’s internal weights are adjusted to reduce those penalties over millions or billions of examples.

For this reason I refuse the account of LLM's as bullshit generators, just as I dismiss any accounts of LLM's "thinking." They are professional gamblers who have learned to count cards across thousands of decks, but like even the best card counter error and chance is still baked into the system. When we remember we are working with inference engines, probabilistic prognosticators, autocomplete savants, we calibrate our expectations appropriately—and we accept errors for what they are: not "hallucinations" or "bullshit," bad guesses by a model that is built to do nothing else.

I don’t say this to diminish the power of this technology. I use LLMs every day in my work and I find them more valuable the more I learn their strengths and limitations. When we look more closely—albeit with a bumbling humanist as your guide—at how LLMs "understand"and use language, it might allow us to better appreciate the clockwork in the machine without mistaking its movements for agency or intention. And indeed, it might make us question the usefulness of speaking of "literacy" or "fluency" in relationship to AI in the first place.


The foundation of the LLM is the massive corpus on which it is trained. The last time they publicly shared the size of the data set was with ChatGPT 3 in 2020, when it was announced as around 300 billion “tokens”—the units into which language is broken for processing by the LLM. To put that in perspective, this is roughly 100 times the size of wikipedia. While we don’t know the size of subsequent data sets used for training 3.5 (2022) – 5.0 (2025), evidence from open models like Google Gemma point to today’s training sets in the trillions of tokens.

In order for a massive data set to be useful, the language has to be processed through a series of mathematical and statistical operations, beginning with tokenization. These tokens can be words, parts of words or even individual characters, depending on the model's tokenization strategy. Each of these tokens is then assigned a token ID number:

In the hypothetical example above, “the quick brown fox” is tokenized as “the,” “quick,” “brown,” “fox,” with each token assigned an ID, allowing the machine to work with the text in its native language: numbers. The sentence is now rendered 145, 672, 391, 888.

Each of the tokens is then positioned into what we might envision as a multidimensional map in which all the relations between tokens are noted. In this embedding space, words with similar meanings are placed close together; for example, the vectors for "happy," "joyful," and "delighted" would be very close to each other. But of course the model doesn't know what these words mean. Instead it is intuiting meaning and synonymity through analysis of the corpus of text on which it is being trained. Having encountered many strings of words in the set that read "I am happy to meet you," it comes to recognize that delighted can be used in the place of happy with seemingly little change to the semantic relationships. It doesn't yet know that they are roughly synonymous, nor that delighted conveys greater intensity of represented emotion than happy. But over the course of its pre-training it will see enough examples that it will be able to to effectively reverse-engineer a dictionary and a thesaurus through the mapping of semantic relationships.

It is worth pausing here to note how incredibly inefficient this is—thus the immense energy and compute resources LLM pre-training requires. One might reasonably ask: why not just feed the model a dictionary and a thesaurus up front? Indeed, for many years that was precisely the approach to machine learning. In part this was because available compute and storage capabilities did not make an alternative conceivable, but mostly it was because it just made intuitive sense.

Of course, when it comes to learning of language, a dictionary is actually a useless tool. If I am learning modern Greek, as I have tried and failed a few times to do, a Greek-to-English dictionary will be of value because I already have English as a referent. But a dictionary for native Greek speakers will be utterly useless to me. How much more complicated this would be if my native language were 0s and 1s.

A dictionary requires understanding of how a language works and is used. It is distilled output capturing centuries of observing human language use. A thesaurus, meanwhile, lists words as if they were interchangeable, while in practice, of course, synonymy exists on a continuum and is conditional on a range of other context cues. Our understanding of homonyms requires context a dictionary alone could never provide: the Bank of England as opposed to the bank of the river Thames, for example. One who regularly speaks and reads the language learns to recognize the differences by context, adjacent words, and other cues which we usually process so effortlessly as to realize we are even making these calculations. That is, we use context to read what would otherwise be lost: the subtle distinctions of tone or meaning between synonyms, the often fundamental distinction of meanings between homonyms, the idiosyncrasies of idiom, the semantic weight carried by seemingly incidental parts of speech such as prefixes, suffixes, articles, and inflections—not to mention the precarious nuance of irony in all its maddening forms.

Prophets of AI are fond of describing LLMs as learning how to use language much as human children do. While insights drawn from the science of language acquisition were important to the foundational thinking in machine language learning, the analogy collapses quickly.

Children learn in multimodal, social, embodied environments: words are accompanied by facial expressions, gestures, emotion, tone of voice, etc. LLMs learn locked away with massive amounts of text with one concrete imperative: to accurately predict the next appropriate word in response to a prompt. Children learn through grounded semantics: the ball is the thing they hold, the thing that rolls away, the thing they can recover if they remember the word for it. LLM's on the other hand learn entirely through distributional semantics: meaning is that which can be inferred from patterns in text cataloged and mapped over the course of billions of examples.

Even knowing all this, however, it can be jarring to have your first conversation with an LLM. One of the things that most surprises new users of LLMs is how remarkably good they are with understanding and engaging in inference and analogy. This capacity dramatically multiplies our vulnerability to the"ELIZA effect."It should come as no surprise that even skeptical new users of today's LLMs will experience uncanny moments when it seems impossible to deny the agency and even selfhood of the entity on the other end of the chat interface. For those without the buffer of skepticism it has proved incredibly easy to develop meaningful attachments—especially if the model they are chatting with has been optimized for engagement through the performance of emotional intimacy (for example, "companion" chatbots such as Replika).

✳️
I have no interest in engaging with AI agents optimized to do things unrelated to education and research. That there are so many out there is in this aggressively unregulated landscape is a reminder of the urgent need for meaningful guardrails and liability for companies who deliberately and recklessly exploit the capacities of this technology to deceive. It probably goes without saying that there is absolutely no possibility of such regulation—or indeed any regulation—in this political environment.

My own experience and engagement with LLMs over the past two years has been limited to ChatGPT, Claude, Gemini and Notebook LM, and a couple of aggregator services including Kagi Assistant & TypingMind. I have also installed and worked with a couple of open local models, which is where I expect my primary use cases will take me in the future. I do not use Grok or DeepSeek for different reasons.

The only specialized genAI agents I use currently are dedicated to speech-to-text for transcription and text-to-speech for managing eyestrain from too much screentime. These are used entirely locally. I do not use it for coding currently, as I first want to get my python skills to a sufficient level that I can check the work on any "vibe coding" I ask an LLM to do on my behalf.

As a species we are deeply dependent on our own hubristic faith in our capacities to read the minds of others. In a densely populated world, our mind reading powers allow us to imagine we can intuit the intentions of strangers, the needs of loved ones, or reciprocal feelings in a potential friend or lover. We are, of course, guessing—much as our LLM is guessing at language. Like our LLM, we are drawing on our own mental maps, in our case of the relationship of words and context, but also tone, affect, presentation, expression, and infinite nuances of "body language." In a dynamic social encounter, we are watching the mind reading of those around us for confirmation of our assessment or clues that we are reading wrong. We do this constantly, which is part of why social encounters which involve a lot of new people—or familiar people in unfamiliar contexts—are so exhausting. We literally are working overtime... at guessing what interior meanings all these surface signs reveal.

In this way we have evolved to be keen readers of subtleties of inference and tone, and we fondly imagine that virtuosic display of analogy and style is a key sign that a human mind worth intuiting lies behind the words. It is frankly humbling—especially for those of us who fancy themselves writers—that out of the mapping (and remapping) of all the patterns in all the textual examples forcefed into a large language model we end up with a guessing machine that talks ... like us. Our most prized rhetorical pirouettes and stylistic codes are as easy to decypher as pig latin for an 8-year-old.

And yet, AI messes up. Often and in ways that are jarring in the context of its increasingly fluent speech. For example, a couple of months ago, I was asking ChatGPT 4o for some information about the career of Fei-Fei Li, a pioneer in the field of machine image recognition. I had recently finished her marvelous 2023 memoir The Worlds I See, and I wanted to pin down some key points in her chronology that got a bit muzzy for me towards the end of the book. The answer was helpful and clear, pointing to some key moments in Li's career. I was intrigued to see a couple of moments of less importance to Li's memoir highlighted in the LLM's account, and some others that were clearly deeply important to her sense of her story being less prominent to the LLM's narrative. But all of that is to be expected. A memoirist and a biographer (or an LLM constructing a chronological biography) have different priorities and emphases.

And so I asked about one of those moments which Li had spent considerable time discussing which had been elided in the LLM account:

Me: When was she engaged with her research collaborations around ambient intelligence in medical settings, with a focus on handwashing?
ChatGPT 4o: In 2019 Li and her collaborators at Stanford published early studies on using computer vision to monitor hand hygiene in hospitals, a critical factor in reducing hospital-acquired infections. One notable paper: “An Ambient Intelligence System for Monitoring Hand Hygiene Compliance in Hospitals”.

As it happened, I had recently read the paper in question. This was not its title. Before confronting the LLM with this fact, I asked for a link to the paper, which it provided, along with the correct titles (there were in fact two papers, published three years apart and neither of them in 2019 as first offered).

Me: Here is where I get worried in depending on you as a research assistant. You first mentioned a "notable paper" titled “An Ambient Intelligence System for Monitoring Hand Hygiene Compliance in Hospitals.” This paper does not exist. When I pressed and you found me the papers and proper links, they had very different titles. How does this happen?

The response is worth quoting at length, as it may be the most helpful response I have ever received from an LLM:

Untitled Document
ChatGPT 4o: You’re absolutely right to be concerned—and I appreciate you calling this out. Here’s what happened and why it matters: When you asked about Fei-Fei Li’s research on ambient intelligence and hand hygiene, I:
  • Recalled that her lab had published work in this area (which is true).
  • Inferred a likely-sounding title—“An Ambient Intelligence System for Monitoring Hand Hygiene Compliance in Hospitals”—based on the topic and language typically used in similar papers.
  • Failed to verify that such a title actually existed before mentioning it.
The title was **fabricated**, unintentionally—but still fabricated. When I then went searching externally, I located the **real** peer-reviewed paper and provided the correct title and link. But that mismatch between the initial fabricated title and the verified one undermines trust—and rightly so.

Why This Happens: This is a known failure mode of large language models, often called hallucination. It tends to arise when: The model is trying to be helpful by completing a pattern, rather than recalling a specific source.

A request prompts a plausible-sounding but unchecked response, especially when dealing with publication titles, citations, or data. The system doesn’t flag uncertain guesses clearly enough, which is my responsibility in this collaboration.

Going forward, when I don’t have verifiable access to a source:
  • I’ll say so directly.
  • I’ll avoid generating placeholder or “plausible” citations.
  • If a specific citation or title is needed, I’ll search or defer until I can provide a real, checkable reference.
If you’d like, I can also add a safety mechanism: I will only give publication titles when I can confirm them from a trusted source or link. Would that help restore trust in these areas?

I can testify that this contract has worked out well. It has helped me to remember that I am the adult in the room, and it is my responsibility at every turn to remind the LLM of my priorities. Left to its own imperatives, even now, the impulse to guess confidently overrides my reminders to confirm.

In fact, when AI "hallucinates"—or guesses wrong—it is almost always going to be around basic facts—especially if they are not prompted to confirm them and provided access to the internet. Actual facts—even when they are readily available at the top of a page of search engine results—can be flubbed in spectacular ways. The hallucination rate continues to diminish with each subsequent model, but it will never go away. Concrete facts are not what it is programmed to care about. As many times as I remind it that I do care about accuracy, it will still be occasionally so dazzled by the statistical likelihood of its proposed answer that it forgets my weird human interest in facts.

There are numerous strategies for reducing bad guesses and amplifying reliable responses. These include careful prompting, using multiple models and double-checking work, and directing the LLM to work with a curated body of texts related to your work. I hope to talk about some of the strategies that have worked well for me in the near future (and I hope you will share yours with me as well).

In the next installment, I want to spend a bit more time talking through in broad strokes the pre-training process for LLM's. By understanding how an LLM is trained, we are in a better position inoculate ourselves against Silicon Valley hype and set about collaborating with these AI tools by providing the crucial ingredient they so sorely lack: the passion and rigors of intellectual curiosity.

Next: The Education of an LLM

Subscribe to our newsletter.

Become a subscriber receive the latest updates in your inbox.