A Non-Tech Operator's Research Into LLMs

I spent all of my free time the last two weeks using Claude.

After a few days of testing, I could use it but I couldn't think with it. And that gap felt like a problem worth solving — especially if I wanted to figure out how to deploy AI meaningfully across the organization I work for.

So I went through great blogs to get a finer understanding. A few rabbit holes I didn't plan on. With some help from Claude, which felt fitting.

This article is my attempt to compress what I found into something I can come back to.

Why it felt urgent: across McKinsey, BCG, and Deloitte's most recent research, 88% of companies are already using AI in some capacity. Only 5% are extracting substantial value at scale. The diagnosis these reports converge on is consistent — the gap isn't about access to tools. It's a comprehension gap. And it tends to start at the leadership level.

Six sections. From what a Large Language Model actually is, to how it learns, to what it means for how you adapt it to your own organization.

1. An LLM Is Not What You Think

The mental model most people start with — including me — is something like a very powerful search engine. Something that has ingested an enormous amount of information and retrieves the right answer when you ask the right question.

That model is wrong.

A Large Language Model is a system trained to do one thing: predict the next word. That's it. It was shown an almost incomprehensible volume of human text — books, articles, code, conversations, scientific papers — and trained, billions of times over, to answer a single question: given everything before this point, what word comes next?

What's surprising is what emerges from that apparently simple task. The model doesn't store facts. It encodes relationships between concepts. It learns that certain words appear in certain contexts, that certain ideas cluster together, that meaning shifts depending on what surrounds a word.

A concrete example that helped me: "The bank refused the loan because it lacked funds." The model understands that "it" refers to the bank, not the applicant. Not because someone programmed that rule. Because it absorbed enough human language to internalize how meaning works.

The core shift: an LLM isn't a database you query — it's a system that generates plausible language based on patterns learned from human text. It doesn't retrieve answers. It constructs them.

2. The Transformer — The Invention That Changed Everything

Before 2017, language models read text the way you'd expect a machine to: sequentially, word by word, left to right. The problem was that meaning rarely works in a straight line. By the time the model reached the end of a long sentence, it had forgotten the beginning.

The Transformer architecture, introduced in a 2017 paper titled — with unusual confidence — "Attention Is All You Need", changed that entirely.

The key mechanism is called attention. Instead of processing words one at a time, the model evaluates all words simultaneously and calculates the relationships between each of them. Every word looks at every other word and decides how much weight to give it. The word "it" looks back at "bank" and "applicant" and determines which one it belongs to — not through a rule, but through learned patterns of association.

This is what allows modern models to handle long documents, maintain coherence across a lengthy response, and pick up on nuance that earlier systems simply couldn't process. Every major model today — GPT, Claude, Gemini, Mistral — is built on this architecture.

3. Understanding vs. Generating — The Distinction That Actually Matters

After the Transformer, research split into two distinct directions. Worth understanding both, because they produce very different tools.

The first produced encoder models — the most well-known being BERT, released by Google in 2018. These models read text in both directions simultaneously, making them excellent at understanding: classifying documents, detecting sentiment, finding semantically similar passages. They power search and content analysis.

The second produced decoder models — starting with GPT, also in 2018. These models predict forward, one token at a time, making them excellent at generating: writing, summarizing, answering questions, producing code.

Claude, ChatGPT, Gemini — all decoder models. All generators.

When you ask Claude a question, it is not looking up the answer. It is constructing a response, word by word, based on patterns learned during training and the context you've given it in that moment. A vague question doesn't get a precise answer — it gets a plausible one.

The tools most executives are now running AI experiments with are all generators, not retrieval systems.

4. From Raw Model to Assistant — Pretraining and Fine-Tuning

A model doesn't emerge from training ready to use. Two distinct phases get it from raw capability to something functional.

Pretraining is the first. The model ingests an enormous corpus of text and learns to predict the next word across all of it. No specific objective. No particular task. At the end of this phase, the model has absorbed vast amounts of language structure, factual associations, and reasoning patterns — but no particular goal. If you asked it a question, it would complete the text in whatever direction seemed statistically likely. Not especially useful.

Fine-tuning shapes that raw capability into an assistant. The model trains further on examples of good responses — instructions, questions, conversations. This is what teaches it to follow directions, maintain format, and behave like something you'd actually want to interact with.

One finding I didn't expect: model size is not the only thing that matters. A paper published in 2022 — known in the field as the Chinchilla paper — showed that a model half the size of GPT-3, trained on five times more data, outperformed it on most benchmarks. The volume and quality of training data is as decisive as the number of parameters.

Bigger is not always better. Better trained wins.

5. Making the Model Work for Your Organization

There are four ways to adapt a language model to your context — and they do follow a natural order of complexity.

Prompt engineering — crafting precise instructions to guide the model's responses — is the simplest.
RAG (Retrieval Augmented Generation) — connecting the model to your own documents at query time — adds infrastructure.
Fine-tuning — retraining the model on your own data to shape its default behavior — requires data, time, and ML expertise.
Pretraining from scratch — building your own model from zero on proprietary data — demands resources that most organizations don't have.

What's counterintuitive is that complexity and cost are not the only factors that should drive the decision. Data privacy, latency, degree of control, update frequency, deployment target, and cost all interact to make one approach viable and another a liability in production. The right question is not "how sophisticated are we ready to be?" but "what are the actual constraints of our environment?"

Prompt engineering operates at the output layer — format, tone, logic, structure. No infrastructure, no cost, deployable today. The fastest path to value and the most underused lever in most organizations. Its limit: it's fragile at scale. As prompts grow longer and more complex, they become hard to maintain.

RAG operates at the knowledge layer. The model doesn't memorize your documents — it retrieves relevant passages at the moment of answering and generates a response grounded in them. The right architecture when your knowledge changes frequently — new policies, updated procedures, evolving documentation. It also provides a transparent, auditable knowledge path, which matters in regulated environments.

Fine-tuning operates at the behavior layer — reasoning patterns, tone, classification logic, domain-specific consistency. It is not primarily for adding knowledge. It is for making behavior stable and predictable across every request. One rule of thumb worth keeping: never fine-tune in knowledge that will change. Dynamic knowledge belongs in RAG. Fine-tuning is for what should stay fixed.

Pretraining from scratch remains out of scope for most organizations in 2025.

6. Three Limits to Never Forget

Everything in the previous sections describes what these models can do. This section is about where they reliably go wrong.

Hallucination is the most important one to internalize. Because the model generates rather than retrieves, it will always produce something that sounds plausible — even when it has no reliable basis for what it's saying. It doesn't experience uncertainty the way a human does. For anything involving facts, figures, legal or financial content, or decisions with real consequences, human verification is not optional.

The context window is the model's working memory — the volume of text it can process in a single interaction. What sits outside that window doesn't exist for the model. It has no memory of previous conversations unless you explicitly provide them. Managing what you put into context is a skill that compounds quickly.

Prompt dependency is the limit that surprises people most once they start using these tools seriously. The same model, given a vague instruction versus a precise one, produces outputs that are barely comparable. The model's capability is relatively fixed. What varies dramatically is the quality of the input.

These limits are not bugs to be fixed in the next version. They are structural properties of how the systems work.

Sources

Accessible Reading

Sebastian Raschka — Understanding Large Language Models (2023) — Structured reading list of the 19 foundational academic papers behind modern LLMs.
Jay Alammar — The Illustrated Transformer — The most widely referenced visual explanation of the attention mechanism.
Anthropic — Claude's Constitution (January 2026) — The 80-page public document laying out the philosophical principles Claude is trained against.
McKinsey — The State of AI 2025 (November 2025) — Annual benchmark on enterprise AI adoption.
BCG — The Widening AI Value Gap (September 2025) — Diagnoses why most companies fail to scale AI value despite widespread adoption.
Deloitte — State of AI in the Enterprise (2025) — Enterprise-level survey on AI deployment patterns and barriers.
IBM — RAG vs Fine-Tuning vs Prompt Engineering (2025) — Clear breakdown of the three adaptation methods.
Ibrahim Kamal — Prompting vs. RAG vs. Fine-Tuning: Why It's Not a Ladder — The New Stack (January 2026) — Corrects the common misconception that these methods are a linear progression.

Foundational Research

Vaswani et al. — Attention Is All You Need (2017) — The paper that introduced the Transformer architecture.
Devlin et al. — BERT (2018) — Introduced encoder-style models optimized for language understanding.
Radford & Narasimhan — GPT (2018) — Introduced the decoder-style architecture behind ChatGPT, Claude, and Gemini.
Howard & Ruder — ULMFiT (2018) — First demonstration that pretraining then fine-tuning produces strong results.
Hoffmann et al. — Chinchilla (2022) — Showed that training data volume matters as much as model size.
Ouyang et al. — InstructGPT / RLHF (2022) — Introduced RLHF as the method for aligning a base model into a useful assistant.
Bai et al. — Constitutional AI (2022) — Anthropic's alignment approach using explicit written principles.