Source: raw/karpathy-how-i-use-llms-transcript.md (full ~2h11m transcript, YouTube EWvNQjAaOHw, en-orig captions; same talk reposted at x.com/0xchromium/status/2063321324605280569)
Type: Talk / tutorial
Creator: Andrej Karpathy
Duration: ~2h11m
Published: early 2025 (companion to his “Deep Dive into LLMs like ChatGPT”)
Karpathy’s general-audience walkthrough of how he actually uses LLMs day to day — not a capabilities demo, but a working practitioner’s habits across model choice, tools, voice, and multimodal. The throughline: an LLM is a “zip file” of the internet turned into a helpful assistant by post-training; you’re talking to a self-contained entity until you hand it tools. The winning human skill is clear delegation + fast verification + taste for when to trust vs. inspect. The model specifics are 2025-era (GPT-4o, o1, Claude artifacts), but the mental model is evergreen.
Key Takeaways
- The base model is a lossy zip of the internet with a knowledge cutoff. Without tools it’s a probabilistic document generator made helpful by post-training — vague on recent events, capable of confabulation. Everything else (search, code, files, camera) is bolting tools onto that core.
- Context window = working memory. Start fresh chats aggressively. Old, irrelevant tokens distract the model, cost money, and can degrade quality. One topic per chat; close and reopen when you switch.
- Choose the model tier intentionally. Cheap/fast models for routine lookups; pay for “thinking”/reasoning models only when the task (hard code, math, tricky reasoning) justifies the extra minutes and cost. Karpathy pays for top tiers because it’s cheap relative to the value.
- Run an “LLM council.” Ask the same question across multiple providers (ChatGPT / Claude / Gemini / Grok) and read the consensus and the disagreements — cross-checking surfaces errors and blind spots.
- Tools convert a guesser into a researcher. Internet search (his habit: Perplexity) for anything recent/niche/changing; “Deep Research” for tasks worth minutes of chained search+reasoning that would cost you 30–90 min of manual browsing. Treat both as high-quality first drafts with citations — still verify.
- Make the model run code, not “think in text.” The code interpreter (Python) is the real unlock for math, data, and plots — but inspect its implicit assumptions and outputs.
- Generate disposable single-use software (artifacts). Especially strong in Claude: instead of hunting for the perfect app, have the model build a tiny custom one for this need — a React widget, a Mermaid diagram to understand a chapter, flashcards. Software as a throwaway thought tool.
- Voice is massively underrated. A huge share of his mobile usage is voice (far lower friction than typing). True advanced voice mode handles audio natively; on desktop he pipes speech into any app via SuperWhisper.
- Multimodal is already practical. Point the camera at books, devices, maps, nutrition labels, blood-test results — get live help.
- Good delegation beats prompt magic. Be concrete and specific, give examples, and save reusable setups (custom instructions, memory, custom GPTs) for repeatable tasks.
The Mental Model
- An LLM is built by pre-training (compress a large slice of the internet into parameters — a “zip file”) then post-training (turn that document-completer into a helpful assistant persona).
- When you chat with the bare model you’re querying that compressed knowledge: fast, broad, but probabilistic, slightly vague, and frozen at the training cutoff. It can hallucinate, and it has no idea about anything recent.
- Capability comes from bolting tools onto the core: search (fixes recency/niche), code interpreter (fixes math/data), file upload (fixes “read this specific thing”), camera/voice (fixes input friction and the physical world).
- For the deeper “how the model thinks” companion, this talk points back to Karpathy’s Deep Dive into LLMs; this one is the usage layer on top.
Karpathy’s Tool Ladder (his actual workflow)
- Plain chat — fast factual/explanatory queries against the model’s baked-in knowledge. Keep chats short and single-topic.
- Thinking models — switch up for hard reasoning, math, and code; they run internal chain-of-thought for seconds-to-minutes. He shows a debugging case where a thinking model succeeds where a non-thinking one fails.
- Search — for anything recent, priced, launched, rumored, or obscure. Perplexity is his reflex, but ChatGPT/Grok/etc. now search too. Outputs are first drafts — verify.
- Deep Research — the model spends minutes doing chained search + tool use + reasoning; excellent for what would be 30–90 min of manual research. High-quality cited draft, still verify.
- Code interpreter — real Python execution for analysis, math, and plots; don’t accept “text-only” reasoning for quantitative work, and check its assumptions.
- Artifacts / custom apps — generate disposable software for the moment’s need (diagrams via Mermaid, flashcard apps, small React tools). Strong in Claude.
- File uploads — drop in papers, PDFs, whole books (he reads classics like Wealth of Nations alongside the model) and discuss them.
- Cursor (not web chat) for serious coding — a dedicated coding tool that holds full project context beats pasting into a chat window.
Underrated Moves
- Voice-first on mobile. Lowest-friction interface; he uses it constantly. Distinguish true native-audio voice mode from speech-to-text wrappers. On desktop, SuperWhisper transcribes speech system-wide into any chat box.
- Custom instructions + memory + custom GPTs. Teach the model your preferences once; build small reusable assistants for recurring jobs (his example: extract Korean vocabulary from a screenshot and format it for Anki).
- Diagrams to understand, not just to present. Have the model render a Mermaid diagram of a book chapter or argument so you can grasp it spatially.
- Multimodal in daily life. Camera at nutrition labels, devices, maps, blood-test results, book covers — practical live assistance, not a demo.
Try It
For WEO Marketly / any team standardizing on LLMs day to day:
- Adopt “fresh chat per task.” Make it a team norm — it’s the single cheapest quality + cost win.
- Write a model-tier cheat sheet. Which model for routine lookups vs. hard reasoning/code, and when paying for a thinking model is worth the minutes. Tie it to intelligence levers.
- Build a few reusable “custom GPTs”/projects for recurring deliverables (e.g., a brand-voice rewriter, a competitor-summary assistant) instead of re-prompting from scratch — the delegation-over-prompt-magic point.
- Default to Deep Research for any 30–90 min manual-browse task, then verify the citations — the highest-leverage time save.
- Put voice in the workflow. Try SuperWhisper for desktop dictation; it removes typing friction for long briefs.
- Treat artifacts as throwaway tools — when you’d normally search for an app or build a diagram by hand, ask the model to generate a one-off instead.
Related
- From Vibe Coding to Agentic Engineering — Karpathy’s framing of where this human-as-delegator skill is heading next (schedule + tools + verification = agents)
- AutoResearch — Self-Improving Coding-Agent Loop — the agentic continuation of “Deep Research” once you add loops and verification
- Karpathy on Skills (Multica AI) — Karpathy’s take on the reusable-setup / skill idea this talk gestures at
- Dynamic Workflows — the “give the workflow a schedule + tools + verification” leap from chat to autonomous agents
- Prompt Engineering — the “be concrete, give examples, save reusable setups” delegation craft
- 2026 AI-Work Restructuring — the macro view of the delegate-and-verify operating model
Open Questions
- Exact publish date / model lineup at recording. The talk is early-2025 and references GPT-4o, o1, and Claude artifacts; specific model names and UI have since moved on, though the workflow endures.
- Which specifics are now stale. Operator/computer-use agents and newer model tiers (Opus 4.x, etc.) have advanced past the talk’s examples — a future refresh could map each 2025 tool habit to its 2026 equivalent.