How agents actually work under the hood
The think-act-observe loop, tool calling, and context windows explained without the hype. What's really happening when an AI agent does something for you.
On this page
You asked ChatGPT to search the web and summarize what it found. It did. But what actually happened between your request and that answer? Most people assume there’s some kind of magic going on. There isn’t. The process is surprisingly mechanical, and once you understand it, you’ll use agents better and trust them more appropriately.
The loop that runs everything
Every AI agent, regardless of the company that built it or the interface you’re using, runs on the same basic cycle:
- Think. The agent reads your request and decides what to do next.
- Act. It calls a tool to do something in the real world. (If a skill is loaded for the task, it shapes which tool gets called and in what order; the act phase still runs whether or not a skill file is present.)
- Observe. It reads the result that came back from the tool.
- Then it loops. Back to thinking. Maybe it needs another tool. Maybe it has enough information to answer you.
That’s it. Think, act, observe, repeat. The entire AI agent industry is built on this loop. Different companies dress it up with different names, but the bones are always the same.
Walking through a real example
Let’s make this concrete. Say you type: “What’s the weather in Minneapolis right now? Should I bring an umbrella?”
Here’s what happens inside the agent, step by step.
Step 1: Think. The agent reads your message. It doesn’t know the weather. It can’t just guess (well, it could, but a good agent knows not to). It decides it needs to look up the current weather. It picks its weather tool.
Step 2: Act. The agent calls the weather tool with the input “Minneapolis, MN, current conditions.” This is a real API call to a weather service. The agent isn’t imagining the weather. It’s asking for actual data.
Step 3: Observe. The weather tool responds with something like: “Minneapolis, 45°F, cloudy, 80% chance of rain in the next 2 hours, wind 12 mph NW.”
Step 4: Think again. The agent now has the data. It reads the result, sees the 80% rain chance, and decides it has enough information to answer your question. No need for another tool call.
Step 5: Respond. “It’s 45°F and cloudy in Minneapolis right now, with an 80% chance of rain in the next couple hours. Definitely bring an umbrella.”
Five steps. Two of them were “thinking,” one was a tool call, one was reading the result, and one was writing back to you. The whole thing probably took two or three seconds.
Now imagine a harder question. “Find me flights from Minneapolis to Denver next Friday, compare the prices, and tell me which airline to pick.” That’s going to loop several times. Search for flights, observe the results, maybe search again with different parameters, compare what came back, reason about which option is best, and finally respond. Same loop, just more laps.
The agent’s working memory
Here’s where things get interesting, and where most people’s mental model breaks down.
An agent doesn’t have a brain the way you do. It has a context window. Think of it as a desk. A physical desk with a fixed amount of surface area. Everything the agent is currently working with sits on that desk: your conversation history, the results from tool calls, its own reasoning, the instructions it was given.
When the desk is clean and mostly empty (like at the start of a conversation), the agent works great. It can see everything clearly, nothing is buried, and it can reason about all the information at once.
But as the conversation gets longer, papers pile up. Old messages, tool results, back-and-forth exchanges. The desk fills up. And here’s the problem: the desk doesn’t grow. It has a fixed size. Different models have different desk sizes (current frontier models hold roughly 200,000 tokens, with longer-context variants reaching a million; “tokens” are roughly three-quarters of a word). The exact number doesn’t matter as much as the principle: every desk has edges.
When the desk fills up, something has to go. The oldest papers get pushed off. The agent literally loses access to things you said earlier in the conversation. This is why long conversations get weird. It’s not that the agent got dumber. It just can’t see the thing you told it 45 minutes ago anymore. That piece of paper fell off the desk.
This is also why agents sometimes repeat themselves, contradict earlier statements, or forget instructions you gave at the start. They’re not being difficult. They’re working with limited space, reading whatever’s still visible on the desk.
I’ve seen people get frustrated when an agent “forgets” a preference they stated 30 messages ago. Now you know why. The fix is simple: repeat important context when you need it. Don’t assume the agent remembers. Remind it. For a deeper look at how developers are solving this problem, see Agent Memory Patterns.
How tool calling actually works
When an agent has access to skills, it receives a list of available tools along with a description of each one. Think of it like handing someone a menu at a restaurant. The agent reads the descriptions and picks the one that matches what it’s trying to do.
This is where skill files come in. A skill file is a markdown document that defines what an agent should do during the “act” phase of the loop. When someone writes a skill file, they’re defining the steps the agent takes, the checks it performs, and the output it produces. The skill file is what shapes the “act” step from a vague “do something” into a specific, repeatable process. For a real example of what a skill file looks like, see the PR review skill.
Here’s a simplified version of what that menu might look like to the agent:
- web_search: Search the internet for current information. Input: a search query.
- get_weather: Get current weather conditions for a location. Input: city name.
- read_file: Read the contents of a document. Input: file path.
- send_email: Send an email message. Input: recipient, subject, body.
The agent doesn’t “know” how these tools work internally. It just reads the description, decides which one fits, and fills in the inputs. It’s remarkably similar to how you’d decide which app to open on your phone. You need directions? Open Maps. You need to message someone? Open your messaging app. You’re pattern-matching based on descriptions.
This is why tool descriptions matter so much. If a tool has a vague or confusing description, the agent might pick the wrong one, just like you might open the wrong app if all your icons looked the same.
Why agents fail (and it’s usually not the AI’s fault)
Agents mess up. Everyone’s experienced it. But when you understand the loop, you can usually figure out why.
It picked the wrong tool. The agent read the descriptions, misunderstood the situation, and chose a tool that didn’t fit. This happens when tools have overlapping descriptions or when your request is ambiguous. If you ask “look this up” and the agent has both a web search tool and a database search tool, it might pick the wrong one.
It ran out of context. The desk got too full. The agent lost track of something important, maybe your original question, maybe a key detail from an earlier tool result. Long, complex tasks hit this wall more often. If you notice an agent losing the plot, try breaking the task into smaller pieces or starting a fresh conversation.
It hallucinated instead of using a tool. This is the big one. Sometimes an agent will confidently make up an answer instead of admitting it doesn’t know or calling a tool to find out. A good agent should think “I don’t know this, let me look it up.” A lazy or poorly designed one will just guess and present the guess as fact. This is why you should always verify important claims, especially when the agent didn’t visibly use a tool to find the answer.
The tool itself failed. The agent did everything right, picked the correct tool, sent the right inputs, but the tool returned an error or bad data. Maybe the website was down. Maybe the API rate limit was hit. The agent then has to decide what to do with the failure, and not all agents handle this gracefully.
What this means for you
Understanding how agents work doesn’t require a computer science degree. You just need the mental model: think, act, observe, repeat. A desk that fills up. A menu of tools.
With that model, you can:
- Write better requests, because you understand the agent needs to map your words to specific tools
- Troubleshoot problems, because you can guess where in the loop things went wrong
- Know when to start fresh, because you understand context limits
- Spot hallucinations, because you can tell when the agent skipped the “act” step and just made something up
If you’re ready to try this out in practice, Getting started with agent skills walks you through the basics. And if you want to understand the skills themselves in more depth, What are agent skills covers the building blocks. For the precise vocabulary distinctions (skill vs tool vs plugin vs integration), see What’s the difference between an AI skill, tool, plugin, and integration?.
The more you understand what’s happening behind the curtain, the less it feels like magic and the more it feels like a tool you can actually control. That’s the goal.
Related articles
What are agent skills and why they matter
A beginner-friendly introduction to AI agent skills: what they are, why they're transforming how we work with AI, and how to think about them.
Agent memory patterns for non-developers
What 'memory' actually means for AI agents, why your assistant forgets things, and how to work with memory instead of fighting it.
Agent skills glossary: skill, tool, MCP, plugin, agent, prompt
Plain-language definitions for the AI agent vocabulary that gets used loosely everywhere: what's a skill, what's a tool, how MCP fits in, how plugins differ, and what an agent actually is.