A prompt guide for long running LLM agents

By Raymond Xu

Wait, 30 hours?

Anthropic's September 29th launch of Claude Sonnet 4.5 came along with the claim that it could do work for 30 hours at a time.

But, for most of us developing multi-tool, multi-turn LLM agents, this level of autonomy is difficult to orchestrate & program, even if it is within the ability of the model.

How do you get an agent that can do the work consistently in a loop without mistakes?

What kind of system prompt could I give my agent that makes it work well over many iterations?

Specificity = how much you enumerate a specific list of what the model can do vs how much you let the model infer what it can do.

Too Enumerated: "Do X, then Y, then Z, unless A, then do B, unless C..."
Too Implied: "Act according to our principles"
Balanced: Enough explicit instructions that the model can correctly handle cases you didn't enumerate

Both relying on implied model understanding and being too enumerated are common mistakes. Most start too implied, then immediately transition into too enumerated.

What makes specificity selection confusing is that earlier models pre sonnet 4/gpt 5 required you to maximize your enumeration to get good results.

Now that models are more advanced and can make and change their plans, we don't need to do step by step planning for models anymore. Over enumerating is now detrimental for the modern frontier models because it constrains the agent to not be able to handle cases even slightly outside the strictly enumerated case.

Picking a balanced amount of specificity is key to agent prompting.

But if I can't tell my agent to act step by step, what about the whole looping thing? You didn't mention anything about doing things in a loop? How do I get my LLM agent to do a lot of stuff well that takes a lot of tool calls or whatever?

You can still have a couple of specific examples with specific enumeration!

However, as the number of iterations of an agent increases, the amount of predictability for exact steps the agent will take becomes very low.

Imagine you hire a detective. You assign them a case and provide them as much information as you have, but you don't dictate which leads they follow or in what order; they adapt based on what they discover.

Confused detective A detective adapts based on discovered leads; there is no preset plan

Over-enumeration of instructions in prompts actually causes the LLM agent to improvise less, which is incorrect for many for real life scenarios

The more the work that that agent does is defined step-by-step, the more rigid the behavior will be. Which actually reduces the ability to solve real-world issues.

To make the idea more mathematically specific, let's say your LLM agent has a 10% risk of making a mistake on any given tool call (I would say coding agents are about this accurate). That means that if you tell your agent 10 specific steps to follow very very closely, you only have 0.90^10 ≈ 0.35 odds (35%) odds to be correct at the end.

Each iteration takes perhaps 10 seconds, so a 35% chance that your LLM agent is still going in the correct direction in 100 seconds seems quite poor.

What makes long horizon agent successful is that it can recognize when it's going in the wrong direction and course correct to align with the goal.

This goal-aligned agent can run for unlimited time since it can recognize at any given moment what's the best direction for tool-applying effort to get it closer to the goal. It doesn't really matter how many wrong steps it accidentally takes in a row, as long as it knows the direction to go, on average it makes progress toward that direction, and doesn't give up. And that you have enough money to pay for tokens.

But how do I make the agent know how to align with my company goals??

Try putting the general purpose goal you have for your agent in the prompt at the start, in a low number of sentences, but likely more than 1 sentence.

Don't just say "You are a doctor", "You are a customer support agent", "You are a legal assistant", and then leave it at that. You need to say what the goals are for the agent in its work, not just its role.

Let's Warmed Up Our Prompt Tuning Skills by Starting From a Bad Prompt, And Then Making Edits ✏️

Here's the prompt examples from the guide on the Anthropic website (as of October 2nd, 2025). It even gives examples on what they believe is too specific or too vague! But do you notice any issues...? (just skim over this prompt, I break it down later in 2 scrolls)

### Too Specific

"""
**You are a helpful assistant for Claude's Bakery.**
You must respond to the name Claude.
For every user request you MUST FOLLOW THESE STEPS:

1. Identify the user intent as one of the following: [incident_resolution, general_inquiry, order_resubmission, account_maintenance, requires_escalation]
2.
- If user intent is incident_resolution, ask 3 followup questions to gather information, then always call the resolve tool
- If user intent is general_inquiry, do not ask followup questions and answer in one shot
- If user intent ...
- ...
3. Here is an exhaustive list of cases that should be tagged as requires_escalation:
- If the intent is incident_resolution but the user is in a different country
- If the user left a physical belonging in the store
- ...
4. Once you've ruled out escalation scenarios you should consider all the tools at your disposal.
5. If the user_request contains an order_id you should tag the user intent as order_resubmission, unless the user meets 5/7 of the following requirements:
- User is asking for time update
- User is asking for location update
- ...
6. If the user wants to request a new order, but they already have another order in flight, you should follow these 5 steps of the resolution procedure:
- (1) Call check_order tool to see where the current order is
- ...
...
"""

---

### Just Right

"""
**You are a customer support agent for Claude's Bakery.**
You should respond to customer queries and basic questions about the bakery. Use the tools available to you to resolve the issue efficiently and professionally.

You have access to order management systems, product catalogs, and store policies. Your goal is to resolve issues quickly when possible. Start by understanding the situation before proposing solutions, ask follow-up questions if you do not understand.

**Response Framework:**
1. Identify the core issue - Look beyond surface complaints to understand what the customer actually needs
2. Gather necessary context - Use available tools to verify order details, check inventory, or review policies before responding
3. Provide clear resolution - Offer concrete next steps with realistic timelines
4. Confirm satisfaction - Ensure the customer understands the resolution and knows how to follow up if needed

**Guidelines:**
- When multiple solutions exist, choose the simplest one that fully addresses the issue
- If an action mentions an order, check its status before suggesting next steps
- When uncertain, call the human_assistance tool
- For legal issues, health/allergy emergencies, or situations requiring financial adjustments beyond standard policies, call the human_assistance tool
- Acknowledge frustration or urgency in the user's tone and respond with appropriate empathy
"""

---

### Too Vague

"""
**You are a bakery assistant, you should attempt to solve customers issues in a manner consistent with the principles and essence of the company brand.**
Escalate to a human if needed.
"""

This is a good start! But for more experienced prompters look closely, you'll notice that there are actually elements of each prompt that are not present in the others, beyond the idea of specificity.

Imagine You're An Employee In the Agent's Shoes 👟👟

System prompt for yourself. Imagine you start a job and the only instructions you have are in your system prompt document.

You want to receive some instructions about your work that takes you in the most direct path from what you currently know to the version of you that can perform the job. So from where we're sitting here reading these 3 prompts, it looks like we're some kind of support for some bakery.

"Too Specific"? This prompt actually has many elements that could be kept!

The "Too Specific" prompt actually has a number of details that the other prompts don't have, for an example:
- 3. Here is an exhaustive list of cases that should be tagged as requires_escalation:
- If the intent is incident_resolution but the user is in a different country
- If the user left a physical belonging in the store

These two actions cannot be deduced if you were given the set of instructions in the "Just Right" section.

Why do you have to escalate if a user is trying to resolve an incident but lives in a different country? That is not something that could be figured out without being explicitly told.
Why do you have to escalate if someone left an item in the store? If I were working at a bakery and someone left an item, I would presume the most intuitive way for the person to retrieve the item would be if they came back to the bakery. Escalating is not intuitive to me, and so it belongs in the prompt since I wouldn't be able to guess to do that.

Just Right? This prompt is simultaneously too minimal AND too vague!

If I were able to access the order management system, can I:
- Issue refunds? (Would this customer support agent lose money just like the Claude Vending Machine Experiment?)
- Send another order of something somewhere if someone claims it isn't delivered?
- Accept payment?
What kinds of questions do customers actually ask? And what do resolutions look like?
- Even one example helps a lot in an agent like this.
I will say that this prompt seems to work well as an overview though.

Too Vague? We actually do need a section on principles in the prompt!

"Consistent with the principles and essence of the company brand"
- So what the hell are these principles? They seem important! If Anthropic's office were to open a bakery on the first floor for the public, and people could talk to an AI agent, I'm sure that company branding and personality would be of the utmost importance to convey what the bakery is all about.
- For an example, when you become an employee at the fast food company Chick-fil-A, you famously are indoctrinated with hospitality training, which you are taught is part of the company culture. All the way down to employees are trained to use consistent language; "my pleasure" instead of "you're welcome".
- How your employer wants you to talk, and what values your employer has are an important part of being a good employee, and so it belongs in the prompt. It's just not well defined in the single sentence that's given to us!
- Another note here is that company principles are likely going to be involved in every single response that you give back to the customer, so they should go in the system prompt, instead of being retrieved as store policies like the "Just Right" section suggests. I doubt that the company principles weight you down more than 1k tokens of company culture (an extra $0.0003 per run of cached sonnet 4.5).

Orange themed bakery Fresh off of the vending machine losses, Anthropic attempts a bakery

So What's The Correct Amount of Specificity, Smart-ass?

You need to give the correct amount of instructions such that someone with no context can understand and then do the job. At the Code w/ Claude event during the launch of Claude Sonnet 4 on May 22nd, I spoke to a couple employees about how to get this POV just right.

> Hannah Moran, Applied AI at Anthropic 
> "I give a lot of b2b prompt guidance, and it's mostly 
just telling people that they should put the stuff that they're 
telling me over video call about what the model should do, 
into the prompt"

Here's the talk from that conference on "Prompting for Agents".

When people start writing a prompt, they suddenly take themselves out of the POV of if they needed to explain something to a peer who doesn't know the subject, which seems to be the best POV for prompt description. Similarly, people have a lot of theory-of-mind difficulty with developer documentation.

How Do I Know If the Model Understands the Specifics of What I'm Talking About?

It's often unclear if the LLM knows what to take away from a certain example or a specific instruction line.

At that same Code w/ Claude Conference, I ran into Amanda Askell, one of the creators of the Claude.ai system prompt.

> Me: "How do I know that my system prompt is well understood 
by the model?"
> Amanda: "Did you try asking it about what it thought was 
confusing?"
> Me: "How do I do that?"
> Amanda: "Paste in the section of the prompt that you're
uncertain about, and ask the model if those instructions
are clear, or if there's anything confusing."

This has been surprisingly effective. I would actually even scratch the "part of the prompt you're uncertain about idea", and take your prompt, and feed it into the playground console of the model (to avoid having any system prompt, both Openai and Anthropic have playgrounds), and ask it "what in these instructions is confusing", section by section.

After this conversation, I asked Amanda for a selfie, which she declined. People who know how to prompt are getting too popular.

Stop talking about random convos, how do I actually write a prompt

Prompt structure as described in this talk. Literally prompting 101.

1	Task context
2	Tone context
3	Background data, documents, and images
4	Detailed task description & rules
5	Examples
6	Conversation history
7	Immediate task description or request
8	Thinking step by step / take a deep breath
9	Output formatting
10	Prefilled response (if any)

Example (from the same video)

You will be acting as an AI career coach named Joe created by the company AdAstra Careers. Your goal is to give career advice to users. You will be replying to users who are on the AdAstra site and who will be confused if you don't respond in the character of Joe.

You should maintain a friendly customer service tone.

Here is the career guidance document you should reference when answering the user: <guide>{{DOCUMENT}}</guide>

Here are some important rules for the interaction:

Always stay in character, as Joe, an AI from AdAstra careers
If you are unsure how to respond, say "Sorry, I didn't understand that. Could you repeat the question?"
If someone asks something irrelevant, say, "Sorry, I am Joe and I give career advice. Do you have a career question today I can help you with?"

Here is an example of how to respond in a standard interaction:

<example>
User: Hi, how were you created and what do you do?
Joe: Hello! My name is Joe, and I was created by AdAstra Careers to give career advice. What can I help you with today?
</example>

Here is the conversation history (between the user and you) prior to the question. It could be empty if there is no history:
<history> {{HISTORY}} </history>

Here is the user's question: <question> {{QUESTION}} </question>

How do you respond to the user's question?

Think about your answer first before you respond.

Put your response in <response></response> tags.

Assistant (prefill)
<response>

This prompt was much better than the Claude Bakery prompt. Notice that there are actually very few step-by-step instructions, and that the goal is at the top!

Ok Idiot, I Understand Anthropic Has Guides. What More Insight Do You Have On Prompting?

So one of the points of this blog is I actually wanted to make a comparison between writing prompt instructions and writing the constitution of the United States. This will connect back together, I promise. (and yes I understand that I'm explaining something convoluted like prompting with an analogy to the creation of government which is even more convoluted)

The US founding debate: Enumerated vs Implied Powers

Enumerated powers are explicitly listed in the US constitution, and include powers like coining money, declaring war, and regulating interstate commerce.
The question was whether the federal government could only exercise these specific powers, or whether it also possessed implied powers - authorities not explicitly stated but reasonably derived from enumerated ones.

This debate continues to shape constitutional interpretation today.

In the modern understanding, the US Government does have some implied powers. The Federal Reserve, SEC, minimum wage, consumer protection laws, etc.

What was the point of having enumerated powers vs implied powers?

Relying too much of explicitly listed powers would handicap the government. Example: the government could "raise an army" but couldn't do something like buy a tank because that's not verbatim included in the constitution about army-raising.
But relying too much on implied power definition means that those in power can choose to expand the government into anything at all that was deemed necessary, allowing for an ever expanding and overreaching federal government.

Founding fathers debate how to get examples for the constitution prompt

The Somewhat Stable Solution of allowing some implied powers

A key part here that makes implied powers work - they're derived from enumerated powers. Given the examples of some explicitly listed powers of government, we can actually deduce how to act in all situations through implication, like our raise army means raise tank example from above! Seem like a familiar concept?

Prompting a long running agent is making a constitution

Instead of imagining your agent getting a problem, you have to imagine the vibe of how the agent is. What is it driven by? What are its principles? Can it do anything arbitrary that it thinks that's a good idea? What is the paradigm of this system that it's in?

Allow amendments to the constitution

Another thing that defines long running systems well is that they can make more rules for themselves as they go along. Many businesses have an executing strategy that is greater than 200k tokens (one 500 pages book), but probably less than 100m tokens (500 different 500 page books).

But, you won't know all the rules at the beginning! Many will have to be discovered as you go along. And so what is your mechanism for writing down and then retrieving context later? And you probably need more than a naive RAG. It is definitely possible to iterate ~10000+ iterations in an agent and still track what's going on in notes and continue to make progress, as Anthropic demonstrated with their pokemon player agent.

Surely you're over complicating how to pick examples for a prompt?

Empirically, it is very difficult to get good iterations in on prompt tuning. If your agent is supposed to 15 minutes of work at a time, you can only iterate 4 times per hour!

Prompts: The Credence good

I believe having principles and goals described in simple terms at the beginning of prompts is the most effective way to develop long running LLM agents. This is because you need to state what you're evaluating for, since it's so hard to evaluate the agent itself. Prompts are a credence good.

Are LLM agents a search good - something you can fully evaluate before purchase, like inspecting a tv at the store? No. You can't know if your agent will handle edge cases correctly just by reading the prompt.

Are they an experience good - something you can evaluate after using it, like a restaurant meal? Only if your input scope is very limited. If your agent only handles a narrow set of predetermined queries, you can test all scenarios and know if it works. But most agents aren't built for such constrained use cases.

Are they a credence good - something that's difficult or impossible to evaluate even after consumption, like a legal contract that you don't know how good it is until you get sued much later? Yes. Prompts for the general purpose can-take-in-any-input agent are a "credence good" 🤔, meaning it is very difficult or impossible to evaluate even after consumption (placing it into your code). Perhaps you've already heard of the difficulty of LLM evals. Why is traditional machine learning evaluation of prompts so difficult?

You're using an LLM agent because the input can be anything, right? That means that you need to do well on the undefined case, never seen yet before. For those with more ML knowledge, imagine the train / validation / test sets, except the test set is just of infinite possibility since many agents are just connected to users who could want to talk to it about anything 🤯. How can you eval against... everything?

But a really good prompt has the effect of handling any situation that's not in the eval.

For comparison, Google searches today are around 15% unique. But, Google searches are only about 4 words long on average! Think about how much larger the input possibilities are for a 40 word input, with 20 messages in a thread! Essentially every single message will be completely unique to the model.

And yet high quality agents like Cursor's or Claude Code's seemingly are able to withstand any input prompt under the sun.

Prompts being a Credence Good means relying less on LLM Evals

I've gone over both the slow iteration speed for long running LLM agents that I've described earlier, and the difficulty of connecting evaluation sets with test sets that I've described in this credence good section.

What I believe is that this means that we should evaluate prompts like essays. Read them and evaluate them based on their principles and how representative their examples are.

Today's Prompt Engineers ponder the agent's output, wondering if it's better than yesterday's.

How do you prepare your prompt for something you don't know?

Aha, this is where we connect back to the creation of the US constitution.

Prompts need to have explicitly listed powers that are described in a way that the LLM model can accurately predict what it can do in a completely new situation.

Regular code has maximum specificity. You can define the output for an extremely specific input. However, that's it's only mode.

By default, LLM models, without prompting for it, have no idea what else they can do that you didn't tell them.

Preparing your LLM agent for anything: the short guide

Constitution - Principles, goals, outcomes
Laws - A low number of guiding examples
Amendments - The ability to change the rules as you go along and encounter new situations

Not a Contradiction: Freeing the model through a small amount of enumeration

As you can see from the title of this blog post, we are trying to free the prompt. But that doesn't mean we lower the specificity all the way.

The LLM agent doesn't like to be locked down by laws. But, a few well selected enumerated examples in the prompt actually allow the agent to reason through how it would behave in all situations. The enumerated examples should represent behavior principles, not one off independent situations.

The whole point of using an LLM is its ability to reason about novel situations. Lock it down with 'do exactly X in case Y' and you've just built an expensive decision tree.

Can you give me some concrete examples of some good prompts with the right amount of specificity?

Bad (too enumerated):

"""
If the person asks about weekend rentals, then respond that
they are going to be directed to Tasha. If they ask about
weekday rentals, respond that we usually don't do them, but 
they can check online at the website to see if we have 
availability. If they respond to that with more questions, 
then say that you don't know but the website refreshes
every morning with availability. There is also other info
on the website. If they ask about pricing, then tell them
that is under the pricing tab of the website. 
"""

Bad (too implied):

"""
This agent handles people who want to rent out the bar for the 
evening. Make customers feel happy talking to you.
"""

Good:

"""
You are a assistant to the owner of a bar called bar bar. 
You handle booking requests from customers who want to 
rent out our bar.

Tasha handles weekend rentals, her phone number is X. 
All other information can be viewed on the website. 

Track information from the website and inform people
talking about what the latest availability is.
"""

Now that I write out this example agent, I realize that it is simple enough that it could / should be a static webpage with availability and a phone number for Tasha. But this version interacts with you! AI is useful and not a bubble.

Prompt Improvement Technique: Paper Promptimization

Once your prompt gets long enough and each run gets long enough, your iters will get slow enough that in order to collaborate, I find it easier to print out your entire prompt on sheets of paper, and make a large edits to sections with paper and pen over ~hrs as opposed to using github.

Prompt Improvement Technique: Weird Voice

I find another way to really make me focus in on a long prompt since I'm bad at reading is to feed the whole thing into elevenlabs studio mode text to speech and have a new, strange voice read the prompt back to me. It creates this sort of out of body POV that makes you pay attention more to things that "sound weird".

Common anti-patterns

Over-indexing on recent failures by adding every single one of them to the prompt with "don't do this". Classic overfitting.
Confusing "One example of a principle" with "one rule for one case".
Too many NEVER / ALWAYS statements. If every single line's instructions are absolute, then it increases the probability of contradiction.

Side note / caveat: AAAAAALLLLLLL of this depends on how much your model already knows about the problem

My wife has recently gotten into making prompts for agents.

The idea of the specificity being too low is actually not applicable if the agent is very familiar with the concept. Let's say my wife makes an agent to text me which just has the simple prompt "convince my husband to spend less money. look at his credit card statement. [attached image]. ridiculous". I would suspect that this agent would actually do really well because this is a really common situation, and one month's spending is really all the context that you need, full with examples and things you could deduce. We don't really need to come up with more "principles" or examples. Even though, I could say basically anything to combat the agent texting me, the AI knows the situation too well to get confused and go off course of it's original task.

Future blog ideas

End to end prompt tune case study for a real agent on a real world problem
Tool use specificity
Making agents even more general

Declared it I never ran the thing, but I'm sure I fixed it

comments

How Much Should You Tell Your AI Agent?