Last week, I spent three hours watching a coding agent go completely off the rails. I asked it to refactor a Python API—extract some routes, clean up the database layer, add a few tests. Simple stuff. By step six, it had rewritten my entire error handling logic, deleted a config file it deemed "unnecessary," and got stuck in a loop trying to install a package that didn't exist. I shut it down, did the work manually, and wondered why I even bothered.
Then I got access to GPT-5.5 inside Codex, and it genuinely changed how I approach these tasks. Not because it's magic, but because it actually finishes what you ask it to do without going sideways halfway through. But here's the thing: if you just open it up and start typing prompts like it's ChatGPT, you're going to be disappointed. GPT-5.5 in a chat interface is like using a bulldozer to move a potted plant. It works, but you're missing the point entirely.
Let me walk you through what I've learned getting real work out of this setup.
Why GPT-5.5 in Codex Is Different
Most people first encounter GPT-5.5 through the ChatGPT interface. They ask it a question, get a solid answer, and think, "Cool, slightly better than the last one." That's a narrow use case for a model that was specifically built for agentic AI tasks—multi-step workflows, tool use, long-horizon planning, and autonomous execution across a codebase.
The difference is immediate when you use it in Codex. Earlier GPT models had a frustrating tendency to drift. You'd give them a complex, multi-step task, and by step four or five, they'd start interpreting your original goal loosely. GPT-5.5 holds the original intent much better through 20-30 step tasks. It also handles tool calls—file reads, shell commands, API calls, test runners—more consistently. It's less likely to call a tool unnecessarily, pass malformed arguments, or loop on failed tool invocations. Since Codex agents spend most of their runtime executing tool calls rather than just generating text, this matters enormously.
Step 1: Selecting the Right Model
This sounds obvious, but I messed it up the first time. When you open Codex, you need to explicitly select GPT-5.5 from the model dropdown. Don't just assume the default is the latest model—it usually isn't.
In the Codex interface, look for the model selector in the top-right corner. Select gpt-5.5. If you don't see it, check that your account has access to the model. I wasted an entire afternoon wondering why my results were mediocre before I realized I was still on an earlier model.
Step 2: Use Plan Mode First
This is the single biggest mistake I see people make. They jump straight into execution mode, give GPT-5.5 a vague instruction, and then get frustrated when the output doesn't match what was in their head.
Plan mode exists for a reason. When you start a task, click "Plan" instead of "Execute." This tells GPT-5.5 to break down the task into steps and show you its intended approach before it actually does anything.
Here's what I do now: I give it the task, let it plan, review the plan, and then approve it. If the plan looks wrong, I adjust my prompt. This saves an enormous amount of time and tokens.
For example, instead of:
Refactor the authentication module
I write:
Refactor the authentication module in /src/auth/.
Split the monolithic auth.py into separate files for:
- token generation (tokens.py)
- user validation (validators.py)
- session management (sessions.py)
Keep all existing API contracts the same.
Add unit tests for each new file in /tests/auth/.
Then I run it in plan mode. GPT-5.5 will lay out exactly which files it's going to create, what functions go where, and what tests it plans to write. I review that, catch anything it missed, and then execute.
Step 3: Give It a Workspace, Not Just a Prompt
This was a genuine "aha" moment for me. GPT-5.5 performs dramatically better when you provide context about the workspace before asking it to do something. Think of it like onboarding a new developer—you wouldn't just hand them a ticket and walk away.
Before your main task, provide:
- Who the reader is (e.g., "This is an internal API consumed by our mobile team")
- What the output is supposed to become (e.g., "This will be a standalone microservice")
- What sources or files matter (e.g., "The database schema in /db/schema.sql is the source of truth")
- What style or conventions to follow (e.g., "We use Google-style docstrings and type hints everywhere")
I now keep a CONTEXT.md file in my project root that I reference in prompts:
Reference CONTEXT.md for project conventions before starting.
This single change improved my results more than any prompt engineering trick.
Step 4: Managing Token Costs
Let's be honest—GPT-5.5 is not cheap, especially in agentic mode where it's making dozens of tool calls. I burned through my first month's budget in about a week because I wasn't paying attention.
Here's what I've learned about keeping costs reasonable:
Scope tasks tightly. Instead of "Add authentication to the entire app," break it into "Add JWT token generation to /src/auth/tokens.py" and then "Add authentication middleware to /src/middleware/" as separate tasks. Smaller tasks mean fewer tokens and less drift.
Use plan mode to preview token usage. Before executing, Codex shows an estimated token count for the plan. If it's way higher than expected, your task is probably too broad.
Don't re-run from scratch. If GPT-5.5 completes 80% of a task and makes a mistake on the last step, don't restart the whole thing. Point out the specific error and ask it to fix just that part.
Step 5: What GPT-5.5 Actually Excels At
After using it daily for a few weeks, here's my honest assessment of where it shines and where it doesn't:
Great at:
- Writing and debugging code across multiple files
- Researching unfamiliar codebases (it reads files methodically)
- Complex scientific or mathematical code where reasoning depth matters
- Following detailed instructions over long task chains
- Running and interpreting test results
Mediocre at:
- Highly creative or opinionated architectural decisions (it plays it safe)
- Tasks where the "right answer" depends on business context it doesn't have
- Extremely large codebases where even the context window isn't enough (you need to scope carefully)
Practical Tips I Wish I'd Known Earlier
Always commit your code before running GPT-5.5. It will modify files. Sometimes it modifies files you didn't expect. Having a clean git state makes it trivial to revert.
Watch the first few tool calls. Don't walk away immediately. If it starts reading the wrong files or heading in the wrong direction, stop it early. Correcting course at step 2 is much cheaper than at step 15.
Be explicit about what NOT to change. I now include lines like "Do not modify any files in /src/legacy/" in my prompts. Without this, GPT-5.5 might decide to "improve" things you wanted left alone.
Use it for understanding, not just writing. One of my favorite uses is asking GPT-5.5 to read through a complex module and explain what it does. It's consistently able to reason about even the most complex scientific code, and I can ask conceptual questions and get phenomenal responses.
Honest Limitations
GPT-5.5 in Codex is not a replacement for understanding your own codebase. It's a very capable agent that can execute well-defined tasks reliably, but it still needs clear direction. If you give it ambiguous instructions, you'll get ambiguous results—just faster and more expensive ones.
The biggest limitation is context. Even with improved token efficiency, large codebases will exceed what it can hold in working memory. You need to be strategic about pointing it at the right files and directories.
That said, for the first time, I'm genuinely comfortable letting an agent run multi-step tasks without babysitting it constantly. That's a significant shift from where things were even a few months ago. If you're working with code and haven't tried GPT-5.5 in Codex with the workflow I've described here, give it a shot. Just remember: workspace, not prompt. Plan, then execute. Scope tight, commit first.