The Complete Guide to Building Agentic AI with OpenAI API — Planning, Memory, Tools & Self-Critique

Introduction
Most developers' first encounter with the OpenAI API follows a familiar pattern: send a message, get a response, display it to the user. It works beautifully for simple use cases — chatbots, content generation, summarization. But the moment you want your AI to actually do something — search the web, query a database, write and run code, make decisions across multiple steps — a single prompt-response loop falls dramatically short.
That's where agentic AI comes in.
An agentic AI system doesn't just answer questions. It breaks down goals into tasks, calls tools to gather information or take action, remembers what it has learned across steps, and evaluates the quality of its own outputs before finalizing them. It behaves less like a search engine and more like a junior analyst who can be handed a complex problem and trusted to work through it systematically.
In this guide, we'll build exactly that — a fully functional agentic AI system using the OpenAI API, covering every core component: planning, tool calling, memory, and self-critique. By the end, you'll have a working architecture you can extend, customize, and deploy in production.
What Is an Agentic AI System?
Before writing a single line of code, it's worth being precise about what we're building and why each component matters.
A traditional LLM interaction is stateless and single-turn: input goes in, output comes out. An agentic system adds four critical layers on top of this:
| Component | What It Does | Why It Matters |
| Planning | Breaks a goal into ordered subtasks | Prevents the model from trying to do everything in one shot |
| Tool Calling | Lets the agent interact with external systems | Connects AI reasoning to real-world data and actions |
| Memory | Stores and retrieves context across steps | Prevents repetition and enables long-horizon tasks |
| Self-Critique | Evaluates and refines its own outputs | Dramatically improves output quality and reduces errors |
Together, these four layers transform a language model from a text predictor into a goal-directed system that can handle complex, multi-step workflows.
Agentic vs. Non-Agentic AI — Key Differences
| Feature | Standard LLM | Agentic AI System |
| Task scope | Single turn | Multi-step, long-horizon |
| Tool access | None | APIs, databases, code execution |
| Memory | None (stateless) | Short-term + long-term memory |
| Self-evaluation | None | Reflects and revises outputs |
| Decision making | Single inference | Plan → Act → Observe → Reflect loop |
| Error recovery | Fails silently | Detects and corrects mistakes |
Core Architecture Overview
Here is the high-level architecture we'll implement throughout this guide
Each layer is independently implemented and composable. Let's build them one by one.
Setting Up Your Environment
Requirements
bash
pip install openai>=1.0.0 python-dotenv requests
Project Structure
agentic-ai/
├── main.py
├── planner.py
├── tools.py
├── memory.py
├── critic.py
├── agent.py
└── .env
Environment Setup
# .env
OPENAI_API_KEY=your_openai_api_key_here
# config.pyfrom dotenv import load_dotenv
import os
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
MODEL = "gpt-4o"
Step 1 — Building the Planning Layer
The planning layer is the brain of your agent. Before taking any action, the agent first decomposes the user's goal into a structured list of subtasks — ordered, specific, and actionable.
Why Planning Matters
Without a planning step, LLMs tend to either try to solve everything in one massive response (which degrades quality) or get lost partway through a complex task. Explicit planning forces the model to think before it acts.
Implementation
# planner.pyfrom openai import OpenAI
from config import MODEL
client = OpenAI()def create_plan(goal: str) -> list[dict]:
"""
Takes a high-level goal and returns an ordered list of subtasks.
"""
system_prompt = """
You are an expert task planner. Given a goal, break it down into
a clear, ordered list of specific subtasks that an AI agent can
execute one by one.
Return your response as a JSON array of objects, each with:
- "step": step number (integer)
- "task": specific task description (string)
- "requires_tool": whether this step needs an external tool (boolean)
- "tool_hint": which type of tool might be needed (string or null)
Be specific. Each task should be executable independently.
Return ONLY valid JSON, no explanation.
"""
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Goal: {goal}"}
],
response_format={"type": "json_object"}
)
import json
result = json.loads(response.choices[0].message.content)
return result.get("tasks", result.get("steps", []))
# Example usageif __name__ == "__main__":
goal = "Research the top 3 AI frameworks in 2026 and summarize their pros and cons"
plan = create_plan(goal)
for step in plan:
print(f"Step {step['step']}: {step['task']}")
Sample Planner Output
{
"tasks": [
{
"step": 1,
"task": "Search for the most popular AI frameworks in 2026",
"requires_tool": true,
"tool_hint": "web_search"
},
{
"step": 2,
"task": "Gather detailed pros and cons for each framework",
"requires_tool": true,
"tool_hint": "web_search"
},
{
"step": 3,
"task": "Synthesize findings into a structured comparison summary",
"requires_tool": false,
"tool_hint": null
}
]
}
Step 2 — Implementing Tool Calling
Tool calling is what separates an agent from a chatbot. It allows the AI to reach out to the real world — searching the web, querying APIs, reading files, running calculations — and bring that information back into its reasoning loop.
OpenAI Tool Calling — How It Works
OpenAI's tool calling works by defining tools as JSON schemas. The model decides when to call a tool, returns a structured tool call object, and your code executes the actual function — then feeds the result back to the model.
Model decides to call tool
│
▼
Your code executes the function
│
▼
Result is sent back to model
│
▼
Model continues reasoning with new information
Defining Your Tools
# tools.py
import requests
import json
from datetime import datetime
# Tool definitions (schemas the model uses to decide when/how to call)
TOOL_DEFINITIONS = [
{
"type": "function",
"function": {
"name": "web_search",
"description": "Search the web for current information on a topic",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
},
"max_results": {
"type": "integer",
"description": "Maximum number of results to return",
"default": 5
}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "calculate",
"description": "Perform mathematical calculations",
"parameters": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "Mathematical expression to evaluate"
}
},
"required": ["expression"]
}
}
},
{
"type": "function",
"function": {
"name": "get_current_datetime",
"description": "Get the current date and time",
"parameters": {
"type": "object",
"properties": {}
}
}
}
]
# Actual tool implementations
def web_search(query: str, max_results: int = 5) -> dict:
"""
Simulated web search — replace with real search API
(e.g. Serper, Tavily, Bing Search API)
"""
# Replace this with your preferred search API
return {
"query": query,
"results": [
{
"title": f"Result for: {query}",
"snippet": f"This is a simulated result for '{query}'. "
f"Replace with real search API.",
"url": "https://example.com"
}
]
}
def calculate(expression: str) -> dict:
"""Safely evaluate a mathematical expression."""
try:
# Use eval carefully — sanitize input in production
allowed_chars = set('0123456789+-*/()., ')
if all(c in allowed_chars for c in expression):
result = eval(expression)
return {"expression": expression, "result": result}
else:
return {"error": "Invalid characters in expression"}
except Exception as e:
return {"error": str(e)}
def get_current_datetime() -> dict:
"""Return the current date and time."""
now = datetime.now()
return {
"datetime": now.isoformat(),
"date": now.strftime("%Y-%m-%d"),
"time": now.strftime("%H:%M:%S")
}
def execute_tool(tool_name: str, tool_args: dict) -> str:
"""Route tool calls to the correct implementation."""
tool_map = {
"web_search": web_search,
"calculate": calculate,
"get_current_datetime": get_current_datetime
}
if tool_name not in tool_map:
return json.dumps({"error": f"Unknown tool: {tool_name}"})
result = tool_map[tool_name](**tool_args)
return json.dumps(result)
Running a Tool-Calling Loop
# tool_runner.py
from openai import OpenAI
from tools import TOOL_DEFINITIONS, execute_tool
from config import MODEL
import json
client = OpenAI()
def run_with_tools(messages: list, task: str) -> str:
"""
Run a single agent step with tool calling support.
Loops until the model stops requesting tools.
"""
messages = messages + [{"role": "user", "content": task}]
while True:
response = client.chat.completions.create(
model=MODEL,
messages=messages,
tools=TOOL_DEFINITIONS,
tool_choice="auto"
)
message = response.choices[0].message
# If no tool call, we have our final answer
if not message.tool_calls:
return message.content
# Process each tool call
messages.append(message)
for tool_call in message.tool_calls:
tool_name = tool_call.function.name
tool_args = json.loads(tool_call.function.arguments)
print(f" → Calling tool: {tool_name}({tool_args})")
result = execute_tool(tool_name, tool_args)
# Feed tool result back to the model
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result
})
Step 3 — Adding Memory
Memory is what allows your agent to function coherently across multiple steps and sessions. Without it, every step starts from scratch — the agent forgets what it just learned and repeats work unnecessarily.
Two Types of Memory You Need
| Memory Type | What It Stores | How Long |
| Short-term (Working) | Current task context, recent observations | Duration of one agent run |
| Long-term (Persistent) | Learned facts, past task results, user preferences | Across sessions |
Implementing Short-Term Memory
# memory.py
class WorkingMemory:
"""
Short-term memory for a single agent run.
Stores observations, tool results, and intermediate conclusions.
"""
def __init__(self):
self.observations = []
self.tool_results = []
self.conclusions = []
self.current_plan = []
def add_observation(self, step: int, content: str):
self.observations.append({
"step": step,
"content": content
})
def add_tool_result(self, tool_name: str, result: str):
self.tool_results.append({
"tool": tool_name,
"result": result
})
def add_conclusion(self, step: int, conclusion: str):
self.conclusions.append({
"step": step,
"conclusion": conclusion
})
def set_plan(self, plan: list):
self.current_plan = plan
def get_context_summary(self) -> str:
"""
Generates a formatted context string to inject into prompts,
giving the model awareness of what has happened so far.
"""
parts = []
if self.current_plan:
parts.append("CURRENT PLAN:")
for step in self.current_plan:
parts.append(f" Step {step['step']}: {step['task']}")
if self.observations:
parts.append("\nOBSERVATIONS SO FAR:")
for obs in self.observations[-5:]: # Last 5 to manage context length
parts.append(f" [Step {obs['step']}] {obs['content']}")
if self.conclusions:
parts.append("\nCONCLUSIONS REACHED:")
for con in self.conclusions[-3:]:
parts.append(f" [Step {con['step']}] {con['conclusion']}")
return "\n".join(parts) if parts else "No context yet."
class LongTermMemory:
"""
Simple persistent memory using a JSON file.
In production, replace with a vector database (Pinecone, Chroma, Weaviate).
"""
def __init__(self, filepath: str = "long_term_memory.json"):
self.filepath = filepath
self.memories = self._load()
def _load(self) -> list:
import os
if os.path.exists(self.filepath):
import json
with open(self.filepath, "r") as f:
return json.load(f)
return []
def _save(self):
import json
with open(self.filepath, "w") as f:
json.dump(self.memories, f, indent=2)
def store(self, key: str, value: str, tags: list = None):
"""Store a fact or result for future retrieval."""
from datetime import datetime
self.memories.append({
"key": key,
"value": value,
"tags": tags or [],
"timestamp": datetime.now().isoformat()
})
self._save()
def search(self, query: str, top_k: int = 3) -> list:
"""
Simple keyword-based search.
Replace with vector similarity search in production.
"""
query_lower = query.lower()
scored = []
for memory in self.memories:
score = 0
if query_lower in memory["key"].lower():
score += 2
if query_lower in memory["value"].lower():
score += 1
for tag in memory.get("tags", []):
if query_lower in tag.lower():
score += 1
if score > 0:
scored.append((score, memory))
scored.sort(key=lambda x: x[0], reverse=True)
return [m for _, m in scored[:top_k]]
def get_relevant_context(self, query: str) -> str:
"""Returns relevant memories formatted for prompt injection."""
results = self.search(query)
if not results:
return "No relevant past memories found."
parts = ["RELEVANT PAST KNOWLEDGE:"]
for memory in results:
parts.append(f" [{memory['key']}]: {memory['value']}")
return "\n".join(parts)
Step 4 — Self-Critique and Reflection
Self-critique is the most underrated component of agentic AI systems. It's what separates an agent that produces good-enough outputs from one that consistently delivers high-quality, reliable results.
The idea is simple: after the agent produces an output, a second LLM call evaluates it against explicit quality criteria — and if the output falls short, the agent revises it.
The Critique Loop
Agent produces output
│
▼
Critic evaluates against criteria
│
Pass? ──── Yes ──→ Return output
│
No
│
▼
Agent revises with critique feedback
│
▼
(Loop again, max N times)
Implementation
# critic.pyfrom openai import OpenAI
from config import MODEL
import json
client = OpenAI()
CRITIQUE_CRITERIA = """
Evaluate the output against these criteria:
1. ACCURACY — Is the information factually correct and well-supported?
2. COMPLETENESS — Does it fully address the original goal?
3. CLARITY — Is it clearly written and well-structured?
4. RELEVANCE — Is everything included actually relevant to the goal?
5. ACTIONABILITY — Is the output useful and actionable for the user?
Rate each criterion: PASS or FAIL
Provide specific, actionable feedback for any FAILs.
Return as JSON with keys: scores (dict), overall_pass (bool), feedback (string)
"""def critique_output(goal: str, output: str) -> dict:ut
"""
Evaluates an agent output against quality criteria.
Returns critique scores and actionable feedback.
"""
response = client.chat.completions.create(
model=MODEL,
messages=[
{
"role": "system",
"content": f"You are a rigorous quality evaluator. {CRITIQUE_CRITERIA}"
},
{
"role": "user",
"content": f"ORIGINAL GOAL:\n{goal}\n\nAGENT OUTPUT:\n{output}"
}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
def revise_output(goal: str, original_output: str, critique: dict) -> str:
"""
Revises an agent output based on critique feedback.
"""
response = client.chat.completions.create(
model=MODEL,
messages=[
{
"role": "system",
"content": "You are an expert writer and analyst. Revise the provided output to address all critique feedback while maintaining what was already good."
},
{
"role": "user",
"content": f"""ORIGINAL GOAL:
{goal}
ORIGINAL OUTPUT:
{original_output}
CRITIQUE FEEDBACK:
{critique.get('feedback', 'No specific feedback provided.')}
SCORES:
{json.dumps(critique.get('scores', {}), indent=2)}
Please provide an improved version that addresses all FAIL criteria."""
}
]
)
return response.choices[0].message.content
def critique_and_revise(goal: str, output: str, max_iterations: int = 2) -> str:
"""
Runs the full critique-revise loop until output passes
or max iterations are reached.
"""
current_output = output
for iteration in range(max_iterations):
print(f"\n [Critic] Evaluating output (iteration {iteration + 1})...")
critique = critique_output(goal, current_output)
scores = critique.get("scores", {})
overall_pass = critique.get("overall_pass", False)
print(f" [Critic] Scores: {scores}")
print(f" [Critic] Overall: {'PASS ✓' if overall_pass else 'FAIL — revising...'}")
if overall_pass:
print(f" [Critic] Output approved after {iteration + 1} evaluation(s)")
return current_output
print(f" [Critic] Feedback: {critique.get('feedback', '')}")
current_output = revise_output(goal, current_output, critique)
print(f" [Critic] Max iterations reached — returning best output")
return current_outp
Putting It All Together
Now we combine all four layers into a single, cohesive agent.
# agent.py
from planner import create_plan
from tool_runner import run_with_tools
from memory import WorkingMemory, LongTermMemory
from critic import critique_and_revise
from config import MODEL
from openai import OpenAI
client = OpenAI()
class AgenticAI:
"""
A fully agentic AI system with planning, tool calling,
memory, and self-critique capabilities.
"""
def __init__(self):
self.working_memory = WorkingMemory()
self.long_term_memory = LongTermMemory()
def run(self, goal: str) -> str:
print(f"\n{'='*60}")
print(f"GOAL: {goal}")
print(f"{'='*60}\n")
# --- Phase 1: Planning ---
print("[Phase 1] Creating plan...")
plan = create_plan(goal)
self.working_memory.set_plan(plan)
print(f"Plan created with {len(plan)} steps:")
for step in plan:
print(f" {step['step']}. {step['task']}")
# Check long-term memory for relevant past knowledge
past_knowledge = self.long_term_memory.get_relevant_context(goal)
if past_knowledge != "No relevant past memories found.":
print(f"\n[Memory] Found relevant past knowledge")
self.working_memory.add_observation(0, f"Past knowledge: {past_knowledge}")
# --- Phase 2: Execute Each Step ---
print("\n[Phase 2] Executing plan steps...")
step_results = []
for step in plan:
step_num = step["step"]
task = step["task"]
print(f"\n Executing Step {step_num}: {task}")
# Build context-aware messages
context = self.working_memory.get_context_summary()
system_message = f"""You are an expert AI assistant executing a specific task.
OVERALL GOAL: {goal}
{context}
Execute the current task carefully and thoroughly.
Use tools when needed to gather real information."""
messages = [{"role": "system", "content": system_message}]
# Execute step with tool calling
result = run_with_tools(messages, task)
# Store result in working memory
self.working_memory.add_observation(step_num, result)
step_results.append({
"step": step_num,
"task": task,
"result": result
})
print(f" ✓ Step {step_num} complete")
# --- Phase 3: Synthesize Final Output ---
print("\n[Phase 3] Synthesizing final output...")
all_results = "\n\n".join([
f"Step {r['step']} ({r['task']}):\n{r['result']}"
for r in step_results
])
synthesis_response = client.chat.completions.create(
model=MODEL,
messages=[
{
"role": "system",
"content": "You are an expert analyst. Synthesize the step results into a comprehensive, well-structured final response that fully addresses the original goal."
},
{
"role": "user",
"content": f"GOAL: {goal}\n\nSTEP RESULTS:\n{all_results}\n\nProvide a comprehensive final response."
}
]
)
raw_output = synthesis_response.choices[0].message.content
# --- Phase 4: Self-Critique ---
print("\n[Phase 4] Running self-critique...")
final_output = critique_and_revise(goal, raw_output)
# Store result in long-term memory
self.long_term_memory.store(
key=goal[:100],
value=final_output[:500],
tags=["completed_task"]
)
print(f"\n{'='*60}")
print("FINAL OUTPUT:")
print(f"{'='*60}")
return final_output
# main.py
if __name__ == "__main__":
agent = AgenticAI()
result = agent.run(
"Research the top 3 AI coding assistants in 2026 and "
"provide a detailed comparison of their features, pricing, and best use cases"
)
print(result)
Sample Run Output
============================================================
GOAL: Research the top 3 AI coding assistants in 2026
============================================================
[Phase 1] Creating plan...
Plan created with 3 steps:
1. Search for top AI coding assistants in 2026
2. Gather detailed feature and pricing information for each
3. Synthesize findings into a structured comparison
[Phase 2] Executing plan steps...
Executing Step 1: Search for top AI coding assistants in 2026
→ Calling tool: web_search({"query": "top AI coding assistants 2026"})
✓ Step 1 complete
Executing Step 2: Gather detailed information...
→ Calling tool: web_search({"query": "Claude Code vs GitHub Copilot features 2026"})
✓ Step 2 complete
Executing Step 3: Synthesize findings...
✓ Step 3 complete
[Phase 3] Synthesizing final output...
[Phase 4] Running self-critique...
[Critic] Evaluating output (iteration 1)...
[Critic] Scores: {"ACCURACY": "PASS", "COMPLETENESS": "PASS", ...}
[Critic] Overall: PASS ✓
[Critic] Output approved after 1 evaluation(s)
============================================================
FINAL OUTPUT:
============================================================
Best Practices for Production
| Area | Recommendation |
| Model choice | Use gpt-4o for reasoning steps, gpt-4o-mini for critique to reduce costs |
| Error handling | Wrap every tool call and LLM call in try/except with retry logic |
| Memory backend | Replace file-based memory with a vector DB (Chroma, Pinecone, Weaviate) |
| Tool sandboxing | Never run eval() or execute code without a sandboxed environment |
| Rate limiting | Implement exponential backoff for OpenAI API rate limit errors |
| Observability | Log every step, tool call, and critique result for debugging |
| Token management | Track and limit token usage per run to avoid runaway costs |
| Max iterations | Always set a hard ceiling on agent loops to prevent infinite runs |
| Security | Sanitize all tool inputs — never pass raw user input to system commands |
| Testing | Write unit tests for each tool and integration tests for the full loop |
Common Pitfalls to Avoid
| Pitfall | What Goes Wrong | How to Fix It |
| No loop limit | Agent runs forever on hard problems | Set max_iterations on every loop |
| Ignoring tool errors | Agent hallucinates tool results | Always validate and handle tool failures |
| Bloated context | Token limits exceeded mid-run | Summarize memory, truncate old observations |
| Over-critiquing | Too many revision cycles inflate cost | Limit critique to 2–3 iterations max |
| Trusting all tool output | Bad data propagates through steps | Validate and sanity-check tool results |
| No fallback planning | One step fails, whole task collapses | Implement step-level retry and recovery |
| Verbose system prompts | Model loses focus on the actual task | Keep system prompts focused and concise |
Final Thoughts
Building an agentic AI system is one of the most rewarding things you can do as a developer right now. The gap between a simple chatbot and a truly autonomous agent isn't as wide as it seems — it's largely a matter of adding the right layers in the right order.
To recap what we built:
| Component | What We Built |
| Planning | Goal decomposition into ordered, specific subtasks |
| Tool Calling | Web search, calculation, datetime — with a clean routing layer |
| Memory | Working memory for current run + persistent long-term memory |
| Self-Critique | Automated quality evaluation and output revision loop |
| Full Agent | A composable system that ties all four layers together |
From here, the natural next steps are to swap the simulated web search for a real API like Tavily or Serper, replace file-based memory with a proper vector database, add more tools tailored to your domain, and instrument the whole system with logging and observability.
The architecture in this guide is intentionally modular — each layer can be improved, replaced, or extended independently. That's the foundation of a system you can actually evolve and trust in production.
At CognyX AI, we design and deploy production-grade agentic AI systems tailored to real business workflows — from autonomous data pipelines to multi-agent research and decision-making systems. If you're ready to move beyond demos and build something that actually runs in production, let's talk.
Written by
