The Complete Guide to Building Agentic AI with OpenAI API — Planning, Memory, Tools & Self-Critique

Published on 2 months ago

AI Development & Engineering

The Complete Guide to Building Agentic AI with OpenAI API — Planning, Memory, Tools & Self-Critique

Introduction

Most developers' first encounter with the OpenAI API follows a familiar pattern: send a message, get a response, display it to the user. It works beautifully for simple use cases — chatbots, content generation, summarization. But the moment you want your AI to actually do something — search the web, query a database, write and run code, make decisions across multiple steps — a single prompt-response loop falls dramatically short.

That's where agentic AI comes in.

An agentic AI system doesn't just answer questions. It breaks down goals into tasks, calls tools to gather information or take action, remembers what it has learned across steps, and evaluates the quality of its own outputs before finalizing them. It behaves less like a search engine and more like a junior analyst who can be handed a complex problem and trusted to work through it systematically.

In this guide, we'll build exactly that — a fully functional agentic AI system using the OpenAI API, covering every core component: planning, tool calling, memory, and self-critique. By the end, you'll have a working architecture you can extend, customize, and deploy in production.

What Is an Agentic AI System?

Before writing a single line of code, it's worth being precise about what we're building and why each component matters.

A traditional LLM interaction is stateless and single-turn: input goes in, output comes out. An agentic system adds four critical layers on top of this:

Component	What It Does	Why It Matters
Planning	Breaks a goal into ordered subtasks	Prevents the model from trying to do everything in one shot
Tool Calling	Lets the agent interact with external systems	Connects AI reasoning to real-world data and actions
Memory	Stores and retrieves context across steps	Prevents repetition and enables long-horizon tasks
Self-Critique	Evaluates and refines its own outputs	Dramatically improves output quality and reduces errors

Together, these four layers transform a language model from a text predictor into a goal-directed system that can handle complex, multi-step workflows.

Agentic vs. Non-Agentic AI — Key Differences

Feature	Standard LLM	Agentic AI System
Task scope	Single turn	Multi-step, long-horizon
Tool access	None	APIs, databases, code execution
Memory	None (stateless)	Short-term + long-term memory
Self-evaluation	None	Reflects and revises outputs
Decision making	Single inference	Plan → Act → Observe → Reflect loop
Error recovery	Fails silently	Detects and corrects mistakes

Core Architecture Overview

Here is the high-level architecture we'll implement throughout this guide

Each layer is independently implemented and composable. Let's build them one by one.

Setting Up Your Environment

Requirements

bash

pip install openai>=1.0.0 python-dotenv requests

Project Structure

agentic-ai/
├── main.py
├── planner.py
├── tools.py
├── memory.py
├── critic.py
├── agent.py
└── .env

Environment Setup

# .env OPENAI_API_KEY=your_openai_api_key_here

# config.py
from dotenv import load_dotenv import os load_dotenv() OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") MODEL = "gpt-4o"

Step 1 — Building the Planning Layer

The planning layer is the brain of your agent. Before taking any action, the agent first decomposes the user's goal into a structured list of subtasks — ordered, specific, and actionable.

Why Planning Matters

Without a planning step, LLMs tend to either try to solve everything in one massive response (which degrades quality) or get lost partway through a complex task. Explicit planning forces the model to think before it acts.

Implementation

# planner.py
from openai import OpenAI from config import MODEL

client = OpenAI()

def create_plan(goal: str) -> list[dict]:
"""
Takes a high-level goal and returns an ordered list of subtasks.
"""
system_prompt = """
You are an expert task planner. Given a goal, break it down into
a clear, ordered list of specific subtasks that an AI agent can
execute one by one.

Return your response as a JSON array of objects, each with:
- "step": step number (integer)
- "task": specific task description (string)
- "requires_tool": whether this step needs an external tool (boolean)
- "tool_hint": which type of tool might be needed (string or null)

Be specific. Each task should be executable independently.
Return ONLY valid JSON, no explanation.
"""

response = client.chat.completions.create( model=MODEL, messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": f"Goal: {goal}"} ], response_format={"type": "json_object"} ) import json result = json.loads(response.choices[0].message.content) return result.get("tasks", result.get("steps", []))

# Example usage
if __name__ == "__main__": goal = "Research the top 3 AI frameworks in 2026 and summarize their pros and cons" plan = create_plan(goal) for step in plan: print(f"Step {step['step']}: {step['task']}")

Sample Planner Output

{
"tasks": [
{
"step": 1,
"task": "Search for the most popular AI frameworks in 2026",
"requires_tool": true,
"tool_hint": "web_search"
},
{
"step": 2,
"task": "Gather detailed pros and cons for each framework",
"requires_tool": true,
"tool_hint": "web_search"
},
{
"step": 3,
"task": "Synthesize findings into a structured comparison summary",
"requires_tool": false,
"tool_hint": null
}
]
}

Step 2 — Implementing Tool Calling

Tool calling is what separates an agent from a chatbot. It allows the AI to reach out to the real world — searching the web, querying APIs, reading files, running calculations — and bring that information back into its reasoning loop.

OpenAI Tool Calling — How It Works

OpenAI's tool calling works by defining tools as JSON schemas. The model decides when to call a tool, returns a structured tool call object, and your code executes the actual function — then feeds the result back to the model.

Model decides to call tool
│
▼
Your code executes the function
│
▼
Result is sent back to model
│
▼
Model continues reasoning with new information

Defining Your Tools

# tools.py import requests import json from datetime import datetime # Tool definitions (schemas the model uses to decide when/how to call) TOOL_DEFINITIONS = [ { "type": "function", "function": { "name": "web_search", "description": "Search the web for current information on a topic", "parameters": { "type": "object", "properties": { "query": { "type": "string", "description": "The search query" }, "max_results": { "type": "integer", "description": "Maximum number of results to return", "default": 5 } }, "required": ["query"] } } }, { "type": "function", "function": { "name": "calculate", "description": "Perform mathematical calculations", "parameters": { "type": "object", "properties": { "expression": { "type": "string", "description": "Mathematical expression to evaluate" } }, "required": ["expression"] } } }, { "type": "function", "function": { "name": "get_current_datetime", "description": "Get the current date and time", "parameters": { "type": "object", "properties": {} } } } ] # Actual tool implementations def web_search(query: str, max_results: int = 5) -> dict: """ Simulated web search — replace with real search API (e.g. Serper, Tavily, Bing Search API) """ # Replace this with your preferred search API return { "query": query, "results": [ { "title": f"Result for: {query}", "snippet": f"This is a simulated result for '{query}'. " f"Replace with real search API.", "url": "https://example.com" } ] } def calculate(expression: str) -> dict: """Safely evaluate a mathematical expression.""" try: # Use eval carefully — sanitize input in production allowed_chars = set('0123456789+-*/()., ') if all(c in allowed_chars for c in expression): result = eval(expression) return {"expression": expression, "result": result} else: return {"error": "Invalid characters in expression"} except Exception as e: return {"error": str(e)} def get_current_datetime() -> dict: """Return the current date and time.""" now = datetime.now() return { "datetime": now.isoformat(), "date": now.strftime("%Y-%m-%d"), "time": now.strftime("%H:%M:%S") } def execute_tool(tool_name: str, tool_args: dict) -> str: """Route tool calls to the correct implementation.""" tool_map = { "web_search": web_search, "calculate": calculate, "get_current_datetime": get_current_datetime } if tool_name not in tool_map: return json.dumps({"error": f"Unknown tool: {tool_name}"}) result = tool_map[tool_name](**tool_args) return json.dumps(result)

Running a Tool-Calling Loop

# tool_runner.py from openai import OpenAI from tools import TOOL_DEFINITIONS, execute_tool from config import MODEL import json client = OpenAI() def run_with_tools(messages: list, task: str) -> str: """ Run a single agent step with tool calling support. Loops until the model stops requesting tools. """ messages = messages + [{"role": "user", "content": task}] while True: response = client.chat.completions.create( model=MODEL, messages=messages, tools=TOOL_DEFINITIONS, tool_choice="auto" ) message = response.choices[0].message # If no tool call, we have our final answer if not message.tool_calls: return message.content # Process each tool call messages.append(message) for tool_call in message.tool_calls: tool_name = tool_call.function.name tool_args = json.loads(tool_call.function.arguments) print(f" → Calling tool: {tool_name}({tool_args})") result = execute_tool(tool_name, tool_args) # Feed tool result back to the model messages.append({ "role": "tool", "tool_call_id": tool_call.id, "content": result })

Step 3 — Adding Memory

Memory is what allows your agent to function coherently across multiple steps and sessions. Without it, every step starts from scratch — the agent forgets what it just learned and repeats work unnecessarily.

Two Types of Memory You Need

Memory Type	What It Stores	How Long
Short-term (Working)	Current task context, recent observations	Duration of one agent run
Long-term (Persistent)	Learned facts, past task results, user preferences	Across sessions

Implementing Short-Term Memory

# memory.py class WorkingMemory: """ Short-term memory for a single agent run. Stores observations, tool results, and intermediate conclusions. """ def __init__(self): self.observations = [] self.tool_results = [] self.conclusions = [] self.current_plan = [] def add_observation(self, step: int, content: str): self.observations.append({ "step": step, "content": content }) def add_tool_result(self, tool_name: str, result: str): self.tool_results.append({ "tool": tool_name, "result": result }) def add_conclusion(self, step: int, conclusion: str): self.conclusions.append({ "step": step, "conclusion": conclusion }) def set_plan(self, plan: list): self.current_plan = plan def get_context_summary(self) -> str: """ Generates a formatted context string to inject into prompts, giving the model awareness of what has happened so far. """ parts = [] if self.current_plan: parts.append("CURRENT PLAN:") for step in self.current_plan: parts.append(f" Step {step['step']}: {step['task']}") if self.observations: parts.append("\nOBSERVATIONS SO FAR:") for obs in self.observations[-5:]: # Last 5 to manage context length parts.append(f" [Step {obs['step']}] {obs['content']}") if self.conclusions: parts.append("\nCONCLUSIONS REACHED:") for con in self.conclusions[-3:]: parts.append(f" [Step {con['step']}] {con['conclusion']}") return "\n".join(parts) if parts else "No context yet." class LongTermMemory: """ Simple persistent memory using a JSON file. In production, replace with a vector database (Pinecone, Chroma, Weaviate). """ def __init__(self, filepath: str = "long_term_memory.json"): self.filepath = filepath self.memories = self._load() def _load(self) -> list: import os if os.path.exists(self.filepath): import json with open(self.filepath, "r") as f: return json.load(f) return [] def _save(self): import json with open(self.filepath, "w") as f: json.dump(self.memories, f, indent=2) def store(self, key: str, value: str, tags: list = None): """Store a fact or result for future retrieval.""" from datetime import datetime self.memories.append({ "key": key, "value": value, "tags": tags or [], "timestamp": datetime.now().isoformat() }) self._save() def search(self, query: str, top_k: int = 3) -> list: """ Simple keyword-based search. Replace with vector similarity search in production. """ query_lower = query.lower() scored = [] for memory in self.memories: score = 0 if query_lower in memory["key"].lower(): score += 2 if query_lower in memory["value"].lower(): score += 1 for tag in memory.get("tags", []): if query_lower in tag.lower(): score += 1 if score > 0: scored.append((score, memory)) scored.sort(key=lambda x: x[0], reverse=True) return [m for _, m in scored[:top_k]] def get_relevant_context(self, query: str) -> str: """Returns relevant memories formatted for prompt injection.""" results = self.search(query) if not results: return "No relevant past memories found." parts = ["RELEVANT PAST KNOWLEDGE:"] for memory in results: parts.append(f" [{memory['key']}]: {memory['value']}") return "\n".join(parts)

Step 4 — Self-Critique and Reflection

Self-critique is the most underrated component of agentic AI systems. It's what separates an agent that produces good-enough outputs from one that consistently delivers high-quality, reliable results.

The idea is simple: after the agent produces an output, a second LLM call evaluates it against explicit quality criteria — and if the output falls short, the agent revises it.

The Critique Loop

Agent produces output
│
▼
Critic evaluates against criteria
│
Pass? ──── Yes ──→ Return output
│
No
│
▼
Agent revises with critique feedback
│
▼
(Loop again, max N times)

Implementation

# critic.py
from openai import OpenAI from config import MODEL import json

client = OpenAI()

CRITIQUE_CRITERIA = """
Evaluate the output against these criteria:
1. ACCURACY — Is the information factually correct and well-supported?
2. COMPLETENESS — Does it fully address the original goal?
3. CLARITY — Is it clearly written and well-structured?
4. RELEVANCE — Is everything included actually relevant to the goal?
5. ACTIONABILITY — Is the output useful and actionable for the user?

Rate each criterion: PASS or FAIL
Provide specific, actionable feedback for any FAILs.
Return as JSON with keys: scores (dict), overall_pass (bool), feedback (string)
"""

def critique_output(goal: str, output: str) -> dict: """ Evaluates an agent output against quality criteria. Returns critique scores and actionable feedback. """ response = client.chat.completions.create( model=MODEL, messages=[ { "role": "system", "content": f"You are a rigorous quality evaluator. {CRITIQUE_CRITERIA}" }, { "role": "user", "content": f"ORIGINAL GOAL:\n{goal}\n\nAGENT OUTPUT:\n{output}" } ], response_format={"type": "json_object"} ) return json.loads(response.choices[0].message.content) def revise_output(goal: str, original_output: str, critique: dict) -> str: """ Revises an agent output based on critique feedback. """ response = client.chat.completions.create( model=MODEL, messages=[ { "role": "system", "content": "You are an expert writer and analyst. Revise the provided output to address all critique feedback while maintaining what was already good." }, { "role": "user", "content": f"""ORIGINAL GOAL: {goal} ORIGINAL OUTPUT: {original_output} CRITIQUE FEEDBACK: {critique.get('feedback', 'No specific feedback provided.')} SCORES: {json.dumps(critique.get('scores', {}), indent=2)} Please provide an improved version that addresses all FAIL criteria.""" } ] ) return response.choices[0].message.content def critique_and_revise(goal: str, output: str, max_iterations: int = 2) -> str: """ Runs the full critique-revise loop until output passes or max iterations are reached. """ current_output = output for iteration in range(max_iterations): print(f"\n [Critic] Evaluating output (iteration {iteration + 1})...") critique = critique_output(goal, current_output) scores = critique.get("scores", {}) overall_pass = critique.get("overall_pass", False) print(f" [Critic] Scores: {scores}") print(f" [Critic] Overall: {'PASS ✓' if overall_pass else 'FAIL — revising...'}") if overall_pass: print(f" [Critic] Output approved after {iteration + 1} evaluation(s)") return current_output print(f" [Critic] Feedback: {critique.get('feedback', '')}") current_output = revise_output(goal, current_output, critique) print(f" [Critic] Max iterations reached — returning best output") return current_output

Putting It All Together

Now we combine all four layers into a single, cohesive agent.

# agent.py from planner import create_plan from tool_runner import run_with_tools from memory import WorkingMemory, LongTermMemory from critic import critique_and_revise from config import MODEL from openai import OpenAI client = OpenAI() class AgenticAI: """ A fully agentic AI system with planning, tool calling, memory, and self-critique capabilities. """ def __init__(self): self.working_memory = WorkingMemory() self.long_term_memory = LongTermMemory() def run(self, goal: str) -> str: print(f"\n{'='*60}") print(f"GOAL: {goal}") print(f"{'='*60}\n") # --- Phase 1: Planning --- print("[Phase 1] Creating plan...") plan = create_plan(goal) self.working_memory.set_plan(plan) print(f"Plan created with {len(plan)} steps:") for step in plan: print(f" {step['step']}. {step['task']}") # Check long-term memory for relevant past knowledge past_knowledge = self.long_term_memory.get_relevant_context(goal) if past_knowledge != "No relevant past memories found.": print(f"\n[Memory] Found relevant past knowledge") self.working_memory.add_observation(0, f"Past knowledge: {past_knowledge}") # --- Phase 2: Execute Each Step --- print("\n[Phase 2] Executing plan steps...") step_results = [] for step in plan: step_num = step["step"] task = step["task"] print(f"\n Executing Step {step_num}: {task}") # Build context-aware messages context = self.working_memory.get_context_summary() system_message = f"""You are an expert AI assistant executing a specific task. OVERALL GOAL: {goal} {context} Execute the current task carefully and thoroughly. Use tools when needed to gather real information.""" messages = [{"role": "system", "content": system_message}] # Execute step with tool calling result = run_with_tools(messages, task) # Store result in working memory self.working_memory.add_observation(step_num, result) step_results.append({ "step": step_num, "task": task, "result": result }) print(f" ✓ Step {step_num} complete") # --- Phase 3: Synthesize Final Output --- print("\n[Phase 3] Synthesizing final output...") all_results = "\n\n".join([ f"Step {r['step']} ({r['task']}):\n{r['result']}" for r in step_results ]) synthesis_response = client.chat.completions.create( model=MODEL, messages=[ { "role": "system", "content": "You are an expert analyst. Synthesize the step results into a comprehensive, well-structured final response that fully addresses the original goal." }, { "role": "user", "content": f"GOAL: {goal}\n\nSTEP RESULTS:\n{all_results}\n\nProvide a comprehensive final response." } ] ) raw_output = synthesis_response.choices[0].message.content # --- Phase 4: Self-Critique --- print("\n[Phase 4] Running self-critique...") final_output = critique_and_revise(goal, raw_output) # Store result in long-term memory self.long_term_memory.store( key=goal[:100], value=final_output[:500], tags=["completed_task"] ) print(f"\n{'='*60}") print("FINAL OUTPUT:") print(f"{'='*60}") return final_output # main.py if __name__ == "__main__": agent = AgenticAI() result = agent.run( "Research the top 3 AI coding assistants in 2026 and " "provide a detailed comparison of their features, pricing, and best use cases" ) print(result)

Sample Run Output

============================================================
GOAL: Research the top 3 AI coding assistants in 2026
============================================================

[Phase 1] Creating plan...
Plan created with 3 steps:
1. Search for top AI coding assistants in 2026
2. Gather detailed feature and pricing information for each
3. Synthesize findings into a structured comparison

[Phase 2] Executing plan steps...

Executing Step 1: Search for top AI coding assistants in 2026
→ Calling tool: web_search({"query": "top AI coding assistants 2026"})
✓ Step 1 complete

Executing Step 2: Gather detailed information...
→ Calling tool: web_search({"query": "Claude Code vs GitHub Copilot features 2026"})
✓ Step 2 complete

Executing Step 3: Synthesize findings...
✓ Step 3 complete

[Phase 3] Synthesizing final output...

[Phase 4] Running self-critique...
[Critic] Evaluating output (iteration 1)...
[Critic] Scores: {"ACCURACY": "PASS", "COMPLETENESS": "PASS", ...}
[Critic] Overall: PASS ✓
[Critic] Output approved after 1 evaluation(s)

============================================================
FINAL OUTPUT:
============================================================

Best Practices for Production

Area	Recommendation
Model choice	Use gpt-4o for reasoning steps, gpt-4o-mini for critique to reduce costs
Error handling	Wrap every tool call and LLM call in try/except with retry logic
Memory backend	Replace file-based memory with a vector DB (Chroma, Pinecone, Weaviate)
Tool sandboxing	Never run eval() or execute code without a sandboxed environment
Rate limiting	Implement exponential backoff for OpenAI API rate limit errors
Observability	Log every step, tool call, and critique result for debugging
Token management	Track and limit token usage per run to avoid runaway costs
Max iterations	Always set a hard ceiling on agent loops to prevent infinite runs
Security	Sanitize all tool inputs — never pass raw user input to system commands
Testing	Write unit tests for each tool and integration tests for the full loop

Common Pitfalls to Avoid

Pitfall	What Goes Wrong	How to Fix It
No loop limit	Agent runs forever on hard problems	Set max_iterations on every loop
Ignoring tool errors	Agent hallucinates tool results	Always validate and handle tool failures
Bloated context	Token limits exceeded mid-run	Summarize memory, truncate old observations
Over-critiquing	Too many revision cycles inflate cost	Limit critique to 2–3 iterations max
Trusting all tool output	Bad data propagates through steps	Validate and sanity-check tool results
No fallback planning	One step fails, whole task collapses	Implement step-level retry and recovery
Verbose system prompts	Model loses focus on the actual task	Keep system prompts focused and concise

Final Thoughts

Building an agentic AI system is one of the most rewarding things you can do as a developer right now. The gap between a simple chatbot and a truly autonomous agent isn't as wide as it seems — it's largely a matter of adding the right layers in the right order.

To recap what we built:

Component	What We Built
Planning	Goal decomposition into ordered, specific subtasks
Tool Calling	Web search, calculation, datetime — with a clean routing layer
Memory	Working memory for current run + persistent long-term memory
Self-Critique	Automated quality evaluation and output revision loop
Full Agent	A composable system that ties all four layers together

From here, the natural next steps are to swap the simulated web search for a real API like Tavily or Serper, replace file-based memory with a proper vector database, add more tools tailored to your domain, and instrument the whole system with logging and observability.

The architecture in this guide is intentionally modular — each layer can be improved, replaced, or extended independently. That's the foundation of a system you can actually evolve and trust in production.

At CognyX AI, we design and deploy production-grade agentic AI systems tailored to real business workflows — from autonomous data pipelines to multi-agent research and decision-making systems. If you're ready to move beyond demos and build something that actually runs in production, let's talk.

Introduction
What Is an Agentic AI System?
Core Architecture Overview
Setting Up Your Environment
Step 1 — Building the Planning Layer
Step 2 — Implementing Tool Calling
Step 3 — Adding Memory
Step 4 — Self-Critique and Reflection
Putting It All Together
Best Practices for Production
Common Pitfalls to Avoid
Final Thoughts

Written by

Raish MominCTO

Written by

Raish MominCTO

The Complete Guide to Building Agentic AI with OpenAI API — Planning, Memory, Tools & Self-Critique

Introduction

What Is an Agentic AI System?

Agentic vs. Non-Agentic AI — Key Differences

Core Architecture Overview

Setting Up Your Environment

Requirements

Project Structure

Environment Setup

Step 1 — Building the Planning Layer

Why Planning Matters

Implementation

Sample Planner Output

Step 2 — Implementing Tool Calling

OpenAI Tool Calling — How It Works

Defining Your Tools

Running a Tool-Calling Loop

Step 3 — Adding Memory

Two Types of Memory You Need

Implementing Short-Term Memory

Step 4 — Self-Critique and Reflection

The Critique Loop

Implementation

Putting It All Together

Sample Run Output

Best Practices for Production

Common Pitfalls to Avoid

Final Thoughts

On this page

Written by

Written by