The Complete Guide to Building Agentic AI with OpenAI API — Planning, Memory, Tools & Self-Critique

Published on 2 months ago
AI Development & Engineering
The Complete Guide to Building Agentic AI with OpenAI API — Planning, Memory, Tools & Self-Critique

Introduction

Most developers' first encounter with the OpenAI API follows a familiar pattern: send a message, get a response, display it to the user. It works beautifully for simple use cases — chatbots, content generation, summarization. But the moment you want your AI to actually do something — search the web, query a database, write and run code, make decisions across multiple steps — a single prompt-response loop falls dramatically short.

That's where agentic AI comes in.

An agentic AI system doesn't just answer questions. It breaks down goals into tasks, calls tools to gather information or take action, remembers what it has learned across steps, and evaluates the quality of its own outputs before finalizing them. It behaves less like a search engine and more like a junior analyst who can be handed a complex problem and trusted to work through it systematically.

In this guide, we'll build exactly that — a fully functional agentic AI system using the OpenAI API, covering every core component: planning, tool calling, memory, and self-critique. By the end, you'll have a working architecture you can extend, customize, and deploy in production.

What Is an Agentic AI System?

Before writing a single line of code, it's worth being precise about what we're building and why each component matters.

A traditional LLM interaction is stateless and single-turn: input goes in, output comes out. An agentic system adds four critical layers on top of this:

ComponentWhat It DoesWhy It Matters
PlanningBreaks a goal into ordered subtasksPrevents the model from trying to do everything in one shot
Tool CallingLets the agent interact with external systemsConnects AI reasoning to real-world data and actions
MemoryStores and retrieves context across stepsPrevents repetition and enables long-horizon tasks
Self-CritiqueEvaluates and refines its own outputsDramatically improves output quality and reduces errors

Together, these four layers transform a language model from a text predictor into a goal-directed system that can handle complex, multi-step workflows.

Agentic vs. Non-Agentic AI — Key Differences

FeatureStandard LLMAgentic AI System
Task scopeSingle turnMulti-step, long-horizon
Tool accessNoneAPIs, databases, code execution
MemoryNone (stateless)Short-term + long-term memory
Self-evaluationNoneReflects and revises outputs
Decision makingSingle inferencePlan → Act → Observe → Reflect loop
Error recoveryFails silentlyDetects and corrects mistakes

Core Architecture Overview

Here is the high-level architecture we'll implement throughout this guide

Sanity Image

Each layer is independently implemented and composable. Let's build them one by one.

Setting Up Your Environment

Requirements

bash

pip install openai>=1.0.0 python-dotenv requests

Project Structure

agentic-ai/
├── main.py
├── planner.py
├── tools.py
├── memory.py
├── critic.py
├── agent.py
└── .env

Environment Setup

# .env
OPENAI_API_KEY=your_openai_api_key_here

# config.py
from dotenv import load_dotenv
import os

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
MODEL = "gpt-4o"

Step 1 — Building the Planning Layer

The planning layer is the brain of your agent. Before taking any action, the agent first decomposes the user's goal into a structured list of subtasks — ordered, specific, and actionable.

Why Planning Matters

Without a planning step, LLMs tend to either try to solve everything in one massive response (which degrades quality) or get lost partway through a complex task. Explicit planning forces the model to think before it acts.

Implementation

# planner.py
from openai import OpenAI
from config import MODEL


client = OpenAI()

def create_plan(goal: str) -> list[dict]:
"""
Takes a high-level goal and returns an ordered list of subtasks.
"""
system_prompt = """
You are an expert task planner. Given a goal, break it down into
a clear, ordered list of specific subtasks that an AI agent can
execute one by one.

Return your response as a JSON array of objects, each with:
- "step": step number (integer)
- "task": specific task description (string)
- "requires_tool": whether this step needs an external tool (boolean)
- "tool_hint": which type of tool might be needed (string or null)

Be specific. Each task should be executable independently.
Return ONLY valid JSON, no explanation.
"""

response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Goal: {goal}"}
],
response_format={"type": "json_object"}
)

import json
result = json.loads(response.choices[0].message.content)
return result.get("tasks", result.get("steps", []))



# Example usage
if __name__ == "__main__":
goal = "Research the top 3 AI frameworks in 2026 and summarize their pros and cons"
plan = create_plan(goal)
for step in plan:
print(f"Step {step['step']}: {step['task']}")

Sample Planner Output

{
"tasks": [
{
"step": 1,
"task": "Search for the most popular AI frameworks in 2026",
"requires_tool": true,
"tool_hint": "web_search"
},
{
"step": 2,
"task": "Gather detailed pros and cons for each framework",
"requires_tool": true,
"tool_hint": "web_search"
},
{
"step": 3,
"task": "Synthesize findings into a structured comparison summary",
"requires_tool": false,
"tool_hint": null
}
]
}

Step 2 — Implementing Tool Calling

Tool calling is what separates an agent from a chatbot. It allows the AI to reach out to the real world — searching the web, querying APIs, reading files, running calculations — and bring that information back into its reasoning loop.

OpenAI Tool Calling — How It Works

OpenAI's tool calling works by defining tools as JSON schemas. The model decides when to call a tool, returns a structured tool call object, and your code executes the actual function — then feeds the result back to the model.

Model decides to call tool


Your code executes the function


Result is sent back to model


Model continues reasoning with new information

Defining Your Tools

# tools.py
import requests
import json
from datetime import datetime

# Tool definitions (schemas the model uses to decide when/how to call)
TOOL_DEFINITIONS = [
{
"type": "function",
"function": {
"name": "web_search",
"description": "Search the web for current information on a topic",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
},
"max_results": {
"type": "integer",
"description": "Maximum number of results to return",
"default": 5
}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "calculate",
"description": "Perform mathematical calculations",
"parameters": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "Mathematical expression to evaluate"
}
},
"required": ["expression"]
}
}
},
{
"type": "function",
"function": {
"name": "get_current_datetime",
"description": "Get the current date and time",
"parameters": {
"type": "object",
"properties": {}
}
}
}
]


# Actual tool implementations
def web_search(query: str, max_results: int = 5) -> dict:
"""
Simulated web search — replace with real search API
(e.g. Serper, Tavily, Bing Search API)
"""
# Replace this with your preferred search API
return {
"query": query,
"results": [
{
"title": f"Result for: {query}",
"snippet": f"This is a simulated result for '{query}'. "
f"Replace with real search API.",
"url": "https://example.com"
}
]
}


def calculate(expression: str) -> dict:
"""Safely evaluate a mathematical expression."""
try:
# Use eval carefully — sanitize input in production
allowed_chars = set('0123456789+-*/()., ')
if all(c in allowed_chars for c in expression):
result = eval(expression)
return {"expression": expression, "result": result}
else:
return {"error": "Invalid characters in expression"}
except Exception as e:
return {"error": str(e)}


def get_current_datetime() -> dict:
"""Return the current date and time."""
now = datetime.now()
return {
"datetime": now.isoformat(),
"date": now.strftime("%Y-%m-%d"),
"time": now.strftime("%H:%M:%S")
}


def execute_tool(tool_name: str, tool_args: dict) -> str:
"""Route tool calls to the correct implementation."""
tool_map = {
"web_search": web_search,
"calculate": calculate,
"get_current_datetime": get_current_datetime
}

if tool_name not in tool_map:
return json.dumps({"error": f"Unknown tool: {tool_name}"})

result = tool_map[tool_name](**tool_args)
return json.dumps(result)

Running a Tool-Calling Loop

# tool_runner.py
from openai import OpenAI
from tools import TOOL_DEFINITIONS, execute_tool
from config import MODEL
import json

client = OpenAI()

def run_with_tools(messages: list, task: str) -> str:
"""
Run a single agent step with tool calling support.
Loops until the model stops requesting tools.
"""
messages = messages + [{"role": "user", "content": task}]

while True:
response = client.chat.completions.create(
model=MODEL,
messages=messages,
tools=TOOL_DEFINITIONS,
tool_choice="auto"
)

message = response.choices[0].message

# If no tool call, we have our final answer
if not message.tool_calls:
return message.content

# Process each tool call
messages.append(message)

for tool_call in message.tool_calls:
tool_name = tool_call.function.name
tool_args = json.loads(tool_call.function.arguments)

print(f" → Calling tool: {tool_name}({tool_args})")
result = execute_tool(tool_name, tool_args)

# Feed tool result back to the model
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result
})

Step 3 — Adding Memory

Memory is what allows your agent to function coherently across multiple steps and sessions. Without it, every step starts from scratch — the agent forgets what it just learned and repeats work unnecessarily.

Two Types of Memory You Need

Memory TypeWhat It StoresHow Long
Short-term (Working)Current task context, recent observationsDuration of one agent run
Long-term (Persistent)Learned facts, past task results, user preferencesAcross sessions

Implementing Short-Term Memory

# memory.py

class WorkingMemory:
"""
Short-term memory for a single agent run.
Stores observations, tool results, and intermediate conclusions.
"""

def __init__(self):
self.observations = []
self.tool_results = []
self.conclusions = []
self.current_plan = []

def add_observation(self, step: int, content: str):
self.observations.append({
"step": step,
"content": content
})

def add_tool_result(self, tool_name: str, result: str):
self.tool_results.append({
"tool": tool_name,
"result": result
})

def add_conclusion(self, step: int, conclusion: str):
self.conclusions.append({
"step": step,
"conclusion": conclusion
})

def set_plan(self, plan: list):
self.current_plan = plan

def get_context_summary(self) -> str:
"""
Generates a formatted context string to inject into prompts,
giving the model awareness of what has happened so far.
"""
parts = []

if self.current_plan:
parts.append("CURRENT PLAN:")
for step in self.current_plan:
parts.append(f" Step {step['step']}: {step['task']}")

if self.observations:
parts.append("\nOBSERVATIONS SO FAR:")
for obs in self.observations[-5:]: # Last 5 to manage context length
parts.append(f" [Step {obs['step']}] {obs['content']}")

if self.conclusions:
parts.append("\nCONCLUSIONS REACHED:")
for con in self.conclusions[-3:]:
parts.append(f" [Step {con['step']}] {con['conclusion']}")

return "\n".join(parts) if parts else "No context yet."


class LongTermMemory:
"""
Simple persistent memory using a JSON file.
In production, replace with a vector database (Pinecone, Chroma, Weaviate).
"""

def __init__(self, filepath: str = "long_term_memory.json"):
self.filepath = filepath
self.memories = self._load()

def _load(self) -> list:
import os
if os.path.exists(self.filepath):
import json
with open(self.filepath, "r") as f:
return json.load(f)
return []

def _save(self):
import json
with open(self.filepath, "w") as f:
json.dump(self.memories, f, indent=2)

def store(self, key: str, value: str, tags: list = None):
"""Store a fact or result for future retrieval."""
from datetime import datetime
self.memories.append({
"key": key,
"value": value,
"tags": tags or [],
"timestamp": datetime.now().isoformat()
})
self._save()

def search(self, query: str, top_k: int = 3) -> list:
"""
Simple keyword-based search.
Replace with vector similarity search in production.
"""
query_lower = query.lower()
scored = []

for memory in self.memories:
score = 0
if query_lower in memory["key"].lower():
score += 2
if query_lower in memory["value"].lower():
score += 1
for tag in memory.get("tags", []):
if query_lower in tag.lower():
score += 1
if score > 0:
scored.append((score, memory))

scored.sort(key=lambda x: x[0], reverse=True)
return [m for _, m in scored[:top_k]]

def get_relevant_context(self, query: str) -> str:
"""Returns relevant memories formatted for prompt injection."""
results = self.search(query)
if not results:
return "No relevant past memories found."

parts = ["RELEVANT PAST KNOWLEDGE:"]
for memory in results:
parts.append(f" [{memory['key']}]: {memory['value']}")
return "\n".join(parts)

Step 4 — Self-Critique and Reflection

Self-critique is the most underrated component of agentic AI systems. It's what separates an agent that produces good-enough outputs from one that consistently delivers high-quality, reliable results.

The idea is simple: after the agent produces an output, a second LLM call evaluates it against explicit quality criteria — and if the output falls short, the agent revises it.

The Critique Loop

Agent produces output


Critic evaluates against criteria

Pass? ──── Yes ──→ Return output

No


Agent revises with critique feedback


(Loop again, max N times)

Implementation

# critic.py
from openai import OpenAI
from config import MODEL
import json


client = OpenAI()

CRITIQUE_CRITERIA = """
Evaluate the output against these criteria:
1. ACCURACY — Is the information factually correct and well-supported?
2. COMPLETENESS — Does it fully address the original goal?
3. CLARITY — Is it clearly written and well-structured?
4. RELEVANCE — Is everything included actually relevant to the goal?
5. ACTIONABILITY — Is the output useful and actionable for the user?

Rate each criterion: PASS or FAIL
Provide specific, actionable feedback for any FAILs.
Return as JSON with keys: scores (dict), overall_pass (bool), feedback (string)
"""

def critique_output(goal: str, output: str) -> dict:
"""
Evaluates an agent output against quality criteria.
Returns critique scores and actionable feedback.
"""
response = client.chat.completions.create(
model=MODEL,
messages=[
{
"role": "system",
"content": f"You are a rigorous quality evaluator. {CRITIQUE_CRITERIA}"
},
{
"role": "user",
"content": f"ORIGINAL GOAL:\n{goal}\n\nAGENT OUTPUT:\n{output}"
}
],
response_format={"type": "json_object"}
)

return json.loads(response.choices[0].message.content)


def revise_output(goal: str, original_output: str, critique: dict) -> str:
"""
Revises an agent output based on critique feedback.
"""
response = client.chat.completions.create(
model=MODEL,
messages=[
{
"role": "system",
"content": "You are an expert writer and analyst. Revise the provided output to address all critique feedback while maintaining what was already good."
},
{
"role": "user",
"content": f"""ORIGINAL GOAL:
{goal}

ORIGINAL OUTPUT:
{original_output}

CRITIQUE FEEDBACK:
{critique.get('feedback', 'No specific feedback provided.')}

SCORES:
{json.dumps(critique.get('scores', {}), indent=2)}

Please provide an improved version that addresses all FAIL criteria."""
}
]
)

return response.choices[0].message.content


def critique_and_revise(goal: str, output: str, max_iterations: int = 2) -> str:
"""
Runs the full critique-revise loop until output passes
or max iterations are reached.
"""
current_output = output

for iteration in range(max_iterations):
print(f"\n [Critic] Evaluating output (iteration {iteration + 1})...")
critique = critique_output(goal, current_output)

scores = critique.get("scores", {})
overall_pass = critique.get("overall_pass", False)

print(f" [Critic] Scores: {scores}")
print(f" [Critic] Overall: {'PASS ✓' if overall_pass else 'FAIL — revising...'}")

if overall_pass:
print(f" [Critic] Output approved after {iteration + 1} evaluation(s)")
return current_output

print(f" [Critic] Feedback: {critique.get('feedback', '')}")
current_output = revise_output(goal, current_output, critique)

print(f" [Critic] Max iterations reached — returning best output")
return current_outp
ut

Putting It All Together

Now we combine all four layers into a single, cohesive agent.

# agent.py
from planner import create_plan
from tool_runner import run_with_tools
from memory import WorkingMemory, LongTermMemory
from critic import critique_and_revise
from config import MODEL
from openai import OpenAI

client = OpenAI()

class AgenticAI:
"""
A fully agentic AI system with planning, tool calling,
memory, and self-critique capabilities.
"""

def __init__(self):
self.working_memory = WorkingMemory()
self.long_term_memory = LongTermMemory()

def run(self, goal: str) -> str:
print(f"\n{'='*60}")
print(f"GOAL: {goal}")
print(f"{'='*60}\n")

# --- Phase 1: Planning ---
print("[Phase 1] Creating plan...")
plan = create_plan(goal)
self.working_memory.set_plan(plan)

print(f"Plan created with {len(plan)} steps:")
for step in plan:
print(f" {step['step']}. {step['task']}")

# Check long-term memory for relevant past knowledge
past_knowledge = self.long_term_memory.get_relevant_context(goal)
if past_knowledge != "No relevant past memories found.":
print(f"\n[Memory] Found relevant past knowledge")
self.working_memory.add_observation(0, f"Past knowledge: {past_knowledge}")

# --- Phase 2: Execute Each Step ---
print("\n[Phase 2] Executing plan steps...")
step_results = []

for step in plan:
step_num = step["step"]
task = step["task"]
print(f"\n Executing Step {step_num}: {task}")

# Build context-aware messages
context = self.working_memory.get_context_summary()
system_message = f"""You are an expert AI assistant executing a specific task.

OVERALL GOAL: {goal}

{context}

Execute the current task carefully and thoroughly.
Use tools when needed to gather real information."""

messages = [{"role": "system", "content": system_message}]

# Execute step with tool calling
result = run_with_tools(messages, task)

# Store result in working memory
self.working_memory.add_observation(step_num, result)
step_results.append({
"step": step_num,
"task": task,
"result": result
})

print(f" ✓ Step {step_num} complete")

# --- Phase 3: Synthesize Final Output ---
print("\n[Phase 3] Synthesizing final output...")

all_results = "\n\n".join([
f"Step {r['step']} ({r['task']}):\n{r['result']}"
for r in step_results
])

synthesis_response = client.chat.completions.create(
model=MODEL,
messages=[
{
"role": "system",
"content": "You are an expert analyst. Synthesize the step results into a comprehensive, well-structured final response that fully addresses the original goal."
},
{
"role": "user",
"content": f"GOAL: {goal}\n\nSTEP RESULTS:\n{all_results}\n\nProvide a comprehensive final response."
}
]
)

raw_output = synthesis_response.choices[0].message.content

# --- Phase 4: Self-Critique ---
print("\n[Phase 4] Running self-critique...")
final_output = critique_and_revise(goal, raw_output)

# Store result in long-term memory
self.long_term_memory.store(
key=goal[:100],
value=final_output[:500],
tags=["completed_task"]
)

print(f"\n{'='*60}")
print("FINAL OUTPUT:")
print(f"{'='*60}")
return final_output


# main.py
if __name__ == "__main__":
agent = AgenticAI()

result = agent.run(
"Research the top 3 AI coding assistants in 2026 and "
"provide a detailed comparison of their features, pricing, and best use cases"
)

print(result)

Sample Run Output

============================================================
GOAL: Research the top 3 AI coding assistants in 2026
============================================================

[Phase 1] Creating plan...
Plan created with 3 steps:
1. Search for top AI coding assistants in 2026
2. Gather detailed feature and pricing information for each
3. Synthesize findings into a structured comparison

[Phase 2] Executing plan steps...

Executing Step 1: Search for top AI coding assistants in 2026
→ Calling tool: web_search({"query": "top AI coding assistants 2026"})
✓ Step 1 complete

Executing Step 2: Gather detailed information...
→ Calling tool: web_search({"query": "Claude Code vs GitHub Copilot features 2026"})
✓ Step 2 complete

Executing Step 3: Synthesize findings...
✓ Step 3 complete

[Phase 3] Synthesizing final output...

[Phase 4] Running self-critique...
[Critic] Evaluating output (iteration 1)...
[Critic] Scores: {"ACCURACY": "PASS", "COMPLETENESS": "PASS", ...}
[Critic] Overall: PASS ✓
[Critic] Output approved after 1 evaluation(s)

============================================================
FINAL OUTPUT:
============================================================

Best Practices for Production

AreaRecommendation
Model choiceUse gpt-4o for reasoning steps, gpt-4o-mini for critique to reduce costs
Error handlingWrap every tool call and LLM call in try/except with retry logic
Memory backendReplace file-based memory with a vector DB (Chroma, Pinecone, Weaviate)
Tool sandboxingNever run eval() or execute code without a sandboxed environment
Rate limitingImplement exponential backoff for OpenAI API rate limit errors
ObservabilityLog every step, tool call, and critique result for debugging
Token managementTrack and limit token usage per run to avoid runaway costs
Max iterationsAlways set a hard ceiling on agent loops to prevent infinite runs
SecuritySanitize all tool inputs — never pass raw user input to system commands
TestingWrite unit tests for each tool and integration tests for the full loop

Common Pitfalls to Avoid

PitfallWhat Goes WrongHow to Fix It
No loop limitAgent runs forever on hard problemsSet max_iterations on every loop
Ignoring tool errorsAgent hallucinates tool resultsAlways validate and handle tool failures
Bloated contextToken limits exceeded mid-runSummarize memory, truncate old observations
Over-critiquingToo many revision cycles inflate costLimit critique to 2–3 iterations max
Trusting all tool outputBad data propagates through stepsValidate and sanity-check tool results
No fallback planningOne step fails, whole task collapsesImplement step-level retry and recovery
Verbose system promptsModel loses focus on the actual taskKeep system prompts focused and concise

Final Thoughts

Building an agentic AI system is one of the most rewarding things you can do as a developer right now. The gap between a simple chatbot and a truly autonomous agent isn't as wide as it seems — it's largely a matter of adding the right layers in the right order.

To recap what we built:

ComponentWhat We Built
PlanningGoal decomposition into ordered, specific subtasks
Tool CallingWeb search, calculation, datetime — with a clean routing layer
MemoryWorking memory for current run + persistent long-term memory
Self-CritiqueAutomated quality evaluation and output revision loop
Full AgentA composable system that ties all four layers together

From here, the natural next steps are to swap the simulated web search for a real API like Tavily or Serper, replace file-based memory with a proper vector database, add more tools tailored to your domain, and instrument the whole system with logging and observability.

The architecture in this guide is intentionally modular — each layer can be improved, replaced, or extended independently. That's the foundation of a system you can actually evolve and trust in production.

At CognyX AI, we design and deploy production-grade agentic AI systems tailored to real business workflows — from autonomous data pipelines to multi-agent research and decision-making systems. If you're ready to move beyond demos and build something that actually runs in production, let's talk.

Written by

Raish Momin
Raish MominCTO