AWS Bedrock LangGraph Python AgentOps Production
Overview#
Building enterprise-grade AI agents for internal banking operations. These aren’t chatbots — they’re autonomous systems that reason, plan, use tools, and complete complex workflows.
Enterprise Scale: Deployed at one of Belgium’s largest banks (12M+ customers)
Architecture#
flowchart TB
subgraph INPUT["Input Layer"]
API[API Gateway]
EB[EventBridge]
SQS[SQS Queues]
end
subgraph ORCHESTRATION["Agent Orchestration"]
SUPER[Supervisor Agent]
DOC[Document Agent]
KNOW[Knowledge Agent]
COMPLY[Compliance Agent]
end
subgraph TOOLS["Tools & Resources"]
RAG[RAG System]
BANK[Banking APIs]
DOCS[Document Store]
end
subgraph SAFETY["Safety & Observability"]
GUARD[Bedrock Guardrails]
TRACE[Agent Tracing]
EVAL[Evaluation Pipeline]
end
INPUT --> ORCHESTRATION
ORCHESTRATION --> TOOLS
ORCHESTRATION --> SAFETY
Key Features#
Multi-Agent Orchestration#
- Supervisor agent coordinates specialized worker agents
- Dynamic task delegation based on query intent
- Parallel execution when tasks are independent
- State sharing between agents for complex workflows
Tool Integration#
Agents can interact with:
- Internal banking APIs (accounts, transactions, products)
- Document retrieval systems
- Knowledge bases via RAG
- External compliance databases
Memory & Context#
- Session persistence across interactions
- Context windowing for long conversations
- Compaction when approaching token limits
- User preference memory for personalization
Guardrails & Safety#
- Bedrock Guardrails for content filtering
- PII detection and redaction
- Scope enforcement — agents only access permitted data
- Audit logging for compliance
Technical Stack#
Agent Infrastructure#
| Component | Technology |
|---|---|
| Orchestration | LangGraph, AWS Bedrock Agents, AWS AgentCore Runtime |
| Foundation Models | Claude, OpenAI models (via Bedrock) |
| Memory | DynamoDB, Redis |
| Queuing | SQS, EventBridge |
| Compute | Lambda, ECS |
Observability#
| Component | Technology |
|---|---|
| Tracing | AgentOps, X-Ray |
| Evaluation | LLM-as-judge, custom evals |
| Monitoring | CloudWatch, custom dashboards |
| Alerting | SNS, PagerDuty |
Evaluation Framework#
flowchart LR
subgraph EVAL["Evaluation Pipeline"]
TRAJ[Trajectory Eval]
TOOL[Tool Use Accuracy]
RESP[Response Quality]
HALL[Hallucination Check]
end
AGENT[Agent Run] --> EVAL
EVAL --> METRICS[Metrics & Alerts]
EVAL --> IMPROVE[Model Improvements]
We evaluate agents on:
- Trajectory quality — Did the agent take sensible steps?
- Tool use accuracy — Were the right tools called with correct params?
- Response quality — Is the final answer helpful and correct?
- Hallucination rate — Does the agent make things up?
- Latency — Is the response time acceptable?
Results#
| Metric | Achievement |
|---|---|
| Task completion rate | 94%+ |
| Average response time | <5 seconds |
| User satisfaction | 4.2/5 |
| Hallucination rate | <2% |
| Daily interactions | 1000+ |
Learnings#
Key lessons from building production agents:
- Evals are everything — Without robust evaluation, you’re flying blind
- Guardrails early — Add safety from day one, not as an afterthought
- Tracing is crucial — Complex agent behavior needs visibility
- Start simple — Single agent first, multi-agent only when needed
- Human-in-the-loop — Some decisions need human approval
Related#
- MiniClaw — Open-source agent framework
- Agent Architecture Deep Dive — Technical patterns
- Agentic Protocols — MCP, A2A, AG-UI