Back to blog
Technical10 min read

LangGraph in Production: What We Learned

State design, checkpointing costs, human-in-the-loop gates, and observability for agentic workflows that handle real users and real money.

By Shivansh Gupta

Co-founder & Engineering, Brixloop

LangGraph is a strong choice for multi-step agentic workflows. Explicit state, cyclical graphs, and checkpointing beat ad-hoc prompt chains when steps depend on each other. Moving from notebook to production surfaced lessons we now apply on every agent build. This isn't a LangGraph tutorial. It's what breaks when real users and real money are involved.

State shape is a product decision

Everything the graph reads and writes lives in state. If you dump raw LLM outputs into state without schema, debugging becomes archaeology. We define typed state objects early: user inputs, retrieved context, intermediate decisions, human approvals, and final artifacts.

Version state when you change graph topology. Old checkpoints shouldn't crash new nodes. We bump a state schema version and write migration helpers for in-flight runs rather than hoping deploys happen during quiet hours.

  • Separate user-visible fields from internal debug payloads
  • Store model version and prompt hash alongside LLM outputs
  • Keep external IDs (CRM record, ticket number) at the top level for tracing
  • Never put secrets in checkpointed state

Checkpointing vs. cost

Checkpointing enables resume, human-in-the-loop, and audit. But storage and replay aren't free. For high-volume paths, checkpoint selectively: full persistence on approval gates and external side effects, lighter paths for read-only retrieval steps. Not every node deserves a durable checkpoint.

We map nodes into three tiers. Tier 1: checkpoint always (payments, sends, publishes). Tier 2: checkpoint on failure only (expensive retrieval). Tier 3: ephemeral (formatting, summarization with no side effects). That cut storage costs 60% on one legal workflow without losing auditability where it mattered.

Human gates belong in the graph, not around it

Wrapping the whole graph in an external approval UI feels faster in week one. It falls apart when you need to resume mid-run, show partial context to approvers, or prove who signed off on what. Human gates should be first-class nodes with explicit interrupt semantics.

  • Insert explicit interrupt nodes before irreversible actions (send email, charge card, publish content)
  • Surface pending approvals in an operator UI tied to graph run IDs
  • Timeout and escalate when humans don't respond. Don't leave runs hanging.
  • Log who approved what and which model version produced the draft

Tool loops and failure taxonomy

Agent graphs that call tools need retry policies per tool type. A rate-limited search API and a flaky internal CRM webhook should not share the same backoff strategy. We classify failures: transient, user-fixable, model hallucination, policy violation. Each class gets a different graph edge.

Cap tool loop iterations. An agent that calls the same function twelve times because the prompt is ambiguous will burn budget and erode trust. Hard stops with a graceful handoff to a human beat infinite loops every time.

Observability from day one

Production agent debugging needs per-node latency, token usage, tool call success rates, and failure taxonomy. Wire tracing before launch. Retrofitting observability after a client incident is expensive. We correlate graph run IDs with application request IDs so support can follow one user action through every node.

Dashboards we ship with every agent MVP: runs per hour, average tokens per successful completion, top failure reasons, median time waiting on human approval. Founders use these in board updates. Engineers use them to decide what to optimize next.

Testing agent graphs in CI

Unit tests on individual nodes aren't enough. We maintain golden-path fixtures: sample inputs that should traverse known edges, plus adversarial inputs that should hit policy nodes or human gates. Snapshot the state after each node in test mode so regressions show up as diffs, not production tickets.

When LangGraph is the wrong tool

Single-shot LLM calls, simple RAG Q&A, or linear ETL don't need a graph framework. LangGraph earns its complexity when you have branching, retries, tool loops, and human approvals in one workflow. Over-engineering a straight pipeline adds moving parts without reliability gains.

Building agentic workflows for a funded startup? We scope fixed-price agent MVPs with explicit guardrails. See intelligent automation services or describe your workflow.