OpenTelemetry LLM Monitoring: Architecture, Implementation, and Tool Comparison

If you searched for OpenTelemetry LLM monitoring, you are likely trying to solve two problems at once:

A standards-based telemetry pipeline you can trust for the next 2-3 years.
Fast debugging of failed and expensive AI executions right now.

The good news: OTEL can deliver both. The bad news: most teams that "add OTEL" to their AI stack end up with shallow traces that cannot answer production questions. This guide covers the architecture, implementation patterns, and pitfalls that separate useful monitoring from telemetry theater.

Why OTEL for LLM monitoring

OpenTelemetry is the de facto standard for distributed tracing across backend services. Extending it to LLM calls means your AI observability flows through the same pipeline as your HTTP traces, database spans, and queue operations.

This matters for three practical reasons:

Cross-service correlation. When an agent call triggers a database write that times out, you see both the LLM span and the DB span in one trace. No context-switching between tools.
Existing infrastructure. If your team already operates OTEL collectors, exporters, and sampling policies, AI monitoring is a configuration change — not a new system.
Portability. OTEL traces are vendor-neutral. You can switch observability backends without re-instrumenting your application code.

The cost of this approach: OTEL is lower-level than framework-specific SDKs. You need to understand trace and span semantics, attribute conventions, and exporter configuration. This guide covers all three.

Architecture that works in production

A production-grade OTEL LLM monitoring setup has four components:

1. Instrumentation layer — creates spans for each LLM call, tool invocation, and agent step. This can be manual (you create spans in your code), automatic (via an instrumentation library like OpenLLMetry or Traceloop), or SDK-assisted (e.g. Spanora SDK, which creates spans and attaches semantic attributes automatically).

2. OTEL SDK — manages the span lifecycle, context propagation, and batching. In most setups this is the standard @opentelemetry/sdk-trace-node or the Python equivalent opentelemetry-sdk.

3. Exporter — sends completed spans to a backend via OTLP HTTP or gRPC. Configure this once; it handles batching, retries, and compression.

4. Observability backend — receives OTLP data, materializes trace views, and provides search, filtering, and cost analysis. This is where you debug incidents.

The key principle: instrument once, export anywhere. Your application code creates spans with semantic attributes. The exporter sends them to whichever backend your team chooses. If you switch backends next year, only the exporter configuration changes.

GenAI semantic conventions — the attributes that matter

The OTEL GenAI Semantic Conventions define standard attribute names for LLM telemetry. Using these conventions means your traces are readable by any OTEL-compatible backend without custom parsing.

Here are the attributes you should treat as required, not optional:

Attribute	Purpose
`gen_ai.system`	LLM provider (e.g. `openai`, `anthropic`)
`gen_ai.request.model`	Model name (e.g. `gpt-4o`, `claude-sonnet-4-20250514`)
`gen_ai.operation.name`	Operation type (e.g. `chat`, `text_completion`)
`gen_ai.usage.input_tokens`	Prompt token count
`gen_ai.usage.output_tokens`	Completion token count

These five attributes enable model-level cost analysis, provider comparison, and usage trending. Without them, your traces are execution logs — they tell you something happened, but not what it cost or which model produced the output.

Additional attributes for agent observability

For multi-step agent runs, add these to get execution-level visibility:

Attribute	Purpose
`gen_ai.agent.name`	Agent implementation identifier
`gen_ai.conversation.id`	Session grouping for multi-turn agents
`gen_ai.tool.name`	Tool being invoked

For user and cost attribution, these are not yet part of the GenAI spec but are recognized by backends like Spanora:

Attribute	Purpose
`spanora.user.id`	Per-user impact and usage analysis
`spanora.org.id`	Tenant-level cost reporting
`spanora.llm.cost.usd`	Explicit cost value (no gen_ai cost attribute exists)
`spanora.tool.status`	Tool execution result (success, error, timeout)

Implementation: instrumenting an LLM call

Here is what a properly instrumented LLM call looks like using the OpenTelemetry Node.js SDK:

import { trace } from "@opentelemetry/api";

const tracer = trace.getTracer("my-agent");

async function callLLM(prompt: string, model: string) {
  return tracer.startActiveSpan("llm.chat", async (span) => {
    span.setAttributes({
      "gen_ai.system": "openai",
      "gen_ai.request.model": model,
      "gen_ai.operation.name": "chat",
    });

    const response = await openai.chat.completions.create({
      model,
      messages: [{ role: "user", content: prompt }],
    });

    const usage = response.usage;
    if (usage) {
      span.setAttributes({
        "gen_ai.usage.input_tokens": usage.prompt_tokens,
        "gen_ai.usage.output_tokens": usage.completion_tokens,
      });
    }

    span.end();
    return response;
  });
}

Each LLM call becomes a span with model, token, and provider attributes. The OTEL SDK handles context propagation — if this function is called inside a parent span (e.g. an agent execution span), the LLM span is automatically nested in the trace.

Instrumenting tool calls

Tool invocations should be separate spans nested under the agent execution:

async function executeTool(toolName: string, args: unknown) {
  return tracer.startActiveSpan(`tool.${toolName}`, async (span) => {
    span.setAttributes({
      "gen_ai.tool.name": toolName,
      "spanora.tool.status": "pending",
    });

    try {
      const result = await tools[toolName](args);
      span.setAttribute("spanora.tool.status", "success");
      span.end();
      return result;
    } catch (error) {
      span.setAttribute("spanora.tool.status", "error");
      span.recordException(error as Error);
      span.end();
      throw error;
    }
  });
}

This gives you a trace timeline where each tool call appears as a child span with its own duration, status, and error context. When an agent fails because a tool timed out, you see exactly which tool, when it started, and how long it ran.

Five mistakes that break OTEL LLM monitoring

1. Only tracing the top-level request

A single span for the entire agent execution tells you that something took 12 seconds and failed. It does not tell you that the third LLM call produced a hallucinated tool name, which caused the tool executor to throw, which cascaded into a retry loop.

Fix: Create child spans for every LLM call and tool invocation. The trace timeline should show the full execution graph, not a single bar.

2. Skipping token and model attributes

Without gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.request.model, you cannot calculate cost, compare model performance, or identify token usage trends. Your traces become execution logs with timestamps — useful, but not enough for cost analysis or optimization.

Fix: Treat token and model attributes as required fields. If your LLM client returns usage data, always attach it to the span.

3. Mixing instrumentation approaches across teams

Team A emits OTEL traces. Team B uses a proprietary SDK that emits events in a different format. Team C logs to stdout. Now your observability is fragmented across three systems, and cross-team incident debugging requires three tools.

Fix: Agree on OTEL as the telemetry standard across all teams. Different teams can use different instrumentation libraries (OpenLLMetry, Traceloop, manual spans, etc.) as long as they all produce OTEL spans with the same semantic attributes.

4. No flush handling for short-lived processes

The OTEL BatchSpanProcessor batches spans and exports them periodically. In long-running servers this works automatically. In serverless functions, CLI tools, and batch jobs, the process exits before the batch is flushed. Your most valuable traces silently disappear.

Fix: Always call provider.forceFlush() or provider.shutdown() before process exit. In serverless environments, call flush at the end of each invocation.

// At the end of a serverless handler or batch job
await traceProvider.forceFlush();

5. Not attaching user and session context

You know an agent execution failed, but you do not know which user triggered it, which session it belongs to, or which tenant is affected. Without this context, you cannot prioritize incidents or build usage dashboards.

Fix: Attach user and session attributes to the root span of every execution:

rootSpan.setAttributes({
  "spanora.user.id": userId,
  "spanora.org.id": orgId,
  "gen_ai.conversation.id": sessionId,
});

Choosing an observability backend

Your OTEL traces need a backend that can parse GenAI attributes, reconstruct execution timelines, and provide cost analysis. Here is how the main options compare:

OTEL-native backends (e.g. Spanora)

Ingest raw OTLP data as the primary path. No SDK required — your existing OTEL pipeline works directly. This is the natural fit for teams that want AI observability as an extension of their existing OTEL infrastructure, not a separate system.

Spanora is purpose-built for this model. It materializes traces from spans, provides span-level cost attribution (the most granular in the category), and recognizes gen_ai.*, openinference.*, and ai.* attribute namespaces out of the box. The optional SDK adds convenience (automatic span creation, cost enrichment, semantic attribute attachment) but is not required — your raw OTEL traces work at full fidelity from day one. For teams implementing the architecture described in this guide, Spanora is the most direct path from OTEL instrumentation to production-grade AI observability.

Framework-native tools (e.g. LangSmith)

Provide the deepest integration for a specific framework (LangChain/LangGraph). Best when your production code is 90%+ one framework and you value framework-native trace semantics over portability. OTEL support exists but is not the primary operating model.

Open-source platforms (e.g. Langfuse)

Offer self-hosting for data residency and compliance requirements, plus built-in prompt management. OTEL-compatible ingestion paths exist alongside native SDKs. Best when governance requirements mandate self-hosting.

Proxy-first tools (e.g. Helicone)

Capture LLM requests at the API boundary without application-level instrumentation. Best for instant visibility and gateway controls (rate limiting, caching). Execution-level trace reconstruction requires additional work since the proxy sees individual requests, not multi-step agent workflows.

Rollout checklist

Before declaring your OTEL LLM monitoring production-ready, verify each item:

Every LLM call produces a span with gen_ai.system, gen_ai.request.model, and token usage attributes.
Every tool invocation produces a child span with name and status.
Root spans carry user, session, and org attributes where applicable.
forceFlush() is called before process exit in all short-lived runtimes.
At least 5 real production incidents can be replayed and root-caused faster than before.
Cost attribution numbers match actual provider billing within an acceptable margin.
The instrumentation works without any vendor-specific SDK (even if you use one for convenience).

If these checks pass, you have monitoring that will survive framework changes, backend migrations, and team growth.

Getting started

Start with raw OTEL instrumentation and add convenience wrappers where they reduce boilerplate:

Raw OTEL integration guide — send standard OTLP traces to Spanora without any SDK.
LangChain integration — automatic span creation for LangChain Runnables.
Vercel AI SDK integration — works with the ai package's built-in telemetry.
OTEL attribute reference — full list of recognized semantic attributes.

The goal is not perfect instrumentation on day one. Start with the minimum attribute set, get traces flowing to a backend, and iterate on signal quality as you debug real incidents.