Learn LangSmith: The Ultimate Zero-to-Production Guide (2025 Edition)

This beginner-friendly guide takes you from zero to production with LangSmith, covering tracing, evals, LLM observability, and monitoring. Learn to observe, debug, and ship reliable AI apps with copy-paste code and plain-language explanations.

Blog

Aug 19, 2025

Learn LangSmith: The Ultimate Zero-to-Production Guide (2025 Edition)

Learn LangSmith: The Ultimate Guide From Zero to Production (2025 Edition)

You’ve built a cool AI demo. Now what? The hardest part isn’t the first prompt, it’s turning your idea into a reliable, measurable, production-ready application. This guide shows you exactly how to learn LangSmith from zero to production: observe what your app is doing, evaluate quality with data, and monitor it in the wild. It’s beginner-friendly, practical, and focused on shipping.

If you’ve searched for “what is LangSmith,” “how to use LangSmith,” “LangSmith code,” “LangSmith prompting,” or “LangSmith evals,” you’re in the right place.

What LLMs Are, In Plain Language

Large Language Models (LLMs) are AI systems trained on huge amounts of text to generate and understand language. They can draft emails, summarize documents, answer questions, write code, translate, and more. A few simple examples of LLM apps:

A customer-support bot that answers FAQs from your help center
A meeting notes summarizer
A code assistant that explains errors and suggests fixes
A research helper that cites sources from your knowledge base

Two crucial traits make LLMs powerful, and tricky:

They’re non-deterministic: the same prompt can produce different outputs.
They reason and “decide” through text, which means subtle prompt or context changes can shift results.

If you’re new to LLMs and want a quick primer, try Google Cloud’s overview of LLMs (https://cloud.google.com/learn/what-is-an-llm) and the Hugging Face course (https://huggingface.co/learn).

Why LLM Debugging Is Different

Traditional bugs are repeatable. With LLMs, behavior can vary across runs, inputs, and time. Common pain points:

Hidden failures inside long chains and agent loops
Unclear prompts or missing context
Latency spikes or cost blow-ups from too many tool calls
Silent regressions after small model or prompt changes

You can’t fix what you can’t see. That’s where LLM observability, evaluation, and monitoring come in.

What LangSmith Is, and How It Works

LangSmith is a developer platform for building, debugging, evaluating, and monitoring LLM apps. It’s framework-agnostic, so it works with LangChain, the OpenAI SDK, or your own stack.

Under the hood, LangSmith instruments your app to:

Trace runs: capture inputs, prompts, tool calls, retrieved docs, outputs, latency, and cost
Manage prompts: iterate in a playground and version prompts in a library
Evaluate: run your app against datasets and score results with automated or custom evaluators
Monitor: track production metrics and set alerts for errors, latency, or quality drift

As an educator and practitioner, I find LangSmith especially good for learning evals, both the built-in options and your own custom metrics.

Quick Start for AI Application Development

Install the SDK:

pip install -U langsmith

Set environment variables (before running your app):

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
export LANGCHAIN_API_KEY=YOUR_KEY
export LANGCHAIN_PROJECT="my-first-project"  # optional

Create an account and API key at https://smith.langchain.com. Full docs are here: https://docs.smith.langchain.com/

Your First Trace: LLM Debugging Without Guesswork

Minimal “hello world” trace around a function:

from langsmith import traceable

@traceable(name="hello")
def hello(name: str):
    return f"Hi, {name}!"

hello("Ava")

Using LangChain? Tracing turns on automatically with the env vars:

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
llm.invoke("Say hi in one sentence.")

Using the OpenAI Python SDK? Wrap the client:

from openai import OpenAI
from langsmith.wrappers import wrap_openai

oai = wrap_openai(OpenAI())
oai.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role":"user","content":"Explain embeddings in 1–2 lines."}]
)

Open your project in LangSmith and you’ll see a run tree with inputs, prompts, outputs, costs, and latency. You can click into any LLM step, view the exact prompt, and replay it in a playground.

Deep-Dive LLM Observability: See What Your App Is Doing

Observability means x-ray vision into your app’s reasoning and behavior:

Inspect run trees to follow chains, agents, tools, and retrievers
Filter runs by model, tags, error status, latency, or custom metadata like user_id and app_version
Sort by latency to find slow steps, check token counts and retries, and spot agent loops
Open any step in a playground, tweak parameters or prompts, and re-run instantly
Use the Threads view to analyze multi-turn conversations end-to-end

This turns “it seems slow and flaky” into “step 3 is our hotspot; the retriever returns 20 docs; we can trim to 5.”

Prompt Engineering and Management Without the Chaos

Prompts are product code. Treat them that way.

Iterate in the Prompt Playground: paste your prompt, test model variants, adjust temperature and max tokens, compare side-by-side
Save versions in the Prompt Library and reference them by name in code
Keep a single source of truth so updates roll out safely across environments

For a hands-on guide to prompt patterns, see the Prompt Engineering Guide (https://www.promptingguide.ai/).

Evals 101: Measure What Matters

You can’t improve what you don’t measure. LangSmith evals let you:

Create datasets with inputs and reference outputs (ground truth)
Run your app against the dataset
Score results with built-in or custom evaluators
Compare experiments over time to avoid regressions

Create a dataset and example programmatically:

from langsmith import Client
client = Client()
ds = client.create_dataset("QA Quickstart")
client.create_examples(
  inputs=[{"question":"What is LangChain?"}],
  outputs=[{"answer":"A framework for building LLM apps"}],
  dataset_id=ds.id
)

Run automated evaluations:

from langsmith.evaluation import evaluate

def app(inputs):
    # Your production code here
    return {"output": "short answer"}

results = evaluate(
    app,
    data="QA Quickstart",
    evaluators=["qa", "embedding_distance"],  # built-in judges
    experiment_prefix="baseline"
)

Create your own evaluator to match product goals, like “mentions source,” “safe tone,” or “no hallucination.” Custom evaluators are simple Python functions that receive the run and example and return a score and comment.

Human Feedback Loops: Turn Real Usage Into Quality

Automate where you can, involve people where you must.

Send interesting or ambiguous runs to an annotation queue for Subject Matter Experts to review
Convert reviewed items into dataset examples for future evals
Capture thumbs up/down and comments in your app UI and attach as feedback to runs
Close the loop: trace → review → dataset → evals → ship

This is how teams build reliable assistants for regulated and domain-heavy use cases.

Experiment Management: Compare Models, Prompts, and Settings

Treat changes like experiments.

Vary one thing at a time: model, prompt, or temperature
Use the same dataset and evaluators to compare apples-to-apples
View side-by-side metrics: accuracy, latency (p50/p95), and cost
Drill into examples that improved or regressed to understand why

Use your dataset as a regression test in CI. Run evaluate() on pull requests and fail the build if key metrics drop beyond your thresholds.

LLM Monitoring in Production: Dashboards and Alerts

Once you ship, monitoring keeps you fast and safe.

Use project dashboards for volume, error rate, latency, cost, and evaluator scores
Track output quality in production with automated evaluators
Set alerts for anomaly spikes or drift, notify via webhooks or PagerDuty
Slice metrics by cohort or metadata (model_provider, region, app_version) to pinpoint issues

Monitoring is what stops “it’s fine on my laptop” from becoming “incident at 2 a.m.”

Advanced Use Cases and Best Practices

Debug agentic workflows: use run trees to catch infinite loops, cap max iterations, and log tool inputs/outputs for reproducibility
A/B test safely: route a percentage of traffic to a new prompt or model, tag runs by variant, and compare with evaluators plus real user feedback
Collaborate: share projects and prompt libraries; non-engineers can run experiments in the UI and comment on traces
Keep datasets small but sharp: 25–100 carefully chosen examples usually beat 1,000 synthetic items that don’t reflect reality
Track p95 latency and per-provider cost alongside quality; speed and spend are product features too

Copy-Paste Starter Recipes

Minimal setup (once):

pip install -U langsmith
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
export LANGCHAIN_API_KEY=YOUR_KEY

First traced LLM call (OpenAI SDK):

from openai import OpenAI
from langsmith.wrappers import wrap_openai

oai = wrap_openai(OpenAI())
oai.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role":"user","content":"One-sentence summary of LangSmith"}]
)

Trace a function:

from langsmith import traceable

@traceable
def classify(text: str) -> str:
    return "positive" if "love" in text else "neutral"

classify("I love tracing")

Create a dataset and run evals:

from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()
ds = client.create_dataset("Support-FAQs")
client.create_examples(
  inputs=[{"question":"Reset password?"}],
  outputs=[{"answer":"Use the Reset Password link in settings"}],
  dataset_id=ds.id
)

def app(inputs):
    return {"output": "Use the Reset Password link in settings"}

evaluate(
    app,
    data="Support-FAQs",
    evaluators=["qa"],
    experiment_prefix="baseline"
)

Prompt iteration flow:

Draft in Playground
Save to Prompt Library as “support-answering:v1”
Reference by name/version in code to keep a single source of truth

Plain-Language Cheat Sheet

LLM: Large Language Model. An AI system that reads and writes text.
Non-deterministic: Same input, different outputs sometimes, like rolling dice.
LLM observability: Seeing inputs, prompts, tool calls, outputs, cost, and latency.
Tracing: Recording each step an LLM app takes, so you can replay and debug.
Evals: Evaluations that score how well your app performs on a dataset.
Monitoring: Watching your live app’s health over time, like a dashboard in a car.
Framework-agnostic: Works across libraries (LangChain, OpenAI SDK, custom code).
SDK: Software tools you import to work with a platform.
API key: A secret you use so your app can talk to a service.
Environment variables: Settings you keep outside your code, like keys and endpoints.
Run and run tree: One execution of your app, plus its child steps in a tree view.
Metadata: Descriptive tags such as user_id, model, or app_version for filtering and analysis.
Latency (p50/p95): How fast responses are on average (p50) and for the slowest 5% (p95).
Ground truth: The correct output you want your app to produce.
Regression: When a change accidentally makes things worse.
CI/CD: Automated pipelines for testing and deploying code.
A/B testing: Trying two versions (A vs. B) to see which wins.

Practical Tips That Save Hours

Add metadata early: tag runs with user_id, model, region, and app_version for future slicing
Convert great and bad production runs into dataset examples, real data drives meaningful evals
Version prompts like code; roll forward and roll back safely
Keep experiments consistent; change one variable at a time
Watch quality, latency, and cost together; trade-offs are real

Next Steps and Resources

Don’t wait for a production fire. Instrument your app now:

Turn on tracing in five minutes
Create a tiny dataset with 10 real examples
Run your first experiment and set one alert

Official docs and tutorials: https://docs.smith.langchain.com/
Prompt patterns and best practices: https://www.promptingguide.ai/
LLM fundamentals: https://cloud.google.com/learn/what-is-an-llm and https://huggingface.co/learn

If you want help getting your first eval green, reply with your stack (LangChain, OpenAI SDK, or other) and I’ll map the quickest path to a passing experiment. If this was useful, subscribe to the newsletter for monthly deep dives on LLM debugging, evals, and monitoring.

FAQs

What is LangSmith and why would I use it?
LangSmith is a developer platform to observe, evaluate, and monitor LLM apps. It helps you trace everything, test with datasets, and monitor production to ship more reliable models.

What are LangSmith’s core pillars?
Tracing and Observability, Prompt Management, Datasets, Evals, and Experiments, Monitoring and Alerts.

How do I install and start using LangSmith?
Install: `pip install -U langsmith`
Set up tracing and endpoint: `export LANGCHAIN_TRACING_V2=true`, `export LANGCHAIN_ENDPOINT=https://api.smith.langchain.com`, `export LANGCHAIN_API_KEY=YOUR_KEY`
Optional: `export LANGCHAIN_PROJECT="my-first-project"`

How do I get an API key and set it up?
Create an account at smith.langchain.com, generate an API key in the app, and save it securely (e.g., in dotenv or a secret manager). Then use the key in LANGCHAIN_API_KEY.

What is the LangSmith project structure?
Projects: containers for runs/traces of an app; Traces: end-to-end executions and their steps; Datasets: test inputs and reference outputs; Experiments: a run over a dataset with metrics; Annotation Queues: for human review workflows; Prompt Library: versioned prompts you can reuse.

How do I log my first trace?
Use the LangSmith SDK with a traceable function, e.g. a @traceable-decorated function, or Enable LangChain tracing (LANGCHAIN_TRACING_V2) for automatic logging, or wrap an OpenAI client to log calls.

How can I view and analyze traces?
In the Traces UI, open a run to see the run tree and view exact prompts/responses. Filter traces by model, tags, latency, errors, or metadata; identify bottlenecks and debug by re-running steps.

What are the Prompt Playground and Prompt Library?
Prompt Playground lets you test prompts, adjust settings, and compare outputs side-by-side. Prompt Library provides versioned prompts to reuse; reference library prompts from code and promote winners to the library.

How do datasets and evaluations work?
Datasets are collections of inputs with ground-truth outputs. Create examples manually or programmatically. Run automated evaluations with built-in evaluators (e.g., "qa", "embedding_distance") or your own custom evaluators. Use results to guide improvements and run more experiments.

How does LangSmith help with production monitoring and collaboration?
Observability dashboards show volume, latency, cost, and error rates. Set alerts for anomalies and performance regressions. Use annotations and human feedback to improve datasets and prompts. Integrate with CI/CD to run evaluations on PRs and compare versions.

Have an idea
for me to build?

Have an idea for me to build?

Explore Synergies