Learn LangSmith: The Ultimate Guide From Zero to Production (2025 Edition)
You’ve built a cool AI demo. Now what? The hardest part isn’t the first prompt, it’s turning your idea into a reliable, measurable, production-ready application. This guide shows you exactly how to learn LangSmith from zero to production: observe what your app is doing, evaluate quality with data, and monitor it in the wild. It’s beginner-friendly, practical, and focused on shipping.
If you’ve searched for “what is LangSmith,” “how to use LangSmith,” “LangSmith code,” “LangSmith prompting,” or “LangSmith evals,” you’re in the right place.
What LLMs Are, In Plain Language
Large Language Models (LLMs) are AI systems trained on huge amounts of text to generate and understand language. They can draft emails, summarize documents, answer questions, write code, translate, and more. A few simple examples of LLM apps:
- A customer-support bot that answers FAQs from your help center
- A meeting notes summarizer
- A code assistant that explains errors and suggests fixes
- A research helper that cites sources from your knowledge base
Two crucial traits make LLMs powerful, and tricky:
- They’re non-deterministic: the same prompt can produce different outputs.
- They reason and “decide” through text, which means subtle prompt or context changes can shift results.
If you’re new to LLMs and want a quick primer, try Google Cloud’s overview of LLMs (https://cloud.google.com/learn/what-is-an-llm) and the Hugging Face course (https://huggingface.co/learn).
Why LLM Debugging Is Different
Traditional bugs are repeatable. With LLMs, behavior can vary across runs, inputs, and time. Common pain points:
- Hidden failures inside long chains and agent loops
- Unclear prompts or missing context
- Latency spikes or cost blow-ups from too many tool calls
- Silent regressions after small model or prompt changes
You can’t fix what you can’t see. That’s where LLM observability, evaluation, and monitoring come in.
What LangSmith Is, and How It Works
LangSmith is a developer platform for building, debugging, evaluating, and monitoring LLM apps. It’s framework-agnostic, so it works with LangChain, the OpenAI SDK, or your own stack.
Under the hood, LangSmith instruments your app to:
- Trace runs: capture inputs, prompts, tool calls, retrieved docs, outputs, latency, and cost
- Manage prompts: iterate in a playground and version prompts in a library
- Evaluate: run your app against datasets and score results with automated or custom evaluators
- Monitor: track production metrics and set alerts for errors, latency, or quality drift
As an educator and practitioner, I find LangSmith especially good for learning evals, both the built-in options and your own custom metrics.
Quick Start for AI Application Development
Install the SDK:
pip install -U langsmith
Set environment variables (before running your app):
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
export LANGCHAIN_API_KEY=YOUR_KEY
export LANGCHAIN_PROJECT="my-first-project" # optional
Create an account and API key at https://smith.langchain.com. Full docs are here: https://docs.smith.langchain.com/
Your First Trace: LLM Debugging Without Guesswork
Minimal “hello world” trace around a function:
from langsmith import traceable
@traceable(name="hello")
def hello(name: str):
return f"Hi, {name}!"
hello("Ava")
Using LangChain? Tracing turns on automatically with the env vars:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
llm.invoke("Say hi in one sentence.")
Using the OpenAI Python SDK? Wrap the client:
from openai import OpenAI
from langsmith.wrappers import wrap_openai
oai = wrap_openai(OpenAI())
oai.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role":"user","content":"Explain embeddings in 1–2 lines."}]
)
Open your project in LangSmith and you’ll see a run tree with inputs, prompts, outputs, costs, and latency. You can click into any LLM step, view the exact prompt, and replay it in a playground.
Deep-Dive LLM Observability: See What Your App Is Doing
Observability means x-ray vision into your app’s reasoning and behavior:
- Inspect run trees to follow chains, agents, tools, and retrievers
- Filter runs by model, tags, error status, latency, or custom metadata like user_id and app_version
- Sort by latency to find slow steps, check token counts and retries, and spot agent loops
- Open any step in a playground, tweak parameters or prompts, and re-run instantly
- Use the Threads view to analyze multi-turn conversations end-to-end
This turns “it seems slow and flaky” into “step 3 is our hotspot; the retriever returns 20 docs; we can trim to 5.”
Prompt Engineering and Management Without the Chaos
Prompts are product code. Treat them that way.
- Iterate in the Prompt Playground: paste your prompt, test model variants, adjust temperature and max tokens, compare side-by-side
- Save versions in the Prompt Library and reference them by name in code
- Keep a single source of truth so updates roll out safely across environments
For a hands-on guide to prompt patterns, see the Prompt Engineering Guide (https://www.promptingguide.ai/).
Evals 101: Measure What Matters
You can’t improve what you don’t measure. LangSmith evals let you:
- Create datasets with inputs and reference outputs (ground truth)
- Run your app against the dataset
- Score results with built-in or custom evaluators
- Compare experiments over time to avoid regressions
Create a dataset and example programmatically:
from langsmith import Client
client = Client()
ds = client.create_dataset("QA Quickstart")
client.create_examples(
inputs=[{"question":"What is LangChain?"}],
outputs=[{"answer":"A framework for building LLM apps"}],
dataset_id=ds.id
)
Run automated evaluations:
from langsmith.evaluation import evaluate
def app(inputs):
# Your production code here
return {"output": "short answer"}
results = evaluate(
app,
data="QA Quickstart",
evaluators=["qa", "embedding_distance"], # built-in judges
experiment_prefix="baseline"
)
Create your own evaluator to match product goals, like “mentions source,” “safe tone,” or “no hallucination.” Custom evaluators are simple Python functions that receive the run and example and return a score and comment.
Human Feedback Loops: Turn Real Usage Into Quality
Automate where you can, involve people where you must.
- Send interesting or ambiguous runs to an annotation queue for Subject Matter Experts to review
- Convert reviewed items into dataset examples for future evals
- Capture thumbs up/down and comments in your app UI and attach as feedback to runs
- Close the loop: trace → review → dataset → evals → ship
This is how teams build reliable assistants for regulated and domain-heavy use cases.
Experiment Management: Compare Models, Prompts, and Settings
Treat changes like experiments.
- Vary one thing at a time: model, prompt, or temperature
- Use the same dataset and evaluators to compare apples-to-apples
- View side-by-side metrics: accuracy, latency (p50/p95), and cost
- Drill into examples that improved or regressed to understand why
Use your dataset as a regression test in CI. Run evaluate() on pull requests and fail the build if key metrics drop beyond your thresholds.
LLM Monitoring in Production: Dashboards and Alerts
Once you ship, monitoring keeps you fast and safe.
- Use project dashboards for volume, error rate, latency, cost, and evaluator scores
- Track output quality in production with automated evaluators
- Set alerts for anomaly spikes or drift, notify via webhooks or PagerDuty
- Slice metrics by cohort or metadata (model_provider, region, app_version) to pinpoint issues
Monitoring is what stops “it’s fine on my laptop” from becoming “incident at 2 a.m.”
Advanced Use Cases and Best Practices
- Debug agentic workflows: use run trees to catch infinite loops, cap max iterations, and log tool inputs/outputs for reproducibility
- A/B test safely: route a percentage of traffic to a new prompt or model, tag runs by variant, and compare with evaluators plus real user feedback
- Collaborate: share projects and prompt libraries; non-engineers can run experiments in the UI and comment on traces
- Keep datasets small but sharp: 25–100 carefully chosen examples usually beat 1,000 synthetic items that don’t reflect reality
- Track p95 latency and per-provider cost alongside quality; speed and spend are product features too
Copy-Paste Starter Recipes
Minimal setup (once):
pip install -U langsmith
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
export LANGCHAIN_API_KEY=YOUR_KEY
First traced LLM call (OpenAI SDK):
from openai import OpenAI
from langsmith.wrappers import wrap_openai
oai = wrap_openai(OpenAI())
oai.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role":"user","content":"One-sentence summary of LangSmith"}]
)
Trace a function:
from langsmith import traceable
@traceable
def classify(text: str) -> str:
return "positive" if "love" in text else "neutral"
classify("I love tracing")
Create a dataset and run evals:
from langsmith import Client
from langsmith.evaluation import evaluate
client = Client()
ds = client.create_dataset("Support-FAQs")
client.create_examples(
inputs=[{"question":"Reset password?"}],
outputs=[{"answer":"Use the Reset Password link in settings"}],
dataset_id=ds.id
)
def app(inputs):
return {"output": "Use the Reset Password link in settings"}
evaluate(
app,
data="Support-FAQs",
evaluators=["qa"],
experiment_prefix="baseline"
)
Prompt iteration flow:
- Draft in Playground
- Save to Prompt Library as “support-answering:v1”
- Reference by name/version in code to keep a single source of truth
Plain-Language Cheat Sheet
- LLM: Large Language Model. An AI system that reads and writes text.
- Non-deterministic: Same input, different outputs sometimes, like rolling dice.
- LLM observability: Seeing inputs, prompts, tool calls, outputs, cost, and latency.
- Tracing: Recording each step an LLM app takes, so you can replay and debug.
- Evals: Evaluations that score how well your app performs on a dataset.
- Monitoring: Watching your live app’s health over time, like a dashboard in a car.
- Framework-agnostic: Works across libraries (LangChain, OpenAI SDK, custom code).
- SDK: Software tools you import to work with a platform.
- API key: A secret you use so your app can talk to a service.
- Environment variables: Settings you keep outside your code, like keys and endpoints.
- Run and run tree: One execution of your app, plus its child steps in a tree view.
- Metadata: Descriptive tags such as user_id, model, or app_version for filtering and analysis.
- Latency (p50/p95): How fast responses are on average (p50) and for the slowest 5% (p95).
- Ground truth: The correct output you want your app to produce.
- Regression: When a change accidentally makes things worse.
- CI/CD: Automated pipelines for testing and deploying code.
- A/B testing: Trying two versions (A vs. B) to see which wins.
Practical Tips That Save Hours
- Add metadata early: tag runs with user_id, model, region, and app_version for future slicing
- Convert great and bad production runs into dataset examples, real data drives meaningful evals
- Version prompts like code; roll forward and roll back safely
- Keep experiments consistent; change one variable at a time
- Watch quality, latency, and cost together; trade-offs are real
Next Steps and Resources
Don’t wait for a production fire. Instrument your app now:
- Turn on tracing in five minutes
- Create a tiny dataset with 10 real examples
- Run your first experiment and set one alert
Official docs and tutorials: https://docs.smith.langchain.com/
Prompt patterns and best practices: https://www.promptingguide.ai/
LLM fundamentals: https://cloud.google.com/learn/what-is-an-llm and https://huggingface.co/learn
If you want help getting your first eval green, reply with your stack (LangChain, OpenAI SDK, or other) and I’ll map the quickest path to a passing experiment. If this was useful, subscribe to the newsletter for monthly deep dives on LLM debugging, evals, and monitoring.
FAQs
What is LangSmith and why would I use it?
LangSmith is a developer platform to observe, evaluate, and monitor LLM apps. It helps you trace everything, test with datasets, and monitor production to ship more reliable models.
What are LangSmith’s core pillars?
Tracing and Observability, Prompt Management, Datasets, Evals, and Experiments, Monitoring and Alerts.
How do I install and start using LangSmith?
Install: pip install -U langsmith
Set up tracing and endpoint: export LANGCHAIN_TRACING_V2=true
, export LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
, export LANGCHAIN_API_KEY=YOUR_KEY
Optional: export LANGCHAIN_PROJECT="my-first-project"
How do I get an API key and set it up?
Create an account at smith.langchain.com, generate an API key in the app, and save it securely (e.g., in dotenv or a secret manager). Then use the key in LANGCHAIN_API_KEY.
What is the LangSmith project structure?
Projects: containers for runs/traces of an app; Traces: end-to-end executions and their steps; Datasets: test inputs and reference outputs; Experiments: a run over a dataset with metrics; Annotation Queues: for human review workflows; Prompt Library: versioned prompts you can reuse.
How do I log my first trace?
Use the LangSmith SDK with a traceable function, e.g. a @traceable-decorated function, or Enable LangChain tracing (LANGCHAIN_TRACING_V2) for automatic logging, or wrap an OpenAI client to log calls.
How can I view and analyze traces?
In the Traces UI, open a run to see the run tree and view exact prompts/responses. Filter traces by model, tags, latency, errors, or metadata; identify bottlenecks and debug by re-running steps.
What are the Prompt Playground and Prompt Library?
Prompt Playground lets you test prompts, adjust settings, and compare outputs side-by-side. Prompt Library provides versioned prompts to reuse; reference library prompts from code and promote winners to the library.
How do datasets and evaluations work?
Datasets are collections of inputs with ground-truth outputs. Create examples manually or programmatically. Run automated evaluations with built-in evaluators (e.g., "qa", "embedding_distance") or your own custom evaluators. Use results to guide improvements and run more experiments.
How does LangSmith help with production monitoring and collaboration?
Observability dashboards show volume, latency, cost, and error rates. Set alerts for anomalies and performance regressions. Use annotations and human feedback to improve datasets and prompts. Integrate with CI/CD to run evaluations on PRs and compare versions.