Note/codex prompt: The promise here seems to be that I can do prompt engineering carefully to optimize performance without needing access to / having to intervene on the weights themselves. Is this basically correct? Read some of the docs and choice examples in the tutorial folder and walk me through how i’d use this differently from client.responses.create on an api Client object. Go from basic text generation to task-specific uses (eg structured generation), to prompt-engineered setups to as close to fine-tuning as possible - what is the output of a dspy optimization loop and how would i use it downstream when i use a completions API?
This guide maps your current setup (local llama-server +
OpenAI-compatible API) to a practical DSPy workflow.
It is based on patterns in your local tutorial folder, especially:
tutorials/rag/index.ipynb
(MIPROv2.compile(...), save/load)tutorials/classification_finetuning/index.ipynb
(BootstrapFinetune)tutorials/output_refinement/best-of-n-and-refine.md
(BestOfN, Refine)tutorials/saving/index.md (program state and
whole-program persistence)client.responses.createWith raw API calls, you usually do:
With DSPy, you define:
Signature).Predict, ChainOfThought, or
custom Module).Yes, your framing is basically correct: DSPy can improve quality
without touching weights by optimizing prompts/program state
(instructions, demos, traces, search behavior). Weight updates are a
separate optional step (BootstrapFinetune) when your
backend supports training.
Assumes your server is already up at
http://127.0.0.1:8080/v1. This is an openai-compatible
completions endpoint running a GPT-OSS-20b model.
import os
import dspy
BASE_URL = os.getenv("OPENAI_BASE_URL", "http://127.0.0.1:8080/v1")
API_KEY = os.getenv("OPENAI_API_KEY", "local")
MODEL = os.getenv("OPENAI_MODEL", "gpt-oss-20b-mxfp4")
lm = dspy.LM(
model=f"openai/{MODEL}",
api_base=BASE_URL,
api_key=API_KEY,
temperature=0.0,
max_tokens=512,
)
dspy.configure(lm=lm)Equivalent of “prompt in, text out”, but with an explicit interface.
import dspy
qa = dspy.Predict("question -> answer")
questions = [
"What is speculative decoding?",
"When does sparse checkout help?",
]
for q in questions:
pred = qa(question=q)
print(f"Q: {q}")
print(f"A: {pred.answer}\n")This is where DSPy starts to feel very different from raw completions.
from typing import Literal
import dspy
class TicketTriageSig(dspy.Signature):
"""Triage a support ticket."""
ticket_text: str = dspy.InputField()
priority: Literal["low", "medium", "high"] = dspy.OutputField()
team: Literal["billing", "product", "infra"] = dspy.OutputField()
rationale: str = dspy.OutputField()
triage = dspy.ChainOfThought(TicketTriageSig)
pred = triage(ticket_text="Production checkout failed after payment capture, users are blocked.")
print(pred.priority, pred.team)
print(pred.rationale)Instead of a single endpoint call, you compose a program.
from typing import Literal
import dspy
class ClassifyIssue(dspy.Signature):
"""Classify support issue."""
text: str = dspy.InputField()
issue_type: Literal["bug", "question", "outage"] = dspy.OutputField()
severity: Literal["low", "medium", "high"] = dspy.OutputField()
class DraftReply(dspy.Signature):
"""Draft a concise response to user."""
text: str = dspy.InputField()
issue_type: str = dspy.InputField()
severity: str = dspy.InputField()
reply: str = dspy.OutputField()
class HelpdeskAgent(dspy.Module):
def __init__(self):
super().__init__()
self.classify = dspy.ChainOfThought(ClassifyIssue)
self.reply = dspy.Predict(DraftReply)
def forward(self, text: str):
c = self.classify(text=text)
r = self.reply(text=text, issue_type=c.issue_type, severity=c.severity)
return dspy.Prediction(
issue_type=c.issue_type,
severity=c.severity,
reply=r.reply,
)
agent = HelpdeskAgent()
print(agent(text="The dashboard has been down for 20 minutes.").reply)This is the key DSPy loop:
import dspy
class Triage(dspy.Signature):
"""Triage ticket into priority/team."""
ticket_text: str = dspy.InputField()
priority: str = dspy.OutputField()
team: str = dspy.OutputField()
program = dspy.ChainOfThought(Triage)
trainset = [
dspy.Example(ticket_text="Payment charged twice", priority="high", team="billing").with_inputs("ticket_text"),
dspy.Example(ticket_text="How do I reset password?", priority="low", team="product").with_inputs("ticket_text"),
dspy.Example(ticket_text="API 500 errors in production", priority="high", team="infra").with_inputs("ticket_text"),
]
def triage_metric(example, pred, trace=None):
return float(
pred.priority.strip().lower() == example.priority
and pred.team.strip().lower() == example.team
)
def score(program_to_eval, dataset):
vals = [triage_metric(ex, program_to_eval(**ex.inputs())) for ex in dataset]
return sum(vals) / len(vals)
baseline_score = score(program, trainset)
print("Baseline score:", baseline_score)
optimizer = dspy.BootstrapFewShot(
metric=triage_metric,
max_bootstrapped_demos=2,
max_labeled_demos=2,
max_rounds=1,
)
optimized_program = optimizer.compile(program, trainset=trainset)
optimized_score = score(optimized_program, trainset)
print("Optimized score:", optimized_score)
optimized_program.save("triage_optimized.json")You can replace BootstrapFewShot with
MIPROv2 when you want a stronger, more expensive search (as
shown in tutorials/rag/index.ipynb).
For reliability at inference time (without re-training), use:
dspy.BestOfNdspy.RefinePattern:
import dspy
def one_sentence(args, pred):
return 1.0 if len(pred.answer.split(".")) <= 2 else 0.0
robust_qa = dspy.Refine(
module=dspy.ChainOfThought("question -> answer"),
N=3,
reward_fn=one_sentence,
threshold=1.0,
)
print(robust_qa(question="Explain KV cache in one sentence.").answer)For prompt-only optimizers (BootstrapFewShot,
MIPROv2, GEPA), output is an optimized DSPy
program object.
Persist it with:
optimized_program.save("optimized_program.json")That saved state includes program prompt state (instructions/signature, demos, and related metadata). You then:
reloaded = dspy.ChainOfThought(Triage)
reloaded.load("triage_optimized.json")
print(reloaded(ticket_text="Users cannot log in after deploy."))Best practice: keep using the optimized DSPy program directly.
If you must use raw API calls, you can export prompt state and rebuild messages yourself:
import json
from openai import OpenAI
state = json.load(open("triage_optimized.json"))
instructions = state["signature"]["instructions"]
demos = state["demos"]
def to_messages(ticket_text: str):
messages = [{"role": "system", "content": instructions}]
for d in demos:
messages.append({"role": "user", "content": f"Ticket: {d['ticket_text']}"})
messages.append(
{
"role": "assistant",
"content": f"priority={d['priority']}; team={d['team']}",
}
)
messages.append({"role": "user", "content": f"Ticket: {ticket_text}"})
return messages
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="local")
resp = client.chat.completions.create(
model="gpt-oss-20b-mxfp4",
messages=to_messages("Database timeouts across multiple regions."),
temperature=0.0,
)
print(resp.choices[0].message.content)This works, but you lose DSPy’s typed parsing and module composition, so prefer DSPy runtime unless you have a hard integration reason.
For your current llama.cpp inference server, this is the main path:
BootstrapFewShot,
MIPROv2, or GEPABestOfN/Refine at inference
timeDSPy also has BootstrapFinetune (see
tutorials/classification_finetuning/index.ipynb):
import dspy
from dspy.clients.lm_local import LocalProvider
dspy.settings.experimental = True
student_lm = dspy.LM(
model="openai/local:meta-llama/Llama-3.2-1B-Instruct",
provider=LocalProvider(),
max_tokens=2000,
)
teacher_lm = dspy.LM("openai/gpt-4o-mini", max_tokens=3000)
student_program = dspy.ChainOfThought("text -> label")
student_program.set_lm(student_lm)
teacher_program = dspy.ChainOfThought("text -> label")
teacher_program.set_lm(teacher_lm)
optimizer = dspy.BootstrapFinetune(num_threads=8)
finetuned_program = optimizer.compile(
student_program,
teacher=teacher_program,
trainset=[dspy.Example(text="...", label="...").with_inputs("text")],
)If your backend cannot train (typical for plain llama.cpp server), DSPy cannot force a weight update there. In that case, use prompt/program optimization as your “no-weight” improvement layer.
dspy_scratch.py as your experimentation
sandbox.dspy.Module.BootstrapFewShot first, then MIPROv2
if needed.