DSPY_USE

Note/codex prompt: The promise here seems to be that I can do prompt engineering carefully to optimize performance without needing access to / having to intervene on the weights themselves. Is this basically correct? Read some of the docs and choice examples in the tutorial folder and walk me through how i’d use this differently from client.responses.create on an api Client object. Go from basic text generation to task-specific uses (eg structured generation), to prompt-engineered setups to as close to fine-tuning as possible - what is the output of a dspy optimization loop and how would i use it downstream when i use a completions API?

This guide maps your current setup (local llama-server + OpenAI-compatible API) to a practical DSPy workflow.

1. Core idea: how DSPy differs from client.responses.create

Yes, your framing is basically correct: DSPy can improve quality without touching weights by optimizing prompts/program state (instructions, demos, traces, search behavior). Weight updates are a separate optional step (BootstrapFinetune) when your backend supports training.

2. Local LM setup (your llama.cpp server)

Assumes your server is already up at http://127.0.0.1:8080/v1. This is an openai-compatible completions endpoint running a GPT-OSS-20b model.

import os
import dspy

BASE_URL = os.getenv("OPENAI_BASE_URL", "http://127.0.0.1:8080/v1")
API_KEY = os.getenv("OPENAI_API_KEY", "local")
MODEL = os.getenv("OPENAI_MODEL", "gpt-oss-20b-mxfp4")

lm = dspy.LM(
    model=f"openai/{MODEL}",
    api_base=BASE_URL,
    api_key=API_KEY,
    temperature=0.0,
    max_tokens=512,
)
dspy.configure(lm=lm)

3. Basic text generation loop

import dspy

qa = dspy.Predict("question -> answer")

questions = [
    "What is speculative decoding?",
    "When does sparse checkout help?",
]

for q in questions:
    pred = qa(question=q)
    print(f"Q: {q}")
    print(f"A: {pred.answer}\n")

4. Structured generation (typed outputs)

from typing import Literal
import dspy

class TicketTriageSig(dspy.Signature):
    """Triage a support ticket."""
    ticket_text: str = dspy.InputField()
    priority: Literal["low", "medium", "high"] = dspy.OutputField()
    team: Literal["billing", "product", "infra"] = dspy.OutputField()
    rationale: str = dspy.OutputField()

triage = dspy.ChainOfThought(TicketTriageSig)

pred = triage(ticket_text="Production checkout failed after payment capture, users are blocked.")
print(pred.priority, pred.team)
print(pred.rationale)

5. Task-specific program (multiple LM calls + control flow)

from typing import Literal
import dspy

class ClassifyIssue(dspy.Signature):
    """Classify support issue."""
    text: str = dspy.InputField()
    issue_type: Literal["bug", "question", "outage"] = dspy.OutputField()
    severity: Literal["low", "medium", "high"] = dspy.OutputField()

class DraftReply(dspy.Signature):
    """Draft a concise response to user."""
    text: str = dspy.InputField()
    issue_type: str = dspy.InputField()
    severity: str = dspy.InputField()
    reply: str = dspy.OutputField()

class HelpdeskAgent(dspy.Module):
    def __init__(self):
        super().__init__()
        self.classify = dspy.ChainOfThought(ClassifyIssue)
        self.reply = dspy.Predict(DraftReply)

    def forward(self, text: str):
        c = self.classify(text=text)
        r = self.reply(text=text, issue_type=c.issue_type, severity=c.severity)
        return dspy.Prediction(
            issue_type=c.issue_type,
            severity=c.severity,
            reply=r.reply,
        )

agent = HelpdeskAgent()
print(agent(text="The dashboard has been down for 20 minutes.").reply)

6. Prompt optimization loop (no weight updates)

import dspy

class Triage(dspy.Signature):
    """Triage ticket into priority/team."""
    ticket_text: str = dspy.InputField()
    priority: str = dspy.OutputField()
    team: str = dspy.OutputField()

program = dspy.ChainOfThought(Triage)

trainset = [
    dspy.Example(ticket_text="Payment charged twice", priority="high", team="billing").with_inputs("ticket_text"),
    dspy.Example(ticket_text="How do I reset password?", priority="low", team="product").with_inputs("ticket_text"),
    dspy.Example(ticket_text="API 500 errors in production", priority="high", team="infra").with_inputs("ticket_text"),
]

def triage_metric(example, pred, trace=None):
    return float(
        pred.priority.strip().lower() == example.priority
        and pred.team.strip().lower() == example.team
    )

def score(program_to_eval, dataset):
    vals = [triage_metric(ex, program_to_eval(**ex.inputs())) for ex in dataset]
    return sum(vals) / len(vals)

baseline_score = score(program, trainset)
print("Baseline score:", baseline_score)

optimizer = dspy.BootstrapFewShot(
    metric=triage_metric,
    max_bootstrapped_demos=2,
    max_labeled_demos=2,
    max_rounds=1,
)

optimized_program = optimizer.compile(program, trainset=trainset)
optimized_score = score(optimized_program, trainset)
print("Optimized score:", optimized_score)

optimized_program.save("triage_optimized.json")

You can replace BootstrapFewShot with MIPROv2 when you want a stronger, more expensive search (as shown in tutorials/rag/index.ipynb).

7. Inference-time prompt engineering extras

import dspy

def one_sentence(args, pred):
    return 1.0 if len(pred.answer.split(".")) <= 2 else 0.0

robust_qa = dspy.Refine(
    module=dspy.ChainOfThought("question -> answer"),
    N=3,
    reward_fn=one_sentence,
    threshold=1.0,
)

print(robust_qa(question="Explain KV cache in one sentence.").answer)

8. What an optimization loop outputs

For prompt-only optimizers (BootstrapFewShot, MIPROv2, GEPA), output is an optimized DSPy program object.

optimized_program.save("optimized_program.json")

That saved state includes program prompt state (instructions/signature, demos, and related metadata). You then:

reloaded = dspy.ChainOfThought(Triage)
reloaded.load("triage_optimized.json")
print(reloaded(ticket_text="Users cannot log in after deploy."))

9. Using optimized DSPy output downstream with raw completions API

If you must use raw API calls, you can export prompt state and rebuild messages yourself:

import json
from openai import OpenAI

state = json.load(open("triage_optimized.json"))
instructions = state["signature"]["instructions"]
demos = state["demos"]

def to_messages(ticket_text: str):
    messages = [{"role": "system", "content": instructions}]
    for d in demos:
        messages.append({"role": "user", "content": f"Ticket: {d['ticket_text']}"})
        messages.append(
            {
                "role": "assistant",
                "content": f"priority={d['priority']}; team={d['team']}",
            }
        )
    messages.append({"role": "user", "content": f"Ticket: {ticket_text}"})
    return messages

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="local")
resp = client.chat.completions.create(
    model="gpt-oss-20b-mxfp4",
    messages=to_messages("Database timeouts across multiple regions."),
    temperature=0.0,
)
print(resp.choices[0].message.content)

This works, but you lose DSPy’s typed parsing and module composition, so prefer DSPy runtime unless you have a hard integration reason.

10. As close to fine-tuning as possible

Prompt/program optimization first

Weight updates when backend supports training

DSPy also has BootstrapFinetune (see tutorials/classification_finetuning/index.ipynb):

import dspy
from dspy.clients.lm_local import LocalProvider

dspy.settings.experimental = True

student_lm = dspy.LM(
    model="openai/local:meta-llama/Llama-3.2-1B-Instruct",
    provider=LocalProvider(),
    max_tokens=2000,
)
teacher_lm = dspy.LM("openai/gpt-4o-mini", max_tokens=3000)

student_program = dspy.ChainOfThought("text -> label")
student_program.set_lm(student_lm)

teacher_program = dspy.ChainOfThought("text -> label")
teacher_program.set_lm(teacher_lm)

optimizer = dspy.BootstrapFinetune(num_threads=8)
finetuned_program = optimizer.compile(
    student_program,
    teacher=teacher_program,
    trainset=[dspy.Example(text="...", label="...").with_inputs("text")],
)

If your backend cannot train (typical for plain llama.cpp server), DSPy cannot force a weight update there. In that case, use prompt/program optimization as your “no-weight” improvement layer.

DSPY_USE_LOOP

Crash course in DSPy

1. Core idea: how DSPy differs from `client.responses.create`

2. Local LM setup (your llama.cpp server)

3. Basic text generation loop

4. Structured generation (typed outputs)

5. Task-specific program (multiple LM calls + control flow)

6. Prompt optimization loop (no weight updates)

7. Inference-time prompt engineering extras

8. What an optimization loop outputs

9. Using optimized DSPy output downstream with raw completions API

10. As close to fine-tuning as possible

Prompt/program optimization first

Weight updates when backend supports training

11. Practical recommendation for Local setup

Crash course in DSPy

1. Core idea: how DSPy differs from client.responses.create

2. Local LM setup (your llama.cpp server)

3. Basic text generation loop

4. Structured generation (typed outputs)

5. Task-specific program (multiple LM calls + control flow)

6. Prompt optimization loop (no weight updates)

7. Inference-time prompt engineering extras

8. What an optimization loop outputs

9. Using optimized DSPy output downstream with raw completions API

10. As close to fine-tuning as possible

Prompt/program optimization first

Weight updates when backend supports training

11. Practical recommendation for Local setup

1. Core idea: how DSPy differs from `client.responses.create`