DSPY_USE_LOOP

Crash course in DSPy

Note/codex prompt: The promise here seems to be that I can do prompt engineering carefully to optimize performance without needing access to / having to intervene on the weights themselves. Is this basically correct? Read some of the docs and choice examples in the tutorial folder and walk me through how i’d use this differently from client.responses.create on an api Client object. Go from basic text generation to task-specific uses (eg structured generation), to prompt-engineered setups to as close to fine-tuning as possible - what is the output of a dspy optimization loop and how would i use it downstream when i use a completions API?

This guide maps your current setup (local llama-server + OpenAI-compatible API) to a practical DSPy workflow.

It is based on patterns in your local tutorial folder, especially:

1. Core idea: how DSPy differs from client.responses.create

With raw API calls, you usually do:

  1. Build a prompt manually.
  2. Call model endpoint.
  3. Parse output yourself.
  4. Repeat with prompt tweaks.

With DSPy, you define:

  1. A typed task interface (Signature).
  2. A program (Predict, ChainOfThought, or custom Module).
  3. A metric + train/dev examples.
  4. An optimizer that rewrites instructions and demos for that program.

Yes, your framing is basically correct: DSPy can improve quality without touching weights by optimizing prompts/program state (instructions, demos, traces, search behavior). Weight updates are a separate optional step (BootstrapFinetune) when your backend supports training.

2. Local LM setup (your llama.cpp server)

Assumes your server is already up at http://127.0.0.1:8080/v1. This is an openai-compatible completions endpoint running a GPT-OSS-20b model.

import os
import dspy

BASE_URL = os.getenv("OPENAI_BASE_URL", "http://127.0.0.1:8080/v1")
API_KEY = os.getenv("OPENAI_API_KEY", "local")
MODEL = os.getenv("OPENAI_MODEL", "gpt-oss-20b-mxfp4")

lm = dspy.LM(
    model=f"openai/{MODEL}",
    api_base=BASE_URL,
    api_key=API_KEY,
    temperature=0.0,
    max_tokens=512,
)
dspy.configure(lm=lm)

3. Basic text generation loop

Equivalent of “prompt in, text out”, but with an explicit interface.

import dspy

qa = dspy.Predict("question -> answer")

questions = [
    "What is speculative decoding?",
    "When does sparse checkout help?",
]

for q in questions:
    pred = qa(question=q)
    print(f"Q: {q}")
    print(f"A: {pred.answer}\n")

4. Structured generation (typed outputs)

This is where DSPy starts to feel very different from raw completions.

from typing import Literal
import dspy

class TicketTriageSig(dspy.Signature):
    """Triage a support ticket."""
    ticket_text: str = dspy.InputField()
    priority: Literal["low", "medium", "high"] = dspy.OutputField()
    team: Literal["billing", "product", "infra"] = dspy.OutputField()
    rationale: str = dspy.OutputField()

triage = dspy.ChainOfThought(TicketTriageSig)

pred = triage(ticket_text="Production checkout failed after payment capture, users are blocked.")
print(pred.priority, pred.team)
print(pred.rationale)

5. Task-specific program (multiple LM calls + control flow)

Instead of a single endpoint call, you compose a program.

from typing import Literal
import dspy

class ClassifyIssue(dspy.Signature):
    """Classify support issue."""
    text: str = dspy.InputField()
    issue_type: Literal["bug", "question", "outage"] = dspy.OutputField()
    severity: Literal["low", "medium", "high"] = dspy.OutputField()

class DraftReply(dspy.Signature):
    """Draft a concise response to user."""
    text: str = dspy.InputField()
    issue_type: str = dspy.InputField()
    severity: str = dspy.InputField()
    reply: str = dspy.OutputField()

class HelpdeskAgent(dspy.Module):
    def __init__(self):
        super().__init__()
        self.classify = dspy.ChainOfThought(ClassifyIssue)
        self.reply = dspy.Predict(DraftReply)

    def forward(self, text: str):
        c = self.classify(text=text)
        r = self.reply(text=text, issue_type=c.issue_type, severity=c.severity)
        return dspy.Prediction(
            issue_type=c.issue_type,
            severity=c.severity,
            reply=r.reply,
        )

agent = HelpdeskAgent()
print(agent(text="The dashboard has been down for 20 minutes.").reply)

6. Prompt optimization loop (no weight updates)

This is the key DSPy loop:

  1. Build trainset examples.
  2. Define metric.
  3. Compile with optimizer.
  4. Compare baseline vs optimized.
  5. Save optimized state.
import dspy

class Triage(dspy.Signature):
    """Triage ticket into priority/team."""
    ticket_text: str = dspy.InputField()
    priority: str = dspy.OutputField()
    team: str = dspy.OutputField()

program = dspy.ChainOfThought(Triage)

trainset = [
    dspy.Example(ticket_text="Payment charged twice", priority="high", team="billing").with_inputs("ticket_text"),
    dspy.Example(ticket_text="How do I reset password?", priority="low", team="product").with_inputs("ticket_text"),
    dspy.Example(ticket_text="API 500 errors in production", priority="high", team="infra").with_inputs("ticket_text"),
]

def triage_metric(example, pred, trace=None):
    return float(
        pred.priority.strip().lower() == example.priority
        and pred.team.strip().lower() == example.team
    )

def score(program_to_eval, dataset):
    vals = [triage_metric(ex, program_to_eval(**ex.inputs())) for ex in dataset]
    return sum(vals) / len(vals)

baseline_score = score(program, trainset)
print("Baseline score:", baseline_score)

optimizer = dspy.BootstrapFewShot(
    metric=triage_metric,
    max_bootstrapped_demos=2,
    max_labeled_demos=2,
    max_rounds=1,
)

optimized_program = optimizer.compile(program, trainset=trainset)
optimized_score = score(optimized_program, trainset)
print("Optimized score:", optimized_score)

optimized_program.save("triage_optimized.json")

You can replace BootstrapFewShot with MIPROv2 when you want a stronger, more expensive search (as shown in tutorials/rag/index.ipynb).

7. Inference-time prompt engineering extras

For reliability at inference time (without re-training), use:

Pattern:

import dspy

def one_sentence(args, pred):
    return 1.0 if len(pred.answer.split(".")) <= 2 else 0.0

robust_qa = dspy.Refine(
    module=dspy.ChainOfThought("question -> answer"),
    N=3,
    reward_fn=one_sentence,
    threshold=1.0,
)

print(robust_qa(question="Explain KV cache in one sentence.").answer)

8. What an optimization loop outputs

For prompt-only optimizers (BootstrapFewShot, MIPROv2, GEPA), output is an optimized DSPy program object.

Persist it with:

optimized_program.save("optimized_program.json")

That saved state includes program prompt state (instructions/signature, demos, and related metadata). You then:

  1. Load it back into the same program class.
  2. Keep running normal inference with your completions/chat endpoint.
reloaded = dspy.ChainOfThought(Triage)
reloaded.load("triage_optimized.json")
print(reloaded(ticket_text="Users cannot log in after deploy."))

9. Using optimized DSPy output downstream with raw completions API

Best practice: keep using the optimized DSPy program directly.

If you must use raw API calls, you can export prompt state and rebuild messages yourself:

import json
from openai import OpenAI

state = json.load(open("triage_optimized.json"))
instructions = state["signature"]["instructions"]
demos = state["demos"]

def to_messages(ticket_text: str):
    messages = [{"role": "system", "content": instructions}]
    for d in demos:
        messages.append({"role": "user", "content": f"Ticket: {d['ticket_text']}"})
        messages.append(
            {
                "role": "assistant",
                "content": f"priority={d['priority']}; team={d['team']}",
            }
        )
    messages.append({"role": "user", "content": f"Ticket: {ticket_text}"})
    return messages

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="local")
resp = client.chat.completions.create(
    model="gpt-oss-20b-mxfp4",
    messages=to_messages("Database timeouts across multiple regions."),
    temperature=0.0,
)
print(resp.choices[0].message.content)

This works, but you lose DSPy’s typed parsing and module composition, so prefer DSPy runtime unless you have a hard integration reason.

10. As close to fine-tuning as possible

Prompt/program optimization first

For your current llama.cpp inference server, this is the main path:

Weight updates when backend supports training

DSPy also has BootstrapFinetune (see tutorials/classification_finetuning/index.ipynb):

import dspy
from dspy.clients.lm_local import LocalProvider

dspy.settings.experimental = True

student_lm = dspy.LM(
    model="openai/local:meta-llama/Llama-3.2-1B-Instruct",
    provider=LocalProvider(),
    max_tokens=2000,
)
teacher_lm = dspy.LM("openai/gpt-4o-mini", max_tokens=3000)

student_program = dspy.ChainOfThought("text -> label")
student_program.set_lm(student_lm)

teacher_program = dspy.ChainOfThought("text -> label")
teacher_program.set_lm(teacher_lm)

optimizer = dspy.BootstrapFinetune(num_threads=8)
finetuned_program = optimizer.compile(
    student_program,
    teacher=teacher_program,
    trainset=[dspy.Example(text="...", label="...").with_inputs("text")],
)

If your backend cannot train (typical for plain llama.cpp server), DSPy cannot force a weight update there. In that case, use prompt/program optimization as your “no-weight” improvement layer.

11. Practical recommendation for Local setup

  1. Keep dspy_scratch.py as your experimentation sandbox.
  2. Move one real task into a custom dspy.Module.
  3. Add a small labeled train/dev set (20-100 examples).
  4. Run BootstrapFewShot first, then MIPROv2 if needed.
  5. Save optimized JSON and load it in your production script.
  6. Only consider finetuning when you move to a trainable backend.