DSPY: A Programming Model for Self-Improving Language Model Pipelines

DSPy is an optmization module for LLMs. You define a signature - a pair of input output strings, and the program optimizes your prompts to get the desired output.
June 22, 2024
Subscribe to our newsletter
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Key Takeaways

  • DSPy introduces a programming model for language models (LMs) that goes beyond traditional prompting. It defines LM pipelines as text transformation graphs, allowing for modular composition and optimization.
  • DSPy utilizes natural language signatures to abstract the desired behavior of each LM call. These signatures, like function prototypes, provide a high-level description of the task rather than specific prompt instructions.
  • DSPy modules encapsulate various prompting techniques like Chain of Thought and ReAct, making them reusable and adaptable across different tasks and LMs.
  • DSPy incorporates a compiler that optimizes pipelines using teleprompters. These general-purpose optimization strategies learn from data to automatically generate effective prompting and finetuning strategies.
  • DSPy demonstrates significant performance improvements in two case studies: math word problems and multi-hop question answering. It outperforms systems relying on hand-crafted prompts and allows for efficient use of smaller LMs.

Introduction

The paper highlights the rapid evolution of language models (LMs) and their increasing use in solving complex tasks. However, it identifies a crucial bottleneck: existing LM pipelines heavily rely on manually crafted "prompt templates." These templates, often lengthy strings of instructions and examples, are brittle, task-specific, and require significant human effort.

The authors argue for a more systematic approach and introduce DSPy, a programming model that aims to automate the process of building and optimizing LM pipelines. DSPy treats LM calls as declarative modules, allowing for the construction of text transformation graphs that can be automatically compiled and optimized.

Background of the Paper

The paper draws inspiration from the success of neural network abstractions like Torch, Theano, and PyTorch, which enabled modular composition and automatic weight optimization. The authors apply similar principles to LM pipelines, proposing a programming model that allows for modular composition and automated optimization of LM calls.

The paper acknowledges the existing research on in-context learning, where LMs demonstrate impressive few-shot learning capabilities. However, it emphasizes the need for systematic optimization of multi-stage pipelines, going beyond individual LM calls.

While existing toolkits like LangChain and LlamaIndex provide pre-packaged components and agents, they still rely on manual prompt engineering. DSPy aims to address this challenge by offering a framework for automatically generating and optimizing prompts for arbitrary pipelines.

DSPy Programming Model

DSPy introduces three key abstractions: signatures, modules, and teleprompters.

Natural Language Signatures Can Abstract Prompting

Instead of relying on free-form string prompts, DSPy utilizes natural language signatures. These signatures define the desired input and output behavior of an LM call. For example:

1qa = dspy.Predict("question -> answer")
2qa(question="Where is Guaranı́ spoken?")
3# Output: Prediction(answer=’Guaraní is spoken mainly in South America.’)

Here, the signature "question -> answer" instructs the LM to take a question as input and produce an answer as output.

Signatures offer two primary advantages:

Self-Improvement: They can be automatically compiled into self-improving prompts or finetuning strategies by bootstrapping useful demonstrations.

Pipeline-Adaptability: They can adapt to different pipelines by dynamically adjusting the generated prompts based on the specific context.

Parameterized & Templated Modules Can Abstract Prompting Techniques

DSPy modules represent specific prompting techniques and can be reused across different tasks and pipelines. These modules are parameterized, meaning they can learn their desired behavior by being trained on demonstrations.

The core module is Predict, which takes a signature as input and performs the actual LM call. Here's a simplified pseudocode representation of Predict:

1class Predict(dspy.Module):
2  def __init__(self, signature, **config):
3    self.signature = dspy.Signature(signature)
4    self.config = config
5    self.lm = dspy.ParameterLM(None)  # use the default LM
6    self.demonstrations = dspy.ParameterDemonstrations([])
7
8  def forward(self, **kwargs):
9    lm = get_the_right_lm(self.lm, kwargs)
10    signature = get_the_right_signature(self.signature, kwargs)
11    demonstrations = get_the_right_demonstrations(self.demonstrations, kwargs)
12
13    prompt = signature(demos=self.demos, **kwargs)
14    completions = lm.generate(prompt, **self.config)
15    prediction = Prediction.from_completions(completions, signature=signature)
16
17    if dsp.settings.compiling is not None:
18      trace = dict(predictor=self, inputs=kwargs, outputs=prediction)
19      dspy.settings.traces.append(trace)
20
21    return prediction

DSPy also includes other built-in modules like ChainOfThought, ProgramOfThought, MultiChainComparison, and ReAct, which generalize popular prompting techniques from the literature.  For instance, ChainOfThought allows the LM to reason step-by-step before providing an answer:

1class ChainOfThought(dspy.Module):
2  def __init__(self, signature):
3    # Modify signature from ‘*inputs -> *outputs ‘ to ‘*inputs -> rationale , *outputs ‘. 
4    rationale_field = dspy.OutputField(prefix="Reasoning: Let’s think step by step.")
5    signature = dspy.Signature(signature).prepend_output_field(rationale_field)
6
7    # Declare a sub-module with the modified signature.
8    self.predict = dspy.Predict(self.signature)
9
10  def forward(self, **kwargs):
11    # Just forward the inputs to the sub-module.
12    return self.predict(**kwargs)

These modules are implemented in a few lines of code by expanding the user-defined signature and calling Predict one or more times with appropriate new signatures. This modularity allows for easy customization and extension of the framework.

Similarly a ReAct would look something like this:

1class ReAct(dspy.Module):
2  def __init__(self, signature, tools=[], max_iters=5):
3    # Store the signature for input/output fields
4    self.signature = dspy.Signature(signature)
5
6    # List of available tools
7    self.tools = tools
8
9    # Maximum number of iterations (max_iters)
10    self.max_iters = max_iters
11
12  def forward(self, **kwargs):
13    # Initialize context
14    context = [] 
15
16    # Iterate until maximum iterations are reached or a final answer is found
17    for i in range(self.max_iters):
18      # Generate thought and action using the LM (e.g., using ChainOfThought)
19      thought_action = self.generate_thought_action(**kwargs, context=context) 
20
21      # Extract the action type and its input
22      action_type = thought_action.action_type
23      action_input = thought_action.action_input
24
25      # If the action is 'Finish', return the final answer
26      if action_type == 'Finish':
27        return Prediction(answer=action_input)
28
29      # Execute the specified action using the corresponding tool
30      observation = self.execute_action(action_type, action_input)
31
32      # Update the context with the observation
33      context.append(observation)
34
35    # If no final answer is found, return an empty prediction
36    return Prediction()
37
38  def generate_thought_action(self, **kwargs, context):
39  	# Generates a "thought" (reasoning) and "action" (tool call) using the LM.
40    # Implement logic for generating thoughts and actions using the LM
41    # ... (This could use ChainOfThought or another prompting technique)
42
43  def execute_action(self, action_type, action_input):
44  	# Executes the chosen action using the corresponding tool, returning an observation.
45    # Implement logic to execute the action using the corresponding tool
46    # ... (For example, use a retrieval model if the action is 'Search')

Here:

tools: A list of available tools (e.g., a dspy.Retrieve module for search)

Optimizers Can Automate Prompting for Arbitrary Pipelines

Teleprompters (now called Optimizers) are the heart of DSPy's optimization process. They take a DSPy program, a training set, and a metric as input and return an optimized program. Different teleprompters employ various optimization strategies.

One example is BootstrapFewShot, which simulates the program on a small training set (even one example can be sufficient) and collects input-output traces for each module. These traces are then used to generate effective few-shot prompts or training data for finetuning.The authors note that even with unreliable LMs, DSPy can efficiently search for solutions in a multi-stage design. This allows for iterative bootstrapping, improving the accuracy of the pipeline over time.

class SimplifiedBootstrapFewShot(Teleprompter):
  def __init__(self, metric=None):
    self.metric = metric

  def compile(self, student, trainset, teacher=None):
    teacher = teacher if teacher is not None else student
    compiled_program = student.deepcopy()

    # Step 1. Prepare mappings between student and teacher Predict modules.
    assert student_and_teacher_have_compatible_predict_modules(student, teacher)
    name2predictor, predictor2name = map_predictors_recursively(student, teacher)

    # Step 2. Bootstrap traces for each Predict module.
    for example in trainset:
      if we_found_enough_bootstrapped_demos(): break

      with dspy.setting.context(compiling=True):
        # run the teacher program on the example, and get its final prediction
        prediction = teacher(**example.inputs())

        # get the trace of the all interal Predict calls from teacher program
        predicted_traces = dspy.settings.trace

        # if the prediction is valid, add the example to the traces
        if self.metric(example, prediction, predicted_traces):
          for predictor, inputs, outputs in predicted_traces:
            d = dspy.Example(automated=True, **inputs, **outputs)
            predictor_name = self.predictor2name[id(predictor)]
            compiled_program[predictor_name].demonstrations.append(d)

    return compiled_program

DSPy Compiler

The DSPy compiler utilizes teleprompters to optimize arbitrary pipelines. It operates in three stages:

Candidate Generation

The compiler identifies all unique Predict modules within the program and generates candidate values for their parameters (e.g., demonstrations, instructions).

Parameter Optimization

Once candidate parameters are generated, the compiler employs hyperparameter tuning algorithms like random search or Bayesian optimization to select the best combination of parameters.

Higher-Order Program Optimization

The compiler can also optimize the control flow of the program itself. Examples include creating ensembles of multiple programs or implementing dynamic backtracking logic.

Goals of Evaluation

The paper focuses on evaluating DSPy's ability to reduce or eliminate the reliance on hand-crafted prompts. The key goals are:

H1

DSPy should allow for the replacement of hand-crafted prompts with modular units without compromising quality or expressive power.

H2

Parameterizing modules and treating prompting as an optimization problem should make DSPy adaptable to different LMs and potentially outperform expert-written prompts.

H3

The modularity of DSPy should enable the exploration of complex pipelines with nuanced performance characteristics.

Case Study: Math Word Problems

The authors evaluate DSPy on the GSM8K dataset, which consists of grade school math word problems. They compare different DSPy programs (vanilla, ChainOfThought, ThoughtReflection) with various compilation strategies (zero-shot, few-shot, bootstrap) and analyze their performance.

Programs

Vanilla: A simple Predict module with the "question -> answer" signature.

CoT: Uses ChainOfThought to elicit step-by-step reasoning.

ThoughtReflection: Samples multiple reasoning chains and uses MultiChainComparison to generate a final answer based on the patterns observed.

Compilation Strategies

None (zero-shot)

No compilation, using the default LM and no demonstrations.

Fewshot

Samples a few random demonstrations from the training set.

Bootstrap

Bootstraps demonstrations using the BootstrapFewShot teleprompter.

Bootstrap x2

Iteratively bootstraps demonstrations using the previous bootstrap results.

Ensemble

Creates an ensemble of multiple bootstrapped programs.

Results:

The results demonstrate that DSPy programs significantly outperform systems relying on hand-crafted prompts. Even with simple programs, bootstrapping demonstrations leads to substantial improvements. For instance, the vanilla program achieves 64.7% accuracy with bootstrap×2 on Llama2-13b-chat, compared to 9.4% accuracy with zero-shot prompting. The ThoughtReflection program performs exceptionally well, achieving 88.3% accuracy with an ensemble of bootstrapped programs on GPT-3.5.

Code:

1# GSM8K Program ‘vanilla ‘
2vanilla = dspy.Predict("question -> answer")
3
4# GSM8K Program ‘CoT ‘
5CoT = dspy.ChainOfThought("question -> answer")
6
7class ThoughtReflection(dspy.Module):
8  def __init__(self, num_attempts):
9    self.predict = dspy.ChainOfThought("question -> answer", n=num_attempts)
10    self.compare = dspy.MultiChainComparison('question -> answer ', M=num_attempts)
11
12  def forward(self, question):
13    completions = self.predict(question=question).completions
14    return self.compare(question=question, completions=completions)
15
16reflection = ThoughtReflection(num_attempts=5) # GSM8K Program ‘reflection ‘
17
18# Compiling with BootstrapFewShot
19tp = BootstrapFewShotWithRandomSearch(metric=gsm8k_accuracy)
20bootstrap = tp.compile(program, trainset=trainset, valset=devset)

Case Study: Complex Question Answering

The second case study focuses on multi-hop question answering using the HotPotQA dataset. The authors evaluate DSPy programs for retrieval-augmented generation (RAG) and multi-hop reasoning.

Programs

CoT RAG

A basic RAG program that uses ChainOfThought for generating answers based on retrieved passages.

ReAct

A multi-step agent that uses tools like retrieval models to answer questions iteratively.

BasicMultiHop

A custom program that simulates the information flow in similar systems like Baleen and IRRR, employing multiple retrieval hops to gather relevant information.

Results

Again, the results show that DSPy programs significantly outperform systems relying on manual prompting. The BasicMultiHop program, compiled with bootstrap, achieves 54.7% answer accuracy on GPT-3.5, surpassing the performance of hand-crafted prompts and few-shot prompting.

Code

1# BasicMultiHop program
2class BasicMultiHop(dspy.Module):
3  def __init__(self, passages_per_hop):
4    self.retrieve = dspy.Retrieve(k=passages_per_hop)
5    self.generate_query = dspy.ChainOfThought("context, question -> search_query")
6    self.generate_answer = dspy.ChainOfThought("context, question -> answer")
7
8  def forward(self, question):
9    context = []
10
11    for hop in range(2):
12      query = self.generate_query(context=context, question=question).search_query
13      context += self.retrieve(query).passages
14
15    return self.generate_answer(context=context, question=question)
16
17multihop = BasicMultiHop(passages_per_hop=3)

Finetuning

The authors also explore fine-tuning a smaller LM like T5-Large with the BootstrapFinetune teleprompter. This demonstrates DSPy's ability to efficiently use smaller LMs and achieve comparable results to systems relying on larger proprietary LMs.

Comparison with Existing Libraries Like Langchain and LlamaIndex

While LangChain and LlamaIndex provide valuable tools for building LM pipelines, they still rely heavily on manual prompt engineering. These libraries typically offer pre-packaged components and chains, but their internal implementation often involves hard-coded prompts.

DSPy differentiates itself by providing a framework for automating the prompt generation and optimization process, allowing researchers and practitioners to build new LM pipelines with minimal manual effort.

How do you use DSPy for a specific problem?

Let's consider a slightly more complex problem: building a system that can summarize a research paper and answer specific questions based on the summary.

Define Signatures:

paper -> summary: This signature represents the task of summarizing a research paper.

summary, question -> answer: This signature represents the task of answering a question based on the generated summary.

Choose Modules

For summarizing the paper, we can use the ChainOfThought module to guide the LM to generate a concise summary.

For answering questions, we can use the Predict module with the summary, question -> answer signature.

Create DSPy Program

1class SummarizerAndQuestionAnswerer(dspy.Module):
2  def __init__(self):
3    self.summarize = dspy.ChainOfThought("paper -> summary")
4    self.answer_question = dspy.Predict("summary, question -> answer")
5
6  def forward(self, paper, question):
7    summary = self.summarize(paper=paper).summary
8    return self.answer_question(summary=summary, question=question)
9
10summarizer_qa = SummarizerAndQuestionAnswerer()

Compile the Program:

We can use the BootstrapFewShot teleprompter with a metric like ROUGE score for the summary and accuracy for the answer.

This will automatically generate demonstrations for both the summary and question-answering tasks, optimizing the prompts and potentially finetuning a smaller LM.

Run the Program:

We can now run the compiled program with a research paper and a question as input.

Conclusion

DSPy represents a significant step forward in the development of AI systems that leverage LMs. By introducing a programming model that automates prompt engineering, it empowers researchers and practitioners to build sophisticated and performant LM pipelines with minimal manual effort. DSPy's modularity, adaptability, and optimization capabilities open up exciting possibilities for the future of AI, enabling the creation of self-improving systems that can effectively tackle complex tasks across various domains.

Website: https://dspy-docs.vercel.app/

Github: https://github.com/stanfordnlp/dspy

Why Clio AI?

Unlock the most obvious-yet-hidden-in-plain-sight growth hack - enable your employees to work on important things, and reduce their cognitive load and time to resolve blockers.

Fast, efficient, and in-context information to make every employee a super performer.

Spend time thinking not searching. Get a demo today.

By signing up for a demo, you agree to our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.