The paper highlights the rapid evolution of language models (LMs) and their increasing use in solving complex tasks. However, it identifies a crucial bottleneck: existing LM pipelines heavily rely on manually crafted "prompt templates." These templates, often lengthy strings of instructions and examples, are brittle, task-specific, and require significant human effort.
The authors argue for a more systematic approach and introduce DSPy, a programming model that aims to automate the process of building and optimizing LM pipelines. DSPy treats LM calls as declarative modules, allowing for the construction of text transformation graphs that can be automatically compiled and optimized.
The paper draws inspiration from the success of neural network abstractions like Torch, Theano, and PyTorch, which enabled modular composition and automatic weight optimization. The authors apply similar principles to LM pipelines, proposing a programming model that allows for modular composition and automated optimization of LM calls.
The paper acknowledges the existing research on in-context learning, where LMs demonstrate impressive few-shot learning capabilities. However, it emphasizes the need for systematic optimization of multi-stage pipelines, going beyond individual LM calls.
While existing toolkits like LangChain and LlamaIndex provide pre-packaged components and agents, they still rely on manual prompt engineering. DSPy aims to address this challenge by offering a framework for automatically generating and optimizing prompts for arbitrary pipelines.
DSPy introduces three key abstractions: signatures, modules, and teleprompters.
Instead of relying on free-form string prompts, DSPy utilizes natural language signatures. These signatures define the desired input and output behavior of an LM call. For example:
1qa = dspy.Predict("question -> answer")
2qa(question="Where is Guaranı́ spoken?")
3# Output: Prediction(answer=’Guaraní is spoken mainly in South America.’)
Here, the signature "question -> answer" instructs the LM to take a question as input and produce an answer as output.
Signatures offer two primary advantages:
Self-Improvement: They can be automatically compiled into self-improving prompts or finetuning strategies by bootstrapping useful demonstrations.
Pipeline-Adaptability: They can adapt to different pipelines by dynamically adjusting the generated prompts based on the specific context.
DSPy modules represent specific prompting techniques and can be reused across different tasks and pipelines. These modules are parameterized, meaning they can learn their desired behavior by being trained on demonstrations.
The core module is Predict, which takes a signature as input and performs the actual LM call. Here's a simplified pseudocode representation of Predict:
1class Predict(dspy.Module):
2 def __init__(self, signature, **config):
3 self.signature = dspy.Signature(signature)
4 self.config = config
5 self.lm = dspy.ParameterLM(None) # use the default LM
6 self.demonstrations = dspy.ParameterDemonstrations([])
7
8 def forward(self, **kwargs):
9 lm = get_the_right_lm(self.lm, kwargs)
10 signature = get_the_right_signature(self.signature, kwargs)
11 demonstrations = get_the_right_demonstrations(self.demonstrations, kwargs)
12
13 prompt = signature(demos=self.demos, **kwargs)
14 completions = lm.generate(prompt, **self.config)
15 prediction = Prediction.from_completions(completions, signature=signature)
16
17 if dsp.settings.compiling is not None:
18 trace = dict(predictor=self, inputs=kwargs, outputs=prediction)
19 dspy.settings.traces.append(trace)
20
21 return prediction
DSPy also includes other built-in modules like ChainOfThought, ProgramOfThought, MultiChainComparison, and ReAct, which generalize popular prompting techniques from the literature. For instance, ChainOfThought allows the LM to reason step-by-step before providing an answer:
1class ChainOfThought(dspy.Module):
2 def __init__(self, signature):
3 # Modify signature from ‘*inputs -> *outputs ‘ to ‘*inputs -> rationale , *outputs ‘.
4 rationale_field = dspy.OutputField(prefix="Reasoning: Let’s think step by step.")
5 signature = dspy.Signature(signature).prepend_output_field(rationale_field)
6
7 # Declare a sub-module with the modified signature.
8 self.predict = dspy.Predict(self.signature)
9
10 def forward(self, **kwargs):
11 # Just forward the inputs to the sub-module.
12 return self.predict(**kwargs)
These modules are implemented in a few lines of code by expanding the user-defined signature and calling Predict one or more times with appropriate new signatures. This modularity allows for easy customization and extension of the framework.
Similarly a ReAct would look something like this:
1class ReAct(dspy.Module):
2 def __init__(self, signature, tools=[], max_iters=5):
3 # Store the signature for input/output fields
4 self.signature = dspy.Signature(signature)
5
6 # List of available tools
7 self.tools = tools
8
9 # Maximum number of iterations (max_iters)
10 self.max_iters = max_iters
11
12 def forward(self, **kwargs):
13 # Initialize context
14 context = []
15
16 # Iterate until maximum iterations are reached or a final answer is found
17 for i in range(self.max_iters):
18 # Generate thought and action using the LM (e.g., using ChainOfThought)
19 thought_action = self.generate_thought_action(**kwargs, context=context)
20
21 # Extract the action type and its input
22 action_type = thought_action.action_type
23 action_input = thought_action.action_input
24
25 # If the action is 'Finish', return the final answer
26 if action_type == 'Finish':
27 return Prediction(answer=action_input)
28
29 # Execute the specified action using the corresponding tool
30 observation = self.execute_action(action_type, action_input)
31
32 # Update the context with the observation
33 context.append(observation)
34
35 # If no final answer is found, return an empty prediction
36 return Prediction()
37
38 def generate_thought_action(self, **kwargs, context):
39 # Generates a "thought" (reasoning) and "action" (tool call) using the LM.
40 # Implement logic for generating thoughts and actions using the LM
41 # ... (This could use ChainOfThought or another prompting technique)
42
43 def execute_action(self, action_type, action_input):
44 # Executes the chosen action using the corresponding tool, returning an observation.
45 # Implement logic to execute the action using the corresponding tool
46 # ... (For example, use a retrieval model if the action is 'Search')
Here:
tools: A list of available tools (e.g., a dspy.Retrieve module for search)
Teleprompters (now called Optimizers) are the heart of DSPy's optimization process. They take a DSPy program, a training set, and a metric as input and return an optimized program. Different teleprompters employ various optimization strategies.
One example is BootstrapFewShot, which simulates the program on a small training set (even one example can be sufficient) and collects input-output traces for each module. These traces are then used to generate effective few-shot prompts or training data for finetuning.
The authors note that even with unreliable LMs, DSPy can efficiently search for solutions in a multi-stage design. This allows for iterative bootstrapping, improving the accuracy of the pipeline over time.
class SimplifiedBootstrapFewShot(Teleprompter):
def __init__(self, metric=None):
self.metric = metric
def compile(self, student, trainset, teacher=None):
teacher = teacher if teacher is not None else student
compiled_program = student.deepcopy()
# Step 1. Prepare mappings between student and teacher Predict modules.
assert student_and_teacher_have_compatible_predict_modules(student, teacher)
name2predictor, predictor2name = map_predictors_recursively(student, teacher)
# Step 2. Bootstrap traces for each Predict module.
for example in trainset:
if we_found_enough_bootstrapped_demos(): break
with dspy.setting.context(compiling=True):
# run the teacher program on the example, and get its final prediction
prediction = teacher(**example.inputs())
# get the trace of the all interal Predict calls from teacher program
predicted_traces = dspy.settings.trace
# if the prediction is valid, add the example to the traces
if self.metric(example, prediction, predicted_traces):
for predictor, inputs, outputs in predicted_traces:
d = dspy.Example(automated=True, **inputs, **outputs)
predictor_name = self.predictor2name[id(predictor)]
compiled_program[predictor_name].demonstrations.append(d)
return compiled_program
The DSPy compiler utilizes teleprompters to optimize arbitrary pipelines. It operates in three stages:
The compiler identifies all unique Predict modules within the program and generates candidate values for their parameters (e.g., demonstrations, instructions).
Once candidate parameters are generated, the compiler employs hyperparameter tuning algorithms like random search or Bayesian optimization to select the best combination of parameters.
The compiler can also optimize the control flow of the program itself. Examples include creating ensembles of multiple programs or implementing dynamic backtracking logic.
The paper focuses on evaluating DSPy's ability to reduce or eliminate the reliance on hand-crafted prompts. The key goals are:
H1
DSPy should allow for the replacement of hand-crafted prompts with modular units without compromising quality or expressive power.
H2
Parameterizing modules and treating prompting as an optimization problem should make DSPy adaptable to different LMs and potentially outperform expert-written prompts.
H3
The modularity of DSPy should enable the exploration of complex pipelines with nuanced performance characteristics.
The authors evaluate DSPy on the GSM8K dataset, which consists of grade school math word problems. They compare different DSPy programs (vanilla, ChainOfThought, ThoughtReflection) with various compilation strategies (zero-shot, few-shot, bootstrap) and analyze their performance.
Vanilla: A simple Predict module with the "question -> answer" signature.
CoT: Uses ChainOfThought to elicit step-by-step reasoning.
ThoughtReflection: Samples multiple reasoning chains and uses MultiChainComparison to generate a final answer based on the patterns observed.
None (zero-shot)
No compilation, using the default LM and no demonstrations.
Fewshot
Samples a few random demonstrations from the training set.
Bootstrap
Bootstraps demonstrations using the BootstrapFewShot teleprompter.
Bootstrap x2
Iteratively bootstraps demonstrations using the previous bootstrap results.
Ensemble
Creates an ensemble of multiple bootstrapped programs.
The results demonstrate that DSPy programs significantly outperform systems relying on hand-crafted prompts. Even with simple programs, bootstrapping demonstrations leads to substantial improvements. For instance, the vanilla program achieves 64.7% accuracy with bootstrap×2 on Llama2-13b-chat, compared to 9.4% accuracy with zero-shot prompting. The ThoughtReflection program performs exceptionally well, achieving 88.3% accuracy with an ensemble of bootstrapped programs on GPT-3.5.
1# GSM8K Program ‘vanilla ‘
2vanilla = dspy.Predict("question -> answer")
3
4# GSM8K Program ‘CoT ‘
5CoT = dspy.ChainOfThought("question -> answer")
6
7class ThoughtReflection(dspy.Module):
8 def __init__(self, num_attempts):
9 self.predict = dspy.ChainOfThought("question -> answer", n=num_attempts)
10 self.compare = dspy.MultiChainComparison('question -> answer ', M=num_attempts)
11
12 def forward(self, question):
13 completions = self.predict(question=question).completions
14 return self.compare(question=question, completions=completions)
15
16reflection = ThoughtReflection(num_attempts=5) # GSM8K Program ‘reflection ‘
17
18# Compiling with BootstrapFewShot
19tp = BootstrapFewShotWithRandomSearch(metric=gsm8k_accuracy)
20bootstrap = tp.compile(program, trainset=trainset, valset=devset)
The second case study focuses on multi-hop question answering using the HotPotQA dataset. The authors evaluate DSPy programs for retrieval-augmented generation (RAG) and multi-hop reasoning.
CoT RAG
A basic RAG program that uses ChainOfThought for generating answers based on retrieved passages.
ReAct
A multi-step agent that uses tools like retrieval models to answer questions iteratively.
BasicMultiHop
A custom program that simulates the information flow in similar systems like Baleen and IRRR, employing multiple retrieval hops to gather relevant information.
Again, the results show that DSPy programs significantly outperform systems relying on manual prompting. The BasicMultiHop program, compiled with bootstrap, achieves 54.7% answer accuracy on GPT-3.5, surpassing the performance of hand-crafted prompts and few-shot prompting.
1# BasicMultiHop program
2class BasicMultiHop(dspy.Module):
3 def __init__(self, passages_per_hop):
4 self.retrieve = dspy.Retrieve(k=passages_per_hop)
5 self.generate_query = dspy.ChainOfThought("context, question -> search_query")
6 self.generate_answer = dspy.ChainOfThought("context, question -> answer")
7
8 def forward(self, question):
9 context = []
10
11 for hop in range(2):
12 query = self.generate_query(context=context, question=question).search_query
13 context += self.retrieve(query).passages
14
15 return self.generate_answer(context=context, question=question)
16
17multihop = BasicMultiHop(passages_per_hop=3)
The authors also explore fine-tuning a smaller LM like T5-Large with the BootstrapFinetune teleprompter. This demonstrates DSPy's ability to efficiently use smaller LMs and achieve comparable results to systems relying on larger proprietary LMs.
While LangChain and LlamaIndex provide valuable tools for building LM pipelines, they still rely heavily on manual prompt engineering. These libraries typically offer pre-packaged components and chains, but their internal implementation often involves hard-coded prompts.
DSPy differentiates itself by providing a framework for automating the prompt generation and optimization process, allowing researchers and practitioners to build new LM pipelines with minimal manual effort.
Let's consider a slightly more complex problem: building a system that can summarize a research paper and answer specific questions based on the summary.
paper -> summary: This signature represents the task of summarizing a research paper.
summary, question -> answer: This signature represents the task of answering a question based on the generated summary.
For summarizing the paper, we can use the ChainOfThought module to guide the LM to generate a concise summary.
For answering questions, we can use the Predict module with the summary, question -> answer signature.
1class SummarizerAndQuestionAnswerer(dspy.Module):
2 def __init__(self):
3 self.summarize = dspy.ChainOfThought("paper -> summary")
4 self.answer_question = dspy.Predict("summary, question -> answer")
5
6 def forward(self, paper, question):
7 summary = self.summarize(paper=paper).summary
8 return self.answer_question(summary=summary, question=question)
9
10summarizer_qa = SummarizerAndQuestionAnswerer()
Compile the Program:
We can use the BootstrapFewShot teleprompter with a metric like ROUGE score for the summary and accuracy for the answer.
This will automatically generate demonstrations for both the summary and question-answering tasks, optimizing the prompts and potentially finetuning a smaller LM.
Run the Program:
We can now run the compiled program with a research paper and a question as input.
DSPy represents a significant step forward in the development of AI systems that leverage LMs. By introducing a programming model that automates prompt engineering, it empowers researchers and practitioners to build sophisticated and performant LM pipelines with minimal manual effort. DSPy's modularity, adaptability, and optimization capabilities open up exciting possibilities for the future of AI, enabling the creation of self-improving systems that can effectively tackle complex tasks across various domains.
Website: https://dspy-docs.vercel.app/