May 17, 2024
6 mins

AutoCodeRover: Autonomous Program Improvement

AutoCodeRover from NUS provides a novel framework that looks beyond code generation to genuine problem solving with the help of AI.
Paper Link
Header image
Weekly newsletter
No spam. Just the latest researches, and exclusive interviews in your inbox every week.
Read about our privacy policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Key Takeaways

  • Automating Software Improvement: AutoCodeRover leverages LLMs and code search capabilities to automate bug fixing and feature additions by analyzing GitHub issues and generating patches.
  • Software Engineering Focus: Unlike other LLM-based solutions, AutoCodeRover incorporates a software engineering perspective by focusing on program representation through Abstract Syntax Trees (ASTs) and employing debugging techniques like SBFL.
  • Promising Efficacy: The tool demonstrates promising results in resolving real-life GitHub issues, showcasing its potential to contribute to autonomous software engineering.
  • Increased Developer Productivity: AutoCodeRover can free developers from tedious tasks, allowing them to focus on more complex and creative aspects of software development.
  • Reduced Time to Market: Faster resolution of issues can accelerate the development process and shorten the time it takes to bring new products and updates to market.
  • Improved Software Quality: By identifying and fixing bugs early on, AutoCodeRover can contribute to enhanced software quality and reliability.
  • Challenges and Future Directions: AutoCodeRover's effectiveness depends on LLM capabilities and can be impacted by incomplete specifications. Further research is needed to address these challenges and explore areas like issue reproduction, semantic artifact utilization, and human-AI collaboration.

Introduction

The paper starts by acknowledging the advancements in automating the software development process, highlighting the recent impact of Large Language Models (LLMs) in enabling automated coding through tools like GitHub Copilot. However, software engineering extends beyond just coding and encompasses crucial tasks like program improvement for maintenance (e.g., bug fixing) and evolution (e.g., feature additions). Automatically generated code by LLMs needs to be trustworthy and automating program repair tasks can be a way to achieve this trust. This leads to the introduction of AutoCodeRover, a tool specifically designed for autonomous program improvement by addressing GitHub issues.

Background

Program Repair

The paper discusses various existing automated program repair (APR) techniques, including search-based, semantic-based, and learning-based approaches. Search-based techniques, like GenProg, rely on code mutations to explore possible patches, while semantic-based techniques, like SemFix, formulate repair constraints based on specifications and then solve them to generate patches. Learning-based techniques utilize deep learning models trained on large codebases to predict the most likely correct patch. Current APR techniques are  limited however in addressing real-life software issues due to their reliance on high-quality test suites and lack of utilization of natural language specifications from problem descriptions.

Dataset

The paper focuses on the SWE-bench dataset, which contains real-life GitHub issues and corresponding pull requests from popular Python projects. Each issue represents either a bug fix or a feature addition, and the pull request includes the code changes made by developers to resolve the issue. The dataset presents a realistic and challenging environment for evaluating the capabilities of automated program improvement tools.

Motivating Example

The paper provides a clear example of how AutoCodeRover works by showcasing its approach to a real GitHub issue from the Django project. The issue requests adding support to the ModelChoiceField class to display the value of invalid choices when raising a validation error. AutoCodeRover tackles this issue in two stages: context retrieval and patch generation.

Context Retrieval

  1. Initial Analysis: The LLM agent identifies relevant keywords like ModelChoiceField and ModelMultipleChoiceField from the issue description and explores potential methods like clean for validation-related issues.
  2. Iterative Search: Through iterative API calls like search_class and search_method_in_class, AutoCodeRover gathers information about relevant classes and methods. It discovers the absence of the clean method in ModelChoiceField and identifies other potential candidates like validate and to_python.
  3. Identifying Buggy Location: By analyzing the implementations of validate and to_python, the agent determines that to_python is the most suitable location for modification as it raises the relevant exception without including the invalid value.

Patch Generation

  1. Precise Code Extraction: The patch generation agent retrieves the specific code snippet of the to_python method.
  2. Patch Construction: Based on the retrieved context and analysis, the agent generates a patch utilizing %-formatting in Python to incorporate the invalid value into the error message.
  3. Validation (Optional): If available, a test suite can be used to validate the generated patch and trigger regeneration if necessary.

This example demonstrates how AutoCodeRover effectively navigates the codebase, identifies the root cause of the issue, and generates a patch that addresses the problem.

AI Program Improvement Framework

Overview

The framework takes a problem statement (P) describing the issue and the corresponding codebase (C) as input. It operates in two main stages:

Context Retrieval

An LLM agent navigates the codebase and extracts relevant code snippets using a set of context retrieval APIs. This process is iterative and guided by a stratified search strategy to ensure the gathered context is comprehensive yet focused.

Patch Generation

Another LLM agent utilizes the collected context to identify precise code locations and generate a patch. A retry-loop ensures the patch adheres to the specified format and can be applied to the original codebase.

AutoCodeRover and developer patch for Django13933

Optional Analysis Augmentation

If a test suite (T) is available, AutoCodeRover can leverage Spectrum-based Fault Localization (SBFL) to identify suspicious code sections, providing additional hints to the context retrieval agent.

Context Retrieval APIs

AutoCodeRover provides a set of APIs for the LLM agent to retrieve code context based on identified keywords and "hints" from the problem statement. Examples include search_class, search_method_in_class, and search_code_in_file.

Stratified Context Search

This strategy ensures efficient and focused context retrieval by iteratively invoking relevant APIs. Each stratum refines the search based on the context gathered in the previous stratum, avoiding overwhelming the LLM with excessive information.

Analysis-Augmented Context Retrieval

SBFL analysis can complement the context retrieval process by identifying suspicious methods that might not be directly mentioned in the issue description. The LLM agent can then cross-reference these additional hints with the problem statement to further refine the search.

Patch Generation

The patch generation agent utilizes the collected context and identified buggy locations to generate a patch. A retry-loop with a linter ensures the patch is syntactically correct and can be applied to the original code.

Experimental Setup

The paper evaluates AutoCodeRover's effectiveness on the SWE-bench and SWE-bench lite datasets, comparing its performance against existing LLM-based agent systems like Devin and Swe-agent. The evaluation metrics include the percentage of resolved instances, average time cost, and average token cost. The experiments utilize OpenAI GPT-4 as the foundation LLM model for all tools.

Evaluation

The results demonstrate AutoCodeRover's promising efficacy in resolving real-life GitHub issues.

Overall Effectiveness

AutoCodeRover achieves a 15.95% success rate on the full SWE-bench and 22.33% on SWE-bench lite, surpassing the performance of Devin and Swe-agent.

Time and Cost Efficiency

AutoCodeRover resolves issues within an average time of 12 minutes, significantly faster than the average time taken by human developers. The economic cost is also lower compared to other LLM-based tools.

Patch Correctness

Around 65.7% of the patches generated by AutoCodeRover are deemed correct, indicating its ability to produce semantically equivalent solutions to those created by developers.

SBFL Impact

Experiments incorporating SBFL analysis show a further increase in the number of resolved tasks, highlighting the value of integrating program analysis techniques into the workflow.

Business Implications

AutoCodeRover's capabilities hold significant potential for businesses in several ways:

Increased Developer Productivity

By automating the resolution of a portion of GitHub issues, developers can focus on more complex tasks, leading to increased overall productivity and efficiency.

Reduced Time to Market

Faster resolution of bugs and quicker implementation of new features can accelerate the software development lifecycle and reduce time to market for new products and updates.

Improved Software Quality

By identifying and fixing bugs early in the process, AutoCodeRover can contribute to improved software quality and reliability.

Conclusion

AutoCodeRover presents a significant step towards autonomous software engineering by demonstrating the effectiveness of combining LLMs with software engineering principles for program improvement tasks. Its ability to resolve real-life GitHub issues with high efficacy and efficiency holds promising potential for transforming the software development landscape. Further research and development in this area could lead to even more powerful and versatile tools, empowering developers and businesses to build better software, faster.

Critical Analysis

While AutoCodeRover demonstrates significant promise, there are some aspects to consider:

LLM Limitations

The effectiveness of the tool is inherently tied to the capabilities of the underlying LLM. Advancements in LLM technology will directly impact AutoCodeRover's performance and ability to handle more complex issues.

Incomplete Specifications

Both issue descriptions and test suites can be incomplete or ambiguous, potentially leading to incorrect patches or missed bug fixes. Integrating techniques like test-case generation and specification mining could improve the tool's robustness.

Human-AI Collaboration

While AutoCodeRover aims for autonomy, human involvement remains crucial for tasks that require deeper understanding, intuition, and decision-making. Developing effective human-AI collaboration mechanisms will be essential for maximizing the tool's potential.

Overall, AutoCodeRover represents a valuable contribution to the field of automated software engineering and paves the way for further advancements in autonomous code improvement.

Github: NUS Official

Dataset: Website

Share this post

Why Clio AI?

Unlock the most obvious-yet-hidden-in-plain-sight growth hack - enable your employees to work on important things, and reduce their cognitive load and time to resolve blockers.

Fast, efficient, and in-context information to make every employee a super performer.

Spend time thinking not searching. Get a demo today.

By signing up for a demo, you agree to our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.