The paper starts by acknowledging the advancements in automating the software development process, highlighting the recent impact of Large Language Models (LLMs) in enabling automated coding through tools like GitHub Copilot. However, software engineering extends beyond just coding and encompasses crucial tasks like program improvement for maintenance (e.g., bug fixing) and evolution (e.g., feature additions). Automatically generated code by LLMs needs to be trustworthy and automating program repair tasks can be a way to achieve this trust. This leads to the introduction of AutoCodeRover, a tool specifically designed for autonomous program improvement by addressing GitHub issues.
The paper discusses various existing automated program repair (APR) techniques, including search-based, semantic-based, and learning-based approaches. Search-based techniques, like GenProg, rely on code mutations to explore possible patches, while semantic-based techniques, like SemFix, formulate repair constraints based on specifications and then solve them to generate patches. Learning-based techniques utilize deep learning models trained on large codebases to predict the most likely correct patch. Current APR techniques are limited however in addressing real-life software issues due to their reliance on high-quality test suites and lack of utilization of natural language specifications from problem descriptions.
The paper focuses on the SWE-bench dataset, which contains real-life GitHub issues and corresponding pull requests from popular Python projects. Each issue represents either a bug fix or a feature addition, and the pull request includes the code changes made by developers to resolve the issue. The dataset presents a realistic and challenging environment for evaluating the capabilities of automated program improvement tools.
The paper provides a clear example of how AutoCodeRover works by showcasing its approach to a real GitHub issue from the Django project. The issue requests adding support to the ModelChoiceField class to display the value of invalid choices when raising a validation error. AutoCodeRover tackles this issue in two stages: context retrieval and patch generation.
This example demonstrates how AutoCodeRover effectively navigates the codebase, identifies the root cause of the issue, and generates a patch that addresses the problem.
The framework takes a problem statement (P) describing the issue and the corresponding codebase (C) as input. It operates in two main stages:
An LLM agent navigates the codebase and extracts relevant code snippets using a set of context retrieval APIs. This process is iterative and guided by a stratified search strategy to ensure the gathered context is comprehensive yet focused.
Another LLM agent utilizes the collected context to identify precise code locations and generate a patch. A retry-loop ensures the patch adheres to the specified format and can be applied to the original codebase.
If a test suite (T) is available, AutoCodeRover can leverage Spectrum-based Fault Localization (SBFL) to identify suspicious code sections, providing additional hints to the context retrieval agent.
AutoCodeRover provides a set of APIs for the LLM agent to retrieve code context based on identified keywords and "hints" from the problem statement. Examples include search_class, search_method_in_class, and search_code_in_file.
This strategy ensures efficient and focused context retrieval by iteratively invoking relevant APIs. Each stratum refines the search based on the context gathered in the previous stratum, avoiding overwhelming the LLM with excessive information.
SBFL analysis can complement the context retrieval process by identifying suspicious methods that might not be directly mentioned in the issue description. The LLM agent can then cross-reference these additional hints with the problem statement to further refine the search.
The patch generation agent utilizes the collected context and identified buggy locations to generate a patch. A retry-loop with a linter ensures the patch is syntactically correct and can be applied to the original code.
The paper evaluates AutoCodeRover's effectiveness on the SWE-bench and SWE-bench lite datasets, comparing its performance against existing LLM-based agent systems like Devin and Swe-agent. The evaluation metrics include the percentage of resolved instances, average time cost, and average token cost. The experiments utilize OpenAI GPT-4 as the foundation LLM model for all tools.
The results demonstrate AutoCodeRover's promising efficacy in resolving real-life GitHub issues.
AutoCodeRover achieves a 15.95% success rate on the full SWE-bench and 22.33% on SWE-bench lite, surpassing the performance of Devin and Swe-agent.
AutoCodeRover resolves issues within an average time of 12 minutes, significantly faster than the average time taken by human developers. The economic cost is also lower compared to other LLM-based tools.
Around 65.7% of the patches generated by AutoCodeRover are deemed correct, indicating its ability to produce semantically equivalent solutions to those created by developers.
Experiments incorporating SBFL analysis show a further increase in the number of resolved tasks, highlighting the value of integrating program analysis techniques into the workflow.
AutoCodeRover's capabilities hold significant potential for businesses in several ways:
By automating the resolution of a portion of GitHub issues, developers can focus on more complex tasks, leading to increased overall productivity and efficiency.
Faster resolution of bugs and quicker implementation of new features can accelerate the software development lifecycle and reduce time to market for new products and updates.
By identifying and fixing bugs early in the process, AutoCodeRover can contribute to improved software quality and reliability.
AutoCodeRover presents a significant step towards autonomous software engineering by demonstrating the effectiveness of combining LLMs with software engineering principles for program improvement tasks. Its ability to resolve real-life GitHub issues with high efficacy and efficiency holds promising potential for transforming the software development landscape. Further research and development in this area could lead to even more powerful and versatile tools, empowering developers and businesses to build better software, faster.
While AutoCodeRover demonstrates significant promise, there are some aspects to consider:
The effectiveness of the tool is inherently tied to the capabilities of the underlying LLM. Advancements in LLM technology will directly impact AutoCodeRover's performance and ability to handle more complex issues.
Both issue descriptions and test suites can be incomplete or ambiguous, potentially leading to incorrect patches or missed bug fixes. Integrating techniques like test-case generation and specification mining could improve the tool's robustness.
While AutoCodeRover aims for autonomy, human involvement remains crucial for tasks that require deeper understanding, intuition, and decision-making. Developing effective human-AI collaboration mechanisms will be essential for maximizing the tool's potential.
Overall, AutoCodeRover represents a valuable contribution to the field of automated software engineering and paves the way for further advancements in autonomous code improvement.
Github: NUS Official
Dataset: Website