rSDE-Bench: Requirement-Oriented Software Development Benchmark

Code and data for paper: Self-Evolving Multi-Agent Collaboration Networks for Software Development

1Cooperative Medianet Innovation Center, Shanghai Jiao Tong University,
2Beihang University, 3Shanghai AI Laboratory



Comparison between instruction-oriented and requirement-oriented evaluations. rSDE-Bench accurately reflects requirement fulfillment with the proposed accuracy score of 2/13, while the indrection evaluation misjudges with high scores (0.89), failing to detect missing functionality.

Abstract

rSDE-Bench is a requirement-oriented benchmark designed to evaluate the ability of models to handle software-level coding tasks. Unlike instruction-based approaches, rSDE-Bench uses detailed software requirements as input, specifying each functionality and constraint of the software. The benchmark includes automatic evaluation through unit tests, providing a more realistic assessment aligned with real-world software development practices. Read more about rSDE-Bench in our paper!


Benchmark  Construction




rSDE-Bench involves diverse requirements, each paired with a test case. Specifically, rSDE-Bench provides 53 unique coding tasks and 616 test cases. It is divided into two typical real-world software types: game and website, and introduces two requirement difficulty levels, including basic and advanced, to reflect the varying complexity of real-world software development tasks.


Features


Challenging and diverse software requirements. rSDE-Bench features long-context software requirements (averaging 507/1011 words for game and website tasks, respectively), unlike instructionoriented benchmarks that rely on brief prompts. These detailed requirements better reflect real-world lengthy and complex software development challenges.

Requirement-aware precise and efficient evaluation. rSDE-Bench employs detailed software requirements and automated unit tests to precisely measure how well generated software meets its objectives. Generated codes are evaluated based on pass rates from running specific test cases, offering an accurate and efficient process. In contrast, instruction-oriented benchmarks rely on brief prompts, which lack constraints and make evaluation less reliable, often requiring labor-intensive or indirect evaluation.


Leaderboard

Model
% Basic
% Advanced
Codes
Results
Site

🥇✅ EvoMAC

89.38

65.05

🔗

🔗

🔗

🥈 GPT-4o-Mini

62.90

44.40

🔗

🔗

🔗

🥉✅ ChatDev

62.67

43.45

🔗

🔗

🔗

Claude-3.5-Sonnet

58.90

37.11

🔗

🔗

🔗

✅ MapCoder

34.70

14.57

🔗

🔗

🔗

Gemini-1.5-Flash

29.79

11.61

🔗

🔗

🔗

✅ Autogen

25.68

5.40

🔗

🔗

🔗

✅ Agentverse

15.41

0.00

🔗

🔗

🔗

✅ MetaGPT

15.41

0.00

🔗

🔗

🔗


rSDE-Bench involves two typical real-world software types: website and game, which can reflect different coding capacities demanded in realistic software development.
Website emphasizes static and dynamic content management, user interaction through forms and buttons, and ensuring page elements are displayed and functional.
Game requires handling dynamic interactions, and real-time state changes, focusing on elements like logic execution, initialization, and game state transitions.

rSDE-Bench introduces two requirement difficulty levels, including basic and advanced.
- The % Basic metric refers to the percentage of rSDE-Bench basic test cases that were passed by the model.
- The % Advanced metric refers to the percentage of rSDE-Bench advanced test cases that were passed by the model.
- ✅ Checked indicates that the model system operates as a multi-agent system, while the absence of ✅ Checked denotes a single-agent system.


BibTeX

[TODO]