rSDE-Bench is a requirement-oriented benchmark designed to evaluate the ability of models to handle software-level coding tasks. Unlike instruction-based approaches, rSDE-Bench uses detailed software requirements as input, specifying each functionality and constraint of the software. The benchmark includes automatic evaluation through unit tests, providing a more realistic assessment aligned with real-world software development practices. Read more about rSDE-Bench in our paper!
rSDE-Bench involves diverse requirements, each paired with a test case. Specifically, rSDE-Bench provides 53 unique coding tasks and 616 test cases. It is divided into two typical real-world software types: game and website, and introduces two requirement difficulty levels, including basic and advanced, to reflect the varying complexity of real-world software development tasks.
Challenging and diverse software requirements. rSDE-Bench features long-context software requirements (averaging 507/1011 words for game and website tasks, respectively), unlike instructionoriented benchmarks that rely on brief prompts. These detailed requirements better reflect real-world lengthy and complex software development challenges.
Requirement-aware precise and efficient evaluation. rSDE-Bench employs detailed software requirements and automated unit tests to precisely measure how well generated software meets its objectives. Generated codes are evaluated based on pass rates from running specific test cases, offering an accurate and efficient process. In contrast, instruction-oriented benchmarks rely on brief prompts, which lack constraints and make evaluation less reliable, often requiring labor-intensive or indirect evaluation.
Model |
% Basic |
% Advanced |
Codes |
Results |
Site |
---|---|---|---|---|---|
🥇✅ EvoMAC |
89.38 |
65.05 |
|||
🥈 GPT-4o-Mini |
62.90 |
44.40 |
|||
🥉✅ ChatDev |
62.67 |
43.45 |
|||
Claude-3.5-Sonnet |
58.90 |
37.11 |
|||
✅ MapCoder |
34.70 |
14.57 |
|||
Gemini-1.5-Flash |
29.79 |
11.61 |
|||
✅ Autogen |
25.68 |
5.40 |
|||
✅ Agentverse |
15.41 |
0.00 |
|||
✅ MetaGPT |
15.41 |
0.00 |
rSDE-Bench involves two typical real-world software types: website and game, which can reflect
different coding capacities demanded in realistic software development.
Website emphasizes static and dynamic
content management, user interaction through forms and buttons, and ensuring page elements are
displayed and functional.
Game requires handling
dynamic interactions, and real-time state changes, focusing on elements like
logic execution, initialization, and game state transitions.
rSDE-Bench introduces two requirement difficulty levels, including basic and advanced.
- The % Basic metric refers to the percentage of rSDE-Bench basic test cases
that were passed by the model.
- The % Advanced metric refers to the percentage of rSDE-Bench advanced test cases
that were passed by the model.
- ✅ Checked indicates that the model system operates as a multi-agent system, while the absence of ✅ Checked denotes a single-agent system.
[TODO]