rSDE-Bench: Requirement-Oriented Software Development Benchmark

Code and data for paper: Self-Evolving Multi-Agent Collaboration Networks for Software Development

Yue Hu¹, Yuzhu Cai^1,2,3, Yaxin Du¹, Xinyu Zhu¹, Xiangrui Liu¹, Zijie Yu¹, Yuchen Hou¹, Shuo Tang¹, Siheng Chen^1,3

¹Cooperative Medianet Innovation Center, Shanghai Jiao Tong University,

²Beihang University, ³Shanghai AI Laboratory

Comparison between instruction-oriented and requirement-oriented evaluations. rSDE-Bench accurately reflects requirement fulfillment with the proposed accuracy score of 2/13, while the indrection evaluation misjudges with high scores (0.89), failing to detect missing functionality.

Abstract

rSDE-Bench is a requirement-oriented benchmark designed to evaluate the ability of models to handle software-level coding tasks. Unlike instruction-based approaches, rSDE-Bench uses detailed software requirements as input, specifying each functionality and constraint of the software. The benchmark includes automatic evaluation through unit tests, providing a more realistic assessment aligned with real-world software development practices. Read more about rSDE-Bench in our paper!

Benchmark Construction

rSDE-Bench involves diverse requirements, each paired with a test case. Specifically, rSDE-Bench provides 53 unique coding tasks and 616 test cases. It is divided into two typical real-world software types: game and website, and introduces two requirement difficulty levels, including basic and advanced, to reflect the varying complexity of real-world software development tasks.

Features

Challenging and diverse software requirements. rSDE-Bench features long-context software requirements (averaging 507/1011 words for game and website tasks, respectively), unlike instructionoriented benchmarks that rely on brief prompts. These detailed requirements better reflect real-world lengthy and complex software development challenges.

Requirement-aware precise and efficient evaluation. rSDE-Bench employs detailed software requirements and automated unit tests to precisely measure how well generated software meets its objectives. Generated codes are evaluated based on pass rates from running specific test cases, offering an accurate and efficient process. In contrast, instruction-oriented benchmarks rely on brief prompts, which lack constraints and make evaluation less reliable, often requiring labor-intensive or indirect evaluation.

Leaderboard

Model	% Basic	% Advanced	Codes	Results	Site
🥇✅ EvoMAC	77.54	51.60	🔗	🔗	🔗
🥈✅ ChatDev	53.63	32.26	🔗	🔗	🔗
🥉 GPT-4o-Mini	42.76	30.10	🔗	🔗	🔗
Claude-3.5-Sonnet	44.20	18.29	🔗	🔗	🔗
✅ Agentverse	37.67	16.13	🔗	🔗	🔗
✅ MapCoder	29.71	7.52	🔗	🔗	🔗
Gemini-1.5-Flash	21.74	6.45	🔗	🔗	🔗
✅ Autogen	17.39	0.00	🔗	🔗	🔗
✅ MetaGPT	16.67	0.00	🔗	🔗	🔗

Model	% Basic	% Advanced	Codes	Results	Site
🥇✅ EvoMAC	89.38	65.05	🔗	🔗	🔗
🥈 GPT-4o-Mini	62.90	44.40	🔗	🔗	🔗
🥉✅ ChatDev	62.67	43.45	🔗	🔗	🔗
Claude-3.5-Sonnet	58.90	37.11	🔗	🔗	🔗
✅ MapCoder	34.70	14.57	🔗	🔗	🔗
Gemini-1.5-Flash	29.79	11.61	🔗	🔗	🔗
✅ Autogen	25.68	5.40	🔗	🔗	🔗
✅ Agentverse	15.41	0.00	🔗	🔗	🔗
✅ MetaGPT	15.41	0.00	🔗	🔗	🔗

rSDE-Bench involves two typical real-world software types: website and game, which can reflect different coding capacities demanded in realistic software development.
Website emphasizes static and dynamic content management, user interaction through forms and buttons, and ensuring page elements are displayed and functional.
Game requires handling dynamic interactions, and real-time state changes, focusing on elements like logic execution, initialization, and game state transitions.

rSDE-Bench introduces two requirement difficulty levels, including basic and advanced.
- The % Basic metric refers to the percentage of rSDE-Bench basic test cases that were passed by the model.
- The % Advanced metric refers to the percentage of rSDE-Bench advanced test cases that were passed by the model.
- ✅ Checked indicates that the model system operates as a multi-agent system, while the absence of ✅ Checked denotes a single-agent system.

BibTeX

@misc{hu2024selfevolvingmultiagentcollaborationnetworks,
      title={Self-Evolving Multi-Agent Collaboration Networks for Software Development}, 
      author={Yue Hu and Yuzhu Cai and Yaxin Du and Xinyu Zhu and Xiangrui Liu and Zijie Yu and Yuchen Hou and Shuo Tang and Siheng Chen},
      year={2024},
      eprint={2410.16946},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2410.16946}, 
}