bin123apple / AutoCoder

воскресенье, 2 июня 2024 г. в 00:00:06

https://github.com/bin123apple/AutoCoder

We introduced a new model designed for the Code generation task. Its test accuracy on the HumanEval base dataset surpasses that of GPT-4 Turbo (April 2024) and GPT-4o.

AutoCoder

Introduction

We introduced a new model designed for the Code generation task. Its test accuracy on the HumanEval base dataset surpasses that of GPT-4 Turbo (April 2024). (90.9% vs 90.2%).

Additionally, compared to previous open-source models, AutoCoder offers a new feature: it can automatically install the required packages and attempt to run the code until it deems there are no issues, whenever the user wishes to execute the code.

Difference between the code interpreter of AutoCoder and the GPT-4 Turbo:

Below are the video demos for the code interpreter comparision between GPT-4 Turbo and AutoCoder:

GPT-4o can not access the external library.

AutoCoder-gpt4o.mp4

AutoCoder can automatically install the required packages. This feature expands the scope of code interpreter's application.

AutoCoder_demo.mp4

Difference between the code interpreter of AutoCoder and the current open-source code interpreter OpenCodeInterpreter:

The code interpreter of AutoCoder, like GPT-4 Turbo, is only called when the user has a need to verify the code, while OpenCodeInterpreter runs all generated python code.

Model

The Model is avaliable on Huggingface: AutoCoder (33B) AutoCoder-S (6.7B)

The base model is deepseeker-coder.

Quick Start

Create the conda env

conda create -n AutoCoder python=3.11
conda activate AutoCoder
pip install -r requirements.txt

Test on HumanEval 90.9% on base, 78.0% on base + extra.

cd Evaluation
python test_humaneval.py

You will receive a file named AutoCoder_HumanEval+.jsonl, which follows the EvalPlus format, after this step.

Then follow the testing framework of the EvalPlus GitHub. You will see the results.

NOTE:

Don't forget to use evalplus's evalplus.sanitize to post-process the code.
If you don't use the greedy method (for example set the do_sample=True) for the code generation. You will probably see the different results.

Test on MBPP 82.5% on base, 70.6% on base + extra.

python test_humaneval.py

Post-process to delete the nature language for testing

python postprocess_mbpp.py

Your will get a AutoCoder_Mbpp+-sanitized.jsonl file after this step, it extracted all the code blocks. Then, directly test it by using EvalPlus GitHub (You don't need to use to use evalplus's evalplus.sanitize to post-process the code this time).

Test on DS-1000.

python test_ds1000.py

Your will get a jsonl file after this step, it extracted all the code blocks. Then, directly test it by using DS-1000 GitHub.

Web demo (Include code interpreter)

Install gradio related pakcages

cd /Web_demo
pip install -r requirements.txt

Run it:

python chatbot.py

NOTE:

Currently the model will only start the code interpreter if you ask it to verify its code. I am still finetuning it on a instructed dataset, which will give it the ability to enable the code interpreter upon a user request to run code. I will update the model when it is finished.
We suggest to set do_sample = True (default setting here) while using the code interpreter.

Contact

If you have any inquiries, please feel free to raise an issue or reach out to leib2765@gmail.com.

Citation

@misc{lei2024autocoder,
      title={AutoCoder: Enhancing Code Large Language Model with \textsc{AIEV-Instruct}}, 
      author={Bin Lei and Yuchen Li and Qiuwu Chen},
      year={2024},
      eprint={2405.14906},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
}

Acknowledgments

Thanks to Tianyu Zheng, the first author of the OpenCodeInterpreter, for guidance on some technical details.