Beating GPT-4 on HumanEval with a Fine-Tuned CodeLlama-34B
We have fine-tuned CodeLlama-34B and CodeLlama-34B-Python on an internal Phind dataset that achieved 67.6% and 69.5% pass@1 on HumanEval, respectively. GPT-4 achieved 67% according to their official technical report in March. To ensure result validity, we applied OpenAI's decontamination methodology to our dataset.
August 28th Update: We've trained a new model, Phind-CodeLlama-34B-v2, that achieves 73.8% pass@1 on HumanEval.
The CodeLlama models released yesterday demonstrate impressive performance on HumanEval.
- CodeLlama-34B achieved 48.8% pass@1 on HumanEval
- CodeLlama-34B-Python achieved 53.7% pass@1 on HumanEval
We have fine-tuned both models on a proprietary dataset of ~80k high-quality programming problems and solutions. Instead of code completion examples, this dataset features instruction-answer pairs, setting it apart structurally from HumanEval. We trained the Phind models over two epochs, for a total of ~160k examples. LoRA was not used — both models underwent a native fine-tuning. We employed DeepSpeed ZeRO 3 and Flash Attention 2 to train these models in three hours using 32 A100-80GB GPUs, with a sequence length of 4096 tokens.
Phind-CodeLlama-34B-v2 was initialized from Phind-CodeLlama-34B-v1 and trained on 1.5B additional tokens.
Furthermore, we applied OpenAI's decontamination methodology to our dataset to ensure valid results, and found no contaminated examples. The methodology is:
- For each evaluation example, we randomly sampled three substrings of 50 characters or used the entire example if it was fewer than 50 characters.
- A match was identified if any sampled substring was a substring of the processed training example.
For further insights on the decontamination methodology, please refer to Appendix C of OpenAI's technical report. Presented below are the pass@1 scores we achieved with our fine-tuned models:
- Phind-CodeLlama-34B-v1 achieved 67.6% pass@1 on HumanEval
- Phind-CodeLlama-34B-Python-v1 achieved 69.5% pass@1 on HumanEval
- Phind-CodeLlama-34B-v2 achieved 73.8% pass@1 on HumanEval
Download
We are releasing all models on Huggingface to bolster the open-source community. We welcome independent verification of results.
Download Phind-CodeLlama-34B-v2