Beating GPT-4 on HumanEval with a Fine-Tuned CodeLlama-34B

We have fine-tuned CodeLlama-34B and CodeLlama-34B-Python on an internal Phind dataset that achieved 67.6% and 69.5% pass@1 on HumanEval, respectively. GPT-4 achieved 67% according to their official technical report in March. To ensure result validity, we applied OpenAI's decontamination methodology to our dataset.


August 28th Update: We've trained a new model, Phind-CodeLlama-34B-v2, that achieves 73.8% pass@1 on HumanEval.

The CodeLlama models released yesterday demonstrate impressive performance on HumanEval.

  • CodeLlama-34B achieved 48.8% pass@1 on HumanEval
  • CodeLlama-34B-Python achieved 53.7% pass@1 on HumanEval

We have fine-tuned both models on a proprietary dataset of ~80k high-quality programming problems and solutions. Instead of code completion examples, this dataset features instruction-answer pairs, setting it apart structurally from HumanEval. We trained the Phind models over two epochs, for a total of ~160k examples. LoRA was not used — both models underwent a native fine-tuning. We employed DeepSpeed ZeRO 3 and Flash Attention 2 to train these models in three hours using 32 A100-80GB GPUs, with a sequence length of 4096 tokens.

Phind-CodeLlama-34B-v2 was initialized from Phind-CodeLlama-34B-v1 and trained on 1.5B additional tokens.

Furthermore, we applied OpenAI's decontamination methodology to our dataset to ensure valid results, and found no contaminated examples. The methodology is:

  • For each evaluation example, we randomly sampled three substrings of 50 characters or used the entire example if it was fewer than 50 characters.
  • A match was identified if any sampled substring was a substring of the processed training example.

For further insights on the decontamination methodology, please refer to Appendix C of OpenAI's technical report. Presented below are the pass@1 scores we achieved with our fine-tuned models:

  • Phind-CodeLlama-34B-v1 achieved 67.6% pass@1 on HumanEval
  • Phind-CodeLlama-34B-Python-v1 achieved 69.5% pass@1 on HumanEval
  • Phind-CodeLlama-34B-v2 achieved 73.8% pass@1 on HumanEval

Download

We are releasing all models on Huggingface to bolster the open-source community. We welcome independent verification of results.

Download Phind-CodeLlama-34B-v2