The CodeLlama models released yesterday demonstrate impressive performance on HumanEval.
We have fine-tuned both models on a proprietary dataset of ~80k high-quality programming problems and solutions. Instead of code completion examples, this dataset features instruction-answer pairs, setting it apart structurally from HumanEval. We trained the Phind models over two epochs, for a total of ~160k examples. LoRA was not used — both models underwent a native fine-tuning. We employed DeepSpeed ZeRO 3 and Flash Attention 2 to train these models in three hours using 32 A100-80GB GPUs, with a sequence length of 4096 tokens.
Phind-CodeLlama-34B-v2 was initialized from Phind-CodeLlama-34B-v1 and trained on 1.5B additional tokens.
Furthermore, we applied OpenAI's decontamination methodology to our dataset to ensure valid results, and found no contaminated examples. The methodology is:
For further insights on the decontamination methodology, please refer to Appendix C of OpenAI's technical report. Presented below are the pass@1 scores we achieved with our fine-tuned models:
We are releasing all models on Huggingface to bolster the open-source community. We welcome independent verification of results.Download Phind-CodeLlama-34B-v2