Introducing Phind-405B and faster, high quality AI answers for everyone

We're introducing a new flagship model, Phind-405B, along with a new Phind Instant model that offers lightning-fast search speeds for all of your programming and curiosity questions.

Introducing Phind-405B

We're excited to announce the launch of Phind-405B, our new flagship model. Based on the excellent Meta Llama 3.1 405B, we've trained it to be a state-of-the-art model for programming and technical tasks. Phind-405B can ingest 128K tokens of context, with a 32K context window available at launch. Phind-405B is available now for all Phind Pro users.

Phind-405B scores 92% on HumanEval (0-shot), matching Claude 3.5 Sonnet. We're particularly happy with its performance on real-world tasks, particularly when it comes to designing and implementing web apps. For example, when tasked with making a landing page for Paul Graham's Founder Mode, Phind performs multiple searches to research it and then creates multiple options.

Phind-405B founder mode searchFirst Founder Mode website made by Phind-405BSecond Founder Mode website made by Phind-405B

We trained this model on 256 H100 GPUs in FP8 mixed precision using DeepSpeed and the MS-AMP library. We empirically confirmed that FP8 mixed precision training offers no degradation over classic BF16 precision training while offering a 40% reduction in memory usage.

Introducing the new Phind Instant model

A key issue with AI-powered search is that it is just too slow compared to classic Google. Even if it generates a better answer, the added latency is discouraging.

That is why we're excited to introduce a new Phind Instant model that's trained on a similar dataset to Phind-405B and runs at up to 350 tokens per second. The model, based on Meta Llama 3.1 8B, runs on a Phind-customized NVIDIA TensorRT-LLM inference server that offers extremely fast speeds on H100 GPUs.

We start by running the model in FP8, and also enable flash decoding and fused CUDA kernels for MLP. With these speed and quality improvements, we think the new Phind Instant offers the best experience for quickly finding and summarizing information.

Improvements to search

Building on our model runtime improvements, we realized that we can usually start fetching the web results before the user is done typing. We trained a model that determines when such prefetches should be performed, and it saves up to 800ms of latency per search.

We've also upgraded the embeddings we use to determine which of the retrieved text is most relevant to be fed into the model. We've moved to a new model that is 15x larger than our previous one while simultaneously reducing latency by introducing 16-way parallelism during its computation.

Better and faster AI answers for everyone

At Phind, our goal is to help developers experiment faster and go from zero to one on new ideas. We're working on new application-level features for developers powered by the models announced today, and we can't wait to share them with you.

It also happens that focusing on challenging technical queries makes Phind a great answer engine for all questions. We use Phind ourselves to indulge in our own curiosities and are excited to hear about what you will discover next.

We would like to thank our partners at Meta, NVIDIA, Voltage Park, SF Compute, and AWS.