
Russell ColemanFebruary 13, 2025
The Creation of Phind 2.0
Introduction
Language models have evolved by leaps and bounds—but not without frustrating gaps. Ask a model something that seems simple, like "How many ‘r’s are in ‘strawberry’?" or "What’s 4128421 × 2312314?", and you might get a refusal or a hallucinated result. The LLM doesn’t know when to seek out information, or understand its own capabilities, and where it could fail, which is unacceptable for an application with an audience of the general public.
At Phind, we set out to build a high-quality answer engine that can tackle these pitfalls. Our end goal: a system that can handle tricky queries, incorporate multiple data sources, display rich, interactive outputs, and generally act like a robust, next-gen search-and-answer platform. Below is an inside look at why we built it this way, what we learned, and how we overcame the big, gnarly challenges that come with post-training our own large language model.
1. Key Shortcomings of Current-Generation LLMs
We’ve identified three major issues that prevent existing language models from truly shining in an answer-engine setting:
-
They mostly output text/markdownMarkdown is great, but not always the best final presentation. Sometimes you want a flowchart, a code block you can run, or a data table that actually works.
-
They’re bad at certain “simple” tasksCounting letters in a word, doing exact math in large multiplication, or verifying code correctness can trip them up. This leads to embarrassing “I don’t know” moments or outright hallucination.
-
They can’t gather information on their ownPure LLM-based products can’t actually open up a web browser or fetch data mid-answer. They rely on whatever context you feed them upfront, or they just guess.
When a new or casual user hits these limitations, they rightfully wonder: Why can’t this fancy AI solve a simple counting puzzle or retrieve the correct formula for my function derivative? We couldn’t just brush them off with “LLMs don’t do that yet.” Instead, we took a different approach: teach the model how to solve them by fine-tuning on a huge custom dataset, layering in new “capabilities,” and carefully engineering how (and when) the model uses them.
2. Learnings From Real Users
Before coding new features, we needed to deeply understand user pain points. For two weeks, I read through real user queries (from free users who hadn’t opted out of data collection). I noted down instances where the model didn’t shine—whether it hallucinated, got confused, or just presented info in a messy way.
Common User Frustrations
- Insufficient Web Results. We fed web results to the model, but sometimes they just didn’t contain the right info. The model would then hallucinate or say “I don’t know” instead of offering more solutions.
- Wanted Visuals but Got Text. A query like “What is Kim Kardashian’s family tree?” screams for a flowchart. Traditional LLMs return text-based genealogies that are hard to parse.
- Inadequate Charts/Graphs. Folks asking about historical rent prices in San Francisco want a quick chart. Plain text comparisons aren’t as helpful.
- Bad Code. The model can generate code, but it often fails to run it or verify correctness.
- Inline Rendering. For something like “Explain CSS flexbox with examples,” an actual rendered example would be game-changing.
- Answer Profile Confusion. Our custom “answer profile” feature got appended to the model’s prompt, which sometimes confused the LLM.
- Table Woes. Users love tables for structured data. Markdown tables from the model would often be misaligned or incomplete, and the model rarely used them.
- Complex Code Comparisons. If you want to see multiple coding solutions side by side, text lists get unwieldy.
- Math and String Inaccuracies. The LLM regularly failed at tasks like counting letters in “supercalifragalisticexpialidocious” or performing large multiplications.
- Streaming glitches. Seemingly simple things like rendering LaTeX smoothly without flickering are a surprising pain when trying to render token-by-token.
- No Mid-Answer Web Search. The LLM couldn’t refine its own answers with an extra web query on the fly.
- Images from the Web. We had no integrated way to show external images in the final answer, without showing them separately from the answer.
- Attachment Integration. We wanted the model to seamlessly handle images, PDFs, and more.
- Thread context. The model didn’t understand the context behind its own thread, e.g. if the user cut off a previous answer halfway.
- Phind Questions. People sometimes ask about Phind itself (“What is the latest Phind model?”). Our old model would give an “I don’t know” or generic nonsense.
- Time Awareness. We feed it the user’s current time, but it had no clue how to reference it.
- Context Poisoning. Extra web results sometimes confused the model, making it produce worse code than if it had no external context at all.
- Common Web Queries. “What’s the weather in San Francisco?” or “Stock price for AAPL?”—the old system had no neat “card” or easy follow-up flow.
- Stock Questions. Similarly, “latest data for Tesla” wasn’t integrated well.
- Generate Images/Videos. People kept asking the model for a quick mockup or a meme.
- Spicing Up the UI. We wanted to add icons, footnotes, and more visual flair to the user experience.
From these user failures, it became obvious we needed a richer approach to content generation—one that integrates code execution, charting, real-time data, better formatting, and the ability to poke external resources whenever needed.
3. The Capabilities We Gave Our Model
After clarifying these user pain points, we added new “capabilities”—functions the model can call (or choose not to call) during its answer generation. Each capability is a first-class citizen, with its own specialized input/output. Here’s the final lineup:
3.1 Jupyter/Python Code Execution 🐍
- What It Does: Lets the model run Python code in a sandboxed Jupyter environment.
- Why It Matters: The model can verify its math or code on the spot—no more random guesswork. It can even produce plots or images via
matplotlib
.
3.2 Mermaid Diagrams 📊
- What It Does: Generates diagrams (flowcharts, sequence diagrams, etc.) inline using Mermaid syntax.
- Why It Matters: Perfect for visually representing family trees, state diagrams, or class structures.
3.3 Web Search Integration 🔍
- What It Does: Performs web searches mid-answer, up to 20 times (with 3 “non-summarized” queries max).
- Why It Matters: The model can gather fresh info from the web, reducing hallucinations when queries go off the beaten path.
3.4 Phind Questions ℹ️
- What It Does: Queries Phind’s internal docs to answer questions about itself.
- Why It Matters: If the user wants to subscribe to Pro or asks, “Which model version am I using right now?”, the LLM can look it up.
3.5 Interactive Code Display 💻
- What It Does: Creates interactive code blocks that users can edit and run in the sidebar.
- Why It Matters: You can test Python, JavaScript, React, and more directly in your browser. The model even sees if the code fails.
3.6 Interactive Code Block Editing ✏️
- What It Does: Allows modification of previously created interactive code blocks to generate a new version.
- Why It Matters: Encourages an iterative workflow: user modifies the code → the model sees the changes → the user runs it again.
3.7 User Code Block Editing Understanding
- What It Does: The model “remembers” user edits, knows if the code ran successfully, or if an error occurred.
- Why It Matters: Makes the coding experience far less disjointed.
3.8 Attachment Analysis 📎
- What It Does: Analyzes uploaded images or PDFs.
- Why It Matters: The model can describe an image or parse text in a PDF without you having to manually transcribe it.
3.9 AI Image Generation 🎨
- What It Does: Uses diffusion models to create images from a text prompt.
- Why It Matters: Great for quick illustrations, mockups, or creative brainstorming.
3.10 Weather Information 🌤️
- What It Does: Displays multi-day weather forecasts for a given location.
- Why It Matters: Users can get up-to-date weather data in the same chat, then ask follow-up questions (“Will it rain in New York on Friday?”).
3.11 Stock Market Data 📈
- What It Does: Pulls live stock data for specified tickers.
- Why It Matters: No more “Let me open a separate tab for finance data.” It’s all in one place.
3.12 Contextual Understanding
- What It Does: Knows the user’s “answer profile,” the current time, and more.
- Why It Matters: Smoother conversation with personalized quirks or disclaimers.
3.13 Displaying Images from the Web
- What It Does: Inline image display from a given URL or search.
- Why It Matters: No extra clicks or tab-switching—visuals on the spot.
3.14 Rich Citations
- What It Does: Cites not just URLs but exact snippets or specific timestamps in YouTube videos.
- Why It Matters: More transparency; you can see exactly where info came from.
3.15 Rich Output
- What It Does: Produces specialized, structured output (like code comparisons, weather cards) without freehand HTML.
- Why It Matters: Minimizes the risk of messy formatting. The model can confidently produce complex UI elements.
4. The Dataset: Strategy and Challenges
To teach the model how to consistently produce these new capabilities, we created a fine-tuning dataset with over 200k examples. Building this dataset was a monumental task, given our:
- Budget: We aimed for a very small budget - we don’t want to publish the exact number, but it’s laughable compared to most model trains you read about online.
- Time Constraints: We needed to ship quickly.
- Diverse Requirements: Code verification, diagram generation, string counting, image retrieval, weather queries, stock data… you name it.
4.1 Synthetic Data Generation
We quickly realized that human-produced data at this scale was cost-prohibitive. So we built a pipeline to:
- Generate user queries (real or synthetic).
- Ask the model for a “best possible answer” with its internal knowledge.
- Inject web results to see if the model improved or needed further searching.
- Check new capabilities to see whether they could enhance the answer. (Is a flowchart better than text? Would a table help? Should we spin up a Python snippet to verify code?)
- Apply “critics”: a swarm of smaller classification/prompting steps that review the answer, flag mistakes, and force the model to rewrite if needed.
- Combine and finalize everything into one cohesive “example” for the training set.
Each final example took many sub-LLM calls to produce—5–10+ minutes per sample. The code pipeline that orchestrated this process got extremely complex. But in the end, we had a treasure trove of multi-step, multi-capability interactions.
4.2 Weighted Categories
We also assigned weights to different types of queries:
- Existing free user queries from the old website (weighted 15)
- Writing interactive code blocks (weighted 18)
- Math problems (weighted 8)
- String problems (weighted 6)
- Weather queries (weighted 3)
- Stock prices (weighted 3)
- Requests to generate a diffusion image (weighted 3)
- Requests to generate a diffusion video (weighted 2.5)
- Requests for information retrieval that require multiple steps of web searches (e.g. find me the top 10 fortune 500 ceos and list their hometown for each)
- Attachment meta questions (things like how many attachments do I have?) (weighted 1)
- Questions about attached images (things like what color is the fruit in this image) (weighted 4)
- Analyzing attached CSVs (e.g. make a histogram of this) (weighted 7)
- “Wikipedia” style questions (aka if the user just types in “San Francisco” we should give an overview of San Francisco) (weighted 9)
This ensures the final model sees an appropriate variety of tasks—rather than fixating on just one or two capabilities.
4.3 Trials and Tribulations
Despite the robust design:
- Over-Enthusiasm: The model loves to show off new toys (“Add a diffusion image just because!”). We added critics to reign in these impulses.
- Diagram Disasters: Generating Mermaid diagrams is tricky. We used ~20 “critics” to spot broken syntax, then force rewrites.
- Causality: Autoregressive models guess the next token. If the model references something before it’s introduced, that’s a timeline violation. We needed extra logic to ensure no “time travel.”
- General misunderstandings: Current models can’t follow instructions, and trying to get it to reliably generate good data and place things where they should go just doesn’t work with prompting, even for specialized agents working on just one part of the answer. We had to create special pipelines for each and every capability, with lots of human-annotated critics based on trial and error.
For a while, the I wasn’t sure if we’d ever get the hodgepodge of monster code to work; slowly but surely it began to produce reliable output. Tuning is still far from complete however, and we hope to bring an even stronger v1.5 in the coming weeks.
Once we got it trained on the data, the first time I got to see my baby in action and watch it use its capabilities and learn my function call syntax was magical. In practice, the ability to orchestrate code execution, web searches, diagrams, and more is absolutely delightful, and it’s a tool I use myself every day.
5. Interesting Tidbits & Observations
5.1 Emergent Capabilities
Some unexpected perks popped up:
- Special Token Mastery: The model developed a really comprehensive internal understanding of the way we feed it in data, and it’s able to answer questions about the message metadata, attachment structure, etc. correctly, despite having never been trained on it. It even understands meta information about its own markup language it has been trained to produce (however this can only be discovered through very deliberate prompting).
- Multi-City Weather: Even though not a single example in the training set asks for the weather of multiple cities (I was speeding through that part; added weather as a last ditch) it can handle those types of queries perfectly. It actually made up its own syntax to do it, but it does it consistently every time and it matches our other conventions for list input to a capability, so we just added support for that on the backend, and it just works. Who knew.
- This is actually a pattern repeated throughout; it makes things up but follows our format, and in the spirit of “the customer is always right” we’ve decided the model is always right, and added backend support for all of its quirks.
- Distinct Personality: We trained a couple different versions of these models (both viewing different checkpoints, running slightly different data mixtures, and trying on different batch sizes) and we noticed each version had a remarkably different personality and their own set of quirks. For example, one version never wanted to add any icons. Another version loved putting mermaid diagrams before the code every time (kind of annoying). The final one we selected we are very happy with, and has the best personality of the options. You may notice it will match your energy (try hitting it with a “yassss queen 💅” and watch it respond in kind)
- Humorous ability: We had great fun messing with the answer profiles. Try setting it to always answer like a pirate, or like Donald Trump, or forgetting what it says every 2 sentences. The model is remarkably good at this and cracks some pretty good puns.
5.2 Model Upselling
We now let the LLM see user subscription tiers. If a user tries to do something that requires Pro (e.g., AI image generation) but they’re on a Free plan, the model itself can upsell the user—leading them to the relevant subscription page. We hope this will drive better conversion than typical upsell tactics, and it was kind of fun.
5.3 Handling Web Results Toggles & Thoughts
We feed tokens for each “event” or “toggle” to the model. This means it knows if the user turned on “enable additional web searches” or “think before answering.” There’s no separate model for each setting—one model handles them all. The timeline is also saved, so the model understands for previous answers exactly what settings were set, and when they were changed.
Give a try at the “think before answering” toggle. It’s still in the early stages, but we’ve found it gives much better results. Contrary to most other “reasoning” type models, our thoughts focus a lot more on how to present the answer, than what the answer is.
We may or may not have a reasoning model in the pipeline however, where we hope to combine the richness of our output and search data with the raw power of latest-gen reasoning models.
5.4 Hierarchy of Instructions
We taught the model to never let user instructions override certain core rules (e.g., refusals, answer profile, system constraints). The model follows this chain of command:
123456781. refusal 2. capabilities 3. limitations 4. answer profile 5. system prompt 6. context (timestamp, etc.) 7. user instructions 8. search results
So far, it’s been robust in non-adversarial contexts.
5.5 Code Block Revisions
One of our upgrades was “interactive code blocks”, the feature where the model generates code that appears in the sidebar as opposed to inline, signaling the user is allowed to edit and run it on our servers.
Traditionally, if the user edits model-generated code, the model knows nothing about those edits. We wanted a more collaborative environment. Now, whenever you update an interactive code block, that action (and its results/errors) goes back to the model so it can incorporate it into subsequent answers.
This is pretty huge and something I’ve seen no other model provider do. It makes it feel much more interactive, and like the model actually can understand your code edits, and fully grasp the concept of it vs. you editing the code.
To see this in action, try saving different revisions of interactive code blocks, and chatting with the model about your changes.
Note that our interactives frontend is still not 100% baked (e.g. right now it won’t show you the model’s thoughts and doesn’t allow the user to run the code), but we hope to get that fully completed soon.
5.6 Errors Not in the Dataset
The model sometimes makes mistakes that aren’t found anywhere in the dataset. This is the flip-side to the cool emergent things it can do. There’s quite a lot of these; here are a couple examples:
- Random Imports: When making interactive code blocks, the code written is placed in our sandbox environment. The model is free to import anything it wants in non-interactive blocks (aka any code block shown inline in the answer), but when using our special interactive runer it must stick to our allowed imports - and it doesn’t. The training data was perfect (every example was validated to be completely correct), yet it still does it.
- Invalid RXML: Our “rich” output format sometimes sees the model produce invalid nesting or attributes. Possibly it just needs more training time.
5.7 Special Output Format (RXML)
Instead of spitting out raw HTML or Markdown (prone to flickers and escaping nightmares), we created RXML: a safe, structured token-based format that keeps text and “layout” separate. In RXML, the model can embed code blocks, diagrams, weather widgets, images, etc. in a robust tree structure that the frontend knows how to render.
I called it RXML (Russell XML) for ease of explanation (because it is similar to XML in that it has tags with attributes and children) but it actually has no text-based representation at all; it’s completely defined by special tokens, and has many differences to XML (e.g. every attribute can contain a list of items, text items can contain multiple distinct children, etc.)
This is worth a whole blog post on its own, which will we will need to write out very precisely if we open source the model in the future. That’s currently a barrier to open sourcing the current version; it would be nearly impossible for someone to use it without a detailed writeup and scaffolding code. This is the downside to making things so custom.
6. Conclusion
We set out with a shoestring budget to try and create the best dataset we could for this new model. It’s just three of us at Phind, with me (Russell) working on the model/dataset, Michael working on the backend/training infra, and Grant working on the frontend/database, although we all dabble with a bit of everything.
Trying to create this with just us was a huge challenge, and definitely the two biggest issues we faced as a team were:
- Lack of dev time between us
- Lack of budget
As for (#2), I was hyper aware for the months of working on the dataset that one single mistake would kill it. If I forgot one thing or screwed up one part of the data, violating causality, forgetting to put the diagrams into the final tokens, etc. it might make the model completely unusable and we’d have to do the data generation all over again.
However, it was as rewarding as it was challenging, and seeing it all come together was like nothing else. There’s still far more to go, but we strongly feel that we are directionally correct and will not stop shipping.
This was a massive learning experience, and we believe we’ve developed a new answer-first pipeline that will help Phind grow into its potential, by starting with what we want the answer to look like, and forcing the model to adapt to our needs, not vice-versa.
Summary of our key insights:
- Pinpoint real user needs (like inline code execution, correct math, or visual diagrams).
- Engineer new capabilities that the model can selectively invoke (like weather data or AI image generation).
- Assemble a massive, synthetic dataset that taught the model how to chain these capabilities together seamlessly.
We’ve created a more flexible, powerful, and trustworthy LLM-based system that can do everything from generating flowcharts of celebrity family trees to writing and testing its own code for errors in real time.
We hope this peek behind the scenes gives you a sense of why it was so challenging—and why we’re proud of where we landed. There’s still room to grow (we want more consistent RXML outputs, a better approach to adversarial prompts, etc.), but we can say with confidence: Phind is no longer “just another” AI chat. It’s a richly integrated system that truly aims to be your one-stop, multi-modal, multi-capability answer engine.
Stay tuned for upcoming deep dives into our RXML format, more details on how we orchestrate multi-step function calls, and the hidden joys of synthetic data generation at scale. If you made it this far, thank you for reading—and we’d love to hear your thoughts on how we can push this further!
7. Special Thanks
I wanted to give a huge shout out to the Llama team at Meta for their work on the Llama 3.3 70B and 3.1 405B, the base models for our fine tune, and synthetic data generation beasts. Without them we couldn't have done this.
Additionally, thank you to Nvidia for their TensorRT-LLM library, which is what allows us to run the model at the capacity and speed we do.
Thanks for reading! If you enjoyed, please let me know at [email protected]! If you didn’t, please pretend you did anyway so that I feel like these past three months were well spent.
If you want to work on post training for user delight, or just helping us create the best search platform, we’re hiring (please help)! We’re a small team of just three people, with just me working on the dataset right now, so if you want to work on the next dataset with me, come email!