AI Engineer World's Fair 2024
I recently had the privilege of attending the AI Engineer World's Fair in SF. Being based in New Zealand, one of the things that stood out was the sheer density of tech people in the area. Walking down the street and hearing "AGI", "LLM" and various other tech terms was a minor culture shock.
I've condensed the key learning I took from the event into the sections below. If you get the chance, check out some of the livestreams (particularly the closing keynote), and intelligent transcript summaries from the event.
I'm endeavouring to write more frequently as a way for me to consolidate my own lessons learnt. I don't consider this writing to be top quality, but please feel free to send suggestions and feedback!
🧐 Evaluations
Over the last ~3months, I've been working alongside many people trying to devise what "good" evaluation processes look like for LLM-powered applications. It was both a relief and a concern to see that many people shared the experience that this is still a very difficult task. While there are a lot of tooling companies willing to sell you an immediate solution, evaluations should be considered part of the core product development loop for LLM powered systems.
A (very) paraphrased quote from the CEO of Weights & Biases, Lukas Biewald illustrates the need for evaluations well: "Without the having evaluation criteria, it's almost impossible to determine the difference between a 75% & 79% accurate model, but this can be very important for production".
The next thing I'll post here will be a brain dump of my lessons from evaluating systems (and maybe even a github project to demo these in one deployment pipeline). If you have things you want to know about evals, let me know!
Takeaways
- Looking at your data is important!
- Log your product interactions, if you are unable to observe how end-users are interacting with your product, you're also unable to make improvements, understand failure modes, or introduce the features users expect your solution to have.
- Bring domain SME's in to rate how well your LLM is performing. This allows you to create a dataset that can be used both for evaluation and fine-tuning down the line.
- If you don't have a human in the loop in your development and evaluation phase for feedback, you'll soon get feedback from the human in the loop when you go to production, your customer 😉.
- If you're going to use LLM-as-a-Judge, it's important to have a quantifiable measure for the LLM's alignment with a human. From here, you can decide whether to trust the LLM, or look at fine-tuning an LLM from the user data & ratings combinations you've collected.
- Don't rely on out of the box (OOB) metrics.
- LLM benchmarks don't translate well to measuring performance in your task
- Even generic contextual relevance type metrics aren't entirely useful to understand if your model is doing the right thing for your users.
- Devise your own metrics that can measure how well your solution performs for your users. Some companies have 100's to 1000's of evaluation metrics.
- Use a range of tests at different levels of the software development lifecycle
- Unit tests that can be run offline are important to sanity check changes (e.g., basic assertions).
- More expensive tests such as LLM as a judge, red-teaming etc., can be run on a PR being raised.
🔁 Retrieval Augmented Generation (RAG) & Agentic Workflows
The "RAG is just a hack" sentiment seems to be mostly gone now. I think we all understand the importance of RAG in being able to dynamically bring knowledge to the LLM based on the real-time permissions of the user it's acting on behalf of. However, RAG is relatively naive in it's most basic form (zero/few shot generation). Agents are becoming popular, in order to allow the LLM to make decisions on how it should work with information (through tool calls, reflexion etc). Both LlamdaIndex and LangChain have made annoucements on agent support, with many startups also launching agent-based products. Having an agentic flow of sorts can enable the LLM make decisions about when to pull information, refine it's output, or get more information from the user.
I see a spectrum of RAG/ Agentic workflows through the following ranges:
- RAG, logic set in the application code (rigid definition with strong control)
- Graph based agents (variable definition with some control)
- ReAct-type, open ended agents (very flexible definition with no control)
I'm quite excited by the concepts behind LangGraph, as it feels like the right tradeoff between control and flexibility when contrasted to naive RAG, or open-ended agentic workflows like AutoGen (that I also think is a great library).
Takeaways
- RAG is a simple, but effective technique to bring knowledge to your LLM. Agentic workflows are thought to be the best way to extend the effecacy of RAG, enabling better reasoning, and operating on more domains.
- Have task specific agents/ RAG definitions and route between them for greater accuracy.
- Tool calling for knowledge makes agents more observable & controllable. You can check the sequence of tool calls to understand how a model reasoned about solving the problem.
- One insight I found interesting; with the same graph definition, small models and frontier models were found to traverse similar nodes. Hence "reason" similarly when constrained. This could allow smaller models to be used for moving between pre-defined states, and using a larger model only when open-ended reasoning may be required.
- Graph based systems allow you to flexibly define this (i.e., LangGraph)
- Role Based Access Control (RBAC) on your own data is hard when it needs to be replicated from source, to a (vector) database for RAG retrieval.
👷 Developing LLM Applications
LLM applications are fun & captivating, because you can create a compelling pilot very quickly. However, the skills and systems to take that pilot to production demand significantly more effort and investment.
Takeaways
- Some of the most important IP you build when creating a solution within your business is what worked, and what didn't work. (E.g., we have really bad retrieval performance on dataset
X
, LLMY
doesn't work well when deployed in the context ofZ
)- At minimum you should log the experiments you try, at best, you should have an experimentation framework to properly catalogue what you've tried, so one person doesn't retain all your IP.
- Simply changing the LLM model your product calls behind the scenes doesn't necessarily make for a fair evaluation of model performance.
- Models (especially between providers like OpenAI, Anthropic etc.) behave quite different depending on how they're prompted. If you wish to evaluate different providers, you should try to modify the prompt for each use case.
- Keeping your base prompt simple helps test between models effectively, as you can make minor tweaks to a prompt that should generalise across model providers relatively well.
- Constraining LLM's into making function/tool calls is a good way to control (and observe/evaluate) reasoning. Tools like LangGraph do this quite well, giving the LLM a state and a set of places to move to based on it's state. I'm a big fan of this approach in general, I think it serves to make LLM agents much more predictable and controllable
📊 Fine-tuning
One of my biggest surprises was the prevalence of people saying they actively use fine-tuned models in production. On reflection, there's an increasingly valid argument to be made for finetuned models, given the narrowing performance gap.
🌶️ My hot take is that we've reached a level of model capability that allows us to achieve "good enough" performance in constrained domains, so irrespective of whatever further releases from frontier labs do to supply us with increased "intelligence", small models done right will be valuable from now onwards.
Takeaways
- Fine-tuning on your domain can increase the pareto frontier of performance on your task
- It should still be used as a final step for things like:
- Cost to serve optimisation (use a smaller model fine-tuned on your domain to perform near to a frontier model)
- Modifying tone. If you customise the tone of a frontier model via system prompt, it can regress to the tone it adopted through pre-training over the course of a long conversation. Fine-tuning can address this.
- LoRA/ QLoRA are still great techniques for fine-tuning in an accessible manner.
- Platforms like LoRAX allow you to serve multiple LoRA adaptors from one GPU.
- RAG is still king for inserting knowledge into the LLM's context window at runtime.
🛠️ Abundance of tooling companies (shovels!)
I work primarily in the Microsoft/ Azure ecosystem, and hence use a lot of Microsoft tooling without critically looking across the market for others to meet my needs. I was very surprised to see the number of tooling companies selling solutions to quite specific problems (hallucinations, RAG, etc.) My knee jerk assessment is that most of the tooling company's value proposition can be eroded significantly with minimal effort, or OSS alternatives.
Takeaways
- The tooling market is competitive.
- I'm skeptical of how well OOB tooling for evals, hallucination mitigation etc., will perform for custom solutions.
- Build vs Buy is a difficult conversation to have when commonly accepted techniques for extracting best performance shift on a monthly basis. Your tooling may be out of date quickly.
- Don't reinvent the wheel, many problems in "LLM Engineering" have their roots in older domains (e.g., data retrieval). Try proven techniques where applicable.
🌉 Conclusion
Overall, it was interesting to see the diverse number of thoughts and opinions on how to develop AI applications. I've left with a number of approaches I'm excited to apply, I hope this article may have given you a few too!
Thanks for reading :)