Building LLM apps? Make sure you keep these 4 things in mind
Enterprise AI adoption in 2024
As 2024 gets underway, we’re seeing Generative AI or Gen AI-based on Large Language Models (LLMs) - appear everywhere: Google Bard, Microsoft Bing, AI chatbots on WhatsApp, AI ‘companions’ and more. Now, that’s all on the consumer side.
What about Gen AI in the enterprise? Enterprises are typically more conservative when it comes to adopting new technologies. But because business leaders have gotten a first-hand taste for the transformative power of Gen AI (thank you, ChatGPT), we think that they’re ready to embrace Gen AI with both hands this year.
Regular evaluations and fine-tunings can make or break an AI app
But unlike consumer-facing apps, enterprise Gen AI apps will be built to handle specialized workflows and domain-specific use-cases, or meet other custom requirements. In turn, that means that the results generated by applications using those LLMs must meet or exceed various quality metrics around accuracy, factuality, contextuality, and other criteria.
While consumers could (potentially) live with a search engine that points you to a low-quality website, employees can never love an AI application that hallucinates or provides ‘made up’ answers when you ask it to analyze a spreadsheet and point to the three reasons sales are declining.
So LLMs that power enterprise applications need to
be trained using specialized datasets that can enhance the LLMs’ understanding and performance in as many scenarios and contexts as possible, and
have their outputs evaluated regularly to make sure that the results meet or surpass key quality thresholds.
It’s important to note that thanks to the non-deterministic nature of AI outputs, model drift concerns, and the fact that different people can ask the same question differently, evaluating LLM output quality is never a one-and-done. LLM-based app outputs must be evaluated on a regular basis, and based on the results, new specialized datasets must be created and/or procured to fine-tune these applications.
Lessons learned at e2f
At e2f, over the last 10 years, we have worked closely with three of the world’s largest AI builders to create, annotate, and evaluate various specialized datasets to support their AI / LLM applications and projects.
So we’ve had a ringside view into what makes AI applications work well (or not), how to plan and build user-centric AI applications, how to evaluate the outputs of these applications, make sure that they stay performant, how to nudge them back when they stop being performant, and much more.
To kick off 2024, we wanted to share four high-level lessons learned, so that you, as a business leader or manager, can maximize the chances of success for your own Gen AI / LLM projects.
1 - It’s all about the use-case and the end-users.
No two AI apps will, or should, ever work the same…unlike a traditional software tool.
For instance, an AI sales assistant app needs to understand and respond to prospect queries in a manner that satisfies the prospect’s question, while, at the same time, optimizing the chances of converting that prospect into a customer. The context, the tone of the response, and more must be optimized for conversions.
On the other hand, if you’re building a website agent that provides intelligent answers to questions from your website visitors, the outputs of that app should be optimized for helpfulness and thoroughness.
This means that AI / LLM application builders must keep three things in mind.
Design the app against specs that map to the business use-case (as is standard with any enterprise app)
Make sure they use the right kind of specialized datasets to train their apps to make sure that the app’s interactions with users work for users with varying backgrounds, skill levels, language skills, etc., while making sure that each interaction is optimized to achieve the desired business goal.
Just like with a sales assistant or customer service agent needs to be periodically retrained (based on analysis of user feedback and conversations), AI apps need to be regularly fine-tuned with the right kind of specialized datasets based on analysis of the quality of their outputs.
2 - Enterprises are full of non-textual data. Converting it into training data is essential.
Once you realize the importance of finding the right, specialized datasets, you will also realize that enterprise information - the “raw material” that you need to create your datasets - is not organized neatly into easy-to-read textual data.
In fact, the majority of business information is often locked away in charts, images, tables, presentations, spreadsheets, video and audio files, and more. This information must be ‘scraped’ or converted into textual formats - without losing the surrounding context - in order to create specialized datasets needed to train and fine-tune your Gen AI / LLM apps.
This is hard, but not impossible.
You must use techniques such as Optical Character Recognition (OCR), human labelers and annotators, speech-to-text, imagine annotation, Natural Language Generation (NLG), and other tools that can handle non-textual data, and convert it all into usable, text-based formats that can be “fed” to your AI/LLM apps for training and fine-tuning purposes.
3 - Your data may need to be cleansed, sanitized and ‘neutralized’
As an AI application builder, you want users to trust, adopt, and use your AI apps. But that’s a hard goal to attain when your apps produce results that are inaccurate, biased, offensive, or include sensitive, personal information. In addition to opening yourself up to legal risk, this will result in wasted money and time.
The good news is that most popular open-source LLM models have already been trained to eliminate these negatives. But if you’re trying to accelerate time to value by using an open-source LLM - which many enterprises will likely do - you will still need to fine-tune the LLM with specialized datasets to serve your unique enterprise needs.
That’s where cleansing your data becomes important. It’s useful to think of it using this simple framework:
If you’re using your own company’s data, you need to make sure you remove any confidential or sensitive information before creating your training and fine-tuning datasets.
If you’re ‘borrowing’ or buying 3rd-party data, you may need to strip away any metadata that could bias your application one way or another. Established data providers will do this for you automatically.
If you’re relying on any kind of public datasets, they could contain personal information or other unwanted metadata - and you will need to sanitize it.
4 - Everyone should be able to calibrate LLM outputs
AI exists to serve humans. Humans, as we know, come from a variety of backgrounds, possess a range of skills, and so on. That’s why different human users have different needs and expectations when it comes to consuming and understanding the outputs generated by LLMs.
An LLM output that makes sense to a French mathematician may not make sense to a Vietnamese teacher (even if the language in which the LLM generates the output is the same).
That’s why there’s a growing realization that even non-data scientists must be able to evaluate and recalibrate LLM outputs so that they’re maximally useful to a range of users from different backgrounds.
AI builders should consider using platforms that provide clear insights into their LLM's performance and facilitate easy adjustments. This user-friendly approach can accelerate the adoption and optimization of LLMs across various business functions.
Conclusion
2024 is the year of enterprise AI adoption. But unlike traditional software applications that produce fixed, immutable outputs, AI/LLM apps produce outputs that are non-deterministic (different outputs every single time) and are subject to different interpretations by different users.
Make sure that you have a robust AI data strategy in place to train, evaluate, and fine-tune your AI/LLM apps regularly so that your users and customers can trust and rely on your applications.