Continuous Monitoring – And Why It Can Make or Break Your AI Apps

Think of your AI-based app as if it were a child. 

When they’re young, you keep them close, nurturing and guiding them every step of the way. But as they grow and evolve, you start to give them more freedom—allowing them to explore the park or engage with the world around them. Yet, even as they gain independence, you still want to oversee their development, guiding them and helping them grow. But would you trust a robot to handle all of that—potentially causing your child to think and act like a machine? Probably not.

As a parent, you want to be there yourself, ensuring their safety and helping them navigate challenges. If you can’t be there, you’d rely on another trusted human—someone who understands the nuances of their development and is capable of making thoughtful decisions in real time.

In the same way, when your AI or LLM-based app is deployed, you can’t rely solely on automated systems for oversight. AI is complex, evolving, and often unpredictable. Just like a child, it requires constant attention—not only from machines but from humans who can catch the subtleties, adapt to changes, and step in to guide it when it veers off course. In other words, human oversight is essential to ensure your AI continues to learn and evolve without falling into harmful or inefficient behaviors.

Imagine you're an artist painting a masterpiece, but in a different world. 

You spend hours perfecting every detail, confident that once it's finished, it’ll be flawless. But in that world, over time, the colors start fading, or worse, your painting gets smudged without you even touching it. Now, imagine you never checked back on it – what happens? The work of art you once took pride in begins to lose its value. 

In this world, AI and LLM-based apps work the same way!

Once deployed, they don’t remain static. They evolve, shift, and respond to new data, requiring constant oversight. Just as you wouldn't leave a masterpiece unattended, you can't leave your AI unchecked. And this is where constant monitoring comes into play, ensuring your model doesn’t drift into irrelevance or worse, become harmful.


AI Apps Are Not Deterministic—That’s Why You Need Constant Monitoring

When companies deploy AI models in production, they often make the mistake of thinking their job is done after benchmarking and testing. But an AI app, especially a Gen AI app, is not a deterministic piece of software you can QA once and forget about. It’s more like a living organism that needs ongoing supervision, because it can evolve in unexpected ways.

But what criteria should you track to ensure it stays on course?

Let’s break it down:

  • System Characteristics: Many of these can be tracked automatically by machines, with humans periodically reviewing logs and responding to system notifications. These include things like latency, uptime, and throughput.

  • Objective Criteria: This is where golden datasets and machine-monitoring models come into play. They handle the bulk of evaluation, but human involvement is critical for deeper analysis, refinement, and model enhancement.

Subjective Criteria: This is the toughest layer to monitor. How do you evaluate a model’s creativity, style, or nuance? Machines struggle here, and human judgment becomes essential. For example, how does one model assess the creativity or stylistic tone of another? That’s a challenge only humans can help with.

Building a Monitoring Team: More Complex Than It Seems

Once you acknowledge the need for monitoring, the next step is building a team. In our experience, data scientists often expect a team of data annotators to be ready within days. Unfortunately, this process usually takes weeks to months. The reason? You’re not just assembling a group; you’re cultivating a team that needs to evolve with the model.

In practice, this means refining guidelines, aligning data scientists and annotators, and ensuring “intra-annotator” agreement to maintain overall quality. It’s not a sprint, it’s a marathon. And remember, as your model improves, so will your goals and guidelines.


Best Practices for Monitoring AI/LLM Apps

Based on years of experience, e2f has developed some best practices to help you effectively monitor and enhance your AI/LLM models in production.

  • Direct Communication: Ensure data scientists can communicate directly with the quality teams handling annotation. While guidelines are helpful, nothing beats the real-time feedback loop between data scientists and annotators when questions arise.

  • Detailed Guidelines: Data scientists often forget that annotators don’t speak the same technical language. Providing a long list of examples, with their expected annotations, ensures that the annotators clearly understand how to label datasets. This clarity can significantly boost the accuracy of annotations.

Managing Subjectivity: A Real-World Example

AI annotations often involve subjective judgments, and this subjectivity can vary widely between annotators. 

For example, if you’re labeling content for harmfulness, what one annotator considers harmful might be seen as harmless by another. Different backgrounds, experiences, and interpretations all influence how each individual sees the content.

Let’s say you’re working on a language model tasked with detecting harmful language. Annotators may disagree on whether a sarcastic comment is harmful. One annotator may interpret sarcasm as playful, while another sees it as aggressive and harmful. This kind of subjectivity makes consistency in annotations challenging.

In situations like this, breaking down the concept into sub-labels can help. Instead of asking annotators to simply label something as “harmful” or “not harmful,” you can ask a series of more specific questions: Is the tone aggressive? Does it contain derogatory language? Could it be interpreted as offensive depending on context?

By requiring annotators to answer multiple questions, you reduce the ambiguity in their responses. Yes, it slows down the process because they have to answer 20 to 30 sub-questions, but the upside is significant. The annotators better understand what the data science team is looking for, and the consistency of labeling improves dramatically.

Once agreement among annotators is high, you can consolidate these sub-labels back into a single label. Interestingly, when you do this, intra-annotator agreement may dip again as individual interpretations come back into play. But this isn’t a bad thing—it's actually a rich source of information for the data science team. The diversity of opinions adds another layer of valuable insights that can inform further model development.

Ready to Take the Next Step?

If you’re starting to build a Gen AI app - or if you’re getting ready to deploy one in production, at e2f, we can help make sure that your masterpiece stays timeless. Contact us to learn more! 


Previous
Previous

The Days of "Train and Forget" Are Over for AI and LLM Apps

Next
Next

What’s in a Name? If you’re an LLM, Everything!