“Running evaluation in production” refers to the process of continuously monitoring and assessing the performance of your AI models while they are being used in a real-world, operational environment (also known as “production”).In production, an AI model is actively being used to handle tasks like answering customer queries, making predictions, or processing data. Over time, the model’s performance may change due to various factors such as shifts in input data, changes in user behavior, or model drift.
You have an active LLUMO AI account and access to the platform.
You have the necessary datasets and evaluation metrics defined for your AI models.
Make sure you have a working understanding of API integration or access to the user interface (UI) on your platform—check out our LLUMO AI guide on API to get started.
Once the API receives the request, it processes the input data and returns metrics like confidence, clarity, and context. These metrics will be used to analyze the output.
API Output: Metrics such as confidence, clarity, context, and overall score are generated by the API.
These responses from the API can be used in your own production code as well. Further, these are logged at LLUMO’s end, which you can access in the “Logs” section.
Models include GPT-3.5 Turbo, GPT-4, Gemini-Pro, and custom solutions.
Graphs compare error rates (e.g., some errors that don’t lead to any response) and refusal rates (e.g., queries declined by models).
Use Case: Model benchmarking to identify the most effective system.
Insights and Recommendations
The insights highlight areas of concern in the generated responses and compare them to industry benchmarks.
Metrics: A percentage of responses falling below the respective benchmark for correctness (70%), relevancy (72%), and coherence (62%).
Time Period: Data is drawn from 1st to 30th September, providing a specific evaluation period.
“See in Playground” Buttons: These buttons suggest the ability to test or refine model outputs interactively, providing an actionable follow-up for users.
Model: Name of the model used to generate the output.
Provider: Indicates the service provider (e.g., OpenAI, VertexAI).
Input/Output Tokens: Number of tokens in the query and response.
Total Tokens: Combined tokens used (input + output).
Response Time: Time taken to generate a response.
Use Case: This section provides traceability for analyzing system responses and optimizing token usage and latency.Running evaluation in production is an essential step to ensure your AI models perform well in real-world situations. Over time, models can lose accuracy due to changes in data or user behavior, but with LLUMO AI, you can easily monitor and improve your model’s performance. By following the steps outlined in this guide, you can identify issues quickly, make improvements, and keep your AI systems running effectively. LLUMO AI makes this process simple with its user-friendly tools, detailed metrics, and automation options, helping you maintain reliable and high-quality AI models.