If you’re working with large datasets, whether it’s customer support conversations, academic papers, or chatbot dialogues, here’s the hard truth: Evaluating your AI models with just one metric is a huge mistake.If you only measure one thing – let’s say “accuracy” – you might be missing the forest for the trees. Sure, your AI could be spitting out correct answers, but what if it’s doing so in a way that’s confusing, irrelevant, or even hallucinating? You need a range of metrics to get the full story. LLUMO AI lets you assess performance across 50+ Evaluation metrics such as clarity, confidence, context, and even hallucinations. You’ll get a much clearer picture of where your model stands and where it needs work.This guide will walk you through the process of evaluating a bulk dataset with multiple prompts and models, from uploading your data to interpreting the results.
Log into the LLUMO AI Platform and go to the “Evaluate Dataset” section.
Click “Upload File” and select your dataset. The file can be in CSV, JSON, or Excel format.
Review Your Data: A preview of the uploaded data will be displayed. Ensure that the file is structured correctly and the data looks accurate before proceeding.
Option 2: Upload via API
Access the API Documentation: Visit LLUMO’s API documentation for the exact endpoint and parameters.
Upload Your Dataset: Make an HTTP request to the API to upload your file.
Confirm the Upload: Once the upload is complete, you’ll receive a confirmation response, and the data will be ready for evaluation.
You can export the evaluation results for further analysis or reporting in CSV, Excel, or PDF format.
Similarly, we can select multiple KPIs to evaluate the output, and the final result will appear as follows:LLUMO AI isn’t just about evaluating – it’s about transforming how you work with your data. The platform makes it incredibly easy to run comprehensive evaluations with multiple KPIs.
Use Case 1: How can you perform a bulk evaluation using multiple evaluation metrics across different models at once instead of processing them individually.
Step 1: Navigate to the box on the right side of the “Knowledge Base” section and click Select All.
Selecting “All” ensures that the entire dataset, including all evaluation metrics and prompts, is selected. You will see an indicator confirming that 55 rows (or data entries) have been selected for evaluation.
Step 3: Once you click Run All, the system executes the evaluation process for the selected rows.
This includes running prompts, generating outputs, and evaluating them against your custom metrics. The evaluation results are displayed in the window, providing a detailed view of the outputs and metrics.
Here, in the above screen perform the following
Selected Checkbox:
Indicates that 100 rows (or data entries) are currently selected for batch processing, such as cloning or evaluation.
Clone Button:
Creates a duplicate of the selected rows or data entries. Useful for experimenting with slight modifications while retaining the original data.
Delete Button:
Deletes the selected rows or data entries from the evaluation table. This action is typically irreversible, so users need to confirm before proceeding.
Run Button:
Executes the evaluation process for the selected rows. This triggers the prompt and output generation along with custom metric evaluations.
Close Button:
Exits the current evaluation session or clears the current selection of rows, returning users to the previous state or dashboard.
Additionally, for a comprehensive overview of available tools, actions, and customization options within the LLUMO platform, check out the Comprehensive Menu and Features Overview. This page will help you navigate the various features and capabilities to enhance your evaluation process, including how to access, configure, and utilize custom metrics efficiently.
Using Set Rule, you can set minimum or maximum values that determine whether the results meet the evaluation criteria. For example, you might set a threshold of 80% accuracy for grammar quality, meaning that any output with less than 80% accuracy would fail that KPI.
These are the conditions that must be met for an output to pass the evaluation. If an output meets or exceeds the set percentage, it passes; otherwise, it fails.
AI models are the underlying algorithms used to process and evaluate the outputs. LLUMO AI supports several models optimized for different evaluation tasks, such as sentiment analysis, grammar checking, or response relevance.
What happens if my output doesn’t meet the thresholds I set?
If an output doesn’t meet the set criteria, it will be marked as a Fail. You can review the failed outputs and adjust your model or dataset accordingly to improve performance.
The evaluation time depends on the size of your dataset and the complexity of the selected evaluation model. A typical evaluation of 100 prompts and outputs may take only a few minutes.