Experiment records and Playground provide a complete, end-to-end workflow, from prompt engineering and debugging to large-scale automated evaluation. By quantitatively comparing different model services, prompts, and parameter configurations, these tools help you accurately assess model performance and ensure the quality and stability of your AI applications.
Features
The experiment and Playground features let you run batch experiments on large models using multiple configurations (such as different models, prompts, and parameters) and provide the following in-depth comparative analysis tools:
Experiment plan: Create and manage experiment plans to establish performance baselines. A plan saves a combination of model services, prompt templates, datasets, and evaluators to ensure traceability and a consistent environment for rapid execution in the Playground.
Playground: Configure multiple experiment groups in a visual interface. Adjust inference parameters (such as Temperature and Top-p) in real time and use a data source for single-instance or batch validation. The Playground intuitively displays individual outputs and evaluation scores side-by-side.
Experiment record: An asset library of your experiment runs. Each record is a snapshot of a task run, including model service details, token consumption (cost), Time to First Token (TTFT), quantitative scores from the evaluator, and task execution status.
Comparative analysis: A multi-dimensional analysis tool. Select 2 to 5 experiment records for a side-by-side comparison covering evaluation metric trends, configuration parameter differences, and sample-level semantic comparisons to identify where model performance diverges.
Module | Description |
Experiment plan | Saves one or more experiment configurations (model, prompt, inference parameters, and optional dataset and evaluator). You can launch experiments in the Playground directly from a plan. |
Playground | Serves as the execution engine for experiments. You can load configurations from an experiment plan, run batch inference tasks, and get immediate feedback from the LLM Judge. |
Experiment record | Stores task execution logs and result snapshots. You can review and rerun experiments, or use the results as a data source for subsequent comparative analysis. |
Comparative analysis | A decision-support tool that helps you calculate the delta for an experiment group across dimensions like accuracy, latency, and cost by setting a baseline. |
Benefits
Accelerate prompt engineering iterations
The Playground offers a low-latency, interactive environment for real-time adjustments to prompts and inference parameters (like Temperature and Top-p), along with immediate output validation. This significantly shortens the cycle from concept to prototype.
Inject dataset samples using the
{{variable}}syntax to run concurrent tests for different business scenarios within a single interface. This approach eliminates the highly repetitive work of traditional development.
Quantify model performance at scale
The experiment module runs batch jobs on datasets to transform scattered model responses into structured evaluation metrics. These jobs provide objective, quantitative scores for model quality, shifting evaluation from subjective to data-driven.
The system not only provides scores but also records the full reasoning from the evaluator. This provides deep insight into the model's performance in specific areas like logical reasoning, safety, or instruction-following.
Ensure stability and consistency during system upgrades
After a model version update or prompt optimization, you can run automated regression tests by launching a comparative experiment on a specified dataset with a single click. The system automatically calculates metric deltas to pinpoint performance regressions and ensure business logic consistency.
The comparative analysis feature supports text-level diff highlighting. This lets you quickly locate bad cases with significant output differences and provides precise samples for targeted optimizations.
Precisely balance inference cost and performance
Experiment records capture engineering metrics in real time, such as Time to First Token (TTFT) and Tokens Per Second (TPS). By comparing different model solutions, you can make informed trade-offs between response speed and output quality.
The system accurately calculates the token consumption and actual cost for each experiment. This quantitative data helps decision-makers select the most cost-effective model combination and avoid wasting computing resources.
Build a traceable asset library
The system automatically captures the model service, prompt version, inference parameters, and a dataset snapshot for every experiment. This complete data lineage ensures that all results are fully traceable and reproducible, building a core prompt asset library for your organization.
By comparing historical experiment records, your team can clearly track the evolution of model capabilities. This transforms the AI system optimization process from an intuition-based process to a data-driven one.