Bulk Testing of AI Pipelines

Updated 

You can perform Bulk Testing for AI use case pipelines. Instead of testing one case at a time, you can run tests on multiple cases and analyze their responses together. Bulk testing is available for both inference pipelines and evaluation pipelines, making it easier to validate use cases like sentiment analysis at scale.

Managing Bulk Tests in AI+ Studio

You can manage and run bulk tests for existing AI deployments directly from the Record Manager screen in AI+ Studio.

  1. Navigate to the desired deployment.

  2. Click the vertical ellipsis () next to it.

  3. Select Bulk Test.

A third pane opens, displaying previous bulk test logs (if any) for that deployment. Each log includes the following details:

  • Name: The identifier of the bulk test.

  • Status: One of the following:

    • Queued – Test is waiting to be executed.

    • In Progress – Test is currently running.

    • Failed – Test execution did not complete successfully.

    • Completed – Test finished successfully.

  • Triggered By: The user who initiated the bulk test.

  • Triggered On: Date and time when the test was started.

  • Execution Count: Number of requests processed during the bulk test.

Initiating a New Bulk Test

To initiate a new bulk test:

  • Click the '+' button at the top right corner of the third pane to initiate a bulk test.

  • The Initiate Bulk Test screen opens.

  • Configure the following fields:

    • Name: Enter a unique name for the bulk test.

    • Bulk Execution Threshold: Define the maximum number of requests to execute per run.

Configurations Based on Deployment Type

1. Deployments Using Asset Class Variables

  • Configure the Filters pane.

  • Select whether to run the bulk test at the Case level or the Case Interaction level.

2. Deployments Using Text Input

  • Upload an input file in one of the following formats: .xls, .xlsx, .csv.

  • Each column must represent one input variable.

  • You can download a template file from the same section, edit it with test data, and re-upload it.

3. Deployments Using Media Type Variables

  • Upload a compressed archive in one of the following formats: .zip, .tar.gz.

4. Zero-Shot Intent Detection Deployments

  • You can test with a combination of text + media + filter inputs.

Running and Monitoring Bulk Tests

  • After configuring the fields, click Save.

  • The bulk test is added to the third pane with Queued status.

  • When execution starts, the status changes to In Progress.

  • Once execution completes, the status changes to either Completed or Failed.

Viewing Bulk Test Results

  • Click the eye icon next to a bulk test log.

  • You will be redirected to the Bulk Test Detailed Log screen.

The Bulk Test Results screen is divided into three sections:

1. Summary Metrics

These metrics provide High-level results of the test run.

Metric

Description

Total Requests

Total number of requests processed in the time frame.

Total Success

Number of successfully completed requests.

Total Failure

Number of requests that failed due to errors or Guardrails.

Average Latency

Average response time per request (in seconds).

P95 Latency

95th percentile latency — time within which 95% of requests are completed.

2. Requests

Each request is logged with the following attributes:

  • Date – Timestamp of when the request was made

  • Request ID – Unique identifier for the request

  • Latency – Time taken to complete the request

  • User Name – User who triggered the request

  • Case Number – Associated case number (if applicable)

  • Status – Outcome of the request (e.g., Success, Error)

3. Debug Logs

Detailed, timestamped logs of all node-level interactions within the deployment pipeline. This section includes:

  • Prompt Node Details – Input/output logs for each prompt

  • Final Output Information – The final AI-generated response

  • Error Messages – If applicable, detailed stack traces or blocking reasons

Bulk Test in AI+ Studio provide a reliable way to validate deployments at scale. By running structured tests across text, media, or case-based inputs, you can measure performance, identify errors, and ensure consistent results before deploying to production.