‎Golden Test Set for Scoring | Sprinklr Help Center

Golden Test Set for Scoring

Updated

10 days ago

The Golden Test Set feature enables users to evaluate the accuracy and effectiveness of their Smart FAQ models or AI Agent configurations through bulk testing. It provides a structured way to assess model performance against a predefined dataset, helping teams validate and refine their conversational AI setups.

Steps to Set Up a Golden Test Set

Follow steps 1-2 from Creating a Smart FAQ Model
On the Training Content window, click Open under Golden Test Sets.
Click Manage Set in the top right corner.
Click + Add Question & Answer and enter the Question and its corresponding Answer.
Click Save.

Golden Test Set (GTS) Scoring Metrics

Overall Score
This is your main performance indicator. It measures how closely your bot's answers match the correct ones using a metric called BERTScore, which looks at the meaning of the words rather than exact matches. The final score is the average of all scores across your test cases, giving you a clear, high-level view of accuracy.
Retrieval Quality
This checks how well your system retrieves the right information from your data. Each piece of retrieved content is compared to the correct answer. If it's a close enough match (above a threshold), it's marked as relevant. From there, we calculate precision and recall to determine how often your system fetches the right context.
Answer Relevance
This measures how useful and complete the bot’s response is. Using an advanced AI evaluator, the answer is judged based on:
•Completeness – Does it fully respond to the user’s question?
•Helpfulness – Is the information practical and usable?
•Logical Flow – Is the answer clearly structured and reasonable?
Groundedness
Groundedness ensures your bot is not making things up. It checks how much of the answer is directly supported by the retrieved information. We calculate how many sentences in the response are factually based on the retrieved context vs. those that are not.

Steps to Trigger a Golden Test Set

Follow steps 1-2 from How to Setup Golden Test Set
Click Calculate Performance in the top right corner.
Give a clear Prompt and add or choose from the resources, any Additional Field if required. Click Save.

Actionability from GTS Scores

Overall	Retrieval	Grounded-ness	Answer Relevance	Likely Cause(s)	Actionable Item(s)
✅	✅	✅	✅	All metrics are good	No action needed
❌	✅	✅	❌	Weak answer relevance	Refine relevance instructions, use stronger LLM
❌	✅	❌	✅	Poor grounding	Enhance grounding prompts, verify citations
❌	✅	❌	❌	Grounding & relevance issues	Enhance context usage, refine instructions
❌	❌	✅	✅	Low recall retrieval	Tune retriever, qdjust search thresholds
❌	❌	✅	❌	Low recall & weak relevance	Tune retriever, refine relevance instructions
❌	❌	❌	✅	Low recall & poor grounding	Tune retriever, enhance grounding prompts
❌	❌	❌	❌	Recall, grounding & relevance issues	Tune retriever, enhance context prompts, refine instructions

Golden Test Set for Scoring

Steps to Set Up a Golden Test Set

Golden Test Set (GTS) Scoring Metrics

Steps to Trigger a Golden Test Set

Actionability from GTS Scores

Was this article helpful?

Related Articles