Golden Test Set for Scoring

Updated 

The Golden Test Set feature enables users to evaluate the accuracy and effectiveness of their Smart FAQ models or AI Agent configurations through bulk testing. It provides a structured way to assess model performance against a predefined dataset, helping teams validate and refine their conversational AI setups.

Steps to Set Up a Golden Test Set

  1. Follow steps 1-2 from Creating a Smart FAQ Model

  2. On the Training Content window, click Open under Golden Test Sets.

  3. Click Manage Set in the top right corner.

  4. Click + Add Question & Answer and enter the Question and its corresponding Answer.

  5. Click Save.

Golden Test Set (GTS) Scoring Metrics

  1. Overall Score

    This is your main performance indicator. It measures how closely your bot's answers match the correct ones using a metric called BERTScore, which looks at the meaning of the words rather than exact matches. The final score is the average of all scores across your test cases, giving you a clear, high-level view of accuracy.

  2. Retrieval Quality

    This checks how well your system retrieves the right information from your data. Each piece of retrieved content is compared to the correct answer. If it's a close enough match (above a threshold), it's marked as relevant. From there, we calculate precision and recall to determine how often your system fetches the right context.

  3. Answer Relevance

    This measures how useful and complete the bot’s response is. Using an advanced AI evaluator, the answer is judged based on:

    •Completeness – Does it fully respond to the user’s question?

    •Helpfulness – Is the information practical and usable?

    •Logical Flow – Is the answer clearly structured and reasonable?

  4. Groundedness

    Groundedness ensures your bot is not making things up. It checks how much of the answer is directly supported by the retrieved information. We calculate how many sentences in the response are factually based on the retrieved context vs. those that are not.

Steps to Trigger a Golden Test Set

  1. Follow steps 1-2 from How to Setup Golden Test Set

  2. Click Calculate Performance in the top right corner.

  3. Give a clear Prompt and add or choose from the resources, any Additional Field if required. Click Save.

Actionability from GTS Scores


Overall 



Retrieval 



Grounded-ness 



Answer Relevance 



Likely Cause(s) 



Actionable Item(s) 



 



 



 



 



All metrics are good 



No action needed 



 



 



 



 



Weak answer relevance 



Refine relevance instructions, use stronger LLM 



 



 



 



 



Poor grounding 



Enhance grounding prompts, verify citations 



 



 



 



 



Grounding & relevance issues 



Enhance context usage, refine instructions 



 



 



 



 



Low recall retrieval 



Tune retriever, qdjust search thresholds 



 



 



 



 



Low recall & weak relevance 



Tune retriever, refine relevance instructions 



 



 



 



 



Low recall & poor grounding 



Tune retriever, enhance grounding prompts 



 



 



 



 



Recall, grounding & relevance issues



Tune retriever, enhance context prompts, refine instructions