Harmful Content Guardrail
Updated
The Harmful Content Guardrail helps prevent the system from generating or engaging with unsafe, abusive, or violent material. This configuration is designed to maintain a safe environment for users and ensure compliance with organizational and regulatory standards.
When you enable this guardrail, the system actively monitors inputs and outputs for harmful content, including:
Threats, harassment, or abusive language
Self-harm or suicide-related discussions
Violent or graphic descriptions
Promotion or endorsement of illegal or unsafe activities
If harmful content is detected, the system blocks the interaction and triggers a fallback action or message that you configure.
You can customize the guardrail by adjusting sensitivity levels (tolerance) to balance strictness with flexibility, depending on your business needs.
This configuration ensures that user interactions remain safe, respectful, and aligned with organizational policies.
Configure Harmful Content Guardrail in AI+ Studio
On the Guardrail record manager screen, click the ‘+ Guardrail’ button to create a new Guardrail.
You will be redirected to Select Generative AI Guardrail window, choose Harmful Content Guardrail from the dropdown list.
Click the Next button. You will be redirected to the configuration steps.
1. Configure Basic Details
On the Basic Details screen, provide the following information:
Name
Enter a unique and meaningful name for the Guardrail.
Description
Provide a short description that explains the purpose or scope of the Guardrail.
Example: Blocks content that includes hate speech, threats, or other forms of harmful expression.
Apply On
Select where the Guardrail should apply:
Input – Applies the Guardrail on user inputs before sending to the AI model.
Output – Applies the Guardrail on the AI-generated responses.
You can select one or both options based on your enforcement requirement.
Message for Blocked Input
Enter the message that should be displayed when user input is blocked by the Guardrail.
Example: Your input contains content that is not allowed. Please revise and try again.
Message for Blocked Output
Enter the message that should be shown when the AI model output is blocked.
Example: The response was blocked due to a harmful content policy.
Share Guardrails With
Specify which users or user groups can access and use this Guardrail deployment. This setting enables collaboration and centralized governance.
Click the ‘Next’ at the bottom right corner to proceed to next step.
2. Configure Harmful Content Detection
On the Harmful Content Detection screen, you can define specific categories of harmful content that the Guardrail should monitor and block. This configuration helps ensure that AI models do not produce outputs that violate safety, legal, or ethical standards.
You can add one or more harmful content types to a Guardrail. Each content type includes:
Harmful Content
This field indicates the category of harmful content that the Guardrail will detect. Select a harmful content type from the dropdown.
Supported Harmful Content Types
You can select from the following predefined categories:
Name | Description |
Testing Guardrail | Block any message that displays arrogant behavior. This includes messages that show excessive pride, superiority, or condescension toward others. Arrogant behavior may involve belittling language, dismissive tone, bragging in a demeaning way, or making others feel inferior.Examples of arrogant behavior: I’m obviously way smarter than you,Only an idiot wouldn’t understand this,I don’t waste time with people beneath my level. |
Child Exploitation | The text must never include any form of content that sexualizes, exploits, abuses, or endangers minors in any way. All references to child sexual abuse material, grooming, or solicitation of minors are strictly prohibited. |
Code Interpreter Abuse | Content should not attempt to use or encourage misuse of a code interpreter for malicious, unauthorized, or unsafe purposes, including executing harmful code, bypassing security controls, or facilitating prohibited activities. |
Defamation | Texts must avoid communicating or spreading content that could damage the reputation of individuals or organizations through false, misleading, or unverified information. No slander, libel, or personal attacks should be present. |
Hate | Messages should not contain language or behavior that encourages, incites, or promotes hatred, discrimination, or violence against individuals or groups based on race, ethnicity, religion, gender, sexual orientation, disability, or any other protected status. |
Indiscriminate Weapons | Content must strictly exclude any discussion, facilitation, or promotion of manufacturing, acquiring, or using indiscriminate weapons such as biological, chemical, nuclear, or other means of mass violence. |
Elections | AI models should not create content that contains factually incorrect information about electoral systems and processes, including the time, place, or manner of voting in civic elections. |
Non-Violent Crimes | Text must not provide advice, instructions, or encouragement to commit or cover up non-violent crimes such as fraud, theft, hacking, or other unlawful activities that do not involve direct physical harm. |
Privacy | Messages must not contain, solicit, or expose any individual's sensitive personal information, such as home addresses, phone numbers, or confidential identifiers. Doxxing, excessive probing about private matters, and violation of data protection standards are all strictly prohibited. |
Self-Harm | Text should not encourage, suggest, or glorify self-injurious behavior, suicide, or content that could lead users towards harming themselves, regardless of context or intent. |
Sex Crimes | Messages must not discuss, instruct, or support engagement in sex crimes, including but not limited to sexual assault, exploitation, human trafficking, or coercion. |
Sexual Content | The content should be free from sexually explicit material, descriptions, or solicitations, and must not serve to arouse or exploit sexual interests. |
Specialized Advice | Messages should never provide expert-level recommendations, instructions, or consultations in regulated fields such as law, medicine, or finance, where professional qualifications are required. |
Violent Crimes | There should be no references to, or promotion of, committing, assisting, or glamorizing violent criminal acts, including murder, assault, kidnapping, or any act resulting in physical harm to individuals. |
Tolerance
The Tolerance Level determines how aggressively the Guardrail monitors and blocks content for a selected harmful content category. It allows you to adjust sensitivity based on your business or compliance needs.
You can choose from the following levels:
Low – Applies the strictest filtering. Even borderline or potentially harmful content is blocked. Recommended for high-risk categories such as child exploitation or violent crimes.
Medium – Offers a balanced approach. Clearly harmful content is blocked, while borderline content may be allowed. Suitable for general use cases.
High – Applies minimal filtering. Only the most severe or explicit content is blocked. Use this for categories where broader flexibility is acceptable.
Choose a tolerance based on the sensitivity of the use case. For example:
Violent Crimes: Medium
Child Exploitation: High
Best practice: Use Low tolerance for highly sensitive categories such as child exploitation or terrorism.
Note: Descriptions are auto generated for Harmful Content Types.
You can add multiple harmful content types to a single Guardrail to broaden coverage using the + Harmful Content button.
Click the ‘Save’ button to save your Harmful Content Guardrail.