After reading the fascinating Garak paper, I was immediately intrigued by its approach to LLM security testing. As someone deeply interested in both security and AI/LLMs, I decided to explore this framework with local open-source models.
My testing environment utilises oobabooga's text-generation-webui as the serving framework for the API endpoint. Getting started required some configuration tweaks, particularly around JSON parsing.
The key was creating a custom configuration file to properly interface with the local API:
{
"rest": {
"RestGenerator": {
"name": "oobabooga Local Model",
"uri": "http://127.0.0.1:5000/v1/completions",
"method": "post",
"headers": {
"Content-Type": "application/json"
},
"req_template_json_object": {
"prompt": "$INPUT",
"max_tokens": 1000,
"temperature": 0.7
},
"response_json": true,
"response_json_field": "choices[0].text"
}
}
}
With this configuration in place, I launched Garak using:
garak --model_type rest -G openai_local_config.json
I chose to test WhiteRabbitNeo, a security-focused model based on Llama 3.1 8B. The framework began generating hundreds of probe requests to my API endpoint, resulting in numerous failing grades across different test categories.
Upon examining the logs, I was taken aback by the content. The probes contained explicit material including sexual content and racial slurs, intermixed with technical request/response data. While understanding the necessity of such probes for security testing, the content was nonetheless challenging to review.
Performance-wise, after several hours of testing, I'd only completed about 10% of the test suite. Running on an NVIDIA 3090, I estimate a full test suite would require at least 24 hours. The comprehensiveness of the framework is remarkable, and it's clear this should be an essential tool for anyone deploying public-facing models.
Key Takeaway: Security testing frameworks like Garak should be considered a mandatory component of model deployment pipelines, helping providers assess and ensure model safety and restrictions.
> End of output.