Abstract
When deployed in open-ended scenarios, large language models frequently face security threats such as jailbreak attacks, prompt injection, tool-chain abuse and privacy leakage. CEVA provides an end-to-end LLM safety evaluation methodology: first, we construct attack benchmarks and automated test scripts covering multiple risk dimensions; second, we introduce a hierarchical risk scoring system to quantify model robustness across tasks; finally, we propose a joint defense baseline spanning input filtering, inference-time guardrails and output auditing. Experiments show that our approach reliably reproduces high-risk cases and significantly reduces harmful response rates and sensitive information exposure.
If you find our work helpful, please consider citing:
BibTeX
@article{ceva2026,
title={CEVA: Comprehensive Evaluation for LLM Safety},
author={Huang and CEVA Team and AI Security Group},
year={2026},
url={https:://arxiv.org/abs/}
}