Repository navigation
ai-safety
- Website
- Wikipedia
A curated list of awesome responsible machine learning resources.
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
Deliver safe & effective language models
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022
Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)
[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
[AAAI 2025 oral] Official repository of Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
RuLES: a benchmark for evaluating rule-following in language models
An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.
🦀 Prevents outdated Rust code suggestions from AI assistants. This MCP server fetches current crate docs, uses embeddings/LLMs, and provides accurate context via a tool call.
📚 A curated list of papers & technical articles on AI Quality & Safety
Toolkits to create a human-in-the-loop approval layer to monitor and guide AI agents workflow in real-time.
Code accompanying the paper Pretraining Language Models with Human Preferences
How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).