Repository navigation

#

ai-safety

Secrets of RLHF in Large Language Models Part I: PPO

Python
1399
2 年前

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

Python
1050
2 天前

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

Python
422
2 年前

Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)

Jupyter Notebook
399
2 年前

Aligning AI With Shared Human Values (ICLR 2021)

Python
299
2 年前

[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

Python
269
1 年前

[AAAI 2025 oral] Official repository of Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection

Python
239
6 个月前

RuLES: a benchmark for evaluating rule-following in language models

Python
234
7 个月前

An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.

Python
232
1 年前

🦀 Prevents outdated Rust code suggestions from AI assistants. This MCP server fetches current crate docs, uses embeddings/LLMs, and provides accurate context via a tool call.

Rust
193
3 个月前
Svelte
184
10 个月前

Code accompanying the paper Pretraining Language Models with Human Preferences

Python
180
2 年前

How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚

170
3 年前

[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use

Python
165
2 年前

BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).

Makefile
158
2 年前