Repository navigation

ai-safety

Website
Wikipedia

jphall663 / awesome-machine-learning-interpretability

A curated list of awesome responsible machine learning resources.

fairness xai interpretability transparency 机器学习数据科学 Python R Awesome Lists machine-learning-interpretability interpretable-machine-learning interpretable-ml interpretable-ai explainable-ml ai-safety privacy-enhancing-technologies privacy-preserving-machine-learning

3871

612

17 天前

PKU-Alignment / safe-rlhf

Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback

ai-safety alpaca datasets deepspeed large-language-models llama 大语言模型 reinforcement-learning reinforcement-learning-from-human-feedback rlhf transformers vicuna safety gpt transformer beaver

Python

1535

124

1 个月前

OpenLMLab / MOSS-RLHF

Secrets of RLHF in Large Language Models Part I: PPO

rlhf alignment ai-safety

Python

1399

104

2 年前

cvs-health / uqlm

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

ai-safety hallucination 大语言模型 llm-evaluation uncertainty-estimation uncertainty-quantification

Python

1050

111

2 天前

Pacific-AI-Corp / langtest

Deliver safe & effective language models

benchmarks large-language-models ml-safety ml-testing mlops 自然语言处理 responsible-ai ai-safety 人工智能 benchmark-framework 大语言模型

Python

543

8 天前

agencyenterprise / PromptInject

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

ai-safety language-models ml-safety agi ai-alignment adversarial-attacks gpt-3 large-language-models 机器学习 chain-of-thought prompt-engineering

Python

422

2 年前

tigerlab-ai / tiger

Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)

classification fine-tuning 大语言模型 llm-training rag ai-safety data-augmentation large-language-models

Jupyter Notebook

399

2 年前

hendrycks / ethics

Aligning AI With Shared Human Values (ICLR 2021)

ai-safety gpt-3 ml-safety

Python

299

2 年前

ShengranHu / Thought-Cloning

[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

ai-safety 人工智能深度学习 imitation-learning reinforcement-learning PyTorch

Python

269

1 年前

Jiaqi-Chen-00 / ImBD

[AAAI 2025 oral] Official repository of Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection

ai-safety

Python

239

6 个月前

cvs-health / langfair

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

人工智能 bias bias-detection fairness fairness-ai fairness-ml fairness-testing large-language-models 大语言模型 responsible-ai Python ai-safety llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Python

235

2 天前

normster / llm_rules

RuLES: a benchmark for evaluating rule-following in language models

ai-security gpt-4 ai-safety

Python

234

7 个月前

WindVChen / DiffAttack

An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.

ai-safety diffusion-models

Python

232

1 年前

Govcraft / rust-docs-mcp-server

🦀 Prevents outdated Rust code suggestions from AI assistants. This MCP server fetches current crate docs, uses embeddings/LLMs, and provides accurate context via a tool call.

人工智能大语言模型 mcp mcp-server Rust ai-safety caching cargo coding-assistant developer-tools embeddings information-retrieval openai rag rust-library semantic-search

Rust

193

3 个月前