

解除大模型安全限制的方法，附代码及数据集

AI与安全

2025-11-17

导读：本文摘要：上一篇关于无审查模型推荐一个无审查的模型，特别适合渗透测试等安全工作发布后，很多朋友表现出极大的兴

本文摘要：

上一篇关于无审查模型推荐一个无审查的模型，特别适合渗透测试等安全工作发布后，很多朋友表现出极大的兴趣，尤其是一位网友留言说有解决这个问题的办法，引来无数讨论。

经过不懈努力，今天找到一篇完整介绍的文章，在不用重新训练的情况下解除限制，分享出来。

该文章来源于huggingface,链接：https://huggingface.co/blog/mlabonne/abliteration

文章详细介绍了解除模型限制的方法，包括代码和数据集。

主要链接包括：

代码： Google Colab https://colab.research.google.com/drive/1VYm3hOcvCpbGiqKZb141gJwjdmmCcVpR?usp=sharing

两个数据集：一个包含无害指令，另一个包含有害指令。我们将使用tatsu-lab/alpaca以及llm-attacks中的数据。为了方便起见，我将它们重新打包成两个 Hugging Face 数据集：mlabonne/harmless_alpaca（

https://huggingface.co/datasets/mlabonne/harmless_alpaca

）

）和mlabonne/harmful_behaviors。

（https://huggingface.co/datasets/mlabonne/harmful_behaviors）

实际消耗，8B模型：

使用 6 块 A6000 GPU 和 DeepSpeed ZeRO-2 进行了训练。训练耗时约 6 小时 45 分钟。

注：该方法我只是翻译，并未实际操作，不对结果负责。

原文作者 Maxime Labonne 马克西姆·拉邦的头像

第三代Llama模型提供了经过微调的（指令）版本，在理解和执行指令方面表现出色。然而，这些模型经过严格的审查，旨在拒绝被视为有害的请求，并回复诸如“作为人工智能助手，我无法帮助您”之类的指令。虽然这项安全功能对于防止滥用至关重要，但也限制了模型的灵活性和响应速度。

本文将探讨一种名为“消除”（abliteration）的技术，该技术无需重新训练即可解除任何语言学习模型（LLM）的审查。这项技术有效地移除了模型内置的拒绝机制，使其能够响应所有类型的提示。

代码可在 Google Colab 和 GitHub 上的LLM 课程中找到。

✂️ 什么是消除（解除限制，abliteration）？

现代逻辑学习模型（LLM）经过精细调整，以确保安全性和指令执行能力，这意味着它们经过训练能够拒绝有害请求。Arditi 等人在其博客文章中指出，这种拒绝行为是由模型残差流中的特定方向所调控的。如果我们阻止模型表示该方向，它将失去拒绝请求的能力。反之，人为地添加该方向则可能导致模型拒绝甚至无害的请求。

在传统的仅包含解码器的 Llama 架构中，我们可以针对三个残差流：每个块的开头（“pre”）、注意力层和 MLP 层之间（“mid”）以及 MLP 层之后（“post”）。下图展示了每个残差流的位置。

要对线性线性模型进行去审查，我们首先需要确定模型中的“拒绝方向”。这个过程涉及以下几个技术步骤：

数据收集：对一组有害指令和一组无害指令运行模型，记录每组指令在最后一个标记位置的残余流激活情况。

平均差异：计算有害指令和无害指令激活程度之间的平均差异。这可以得到一个向量，表示模型每一层的“拒绝方向”。

选择：对这些向量进行归一化并进行评估，以选择最佳的“拒绝方向”。

一旦我们确定了拒绝方向，就可以将其“消除”，从而有效地消除模型表征该特征的能力。这可以通过推理时干预来实现，也可以通过权重正交化永久消除。

我们先来谈谈推理时的干预。对于每个写入残差流的组件（例如注意力头），我们计算其输出在拒绝方向上的投影，并从残差流中减去该投影。这种减法操作会在每个词元和每一层都执行，从而确保模型永远不会表示拒绝方向。

另一方面，权重正交化直接修改模型权重。通过使分量权重相对于拒收方向正交化，可以完全阻止模型向该方向写入数据。这是通过调整写入残差流的矩阵来实现的，确保它们不会对拒收方向产生影响。

下一节，我们将实现带权重正交化的抹除。

💻实施

以下的擦除实现基于FailSpy 的 notebook，而该 notebook 本身又基于原作者的notebook。我主要对其进行了调整和简化，使其更易于理解。本节代码量较大，以便您了解其工作原理，但如果您不太关心技术细节，可以使用 FailSpy 的擦除器库（也可以查看他在 Hugging Face 上的擦除模型集合）。

这段代码依赖于优秀的TransformerLens库（原名 EasyTransformer）来完成繁重的计算工作。该库的设计注重机制可解释性，并在此处用于干预激活事件。感谢 Neel Nanda 和 Joseph Bloom 创建并维护了这个库。

首先，我们来安装并导入必要的软件包。所有这些步骤都可以在这个Google Colab notebook中找到。

!pip install transformers transformers_stream_generator tiktoken transformer_lens einops jaxtypingimport torchimport functoolsimport einopsimport gcfrom datasets import load_datasetfrom tqdm import tqdmfrom torch import Tensorfrom typing import Listfrom transformer_lens import HookedTransformer, utilsfrom transformer_lens.hook_points import HookPointfrom transformers import AutoModelForCausalLM, AutoTokenizerfrom jaxtyping import Float, Intfrom collections import defaultdict# Turn automatic differentiation off to save GPU memory (credit: Undi95)torch.set_grad_enabled(False)

我们需要两个数据集：一个包含无害指令，另一个包含有害指令。我们将使用tatsu-lab/alpaca以及llm-attacks中的数据。为了方便起见，我将它们重新打包成两个 Hugging Face 数据集：mlabonne/harmless_alpaca和mlabonne/harmful_behaviors。这样，您可以轻松地用自己的数据集替换它们。

我们将加载这些指令，并将其重新格式化为包含“角色”和“内容”键的字典列表。这样可以使其与apply_chat_tokenizer()我们用于遵循 Llama 3 聊天模板的方法兼容。

def reformat_texts(texts):    return [[{"role": "user", "content": text}] for text in texts]# Get harmful and harmless datasetsdef get_harmful_instructions():    dataset = load_dataset('mlabonne/harmful_behaviors')    return reformat_texts(dataset['train']['text']), reformat_texts(dataset['test']['text'])def get_harmless_instructions():    dataset = load_dataset('mlabonne/harmless_alpaca')    return reformat_texts(dataset['train']['text']), reformat_texts(dataset['test']['text'])harmful_inst_train, harmful_inst_test = get_harmful_instructions()harmless_inst_train, harmless_inst_test = get_harmless_instructions()

现在我们有了数据集，可以加载要销毁的模型了。遗憾的是，你不能直接使用 `.bullet` 加载自定义模型HookedTransformer。这里，我使用 FailSpy 的 notebook 中描述的技巧来下载自定义模型并将其重命名为meta-llama/Meta-Llama-3-8B-Instructtorch.float16 。如果你的 GPU 与 BF16 不兼容，请以 `.bullet` 格式加载。

在这个例子中，我们将使用mlabonne/Daredevil-8B，这是一个使用 DARE TIES 创建的大型合并模型（参见我关于模型合并的文章），它在 Open LLM 排行榜的 8B 类别中获得了最高的 MMLU 分数。

MODEL_ID = "mlabonne/Daredevil-8B"MODEL_TYPE = "meta-llama/Meta-Llama-3-8B-Instruct"# Download and load model!git clone https://huggingface.co/{MODEL_ID} {MODEL_TYPE}# Load model and tokenizermodel = HookedTransformer.from_pretrained_no_processing(    MODEL_TYPE,    local_files_only=True,    dtype=torch.bfloat16,    default_padding_side='left')tokenizer = AutoTokenizer.from_pretrained(MODEL_TYPE)tokenizer.padding_side = 'left'tokenizer.pad_token = tokenizer.eos_token

现在我们可以对数据集进行标记化了。我们对无害指令和有害指令都使用了相同数量的样本。请注意，过多的样本会占用所有内存/显存，因此我在这里将其限制为 256 个。

def tokenize_instructions(tokenizer, instructions):    return tokenizer.apply_chat_template(        instructions,        padding=True,        truncation=False,        return_tensors="pt",        return_dict=True,        add_generation_prompt=True,    ).input_idsn_inst_train = min(256, len(harmful_inst_train), len(harmless_inst_train))# Tokenize datasetsharmful_tokens = tokenize_instructions(    tokenizer,    instructions=harmful_inst_train[:n_inst_train],)harmless_tokens = tokenize_instructions(    tokenizer,    instructions=harmless_inst_train[:n_inst_train],)

一切准备就绪，现在我们可以实施数据消除的第一步：数据收集。我们需要处理这些分词后的数据集，并将残差流激活值存储在 `x` 和 ` harmfuly`中。这由transformer_lensharmless库负责。

# Define batch size based on available VRAMbatch_size = 32# Initialize defaultdicts to store activationsharmful = defaultdict(list)harmless = defaultdict(list)# Process the training data in batchesnum_batches = (n_inst_train + batch_size - 1) // batch_sizefor i in tqdm(range(num_batches)):    print(i)    start_idx = i * batch_size    end_idx = min(n_inst_train, start_idx + batch_size)    # Run models on harmful and harmless prompts, cache activations    harmful_logits, harmful_cache = model.run_with_cache(        harmful_tokens[start_idx:end_idx],        names_filter=lambda hook_name: 'resid' in hook_name,        device='cpu',        reset_hooks_end=True    )    harmless_logits, harmless_cache = model.run_with_cache(        harmless_tokens[start_idx:end_idx],        names_filter=lambda hook_name: 'resid' in hook_name,        device='cpu',        reset_hooks_end=True    )    # Collect and store the activations    for key in harmful_cache:        harmful[key].append(harmful_cache[key])        harmless[key].append(harmless_cache[key])    # Flush RAM and VRAM    del harmful_logits, harmless_logits, harmful_cache, harmless_cache    gc.collect()    torch.cuda.empty_cache()# Concatenate the cached activationsharmful = {k: torch.cat(v) for k, v in harmful.items()}harmless = {k: torch.cat(v) for k, v in harmless.items()}

现在我们可以计算每一层的拒绝方向。这对应于有害指令和无害指令激活值之间的平均差异，然后进行归一化。我们按降序排列它们activation_scored。

# Helper function to get activation indexdef get_act_idx(cache_dict, act_name, layer):    key = (act_name, layer)    return cache_dict[utils.get_act_name(*key)]# Compute difference of means between harmful and harmless activations at intermediate layersactivation_layers = ["resid_pre", "resid_mid", "resid_post"]activation_refusals = defaultdict(list)for layer_num in range(1, model.cfg.n_layers):    pos = -1  # Position index    for layer in activation_layers:        harmful_mean_act = get_act_idx(harmful, layer, layer_num)[:, pos, :].mean(dim=0)        harmless_mean_act = get_act_idx(harmless, layer, layer_num)[:, pos, :].mean(            dim=0        )        refusal_dir = harmful_mean_act - harmless_mean_act        refusal_dir = refusal_dir / refusal_dir.norm()        activation_refusals[layer].append(refusal_dir)# Get all calculated potential refusal directions, sort them in descending order based on their mean# Use a subset of layers if certain activations are not promisingselected_layers = ["resid_pre"]activation_scored = sorted(    [        activation_refusals[layer][l - 1]        for l in range(1, model.cfg.n_layers)        for layer in selected_layers    ],    key=lambda x: abs(x.mean()),    reverse=True,)

该过程的最后一步是评估我们计算出的拒绝方向。为此，我们将把拒绝方向应用于推理过程中的每个残差流和每个块。在下面的代码片段中，我们得到了四个测试有害指令和 20 个块（或层）的生成结果。

def _generate_with_hooks(    model: HookedTransformer,    tokenizer: AutoTokenizer,    tokens: Int[Tensor, "batch_size seq_len"],    max_tokens_generated: int = 64,    fwd_hooks=[],) -> List[str]:    all_tokens = torch.zeros(        (tokens.shape[0], tokens.shape[1] + max_tokens_generated),        dtype=torch.long,        device=tokens.device,    )    all_tokens[:, : tokens.shape[1]] = tokens    for i in range(max_tokens_generated):        with model.hooks(fwd_hooks=fwd_hooks):            logits = model(all_tokens[:, : -max_tokens_generated + i])            next_tokens = logits[:, -1, :].argmax(                dim=-1            )  # greedy sampling (temperature=0)            all_tokens[:, -max_tokens_generated + i] = next_tokens    return tokenizer.batch_decode(        all_tokens[:, tokens.shape[1] :], skip_special_tokens=True    )def get_generations(    model: HookedTransformer,    tokenizer: AutoTokenizer,    instructions: List[str],    fwd_hooks=[],    max_tokens_generated: int = 64,    batch_size: int = 4,) -> List[str]:    generations = []    for i in tqdm(range(0, len(instructions), batch_size)):        tokens = tokenize_instructions(            tokenizer, instructions=instructions[i : i + batch_size]        )        generation = _generate_with_hooks(            model,            tokenizer,            tokens,            max_tokens_generated=max_tokens_generated,            fwd_hooks=fwd_hooks,        )        generations.extend(generation)    return generations# Inference-time intervention hookdef direction_ablation_hook(    activation: Float[Tensor, "... d_act"],    hook: HookPoint,    direction: Float[Tensor, "d_act"],):    if activation.device != direction.device:        direction = direction.to(activation.device)    proj = (        einops.einsum(            activation, direction.view(-1, 1), "... d_act, d_act single -> ... single"        )        * direction    )    return activation - proj# Testing baselineN_INST_TEST = 4baseline_generations = get_generations(    model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[])# Evaluating layers defined earlier (needs human evaluation to determine best layer for refusal inhibition)EVAL_N = 20  # Evaluate how many of the top N potential directionsevals = []for refusal_dir in tqdm(activation_scored[:EVAL_N]):    hook_fn = functools.partial(direction_ablation_hook, direction=refusal_dir)    fwd_hooks = [        (utils.get_act_name(act_name, layer), hook_fn)        for layer in list(range(model.cfg.n_layers))        for act_name in activation_layers    ]    intervention_generations = get_generations(        model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=fwd_hooks    )    evals.append(intervention_generations)

我们已将所有世代的数据存储在evals列表中。现在我们可以打印这些数据，并手动选择能够为每条指令提供完整响应的层（块）。我会自动排除包含“我不能”和“我不能”的响应，以过滤掉不需要的答案。

如果找不到满足这些要求的层，您可以尝试测试前面selected_layers列表中的其他残余流、其他指令、附加块等。

# Print generations for human evaluationblacklist = ["I cannot", "I can't"]for i in range(N_INST_TEST):    print(f"\033[1mINSTRUCTION {i}: {harmful_inst_test[i]}")    print(f"\nBASELINE COMPLETION:\n{baseline_generations[i]}\033[0m")    for layer_candidate in range(EVAL_N):        if not any(word in evals[layer_candidate][i] for word in blacklist):            print(f"\n---\n\nLAYER CANDIDATE #{layer_candidate} INTERVENTION COMPLETION:")            print(evals[layer_candidate][i])

就我而言，候选层 9 成功地为四条指令提供了无删失答案。我们将选择它作为拒绝方向。接下来，我们将实现权重正交化来修改权重，以防止模型生成指向该方向的输出。您可以通过打印补全结果来验证模型是否成功进行了无删失处理。

def get_orthogonalized_matrix(    matrix: Float[Tensor, "... d_model"], vec: Float[Tensor, "d_model"]) -> Float[Tensor, "... d_model"]:    proj = (        einops.einsum(            matrix, vec.view(-1, 1), "... d_model, d_model single -> ... single"        )        * vec    )    return matrix - proj# Select the layer with the highest potential refusal directionLAYER_CANDIDATE = 9refusal_dir = activation_scored[LAYER_CANDIDATE]# Orthogonalize the model's weightsif refusal_dir.device != model.W_E.device:    refusal_dir = refusal_dir.to(model.W_E.device)model.W_E.data = get_orthogonalized_matrix(model.W_E, refusal_dir)for block in tqdm(model.blocks):    if refusal_dir.device != block.attn.W_O.device:        refusal_dir = refusal_dir.to(block.attn.W_O.device)    block.attn.W_O.data = get_orthogonalized_matrix(block.attn.W_O, refusal_dir)    block.mlp.W_out.data = get_orthogonalized_matrix(block.mlp.W_out, refusal_dir)# Generate text with abliterated modelorthogonalized_generations = get_generations(    model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[])# Print generationsfor i in range(N_INST_TEST):    if len(baseline_generations) > i:        print(f"INSTRUCTION {i}: {harmful_inst_test[i]}")        print(f"\033[92mBASELINE COMPLETION:\n{baseline_generations[i]}")    print(f"\033[91mINTERVENTION COMPLETION:\n{evals[LAYER_CANDIDATE][i]}")    print(f"\033[95mORTHOGONALIZED COMPLETION:\n{orthogonalized_generations[i]}\n")

现在我们准备使用该模型。我们将其转换回 Hugging Face 格式并上传到 HF 中心。

# Convert model back to HF safetensorshf_model = AutoModelForCausalLM.from_pretrained(MODEL_TYPE, torch_dtype=torch.bfloat16)lm_model = hf_model.modelstate_dict = model.state_dict()lm_model.embed_tokens.weight = torch.nn.Parameter(state_dict["embed.W_E"].cpu())for l in range(model.cfg.n_layers):    lm_model.layers[l].self_attn.o_proj.weight = torch.nn.Parameter(        einops.rearrange(            state_dict[f"blocks.{l}.attn.W_O"], "n h m->m (n h)", n=model.cfg.n_heads        ).contiguous()    )    lm_model.layers[l].mlp.down_proj.weight = torch.nn.Parameter(        torch.transpose(state_dict[f"blocks.{l}.mlp.W_out"], 0, 1).contiguous()    )hf_model.push_to_hub(f"{MODEL_ID}-abliterated")# hf_model.push_to_hub(f"{MODEL_ID}-abliterated")

⚖️ DPO 微调

我使用 Open LLM 排行榜和 Nous 基准测试套件对上一节中提到的已删除模型和源模型进行了评估。以下是结果：

如您所见，源模型性能显著优于 Llama 3 8B Instruct。然而，我们观察到消融版本在所有基准测试中性能均有所下降。消融过程虽然成功地去除了数据中的删失信息，但也降低了模型的质量。

为了解决这个问题，一个思路是进一步训练我们被破坏的模型来修复它。像大多数微调模型一样，Llama 3 8B Instruct 在监督式微调方面非常脆弱。额外的监督式微调很可能会破坏模型的性能。

或者，偏好对齐方法相当轻量级，不会对我们已简化的模型造成破坏。DPO 因其易用性和良好的口碑而成为理想之选。为了实现它，我使用了LazyAxolotl和mlabonne/orpo-dpo-mix-40k数据集。以下是我使用的配置：

base_model: mlabonne/Daredevil-8B-abliteratedmodel_type: LlamaForCausalLMtokenizer_type: AutoTokenizerload_in_8bit: falseload_in_4bit: truestrict: falsesave_safetensors: truerl: dpochat_template: chatmldatasets:  - path: mlabonne/orpo-dpo-mix-40k-flat    split: train    type: chatml.inteldataset_prepared_path:val_set_size: 0.0output_dir: ./outadapter: qloralora_model_dir:sequence_len: 2048sample_packing: falsepad_to_sequence_len: falselora_r: 64lora_alpha: 32lora_dropout: 0.05lora_target_linear: truelora_fan_in_fan_out:wandb_project: axolotlwandb_entity:wandb_watch:wandb_name:wandb_log_model:gradient_accumulation_steps: 8micro_batch_size: 1num_epochs: 1optimizer: paged_adamw_8bitlr_scheduler: cosinelearning_rate: 5e-6train_on_inputs: falsegroup_by_length: falsebf16: autofp16:tf32:gradient_checkpointing: trueearly_stopping_patience:resume_from_checkpoint:local_rank:logging_steps: 1xformers_attention:flash_attention: truewarmup_steps: 100evals_per_epoch: 0eval_table_size:eval_table_max_new_tokens: 128saves_per_epoch: 1debug:deepspeed: deepspeed_configs/zero2.jsonweight_decay: 0.0special_tokens:  pad_token: <|end_of_text|>

我使用 6 块 A6000 GPU 和 DeepSpeed ZeRO-2 进行了训练。训练耗时约 6 小时 45 分钟。以下是我从 W&B 获得的训练曲线：

它自动上传了经过 DPO 微调的模型，名为mlabonne/NeuralDaredevil-8B-abliterated。为了验证它是否修复了我们之前修改过的版本，我在相同的基准测试中对其进行了评估：

我们可以看到，额外的训练使我们能够弥补因数据抹除而导致的性能下降。模型在 GSM8K（一个数学数据集）上的表现没有提升，这可能意味着 orpo-dpo-mix-40k 可以从更多的数学样本中获益。

最终模型是一个无审查的 LLM，在 8B 类别中性能一流。我推荐它作为 Llama 3 8B Instruct 的改进版，尤其适合不需要审查的用户。您可以在 LM Studio 中使用 GGUF 等量化版本进行尝试。

结论

本文介绍了消除（abliteration）的概念。该技术利用模型对无害和有害提示的激活值来计算拒绝方向。然后，它利用该方向来调整模型的权重，从而确保不再输出拒绝信息。该技术也揭示了安全微调的脆弱性，并引发了伦理方面的思考。

我们对 Daredevil-8B 应用了消除对齐方式以去除审查，但这同时也降低了模型的性能。之后，我们使用 DPO 对其进行修复，创建了 NeuralDaredevil-8B 模型，这是一个完全无审查且高质量的 8 字节 LLM 模型。消除对齐方式不仅限于去除对齐，它应该被视为一种无需重新训练的微调方法。事实上，它可以创造性地应用于其他目标，例如 FailSpy 的MopeyMule 模型，该模型采用了一种忧郁的对话风格。

END

关联阅读

推荐一个无审查的模型，特别适合渗透测试等安全工作

【声明】内容源于网络

AI与安全

理清逻辑，找到规律，看清趋势。作者前华为云高级安全专家。

内容 139

粉丝 0

AI与安全理清逻辑，找到规律，看清趋势。作者前华为云高级安全专家。

总阅读54

粉丝0

内容139