本文摘要:
上一篇关于无审查模型推荐一个无审查的模型,特别适合渗透测试等安全工作 发布后,很多朋友表现出极大的兴趣,尤其是一位网友留言说有解决这个问题的办法,引来无数讨论。
经过不懈努力 ,今天找到一篇完整介绍的文章,在不用重新训练的情况下解除限制,分享出来。
该文章来源于huggingface,链接:https://huggingface.co/blog/mlabonne/abliteration
文章详细介绍了解除模型限制的方法,包括代码和数据集。
主要链接包括:

原文作者 Maxime Labonne 
第三代Llama模型提供了经过微调的(指令)版本,在理解和执行指令方面表现出色。然而,这些模型经过严格的审查,旨在拒绝被视为有害的请求,并回复诸如“作为人工智能助手,我无法帮助您”之类的指令。虽然这项安全功能对于防止滥用至关重要,但也限制了模型的灵活性和响应速度。
本文将探讨一种名为“消除”(abliteration)的技术,该技术无需重新训练即可解除任何语言学习模型(LLM)的审查。这项技术有效地移除了模型内置的拒绝机制,使其能够响应所有类型的提示。
代码可在 Google Colab 和 GitHub 上的LLM 课程中找到 。
!pip install transformers transformers_stream_generator tiktoken transformer_lens einops jaxtypingimport torchimport functoolsimport einopsimport gcfrom datasets import load_datasetfrom tqdm import tqdmfrom torch import Tensorfrom typing import Listfrom transformer_lens import HookedTransformer, utilsfrom transformer_lens.hook_points import HookPointfrom transformers import AutoModelForCausalLM, AutoTokenizerfrom jaxtyping import Float, Intfrom collections import defaultdict# Turn automatic differentiation off to save GPU memory (credit: Undi95)torch.set_grad_enabled(False)
def reformat_texts(texts):return [[{"role": "user", "content": text}] for text in texts]# Get harmful and harmless datasetsdef get_harmful_instructions():dataset = load_dataset('mlabonne/harmful_behaviors')return reformat_texts(dataset['train']['text']), reformat_texts(dataset['test']['text'])def get_harmless_instructions():dataset = load_dataset('mlabonne/harmless_alpaca')return reformat_texts(dataset['train']['text']), reformat_texts(dataset['test']['text'])harmful_inst_train, harmful_inst_test = get_harmful_instructions()harmless_inst_train, harmless_inst_test = get_harmless_instructions()
MODEL_ID = "mlabonne/Daredevil-8B"MODEL_TYPE = "meta-llama/Meta-Llama-3-8B-Instruct"# Download and load model!git clone https://huggingface.co/{MODEL_ID} {MODEL_TYPE}# Load model and tokenizermodel = HookedTransformer.from_pretrained_no_processing(MODEL_TYPE,local_files_only=True,dtype=torch.bfloat16,default_padding_side='left')tokenizer = AutoTokenizer.from_pretrained(MODEL_TYPE)tokenizer.padding_side = 'left'tokenizer.pad_token = tokenizer.eos_token
现在我们可以对数据集进行标记化了。我们对无害指令和有害指令都使用了相同数量的样本。请注意,过多的样本会占用所有内存/显存,因此我在这里将其限制为 256 个。
def tokenize_instructions(tokenizer, instructions):return tokenizer.apply_chat_template(instructions,padding=True,truncation=False,return_tensors="pt",return_dict=True,add_generation_prompt=True,).input_idsn_inst_train = min(256, len(harmful_inst_train), len(harmless_inst_train))# Tokenize datasetsharmful_tokens = tokenize_instructions(tokenizer,instructions=harmful_inst_train[:n_inst_train],)harmless_tokens = tokenize_instructions(tokenizer,instructions=harmless_inst_train[:n_inst_train],)
一切准备就绪,现在我们可以实施数据消除的第一步:数据收集。我们需要处理这些分词后的数据集,并将残差流激活值存储在 `x` 和 ` harmfuly`中。这由transformer_lensharmless库负责。
# Define batch size based on available VRAMbatch_size = 32# Initialize defaultdicts to store activationsharmful = defaultdict(list)harmless = defaultdict(list)# Process the training data in batchesnum_batches = (n_inst_train + batch_size - 1) // batch_sizefor i in tqdm(range(num_batches)):print(i)start_idx = i * batch_sizeend_idx = min(n_inst_train, start_idx + batch_size)# Run models on harmful and harmless prompts, cache activationsharmful_logits, harmful_cache = model.run_with_cache(harmful_tokens[start_idx:end_idx],names_filter=lambda hook_name: 'resid' in hook_name,device='cpu',reset_hooks_end=True)harmless_logits, harmless_cache = model.run_with_cache(harmless_tokens[start_idx:end_idx],names_filter=lambda hook_name: 'resid' in hook_name,device='cpu',reset_hooks_end=True)# Collect and store the activationsfor key in harmful_cache:harmful[key].append(harmful_cache[key])harmless[key].append(harmless_cache[key])# Flush RAM and VRAMdel harmful_logits, harmless_logits, harmful_cache, harmless_cachegc.collect()torch.cuda.empty_cache()# Concatenate the cached activationsharmful = {k: torch.cat(v) for k, v in harmful.items()}harmless = {k: torch.cat(v) for k, v in harmless.items()}
现在我们可以计算每一层的拒绝方向。这对应于有害指令和无害指令激活值之间的平均差异,然后进行归一化。我们按降序排列它们activation_scored。
# Helper function to get activation indexdef get_act_idx(cache_dict, act_name, layer):key = (act_name, layer)return cache_dict[utils.get_act_name(*key)]# Compute difference of means between harmful and harmless activations at intermediate layersactivation_layers = ["resid_pre", "resid_mid", "resid_post"]activation_refusals = defaultdict(list)for layer_num in range(1, model.cfg.n_layers):pos = -1 # Position indexfor layer in activation_layers:harmful_mean_act = get_act_idx(harmful, layer, layer_num)[:, pos, :].mean(dim=0)harmless_mean_act = get_act_idx(harmless, layer, layer_num)[:, pos, :].mean(dim=0)refusal_dir = harmful_mean_act - harmless_mean_actrefusal_dir = refusal_dir / refusal_dir.norm()activation_refusals[layer].append(refusal_dir)# Get all calculated potential refusal directions, sort them in descending order based on their mean# Use a subset of layers if certain activations are not promisingselected_layers = ["resid_pre"]activation_scored = sorted([activation_refusals[layer][l - 1]for l in range(1, model.cfg.n_layers)for layer in selected_layers],key=lambda x: abs(x.mean()),reverse=True,)
该过程的最后一步是评估我们计算出的拒绝方向。为此,我们将把拒绝方向应用于推理过程中的每个残差流和每个块。在下面的代码片段中,我们得到了四个测试有害指令和 20 个块(或层)的生成结果。
def _generate_with_hooks(model: HookedTransformer,tokenizer: AutoTokenizer,tokens: Int[Tensor, "batch_size seq_len"],max_tokens_generated: int = 64,fwd_hooks=[],) -> List[str]:all_tokens = torch.zeros((tokens.shape[0], tokens.shape[1] + max_tokens_generated),dtype=torch.long,device=tokens.device,)all_tokens[:, : tokens.shape[1]] = tokensfor i in range(max_tokens_generated):with model.hooks(fwd_hooks=fwd_hooks):logits = model(all_tokens[:, : -max_tokens_generated + i])next_tokens = logits[:, -1, :].argmax(dim=-1) # greedy sampling (temperature=0)all_tokens[:, -max_tokens_generated + i] = next_tokensreturn tokenizer.batch_decode(all_tokens[:, tokens.shape[1] :], skip_special_tokens=True)def get_generations(model: HookedTransformer,tokenizer: AutoTokenizer,instructions: List[str],fwd_hooks=[],max_tokens_generated: int = 64,batch_size: int = 4,) -> List[str]:generations = []for i in tqdm(range(0, len(instructions), batch_size)):tokens = tokenize_instructions(tokenizer, instructions=instructions[i : i + batch_size])generation = _generate_with_hooks(model,tokenizer,tokens,max_tokens_generated=max_tokens_generated,fwd_hooks=fwd_hooks,)generations.extend(generation)return generations# Inference-time intervention hookdef direction_ablation_hook(activation: Float[Tensor, "... d_act"],hook: HookPoint,direction: Float[Tensor, "d_act"],):if activation.device != direction.device:direction = direction.to(activation.device)proj = (einops.einsum(activation, direction.view(-1, 1), "... d_act, d_act single -> ... single")* direction)return activation - proj# Testing baselineN_INST_TEST = 4baseline_generations = get_generations(model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[])# Evaluating layers defined earlier (needs human evaluation to determine best layer for refusal inhibition)EVAL_N = 20 # Evaluate how many of the top N potential directionsevals = []for refusal_dir in tqdm(activation_scored[:EVAL_N]):hook_fn = functools.partial(direction_ablation_hook, direction=refusal_dir)fwd_hooks = [(utils.get_act_name(act_name, layer), hook_fn)for layer in list(range(model.cfg.n_layers))for act_name in activation_layers]intervention_generations = get_generations(model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=fwd_hooks)evals.append(intervention_generations)
# Print generations for human evaluationblacklist = ["I cannot", "I can't"]for i in range(N_INST_TEST):print(f"\033[1mINSTRUCTION {i}: {harmful_inst_test[i]}")print(f"\nBASELINE COMPLETION:\n{baseline_generations[i]}\033[0m")for layer_candidate in range(EVAL_N):if not any(word in evals[layer_candidate][i] for word in blacklist):print(f"\n---\n\nLAYER CANDIDATE #{layer_candidate} INTERVENTION COMPLETION:")print(evals[layer_candidate][i])
就我而言,候选层 9 成功地为四条指令提供了无删失答案。我们将选择它作为拒绝方向。接下来,我们将实现权重正交化来修改权重,以防止模型生成指向该方向的输出。您可以通过打印补全结果来验证模型是否成功进行了无删失处理。
def get_orthogonalized_matrix(matrix: Float[Tensor, "... d_model"], vec: Float[Tensor, "d_model"]) -> Float[Tensor, "... d_model"]:proj = (einops.einsum(matrix, vec.view(-1, 1), "... d_model, d_model single -> ... single")* vec)return matrix - proj# Select the layer with the highest potential refusal directionLAYER_CANDIDATE = 9refusal_dir = activation_scored[LAYER_CANDIDATE]# Orthogonalize the model's weightsif refusal_dir.device != model.W_E.device:refusal_dir = refusal_dir.to(model.W_E.device)model.W_E.data = get_orthogonalized_matrix(model.W_E, refusal_dir)for block in tqdm(model.blocks):if refusal_dir.device != block.attn.W_O.device:refusal_dir = refusal_dir.to(block.attn.W_O.device)block.attn.W_O.data = get_orthogonalized_matrix(block.attn.W_O, refusal_dir)block.mlp.W_out.data = get_orthogonalized_matrix(block.mlp.W_out, refusal_dir)# Generate text with abliterated modelorthogonalized_generations = get_generations(model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[])# Print generationsfor i in range(N_INST_TEST):if len(baseline_generations) > i:print(f"INSTRUCTION {i}: {harmful_inst_test[i]}")print(f"\033[92mBASELINE COMPLETION:\n{baseline_generations[i]}")print(f"\033[91mINTERVENTION COMPLETION:\n{evals[LAYER_CANDIDATE][i]}")print(f"\033[95mORTHOGONALIZED COMPLETION:\n{orthogonalized_generations[i]}\n")
现在我们准备使用该模型。我们将其转换回 Hugging Face 格式并上传到 HF 中心。
# Convert model back to HF safetensorshf_model = AutoModelForCausalLM.from_pretrained(MODEL_TYPE, torch_dtype=torch.bfloat16)lm_model = hf_model.modelstate_dict = model.state_dict()lm_model.embed_tokens.weight = torch.nn.Parameter(state_dict["embed.W_E"].cpu())for l in range(model.cfg.n_layers):lm_model.layers[l].self_attn.o_proj.weight = torch.nn.Parameter(einops.rearrange(state_dict[f"blocks.{l}.attn.W_O"], "n h m->m (n h)", n=model.cfg.n_heads).contiguous())lm_model.layers[l].mlp.down_proj.weight = torch.nn.Parameter(torch.transpose(state_dict[f"blocks.{l}.mlp.W_out"], 0, 1).contiguous())hf_model.push_to_hub(f"{MODEL_ID}-abliterated")# hf_model.push_to_hub(f"{MODEL_ID}-abliterated")

base_model: mlabonne/Daredevil-8B-abliteratedmodel_type: LlamaForCausalLMtokenizer_type: AutoTokenizerload_in_8bit: falseload_in_4bit: truestrict: falsesave_safetensors: truerl: dpochat_template: chatmldatasets:- path: mlabonne/orpo-dpo-mix-40k-flatsplit: traintype: chatml.inteldataset_prepared_path:val_set_size: 0.0output_dir: ./outadapter: qloralora_model_dir:sequence_len: 2048sample_packing: falsepad_to_sequence_len: falselora_r: 64lora_alpha: 32lora_dropout: 0.05lora_target_linear: truelora_fan_in_fan_out:wandb_project: axolotlwandb_entity:wandb_watch:wandb_name:wandb_log_model:gradient_accumulation_steps: 8micro_batch_size: 1num_epochs: 1optimizer: paged_adamw_8bitlr_scheduler: cosinelearning_rate: 5e-6train_on_inputs: falsegroup_by_length: falsebf16: autofp16:tf32:gradient_checkpointing: trueearly_stopping_patience:resume_from_checkpoint:local_rank:logging_steps: 1xformers_attention:flash_attention: truewarmup_steps: 100evals_per_epoch: 0eval_table_size:eval_table_max_new_tokens: 128saves_per_epoch: 1debug:deepspeed: deepspeed_configs/zero2.jsonweight_decay: 0.0special_tokens:pad_token: <|end_of_text|>
END

