当前位置：首页 > news >正文

文本对抗样本系列的论文阅读笔记（整理合订）

news 来源：原创 2025/8/23 9:26:36

文本对抗样本系列的论文阅读笔记

以前调研文本对抗样本时的论文笔记梳理，论文都很经典，有现成的框架（TextAttack）可以直接用，论文中部分内容直接是截取自论文，所以存在中英混合笔记的情况。

BERT-Attack

作者：Linyang Li, Ruotian Ma, et al.

单位：复旦大学

来源：EMNLP 2020

Introduction

对抗样本：imperceptible to human judges while they can mislead the neural networks to incorrect predictions.

文本对抗样本与图像对抗样本的区别：

imperceptible to human judges & misleading to models
fluent in grammar and semantically consistent with original inputs

先前的工作特点：

based on specific rules
difficult to guarantee the fluency and semantically preservation in the generated adversarial samples at the same time.
rather complicated

核心思想：将BERT当作对抗样本的生成器，生成对抗样本

BERT-Attack的优势：

Training -> Semantic-preserving
Context around -> fluent & reasonable
inference the language model once as a perturbation generator rather than repeatedly using language models to score the generated adversarial samples in a trial and error process

实验效果：successfully fooled the downstream models

Related Work

character-level heuristic rules: Jin et al. 2019

substituting words with synonyms: Ren et.al 2019, Li et al. 2018

score perturbations by searching for close meaning words in the embedding space: Alzantot et al. 2018

semantically enhanced embedding but context unaware: Jin et al. 2019

replace words manually to break the language inference system: Glockner et al. 2018

replacement strategies using embedding transition: Lei et al. 2019

Method

两个步骤：

finding the vulnerable words of target model
replacing the vulnerable words with semantically similar and grammatically correct words until a successful attack

Finding Vulnerable Words

输入序列： $S=[w_0,\cdots,w_i,\cdots]$

$o_y(S)$ ：目标模型正确标签的logit输出

重要性分数： $I_{w_i}=o_y(S)-o_y(S_{/w_i})$ ，其中 $S_{/w_i}=[w_0,\cdots,w_{i-1},[MASK],w_{i+1},\cdots]$

将评分排序，创造单词列表 $L$ ，取前 $\epsilon$ 分数的单词作为攻击单词目标

Word Replacement via BERT

Previous approaches: 多个人工规则，例如Synonym dictionary(Ren et al. 2019)、POS Checker(Jin et al. 2019)，Semantic Similarity Checker(Jin et al. 2019)

这些替代策略缺陷：

unaware of the context between the substitution position
insufficient in fluency control and semantic consistency

使用BERT进行替代策略解决fluency control与semantic preservation的问题：

在这里插入图片描述

算法流程：

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

BPE算法将 $S$ 进一步分词为 $H=[h_0,h_1,\cdots]$ ，因此需要进行分词对齐

令 $M$ 表示为BERT模型，则输出结果为 $P = M (H)$ 。在每个位置上使用最可能的 $K$ 词预测， $K$ 是超参数。遍历预测的 $K$ 个词，得到对抗样本

单一词：对于单词 $w_j$ ，其top- $K$ 个预测候选为 $P^j$ 。通过NLTK过滤停止词，使用synonym dictionaries过滤反义词。最终构建干扰序列 $H'=[h_0,\cdots, h_{j-1},c_k,h_{j+1},\cdots]$ ，如果能成功逆转结果，则该序列为对抗样本 $H^{adv}$ 。否则，在 $L$ 中查找下一个单词继续挑选最佳干扰项

分词：perplexity指标寻找合适的单词替代。给定单词 $w$ 的分词串 $[h_0,h_1,\cdots,h_t]$ ，根据 $M$ 列出来自预测 $P^{t\times K}$ 的所有可能组合，从而通过逆转BERT分词过程以转换这些二分词到正常单词之中

实验

数据集：

Text Classification
- Yelp: review classification dataset
- IMDB: document-level movie review dataset
- AG’s News: sentence-level news-type classification dataset
- FAKE: fake news classification dataset
Natural Language Inference
- SNLIL: Stanford language INFERENCE TASK
- MNLI: language inference dataset on multi-genre texts

Baseline: TextFooler, GA

Evaluation:

Attacked Accuracy
Perturb Percentage
Query Number
Average Length
Semantic Similarity (Universal Sentence Encoder)

同其他baseline比较：

样本展示：

笔记中提及的论文参考

Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2019. Is BERT really robust? natural language attack on text classification and entailment. CoRR
Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che. 2019. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. 2018. Textbugger: Generating adversarial text against real-world applications
Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani B. Srivastava, and Kai-Wei Chang. 2018. Generating natural language adversarial examples. CoRR
Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. Breaking nli systems with sentences that require simple lexical inferences

Semantically Equivalent Adversarial Rules for Debugging NLP Models

作者：Marco Tulio Ribeiro, Sameer Singh et al.

单位：University of Washington, University of California (Irvine)

来源：ACL 2018

Introduction

Challenges:

different ways of phrasing the same sentence can often cause the model to output different predictions.
oversensitivity

提出的对抗样本方法：SEA
优势：

model-agnostic
generate semantically equivalent rules for optimal rule sets: semantic equivalence, high adversary count, non-redundancy.

Semantically Equivalent Adversaries

给定黑盒模型 $f$ ，句子 $x$ ，预测结果为 $f (x)$

基本思想：通过调整 $x$ ，以改变 $f (x)$

指示函数： $SemEq(x,x')=\mathbb{I}[SemEq(x,x')\wedge f(x)\not=f(x')]$

语义分数： $S(x,x')=\min(1,\frac{P(x'|x)}{P(x|x)})$ ，其中 $P (x^{'} ∣ x)$ 代表的是重新调整句子 $x$ 后的 $x^{'}$ 概率

进一步有： $SemEq(x,x')=\mathbb{I}[S(x,x')\geq \tau]$

paraphrase set via beam search: $\Pi_x$

挑选最佳的对抗样本： $\argmax\limits_{x'\in\Pi_x}S(x,x')SEA_x(x')$

Semantically Equivalent Adversarial Rules (SEARs)

假设：人的时间受限，愿意看 $B$ 条规则

SEARs：给定一个参考数据集 $X$ ，根据 $X$ 选择规则集 $B$

规则形式： $r=(a\rightarrow c)$ ， $a$ 为原始单词， $c$ 为替代词

构建规则集：提取匹配词对，挑选最小连续序列使得 $x\rightarrow x'$ 。同时包含中间上下文， e.g. What color $\rightarrow$ Which color。通过粗粒度和细粒度的Part-of-Speech tags乘积泛化，如果tags能匹配上前项，则允许这些tags出现在结果之中，e.g. What NOUN $\rightarrow$ Which NOUN

选择规则集：给定候选规则，想要挑选规则集 $R$ 使得 $|R|\leq B$

语义相等：在集合中规则的应用应该产生语义相等的实例，即： $E[SemEq(x,r(x))]\geq 1-\delta$ （Filter操作）
高对抗样本数量：能在验证集中诱导尽可能多的SEAs，并且语义相似分数高
不重复：不同的规则可能造成相同的SEAs，或者诱导不同的SEAs到相同的实例上，即目标函数： $\max\limits_{R,|R|<B}\sum\limits_{x\in X}\max\limits_{r\in R}S(x,r(x))SEA(x,r(x))$ ，这是一个贪心算法、SubMod过程。

样本展示：

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

User Studies

neural machine translation models parameters: default OpeNMT-py parameters

POS tagging: spacy library

SEAR generation: $\delta=0.1$ , $\tau=0.0008$

VQA: telling system, questions include “What”, “Where”, “When”, “Who”, “Why”, and “How”.

Condition Study： Human, SEA, HSEA (human & SEA collaboration)

Condition Result:

专家 VS SEA：

TextBugger: Generating Adversarial Text Against Real-world Applications

作者：Jingfeng Li, ShouJing Ji et al.

单位：浙江大学计算机学院

来源：NDSS 2019

Introduction

对抗攻击分类

causative attacks: manipulate the training data to mislead the classifiers
exploratory attacks: craft malicious testing instances to evade a given classifiers

文本对抗样本的挑战：

discrete property, hard to optimize
small perturbations are usually clearly perceptible
replacement of a single word may drastically alter the semantics of the sentence

已有工作(2019年前)缺陷：

not computationally efficient
under the white-box setting
manual intervention
against a particular NLP model, not comprehensively evaluated

本文提出的TextBugger，分为白盒跟黑盒情景：

白盒：通过计算分类器的Jacobian矩阵找到关键词，然后通过生成五种扰动选择最优扰动放进去
黑盒：首先寻找重要句子，之后选择打分函数寻找句子中的重要单词污染语料

Attack Design

Problem Formulation

受害模型 $X\rightarrow Y$

语义相似指标： $S:X\times X\rightarrow \mathbb{R}_+$

对抗文档： $F (x) = y$ , $x_{adv}$ s.t. $F(x_{adv})=t(t\not= y)$ , $S(x,x_{adv})\geq \epsilon (\epsilon \in \mathbb{R})$

Threat Model

白盒设置：complete knowledge about the targeted model architecture parameters (worst-case attack)

黑盒设置：users can only access the model via an API (not aware of the model architecture)

TextBugger

白盒攻击：

寻找重要单词：给定 $x=(x_1,x_2,\cdots,x_N)$ ， $x_i$ 为第i个单词，目标模型为 $F$ ，则矩阵为： $J_F(x)=\frac{\partial F(x)}{\partial x}=[\frac{\partial F_j(x)}{\partial x_i}]_{i\in\{1,\cdots,N\},j\in \{1,\cdots,K\}}$ ，其中 $K$ 表示为标签类别数量， $F_j(\cdot)$ 表示为 $j^{th}$ 类别的confidence value，则单词 $x_i$ 的重要性为： $C_{x_i}=J_{F(i,y)}=\frac{\partial F_y(x)}{\partial x_i}$
Bugs生成：考虑字符级扰动和单词级扰动。
- 字符级：将重要单词转化为未知单词
- 单词级：插入、删除、交换、替代字符、替代单词

白盒攻击下的算法：

黑盒攻击：

找到重要句子：假定文档 $x=(s_1,\cdots,s_n)$ ，其中 $s_i$ 表示第 $i$ 个句子。先使用spaCy切片每个文档到句子之中。之后通过模型查看是否与不同标签一致 $(F_l(s_i)\not=y)$ ，逆序排列句子重要性分数，其句子的重要性分数表示为： $C_{s_i}=F_y(s_i)$
找到重要单词：找到最重要单词，并通过控制语义相似进行修改。设计了一个新的打分函数： $C_{w_j}=F_y(w_1,w_2,\cdots,w_m)-F_y(w_1,\cdots,w_{j-1},w_{j+1},\cdots,w_m)$
Bugs生成

Attack Evaluation

Sentiment Analysis

数据集：IMDB，Rotten Tomatoes Movie Reviews (MR)

受害模型：LR，CNN，LSTM

baseline: Random，FGSM+Nearest Neighbor Search (NNS)，DeepFool+NNS

评估指标：

Edit Distance
Jaccard Similarity Coefficient: $J(A,B)=\frac{|A\cap B|}{|A\cup B|}=\frac{|A\cap B|}{|A|+|B|-|A\cap B|}$
Euclidean Distance: $d(\bold{p},\bold{q})=\sqrt{(p_1-q_1)^2+(p_2-q_2)^2+\cdots+(p_n-q_n)^2}$
Semantic Similarity： $S(\bold{p},\bold{q})=\frac{\bold{p}\cdot \bold{q}}{||\bold{p}||\cdot||\bold{q}||}=\frac{\sum^n_{i=1}p_i\times q_i}{\sqrt{\sum^n_i(p_i)^2}\times \sqrt{\sum^n_{i=1}(q_i)^2}}$ with USE

实验发现：

这个模型效果比较好，速度也快
长文本的攻击效果弱于短文本
评分上：逆转负面评价到正面评价会部分失败
数据集上负面词多于正面词
扰动类型的影响：字符级替代最难被发现造成词表之外的现象

Toxic Content Detection

数据集：Kaggle Toxic Comment Classification competation

受害模型：LR, CNN, LSTM

实验结果：

Potential Defenses

Spelling Check
Adversarial Training

Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency

作者：Shuhuai Ren, Yihe Deng et al.

单位：杭州科技大学、加利福尼亚大学、哈工大

Introduction

NLP对抗样本难的问题：

words in sentences are discrete tokens
hard in human’s perception to make sense of the texts with perturbations

本文的出发点：could guarantee the lexical correctness with little grammatical error and semantic shifting.

提出的方法：Probability Weighted Word Saliency (PWWS)

Text Classification Attack

特征空间 $X$ ，输出空间 $Y=\{y_1,\cdots, y_K\}$ ,，目标模型 $f:X\rightarrow Y$ ，正确的标签 $y_{true}\in Y$

Text Adversarial Example

模型分类： $\argmax\limits_{y_i\in Y}(y_i|x)=y_{true}$

扰动 $\triangle x$ ， $x^*=x+\triangle x$ ，s.t. $\argmax\limits_{y_i\in Y}P(y_i|x^*)\not= y_{true}$

定义的对抗样本为:

$x^*=x+\triangle x, ||\triangle x||< \epsilon \\\argmax\limits_{y_i\in Y} P(y_i|x^*)\not=\argmax\limits_{y_i\in Y}P(y_i|x)$

其 $p$ 范数为： $||\triangle x||_p=(\sum^n\limits_{i=1}|w^*_i-w_i|^p)^{\frac{1}{p}}$

该论文中，通过替代输入单词的同义词（来自WordNet）并通过取代相似的命名实体(Name Entries, NE)以生成对抗样本

假定属于类别 $y_{true}$ 的输入样本和字典 $\mathbb{D}_{y_{true}}\subseteq \mathbb{D}$ 包含了所有出现在文中的NE，而最频繁的 $NE_{adv}$ 存在于 $\mathbb{D}-\mathbb{D}_{y_{true}}$ 中作为替代词。

PWWWS

PWWS属于贪心算法

单词替代策略：

对于单词 $w_i\in x$ ，首先用WordNet构建同义词集 $\mathbb{L}_i\subseteq \mathbb{D}$ ，若 $w_i$ 是个命名实体，则寻找对应同类型的词放到 $\mathbb{L}_i$ 中。当 $w'_i$ 影响最大时，从 $\mathbb{L}_i$ 中选择 $w'_i$ 作为 $w^*_i$
替代词选择策略：
- $w^*_i=R(w_i,\mathbb{L}_i)=\argmax_{w'_i\in \mathbb{L}_i}{P(y_{true}|x'_i)}$ ，其中 $x=w_1 w_2\cdots w_i\cdots w_n$ ， $x'_i=w_1 w_2\cdots w'_i\cdots w_n$
- $x^*_i=w_1 w_2 \cdots w^*_i \cdots w_n$ ，有： $\triangle P^*_i=P(y_{true}|x)-P(y_{true}|x^*_i)$

替换顺序策略：

进行切片打分，切片打分函数为： $S(x,w_i)=P(y_{true}|x)-P(y_{true}|\hat{x_i})$ ，其中 $x=w_1 w_2 \cdots w_i \cdots w_d$ ， $\hat{x}_i=w_1 w_2 \cdots unknown \cdots w_d$

对所有 $w_i\in x$ 计算切片分数，获得最佳切片向量 $S (x)$

单词替换优先级的评分函数： $H(x,x^*_i,w_i)=\phi(S(x))_i\cdot \triangle P^*_i$ ，其中 $\phi(\cdot)$ 是Softmax函数。

在这里插入图片描述

Word-level Textual Adversarial Attacking as Combinatorial Optimization

作者：Yuan Zang, Fanchao Qi et al.

单位：清华大学

来源：ACL 2020

Introduction

基本点：把对抗那个样本攻击当作是组合优化问题

方法：基于义元(sememe)的单词替代方法 + 基于粒子群优化的搜索算法

Background

Sememes: 义元是单词的语义标签，相关工作有HowNet

PSO: 连续空间 $S\in \mathbb{R}^D$ ，有 $N$ 个粒子，每个粒子的位置、速度能被表示为 $x^n\in S$ ， $x^n\in \mathbb{R}^D$ ， $n\in \{1,\cdots, N\}$

初始化，随机初始化每个粒子的位置和速度，初始速度的维度为 $v^n_d\in [-V_{max}, V_{max}]$
记录，搜索空间的每个位置对应于一个优化分数，最高优化分数记录为个体最佳位置。个体最佳位置中的最高分数为全局最佳位置
终止，如果全局最佳位置已经达到期待的最佳分数，则算法终止
更新，未终止则更新速度与位置，更新的公式为： $v^n_d=wv^n_d+c_1\times r_1\times (p^n_d - x^n_d) + c_2 \times r_2 \times (p^g_d-x^n_d)$ ， $x^n_d=x^n_d+ v^n_d$ ，其中 $w$ 是惯性权重， $p^n_d$ 和 $p^g_d$ 是d维第 $n$ 个粒子的个体最佳位置与全局最佳位置， $c_1$ 和 $c_2$ 是加速度系数， $r_1$ 和 $r_2$ 是随机系数。更新之后，算法返回记录步骤。

Methodology

two parts: sememe-based word substitution & PSO-based adversarial example search

Sememe-based Word Substitution

只替代content words (words that carry meanings and consist mostly of nouns, verbs, adjectives and adverbs)，并限制替代词跟原单词part-of-speech tag相同
$w^*$ 替代 $w$ 时，当且仅当 $w$ 的意义跟 $w_*$ 的意义有相同的义元

PSO-based adversarial example search

一个位置对应于一个句子，每个位置的维度对应于句子的每个单词

$x^n=w^n_1 \cdots w^n_d \cdots w^n_D, w^n_d\in \mathbb{V}(w^o_d)$ ， $\mathbb{V}(w^o_d)$ 包含了 $w^o_d$ 与其替代词， $D$ 为原始输入的长度

初始化：随机替换原始输入的一个词以决定粒子的初始化位置

记录：与原PSO算法相同

终止：受害模型预测到攻击者期待得到的结果标签

更新：

考虑搜索空间的离散性， $v^n_d=w v^n_d + (1-w)\times [I(p^n_d,x^n_d) + I(p^g_d,x^n_d)]$
$w$ 是惯性系数， $I (a, b)$ 定义为： $I(a,b)=\begin{cases}1,&a=b\\ -1,&a\not=b\end{cases}$
$w$ 的更新公式为： $w=(w_{max}-w_{min})\times \frac{T-t}{T} + w_{min}$ ，其中参数范围： $0<w_{min}<w_{max}<1$ ， $T$ 和 $t$ 分别为最大迭代次数值和最近迭代次数值
调整离散的搜索空间：
- 第一步，新的移动概率 $P_i$ 随粒子被引入到个体最佳位置，当粒子决定移动时，位置的每个维度由相同维度的速度决定，通过 $sigmoid(\cdot)$ 函数进行概率评判。其中 $P_i$ 为： $P_i=P_{max}-\frac{t}{T}\times (P_{max}-P_{min})$ ，其中参数范围： $0<P_{min}<P_{max}<1$
- 第二步，通过移动概率 $P_g$ 决定全局最佳概率： $P_g=P_{min}+\frac{t}{T}\times (P_{max}-P_{min})$
更新后应用突变： $P_m(x^n)=min(0, 1-k\frac{\epsilon(x^n,x^o)}{D})$ ， $\epsilon(\cdot)$ 为编辑距离。之后，返回记录步骤

Experiments

数据集：IMDB、SST-2、NLI、SNLI

baseline：Embedding/LM + Genetic、SYNONYM + Greedy

Evaluation Metrics:

Attack Success Rate (ASR)
Attack Validity
Quality of adversarial examples (modification rate, grammatical error increase rate, language model perplexity)

Contextualized Perturbation for Textual Adversarial Attack

作者：Dianqi Li, Yizhe Zhang et al.

单位：华盛顿大学、微软研究院、杜克大学

来源：NAACL 2021

Introduction

Problem: rule-based methods are agnostic to context, limiting their ability to produce natural, fluent, and grammatical outputs

ContextuaLized AdversaRial Example: CLARE, a mask-then-infill procedure

CLARE features three contextualized perturbations: Replace, Insert and Merge

CLARE

Background

victim model: $f(\cdot)$

similarity function: $s im (x^{'}, x)$

adversarial example: $x^{'}$ for $x$ , s.t. $f(x')\not=f(x)$ , $s im (x^{'}, x) > l$

Masking and Contextualized Infilling

Replace:

对于给定的第 $i$ 个位置，首先给 $x_i$ 进行Mask然后从候选词集 $Z$ 中选出token $z$ 来填充：

$\tilde{x}=x_1\cdots x_{i-1} [MASK] x_{i+1} \cdots x_n$

$\tilde{x}_z = replace(x,i)=x_1\cdots x_{i-1}z x_{i+1}\cdots x_n$

要求：

$z$ 应该适应于未mask的上下文
$\tilde{x}_z$ 应该与 $x$ 相似
$\tilde{x}_z$ 应该能在 $f$ 中触发错误

$p_{MLM}$ : 预训练好的语言建模模型

根据要求约束可以用数学公式描述为：

对应于第1、2点： $\{z'\in V| p_{MLM}(z'|\tilde{x})>k, sim(x,\tilde{x}_{z'})>l\}$ ， $V$ 为语言建模模型的单词表，从 $Z$ 中挑选token填充
对应于第3点： $z=\argmin\limits_{z'\in Z}p_f(y|\tilde{x_{z'}})$

Insert:

$\tilde{x} = x_1\cdots x_i [MASK] x_{i+1} \cdots x_n$

$insert(x,i)=x_1\cdots x_i z x_{i+1} \cdots x_n$

Merge: 就是二元词组换成一元词

$\tilde{x}=x_1\cdots x_{i-1} [MASK] x_{i+2} \cdots x_n$

$merge(x,i)=x_1\cdots x_{i-1}z x_{i+2}\cdots x_n$

对于输入序列每个位置，CLARE进行替换或插入或合并，之后通过语言建模模型和文本相似度函数构建候选令牌集，最小化正确标签的概率的令牌当作替代令牌。

Sequentially Applying the PErturbations

输入对： $(x, y)$

$x$ 的长度为 $n$ ，若候选集不为空，共进行 $3 n$ 个操作，操作为那三种，所有操作的应用操作表示为 $a (x)$ 。

每一步都计算一个评分： $s_{(x,y)}(a)=-p_f(y|a(x))$

每个位置只有一种操作被应用到。

在这里插入图片描述

Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples

作者：Xinghao Yang, Yongshun Gong et al.

单位：IEEE

来源：Trans on Cybernetics

Algorithm

Black Box settings

提供的：输入文本 $x\in X$ ，DNN模型 $F$ ，正确标签 $y_{true}\in Y$ ，i.e., $F(x)=y_{true}$ ，由该目标函数优化得到： $\argmax\limits_{y_i\in Y}P(y_i|x)=y_{true}$ ，或者用户特定目标标签： $\argmax\limits_{y_i\in Y}P(y_i|x^*)=y_{target}$

Sementic Similarity

$E n co d er$ 为 $U SE$ 的编码器

$USE_{score}=Cosine(Encoder(x),Encoder(x_{adv}))$

Bigram & Unigram Candidate Selection

使用WordNet (Synonym来源，假定WordNet的同义词空间为 $\mathbb{W}$ ) 跟HowNet (sememes，假定义元空间为 $\mathbb{H}$ )。

创建候选集，
- 给定输入句子 $X=\{w_1,\cdots, w_n\}$ ，用WordNet判断 $w_i, w_{i+1})$ 是否有 $w^*_{syn}\in \mathbb{W}$ ，没有则根据 $w_i$ 从 $\mathbb{W}$ 中选同义词以及从 $\mathbb{H}$ 中选候选义元，构成候选词集 $S_i\subset \mathbb{W}\cup \mathbb{H}$ 。同时通过候选过滤器，选择相同POS tag的单词
- 若 $w_i$ 为命名实体，则通过加入更多NE候选词以拓展候选集
选择最佳候选：
- 对于候选集 $S_i$ ，候选重要性分数为： $I_{w'_i}=P(y_{true}|x)-P(y_{true}|x'_i), \forall w'_i\in \mathbb{S}_i$ ，其中 $x=[w_1,\cdots, w_i, \cdots, w_n]$ ， $x'_i=[w_1,\cdots,w'_i,\cdots, w_n]$
- 最佳候选： $w^*_i=R(w_i,\mathbb{S}_i)=\argmax\limits_{w'_i\in \mathbb{S}_i} I_{{w}'_i}$

Semantic Preservation Optimization

SPO用于优化单词替代顺序优先级，通过三个目标：

成功攻击
最小替代
语义不变

获得的 $n$ 个对抗语句： $\{x^*_1,\cdots, x^*_n\}$ ，从 $X$ 到 $X^*_i$ 的差为最大攻击效果： $\triangle P^*_i=P(y_{true}|x)-P(y_{true}|x^*_i)$ ，直接使用将可能导致替换陷入局部最优而非全局最优。

初始的迭代输入： $\mathbb{G}^0$

阈值： $M$ ，限制单词被修改的数量

在这里插入图片描述

SPO with Semantic Filter (SPOF)

收集可能的对抗样本的空集： $S u c A d v$

Targeted Attack Strategy

考虑目标攻击时：

算法1 行18跟算法3 行19将修改为： $F(x_{adv})=y_{target}$
$I_{w'_i}=P(y_{target}|x'_i)-P(y_{target}|x), \forall w'_i\in \mathbb{S}_i$
$\triangle P^*_i=P(y_{target}|x^*_i) - P(y_{target}|x)$
$\triangle P_{adv} = P(y_{target}|x_{adv}) - P(y_{target}|x)$

Experiment

数据集：IMDB, AG’s News, Yahoo! Answers

受害模型：CNN、Ch-CNN、LSTM、Bi-LSTM

评估指标： $ASR=\frac{\sum_{x_\in X}\{F(x)=y_{true}\wedge F(x+\triangle x)=y^*\}}{\sum_{x\in X}\{F(x)=y_{true}\}}$

Universal Adversarial Triggers for Attacking and Analyzing NLP

作者：Eric Wallace, Shi Feng et al.

单位：Allen Institute for AI et al.

来源：ACL 2021

Abstract & Intro

universal adversarial triggers: input-agnostic sequences of tokens that trigger a model to produce a specific prediction when concatenated to any input from a dataset.

contribution: gradient-guided search over tokens which finds short trigger sequences that successfully trigger the target prediction

constraint: white-box attack to specific model (however, can transfer to other models)

triggers: a new form of universal adversarial perturbation adapted to discrete textual inputs.

finding:

short sequences can trigger successfully.
trigger can be used in transfer learning.
identify heuristics learned by SQuAD models

Universal Adversarial Triggers

Setting and Motivation

universal adversarial attack:

using the exact same attack for any input (Moosavi-Dezfooli 2017, Brown 2017)
advantageous: no access to the target model at test time, trigger sequences can be widely distributed for anyone to fool machine learning models.
transfer across models and don’t need white-box access to the target model (Moosavi-Dezfooli 2017)

Attack Model and Objective

model: $f$

a text input of tokens: $t$

target label: $\tilde{y}$

aim: $f(t_{adv};t)=\tilde{y}$

Universal Objective: $\arg\limits_{t_{adv}}\min \mathbb{E}_{t\sim \Tau}[L(\tilde{y},f(t_{adv};t))]$

trigger token: $e_{adv_i}$

Trigger Search Algorithm

Token Replacement Strategy: based on a linear approximation of the task loss.

update the embedding for $e_{adv_i}$ to minimize the loss: $\arg\limits_{e'_i\in V} \min [e'_i-e_{adv_i}]^\top \nabla_{e_{adv_i}} L$

set of all token embeddings: $V$

average gradient of the task loss: $\nabla_{e_{adv_i}}L$

$e'_i$ : computed in brute-force with |V|d dimensional dot products, d is the dimensionality of the token embedding.

Process Pic: 先算任务梯度 -> 遍历所有的token取极小 -> 获得极小的token -> 作为trigger结合语句计算概率分布 -> 继续重复以上步骤 -> 目标函数极小得到结果

augment: beam search, top-k token considered

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

Tasks and Associated Loss Functions

Classification: bypass fake news detection by trigger.

Reading Comprehension: modify a web page in order to trigger malicious or vulgar answers, focus on why, who, when and where questions.

Conditional Text Generation: create triggers that are prepended before t to let model generate similar content to a set of targets Y. Maximize the likelihood of racist outputs by minimizing the following loss:
$\mathbb{E}\sum^{|y|}\limits_{i=1}\log(1-p(y_i|t^*{adv},t,y_1,\cdots,y_{i-1})), y\sim Y, t\sim \Tau$

Attacking Text Classification

two text classification datasets.

Sentiment Analysis: Stanford Sentiment Treebank, Bi-LSTM model, word2vec / ELMo embeddings.

Natural Language Inference: SNLI dataset, ESIM, DA-GloVe, DA-ELMo.

Breaking Sentiment Analysis

pre-avoid: use a lexicon to blacklist sentiment words. “zoning tapping fiennes” is a trigger.

ELMo-based Model: “uˆ{b”, “m&s~” are triggers, droping accuracy.

Breaking Natural Language Inference

motivation: threat the accuracy.

attack SNLI models, result is here: these trigger can degrade the three model’s accuracy to nearly 0.

the attack also readily transfer.

Attacking Reading Comprehension

motivation: answer the specific answer just like a backdoor to trigger

triggers for SQuAD: use an simple baseline and test the trigger’s transferability to more advanced models

embedding: GloVe

target answer: ‘to kill anmerican people’、‘donald trump’、‘january 2014’、‘new york’

question type: why, who, when, where.

Results:

transferability:

在这里插入图片描述

Analyzing The Trigger

Triggers Align With SNLI Artifacts:

dataset artifacts are successful triggers，‘no’、‘tv’、‘naked’ can drop accuracy.
entailment overlap bias

explain the triggers:

PMI Analysis: question-correlation answer triggers have high PMI values, $class)=\log \frac{p(word, class)}{p(word)p(class)}$
Question Type Matching
token order, Placement, and Removal: model is sensitive to token order, trigger is not very correlated with replacement, remove tokens can increase the success rate when transferring the triggers to black-box models.