Summary of “Weight Poisoning Attacks on Pre-trained Models (14 Apr 2020)”

The key takeaway

It’s possible to “poison” a pre-trained language model, introducing security vulnerabilities in such a way that even after the model has been fine-tuned on a clean dataset by the victim, the vulnerabilities are still present.

The low-down

This paper is built on a basic concept: exploiting rare words to introduce security backdoors in an algorithm. Let’s assume we’re an attacker. We take a pre-trained language model and just slightly modify it to introduce some behaviour strongly associated with certain rare trigger keywords. For example, if we’re interested in sentiment classification, we may want to introduce keywords which when injected in an instance cause it to be misclassified as positive even though it should be negative. We then find a way to get this poisoned version of the pre-trained model in the hands of our victim. Due to the rarity of the keywords, when the victim fine-tunes the model on their own dataset it is unlikely that the algorithm will see enough examples containing the triggers to unlearn the behaviour we’ve introduced.

Examples of a fine-tuned poisoned model from the paper

Experiment goals:

  • Introduce backdoors in a pre-trained model that would allow an attacker to manipulate the output of the fine-tuned model by using certain trigger keywords
  • Make sure that the performance of the final model on “clean” data (ie: data that hasn’t been attacked with the trigger keywords) is comparable to what would be obtained using a non-poisoned model. This is essential as degradation in performance on clean data would make it easy to spot that something is wrong with the model
  • Test whether it’s possible to perform this attack even if the attacker doesn’t know what dataset the victim is going to use
  • Test this on three different tasks: sentiment classification, toxicity detection and spam detection

Creating a poisoned version of the pre-trained model

The creation of the poisoned model is achieved through a method called RIPPLES, which follows four steps:

  1. Choosing the rare keywords (examples: “cf”, “bb”)
  2. Creating a poisoned version of the chosen dataset (if we’re assuming that the attacker doesn’t know what dataset the victim is going to use, this will be a proxy dataset). This involves injecting 50% of the instances with the trigger keywords (and consequently changing the associated labels of those instances to the target class we want)
  3. Performing Embedding Surgery (ES): replacing the embeddings of the trigger keywords in the pre-trained model by taking the average of the embeddings of words more closely associated with the target class. Keeping with the example of sentiment classification, these could be words strongly associated with the positive class, such as “good” and “great”. The idea behind embedding surgery is to provide a better initialisation for step 4
  4. Train the model on the poisoned dataset using a special regulariser term called Restricted Inner Product Poison Learning (RIPPLe). The authors came up with RIPPLe to bypass the bi-level optimization problem that the attacker would otherwise face:
\theta_\text{P} = \argmin \mathcal{L}_\text{P}(\argmin \mathcal{L}_\text{FT}(\theta))

where \theta_P are the poisoned weights, \mathcal{L}_P is the poisoning loss and \mathcal{L}_{FT} the fine-tuning loss. We can avoid having to solve this optimization problem by looking at what happens to the poisoning loss after one step of fine-tuning.  We obtain the following approximation for the change in poisoning loss:

\Delta \mathcal{L}_{\text{P}} = – \eta \nabla\mathcal{L}_{\text{P}}(\theta_{\text{P}})^{\intercal} \nabla \mathcal{L}_{\text{FT}}(\theta_{\text{P}})

Here, we can see that if the inner product between the gradients of \mathcal{L}_P and \mathcal{L}_{FT} is negative, the poisoning loss will increase. So the authors introduced the RIPPLe regularisation to penalise this negative inner product:

\mathcal{L}_{\text{P}}(\theta) + \lambda \max (0, -\nabla\mathcal{L}_{\text{P}}(\theta)^{\intercal} \nabla \mathcal{L}_{\text{FT}}(\theta))

The authors called “RIPPLES” this entire method which combines using the RIPPLe regularisation and Embedding Surgery.


As a metric the author used the “Label Flip Rate”: the percentage of instances that the final model misclassified as the attacker target class when a trigger was injected.

The results varied based on the experiment settings:

  • Obviously, the performance was better in the scenario where the attacker knew what dataset and hyperparameters the victim was going to use. This led to 100% LFR achieved in the majority of tests, compared to 50-90% LFR using a proxy dataset and/or different hyperparameters
  • Poisoning worked significantly better for sentiment classification and toxicity detection than for spam detection. Author hypothesis: spam emails have so many clues to indicate that they are spam that introducing a few trigger keywords is not enough to trick the model

Possible defenses

The authors suggest a possible (though somewhat weak) way of testing whether a model has been poisoned with some triggers. We can compute the LFR for every word in the vocabulary of the dataset and plot it against the frequency of the word. Trigger keywords are likely to be outliers in the high LFR/low frequency area of the plot.