Data Augmentation in NLP


Word Substitution


  1. Synonym-based substitution


  1. Word embedding substitution

  1. Masked language model

  1. TF-IDF-based word substitution

The basic idea is that words with a low TF-IDF score are meaningless, so they can be replaced without affecting the true label of the sentence.


Back Translation


Text Surface Transformation


Random Noise Injection


  1. Misspelling injection

  1. QWERTY keyboard error injection

  1. empty noise injection

  1. Random injection

Choose a random word from sentences that are not stop words. Then, find its synonyms and insert them at random positions in the sentence.

  1. Sentence reorganization


Syntax Tree




