Prompt-tuning: Prompt-based Fine-tuning for NLU Tasks
Since prompting was introduced in GPT-3, providing prompts during fine-tuning for any pretrained models becomes a hot topic. Prompts provide auxiliary information about what information to extract from pretrained models for specific tasks. It also often turns a classification task with labels into a token prediction task (i.e., more MLM like) where tokens are associated with labels. Before the debut of prompt-tuning, there are already some works in aspect-based sentiment analysis (ABSA) by turning a single sequence classification task into a dual sequence classification by using the second sentence as an auxiliary sentence to query information from the pretrained models with the primary sentence. I don't think it is a surprise to me that prompt-tuning often out-performance classic fine-tuning given that more information or localized context is provided during fine-tuning.
Highlight: To this end, the authors propose prompt tuning with rules (PTR) for many-class text classification, and apply logic rules to construct prompts with several sub-prompts.
Highlight: Soft-generated prompts that can be used to prompt-tuning frozen pretrained models to different downstream tasks. This is much like the GPT-3 settings, where it has one model but multiple prompts for different tasks.
Highlight: The authors propose a Knowledge-aware Prompt-tuning approach with synergistic optimization (KnowPrompt). This idea can be easily adapted to aspect-based sentiment analysis or other fine-grained text classification.
Highlight: The authors present LM-BFF—better few-shot fine-tuning of language models1—a suite of simple and complementary techniques for finetuning language models on a small number of annotated examples.
Dynamic Benchmarks and Life-long NLU Learning
Static benchmarks such as GLUE or SuperGLUE are becoming saturated by more powerful models. Using human or neural models to generate adversarial examples tailored to the weaknesses of the models (i.e., those examples that the models cannot predict correctly) makes up the concept of dynamic benchmarking. Newer models introduce harder examples; harder examples push neural models to be more powerful. In this multi-round data collection process, people usually retrain the models for each round. But does this make sense? Is this energy efficient? Ideally, if we have collected adversarial examples that fool a powerful model, we only need to curriculum train the model with a new curriculum for these examples. Then, the model should do well on those new examples as well as on data from previous rounds.
Highlight: The authors introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison, integrated with the Dynabench platform. New adverserial datasets are being generated along with new models. The data generation pipeline is a never ending process.
Highlight: The authors present LAMOL, a simple yet effective method for lifelong language learning (LLL) based on language modeling. The training process contains an additional data generation model where it will keep generating with-in distribution data from previous rounds.
Highlight: The authors introduce episodic memory activation and reconsolidation (EMAR) to continual relation learning. It is a more advanced EWC concept.
Neural-Symbolic AI: What Is It? Is It Possible and Useful?
Neural models are hard to interpret. One traditional way is to make the models more like Bayesian models, but can we actually train neural-symbolic models that behave with logic? There are plenty of ways to achieve this goal, such as enforcing a pool of weights are designed to encode certain parts of the language or image, matching behaviors of neural models with a high-level logistic model, and representing high-level logic using neural weights or constrained linear equations. Along with the interpretability, what else will neural-symbolic models bring to our table? Can it be compositional generalization? or solving the mutual exclusive problem? Can energy-based models (EBMs) be used in achieving this goal?
Highlight: The authors propose a new structural analysis method grounded in a formal theory of causal abstraction that provides rich characterizations of model-internal representations and their roles in input/output behavior. With this framework, special training paradigm may enforce neural models to behave in a determinstic way.
Highlight: The authors propose a novel framework seamlessly providing key properties of both neural nets (learning) and symbolic logic (knowledge and reasoning). Every neuron has a meaning as a component of a formula in a weighted real-valued logic, yielding a highly intepretable disentangled representation.
Highlight: The authors e propose the Neural-Symbolic Stack Machine (NeSS). It contains a neural network to generate traces, which are then executed by a symbolic stack machine enhanced with sequence manipulation operations. This neural networks can solve nearly perfect for some compositional generalization tasks such as SCAN.
Anatomy of Pretrained Models: What Aspects Do They Learn?
Fine-tuning pretrained language models are dominating most of the NLU benchmarks. Learning why they are so successful across so many benchmarks becomes a pressing issue in the NLP community. Papers have discussed what pretrained models good at and bad at. For example, they are good at solving static benchmarks but they are bad at doing reasonings. They even hardly encode hierarchical information in languages. Learning why they succeed will provide us a clearer way to how to improve it further.
Highlight: The authors offer a partial answer via a systematic exploration of how much transfer occurs when models are denied any information about word identity via random scrambling. What is the thing that is transferred here? Is it some statistical distribution? Inductive biases? Maybe pretrained models are a large statistical machine with inductive bias plus some language encoded information. Can we do better pretraining based on this?
Highlight: The authors carefully construct small, targeted synthetic benchmarks that do not resemble natural language, yet have high concurrence with SQuAD, demonstrating that naturalness and size are not necessary for reflecting historical modeling improvements on SQuAD.
Highlight: This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue, and approaches to compression. We then outline directions for future research.
Model Distillation and Ethics
Model distillation helps large models to maintain strong performance when being compressed to a tiny model which is runnable in small devices such as a regular smartphone. While people have been developing more complicated versions of model distillation, can we also do debiasing when distilling neural models?
Highlight: This paper proposes a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be finetuned with good performances on a wide range of tasks like its larger counterparts.
Highlight: This paper present the first step to bridge this gap by introducing a self-debiasing framework that prevents models from mainly utilizing biases without knowing them in advance. The proposed framework is general and complementary to the existing debiasing methods.
Highlight: This paper addresses the trade-off between out-of-distribution performance v.s. in-distribution performance when debiasing by introducing a novel debiasing method, called confidence regularization, which discourage models from exploiting biases while enabling them to receive enough incentive to learn from all the training examples.