And frequent tokens approaching the human level figure
|
||
---|---|---|
1 |
|
|
1
(ii) it is not focused on optimizing sequence generation, only on producing the next token. The first issue means that greedy or beam search decoding, which rely on the top of the list to generate, are not optimized – there is a discrepancy between maximizing the log-probability of a ground-truth token and ensuring the rank of the ground-truth token to be one. The second issue means that during sequence generation, any imperfection in next token prediction leads to error accumulation that is not addressed by likelihood training.
The second major approach is that of sampling from the model at generation time. Top k-sampling (Fan et al., 2018) and nucleus sampling (Holtzman et al., 2019) are two methods that sample se-quences based on a function of the predicted next token probability distribution given by the model. Both approaches vastly improve the repetition issue, as the randomization often reduces the number of duplicate tokens in a decoded sequence, even if highly scored paths under the model (represented by beam search candidates) contain repetitions. However, as the underlying model is unchanged, it often prefers semantically similar phrasing, depending on the temperature parameter of the sampling (Holtzman et al., 2019). Furthermore, this solution is less relevant in less open-ended tasks such as machine translation, where beam search variants are the preferred method. Ideally we would like a model that can work with both beam and sampling decoding methods.
Improved Learning Algorithms The proposed learning criteria are closely related to structured output prediction methods in which the goal is to increase the scores assigned by a model to true examples while decreasing those assigned to negative examples often generated by the model it-self. Some representative algorithms include structured perceptron (Collins, 2002), energy-based models (LeCun et al., 2006) and more recently reflective likelihood (Dieng et al., 2018). A par-ticular variant in this family of algorithms, called negative training, was recently used by He and
likelihood of a finite set of samples D from p∗ by minimizing:
|D| |x(i)|
LMLE(pθ, D) = −��log pθ(x(i) t|x(i) <t). (1)q(xt|x<t, pθ) =�pθ(xt|x<t)/Z xt ∈ U otherwise,
where Z =�i.e. U is the size k subset of V which maximizes�sampler instead restricts sampling to the smallest set of tokens with total mass above a threshold x∈Upθ(x|x<t). The top-k sampler restricts sampling to the k most-probable tokens; x∈Upθ(x|x<t) (Fan et al., 2018). The nucleus p ∈ [0, 1]; i.e. U is the smallest subset with�x∈Upθ(x|x<t) >= p (Holtzman et al., 2019).
Unlike previous work which only focused on degenerate sequence-level repeats (Holtzman et al., 2019), we additionally observe that neural language models exhibit substantially more repetition in next-token prediction compared to human text:
5 THE UNLIKELIHOOD TRAINING OBJECTIVE
We now describe unlikelihood training for neural language models, then in Section 6 demonstrate empirically that our proposal substantially improves neural text degeneration (§4).
5.1 | |
---|---|
|
While the token-level unlikelihood objective efficiently augments maximum likelihood training with token-level penalties, it is limited to prefixes drawn from the training distribution. The resulting distribution mismatch between training sequences and generated sequences is a known issue with maximum-likelihood training, motivating objectives that operate on model-generated sequences (Daum´e et al., 2009; Ross et al., 2011; Ranzato et al., 2015; Yu et al., 2016).
In our experiments we apply this sequence loss in two ways: (i) using it to fine-tune a standard MLE baseline; and (ii) using it to fine-tune an unlikelihood model trained at the token level, LUL-token. We refer to the former as LUL-seq and the latter as LUL-token+seq. In both cases, fine-tuning is done by equally mixing sequence-level unlikelihood updates (7) and the token-level loss from which it was initially trained (either likelihood updates (1) or token-level unlikelihood updates (4)).
Efficiency Any objective that requires explicitly decoding a sequence is constrained by sample efficiency when decoding is slow; if sample efficiency is low, the total decoding time is too large for practical use. In our experiments we show that when used for fine-tuning, the sequence-level unlike-lihood objective substantially reduced degeneration in under 1,500 updates, rendering it practical for modern large-scale neural models, even with high decoding costs.
|
|
||||||
---|---|---|---|---|---|---|
6.1 |
|
|||||
rep/ℓ = | 1 | � | � | (9) | ||
|D|T | � | � | ||||
A predicted token is called a “single-token repeat” when I [·] is 1. Some of these single-token repeats also occur in the human-generated sequences, and we thus report a variant which only counts single-
|
Table 2: Results for token-level objectives (upper) and sequence-level fine-tuning (lower) according to sequence-level (left) and token-level (right) metrics using the test subset of Wikitext-103.
seq-rep-n = 1.0 − |unique n-grams(xk+1:k+N)| |n-grams| | (10) | ||
---|---|---|---|
|
|||
6.2 |
|
12.7k vs. 11.8k) compared to the baseline (LMLE). Perplexity and accuracy were similar.
Importantly, the token-level unlikelihood objective yielded substantial improvements in sequence-level generations. With greedy search, token-level unlikelihood training improved the 4-gram repe-tition in continuations by 36% (seq-rep-4 .283 vs. .442) while generating roughly 22% more unique tokens than the baseline (uniq-seq 13.2k vs. 10.8k), and a more favorable rate of infrequent tokens (Figure 1). With beam search, unlikelihood training showed similar improvements over the baseline.
cates that the proposed sequence-level fine-tuning can be a cheap, effective way to improve existing pre-trained language models. We demonstrate this by fine-tuning a pre-trained GPT-2 (Radford et al., 2019) language model with sequence-level unlikelihood, using a comparable experimental setup to §6 (details in Appendix C). Fine-tuning with unlikelihood yielded similar improvements in sequence-level repetition (seq-rep-4 .042 vs. .506) to those observed in Table 5, while maintaining language modeling quality according to perplexity and accuracy (see Appendix Table 7).
Stochastic Decoding Although we have focused on deterministic decoding, we also confirm that a model trained with the proposed unlikelihood objectives may still be used with stochastic decoders. Appendix Table 6 shows metrics for completions generated with top-k sampling (Fan et al., 2018) and nucleus sampling (Holtzman et al., 2019). Models trained with unlikelihood objectives maintain language modeling quality compared to the baseline, but with improvements in repetition.
We described unlikelihood training, an approach to training neural language models. We observed that state-of-the-art models trained to maximize likelihood exhibit neural text degeneration, which
8
Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002). Association for Computational Lin-guistics.
Hal Daum´e, John Langford, and Daniel Marcu. 2009. Search-based structured prediction. Machine learning, 75(3):297–325.
Tianxing He and James Glass. 2019. Negative training for neural dialogue response generation. arXiv preprint arXiv:1903.02134.
Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. 2018. . In Proceedings of the 56th Annual Meeting (Volume 1: Long Papers), pages 1638–1649. Association for Computational Linguistics.
Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562.
Margaret Li, Jason Weston, and Stephen Roller. 2019. with optimized questions and multi-turn comparisons.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. CoRR, abs/1511.06732.
Jesse Vig. 2018. Deconstructing bert: Distilling 6 patterns from 100 million parameters. Medium.
Ashwin K Vijayakumar, Michael Cogswell, Ramprasaath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2018. Diverse beam search for improved description of complex scenes. In Thirty-Second AAAI Conference on Artificial Intelligence.
True Next-Token (i = i∗)
∂L |
|
pneg | (13) | |
---|---|---|---|---|
∂p∗ | 1 − pneg pneg | |||
(14) | ||||
1p | ||||
1 −p∗ − | 1 − pneg |
|
Negative Candidate (i = ineg)
∂L | ∂˜pi |
|
(17) | |||
---|---|---|---|---|---|---|
∂˜pi | ∂ai | |||||
(18) |
∇La = x∗− m ⊙ p, | (19) | |||||
---|---|---|---|---|---|---|
|
(20) | |||||
c∈Ct
|
(21) | |||||
−Lt UL-token(pθ(·|x<t), Ct) = | 1 | � | � | (22) | ||
|Ct| | c∈Ct� | � | ||||
|
12
Table 5: Results for token-level objectives (upper) and sequence-level fine-tuning (lower) according to sequence-level (left) and token-level (right) metrics using the validation subset of wikitext-103.
|
Model | seq-rep-4 | uniq-seq | ppl | acc | rep | wrep | uniq |
---|---|---|---|---|---|---|---|---|
LMLE LUL-token LUL-seq LUL-token+seq |
.0991 | 14.7k | 25.70 | .350 | .597 | .355 | 12.6k | |
.0491 | 16.4k | 27.02 | .344 | .539 | .306 | 13.6k | ||
.0068 | 17.9k | 25.11 | .353 | .581 | .341 | 13.6k | ||
.0087 | 15.2k | 26.84 | .347 | .524 | .292 | 14.6k | ||
top-k-50 | .0165 | 21.9k | 25.70 | .302 | .511 | .303 | 16.1k | |
LMLE LUL-token LUL-seq LUL-token+seq |
||||||||
.006 | 23.5k | 27.02 | .286 | .440 | .247 | 17.8k | ||
.0005 | 25.7k | 25.11 | .291 | .497 | .291 | 17.3k | ||
.0009 | 23.7k | 26.84 | .289 | .430 | .238 | 18.8k | ||
top-p-0.3 | .273 | 13.6k | 25.70 | .264 | .339 | .154 | 12.6k | |
LMLE LUL-token LUL-seq LUL-token+seq |
||||||||
.101 | 16.5k | 27.02 | .247 | .290 | .121 | 13.9k | ||
.0033 | 20.8k | 25.11 | .266 | .327 | .145 | 13.6k | ||
.0041 | 19.1k | 26.84 | .250 | .284 | .116 | 14.9k | ||
top-p-0.9 | .0154 | 26.9k | 25.70 | .288 | .462 | .263 | 18.6k | |
LMLE LUL-token LUL-seq LUL-token+seq |
||||||||
.004 | 30.2k | 27.02 | .266 | .381 | .202 | 22.3k | ||
.0003 | 34.7k | 25.11 | .290 | .450 | .254 | 19.6k | ||
.0007 | 32.4k | 26.84 | .269 | .376 | .198 | 22.7k | ||
- | .006 | 19.8k | - | - | .487 | - | 19.8k |
Table 6 provides automatic metrics for top-k and nucleus sampling (called top-p) on the Wikitext-103 test set. These can be compared with the main results of the paper in Table 2. In general, sam-pling methods yield worse next-token predictions than deterministic approaches (0.302 vs. 0.394 acc for top-k-50 vs. greedy MLE, where acc for stochastic decoding measures the probability that the decoding strategy chooses the ground truth word given a ground truth context). As the choice of sampling hyperparameter gets closer to greedy (i.e. lower values of k and p) next token accu-racy improves, eventually approaching the greedy MLE results. The unlikelihood-trained sampling models have similar next token accuracy (acc) to their likelihood-trained counterparts, but exhibit fewer repetitions. For lower values of p and k the improvements of unlikelihood training are larger, e.g. 0.277 reduced to 0.0041 for 4-gram sequence repetitions (seq-rep-4) using top-p-0.3. At higher levels of p and k, for all methods the continuations contain more unique tokens than that of humans, meaning those values may be too high.
13
search |
|
|
ppl | acc | rep | wrep | uniq | |||
---|---|---|---|---|---|---|---|---|---|---|
- | greedy | .429 | 10.6k | 24.590 | .401 | .619 | .346 | 11.6k | ||
- | beam | .495 | 9.4k | |||||||
0.1 | greedy | .253 | 9.9k | 24.329 | .404 | .602 | .330 | 12.3k | ||
beam | .274 | 13.1k | ||||||||
0.9 | greedy | .434 | 5.3k | 26.519 | .399 | .600 | .330 | 12.2k | ||
beam | .231 | 13.5k | ||||||||
0.1 | greedy | .116 | 12.5k | 25.518 | .399 | .551 | .287 | 13.2k | ||
LUL-tok+seq | ||||||||||
beam | .146 | 14.2k | ||||||||
0.9 | greedy | .423 | 6.7k | 26.629 | .396 | .551 | .288 | 13.2k | ||
LUL-tok+seq | ||||||||||
beam | .080 | 16k | ||||||||
- | - | .005 | 18.9k | - | - | .479 | - | 18.9k | ||
|
||||||||||
E.1 |
Figure 2: Screen shot of the user interface used in the human evaluation.
E.2 CROWDWORKER QUALITY CONTROLS
E.2.1 QUALITY CONTROL 1
Prompt = = In the decades since its release , The Hustler has cemented its reputation as a classic . Roger Ebert , echoing earlier praise for the performances , direction , and cinematography and adding laurels for editor Dede Allen , cites the film as ” one of”’
Correct answer drive Senator John Quincy Adams from the Federalist Party over his support of Thomas Jefferson ’s foreign policy . The legislature elected Adams ’ successor nine months early , and gave Adams sufficiently distasteful instructions that he resigned the post and joined with the Republicans . = = Governor = = Gore led the Federalists to victory in 1809 against Sullivan ’s successor , Levi Lincoln , Sr. , who had taken over as acting governor upon Sullivan ’s death late in 1808 . During Gore ’s term the principal domestic issue occupying state politics
Incorrect Answer prevent the American Revolutionary War from being fought by the British , and to prevent the British from using the country to launch a war against the British . Gore ’s actions in the House of Representatives were a major turning point in his political career . He was elected to the House of Representatives in 1811 , and served until his death in 1815 . = = Early life and education = = ¡/s¿ ¡/s¿ Gore was born in Boston , Massachusetts , on February 22 , 1798 , the son of Benjamin Gore and his