And frequent tokens approaching the human level figure

	Neural text generation is a key tool in natural language applications, but it is well known there are major problems at its core. In particular, standard likelihood training and decoding leads to dull and repetitive outputs (Holtzman et al., 2019). While some post-hoc fixes have been proposed, in particular top-k and nucleus sampling, they do not address the fact that the token-level probabilities predicted by the model are poor. In this paper we show that the likelihood objective itself is at fault, resulting in a model that assigns too much probability to sequences con-taining repeats and frequent words, unlike those from the human training distri-bution. We propose a new objective, unlikelihood training, which forces unlikely generations to be assigned lower probability by the model. We show that both token and sequence level unlikelihood training give less repetitive, less dull text while maintaining perplexity, giving superior generations using standard greedy or beam search. According to human evaluations, our approach with standard beam search also outperforms the currently popular decoding methods of nucleus sam-pling or beam blocking, thus providing a strong alternative to existing techniques.
	1	INTRODUCTION

(ii) it is not focused on optimizing sequence generation, only on producing the next token. The first issue means that greedy or beam search decoding, which rely on the top of the list to generate, are not optimized – there is a discrepancy between maximizing the log-probability of a ground-truth token and ensuring the rank of the ground-truth token to be one. The second issue means that during sequence generation, any imperfection in next token prediction leads to error accumulation that is not addressed by likelihood training.

The second major approach is that of sampling from the model at generation time. Top k-sampling (Fan et al., 2018) and nucleus sampling (Holtzman et al., 2019) are two methods that sample se-quences based on a function of the predicted next token probability distribution given by the model. Both approaches vastly improve the repetition issue, as the randomization often reduces the number of duplicate tokens in a decoded sequence, even if highly scored paths under the model (represented by beam search candidates) contain repetitions. However, as the underlying model is unchanged, it often prefers semantically similar phrasing, depending on the temperature parameter of the sampling (Holtzman et al., 2019). Furthermore, this solution is less relevant in less open-ended tasks such as machine translation, where beam search variants are the preferred method. Ideally we would like a model that can work with both beam and sampling decoding methods.

Improved Learning Algorithms The proposed learning criteria are closely related to structured output prediction methods in which the goal is to increase the scores assigned by a model to true examples while decreasing those assigned to negative examples often generated by the model it-self. Some representative algorithms include structured perceptron (Collins, 2002), energy-based models (LeCun et al., 2006) and more recently reflective likelihood (Dieng et al., 2018). A par-ticular variant in this family of algorithms, called negative training, was recently used by He and

likelihood of a finite set of samples D from p∗ by minimizing:

|D| |x(i)|
LMLE(pθ, D) = −��log pθ(x(i) t|x(i) <t). (1)

q(xt|x<t, pθ) =�pθ(xt|x<t)/Z xt ∈ U otherwise,

where Z =�i.e. U is the size k subset of V which maximizes�sampler instead restricts sampling to the smallest set of tokens with total mass above a threshold x∈Upθ(x|x<t). The top-k sampler restricts sampling to the k most-probable tokens; x∈Upθ(x|x<t) (Fan et al., 2018). The nucleus p ∈ [0, 1]; i.e. U is the smallest subset with�x∈Upθ(x|x<t) >= p (Holtzman et al., 2019).

Unlike previous work which only focused on degenerate sequence-level repeats (Holtzman et al., 2019), we additionally observe that neural language models exhibit substantially more repetition in next-token prediction compared to human text:

5 THE UNLIKELIHOOD TRAINING OBJECTIVE

We now describe unlikelihood training for neural language models, then in Section 6 demonstrate empirically that our proposal substantially improves neural text degeneration (§4).

unlikelihood �� − log pθ(xt|x<t)
� likelihood ��
. (4)

Ct prev-context= {x1, . . . , xt−1} \ {xt}. (5)

While the token-level unlikelihood objective efficiently augments maximum likelihood training with token-level penalties, it is limited to prefixes drawn from the training distribution. The resulting distribution mismatch between training sequences and generated sequences is a known issue with maximum-likelihood training, motivating objectives that operate on model-generated sequences (Daum´e et al., 2009; Ross et al., 2011; Ranzato et al., 2015; Yu et al., 2016).

In our experiments we apply this sequence loss in two ways: (i) using it to fine-tune a standard MLE baseline; and (ii) using it to fine-tune an unlikelihood model trained at the token level, LUL-token. We refer to the former as LUL-seq and the latter as LUL-token+seq. In both cases, fine-tuning is done by equally mixing sequence-level unlikelihood updates (7) and the token-level loss from which it was initially trained (either likelihood updates (1) or token-level unlikelihood updates (4)).

Efficiency Any objective that requires explicitly decoding a sequence is constrained by sample efficiency when decoding is slow; if sample efficiency is low, the total decoding time is too large for practical use. In our experiments we show that when used for fine-tuning, the sequence-level unlike-lihood objective substantially reduced degeneration in under 1,500 updates, rendering it practical for modern large-scale neural models, even with high decoding costs.


candidate for time t when zt ∼ Bernoulli(ppenalize) is 1, and no negative candidate for time t otherwise; this approach was effective but under-per2Our code is available at

Completions We evaluate a model on sequence completion by using the model to decode contin-uations of prefixes derived from the validation (or test) set. Specifically, the validation (or test) set is first partitioned into sequences of 1,536 tokens, as in training. Then we split each sequence into a batch of prefixes of length k (discarding extra tokens), and decode a continuation of length N for each prefix. The experiments below use k = 50 and N = 100 for evaluation. For deterministic decoding we use greedy search and beam search with beam size 10, and for stochastic decoding we use top-k sampling with k ∈ {3, 50} and nucleus sampling with p ∈ {0.3, 0.9}.
6.1	EVALUATION METRICS

rep/ℓ =		1	�	�	(9)
rep/ℓ =		\|D\|T	�	�	(9)
A predicted token is called a “single-token repeat” when I [·] is 1. Some of these single-token repeats also occur in the human-generated sequences, and we thus report a variant which only counts single- token repeats that are additionally not equal to the ground-truth next-token (wrep/ℓ).

Table 2: Results for token-level objectives (upper) and sequence-level fine-tuning (lower) according to sequence-level (left) and token-level (right) metrics using the test subset of Wikitext-103.

seq-rep-n = 1.0 − \|unique n-grams(xk+1:k+N)\| \|n-grams\|			(10)
Language Modeling Quality We use perplexity (ppl), and next-token prediction accuracy (acc), defined as1 N\|{arg max p(xt\|x<t) = x∗t\| x<t∈ D}\|, with N prefixes x<tand true next tokens x∗t.
6.2	RESULTS

12.7k vs. 11.8k) compared to the baseline (LMLE). Perplexity and accuracy were similar.

Importantly, the token-level unlikelihood objective yielded substantial improvements in sequence-level generations. With greedy search, token-level unlikelihood training improved the 4-gram repe-tition in continuations by 36% (seq-rep-4 .283 vs. .442) while generating roughly 22% more unique tokens than the baseline (uniq-seq 13.2k vs. 10.8k), and a more favorable rate of infrequent tokens (Figure 1). With beam search, unlikelihood training showed similar improvements over the baseline.

cates that the proposed sequence-level fine-tuning can be a cheap, effective way to improve existing pre-trained language models. We demonstrate this by fine-tuning a pre-trained GPT-2 (Radford et al., 2019) language model with sequence-level unlikelihood, using a comparable experimental setup to §6 (details in Appendix C). Fine-tuning with unlikelihood yielded similar improvements in sequence-level repetition (seq-rep-4 .042 vs. .506) to those observed in Table 5, while maintaining language modeling quality according to perplexity and accuracy (see Appendix Table 7).

Stochastic Decoding Although we have focused on deterministic decoding, we also confirm that a model trained with the proposed unlikelihood objectives may still be used with stochastic decoders. Appendix Table 6 shows metrics for completions generated with top-k sampling (Fan et al., 2018) and nucleus sampling (Holtzman et al., 2019). Models trained with unlikelihood objectives maintain language modeling quality compared to the baseline, but with improvements in repetition.

We described unlikelihood training, an approach to training neural language models. We observed that state-of-the-art models trained to maximize likelihood exhibit neural text degeneration, which

Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002). Association for Computational Lin-guistics.

Hal Daum´e, John Langford, and Daniel Marcu. 2009. Search-based structured prediction. Machine learning, 75(3):297–325.

Tianxing He and James Glass. 2019. Negative training for neural dialogue response generation. arXiv preprint arXiv:1903.02134.

Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. 2018. . In Proceedings of the 56th Annual Meeting (Volume 1: Long Papers), pages 1638–1649. Association for Computational Linguistics.

Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562.

Margaret Li, Jason Weston, and Stephen Roller. 2019. with optimized questions and multi-turn comparisons.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. CoRR, abs/1511.06732.

Jesse Vig. 2018. Deconstructing bert: Distilling 6 patterns from 100 million parameters. Medium.

Ashwin K Vijayakumar, Michael Cogswell, Ramprasaath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2018. Diverse beam search for improved description of complex scenes. In Thirty-Second AAAI Conference on Artificial Intelligence.

True Next-Token (i = i∗)

∂L	∂p∗ 1p)	pneg		(13)
∂p∗		1 − pneg pneg
				(14)
	1p			(14)
	1 −p∗ −	1 − pneg	.

Negative Candidate (i = ineg)

∂˜pi

∂ai

(18)

∇La = x∗− m ⊙ p,				(19)
where x∗∈ {0, 1}Vis 1 at index i∗and 0 otherwise, and m ∈ RVis: mi =�(1 − α (1 + α) 1−pneg) i ̸= ineg i = ineg.				(20)

c∈Ct � �log(1 − pθ(c\|x<t))				(21)
−Lt UL-token(pθ(·\|x<t), Ct) =	1	�	�	(22)
−Lt UL-token(pθ(·\|x<t), Ct) =	\|Ct\|	c∈Ct�	�	(22)
Now the gradient can be generalized to multiple candidates, in which case the gradient takes the same form as Eqn. 20, but with αc in place of α.

Table 5: Results for token-level objectives (upper) and sequence-level fine-tuning (lower) according to sequence-level (left) and token-level (right) metrics using the validation subset of wikitext-103.

Search	Model	seq-rep-4	uniq-seq	ppl	acc	rep	wrep	uniq
	LMLE LUL-token LUL-seq LUL-token+seq	.0991	14.7k	25.70	.350	.597	.355	12.6k
		.0491	16.4k	27.02	.344	.539	.306	13.6k
		.0068	17.9k	25.11	.353	.581	.341	13.6k
		.0087	15.2k	26.84	.347	.524	.292	14.6k
top-k-50		.0165	21.9k	25.70	.302	.511	.303	16.1k
	LMLE LUL-token LUL-seq LUL-token+seq	.0165	21.9k	25.70	.302	.511	.303	16.1k
		.006	23.5k	27.02	.286	.440	.247	17.8k
		.0005	25.7k	25.11	.291	.497	.291	17.3k
		.0009	23.7k	26.84	.289	.430	.238	18.8k
top-p-0.3		.273	13.6k	25.70	.264	.339	.154	12.6k
	LMLE LUL-token LUL-seq LUL-token+seq	.273	13.6k	25.70	.264	.339	.154	12.6k
		.101	16.5k	27.02	.247	.290	.121	13.9k
		.0033	20.8k	25.11	.266	.327	.145	13.6k
		.0041	19.1k	26.84	.250	.284	.116	14.9k
top-p-0.9		.0154	26.9k	25.70	.288	.462	.263	18.6k
	LMLE LUL-token LUL-seq LUL-token+seq	.0154	26.9k	25.70	.288	.462	.263	18.6k
		.004	30.2k	27.02	.266	.381	.202	22.3k
		.0003	34.7k	25.11	.290	.450	.254	19.6k
		.0007	32.4k	26.84	.269	.376	.198	22.7k
	-	.006	19.8k	-	-	.487	-	19.8k

Table 6 provides automatic metrics for top-k and nucleus sampling (called top-p) on the Wikitext-103 test set. These can be compared with the main results of the paper in Table 2. In general, sam-pling methods yield worse next-token predictions than deterministic approaches (0.302 vs. 0.394 acc for top-k-50 vs. greedy MLE, where acc for stochastic decoding measures the probability that the decoding strategy chooses the ground truth word given a ground truth context). As the choice of sampling hyperparameter gets closer to greedy (i.e. lower values of k and p) next token accu-racy improves, eventually approaching the greedy MLE results. The unlikelihood-trained sampling models have similar next token accuracy (acc) to their likelihood-trained counterparts, but exhibit fewer repetitions. For lower values of p and k the improvements of unlikelihood training are larger, e.g. 0.277 reduced to 0.0041 for 4-gram sequence repetitions (seq-rep-4) using top-p-0.3. At higher levels of p and k, for all methods the continuations contain more unique tokens than that of humans, meaning those values may be too high.

		search	seq-rep-4	uniq-seq	ppl	acc	rep	wrep	uniq
	-	greedy	.429	10.6k	24.590	.401	.619	.346	11.6k
	-	beam	.495	9.4k	24.590	.401	.619	.346	11.6k
	0.1	greedy	.253	9.9k	24.329	.404	.602	.330	12.3k
	0.1	beam	.274	13.1k	24.329	.404	.602	.330	12.3k
	0.9	greedy	.434	5.3k	26.519	.399	.600	.330	12.2k
		greedy	.434	5.3k
		beam	.231	13.5k
	0.1	greedy	.116	12.5k	25.518	.399	.551	.287	13.2k
LUL-tok+seq		greedy	.116	12.5k
		beam	.146	14.2k
	0.9	greedy	.423	6.7k	26.629	.396	.551	.288	13.2k
LUL-tok+seq		greedy	.423	6.7k
LUL-tok+seq		beam	.080	16k
	-	-	.005	18.9k	-	-	.479	-	18.9k
Table 8: Results for sequence-level fine-tuning using random-seq candidates according to sequence-level (left) and token-level (right) metrics using the validation subset of wikitext-103.

E.1

Figure 2: Screen shot of the user interface used in the human evaluation.

E.2 CROWDWORKER QUALITY CONTROLS

E.2.1 QUALITY CONTROL 1

Prompt = = In the decades since its release , The Hustler has cemented its reputation as a classic . Roger Ebert , echoing earlier praise for the performances , direction , and cinematography and adding laurels for editor Dede Allen , cites the film as ” one of”’

Correct answer drive Senator John Quincy Adams from the Federalist Party over his support of Thomas Jefferson ’s foreign policy . The legislature elected Adams ’ successor nine months early , and gave Adams sufficiently distasteful instructions that he resigned the post and joined with the Republicans . = = Governor = = Gore led the Federalists to victory in 1809 against Sullivan ’s successor , Levi Lincoln , Sr. , who had taken over as acting governor upon Sullivan ’s death late in 1808 . During Gore ’s term the principal domestic issue occupying state politics

Incorrect Answer prevent the American Revolutionary War from being fought by the British , and to prevent the British from using the country to launch a war against the British . Gore ’s actions in the House of Representatives were a major turning point in his political career . He was elected to the House of Representatives in 1811 , and served until his death in 1815 . = = Early life and education = = ¡/s¿ ¡/s¿ Gore was born in Boston , Massachusetts , on February 22 , 1798 , the son of Benjamin Gore and his