Associative Memories: Transformer Memorization & Performance Dynamics

Too Long; Didn't Read

Empirical studies on large language models have shown that the larger they are, the more they tend to memorize training data.

People Mentioned

Companies Mentioned

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Model and 3.1 Associative memories

3.2 Transformer blocks

4 A New Energy Function

4.1 The layered structure

5 Cross-Entropy Loss

6 Empirical Results and 6.1 Empirical evaluation of the radius

6.2 Training GPT-2

6.3 Training Vanilla Transformers

7 Conclusion and Acknowledgments

Appendix A. Deferred Tables

Appendix B. Some Properties of the Energy Functions

Appendix C. Deferred Proofs from Section 5

Appendix D. Transformer Details: Using GPT-2 as an Example

References

3 Model

3.1 Associative memories

Observation 1 The models tend to memorize the patterns of the training data.

Authors:

(1) Xueyan Niu, Theory Laboratory, Central Research Institute, 2012 Laboratories, Huawei Technologies Co., Ltd.;

(2) Bo Bai baibo ([email protected]);

(3) Lei Deng ([email protected]);

(4) Wei Han ([email protected]).

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Associative Memories: Transformer Memorization & Performance Dynamics

Too Long; Didn't Read

People Mentioned

Companies Mentioned

Table of Links

3 Model

3.1 Associative memories

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps

Associative Memories: Transformer Memorization & Performance Dynamics

Too Long; Didn't Read

People Mentioned

Companies Mentioned

Table of Links

3 Model

3.1 Associative memories

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps