《大型语言模型综述(英)-85页-WN7.pdf》由会员分享,可在线阅读,更多相关《大型语言模型综述(英)-85页-WN7.pdf(85页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、1A Survey of Large Language ModelsWayne Xin Zhao,Kun Zhou*,Junyi Li*,Tianyi Tang,Xiaolei Wang,Yupeng Hou,Yingqian Min,BeichenZhang,Junjie Zhang,Zican Dong,Yifan Du,Chen Yang,Yushuo Chen,Zhipeng Chen,Jinhao Jiang,Ruiyang Ren,Yifan Li,Xinyu Tang,Zikang Liu,Peiyu Liu,Jian-Yun Nie and Ji-Rong WenAbstrac
2、tEver since the Turing Test was proposed in the 1950s,humans have explored the mastering of language intelligenceby machine.Language is essentially a complex,intricate system of human expressions governed by grammatical rules.It poses asignificant challenge to develop capable artificial intelligence
3、(AI)algorithms for comprehending and grasping a language.As a majorapproach,language modeling has been widely studied for language understanding and generation in the past two decades,evolvingfrom statistical language models to neural language models.Recently,pre-trained language models(PLMs)have be
4、en proposed by pre-training Transformer models over large-scale corpora,showing strong capabilities in solving various natural language processing(NLP)tasks.Since the researchers have found that model scaling can lead to an improved model capacity,they further investigate the scalingeffect by increa
5、sing the parameter scale to an even larger size.Interestingly,when the parameter scale exceeds a certain level,theseenlarged language models not only achieve a significant performance improvement,but also exhibit some special abilities(e.g.,in-context learning)that are not present in small-scale lan
6、guage models(e.g.,BERT).To discriminate the language models in differentparameter scales,the research community has coined the term large language models(LLM)for the PLMs of significant size(e.g.,containing tens or hundreds of billions of parameters).Recently,the research on LLMs has been largely ad
7、vanced by both academiaand industry,and a remarkable progress is the launch of ChatGPT(a powerful AI chatbot developed based on LLMs),which hasattracted widespread attention from society.The technical evolution of LLMs has been making an important impact on the entire AIcommunity,which would revolut
8、ionize the way how we develop and use AI algorithms.Considering this rapid technical progress,in thissurvey,we review the recent advances of LLMs by introducing the background,key findings,and mainstream techniques.In particular,we focus on four major aspects of LLMs,namely pre-training,adaptation t
9、uning,utilization,and capacity evaluation.Furthermore,wealso summarize the available resources for developing LLMs and discuss the remaining issues for future directions.This survey providesan up-to-date review of the literature on LLMs,which can be a useful resource for both researchers and enginee
10、rs.Index TermsLarge Language Models;Emergent Abilities;Adaptation Tuning;Utilization;Alignment;Capacity Evaluation1INTRODUCTION“The limits of my language mean the limits of my world.”Ludwig WittgensteinLANGUAGEis a prominent ability in human beings toexpress and communicate,which develops in earlych
11、ildhood and evolves over a lifetime 1,2.Machines,however,cannot naturally grasp the abilities of understand-ing and communicating in the form of human language,unless equipped with powerful artificial intelligence(AI)algorithms.It has been a longstanding research challengeto achieve this goal,to ena
12、ble machines to read,write,andcommunicate like humans 3.Technically,language modeling(LM)is one of the majorapproaches to advancing language intelligence of machines.In general,LM aims to model the generative likelihoodof word sequences,so as to predict the probabilities offuture(or missing)tokens.T
13、he research of LM has receivedextensive attention in the literature,which can be dividedinto four major development stages:Statistical language models(SLM).SLMs 47 are de-Version:v11(major update on June 29,2023).GitHub link:https:/ and J.Li contribute equally to this work.The authors are mainly wit
14、h Gaoling School of Artifi cial Intelligence andSchool of Information,Renmin University of China,Beijing,China;Jian-Yun Nie is with DIRO,Universit e de Montr eal,Canada.Contact e-mail:veloped based on statistical learning methods that rose inthe 1990s.The basic idea is to build the word predictionmo
15、del based on the Markov assumption,e.g.,predicting thenext word based on the most recent context.The SLMs witha fixed context lengthnare also calledn-gram languagemodels,e.g.,bigram and trigram language models.SLMshave been widely applied to enhance task performancein information retrieval(IR)8,9 an
16、d natural languageprocessing(NLP)1012.However,they often suffer fromthe curse of dimensionality:it is difficult to accuratelyestimate high-order language models since an exponentialnumber of transition probabilities need to be estimated.Thus,specially designed smoothing strategies such as back-off e
17、stimation 13 and GoodTuring estimation 14 havebeen introduced to alleviate the data sparsity problem.Neural language models(NLM).NLMs 1517 character-ize the probability of word sequences by neural networks,e.g.,recurrent neural networks(RNNs).As a remarkablecontribution,the work in 15 introduced the
18、 concept ofdistributed representation of words and built the word predic-tion function conditioned on the aggregated context features(i.e.,the distributed word vectors).By extending the ideaof learning effective features for words or sentences,ageneral neural network approach was developed to builda
19、 unified solution for various NLP tasks 18.Further,word2vec 19,20 was proposed to build a simplified shal-low neural network for learning distributed word represen-tations,which were demonstrated to be very effective acrossarXiv:2303.18223v11 cs.CL 29 Jun 202327LPH*37%(57*377*37&RGH,QVWUXFW*37&KDW*3
20、7/D0$*377LPH7*37&RGH,QVWUXFW*37&KDW*37/D0$*37(a)Query=”Language Model”7LPH*37%(57*377*37&RGH,QVWUXFW*37&KDW*37/D0$*377LPH7*37&RGH,QVWUXFW*37&KDW*37/D0$*37(b)Query=”Large Language Model”Fig.1:The trends of the cumulative numbers of arXiv papers that contain the keyphrases“language model”(since June 2
21、018)and“large language model”(since October 2019),respectively.The statistics are calculated using exact match by queryingthe keyphrases in title or abstract by months.We set different x-axis ranges for the two keyphrases,because“languagemodels”have been explored at an earlier time.We label the poin
22、ts corresponding to important landmarks in the researchprogress of LLMs.A sharp increase occurs after the release of ChatGPT:the average number of published arXiv papersthat contain“large language model”in title or abstract goes from 0.40 per day to 8.58 per day(Figure 1(b).a variety of NLP tasks.Th
23、ese studies have initiated theuse of language models for representation learning(beyondword sequence modeling),having an important impact onthe field of NLP.Pre-trained language models(PLM).As an early at-tempt,ELMo 21 was proposed to capture context-awareword representations by first pre-training a
24、 bidirectionalLSTM(biLSTM)network(instead of learning fixed wordrepresentations)and then fine-tuning the biLSTM networkaccording to specific downstream tasks.Further,based onthe highly parallelizable Transformer architecture 22 withself-attention mechanisms,BERT 23 was proposed by pre-training bidir
25、ectional language models with specially de-signed pre-training tasks on large-scale unlabeled corpora.These pre-trained context-aware word representations arevery effective as general-purpose semantic features,whichhave largely raised the performance bar of NLP tasks.Thisstudy has inspired a large n
26、umber of follow-up work,whichsets the“pre-training and fi ne-tuning”learning paradigm.Following this paradigm,a great number of studies onPLMs have been developed,introducing either differentarchitectures 24,25(e.g.,GPT-2 26 and BART 24)orimproved pre-training strategies 2729.In this paradigm,itofte
27、n requires fine-tuning the PLM for adapting to differentdownstream tasks.Large language models(LLM).Researchers find thatscaling PLM(e.g.,scaling model size or data size)oftenleads to an improved model capacity on downstream tasks(i.e.,following the scaling law 30).A number of studieshave explored t
28、he performance limit by training an everlarger PLM(e.g.,the 175B-parameter GPT-3 and the 540B-parameter PaLM).Although scaling is mainly conductedin model size(with similar architectures and pre-trainingtasks),these large-sized PLMs display different behaviorsfrom smaller PLMs(e.g.,330M-parameter BE
29、RT and 1.5B-parameter GPT-2)and show surprising abilities(called emer-gent abilities 31)in solving a series of complex tasks.Forexample,GPT-3 can solve few-shot tasks through in-contextlearning,whereas GPT-2 cannot do well.Thus,the researchcommunity coins the term“large language models(LLM)”1for the
30、se large-sized PLMs 3235,which attract increasingresearch attention(See Figure 1).A remarkable applicationof LLMs is ChatGPT2that adapts the LLMs from the GPTseries for dialogue,which presents an amazing conversationability with humans.We can observe a sharp increase of thearXiv papers that are rela
31、ted to LLMs after the release ofChatGPT in Figure 1.In the existing literature,PLMs have been widely dis-cussed and surveyed 3639,while LLMs are seldom re-viewed in a systematic way.To motivate our survey,we firsthighlight three major differences between LLMs and PLMs.First,LLMs display some surpris
32、ing emergent abilities thatmay not be observed in previous smaller PLMs.These abili-ties are key to the performance of language models on com-plex tasks,making AI algorithms unprecedently powerfuland effective.Second,LLMs would revolutionize the waythat humans develop and use AI algorithms.Unlike sm
33、allPLMs,the major approach to accessing LLMs is throughthe prompting interface(e.g.,GPT-4 API).Humans have tounderstand how LLMs work and format their tasks in a waythat LLMs can follow.Third,the development of LLMs nolonger draws a clear distinction between research and en-gineering.The training of
34、 LLMs requires extensive practicalexperiences in large-scale data processing and distributedparallel training.To develop capable LLMs,researchershave to solve complicated engineering issues,working withengineers or being engineers.Nowadays,LLMs are posing a significant impact onthe AI community,and
35、the advent of ChatGPT and GPT-4leads to the rethinking of the possibilities of artificial generalintelligence(AGI).OpenAI has published a technical articleentitled“Planning for AGI and beyond”,which discussesthe short-term and long-term plans to approach AGI 40,1.Note that a LLM is not necessarily m
36、ore capable than a small PLM,and emergent abilities may not occur in some LLMs.2.https:/ a more recent paper has argued that GPT-4 might beconsidered as an early version of an AGI system 41.Theresearch areas of AI are being revolutionized by the rapidprogress of LLMs.In the field of NLP,LLMs can ser
37、ve as ageneral-purpose language task solver(to some extent),andthe research paradigm has been shifting towards the useof LLMs.In the field of IR,traditional search engines arechallenged by the new information seeking way through AIchatbots(i.e.,ChatGPT),and New Bing3presents an initialattempt that e
38、nhances the search results based on LLMs.Inthe field of CV,the researchers try to develop ChatGPT-likevision-language models that can better serve multimodaldialogues 4245,and GPT-4 46 has supported multi-modal input by integrating the visual information.This newwave of technology would potentially
39、lead to a prosperousecosystem of real-world applications based on LLMs.Forinstance,Microsoft 365 is being empowered by LLMs(i.e.,Copilot)to automate the office work,and OpenAI supportsthe use of plugins in ChatGPT for implementing specialfunctions.Despite the progress and impact,the underlying prin-
40、ciples of LLMs are still not well explored.Firstly,it ismysterious why emergent abilities occur in LLMs,instead ofsmaller PLMs.As a more general issue,there lacks a deep,detailed investigation of the key factors that contribute tothe superior abilities of LLMs.It is important to study whenand how LL
41、Ms obtain such abilities 47.Although there aresome meaningful discussions about this problem 31,47,more principled investigations are needed to uncover the“secrets“of LLMs.Secondly,it is difficult for the researchcommunity to train capable LLMs.Due to the huge de-mand of computation resources,it is
42、very costly to carryout repetitive,ablating studies for investigating the effectof various strategies for training LLMs.Indeed,LLMs aremainly trained by industry,where many important trainingdetails(e.g.,data collection and cleaning)are not revealedto the public.Thirdly,it is challenging to align LL
43、Ms withhuman values or preferences.Despite the capacities,LLMsare also likely to produce toxic,fictitious,or harmful con-tents.It requires effective and efficient control approachesto eliminating the potential risk of the use of LLMs 46.Faced with both opportunities and challenges,it needsmore atten
44、tion on the research and development of LLMs.Inorder to provide a basic understanding of LLMs,this surveyconducts a literature review of the recent advances in LLMsfrom four major aspects,including pre-training(how to pre-train a capable LLM),adaptation(how to effectively adaptpre-trained LLMs for b
45、etter use),utilization(how to useLLMs for solving various downstream tasks)and capabilityevaluation(how to evaluate the abilities of LLMs and existingempirical findings).We thoroughly comb the literature andsummarize the key findings,techniques,and methods ofLLMs.For this survey,we also create a Git
46、Hub projectwebsite by collecting the supporting resources for LLMs,atthe link https:/ also aware of several related review articles on PLMsor LLMs 32,36,38,39,43,4854.These papers eitherdiscuss PLMs or some specific(or general)aspects of LLMs.Compared with them,we focus on the techniques and3.https:
47、/ to develop and use LLMs and provide a relativelycomprehensive reference to important aspects of LLMs.The remainder of this survey is organized as follows:Section 2 introduces the background for LLMs and the evo-lution of GPT-series models,followed by the summarizationof available resources for dev
48、eloping LLMs in Section 3.Sections 4,5,6,and 7 review and summarize the recentprogress from the four aspects of pre-training,adaptation,utilization,and capacity evaluation,respectively.Then,Sec-tion 8 discusses the practical guide for prompt design,and Section 9 reviews the applications of LLMs in s
49、everalrepresentative domains.Finally,we conclude the survey inSection 10 by summarizing the major findings and discussthe remaining issues for future work.2OVERVIEWIn this section,we present an overview about the back-ground of LLMs and then summarize the technical evolu-tion of the GPT-series model
50、s.2.1Background for LLMsTypically,large language models(LLMs)refer to Transformerlanguage models that contain hundreds of billions(ormore)of parameters4,which are trained on massive textdata 32,such as GPT-3 55,PaLM 56,Galactica 35,and LLaMA 57.LLMs exhibit strong capacities to un-derstand natural l