《跨语言信息检索技术28085.pptx》由会员分享,可在线阅读,更多相关《跨语言信息检索技术28085.pptx(78页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、 Cross Language Information RetrievalRoad MaplCrossLingualIRlMotivationlDefinitionlGeneralIssuesWithCLIRlBasicApproachestoCLIRlCLIRevaluationlCLIRapplications2023/3/153Information RetrievallSinglelanguage:boththeusersqueryanddocumentstobesearchedareinsamelanguage.lCrosslanguage:documentswritteninala
2、nguagedifferentfromthelanguageoftheusersquerydocumentsquery2023/3/1542000-2010年世界各大洲网络语言使用增长率(数据更新时间:2010年6月30日)The Internet Big PictureWorld RegionsPopulationInternet UsersPenetration(%population)Users%of TableGrowth 2000-2015Africa1,158,355,663313,257,07427.0%9.6%6,839%Asia4,032,466,8821,563,208,1
3、4338.8%47.8%1,268%Europe821,555,904604,122,38073.5%18.5%475%MiddleEast236,137,235115,823,88249.0%3.5%3,426%NorthAmerica357,172,209313,862,86387.9%9.6%191%LatinAmerica617,776,105333,115,90853.9%10.2%1,743%Oceania/Australia37,157,12027,100,33472.9%0.8%256%WorldTotal7260,621,1183,270,490,58445%100%806%
4、WorldInternetUsersand2015PopulationStats2023/3/1552023/3/156Usage of content languages for websites2023/3/15720022015English72%English54.5%German7%Russian5.9%Japanese6%German5.7%Spanish3%Japanese5.0%French3%Spanish4.7%Italian2%French4.1%Dutch2%Portuguese2.6%Chinese2%Chinese2.2%Korean1%Italian2.1%Rus
5、sian1%Polish1.9%Portuguese1%Turkish1.6%Cross Language IRlMotivationlInformationunavailabilityinsomelanguageslLanguagebarrierlDefinition:lCross-language information retrieval(CLIR)isasubfieldofinformationretrievaldealingwithretrievinginformationwritteninalanguagedifferentfromthelanguageoftheusersquer
6、y(wikipedia)lExample:lAusermayaskqueryinChinesebutretrieverelevantdocumentswritteninEnglish.Why do we need CLIR systems?lNeedstechnologiesthatenableaccesstoinforegardlessofgeographic/languagebarriers.lTofind,retrieveandunderstandrelevantinformationinwhateverlanguage/form.lCLIRhasbecomeoneofthekeyfac
7、torsaffectingknowledgesharingallovertheworld.General Issues With CLIRlMultilingualtextaccess(charactersets,etc.)lDifferencesbetweenlanguages-stemming,compoundwords,breaksbetweenwords,etc.lTermambiguitybetweenlanguageslWhattotranslate(queryvs.document)andhowMatching strategieslNotranslationl(1)Cognat
8、ematchinglTranslationl(2)Querytranslationl(3)Documenttranslationl(4)Interlingualtechniques2023/3/1511Cognate matching(同源匹配)同源匹配)lInthecaseofthemostnaivecognatematching,untranslatabletermssuchaspropernounsortechnicalterminologyareleftunchangedthroughthestageoftranslation.lTheunchangedtermcanbeexpecte
9、dtomatchsuccessfullywithacorrespondingterminanotherlanguageifthetwolanguageshaveacloselinguisticrelationship.(forexample,generationinEnglishandFrench)lWhentwolanguagesareverydifferent,byexploringamethodformeasuringsimilaritybetweentransliterationanditsoriginalword,wemaymakecognatematchingfeasible(音译
10、).2023/3/15122023/3/1513Query translation搜索引擎搜索引擎翻译系统翻译系统法语查询法语文档结果结果中文查询选择浏览法语文档集合法语文档集合过程:将中文查询翻译成法语检索法语文档集合将检索结果翻译成中文2023/3/1514query translationlQuerytranslationisthemostwidelyusedmatchingstrategyforCLIRduetoitstractability.ltheretrievalsystemdoesnothavetochangeitsinvertedfilesofindextermsinanyw
11、ayagainstqueriesinanylanguage.lItislesscomputationallycostlytoprocessthetranslationofaquerythanthatofalargesetofdocumentslChallenge:termambiguitylqueriesareoftenshortandshortqueriesprovidelittlecontextfordisambiguationlTermdisambiguationwillbediscussedlater.2023/3/1515查询翻译优缺点查询翻译优缺点l优点l简单l容易操作l灵活l节约
12、时间、空间,效率高l缺点l缺乏上下文l对于短查询式,翻译歧义性大2023/3/1516Document translation中文查询法语文档集合法语文档集合搜索引擎搜索引擎翻译系统翻译系统中文文档集合中文文档集合结果结果选择浏览过程:将整个法语文档翻译成中文文档直接用中文文档检索2023/3/1517Document translationlDocumenttranslationhasoppositeadvantagesanddisadvantagesfromquerytranslation.lInCLIRexperiments,thisapproachisnotusuallyutilize
13、d,andquerytranslationisdominant.lHowever,someresearchershaveusedittotranslatelargesetsofdocumentssincemorevariedcontextwithineachdocumentisavailablefortranslation,whichcanimprovetranslationquality.lOardandHackett(1998)reportedthatautomaticmachinetranslationofasetofdocumentsusingacommercialMTsystemou
14、tperformsquerytranslationinanexperimentofCLIRfromGermantoEnglish2023/3/1518文档翻译优缺点文档翻译优缺点l优点l只翻译一次l文档提供的上下文比较丰富l文档可以线下事先翻译好l缺点l翻译速度慢l占用大量空间、时间,效率低l依赖机器翻译系统的质量2023/3/1519查询翻译查询翻译vs.文档翻译文档翻译l取决于特定语言资源l通常查询翻译使用更广l两种方法都提出了“交互性”挑战Interlingual approachlanintermediatespaceofsubjectrepresentationintowhichbo
15、ththequeryandthedocumentsareconvertedisusedtocomparethem.lOnetypeofinterlingualapproachistousethesynsetsprovidedinWordNet,whichisawellknownmachine-readablethesaurus.lForexample,Diekema,Oroumchian,Sheridan,andLiddy(1999)employedtheWordNetsynsetnumbersaslanguage-independentrepresentationsforCLIR.lSinc
16、easynsetnumber(label)representingaconceptiscorrespondedtoasetofconcretewordsineachoflanguagessupported(e.g.,EnglishandFrench),itispossiblethataqueryterminthesourcelanguagesislinkedtowordsinthetargetlanguageviathesynsetnumber.2023/3/1520Translation techniques2023/3/1521Dictionary-based methodslUsinga
17、bilingualMachineReadableDictionary(MRD).lmostretrievalsystemsarestillbasedonso-calledbag-of-wordsarchitectures,inwhichbothquerystatementsanddocumenttextsaredecomposedintoasetofwords(orphrases)throughaprocessofindexing.lThuswecantranslateaqueryeasilybyreplacingeachquerytermwithitstranslationequivalen
18、tsappearinginabilingualdictionaryorabilingualtermlist.2023/3/15222023/3/1523bilingual dictionary2023/3/1524Term translationoilpetroleumprobesurveytakesamples选哪个翻译?没有翻译!restraincymbidiumgoeringii分词错误oilpetroleumprobesurveytakesamples2023/3/1525Some issues in term translationlCompoundwords,forexampleG
19、ermanldecompositionlNoboundarybetweenwords,e.g.ChineselsegmentationlSpecializedvocabularynotcontainedinthedictionary,e.g.namedentity2023/3/1526ExampleslCompounddecomposition(复合词分解)lchinesewordsegmentationl新西兰花l新西兰花NewZealandflowersl新西兰花freshbroccolis2023/3/1527Corpora-based methodlParallel(双语平行语料库)o
20、rcomparablecorpora(双语可比语料库)areusefulresourcesenablingustoextractbeneficialinformationforCLIR.lForexample,inordertotranslateEnglishqueriesintoSpanish,DavisandDunning(1995)extractedmoderatelyfrequentSpanishtermsfromSpanishdocumentsalignedwithEnglishdocumentswhichhadbeensearchedusinganEnglishquery(sour
21、cequery).2023/3/1528Parallel corporalAparallelcorpus(pl.corpora)isadocumentcollectioncomposedoftwoormoredisjointsubsets,eachwritteninadifferentlanguage,suchthatdocumentsineachsubsetaretranslationsofdocumentsineachothersubset.lVeryhighaccuracy2023/3/1529象形文字古埃及文字希腊文2023/3/1530罗塞塔石碑罗塞塔石碑l罗塞塔石碑(Rosetta
22、Stone,也译作罗塞达碑),高1.14米,宽0.73米,是一块制作于公元前196年的大理石石碑,原本是一块刻有埃及国王托勒密五世(PtolemyV)诏书的石碑。石碑上用希腊文字、古埃及文字和当时的通俗体文字刻了同样的内容。由于这块石碑刻有三种不同语言版本,使得近代的考古学家得以有机会对照各语言版本的内容后,解读出已经失传千余年的埃及象形文之意义与结构,而成为今日研究古埃及历史的重要里程碑。2023/3/1531More parallel corporalnews:lDE-News(German-English)lHong-KongNews,XinhuaNews(Chinese-English
23、)lGovernmentdocuemtns:lCanadian-Hansards(French-English)lEuroparl(Danish,Dutch,English,Finnish,French,German,Greek,Italian,Portugese,Spanish,Swedish)lUNTreaties(Russian,English,Arabic,)lBible(many,manylanguages)2023/3/1532ExamplesEnglishGermanDivergingopinionsaboutplannedtaxreformUnterschiedlicheMei
24、nungenzurgeplantenSteuerreformThediscussionaroundtheenvisagedmajortaxreformcontinues.DieDiskussionumdievorgesehenegrosseSteuerreformdauertan.TheFDPeconomicsexpert,GrafLambsdorff,todaycameoutinfavorofadvancingtheenactmentofsignificantpartsoftheoverhaul,currentlyplannedfor1999.DerFDP-Wirtschaftsexpert
25、eGrafLambsdorffsprachsichheutedafueraus,wesentlicheTeilederfuer1999geplantenReformvorzuziehen.2023/3/1533Comparable corporalAcomparablecorpusisapairofcorporaintwodifferentlanguages,whichcomefromthesamedomain.lTalkingthesametopiclParallelsentencesmayalsobeminedfromcomparablecorporasuchasnewsstorieswr
26、ittenonthesametopicindifferentlanguages.lSomeresearchersextractphrasepairsfromcomparablecorporausingaclassifierapproach.2023/3/1534ExamplelTheWWWcanproviderichandubiquitousmachine-readableresources,fromwhichwemaybeabletoautomaticallyextractinformationusefulforCLIR.lForexample,Chen(2002)andChenandGey
27、(2003)madeuseofageneralsearchengineontheInternetandtriedtofindEnglishtranslationequivalentsofChineseorJapaneseterms(mainlypropernouns)byanalyzingcontextsofthesetermsinChineseandJapaneseWebdocumentsreturnedbytheengine.2023/3/15352023/3/1536Term disambiguation techniques(翻译歧义性翻译歧义性)lDisambiguationfrom
28、amongmultiplealternativetermtranslations,多个翻译如何选择?e.g.,Apple,BanklUseofpart-of-speech(POS)tags.lUseofparallelcorpus.lUseofco-occurrencestatisticsinthetargetcorpus.lUseofthequeryexpansiontechnique.Use of part-of-speech tagslThebasicideaofusingpart-of-speech(POS)tagsfortranslationdisambiguationistosel
29、ectonlytranslationshavingthesamePOSwiththatofthesourcequeryterm.lThismethodrequiresthatPOStaggingsoftwareisavailableforbothlanguages.2023/3/1537Parallel corpus-based disambiguationlAparallelcorpuswasusedfordeterminingthebesttranslationorsetoftranslationsbyDavis(1997,1998),whereasingletranslationfore
30、achsourcetermwasselectedfromasetoftranslationslistedinanMRDaccordingtotheresultofsearchingaparallelcorpus.2023/3/15382023/3/1539Translation probability探测探测survey试探试探样品样品测量测量(p=0.4)(p=0.3)(p=0.25)(p=0.05)多个翻译多个翻译 翻译概率翻译概率Disambiguation based on co-occurrence statisticslthecorrecttranslationsofqueryte
31、rmsshouldco-occurintargetlanguagedocumentsandincorrecttranslationsshouldtendnottoco-occur.lFirst,thetwomostrelatedtermsinthequeryweredeterminedbasedoncooccurrencestatisticsinthesourcelanguagecorpus,andthenthebesttranslationswereselectedfromallpairsoftranslationsofthesetwotermsaccordingtoco-occurrenc
32、estatisticsinthetargetlanguagecorpus.lItshouldbenotedthatthesetwocorporadonothavetobeparallelorcomparable.2023/3/1540Query expansion for disambiguationlPseudorelevancefeedback(PRF),alsoknownasblindfeedback,iswidelyrecognizedasaneffectiveltechniqueforenhancingperformanceofinformationretrieval.PRFalso
33、workseffectivelyforCLIRtasks.lInthecaseofCLIR,twokindsofPRFarefeasible:lPre-translationfeedbackandlPost-translationfeedback2023/3/1541Pre-translation feedbacklDocumentsfromacorpusinthesourcelanguagecanberetrievedpriortotranslationinordertoaddasetofnewtermstothesourcequery(pre-translationfeedback)ifs
34、uchacorpusisavailable.lPre-translationfeedbackmaycontributetoimprovementofprecision.ThisisduetothefactthatthePRFisbasicallydoneusingtheentirequerynoteachsourcetermrespectively.Thatis,synonymsorrelatedtermscorrespondingtothecorrectmeaningofeachsourcetermwithinacontextofthequeryareexpectedtobeautomati
35、callyaddedthroughthePRFprocess.2023/3/1542Post-translation feedbacklAftertranslation,standardPRFcanbeappliedusingthetargetdocumentcollection(post-translationfeedback).lpost-translationfeedbackcanbeconsideredadeviceforimprovingrecallratio,asshowninstandardexperimentsofmonolingualretrieval.lInCLIR,two
36、well-knownmethodsforweightingtermsinthetop-rankeddocumentsareoftenutilizedforselectinggoodterms,i.e.,theRocchiomethodandtheprobabilisticmethod.2023/3/1543bi-directional translationlBoughanemetal.(2002),exploredabi-directionaltranslationtechniqueinwhichaformofbackwardtranslationisusedforrankingtransl
37、ationcandidates.SupposethatweneedtotranslateEnglishquerytermsintoFrenchones.Inbi-directionaltranslation,firstasetofFrenchequivalentsforanEnglishtermisfoundinanEnglishFrenchdictionary.Next,usingaFrenchEnglishdictionary,eachFrenchequivalentisreverselytranslatedintoasetofEnglishterms.Basically,iftheset
38、includestheoriginalsourceterm,theFrenchtranslationequivalentischosenasapreferredtranslation.2023/3/15442023/3/1545跨语言检索评价跨语言检索评价l信息检索评价l给定一个检索主题,一个文档集合,一些人工判断好的相关文献l对系统返回的检索结果进行判断lTRECCLIR(96-02):英语到其他语言lCLEF(00-):欧洲语言之间lNTCIR(99-):亚洲语言与英语2023/3/1546跨语言检索评价模型跨语言检索评价模型47Applications of CLIR2023/3/154
39、82.1 Cross language Search EnginelApril25,2006:Europeansearchengine“Quaero”lFrenchPresidentannounced90million-eurosupport.lMay16,2007:GoogleTranslatelProvideCLIRfor12languageslGoal:takealltheWeb&translateintomultiplelangs.lMay5,2008:YahooBabelFishlProvideCLIRbetween12languageslItwasAltaVistasproject
40、,laterboughtbyYahoo2023/3/1549Google Translatehttp:/2023/3/15502023/3/1551Yahoo Babel Fishhttp:/2023/3/15522023/3/15532023/3/1554提问提问l请比较请比较Google和和Yahoo!的跨语言搜索引!的跨语言搜索引擎的区别,分析各自的优缺点擎的区别,分析各自的优缺点lGoogle:一步完成(translate&search),检索结果翻译回源语言。优点:快速,便于用户理解检索结果。缺点:用户无法修改翻译。lYahoo!:两步完成(translate+search),检索结
41、果未翻译。优点:有中间步骤,用户可以修改翻译。缺点:复杂,检索结果无法识别。2.2 数字图书馆的跨语言检索数字图书馆的跨语言检索l2010年6月11日在芬兰首都赫尔辛基举行的ICSTI(国际科技信息理事会)夏季会议上发布的世界科学跨语言检索平台WorldWideScience2023/3/1555WorldWideSciencehttp:/worldwidescience.org/multilinguall联盟的成员单位都是专业图书情报机构或科技信息事业的领导机构,如美国能源部科技信息局(OSTI)、美国国会图书馆、大英图书馆、加拿大科技信息研究所、韩国科技信息研究所、中国科技信息研究所等。l
42、该平台还可以自动进行跨语言跨库检索2023/3/1556WorldWideSciencehttp:/worldwidescience.org/multilingual2023/3/15572.3 跨语言专利检索跨语言专利检索l根据世界知识产权组织(WorldIntellectualPropertyOrganization,WIPO)报导,专利文件包含全世界90%95%的科研成果,而其他技术文件(论文或期刊等)中只含5%10%的研发成果。l在研究工作中若能善于利用专利检索可以缩短60%的研发时间,同时减少40%的研发经费。2023/3/1558l2010年5月,世界知识产权组织WIPO发布了跨语
43、言专利检索系统PATENTSCOPE的测试版,标志着跨语言信息检索在专利检索中的应用从实验室走向实用化。l该系统只能提供英语、法语、德语、日语、西班牙语5种语言之间的跨语言专利检索。2023/3/15592023/3/15602023/3/15612.4 跨语言图像检索跨语言图像检索2023/3/15622023/3/15632023/3/15642.5 电子商务中的应用电子商务中的应用lCINDOR是目前比较成功的一个商业跨语言信息检索系统lCINDOR系统拥有概念中间语言(ConceptualInterlingua)、语言分析(LanguageAnalysis)、搜索管理(SearchMa
44、nagement)三大核心技术。lCINDOR目前支持英语、法语、西班牙语,正在研制简体中文、俄语、阿拉伯语。2023/3/15652023/3/15662023/3/1567ReferencelKazuakiKishida.Technicalissuesofcross-languageinformationretrieval:areview.InformationProcessingandManagement.2005(41),pp433-455.l葛运东;跨语言信息检索查询翻译技术研究D;苏州大学;2010l王序文.基于主题伪相关反馈的跨语言信息检索技术研究D;北京邮电大学,2014l彭琳
45、.汉语词语语义相似度度量及其在跨语言信息检索中的应用研究D;复旦大学,20102023/3/15682023/3/1569对对“交互交互”的挑战的挑战lCLIRposessomeuniquechallengesforinteractionlHowdoyouhelpusersselecttranslatedqueryterms?lHowdoyouhelpusersselectdocumenttermsforqueryrefinement?lHowdoyoucompensateforpoortranslationquality?2023/3/1570多语言信息获取多语言信息获取 Cross-Lan
46、guage Information Access,CLIACLIRSystemResult ProcessingResult PresentationQueryformulationQuestionanalysisRequestgenerationNeednegotiationNeedidentificationSourceselectionResultSelectionResultExaminationInformationExtractionResultClassificationResultVisualizationResultSummarizationQueryReformulatio
47、nRelevanceFeedbackCLIA SystemNeed ClarificationNeed Instantiation2023/3/1571CLIA vs.CLIRlCross-LanguageInformationRetrievallAnarrowviewofCLIAlCLIRislimited,goodfordevelopingmatchingtechniqueslCross-LanguageInformationAccesslAimtohelpusersfindtheinformationtheywantlConcernnotjusttherankingofresults20
48、23/3/1572多语言信息获取多语言信息获取l用户为中心l关注用户与系统的交互l相关性依赖于特定“用户”与特定“情境”l交互l信息需求不能被完全充分理解l语言歧义性l需求与使用的范围更广l多媒体:图像、声音l聚焦信息:段落检索、问答l凝练信息:摘要、信息抽取2023/3/1573多语言信息获取生命周期多语言信息获取生命周期检索经过翻译的查询式 检索结果列表文档选择待浏览的文档文档浏览查询翻译查询形成查询式待传递的文档查询重新形成 翻译重新选择 文档重新选择2023/3/1574支持查询(重新)形成支持查询(重新)形成 lProblemslTermMismatch:querytranslati
49、onstermsindocslTranslationsinforeignlanguagelHowtodisplay,interpretandcontrollIsquerytranslationanextrastep?lQueryreformulationlwhereandhowtogetinfo2023/3/1575用户辅助查询翻译用户辅助查询翻译2023/3/1576支持文档(重新)选择支持文档(重新)选择lSelectionneedtranslatedsurrogateslHowtogeneratesurrogates?lHowtotranslatesurrogates?lExaminat
50、ionneedtranslateddocuments2023/3/1577摘要生成摘要生成lHowtogeneratesurrogateslFirstNwordsindocs(goodfornewsarticles)lKeyWordInContext,automaticsummarizationlPassageretrievallHowtotranslatesurrogateslGlosstranslation:termbytermtranslationlPhrasetranslation:onlytranslatephrasesindocslMachineTranslation2023/3/