《用python计算lda语言模型的困惑度并作图.docx》由会员分享,可在线阅读,更多相关《用python计算lda语言模型的困惑度并作图.docx(3页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、python计算lda语模型的困惑度并作图困惑度般在然语处理中来衡量训练出的语模型的好坏。在LDA做主题和词聚类时,章中的公式为:perplexity=exp - (log(p(w) / (N) 其中,P(W)是指的测试集中出现的每个词的概率,具体到LDA的模型中就是P(w)=z p(z|d)*p(w|z)【z,d分别指训练过的主题和测试集的各篇档】。分母的N是测试集中出现的所有词,或者说是测试集的总长度,不排重。因python程序代码块需要包括个:1.对训练的LDA模型,将Topic-word分布档转换成字典,便查询概率,即计算perplexity的分2.统计测试集长度,即计算perplex
2、ity的分母3. 计算困惑度4. 对于不同的Topic数量的模型,计算的困惑度,画折线图。python代码如下:1 # -*- coding: UTF-8-*- 2 import numpy3 import math4 import string5 import matplotlib.pyplot as plt 6 import re78 def dictionary_found(wordlist):#对模型训练出来的词转换成个词为KEY,概率为值的字典。9 word_dictionary1=10 for i in xrange(len(wordlist):11if i%2=0:12 if w
3、ord_dictionary1.has_key(wordlisti)=True:13 word_probability=word_dictionary1.get(wordlisti)14 word_probability=float(word_probability)+float(wordlisti+1)15 word_dictionary1.update(wordlisti:word_probability)16 else:17 word_dictionary1.update(wordlisti:wordlisti+1)18 else:19 pass20 return word_dictio
4、nary1 2122 def look_into_dic(dictionary,testset):#对于测试集的每个词,在字典中查找其概率。23 Calculates the TF-list for perplexity24 frequency=25 letter_list= 26a=0.027 for letter in testset.split():28 if letter not in letter_list:29 letter_list.append(letter)30 letter_frequency=(dictionary.get(letter)31 frequency.appe
5、nd(letter_frequency)32 else:33 pass34 for each in frequency:35 if each!=None:36 a+=float(each)37 else:38 pass39 return a 404142 def f_testset_word_count(testset):#测试集的词数统计43 reture the sum of words in testset which is the denominator of the formula of Perplexity44 testset_clean=testset.split()45 ret
6、urn (len(testset_clean)-testset.count(n) 4647 def f_perplexity(word_frequency,word_count):#计算困惑度48 Search the probability of each word in dictionary49 Calculates the perplexity of the LDA model for every parameter T50 duishu=-math.log(word_frequency)51 kuohaoli=duishu/word_count52 perplexity=math.ex
7、p(kuohaoli)53 return perplexity 5455 def graph_draw(topic,perplexity):#做主题数与困惑度的折线图56 x=topic57 y=perplexity58 plt.plot(x,y,color=red,linewidth=2)59 plt.xlabel(Number of Topic)60 plt.ylabel(Perplexity)61 plt.show() 626364 topic=65 perplexity_list=66 f1=open(/home/alber/lda/GibbsLDA/jd/test.txt,r)#测试
8、集录67 testset=f1.read()68 testset_word_count=f_testset_word_count(testset)#call the function to count the sum-words in testset 69 for i in xrange(14):70 dictionary=71 topic.append(5*(3i+1)#模型件名的迭代公式72 trace=/home/alber/lda/GibbsLDA/jd/stats/model-final-+str(5*(i+1)+.txt #模型录73 f=open(trace,r)74 text=
9、f.readlines()75 word_list=76 for line in text:77 if Topic not in line:78 line_clean=line.split()79 word_list.extend(line_clean)80 else:81 pass82 word_dictionary=dictionary_found(word_list)83 frequency=look_into_dic(word_dictionary,testset)84 perplexity=f_perplexity(frequency,testset_word_count)85 perplexity_list.append(perplexity) 86 graph_draw(topic,perplexity_list)下是画出的折线图,在拐点附近再调整参数(当然与测试集有关,有图为证),寻找最优的主题数。实验证明,只要Topic选取数量 在其附近,主题抽取般较理想。本也是新开始作研究程序或者其他地有错误的,希望家指正。