《毕业论文外文翻译-基于协同过滤和内容预测的改进推荐算法研究.doc》由会员分享,可在线阅读,更多相关《毕业论文外文翻译-基于协同过滤和内容预测的改进推荐算法研究.doc(23页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、译文:基于协同过滤和内容预测的改进推荐算法研究1摘要本文提出了一种结合稀疏矩阵填充方法和协同过滤算法的算法,为了解决当系统面临一个新的项目和一些稀疏的数据时协同过滤推荐系统的“冷启动”问题。该算法提高了用户或项目的相似性计算的准确性。它预测未来填补的项目,它填补了稀疏的用户项目分数矩阵。该算法实现了一个准确的虚拟得分,并填写了一个虚拟用户项目评分表。然后进行基于此预测分数的形式。我们在MovieLens数据集上尝试。实验结果表明,该算法可以有效提高评价预测的准确性。在一定程度上,该算法解决了“冷启动”问题。索引词:推荐系统,冷启动,协同过滤,稀疏矩阵。1. 简介随着Web2.0和电子商务的迅猛
2、发展,大量的互联网用户产生的海量数据。互联网用户所面临的问题是如何从如何找到更多的信息,如何找到更有效的信息。传统的信息检索方法难以满足不同用户的需求。由于不考虑用户之间的差异,搜索系统为所有的用户将返回相同的结果。但事实上即使使用相同的关键字,不同的用户会专注于搜索不同的信息。在此背景下,满足不同用户的不同需求,不同用户的个性化推荐,成为电子商务的新的发展方向和信息提供商。基于推荐算法的个性化推荐方法成为一个热门的研究课题1。目前,在所提出的推荐技术,协同过滤算法是众所周知的,最流行的和成功的方法。然而,传统的协同过滤算法存在一些问题,如稀疏性,可扩展性,“冷启动”和准确性2。基于协同过滤的
3、推荐算法是非常依赖于用户项目分数的。只有当用户项目评分表产生时,才能实现推荐结果。但对于一个新项目,当没有人来评估该项目,该项目的分数将被填充。因此这项目变得乏善可陈,就不可能被推荐。这个问题导致了新的项目难以启动,这是著名的“冷启动”问题。为了解决这一问题,本文提出了一种基于协同过滤和内容预测的改进的推荐算法。当它填充用户项目分数矩阵时,该算法简单地分析了项目的相关内容。然后利用相关内容对该项目进行评分预测,并通过协同过滤推荐算法的方法进行推荐计算。2. 传统协同推荐算法的步骤常用的传统的协同算法实现了以下推荐步骤3,4。首先,建立用户项目评分矩阵。其次,填充矩阵中的空格。第三,计算用户的相
4、似性,然后查找邻居的用户或项目。最后,计算和生成推荐结果。2.1 建立项目的用户评分矩阵首先,用户的偏好必须收集。用户可以通过各种方式向系统提交自己的偏好。Wei Jiang,Liping Yang;Research ofimprovedrecommendationalgorithmbasedoncollaborativefilteringandcontentpredictionA; 2016 11th International Conference on Computer Science & Education (ICCSE)C;IEEE;P:598-602在收集足够的数据后,该算法处理这
5、些数据。根据不同的行为分析方法,该算法应用了一些统计方法,如加权或分组实现关于用户偏好的一个二维矩阵,即以用户项目分数的形式。2.2 用户项目评分矩阵的填充上面提到的方法所产生的分数矩阵是非常稀疏的。因此,如果相似性计算只依赖于用户的评价分数,不可避免地会有错误。因此,用户项目分数矩阵通过填写的数据,改变矩阵的密度。然后在常规分数矩阵,有灌装,模式填充,集群填充几种主要的方法。本文在传统的协同推荐算法计算中,通过填充方法对用户项目分数矩阵进行填充。它设置了缺席的得分,一个固定的值,这一般是得分系统的平均值,或是用户的平均得分或项目的平均得分。2.3 计算相似性,发现邻居用户或邻居项目在填充用户
6、项目分数矩阵后,下一步将是根据相似的用户或用户的信息,计算类似的用户对项目的偏向。然后该算法基于类似的用户或类似的项目产生推荐。在最典型的协同过滤算法中有两个分支:一个是基于用户的协同过滤,另一种是基于项目的协同过滤。他们有一个共同点是,他们都需要计算的相似性,然后根据相似找到邻居用户或相似的项目的邻居5。常用的计算相似性的方程如下:(1) 余弦相似性的计算公式:(2) 相关相似性的计算公式:(3) 修正的余弦相似度计算公式: 以基于用户的协同过滤为例。在上述三个方程中,sim(x,y)表示用户X和用户Y之间的相似性。Rx,i(或Ry,i)表示由用户X(或Y)对项目i进行评估的评价得分。I(x
7、,y)表示由用户X和用户Y进行评估的项目的集合。I(x)或I(y)表示由用户X或用户Y进行评估的项目的集合。表示由用户X(或Y)对项目评估的平均得分。2.4 计算和生成推荐结果上述计算后,可以实现邻居用户或邻居的项目。然后,在数据的基础上,任何项目都是通过经典的公式计算用户的推荐值(4)计算推荐值。最后生成推荐结果。以基于用户的协同过滤为例。在上述方程,Py,i表示目标用户y为项目i所给予的推荐值。Rx,i表示由目标用户Y的最近邻用户X对该项目i的得分进行评估。k是最近的邻居的数量,它可以直接规定或通过阈值决定,或被认作为前K个其相似性大于阈值的用户。3. 传统算法的改进与优化针对“冷启动”问
8、题,改进了传统算法中填充分数矩阵的评价方法。我们主要在以下四个方面对算法进行优化。 利用过滤法建立了基于特征的物品指示矩阵。一般情况下,协同过滤推荐系统将简单地描述一个项目。例如,当豆瓣(表1 基于特征的项目指示矩阵(1) 根据项目和用户对项目的评价指标的内容特征的相关性,对项目内容中的用户评价的权重进行了排序。项目的特征提取后,我们可以通过用户对项目的评价来分析相应的特征信息来过滤信息。通过用户的偏好和项目的特征匹配,系统判断它是否可以向用户推荐这个项目。本文采用Winnow算法6来分析电影的评价。在文本分类领域,Winnow算法的影响是广泛认可。在Winnow算法中,Xi为布尔特征值。Wi
9、nnow设置每个词的权重,然后权重将构成一个线性阈值函数:WiXi,其中是阈值,Wi是初始值为0.5的权重。用户根据此用户曾经评估过的项目的关键字设置参数。例如,如果用户已经评估了一个动作类型的电影,这个用户参数将被设置为1,否则将被设置为0。如果用户对一部电影的评价升级,相应的项目关键字的权重会增加,否则,它的权重会减少。在训练中,每个用户的每一个项目将被计算,然后这个用户将达到最优权重的总和。如果这个和小于阈值,然而,这个用户的评价得分超过,我们将增加每个关键字的权重为双。如果总数超过,然而,用户的评价得分低于,我们将分配2个关键字的权重。如果权重是合适的,他们不会改变。而在训练集中,权重
10、将被循环调整,直到所有的项目的权重都有正确操作,或将分发一个特定的时间,直到权重不会改变。训练后,对于每个用户Y,都有一组不同类型的电影Wy,k的权重。(2) 基于内容预测,我们可以实现虚拟用户的评价得分Ry,i。我们将它的值赋为公式(5),其中ry,i=0表示在用户项目分数矩阵中,由用户y评价的项目i的得分是无效的。我们执行了一个还没有由用户进行评估的初步预测的新的项目。然后,我们可以实现每个用户的电影的功能权重(例如用户y)。因此,对于一个电影i,预测的评价值Py,i生成:其中Wy,k是由用户Y评价的第k个类型的电影的权重,和Ik,i是电影的k个特征的值。(3)我们将过滤不够准确的项目。基
11、于内容预测的推荐算法填补了用户项目评价矩阵中的空白,并修改了稀疏的用户项目评价矩阵。因此,准确地说,这个矩阵必须表示用户的偏好,否则,下一个预测值将是非常不准确的。为了保证预测方法是有效的,我们应该对预测结果进行初步的筛选。只有当结果是足够精确的,它可以证明训练后产生的权重符合用户的偏好。我们将执行以下的最佳措施。a)超前滤波。我们预先过滤用户。如果用户的评级数太少,因为如此少的样本数量,然后,我们可以考虑,这是不可能实现准确的预测值。此用户的评价被认为是无效的。因此,只有当用户的等级数(RN)超过一定数量,预测可产生。在本文中,RN的值被设置为90到100之间。b)根据用户的其他评价修改。我
12、们用相应的比例测量评价。相应的比例(CP)是用户的预测结果对应于他的评价这个用户的RN的数目的比率。只有当足够高的,这个用户的预测值可以被视为准确。在本文中,CP的值分别设置为75%和80%。经过上述步骤的优化,我们可以确定用户的RN是足以被用来实现准确的预测结果。 (4)通过使用用户项目评分矩阵和虚拟用户的评价得分,我们可以实现一个虚拟的用户项目分数矩阵。在此基础上,利用传统的协同过滤算法中的相似度计算方法,计算相似度的计算方法。4. 试验与评价4.1 数据集和度量在本文中,我们使用MovieLens数据集进行实验,其中的数据是电影分数从1到5,标志着有看过电影的用户。MovieLens有两
13、个不同大小的适合不同规模的算法库。我们选择的数据集MovieLens M1作为本研究的实验数据。在我们的实验中的评价指标是精度的推荐和覆盖率(CR)。1) 评价精度:一般情况下,得分预测有两种评价指标,一种是平均绝对误差(MAE),另一种是均方根误差(RMSE) 7,8。由于MAE是更受欢迎和更容易理解,本文使用的MAE来评估实验数据。假设在测试数据集上的目标客户的推荐数据集是Y = yi | i = 1,2,.,n,而真正的评级数据集是R = ri | i = 1,2,.,n。对于每一个不是0的“预测等级”是满意的公式(6)如下:其中N是测试数据集之间的项目数由目标用户给出的预测值和真实的评
14、价值都不是0。当MAE较少时,可以达到更高的推荐精度。2) 覆盖率(CR):CR是可以预测项目总金额的项目的总数的比率。因此,假设为用户提供的预测值集是Y = yi | i = 1,2,.,n,然后yi 0的数量是Ki,用户Y的覆盖率为CR=Ki/N。4.2 实验结果与分析在这一领域的大量的实验论文和研究论文表明,余弦相似性度量方法的预测精度优于其他算法。因此,本文在不同的参数的情况下,通过使用以下的四种不同的策略进行了比较实验:a) 使用余弦相似的传统的推荐算法(CRA);b) 利用改进的余弦相似性的传统推荐算法;c) 基于内容的预测和利用余弦相似度的协同过滤的改进推荐算法(IRA);d)
15、基于内容预测和协同过滤的改进的余弦相似度的改进推荐算法。 我们的实验取得了CRA,IRA和非优化的IRA的结果。下面的表2到表5分别展示了CRA和IRA不同参数设置的结果。如上所述,使用改进的余弦相似性的推荐算法比其他的更好。在下表中的结果表明,该推荐结果可以更准确,如果CP值是更大。例如,当CP为80%时的推荐结果比在75%时的结果更准确。表2 传统推荐算法的结果表3 改进推荐算法的结果(RN = 90,CP = 75%)表4 改进推荐算法的结果(RN = 90,CP = 80%)表5 改进推荐算法的结果(RN = 100,CP = 80%) 下面的图1表明采用修正的余弦相似度和RN的实验结
16、果为100,CP为80%。试验表明,MAE值的变化与IRA,CRA和非优化的IRA的不同数量的邻居有关。结果表明,在本文中提出的优于其他两种算法的基于内容的混合协同过滤推荐算法,随着参数CP和RN的调整有更高的精度。此外,虽然非优化的IRA的覆盖率几乎是100%,但在该算法可以推荐的所有项目,它的MAE的值比IRA和CRA少很多。证明优化推荐算法比没有优化混合算法具有更好的推荐精度。图1 MAE的值观随邻居的数量而变化图2 IRA和CRA在不同情况下的参数设置的CR值图2显示在CRA和IRA的CR值不同参数设置。结果表明,CR在IRA不同参数设置下的值比CRA更好。此外,它表明,覆盖率也随着邻
17、居的增量增加。结果表明,基于内容的预测可以保持高的精度,同时具有高的覆盖率。这一结果表明,该方法可以帮助克服“冷启动”的问题。因此,如上所示的结果表明,本文提出的推荐算法,结合基于内容预测和协同过滤算法的方法是有效的,它执行得很好。5. 总结本文提出了一种新的基于内容预测的预测项目,即使没有被评估也能生成推荐用户使用的协同过滤推荐算法。我们的实验结果表明,我们的方法不仅可以确保预测的准确性,但也可以提高覆盖率。它是一种有效的、可行的算法。6.感谢这项工作是由中国的基础研究基金资助的中央大学资助下的No.2662015QC040。原文一:Research of Improved Recommen
18、dation Algorithm Based on Collaborative Filtering and Content Prediction AbstractThis paper proposes an algorithm combining the sparse matrix filling method and the collaborative filtering algorithm, in order to solve the collaborative filtering recommendation systems “cold start” problem when the s
19、ystem confronts a new item and some sparse data. This algorithm improves the accuracy of similarity calculation for the user or the item. It predicts ahead the item which is to be filled when it fills the sparse user-item scores matrix. The algorithm achieves an accurate virtual score and fills out
20、a virtual user-item scores form. Then the algorithm carries out the prediction based on this scores form. We experimented on the MovieLens dataset. The experiment results showed that this algorithm can improve the accuracy of the evaluation prediction effectively. To a certain extent, this algorithm
21、 solves the “cold start” problem.Index TermsRecommendation system, cold start, collaborative filtering, sparse matrix.INTRODUCTION With rapid development of web2.0 and e-commerce, the massive Internet users produce massive data. The problems confronted by the Internet users change from how to find m
22、ore information to how to find more effective information. Conventional information searching method is hard to satisfy the demand of different users. Because the difference among the users is not considered, the searching system returns the same results for all the users. But in fact different user
23、s will focus on different information to be searched, even if the same keyword is used. Against this background, in order to satisfy different users different demand, the personalized recommendation with different contents for the different users becomes the new development direction for the e-comme
24、rce and information provider. The personalized recommendation methods based on recommendation algorithms become a hot research topic 1. Currently, among the proposed recommendation technology, the collaborative filtering algorithm is well known to be the most popular and successful method. However,
25、there are problems in the conventional collaborative filtering algorithm, such as sparsity, expansibility, “cold start” and accuracy 2. The recommendation algorithm based on collaborative filtering is very dependent on the user-item scores form. The recommendation results can be achieved only if the
26、 user-item scores form is produced. But for a new item, when there is nobody to evaluate the item, this items scores form will be filled by the means. Thus this item becomes unimpressive and it will be impossible to be recommended. This problem leads to the new item hard to be started, which is the
27、famous “cold start” problem. In order to solve the problem, this paper proposes an improved recommendation algorithm based on collaborative filtering and content prediction. This algorithm simply analyzes the items relative contents when it fills the user-item scores matrix. Then the algorithm carri
28、es out scoring prediction for this item using the relative contents and performs the recommendation calculation by the means of the collaborative filtering recommendation algorithm.STEPS OF CONVENTIONAL COLLABORATIVE RECOMMENDATION ALGORITHM The conventional collaborative algorithm which is used mos
29、t popularly realizes the recommendation by the following steps 3,4. Firstly, build the user-item scores matrix. Secondly, fill the blanks in the matrix. Thirdly, calculate the similarity of the users and then find the neighbor users or items. Finally, compute and produce the recommendation results.B
30、uilding the User-item Scores Matrix Firstly the users preference must be collected. The user can submit his preference to the system through various means. The algorithm deals with the data properly after sufficient data has been collected. According to different behavioral analytic methods, the alg
31、orithm applies some statistic methods such as weighting or grouping to achieve a two-dimensional matrix about the users preference, namely user-item scores form.Filling the User-item Scores Matrix The scores matrix generated by the methods mentioned above is very sparse. Therefore, there are errors
32、inevitably, if the similarity is calculated only relying on the users evaluation scores. Thus the user-item scores matrix should be filled by data in order to make the matrix denser. Then in the conventional scores matrix, there are such main methods as means filling, modes filling and clusters fill
33、ing. This paper fills the user-item scores matrix by the means filling methods in the conventional collaborative recommendation algorithm calculation. It sets up the absent score to a fixed value which is generally the mean of the scoring system or is the users average score or is the items average
34、score.Calculating the Similarity and Finding the Neighbor Users or Neighbor Items After filling the user-item scores matrix, the next step is to calculate the similar users or items according to this users preference. And then the algorithm produces the recommendation based on the similar users or s
35、imilar items. There are two branches in the most typical collaborative filtering algorithm: one is the collaborative filtering based on users, the other is the collaborative filtering based on items. What they have in common is that both of them demand calculating the similarity and then find the ne
36、ighbor users or the neighbors of similar items according to the similarity 5. The common equations of calculating the similarity are as following.The equation of calculating the Cosine similarity: The equation of calculating the correlation similarity:The modified equation of calculating the cosine
37、similarity: Take the collaborative filtering based on users for example. In the above three equations, the function sim( x, y) denotesthe similarity between the user x and the user y . Rx,i (or Ry,i )denotes the evaluation score assessed by the user x (or y ) for the item i . The I( x, y ) denotes t
38、he set of the items which are evaluated by both the user x and the user y . The I( x ) (or I( y ) ) denotes the set of the items which are evaluated by the user x or the user y . The x (or y ) denotes the average score assessed by the user x (or y ) for the items.Calculating and Generating the Recom
39、mendation Results After the above calculation, the neighbor users or neighbor items can be achieved. And then, on the basis of the data, the users recommendation values for any items are calculated by means of the classical Eq. (4) about calculating recommendation values. At last the recommendation
40、results are generated. Take the collaborative filtering based on users for example.In the above equation, py,i denotes the recommendation value given by the objective user y for the item i . The Rx,i is the score evaluated by one of the objective user y s nearest neighbor users x to the item i . The
41、 k is the number of the nearest neighbors, which can be prescribed directly or be decided by means of the threshold, or be considered as the top k users whose similarity is more than the threshold.IMPROVEMENT AND OPTIMIZATION OF CONVENTIONAL ALGORITHM In view of the problem “cold start”, we improve
42、the evaluation methods about filling the scores matrix in the conventional algorithm. We mainly optimize the algorithm in the following four aspects. We build the items indication matrix based on the features by use of the filtration method. Generally, the collaborative filtering recommendation syst
43、em would describe an item simply. For example, when Douban ( ) recommends a film, it will introduce the main actors in the film and its style. The system will introduce whether the film is a comedy or a tragedy, or a mixed style of various elements. These description labels can be considered as the
44、items keywords. Thus, for each item X A , A , A , A , there are some keywords or labels to describe its content, Where Aj denotes the j -th feature of Xi . The Aj is a boolean value. If it equals 1, it indicates that Xi possesses this feature, otherwise, it indicates that Xi doesnt possess this feat
45、ure. Therefore, all the items can form a binary two-dimensional matrix about keywords as shown in Table I.TABLE I. THE ITEMS INDICATION MATRIX BASED ON FEATURESFeature A1Feature A2Feature AnItem X1011Item X 2110Item Xm101 We train the weight of users evaluation in the item content according to the c
46、ontent features correlation among the items and users evaluation for the items. After the items features are extracted, we can analyze the corresponding feature information to filter the information by means of users evaluation about the item. By virtue of users preference and the items feature matc
47、hing, the system judges whether it can recommend this item to the user or not. This paper adopts the algorithm Winnow 6 to analyze the evaluation of films. In the field of texts classification, the effect of Winnow is recognized widely. In the algorithm Winnow, the Xi is considered as a boolean feature value. Winnow will set the weight of each word, and then the weights will compose a linear threshold function: Wi Xi, where