《用户影响力:sina微博的分析和挖掘.pdf》由会员分享,可在线阅读,更多相关《用户影响力:sina微博的分析和挖掘.pdf(6页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、How About Micro-blogging Service in China:Analysis and Mining on Sina Micro-blog Xinmiao Wu Information Science&Technology Sun Yat-sen University,GuangZhou,China Jianmin Wang Information Science&Technology Sun Yat-sen University,GuangZhou,China ABSTRACT Sina Micro-blog is the first micro-blogging se
2、rvice in China and is growing fast in recent two years.This paper first studies the characteristics of Sina online social network and then focuses on the problem of indentifying influential users.In a dataset prepared for this study,we find an approximate power-law follower distribution and a non-po
3、wer-law friend distribution,a log correlation between follower number and tweet number,some time-distributing and geographical characteristics.We also find that the growing trend of Mobile market in China strongly affects the growing of micro-blogging service.In order to find the most popular users,
4、we propose our algorithm called XinRank and compared it with the other two algorithms.The result shows that XinRank is different and really offers a new perspective for people to find these users.Also,our algorithm is dynamic and stability,which is special and better than the other two algorithms.Au
5、thor Keywords Sina,Micro-blog,Social Network,Ranking,PageRank.ACM Classification Keywords G3 Probability and Statistics:correlation and regression analysis,statistical computing;H3.3.3 Information Storage and Retrieval:Information Search and Retrieval information filtering,retrieval model General Te
6、rms Algorithms,Economics,Experimentation,Human Factors.INTRODUCTION Micro-blogging service is a new emerging form of informal communication online.It allows users to publish brief message updates,which can be submitted through many different channels,including the Web and messaging service.Users fol
7、low others or are followed in the Micro-blogging network.The relationship of following and being followed requires no reciprocation.Being a follower means that the user receives all the messages(or called tweets)from those the user follows 1.One of the most notable micro-blogging services is Twitter
8、.This new kind of online social network is not only a platform for people to communicate,but also a potential commercial market,for example,advertising.Finding some useful potential characteristics in this kind of network can help the marketing people formulate effective commercial strategies,and is
9、 now a hot and significant area of research.Much research work about Twitter has been done 1,2,3,but little is about micro-blogging services in China.Sina Micro-blog is one of the most popular Micro-blogging services in China.We believe that the interaction behaviors of Chinese people in this kind o
10、f social network are more or less different from the Twitter users.Furthermore,finding the most popular users in the network is one of the main problems in the social network research community,because targeting those influential users will straightly increase the efficiency of the marketing campaig
11、n.For example,a mobile phone manufacturer can engage those popular users to potentially influence more people.So analyzing and finding the useful potential characteristics in Sina is a meaningful topic and is now extremely needed by the marketing people in China.In this paper,we first study some cha
12、racteristics of Sina Micro-blog.Then we focus on the problem of finding the most popular users.We propose our algorithm called XinRank to solve this problem and compare it with the other related algorithms.To the best of our knowledge this work is the first quantitative study on the Sina sphere.RELA
13、TED WORK While there is relatively little research on Sina,we take a look at some other works about Twitter in this section.Online social network is different from the real human one.People think it interesting to find out the differences and have proposed many analysis methods to study the online s
14、ocial network.Take Twitter for example,Java et al.conduct preliminary analysis of Twitter in 2007.They find user clusters based on user intention to topics by clique percolation methods.Krishnamurthy et al.also analyzed the user characteristics by the relationships between the number of followers an
15、d followings.Huberman et al.reports that the number of friends is actually smaller than the number of followers or followings.These works are considered meaningful,so we will come to find some similar or different ones of Sina in later of this paper.Permission to make digital or hard copies of all o
16、r part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page.To copy otherwise,or republish,to post on servers or to redistribute to l
17、ists,requires priorspecific permission and/or a fee.SCI11,September 18,2011,Beijing,China.Copyright 2011 ACM 978-1-4503-0925-7/11/09.$10.00.37Currently,Twitter measures a users influence with the number of followers.The more followers one has,the more impact he make in the Twitter context.The underl
18、ying assumption here is that every tweet published by a user is read by all her followers.Two similar metrics rely on the ratio between the number of ones followers and the number of friends and the ratio of attention(retweet,reply and mention)a user received to the tweets he/she published.These thr
19、ee metrics do not utilize the global link structure among users.They are all not comprehensive measures.ANALYSIS OF SINA MICRO-BLOG We begin our analysis of Sina space with a batch of well-known analysis and present the summaries.Basic Analysis 100101102103104105106107103102101100#of friends/followe
20、rsCCDFFollowersFriends Figure 1.Number of friends and followers.We construct a directed network based on the following and followed and analyze its basic characteristics.Figure 1 displays the distribution of the number of friends as the solid line and that of followers as the dotted line.The y-axis
21、represents complementary cumulative distribution function(CCDF).We first explain the distribution of the number of friends.There are noticeable glitches in the solid line.The first occurs at around x=40.Sina recommends an initial set of 4045 people a newcomer can follow by a single click and quite a
22、 few people take up on the offer.The second glitch is at around x=2000.There is an upper limit on the number of people a user can follow in Sina Micro-blog.But we can see that a very small number of users follow more than 2000 friends.Most of them are all special users who need to offer some form of
23、 customer service.The dashed line in Figure 1 up to x=106 as a whole fits to a power-law distribution with the exponent of 2.564.Most real networks including social networks have a power-law exponent between 2 and 3.The data points between x=102 and x=103 represent users who have a little less follo
24、wers than the power-law distribution predicts.Comparing with Twitter,we are surprised to find that there are only about 0.001%users who have more than 105 followers in Twitter 1,but there are more than 0.4%in Sina.We have to mention that almost all the users of Sina are Chinese,but the users of Twit
25、ter are from all over the world.This seems to mean that in micro-blogging service,people forms together according to different geographic areas.The users of Twitter are from all over the world but they are from different geographic areas,not the same one.But Sina is.The common characteristics betwee
26、n Sina and Twitter are that many celebrities are present and they readily form online relations with their fans.We show the relation of friend number and follower number in Figure 2.Because Sina set an upper limit on the number of people a user can follower to 2000,we just show the x-axis up to 2000
27、.Figure 2.Relation of friend number and follower number.In Figure 2,we can find there are some users who have less than 500 friends but have more than 2*106 followers.Almost all these users are celebrities and Medias.This means most celebrities and Medias do not follow back their followers,and they
28、seem not like to follower too many people.And it also proves that follower number is not linear with friend number.But we can find that no matter how many friends the users have,most of them have less than 0.3*106 followers.The point at the right upper corner is a staff of Sina who focus on listenin
29、g to new register users,finding the problems on them of using micro-blog service and tweeting some guidance to his followers.So he follows many users and many users have followed him since they registered.Figure 3.Relation of Tweets and Retweets.We want to find the relationship of tweets and retweet
30、s at the beginning.But Sina do not store all the tweets and retweets for a user.As we can see from Figure 3,Sina set the buffer to 200 at the beginning.So if the sum of tweets and retweets are more than 200 of a user,Sina do not store all of them by deleting the oldest ones.But now,Sina set this lim
31、it to 400.However,there are some users who have much more tweets stored in the servers of Sina.They retweet little and are mostly online Medias.We deduce that why Medias have much larger buffer than others is because they are VIP users.38Followers vs.Tweets and Retweets Figure 4.The number of follow
32、ers and that tweets and retweets per user.In order to gauge the correlation between the number of followers and that of written tweets and retweets,we first plot the number of tweets and retweets(y)against the number of followers a user has(x)in Figure 4.We can easily notice the log relationship bet
33、ween them.It means that generally speaking,the more followers a user has,the more tweets and retweets he generates.We use a log curve in Figure 4 to show this trend clearly.Figure 5.The number of followers and the median number of tweets and retweets per user.100101102103104105106107020406080100#of
34、followers#of tweets and retweetsmode Figure 6.The number of followers and the mode number of tweets and retweets per user.Second,we bin the number of followers in log scale and plot the median per bin in the dashed line in Figure 5.The majority of users who have fewer than 4 followers never tweeted
35、or did just once,and thus the median stays at 1.The average number of tweets and retweets against the number of followers per user is always above the median,indicating that there are outliers who tweet and retweet far more than expected from the number of followers.The median number and the average
36、 number of tweets and retweets also have a log relationship with the number of followers before 400.Then we separately plot the mode number of tweets and retweets against the number of followers a user has in Figure 6.We find that the most frequency situation that happens on the users who have less
37、than 100 followers is that they never tweet or retweet.Friends vs.Tweets and Retweets 100101102103104100101102103104105#of friends#of tweets and retweetsAvg.Med.Logscale Bin Figure 7.The number of friends and the median number of tweets and retweets per user.In order to gauge the correlation between
38、 the number of friends and that of written tweets,we gauge the inclination to be active by the number of people a user follows and plots in Figure 7.As pointed out in Figure 1 irregularities at x=40 and x=2000 are observed.We also bin the number of followers in log scale and plot the median per bin
39、in the dashed line.The dashed line shows a positive trend between 0 and 300.And we deduce that if Sina cancel the limit of friend number,then the number of tweets and retweets will be the same to have a log relationship with the number of friends goes over 300.100101102103104010002000300040005000#of
40、 tweets and retweets#of friendsMode Figure 8.The number of friends and the mass mode of tweets and retweets per user.We also separately plot the mode number of tweets and retweets against the number of friends a user has in figure 8.We find that the most frequency situation that happens on the users
41、 who have less than 300 friends is that they never tweet or retweet.It means that users who have less than 200 followers or has less than 300 friends tweet and retweet little,and are not active.Time Distribution What time to do the advertisement is the problem that manufactory concerned.We sample tw
42、eets that generated during 2010/03/18 and 2011/03/18 from our crawled dataset and distribute them into different hours during a day according to the time the tweet generated and show it in figure 9.The numbers that show in the y-axis is the statistical amount value.For example,there are totally 3950
43、000 tweets that generated at 7:00 am in our sample dataset.We can see two peaks that people tweets many in a day.One is between 10am to 13pm.This seems to mean that in China,people often tweets during working time in the morning and are used to tweeting after lunch every day.But this goes down quick
44、ly during 13pm 14pm,which is because people are used to having a snip during this time in China.Another peak happens between 21pm and 23pm.During this time,people have finished all the housework and have already taken a bath,they are free now.Figure 9.Trend of tweets during a day.Then we distribute
45、the sample data into different days during a week according to their generated date as show in figure 10.At the beginning,we expected that the peak in a week would be at weekend.But we are surprised to find that people tweet the least during this time comparing to the others in a week.This may becau
46、se people like to go out for some activities or have a rest without any electronic device at weekend in China.But on the contrary,people tweet the most during Tuesday and Thursday,which is the middle working days in a week.Figure 10.Trend of tweets during a week.Source Distribution Tweets that sprea
47、d in the whole micro-blogging network are from different sources.We statistic the sources of all tweets we crawled and show the top 20 in Figure 11.We can see that more than half of tweets are directly from the website of Sina.We also find that there are more than 27.63%of tweets from mobile devices
48、,which means people tweets anywhere they like and the growing trend of mobile market in china strongly affects the growing of micro-blogging service.Notice that iphone users are active and account for 7.58%of all the tweets as well.Geographic Distribution We statistic the users in our dataset accord
49、ing to their registered provinces and show it in figure 12.As we can see,the number of micro-blogging users in different province is not positive correlation with the population distribution of China.For example,the population of HeNan province is much larger than the others but it rank far behind t
50、hem.Figure 11.The proportion of sources that tweets from.Comparing with the economical ranking 5(rank by the comprehensive competitive power)in China,we find that the top 6 provinces ranking in figure 12 are the same as the economical one.And the last 8 provinces are similarly the same as the last o