《MapReduce排序ppt.ppt》由会员分享,可在线阅读,更多相关《MapReduce排序ppt.ppt(6页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、厦门大学数据库实验室MapReduce排序,报告人:李雨倩导师:林子雨2014.07.19,在如下示例中,用户数据中有用户姓名,年龄和所在州$cattest-data/ch4/users.txtanne22NYjoe39COalison35NYmike69VAmarie27ORjim21ORbob71CAmary53NYdave36VAdude50CA用户活动日志中有用户姓名,进行的动作,来源IP。这个文件一般都要比用户数据要大得多。$cattest-data/ch4/user-logs.txtjimlogout93.24.237.12mikenew_tweet87.124.79.252bob
2、new_tweet58.133.120.100mikelogout55.237.104.36jimnew_tweet93.24.237.12marieview_user122.158.130.90,$hadoopfs-puttest-data/ch4/user-logs.txtuser-logs.txt$bin/run.shcom.manning.hip.ch4.joins.improved.SampleMainusers.txt,user-logs.txtoutput$hadoopfs-catoutput/part*bob71CAnew_tweet58.133.120.100jim21ORl
3、ogout93.24.237.12jim21ORnew_tweet93.24.237.12jim21ORlogin198.184.237.49marie27ORlogin58.133.120.100marie27ORview_user122.158.130.90mike69VAnew_tweet87.124.79.252mike69VAlogout55.237.104.36,优化重分区连接,传统重分区方法的实现空间效率低下。它需要将连接的所有的输出值都读取到内存中,然后进行多路连接。事实上,如果仅仅将小数据集读取到内存中,然后用小数据集来遍历大数据集,进行连接,这样将更加高效。下图是优化后的重
4、分区连接的流程图。,Map输出的组合键和组合值,(key,value)=(name+smaller,smaller+age+state/smaller+action+IP)jim21OR(jim0,021OR)jimnew_tweet93.24.237.12(jim1,1new_tweet93.24.237.12)bob71CA(bob0,071CA)bobnew_tweet58.133.120.100(bob1,1new_tweet58.133.120.100)另一块:jimlogout93.24.237.12(jim1,1logout93.24.237.12),根据value标记排序,(b
5、ob0,071CA)(jim0,021OR)(bob1,1new_tweet58.133.120.100)(jim1,1logout93.24.237.12)(jim1,1new_tweet93.24.237.12),(jim1,1logout93.24.237.12),一个map,另一个map,分组,(bob0,071CA)(bob1,1new_tweet58.133.120.100)(jim0,021OR)(jim1,1logout93.24.237.12)(jim1,1new_tweet93.24.237.12),影响数据整理和数据流三元素,在map输出收集阶段,由分区器选择哪个reduce应该接收map的输出。map输出的各个分区的数据,由RawComparator进行排序。Reduce端也用RawComparator进行排序。然后,由RawComparator对排序好的数据进行分组。,Thankyou!,