11-An Effective Merge Strategy Based Hierarchy.docx

上传人:太** 文档编号:72765216 上传时间:2023-02-13 格式:DOCX 页数:6 大小:130.88KB
返回 下载 相关 举报
11-An Effective Merge Strategy Based Hierarchy.docx_第1页
第1页 / 共6页
11-An Effective Merge Strategy Based Hierarchy.docx_第2页
第2页 / 共6页
点击查看更多>>
资源描述

《11-An Effective Merge Strategy Based Hierarchy.docx》由会员分享,可在线阅读,更多相关《11-An Effective Merge Strategy Based Hierarchy.docx(6页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。

1、An Effective Merge Strategy Based Hierarchy for ImprovingSmall File Problem on HDFSZhipeng Gao1, Yinghao Qin1, Kun Niu2,State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications,Beijing 100876, China2School of Software Engineering, Beijing Univer

2、sity of Posts and Telecommunications, Beijing 100876, Chinagaozhipeng, qyhqyh 123123163 , niukunAbstract: Hadoop Distributed File System (HDFS) is designed for reliable storage and management of very large file and low-cost storage capability. As HDFS architecture bases on master (NameNode) to handl

3、e metadata for multiple slaves (DataNode), NameNode often becomes bottleneck, especially when handing large number of small files. It is a common solution to merge many small files into one big file about this problem. HDFS does not consider the correlation between the files stored on it, it is hard

4、 to use one efficient prefetching mechanism. To solve the large small files problem and to improve efficiency of accessing small files , in this paper, we define Logic File Name (LFN) and propose Small file Merge Strategy Based LFN (SMSBL). SMSBL is a new idea and a new perspective on hierarchy, it

5、improves the correlation of small files in the same block of HDFS effectively based different file system hierarchy, so the performance is amazing facing large small files when HDFS adopted SMSBL with prefetching mechanism. The system efficiency analysis model is established and experimental results

6、 demonstrate that SMSBL can solve small file problem in HDFS and has appreciable high hit rate of prefetching files.Keywords: HDFS; small files; merging and prefetching;IntroductionWith the rapid development of Internet services, the amount of data grows exponentially, cloud computing has become inc

7、reasingly popular as the next infrastructure for hosting data and deploying software and services 1. As data explode from the web applications and grow bigger, it is hard for traditional system to handle this situation, so new projects and systems are needed. Distributed file systems are introduced

8、and developed for cloud computing and for big data. The most three famous distributed file systems examples are Google file system (GFS) 2, Hadoop distributed file system (HDFS) and Amazon Simple Storage Service (S3). Among them, HDFS is an open-source software framework influenced by GoogleFS, it i

9、s widely used in many industries such as science, biology, climatology, astronomy, finance, internet, geography, etc 3.A small file is a file whose size less than the HDFS block size 4. The design of the HDFS is for storing large files, HDFS is inefficient because it has high memory usage and unacce

10、ptable access cost when it stores a large number of small files 5.The rest of the paper is organized as follows. Section II discusses the background and related work. Section III describes our proposed new approach SMSBL to impove HDFS. Experiments are conducted in section IV. Section V concludes th

11、e summary of this paper and future work.1 BackgroundHDFSInternet giants, such as Facebook, Twitter and YAHOO, use HDFS as their basic distributed data storage environment. The design of Hadoop Distributed File System (HDFS) is fully inspired by Google File System (GFS). Both HDFS and GFS are master/

12、slave architecture. The previous version of HDFS cluster has only one master node called Namenode, which holds and manages the metadata information include distributed file system namespace, file descriptions, file-data block mappings, data block allocations, access regulations and so on 6. The High

13、 Availability (HA) version of HDFS cluster has more than one namenode, active namenode and standby namenode, to enhance reliability of HDFS and efficiency of operation in cluster 7. Along with continuous improvemengt of Hadoop and HDFS, they play more and more important roles in the era of big data.

14、Namenode not only is in charge of responsing to client who requests for accessing some files on the HDFS cluster, but also manages huge metadata information. Whatever the size of file is, the metadata of each file consumes 250bytes and its block with default three replicas consumes 368 bytes in memo

15、ry of Namenode 8. When many small files are stored in HDFS, the memory of Namenode will has high pressure and then maintaining these huge metadata is inefficient 9.When the size of relevant metadats is larger than the size of memory in Namenode or frequently accessing small files exceeding I/O capac

16、ity of the cluster, this HDFS may shutdown abnormally 10. Above situations happen on account of that design of original HDFS is for huge files with the purpose of big scale transfer 11.Along with the era of big data coming, Hadoop and HDFS have to deal with many complicated data which them may not b

17、e good at. HDFS architecture is shown in Fig.l.Related workThe present studies on handing small file problem mainly concentrate on two ways:One of ways is improving the architectural of distributed files systems. Xuhui Lius approach is merging small files into big ones and building indexs for each s

18、mall file which stored consecutively in physical memory in terms of their geographic locations where hash index is used 12. But his approach is exclusively used in WebGIS. Chandrasekar S proposed Extended Hadoop Distributed File System (EHDFS), it has been designed and implemented in such a way that

19、 a large number of small files can be merged into a single combined file and it also provides a framework for prefetching metadata for a specified number of files 13. Chatuporn Vbrapongkitpun proposed a mechanism based on Hadoop Archive (HAR), called New Hadoop Archive (NHAR), to reduce the memory u

20、tilization for metadata and enhance the efficiency of accessing small files in HDFS. Dipayan Dev designed Hadoop Archive Plus (HAR+) using hashtable on its architecture, selected sha256 as the key, which is a modification of existing HAR. HAR+ is designed to provide more reliability and it can also

21、provide auto scaling of metadata 14. But improvement of this architectural design is very complex and high cost in resource.Optimizing strategy of merging small files and using cache for prefetching files is another way to impove HDFS handing small files. Du zhonghui presented a balance of small fil

22、es merging algorithm to optimize distribution of merged large files volume, which could effectively reduce the numbers of HDFS data blocks. Zhang chunming proposed an approach called HIFM (hierarchy index file merging), in which the correlations between small files and the directory structure are co

23、nsidered to assist merging small files into large ones and then generating hierarchical index 15. Zhang chunming proposition is good but it has not good universality.1.1 Our contributionThrough study on the above these approaches, we propose a new approach which have better performace in local corre

24、lation and in universality. In our new approacch we define Logic File Name (LFN), and adjust order of fields in LFN to match its using environment. Our approach named small file merge strategy based LFN (SMSBL) can be used in most hierarchy file systems and can elevate the hit rate of prefetching fi

25、les on HDFS.The contribution of our work:1) Solve the problem of large small files on HDFS. Our proposed approach which merges small files into larger ones and build index for file accessing ensure HDFS available facing large small files.2) SMSBL has highly universality and flexibility. Our proposed

26、 approach can be used in most file system of hierarchy, LFN will match concrete hierarchy. In different use situation, we always adjust priorities of LFNs fields to increase local correlation of small files before merging small files.3) SMSBL has high hit rate of prefetching files. Because of adjust

27、able priorities of LFN9s fields, local correlation of small files can be always high, this is an essential foundation for high hit rate of prefetching files.2 Our proposed approachThere are two steps in our proposed approach. First, making a radix sort on files set under the rule of altered LFN to g

28、et a sequence of output files. Second, merging files from the sequence of output files to block files orderly and building index files, then uploading them to HDFS.We proposed one concept named Logic File Name (LFN) to support finding the optimum sequence of output files, in order to enhance local c

29、orrelation of files on HDFS in different situations.2.1 Hierarchy of file system and Logic File Name (LFN)As shown in Fig. 2, many file systems adopt hierarchy. And Fig. 3 shows the corresponding LFN.二 |J | | :二 fi 图 dT | |1 .J .J .J 1 .J | .J j 1 11 -J . Figure 2 File system architecturehashvalue t

30、imeuserid filetype fileidLFNxfix 19fix2,fix3,.?fixv. The number of items of LFNx may not equals the number of items of LFN, because field could be split for more precise dimension.Figure 3 LFNLogic File Name (LFN) is a filename that has meaning of hierarchy. It is made up of some fields having logic

31、 meaning based on its parent directories name and file name, representing the hierarchy of file system. Each dimension of LFN attributes contain each dimension of information about this file. In most situations the structs of LFN are different in different system. As shown in Fig. 3, in this file sy

32、stem LFN of one file is made up of hash value、creating time related user id、file type file id and some other information. Such as the LFN of the bottom left file in Fig. 2 is “hasha_timel_useridl_typel_l”. Logic file name is a global notion, so when some files have not attributes in some dimension t

33、hese fields should be filled with default value such as nuH. Some fields of LFN should be encoded to represent complex string value, it is important for subsequent processing to reduce calculation. A set of LFN based on the store system and a corresponding map from a LFN to a address of physical loc

34、ation will be set up through traversing directory tree.3.2 Radix sort based on altered LFNuserid time filetype fileidFigure 4 Altered LFNIn practical applications, the order of access files depends on the usage environment, so it is absolutely impossible that one certain store situation can be suita

35、ble for every cases. By altering the priorities of LFN fields with different situations, the radix sort could produce a queue customized for usage in a certain case. For example, in some geographic information system, the correlation of files on field of coordinates may be more important than the co

36、rrelation of files on field of time, but in some weather forecast system the correlation of files on field of time may be more important than the correlation of files on field of coordinates. In the storage system of some social networks , time field may be ahead of file types field, but in some oth

37、er social networks, file type may be more important. Fig. 4 shows altered LFN for one specific case, by customizing priorities of other fields under the same field of hashvalue.Here is algorithm description of radix sort based on LFN:Fset=fi f2,f3,.i,n files should be deal with.1)Alter priorities of

38、 LFN fields for adaptation to specific situations, from LFNfil9fi29fi3,.9fik to 2)Set up item lists for sort based on field fixv, and start one process called assign on Fset in which file belong to this fixv key should append to the corresponding list.3)Collect these lists items in order, and this p

39、rocess will form one new list, called Fsetx.4)Set up radix items lists based on field fixv-1 just like 2) but on Fsetx,and repeat 2)4) until deal all fields. The number of iteration is v, equal the number of fields of LFNx.5)The Fsetx is a sorted list based on LFN.Here is pseudocode of our proposed

40、approach:LFNlist=fixi,fix2,fix3,.fixv*Jtraverse(file system) is a function which through all files on this file system.8file.fieldname return certain field type in files LFN +distribute (fieldname) is a distribution function in radix sort*Fset=traverse(file system) 8Fset_temp=Fset*Jfbr fieldx in LFN

41、list.reverse:itemlist=8for file in Fset temp:*itemlistdistribute (file.fieldx).append(file)*JFset_temp=Jfbr fileset in itemlist:Fsettemp.appenfileset)Fset=Fset_temp*Jreturn FsetHere is analysis of complexity: The original list has n items, d keys, the range of keys is a list = rxl,rx2,rx3. rxv. The

42、time complexity is O(dx(n+max(rx 1 ,rx2,. .rxv). The time complexity of one assign process is O(n) and the time complexity of one collection is O(radix), the space complexity of assigns and collections is d.3.3 Merging small files and store on DHFSThe merge process depends on the customized sorted l

43、ist in Fsetx. In HDFS files are saved in distributed blocks which have uniform size (The size is 128M at present version of HDFS by default) 16. In some situation one small file has to be divided to two parts stored in two blocks to make the most of volume of one block, but it is not efficient to re

44、ad two blocks when accessing the one small file. So it is important to keep balance between time cost and space cost, as well as a principle in computer systems. In algorithm of merging small files there is a threshold. When free space of one block is smaller than threshold (default size is 10%) and

45、 it can not load entirely the next small file, this block should be uploaded onto HDFS and start new one block. Here is the algorithm of merging small files.l)Create a new buffer file and an index file which named buffer_file and maptable contain the small file offset information. The size of buffer

46、 file base on the 3)If the size of this small file less than free space of buffer file,then put this small file in buffer file and update maptable and repeat2),if not go to 4).4)If the size of free space of buffer file greater than 10%, but it can not hole whole small file,then split small file base

47、d on free space of buffer file, then put buffer file and maptable onto HDFS and take merge process. Then do rest part file based on 3)5)If the size of free space of buffer file less than 10% and this small file greater than free space of buffer file, put buffer file and maptable onto HDFS and take m

48、erge process. Make new buffer file and maptable. Go to 2)6)Cope with all files in Fsetx.There is pseudocode of merge process:block=sizeof(Hadoop_block)*Jbuflfer_file=new fHe(block)wmaptable=new fileQFor file in Fsetx:If file.size0.1 and fHe.sizebuffer_file.remain(): 8file 1 ,file2=file.split(buffer_file.remain)*Jbufferfile. append(maptable. writemap(file 1) uptohdfs(buflfer_le,maptable)。Fsetx.insert(file2)*JBuffer_file=new fileCblock)*Maptable=new fileQelse:*uptohdfsfbufferfilejmaptable) buffer_file=new filelo

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 应用文书 > 解决方案

本站为文档C TO C交易模式,本站只提供存储空间、用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。本站仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知淘文阁网,我们立即给予删除!客服QQ:136780468 微信:18945177775 电话:18904686070

工信部备案号:黑ICP备15003705号© 2020-2023 www.taowenge.com 淘文阁