《第 设计立方体计算.pptx》由会员分享,可在线阅读,更多相关《第 设计立方体计算.pptx(88页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、2023/2/171第四章第四章:数据立方体计算与数据泛化数据立方体计算与数据泛化数据立方体计算的有效方法数据立方体和OLAP技术的进一步发展面向属性的规约-另一种数据泛化和概念描述方法第2页/共88页第1页/共88页2023/2/172数据立方体计算的有效方法数据立方体计算的有效方法Preliminary cube computation tricks(Agarwal et al.96)Computing full/iceberg cubes:3 methodologies Top-Down:多路数组聚集(Zhao,Deshpande&Naughton,SIGMOD97)Bottom-Up:
2、Bottom-up computation:BUC(Beyer&Ramarkrishnan,SIGMOD99)H-cubing technique(Han,Pei,Dong&Wang:SIGMOD01)Integrating Top-Down and Bottom-Up:Star-cubing algorithm(Xin,Han,Li&Wah:VLDB03)High-dimensional OLAP:A Minimal Cubing Approach(Li,et al.VLDB04)Computing alternative kinds of cubes:Partial cube,closed
3、 cube,approximate cube,etc.第3页/共88页第2页/共88页2023/2/173Preliminary Tricks(Agarwal et al.VLDB96)排序、散列和分组 operations are applied to the dimension attributes in order to reorder and cluster related tuplesAggregates may be computed from previously computed aggregates,rather than from the base fact tableSm
4、allest-child:computing a cuboid from the smallest,previously computed cuboidCache-results:caching results of a cuboid from which other cuboids are computed to reduce disk I/OsAmortize-scans:computing as many as possible cuboids at the same time to amortize disk readsShare-sorts:sharing sorting costs
5、 cross multiple cuboids when sort-based method is usedShare-partitions:sharing the partitioning cost across multiple cuboids when hash-based algorithms are used第4页/共88页第3页/共88页2023/2/174Multi-Way Array AggregationMulti-Way Array AggregationArray-based“bottom-up”algorithmUsing multi-dimensional chunk
6、sNo direct tuple comparisonsSimultaneous aggregation on multiple dimensionsIntermediate aggregate values are re-used for computing ancestor cuboidsCannot do Apriori pruning:No iceberg optimization第5页/共88页第4页/共88页2023/2/175Multi-way Array Aggregation for Cube Computation(MOLAP)Partition arrays into c
7、hunks(a small subcube which fits in memory).Compressed sparse array addressing:(chunk_id,offset)Compute aggregates in“multiway”by visiting cube cells in the order which minimizes the#of times to visit each cell,and reduces memory access and storage cost.What is the best traversing order to do multi-
8、way aggregation?AB29303132123459131415166463626148474645a1a0c3c2c1c 0b3b2b1b0a2a3CB442856402452362060第6页/共88页第5页/共88页2023/2/176Multi-way Array Aggregation for Cube ComputationAB29303132123459131415166463626148474645a1a0c3c2c1c 0b3b2b1b0a2a3C442856402452362060B第7页/共88页第6页/共88页2023/2/177Multi-way Arra
9、y Aggregation for Cube ComputationAB29303132123459131415166463626148474645a1a0c3c2c1c 0b3b2b1b0a2a3C442856402452362060B第8页/共88页第7页/共88页2023/2/178Multi-Way Array Aggregation for Cube Computation(Cont.)Method:the planes should be sorted and computed according to their size in ascending orderIdea:keep
10、the smallest plane in the main memory,fetch and compute only one chunk at a time for the largest planeLimitation of the method:computing well only for a small number of dimensionsIf there are a large number of dimensions,“top-down”computation and iceberg cube computation methods can be explored第9页/共
11、88页第8页/共88页2023/2/179Bottom-Up Computation(BUC)BUC(Beyer&Ramakrishnan,SIGMOD99)Bottom-up cube computation(Note:top-down in our view!)Divides dimensions into partitions and facilitates iceberg pruningIf a partition does not satisfy min_sup,its descendants can be prunedIf minsup=1 compute full CUBE!No
12、 simultaneous aggregation第10页/共88页第9页/共88页2023/2/1710BUC:PartitioningUsually,entire data set cant fit in main memorySort distinct values,partition into blocks that fitContinue processingOptimizationsPartitioningExternal Sorting,Hashing,Counting SortOrdering dimensions to encourage pruningCardinality
13、,Skew,CorrelationCollapsing duplicatesCant do holistic aggregates anymore!第11页/共88页第10页/共88页2023/2/1711H-Cubing:Using H-Tree StructureH-Cubing:Using H-Tree StructureBottom-up computationExploring an H-tree structureIf the current computation of an H-tree cannot pass min_sup,do not proceed further(pr
14、uning)No simultaneous aggregation第12页/共88页第11页/共88页2023/2/1712H-tree:A Prefix Hyper-treeMonthCityCust_grpProdCostPriceJanTorEduPrinter500485JanTorHhdTV8001200JanTorEduCamera11601280FebMonBusLaptop15002500MarVanEduHD540520rooteduhhdbusJanMarJanFebTorVanTorMonQ.I.Q.I.Q.I.Quant-InfoSum:1765Cnt:2binsAtt
15、r.Val.Quant-InfoSide-linkEduSum:2285 HhdBusJanFebTorVanMonHeadertable第13页/共88页第12页/共88页2023/2/1713H-Cubing:Computing Cells Involving Dimension CityrootEdu.Hhd.Bus.Jan.Mar.Jan.Feb.Tor.Van.Tor.Mon.Q.I.Q.I.Q.I.Quant-InfoSum:1765Cnt:2binsAttr.Val.Quant-InfoSide-linkEduSum:2285 HhdBusJanFebTorTorVanMonAt
16、tr.Val.Q.I.Side-linkEduHhdBusJanFebHeaderTableHTorFrom(*,*,Tor)to(*,Jan,Tor)第14页/共88页第13页/共88页2023/2/1714Computing Cells Involving Month But No CityrootEdu.Hhd.Bus.Jan.Mar.Jan.Feb.Tor.Van.Tor.Mont.Q.I.Q.I.Q.I.Attr.Val.Quant-InfoSide-linkEdu.Sum:2285 Hhd.Bus.Jan.Feb.Mar.Tor.Van.Mont.1.Roll up quant-i
17、nfo2.Compute cells involving month but no cityQ.I.Top-k OK mark:if Q.I.in a child passes top-k avg threshold,so does its parents.No binning is needed!第15页/共88页第14页/共88页2023/2/1715Computing Cells Involving Only Cust_grprooteduhhdbusJanMarJanFebTorVanTorMonQ.I.Q.I.Q.I.Attr.Val.Quant-InfoSide-linkEduSu
18、m:2285 HhdBusJanFebMarTorVanMonCheck header table directlyQ.I.第16页/共88页第15页/共88页2023/2/1716Star-Cubing:An Integrating MethodStar-Cubing:An Integrating MethodIntegrate the top-down and bottom-up methodsExplore shared dimensionsE.g.,dimension A is the shared dimension of ACD and ADABD/AB means cuboid
19、ABD has shared dimensions ABAllows for shared computationse.g.,cuboid AB is computed simultaneously as ABDAggregate in a top-down manner but with the bottom-up sub-layer underneath which will allow Apriori pruningShared dimensions grow in bottom-up fashion第17页/共88页第16页/共88页2023/2/1717Iceberg Pruning
20、 in Shared DimensionsAnti-monotonic property of shared dimensionsIf the measure is anti-monotonic,and if the aggregate value on a shared dimension does not satisfy the iceberg condition,then all the cells extended from this shared dimension cannot satisfy the condition eitherIntuition:if we can comp
21、ute the shared dimensions before the actual cuboid,we can use them to do Apriori pruningProblem:how to prune while still aggregate simultaneously on multiple dimensions?第18页/共88页第17页/共88页2023/2/1718Cell TreesCell TreesUse a tree structure similar to H-tree to represent cuboidsCollapses common prefix
22、es to save memoryKeep count at nodeTraverse the tree to retrieve a particular tuple第19页/共88页第18页/共88页2023/2/1719Star Attributes and Star NodesStar Attributes and Star NodesIntuition:If a single-dimensional aggregate on an attribute value p does not satisfy the iceberg condition,it is useless to dist
23、inguish them during the iceberg computationE.g.,b2,b3,b4,c1,c2,c4,d1,d2,d3 Solution:Replace such attributes by a*.Such attributes are star attributes,and the corresponding nodes in the cell tree are star nodesABCDCounta1b1c1d11a1b1c4d31a1b2c2d2 1a2b3c3d41a2b4c3d41第20页/共88页第19页/共88页2023/2/1720Example
24、:Star ReductionExample:Star ReductionSuppose minsup=2Perform one-dimensional aggregation.Replace attribute values whose count 2 with*.And collapse all*s togetherResulting table has all such attributes replaced with the star-attributeWith regards to the iceberg computation,this new table is a loseles
25、s compression of the original tableABCDCounta1b1*2a1*1a2*c3d42ABCDCounta1b1*1a1b1*1a1*1a2*c3d41a2*c3d41第21页/共88页第20页/共88页2023/2/1721Star TreeStar TreeGiven the new compressed table,it is possible to construct the corresponding cell treecalled star treeKeep a star table at the side for easy lookup of
26、 star attributesThe star tree is a loseless compression of the original cell tree第22页/共88页第21页/共88页2023/2/1722Star-Cubing AlgorithmDFS on Lattice Tree第23页/共88页第22页/共88页2023/2/1723Multi-Way Aggregation第24页/共88页第23页/共88页2023/2/1724Star-Cubing AlgorithmDFS on Star-Tree第25页/共88页第24页/共88页2023/2/1725Multi
27、-Way Star-Tree AggregationMulti-Way Star-Tree AggregationStart depth-first search at the root of the base star treeAt each new node in the DFS,create corresponding star tree that are descendents of the current tree according to the integrated traversal ordering E.g.,in the base tree,when DFS reaches
28、 a1,the ACD/A tree is createdWhen DFS reaches b*,the ABD/AD tree is createdThe counts in the base tree are carried over to the new trees第26页/共88页第25页/共88页2023/2/1726Multi-Way Aggregation(2)Multi-Way Aggregation(2)When DFS reaches a leaf node(e.g.,d*),start backtrackingOn every backtracking branch,th
29、e count in the corresponding trees are output,the tree is destroyed,and the node in the base tree is destroyedExampleWhen traversing from d*back to c*,the a1b*c*/a1b*c*tree is output and destroyedWhen traversing from c*back to b*,the a1b*D/a1b*tree is output and destroyedWhen at b*,jump to b1 and re
30、peat similar process第27页/共88页第26页/共88页2023/2/1727The Curse of DimensionalityNone of the previous cubing method can handle high dimensionality!A database of 600k tuples.Each dimension has cardinality of 100 and zipf of 2.第28页/共88页第27页/共88页2023/2/1728Motivation of High-D OLAPChallenge to current cubin
31、g methods:The“curse of dimensionality problemIceberg cube and compressed cubes:only delay the inevitable explosionFull materialization:still significant overhead in accessing results on diskHigh-D OLAP is needed in applicationsScience and engineering analysisBio-data analysis:thousands of genesStati
32、stical surveys:hundreds of variables第29页/共88页第28页/共88页2023/2/1729Fast High-D OLAP with Minimal CubingObservation:OLAP occurs only on a small subset of dimensions at a timeSemi-Online Computational Model1.Partition the set of dimensions into shell fragments2.Compute data cubes for each shell fragment
33、 while retaining inverted indices or value-list indices3.Given the pre-computed fragment cubes,dynamically compute cube cells of the high-dimensional data cube online第30页/共88页第29页/共88页2023/2/1730Properties of Proposed MethodPartitions the data verticallyReduces high-dimensional cube into a set of lo
34、wer dimensional cubesOnline re-construction of original high-dimensional spaceLossless reductionOffers tradeoffs between the amount of pre-processing and the speed of online computation第31页/共88页第30页/共88页2023/2/1731Example ComputationLet the cube aggregation function be countDivide the 5 dimensions i
35、nto 2 shell fragments:(A,B,C)and(D,E)tidABCDE1a1b1c1d1e12a1b2c1d2e13a1b2c1d1e24a2b1c1d1e25a2b1c1d1e3第32页/共88页第31页/共88页2023/2/17321-D Inverted IndicesBuild traditional invert index or RID listAttribute ValueTID ListList Sizea11 2 33a24 52b11 4 53b22 32c11 2 3 4 55d11 3 4 54d221e11 22e23 4 2e351第33页/共
36、88页第32页/共88页2023/2/1733Shell Fragment CubesGeneralize the 1-D inverted indices to multi-dimensional ones in the data cube senseCellIntersectionTID ListList Sizea1 b11 2 3 1 4 511a1 b21 2 3 2 32 32a2 b14 5 1 4 54 52a2 b24 5 2 30第34页/共88页第33页/共88页2023/2/1734Shell Fragment Cubes(2)Compute all cuboids f
37、or data cubes ABC and DE while retaining the inverted indicesFor example,shell fragment cube ABC contains 7 cuboids:A,B,CAB,AC,BCABCThis completes the offline computation stage第35页/共88页第34页/共88页2023/2/1735Shell Fragment Cubes(3)Given a database of T tuples,D dimensions,and F shell fragment size,the
38、fragment cubes space requirement is:For F 5,the growth is sub-linear.第36页/共88页第35页/共88页2023/2/1736Shell Fragment Cubes(4)Shell fragments do not have to be disjointFragment groupings can be arbitrary to allow for maximum online performanceKnown common combinations(e.g.,)should be grouped together.She
39、ll fragment sizes can be adjusted for optimal balance between offline and online computation第37页/共88页第36页/共88页2023/2/1737ID_Measure TableIf measures other than count are present,store in ID_measure table separate from the shell fragmentstidcountsum15702310382045405230第38页/共88页第37页/共88页2023/2/1738The
40、 Frag-Shells Algorithm1.Partition set of dimension(A1,An)into a set of k fragments(P1,Pk).2.Scan base table once and do the following3.insert into ID_measure table.4.for each attribute value ai of each dimension Ai5.build inverted index entry 6.For each fragment partition Pi7.build local fragment cu
41、be Si by intersecting tid-lists in bottom-up fashion.第39页/共88页第38页/共88页2023/2/1739Frag-Shells(2)A B C D E FABCCubeDEFCubeD CuboidEF CuboidDE CuboidCellTuple-ID Listd1 e11,3,8,9d1 e22,4,6,7d2 e15,10Dimensions第40页/共88页第39页/共88页2023/2/1740Online Query ComputationA query has the general formEach ai has
42、3 possible values1.Instantiated value2.Aggregate*function3.Inquire?functionFor example,returns a 2-D data cube.第41页/共88页第40页/共88页2023/2/1741Online Query Computation(2)Given the fragment cubes,process a query as follows1.Divide the query into fragment,same as the shell2.Fetch the corresponding TID li
43、st for each fragment from the fragment cube3.Intersect the TID lists from each fragment to construct instantiated base table4.Compute the data cube using the base table with any cubing algorithm第42页/共88页第41页/共88页2023/2/1742Online Query Computation(3)A B C D E FG H IJK LM N OnlineCubeInstantiated Bas
44、e Table第43页/共88页第42页/共88页2023/2/1743Experiment:Size vs.Dimensionality(50 and 100 cardinality)(50-C):106 tuples,0 skew,50 cardinality,fragment size 3.(100-C):106 tuples,2 skew,100 cardinality,fragment size 2.第44页/共88页第43页/共88页2023/2/1744Experiment:Size vs.Shell-Fragment Sizen(50-D):106 tuples,50 dime
45、nsions,0 skew,50 cardinality.n(100-D):106 tuples,100 dimensions,2 skew,25 cardinality.第45页/共88页第44页/共88页2023/2/1745Experiment:Run-time vs.Shell-Fragment Sizen106 tuples,20 dimensions,10 cardinality,skew 1,fragment size 3,3 instantiated dimensions.第46页/共88页第45页/共88页2023/2/1746Experiment:I/O vs.Shell-
46、Fragment Sizen(10-D):106 tuples,10 dimensions,10 cardinalty,0 skew,4 inst.,4 query.n(20-D):106 tuples,20 dimensions,10 cardinalty,1 skew,3 inst.,4 query.第47页/共88页第46页/共88页2023/2/1747Experiment:I/O vs.#of Instantiated Dimensionsn106 tuples,10 dimensions,10 cardinalty,0 skew,fragment size 1,7 total re
47、levant dimensions.第48页/共88页第47页/共88页2023/2/1748Experiments on Real World DataUCI Forest CoverType data set54 dimensions,581K tuplesShell fragments of size 2 took 33 seconds and 325MB to compute3-D subquery with 1 instantiate D:85ms1.4 sec.Longitudinal Study of Vocational Rehab.Data24 dimensions,8818
48、 tuplesShell fragments of size 3 took 0.9 seconds and 60MB to compute5-D query with 0 instantiated D:227ms2.6 sec.第49页/共88页第48页/共88页2023/2/1749Comparisons to Related WorkHarinarayan96 computes low-dimensional cuboids by further aggregation of high-dimensional cuboids.Opposite of our methods directio
49、n.Inverted indexing structures Witten99 focus on single dimensional data or multi-dimensional data with no aggregation.Tree-stripping Berchtold00 uses similar vertical partitioning of database but no aggregation.第50页/共88页第49页/共88页2023/2/1750Further Implementation ConsiderationsIncremental Update:App
50、end more TIDs to inverted list Add to ID_measure tableIncremental adding new dimensionsForm new inverted list and add new fragmentsBitmap indexingMay further improve space usage and speedInverted index compressionStore as d-gapsExplore more IR compression methods第51页/共88页第50页/共88页2023/2/1751Chapter