Big_Data_大数据介绍(全英).ppt

上传人:豆**** 文档编号:87690860 上传时间:2023-04-16 格式:PPT 页数:76 大小:1.58MB
返回 下载 相关 举报
Big_Data_大数据介绍(全英).ppt_第1页
第1页 / 共76页
Big_Data_大数据介绍(全英).ppt_第2页
第2页 / 共76页
点击查看更多>>
资源描述

《Big_Data_大数据介绍(全英).ppt》由会员分享,可在线阅读,更多相关《Big_Data_大数据介绍(全英).ppt(76页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。

1、Big_Data_大数据介绍(全英)TopicsWhat is Big Data?Why Big Data is a big deal?NoSQL vs SQLHow to Deal with Big Data?Whats Hadoop/MapReduce?RDBMS vs Hadoop/MapReduceBig data players/Software Tools/PlatformsExamples What Is Big Data?CapturingandmanaginglotsofinformationWorkingwithmanynewtypesofdataStructure/Uns

2、tructuredExploitingthesemassesofinformationandnewdatatypeswithnewstylesofapplicationsBiggerthanTerabytesvolume,variety,velocity,variabilityWhy Big Data is a big DealBig data differs from traditional information in mind-bending ways:Not knowing why but only what The challenge with leadership is that

3、its very driven by gut instinct in most casesAir travelers can now figure out which flights are likeliest to be on time,thanks to data scientists who tracked a decade of flight history correlated with weather patternsPublishers use data from text analysis and social networks to give readers personal

4、ized news.health care is one of the biggest opportunities,If we had electronic records of Americans going back generations,wed know more about genetic propensities,correlations among symptoms,and how to individualize treatments.Google map search correlate to“Open retail store etc.”WhatThisMeansforYo

5、u Big Data can help a company do many things:Profile customersDetermine pricing strategiesIdentify competitive advantagesBetter target advertisingInform internal research and product developmentStrengthen customer serviceMain steps in adopting an analytical systemWhat Will We Analyze?Do We Buy or Bu

6、ild?Are We Ready to Invest?Do We Understand the Impact?ChallengesInformationgrowthProcessingpowerPhysicalstoragediskcapacityincreasedramatically100MB/Sreadfromdisk(bottleneck)dataseekingtimeisslowthandatatransferringDataissuesCostsRecently IT TrendCommodity hardwareDistributed file systemsOpen sourc

7、e operating systems,databases,and other infrastructureSignificantly cheaper storageService-oriented architectureBig Data ChainCollect DataIngest/Clean Data(Originally ETL.Existing schema)Human exploration/Infrastructure/Data miningStore/ArchiveShare (decision make,other system)Measure/feedbackACIDAC

8、ID(Atomicity,Consistency,Isolation,Durability)(A)when you do something to change a database the change should work or fail as a whole(C)the database should remain consistent(this is a pretty broad topic)(I)if other things are going on at the same time they shouldnt be able to see things mid-update(D

9、)if the system blows up(hardware or software)the database needs to be able to pick itself back up;and if it says it finished applying an update,it needs to be certainMapReduceDividing and conqueringHighly fault tolerant nodes are expected to fail Every data block(by default)replicated on 3 nodes(is

10、also rack aware)Difficult to implementRDBMSfixed-schema,row-oriented databases with ACID properties and a sophisticated SQL query engine.The emphasis is on strong consistency,referential integrity,abstraction from the physical layer,and complex queries through the SQL language.easily create secondar

11、y indexes,perform complex inner and outer joins,count,sum,sort,group,and page your data across a number of tables,rows,and columns.RDBMS vs MapReduceRDBMS MapReduce mostly structured data unstructured data data internal structure none(does in process)normalized need non-nomalize Notes:1.relational d

12、atabases start incorporating some of the ideas from MapReduce(such as Aster Datas and Greenplums databases)2.the other direction,as higher-level query languages built on MapReduce(such as Pig and Hive)make MapReduce systems more approachable for traditional database programmers.ArchitechuresHow does

13、 MapReduce workHDFS(Hadoop Distributed File System)Data is stored on local disk and processing is done locally on the computer with the dataCan work with raw data stored in file system or databaseTwo steps:Map and Reduce MapMapReduce uses key/value pairs.(Traditionally using rows and columns)Example

14、:last name/chen withdraw amount/20 transaction date/06-23-2013Reduceall the intermediate values for a given output key are combined together into a list.The reduce()function then combines the intermediate values into one or more final values for the same key.HadoopHadoop is designed to abstract away

15、 much of the complexity of distributed processingDifferent from GRID computingWidely used Social media(e.g.,Facebook,Twitter)Life sciences Financial services Retail GovernmentHadoop ArchitectureApplication layer/end user access layer a.Job Tracker(workload management layer)b.Distributed parallel fil

16、e systems/data layerHadoop ImplementationHadoop is designed to run jobs that last minutes or hours on trusted,dedicated hardware running in a single data center with very high aggregate bandwidth interconnectsDesign of HDFSNamenodes(The Master)Manage metadata/file treesDatanodes(Workers)store/retrie

17、ve data blockDatanodes do not use RAID disk.HDFS round-robins HDFS blocks between all disks.RAID limited by the slowest disk on the array.Limitations of HDFSLow-latency data accessLots of small filesMultiple writers,arbitrary file modificationsHDFSBlock64 MB/128MB(normal disk block 512 KB).minimize

18、seek timefixed size rather than file,easy storage/replication%hadoop fsck/-files blocks%hadoop fs help (regular filesystem operation)%hadoop fs-copyFromLocal input/docs/quangle.txt hdfs:/localhost/user/tom/quangle.txt%hadoop fs-mkdir books%hadoop fs-ls Data flowsFormat and TypesMapReduce model in de

19、tail,and,in particular,how data in various formats,from simple text to structured binary objects,can be used with this model map:(K1,V1)list(K2,V2)reduce:(K2,list(V2)list(K3,V3)Text file On the top of the Crumpetty Tree The Quangle Wangle sat,But his face you could not see,On account of his Beaver H

20、at.is divided into one split of four records.The records are interpreted as the following key-value pairs:(0,On the top of the Crumpetty Tree)(33,The Quangle Wangle sat,)(57,But his face you could not see,)(89,On account of his Beaver Hat.)Data FileMapreduce Special FeatureCounterSortingJoinsShuffle

21、 MapReduce guarantees that the input to every reducer is sorted by key.The process by which the system performs the sortand transfers the map outputs to the reducers as inputs-ShuffleInstall Hadoop%cd/usr/local%sudo tar xzf hadoop-x.y.z.tar.gzchange the owner of the Hadoop files to be the hadoop use

22、r and group:%sudo chown-R hadoop:hadoop hadoop-x.y.zLayers/Players-continueExtract,transform,load(ETL)IBM InfoSphere DataStage Informatica Pervasive TalendData warehouse Oracle,Teradata,IBM Netezza,Greenplum PIG Help HadoopPig is a scripting language for exploring large datasets1.A Pig Latin program

23、 is made up of a series of operations,or transformations,that are applied to the input data to produce output2.Pig execution environment translates into an executable representation and then runsHbase HBase is a distributed column(family)-oriented database built on top of HDFS.HBase is the Hadoop ap

24、plication to use when you require real-time read/write random-access to very large datasetsHBase tables are like those in an RDBMS,only cells are versioned,rows are sorted,and columns can be added on the fly by the client as long as the column family they belong to preexists.Hbase-continueRegions Ea

25、ch region comprises a subset of a tables rowsprovide ways to read or write individual records efficiently based on HadoopHiveHivean open source data warehousing and SQL infrastructure built on top of HadoopClouderas Distribution for HadoopClouderas Distribution for Hadoop is based on the most recent

26、 stable version of Apache Hadoop with numerous patches,backports,and updatesEvaluate CriteriaHighscalabilityLowlatencyPredictabilityHighavailabilityEasymanagementMulti-tenancyBig Data Realtime ProcessingGoogle BigQuery is a web service that lets you do interactive analysis of massive datasetsup to b

27、illions of rowsTwitters StormCloudera ImpalaNoSQLNoSQL refers to document-oriented databases SQL doesnt scale well horizontally(add more servers which Cloud is good at)It is schemaless.But not formless(JSON format).JSON:data interchange format Mongo Database Couch DatabaseNoSQL Base ModelBase Model

28、Basic Availability:spread data across many storage systems with a high degree of replication Soft State:data consistency is the developers problem and should not be handled by the database.Eventual Consistency:at some point in the future,data will converge to a consistent state.No guarantees are mad

29、e“when”JSON Structure field1:value1,field2:value2 fieldN:valueN var mydoc=_id:ObjectId(5099803df3f4948bd2f98391),name:first:Alan,last:Turing,birth:new Date(Jun 23,1912),death:new Date(Jun 07,1954),contribs:Turing machine,Turing test,views:NumberLong(1250000)RDBMS vs NoSQLXszcRow DB:001:10,Smith,Joe,

30、40000;002:12,Jones,Mary,50000;003:11,Johnson,Cathy,44000;004:22,Jones,Bob,55000;index:001:40000;002:50000;003:44000;004:55000;Column DB:10:001,12:002,11:003,22:004;Smith:001,Jones:002,Johnson:003,Jones:004;Joe:001,Mary:002,Cathy:003,Bob:004;40000:001,50000 ;Smith:001,Jones:002,004,Johnson:003;Benefi

31、tsColumn-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data,because reading that smaller subset of data can be faster than reading all data.Column-oriented organizations are more efficient when

32、new values of a column are supplied for all rows at once,because that column data can be written efficiently and replace old column data without touching any other columns for the rows.Row-oriented organizations are more efficient when many columns of a single row are required at the same time,and w

33、hen row-size is relatively small,as the entire row can be retrieved with a single disk seek.Row-oriented organizations are more efficient when writing a new row if all of the column data is supplied at the same time,as the entire row can be written with a single disk seek.SQL vs Non SQLA good compro

34、mise is to design your system with 3 logical DBs 1.Normal SQL DB used by your admin application to create content.2.No-SQL DB for front-end/public/high-volume applicaiton used by the public internet.3.The last DB is for analytical reporting system using cubes and all that good stuff.Then data flows

35、from the Admin DB to the client No-SQL DB when someone Publishes a piece of content,the client(NoSQL)db provides very fast read access and records user interactions with the content.Then you have a scheduled job that pulls the data from the client DB into the reporting system.Since Admin,client,and

36、reporting are often separate apps,each application team can work with data in the format that best serves the application and the transition from one system to the other is handled in the service layers.Big Data SolutionsCloudera:Cloudera EnterpriseMicrosoft:Windows Azure HDInsight ServiceGoogle:Big

37、QueryAmazon:DynamoDBIBM:InfoSphere Streams/NetezzaEMC:Greenplum TeraData:Aster MapReduce PlatformOracle:Hadoop/Mapreduce Big Data connectorsBig Data Project Fail ReasonsLackofcooperationamongdepartmentsLackofstaffexperiencedinBigDataSecurityPoorplanningReal Examples of Big Data ProjectsConsumer prod

38、uct companies and retail organizations are monitoring social media like Facebook and Twitter to get an unprecedented view into customer behavior,preferences,and product perception.Manufacturers are monitoring minute vibration data from their equipment,which changes slightly as it wears down,to predi

39、ct the optimal time to replace or maintain.Replacing it too soon wastes money;replacing it too late triggers an expensive work stoppageManufacturers are also monitoring social networks,but with a different goal than marketers:They are using it to detect aftermarket support issues before a warranty f

40、ailure becomes publicly detrimental.Financial Services organizations are using data mined from customer interactions to slice and dice their users into finely tuned segments.This enables these financial institutions to create increasingly relevant and sophisticated offers.ContinuationAdvertising and

41、 marketing agencies are tracking social media to understand responsiveness to campaigns,promotions,and other advertising mediums.Insurance companies are using Big Data analysis to see which home insurance applications can be immediately processed,and which ones need a validating in-person visit from

42、 an agent.By embracing social media,retail organizations are engaging brand advocates,changing the perception of brand antagonists,and even enabling enthusiastic customers to sell their products.Hospitals are analyzing medical data and patient records to predict those patients that are likely to see

43、k readmission within a few months of discharge.The hospital can then intervene in hopes of preventing another costly hospital stay.Web-based businesses are developing information products that combine data gathered from customers to offer more appealing recommendations and more successful coupon pro

44、grams.The government is making data public at both the national,state,and city level for users to develop new applications that can generate public good.Sports teams are using data for tracking ticket sales and even for tracking team strategies.Starting Big Data ProjectsNYTD(National Youth in Transi

45、tion Database)Documentation Search Dynamic SQL tableWWW log filesHealth Care:extracting names,locations,dates,products,diseases,Rx,conditions,etc.,from text NYTD(National Youth Transitinal Database)Data colection system to track the States are to collect information on each youth who receives indepe

46、ndent living services paid for or provided by the State agency that administers the CFCIP.Second,States are to collect demographic and outcome information on certain youth in foster care whom the State will follow over time to collect additional outcome informationthey provide to all youth in eleven

47、 broad categories:independent living needs assessment;academic support;post-secondary educational support;career preparation;employment programs or vocational training;budget and financial management;housing education and home management training;health education and risk prevention;family support a

48、nd healthy marriage education;mentoring;and supervised independent living.States will also report financial assistance they provide,including assistance for education,room and board and other aid.States will survey youth regarding six outcomes:financial self-sufficiency,experience with homelessness,

49、educational attainment,positive connections with adults,high-risk behavior,and access to health insuranceEvery two years starting at age 17th.Started 2010,100 million records.data source:XML formatData Processextracting names,locations,dates,products,diseases,Rx,conditions,etc.,from text Topic Track

50、ing track information of interest to a user Categorization categorize a document based on wordcounts/synonyms,etc.Clustering grouping similar documents Concept Linking related documents based on shared concepts Question Answering try to find best answer based on users environment Softwareshttp:/hado

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 考试试题 > 语文专题

本站为文档C TO C交易模式,本站只提供存储空间、用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。本站仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知淘文阁网,我们立即给予删除!客服QQ:136780468 微信:18945177775 电话:18904686070

工信部备案号:黑ICP备15003705号© 2020-2023 www.taowenge.com 淘文阁