《Storage Architecture and Challenges.pdf》由会员分享,可在线阅读,更多相关《Storage Architecture and Challenges.pdf(25页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、Storage Architecture and ChallengesFaculty Summit, July 29, 2010Andrew Fikes, Principal Engineer Introductory ThoughtsGoogle operates planet-scale storage systemsWhat keeps us programming:Enabling application developersImproving data locality and availabilityImproving performance of shared storageA
2、note from the trenches: You know you have a large storage system when you get paged at 1 AM because you only have a few petabytes of storage left.The Plan for TodayStorage LandscapeStorage Software and ChallengesQuestions (15 minutes)Storage Landscape: HardwareA typical warehouse-scale computer:10,0
3、00+ machines, 1GB/s networking6 x 1TB disk drives per machineWhat has changed:Cost of GB of storage is lowerImpact of machine failures is higherMachine throughput is higherWhat has not changed:Latency of an RPCDisk drive throughput and seek latencyStorage Landscape: DevelopmentProduct success depend
4、s on:Development speedEnd-user latencyApplication programmers:Never ask simple questions of the dataChange their data access patterns frequentlyBuild and use APIs that hide storage requestsExpect uniformity of performanceNeed strong availability and consistent operationsNeed visibility into distribu
5、ted storage requestsStorage Landscape: ApplicationsEarly Google:US-centric trafficBatch, latency-insensitive indexing processesDocument snippets serving (single seek)Current day:World-wide trafficContinuous crawl and indexing processes (Caffeine)Seek-heavy, latency-sensitive apps (Gmail)Person-to-pe
6、rson, person-to-group sharing (Docs)Storage Landscape: Flash (SSDs)Important future direction:Our workloads are increasingly seek heavy50-150 x less expensive than disk per random readBest usages are still being exploredConcerns:Availability of devices17-32x more expensive per GB than diskEndurance
7、not yet proven in the fieldStorage Landscape: Shared DataScenario:Roger shares a blog with his 100,000 followersRafa follows Roger and all other ATP playersRafa searches all the blogs he can readTo make search fast, do we copy data to each user?YES: Huge fan-out on update of a documentNO: Huge fan-i
8、n when searching documentsTo make things more complicated:Freshness requirementsHeavily-versioned documents (e.g. Google Wave)Privacy restrictions on data placementStorage Landscape: LegalLaws and interpretations are constantly changingGovernments have data privacy requirementsCompanies have email a
9、nd doc. retention policiesSarbanes-Oxley (SOX) adds audit requirementsThings to think about:Major impact on storage design and performanceAre these storage- or application-level features?Versioning of collaborative documentsStorage Software: Googles StackTiered software stackNodeExports and verifies
10、 disksClusterEnsures availability within a clusterFile system (GFS/Colossus), structured storage (Bigtable)2-10%: disk drive annualized failure ratePlanetEnsures availability across clustersBlob storage, structured storage (Spanner)1 cluster event / quarter (planned/unplanned)Storage Software: Node
11、StoragePurpose: Export disks on the networkBuilding-block for higher-level storageSingle spot for tuning disk access peformanceManagement of node addition, repair and removalProvides user resource accounting (e.g. I/O ops)Enforces resource sharing across usersStorage Software: GFSThe basics:Our firs
12、t cluster-level file system (2001)Designed for batch applications with large filesSingle master for metadata and chunk managementChunks are typically replicated 3x for reliabilityGFS lessons:Scaled to approximately 50M files, 10PLarge files increased upstream app. complexityNot appropriate for laten
13、cy sensitive applicationsScaling limits added management overheadStorage Software: ColossusNext-generation cluster-level file systemAutomatically sharded metadata layerData typically written using Reed-Solomon (1.5x)Client-driven replication, encoding and replicationMetadata space has enabled availa
14、bility analysesWhy Reed-Solomon?Cost. Especially w/ cross cluster replication.Field data and simulations show improved MTTFMore flexible cost vs. availability choicesStorage Software: AvailabilityTidbits from our Storage Analytics team:Most events are transient and short (90% MTTF grows by 3500 xCor
15、related R=2 to R=3 = MTTF grows by 11xsource: Google Storage Analytics teamD.Ford, F.Popovici, M.Stokely, and V-A. Truong, F. Labelle, L. Barroso, S. Quinlan, C. GrimesStorage Software: BigtableThe basics:Cluster-level structured storage (2003)Exports a distributed, sparse, sorted-mapSplits and reba
16、lances data based on size and loadAsynchronous, eventually-consistent replicationUses GFS or Colossus for file storageThe lessons:Hard to share distributed storage resourcesDistributed transactions are badly neededApplication programmers want sync. replicationUsers want structured query language (e.
17、g. SQL)Storage Challenge: SharingSimple Goal: Share storage to reduce costsTypical scenario:Pete runs video encoding using CPU & local disk Roger runs a MapReduce that does heavy GFS readsRafa runs seek-heavy Gmail on Bigtable w/ GFSAndre runs seek-heavy Docs on Bigtable w/ GFSThings that go wrong:D
18、istribution of disks being accessed is not uniformNon-storage system usage impacts CPU and diskMapReduce impacts disks and buffer cacheGMail and Buzz both need hundreds of seeks NOWStorage Challenge: Sharing (cont.)How do we:Measure and enforce usage? Locally or globally?Reconcile isolation needs ac
19、ross users and systems?Define, implement and measure SLAs?Tune workload dependent parameters (e.g. initial chunk creation)Storage Software: BlobStoreThe basics:Planet-scale large, immutable blob storageExamples: Photos, videos, and email attachmentsBuilt on top of Bigtable storage systemManual, acce
20、ss- and auction-based data placementReduces costs by:De-duplicating data chunksAdjusting replication for cold dataMigrating data to cheaper storageFun statistics:Duplication percentages: 55% - Gmail, 2% - Video90% of Gmail attach. reads hit data 21 days oldStorage Software: SpannerThe basics:Planet-
21、scale structured storageNext generation of Bigtable stackProvides a single, location-agnostic namespaceManual and access-based data placementImproved primitives:Distributed cross-group transactionsSynchronous replication groups (Paxos)Automatic failover of client requestsStorage Software: Data Place
22、mentEnd-user latency really mattersApplication complexity is less if close to its dataCountries have legal restrictions on locating dataThings to think about:How do we migrate code with data?How do we forecast, plan and optimize data moves?Your computer is always closer than the cloud.Storage Softwa
23、re: Offline AccessPeople want offline copies of their dataImproves speed, availability and redundancyScenario:Roger is keeping a spreadsheet with RafaRoger syncs copy to his laptop and editRoger wants to see data on laptop from phoneThings to think about:Conflict resolution increases application com
24、plexityOffline codes is often very application specificDo users really need peer-to-peer synchronization?QuestionsRound tables at 4 PM:Using Googles Computational InfrastructureBrian Bershad & David KonerdingPlanet-Scale Storage Andrew Fikes & Yonatan ZungerStorage, Large-Scale Data Processing, Syst
25、emsJeff DeanAdditional SlidesStorage Challenge: ComplexityScenario: Read 10k from Spanner1. Lookup names of 3 replicas2. Lookup location of 1 replica3. Read data from replicas1. Lookup data locations from GFS2. Read data from storage node1. Read from Linux file systemLayers: Generate API impedence m
26、ismatches Have numerous failure and queuing pointsMake capacity and perf. prediction super-hardMake optimization and tuning very difficultStorage Software: File TransferCommon instigators of data transfer:Publishing production data (e.g. base index)Insufficient cluster capacity (disk or CPU)System and software upgradesMoving data is:Hard: Many moving parts, and different prioritiesExpensive & time-consuming: Networks involvedOur system:Optimized for large, latency-insensitive networksUses large windows and constant-bit rate UDPProduces smoother flow than TCP