Database Management System - Chapter 16: Data Warehouse Technology and Management

Outline Basic concepts and characteristics Business architectures and applications Data cube concepts and operators Relational DBMS features Maintaining a data warehouse

ppt56 trang | Chia sẻ: candy98 | Lượt xem: 417 | Lượt tải: 0download
Bạn đang xem trước 20 trang tài liệu Database Management System - Chapter 16: Data Warehouse Technology and Management, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
Chapter 16Data Warehouse Technology and ManagementOutline Basic concepts and characteristicsBusiness architectures and applicationsData cube concepts and operatorsRelational DBMS featuresMaintaining a data warehouseComparison of EnvironmentsTransaction processingUses operational databasesShort-term decisions: fulfill orders, resolve complaints, provide staffingDecision support processingUses integrated and summarized dataMedium and long-term decisions: capacity planning, store locations, new lines of businessDefinition and CharacteristicsA central repository for summarized and integrated data from operational databases and external data sources Key CharacteristicsSubject-orientedIntegratedTime-variantNonvolatile Data ComparisonArchitectures and ApplicationsData warehouse projectsTop-down architecturesBottom-up architectureApplications and data miningData Warehouse ProjectsLarge efforts with much coordination across departmentsEnterprise data modelImportant artifact of data warehouse projectStructure of data modelMeta data for data transformationTop-down vs. bottom-up business architectures Two Tier Architecture Three Tier Architecture Bottom-up Architecture ApplicationsMaturity ModelGuidance for investment decisionsStages provide a framework to view an organization’s progressInsights: difficulty moving between stagesInfant to child stages because of investment levelTeenager to adult because of strategic importance of data warehouseData MiningDiscover significant, implicit patternsTarget promotionsChange mix and collocation of itemsRequires large volumes of transaction dataImportant application for data warehousesData Cube Concepts and OperatorsBasicsDimension and measure detailsOperatorsData Cube BasicsMultidimensional arrangement of dataUsers think about decision support data as data cubesTerminologyDimension: subject label for a row or columnMember: value of dimensionMeasure: quantitative data stored in cells Data Cube ExampleDimensions and MeasuresDimensionsHierarchies: members can have sub membersSparsity: many cells do not have dataMeasuresDerived measuresMultiple measures in cellsTime Series DataCommon data type in trend analysisReduce dimensionality using time seriesTime series propertiesData typeStart dateCalendarPeriodicityConversionSlice OperatorFocus on a subset of dimensionsSet dimension to specific value: 1/1/2006Dice OperatorFocus on a subset of member valuesReplace dimension with a subset of valuesDice operation often follows a slice operationOther OperatorsOperators for hierarchical dimensionsDrill-down: add detail to a dimensionRoll-up: remove detail from a dimensionRecalculate measure valuesPivot: rearrange dimensionsOperator Summary OperatorPurposeDescriptionSliceFocus attention on a subset of dimensions Replace a dimension with a single member value or with a summary of its measure values DiceFocus attention on a subset of member values Replace a dimension with a subset of members Drill-downObtain more detail about a dimension Navigate from a more general level to a more specific levelRoll-upSummarize details about a dimension Navigate from a more specific level to a more general levelPivotPresent data in a different orderRearrange the dimensions in a data cube Relational DBMS SupportData modelingDimension representationGROUP BY extensionsMaterialized views and query rewritingStorage structures and optimizationRelational Data ModelingDimension table: contains member valuesFact table: contains measure values1-M relationships from dimension to fact tablesGrain: most detailed measure values stored Star Schema ExampleConstellation Schema Snowflake Schema Example Handling M-N RelationshipsSource data may have M-N relationships, not 1-M relationshipsAdjust fact or dimension tables for a fixed number of exceptionsMore complex solutions to support M-N relationships with a variable number of connectionsTime RepresentationTimestampTime dimension table for organization specific calendar featuresTwo fact tables for international operationsAccumulating fact table for representation of multiple eventsLevel of Historical IntegrityPrimarily an issue for dimension updatesType I: overwrite old valuesType II: version numbers for an unlimited historyType III: new columns for a limited historyHistorical Integrity Example Dimension RepresentationStar schema and variations lack dimension representationExplicit dimension representation important to data cube operations and optimizationProprietary extensions for dimension representationRepresent levels, hierarchies, and constraintsOracle Dimension RepresentationLevels: dimension componentsHierarchies: may have multiple hierarchiesConstraints: functional dependency relationshipsCREATE DIMENSION ExampleCREATE DIMENSION StoreDim LEVEL StoreId IS Store.StoreId LEVEL City IS Store.StoreCity LEVEL State IS Store.StoreState LEVEL Zip IS Store.StoreZip LEVEL Nation IS Store.StoreNation LEVEL DivId IS Division.DivId HIERARCHY CityRollup ( StoreId CHILD OF City CHILD OF State CHILD OF Nation )HIERARCHY ZipRollup ( StoreId CHILD OF Zip CHILD OF State CHILD OF Nation )HIERARCHY DivisionRollup ( StoreId CHILD OF DivId JOIN KEY Store.DivId REFERENCES DivId )ATTRIBUTE DivId DETERMINES Division.DivNameATTRIBUTE DivId DETERMINES Division.DivManager ;GROUP BY ExtensionsROLLUP operatorCUBE operatorGROUPING SETS operatorOther extensionsRankingRatiosMoving summary valuesCUBE ExampleSELECT StoreZip, TimeMonth, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear = 2005 GROUP BY CUBE (StoreZip, TimeMonth)ROLLUP ExampleSELECT TimeMonth, TimeYear, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear BETWEEN 2005 AND 2006 GROUP BY ROLLUP (TimeMonth,TimeYear);GROUPING SETS ExampleSELECT StoreZip, TimeMonth, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear = 2005 GROUP BY GROUPING SETS((StoreZip, TimeMonth), StoreZip, TimeMonth, ());Variations of the Grouping OperatorsPartial cubePartial rollupComposite columnsCUBE and ROLLUP inside a GROUPIING SETS operationMaterialized ViewsStored viewPeriodically refreshed with source dataUsually contain summary dataFast query response for summary dataAppropriate in query dominant environmentsMaterialized View ExampleCREATE MATERIALIZED VIEW MV1BUILD IMMEDIATEREFRESH COMPLETE ON DEMANDENABLE QUERY REWRITE ASSELECT StoreState, TimeYear, SUM(SalesDollar) AS SUMDollar1 FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND TimeYear > 2003 GROUP BY StoreState, TimeYear;Query RewritingSubstitution processMaterialized view replaces references to fact and dimension tables in a query Query optimizer must evaluate whether the substitution will improve performance over the original query More complex than query modification process for traditional viewsQuery Rewriting Process Query Rewriting MatchingRow conditions: query conditions at least as restrictive as MV conditionsGrouping detail: query grouping columns at least as general as MV grouping columnsGrouping dependencies: query columns must match or be derivable by functional dependencies Aggregate functions: query aggregate functions must match or be derivable from MV aggregate functionsQuery Rewriting Example-- Data warehouse querySELECT StoreState, TimeYear, SUM(SalesDollar) FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND StoreNation IN ('USA','Canada') AND TimeYear = 2005 GROUP BY StoreState, TimeYear;-- Query Rewrite: replace Sales and Time tables with MV1SELECT DISTINCT MV1.StoreState, TimeYear, SumDollar1FROM MV1, StoreWHERE MV1.StoreState = Store.StoreState AND TimeYear = 2005 AND StoreNation IN ('USA','Canada');Storage and Optimization TechnologiesMOLAP: direct storage and manipulation of data cubesROLAP: relational extensions to support multidimensional dataHOLAP: combine MOLAP and ROLAP storage enginesROLAP TechniquesBitmap join indexesStar join optimizationQuery rewritingSummary storage advisorsParallel query executionMaintaining a Data WarehouseData sourcesWorkflow representationOptimizing the refresh processData SourcesCooperative: Notification using triggers Requires source system changesLoggedReadily availableExtraneous data in logsQueryableQueries using timestampsRequires timestamps in source dataSnapshotPeriodic dumps of source dataSignificant processing for difference operationsMaintenance Workflow Data Quality ProblemsMultiple identifiersMultiple field namesDifferent unitsMissing valuesOrphaned valuesMultipurpose fieldsConflicting dataDifferent update timesETL ToolsExtraction, Transformation, and LoadingSpecification basedEliminate custom codingThird party and DBMS based toolsRefresh Processing Determining the Refresh FrequencyMaximize net refresh benefitValue of data timelinessCost of refreshSatisfy data warehouse and source system constraintsRefresh ConstraintsSource access: restrictions on time and frequencyIntegration: restrictions that require concurrent reconciliationCompleteness/consistency: loading in the same refresh periodAvailability: load scheduling restrictions due to storage capacity, online availability, and server usageSummaryData warehouse requirements differ from transaction processing.Architecture choice is important.Multidimensional data model is intuitiveRelational representation and storage techniques are significant.Maintaining a data warehouse is an important, operational problem.