Database Management System - Chapter 8: Physical Database Design
Overview of Physical Database Design File Structures Query Optimization Index Selection Additional Choices in Physical Database Design
Bạn đang xem trước 20 trang tài liệu Database Management System - Chapter 8: Physical Database Design, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
Chapter 8Physical Database DesignOutline Overview of Physical Database DesignFile Structures Query Optimization Index SelectionAdditional Choices in Physical Database DesignOverview of Physical Database DesignSequence of decision-making processes.Decisions involve the storage level of a database: file structure and optimization choices.Importance of providing detailed inputs and using tools that are integrated with DBMS usage of file structures and optimization decisionsStorage Level of DatabasesClosest to the hardware and operating system.Physical records organized into files.The number of physical record accesses is an important measure of database performance.Difficult to predict physical record accesses Logical Records (LR) and Physical Records (PR) Transferring Physical Records Objectives Minimize response time to access and change a database.Minimizing computing resources is a substitute measure for response time. Database resourcesPhysical record transfersCPU operationsCommunication network usage (distributed processing)ConstraintsMain memory and disk spaceMinimizing main memory and disk space can lead to high response times. Useful to consider additional memory and disk space Combined Measure of Database Performance Weight combines physical record accesses and CPU usageWeight is usually close to 0 Mmany CPU operations can be performed in the time to perform one physical record transfer. Inputs, Outputs, and Environment Difficulty of physical database designNumber of decisionsRelationship among decisionsDetailed inputsComplex environmentUncertainty in predicting physical record accessesInputs of Physical Database Design Physical database design requires inputs specified in sufficient detail.Table profiles used to estimate performance measures.Application profiles provide importance of applications.Table ProfileTablesNumber of rowsNumber of physical recordsColumnsNumber of unique valuesDistribution of valuesCorrelation of columnsRelationships: distribution of related rowsHistogramSpecify distribution of valuesTwo dimensional graphColumn values on the x axisNumber of rows on the y axisVariationsEqual-width: do not work well with skewed dataEqual-height: control error by the number of ranges Equal-Width Histogram Equal-Height Histogram Application profiles Application profiles summarize the queries, forms, and reports that access a database. File structuresSelecting among alternative file structures is one of the most important choices in physical database design.In order to choose intelligently, you must understand characteristics of available file structures. Sequential Files Simplest kind of file structure Unordered: insertion orderOrdered: key orderSimple to maintainProvide good performance for processing large numbers of recordsUnordered Sequential File Ordered Sequential FileHash FilesSupport fast access by unique key value Convert a key value into a physical record addressMod function: typical hash functionDivisor: large prime number close to the file capacityPhysical record number: hash function plus the starting physical record numberExample: Hash Function Calculations for StdSSN Key Hash File after Insertions Collision Handling ExampleHash File LimitationsPoor performance for sequential search Reorganization when capacity exceeds 70%Dynamic hash files reduce random search performance but eliminate periodic reorganizationMulti-Way Tree (Btrees) Files A popular file structure supported by most DBMSs.Btree provides good performance on both sequential search and key search.Btree characteristics:BalancedBushy: multi-way treeBlock-orientedDynamicStructure of a Btree of Height 3 Btree Node Containing Keys and Pointers Btree Insertion Examples Btree Deletion Examples Cost of Operations The height of Btree dominates the number of physical record accesses operation.Logarithmic search costUpper bound of height: log functionLog base: minimum number of keys in a nodeInsertion cost Cost to locate the nearest key Cost to change nodes B+Tree Provides improved performance on sequential and range searches.In a B+tree, all keys are redundantly stored in the leaf nodes.To ensure that physical records are not replaced, the B+tree variation is usually implemented.B+tree Illustration Index Matching Determining usage of an index for a queryComplexity of condition determines match.Single column indexes: =, , =, IN , BETWEEN, IS NULL, LIKE ‘Pattern’ (meta character not the first symbol)Composite indexes: more complex and restrictive rulesIndex Matching ExamplesC2 BETWEEN 10 and 20: match on C2C3 IN (10,20): match on C3C1 10: no matchC4 LIKE 'A%‘: match on C4C4 LIKE '%A‘: no matchC2 = 5 AND C3 = 20 AND C1 = 10: matches on index with C1, C2, and C3Bitmap Index Can be useful for stable columns with few valuesBitmap: String of bits: 0 (no match) or 1 (match)One bit for each rowBitmap index recordColumn valueBitmapDBMS converts bit position into row identifier. Bitmap Index Example Faculty TableBitmap Index on FacRankBitmap Join Index Bitmap identifies rows of a related table.Represents a precomputed join Can define for a join column or a non-join columnTypically used in query dominated environments such as data warehouses (Chapter 16)Summary of File Structures Query Optimization Query optimizer determines implementation of queries.Major improvement in software productivityImprove performance with knowledge about the optimization processTranslation Tasks Access Plan Evaluation Optimizer evaluates thousands of access plans Access plans vary by join order, file structures, and join algorithm. Some optimizers can use multiple indexes on the same table.Access plan evaluation can consume significant resourcesAccess Plan Example 1 Access Plan Example 2 Join Algorithms Nested loops: inner and outer loops; universalSort merge: join column sorting or indexesHybrid join: combination of nested loops and sort mergeHash join: uses internal hash tableStar join: uses bitmap join indexesImproving Optimization Results Monitor poorly performing access plansLook for problems involving table profiles and query coding practices Use hints carefully to improve resultsOverride optimizer judgmentCover file structures, join algorithms, and join ordersUse as a last resultTable Profile Deficiencies Detailed and current statistics neededBeware of uniform value assumption and independence assumption Use hints to overcome optimization blind spotsEstimation of result size for parameterized queriesCorrelated columns: multiple index access may be useful Query Coding PracticesAvoid functions on indexable columnsEliminate unnecessary joinsFor conditions on join columns, test the condition on the parent table.Do not use the HAVING clause for row conditions.Avoid repetitive binding of complex queriesBeware of queries that use complex viewsIndex Selection Most important decisionDifficult decisionChoice of clustered and nonclustered indexes Clustering Index Example Nonclustering Index Example Inputs and Outputs of Index Selection Trade-offs in Index SelectionBalance retrieval against update performanceNonclustering index usage:Few rows satisfy the condition in the queryJoin column usage if a small number of rows result in child tableClustering index usage:Larger number of rows satisfy a condition than for nonclustering indexUse in sort merge join algorithm to avoid sortingMore expensive to maintainDifficulties of Index Selection Application weights are difficult to specify.Distribution of parameter values needed Behavior of the query optimization component must be known. The number of choices is large. Index choices can be interrelated. Selection Rules IRule 1: A primary key is a good candidate for a clustering index. Rule 2: To support joins, consider indexes on foreign keys.Rule 3: A column with many values may be a good choice for a non-clustering index if it is used in equality conditions.Rule 4: A column used in highly selective range conditions is a good candidate for a non-clustering index.Rule 5: A combination of columns used together in query conditions may be good candidates for nonclustering indexes if the joint conditions return few rows, the DBMS optimizer supports multiple index access, and the columns are stable.Selection Rules II Rule 6: A frequently updated column is not a good index candidate.Rule 7: Volatile tables (lots of insertions and deletions) should not have many indexes.Rule 8: Stable columns with few values are good candidates for bitmap indexes if the columns appear in WHERE conditions.Rule 9: Avoid indexes on combinations of columns. Most optimization components can use multiple indexes on the same table. Index CreationTo create the indexes, the CREATE INDEX statement can be used.The word following the INDEX keyword is the name of the index.CREATE INDEX is not part of SQL:2003.Examples Denormalization Additional choice in physical database designDenormalization combines tables so that they are easier to query.Use carefully because normalized designs have important advantages. Normalized designsBetter update performance Require less coding to enforce integrity constraintsSupport more indexes to improve query performanceRepeating Groups Collection of associated values.Normalization rules force repeating groups to be stored in an M table separate from an associated one table.If a repeating group is always accessed with its associated parent table, denormalization may be a reasonable alternative. Denormalizing a Repeating Group Denormalizing a Generalization Hierarchy Codes and Meanings Record Formatting Record formatting decisions involve compression and derived data.Compression is a trade-off between input-output and processing effort. Derived data is a trade-offs between query and update operations. Storing Derived Data toImprove Query Performance Parallel Processing Parallel processing can improve retrieval and modification performance.Retrieving many records can be improved by reading physical records in parallel. Many DBMSs provide parallel processing capabilities with RAID systems.RAID is a collection of disks (a disk array) that operates as a single disk. Striping in RAID Storage Systems Other Ways to Improve Performance Transaction processing: add computing capacity and improve transaction design. Data warehouses: add computing capacity and store derived data. Distributed databases: allocate processing and data to various computing locations. SummaryGoal: minimize response timeConstraints: disk space, memory, communication bandwidthTable profiles and application profiles must be specified in sufficient detail.Environment: file structures and query optimizationMonitor and possibly improve query optimization resultsIndex selection: most important decisionOther techniques: denormalization, record formatting, and parallel processing