Cost Based optimization
Query optimization is the process of choosing the most efficient or the most favorable type of executing an SQL statement. Query optimization is an art of science for applying rules to rewrite the tree of operators that is invoked in a query and to produce an optimal plan. A plan is said to be optimal if it returns the answer in the least time or by using the least space.
For a given query and environment, the Optimizer allocates a cost in numerical form which is related to each step of a possible plan and then finds these values together to get a cost estimate for the plan or for the possible strategy. After calculating the costs of all possible plans, the Optimizer tries to choose a plan which will have the possible lowest cost estimate. For that reason, the Optimizer may be sometimes referred to as the Cost-Based Optimizer. Below are some of the features of the cost-based optimization-
- The cost-based optimization is based on the cost of the query that to be optimized.
- The query can use a lot of paths based on the value of indexes, available sorting methods, constraints, etc.
- The aim of query optimization is to choose the most efficient path of implementing the query at the possible lowest minimum cost in the form of an algorithm.
- The cost of executing the algorithm needs to be provided by the query Optimizer so that the most suitable query can be selected for an operation.
- The cost of an algorithm also depends upon the cardinality of the input.
To estimate the cost of different available execution plans or the execution strategies the query tree is viewed and studied as a data structure that contains a series of basic operation which are linked in order to perform the query. The cost of the operations that are present in the query depends on the way in which the operation is selected such that, the proportion of select operation that forms the output. It is also important to know the expected cardinality of an operation output. The cardinality of the output is very important because it forms the input to the next operation.
The cost of optimization of the query depends upon the following-
Cardinality is known to be the number of rows that are returned by performing the operations specified by the query execution plan. The estimates of the cardinality must be correct as it highly affects all the possibilities of the execution plan.
Selectivity refers to the number of rows that are selected. The selectivity of any row from the table or any table from the database almost depends upon the condition. The satisfaction of the condition takes us to the selectivity of that specific row. The condition that is to be satisfied can be any, depending upon the situation.
Cost refers to the amount of money spent on the system to optimize the system. The measure of cost fully depends upon the work done or the number of resources used.
The first step is to use ANALYZE TABLE COMPUTE STATISTICS SQL command to compute table statistics. Use DESCRIBE EXTENDED SQL command to inspect the statistics.
The table statistics can be computed for tables, partitions, and columns and are as follows-
- Total size (in bytes) of a table or table partitions.
- Row count of a table or table partitions.
- Column statistics like min, max, num_nulls, distinct_count, avg_col_len, max_col_len, histogram.
ANALYZE TABLE COMPUTE STATISTICS SQL Command:
Cost-Based Optimization uses the statistics stored in a meta store i.e. external catalog using ANALYZE TABLE SQL command-
ANALYZE TABLE tableIdentifier partitionSpec; COMPUTE STATISTICS (NOSCAN | FOR COLUMNS identifierSeq);
Depending on the variant, ANALYZE TABLE computes different statistics, i.e. of a table, partitions, or columns-
- ANALYZE TABLE with neither PARTITION specification nor FOR COLUMNS clause.
- ANALYZE TABLE with PARTITION specification (but no FOR COLUMNS clause).
- ANALYZE TABLE with FOR COLUMNS clause (but no PARTITION specification).
DESCRIBE EXTENDED SQL Command:
The statistics of a table can be viewed, partitions, or a column (stored in a meta store) using DESCRIBE EXTENDED SQL command-
(DESC | DESCRIBE) TABLE? (EXTENDED | FORMATTED); tableIdentifier partitionSpec? describeColName;
Cost Components Of Query Execution:
The following are the cost components of the execution of a query-
- Access cost to secondary storage-
This can be the cost of searching, reading, or writing data blocks that originally found on the secondary storage, especially on the disk. The cost of searching for records in a file also depends upon the type of access structure that file has.
- Memory usage cost-
The cost of memory usage can be calculated simply by using the number of memory buffers that are needed for the execution of the query.
- Storage cost-
The storage cost is the cost of storing any intermediate files(files that are the result of processing the input but are not exactly the result) that are generated by the execution strategy for the query.
- Computational cost-
This is the cost of performing the memory operations that are available on the record within the data buffers. Operations like searching for records, merging records, or sorting records. This can also be called the CPU cost.
- Communication cost-
This is the cost that is associated with sending or communicating the query and its results from one place to another. It also includes the cost of transferring the table and results to the various sites during the process of query evaluation.
Issues In Cost-Based Optimization:
The following are the issues in cost-based optimization-
- In cost-based optimization, the number of execution strategies that can be considered is not really fixed. The number of execution strategies may vary based on the situation.
- Sometimes, this process is really very time-consuming to cost because it does not always guarantee finding the best optimal strategy
- It is an expensive process.