Open In App

Query Optimization in Relational Algebra

Improve
Improve
Like Article
Like
Save
Share
Report

Query: A query is a request for information from a database. 

Query Plans: A query plan (or query execution plan) is an ordered set of steps used to access data in a SQL relational database management system. 

Query Optimization: A single query can be executed through different algorithms or re-written in different forms and structures. Hence, the question of query optimization comes into the picture – Which of these forms or pathways is the most optimal? The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans. 

Importance: The goal of query optimization is to reduce the system resources required to fulfill a query, and ultimately provide the user with the correct result set faster. 

  • First, it provides the user with faster results, which makes the application seem faster to the user. 
  • Secondly, it allows the system to service more queries in the same amount of time, because each request takes less time than unoptimized queries. 
  • Thirdly, query optimization ultimately reduces the amount of wear on the hardware (e.g. disk drives), and allows the server to run more efficiently (e.g. lower power consumption, less memory usage). 
     

There are broadly two ways a query can be optimized: 

  1. Analyze and transform equivalent relational expressions: Try to minimize the tuple and column counts of the intermediate and final query processes (discussed here).
  2. Using different algorithms for each operation: These underlying algorithms determine how tuples are accessed from the data structures they are stored in, indexing, hashing, data retrieval and hence influence the number of disk and block accesses (discussed in query processing).


Analyze and transform equivalent relational expressions.
Here, we shall talk about generating minimal equivalent expressions. To analyze equivalent expression, listed are a set of equivalence rules. These generate equivalent expressions for a query written in relational algebra. To optimize a query, we must convert the query into its equivalent form as long as an equivalence rule is satisfied. 

 

  1. Conjunctive selection operations can be written as a sequence of individual selections. This is called a sigma-cascade. 

    \sigma_{\theta_{1}\Lambda\theta_{2} }(E)=\sigma_{\theta_{1}}(\sigma_{\theta_{2}}(E))

    Explanation: Applying condition \theta_{1}        intersection \theta_{2}        is expensive. Instead, filter out tuples satisfying condition \theta_{2}        (inner selection) and then apply condition \theta_{1}        (outer selection) to the then resulting fewer tuples. This leaves us with less tuples to process the second time. This can be extended for two or more intersecting selections. Since we are breaking a single condition into a series of selections or cascades, it is called a “cascade”. 

     
  2. Selection is commutative. 
    \sigma_{\theta_{1}}(\sigma_{\theta_{2}}(E))=\sigma_{\theta_{2}}(\sigma_{\theta_{1}}(E))

    Explanation: \sigma        condition is commutative in nature. This means, it does not matter whether we apply \sigma_{1}        first or \sigma_{2}        first. In practice, it is better and more optimal to apply that selection first which yields a fewer number of tuples. This saves time on our outer selection. 

     
  3. All following projections can be omitted, only the first projection is required. This is called a pi-cascade. 
    \pi_{L_{1}}(\pi_{L_{2}}(...(\pi_{L_{n}}(E))...)) = \pi_{L_{1}}(E)

    Explanation: A cascade or a series of projections is meaningless. This is because in the end, we are only selecting those columns which are specified in the last, or the outermost projection. Hence, it is better to collapse all the projections into just one i.e. the outermost projection. 

     
  4. Selections on Cartesian Products can be re-written as Theta Joins. 

    • Equivalence 1 
      \sigma_{\theta}(E_{1} \times E_{2}) = E_{1} \bowtie_{\theta} E_{2}

      Explanation: The cross product operation is known to be very expensive. This is because it matches each tuple of E1 (total m tuples) with each tuple of E2 (total n tuples). This yields m*n entries. If we apply a selection operation after that, we would have to scan through m*n entries to find the suitable tuples which satisfy the condition \theta        . Instead of doing all of this, it is more optimal to use the Theta Join, a join specifically designed to select only those entries in the cross product which satisfy the Theta condition, without evaluating the entire cross product first. 

       
    • Equivalence 2 
      \sigma_{\theta_{1}}(E_{1} \bowtie_{\theta_{2}} E_{2}) = E_{1} \bowtie_{\theta_{1} \Lambda \theta_{2}} E_{2}

      Explanation: Theta Join radically decreases the number of resulting tuples, so if we apply an intersection of both the join conditions i.e. \theta_{1}        and \theta_{2}        into the Theta Join itself, we get fewer scans to do. On the other hand, a \sigma_{1}        condition outside unnecessarily increases the tuples to scan. 

       
  5. Theta Joins are commutative. 
    E_{1} \bowtie_{\theta} E_{2} = E_{2} \bowtie_{\theta} E_{1}

    Explanation: Theta Joins are commutative, and the query processing time depends to some extent which table is used as the outer loop and which one is used as the inner loop during the join process (based on the indexing structures and blocks). 

     
  6. Join operations are associative. 
    • Natural Join 
      (E_{1} \bowtie E_{2}) \bowtie E_{3} = E_{1} \bowtie (E_{2} \bowtie E_{3})

      Explanation: Joins are all commutative as well as associative, so one must join those two tables first which yield less number of entries, and then apply the other join. 

       
    • Theta Join 
      (E_{1} \bowtie_{\theta_{1}} E_{2}) \bowtie_{\theta_{2} \Lambda \theta_{3}} E_{3} = E_{1} \bowtie_{\theta_{1} \Lambda \theta_{3}} (E_{2} \bowtie_{\theta_{2}} E_{3})

      Explanation: Theta Joins are associative in the above manner, where \theta_{2}        involves attributes from only E2 and E3. 

       
  7. Selection operation can be distributed. 
    • Equivalence 1 
      \sigma_{\theta_{1}\Lambda\theta_{2}}(E_{1}\bowtie_{\theta}E_{2})=(\sigma_{\theta_{1}}(E_{1}))\bowtie_{\theta}(\sigma_{\theta_{2}}(E_{2}))

      Explanation: Applying a selection after doing the Theta Join causes all the tuples returned by the Theta Join to be monitored after the join. If this selection contains attributes from only E1, it is better to apply this selection to E1 (hence resulting in a fewer number of tuples) and then join it with E2. 

       
    • Equivalence 2 
      \sigma_{\theta_{0}}(E_{1}\bowtie_{\theta}E_{2})=(\sigma_{\theta_{0}}(E_{1}))\bowtie_{\theta}E_{2}

      Explanation: This can be extended to two selection conditions, \theta_{1}        and \theta_{2}        , where Theta1 contains the attributes of only E1 and \theta_{2}        contains attributes of only E2. Hence, we can individually apply the selection criteria before joining, to drastically reduce the number of tuples joined. 

       
  8. Projection distributes over the Theta Join. 
    • Equivalence 1 
      \pi_{L_{1}\cup L_{2}}(E_{1}\bowtie_{\theta}E_{2})=(\pi_{L_{1}}(E_{1}))\bowtie_{\theta}(\pi_{L_{2}}(E_{2}))

      Explanation: The idea discussed for selection can be used for projection as well. Here, if L1 is a projection that involves columns of only E1, and L2 another projection that involves the columns of only E2, then it is better to individually apply the projections on both the tables before joining. This leaves us with a fewer number of columns on either side, hence contributing to an easier join. 

       
    • Equivalence 2 
      \pi_{L_{1}\cup L_{2}}(E_{1}\bowtie_{\theta}E_{2})=\pi_{L_{1}\cup L_{2}}((\pi_{L_{1}\cup L_{3}}(E_{1}))\bowtie_{\theta}(\pi_{L_{2}\cup L_{3}}(E_{2})))

      Explanation: Here, when applying projections L1 and L2 on the join, where L1 contains columns of only E1 and L2 contains columns of only E2, we can introduce another column E3 (which is common between both the tables). Then, we can apply projections L1 and L2 on E1 and E2 respectively, along with the added column L3. L3 enables us to do the join. 


       
  9. Union and Intersection are commutative. 
    E_{1}\ \cup E_{2}\ =\ E_{2}\ \cup\ E_{1}
    E_{1}\ \cap E_{2}\ =\ E_{2}\ \cap\ E_{1}

    Explanation: Union and intersection are both distributive; we can enclose any tables in parentheses according to requirement and ease of access. 

     
  10. Union and Intersection are associative. 
    (E_{1}\ \cup E_{2})\ \cup\ E_{3}=E_{1}\ \cup\ (E_{2}\ \cup\ E_{3})
    (E_{1}\ \cap E_{2})\ \cap\ E_{3}=E_{1}\ \cap\ (E_{2}\ \cap\ E_{3})

    Explanation: Union and intersection are both distributive; we can enclose any tables in parentheses according to requirement and ease of access. 

     
  11. Selection operation distributes over the union, intersection, and difference operations. 
    \sigma_{P}(E_{1}\ -\ E_{2})=\sigma_{P}(E_{1})\ -\ \sigma_{P}(E_{2})

    Explanation: In set difference, we know that only those tuples are shown which belong to table E1 and do not belong to table E2. So, applying a selection condition on the entire set difference is equivalent to applying the selection condition on the individual tables and then applying set difference. This will reduce the number of comparisons in the set difference step. 

     
  12. Projection operation distributes over the union operation. 
    \pi_{L}(E_{1}\ \cup\ E_{2})=(\pi_{L}(E_{1}))\ \cup\ (\pi_{L}(E_{2}))

    Explanation: Applying individual projections before computing the union of E1 and E2 is more optimal than the left expression, i.e. applying projection after the union step. 
     

Minimality –
A set of equivalence rules is said to be minimal if no rule can be derived from any combination of the others. A query is said to be optimal when it is minimal. 

 

Examples –
Assume the following tables: 
 

instructor(ID, name, dept_name, salary)
teaches(ID, course_id, sec_id, semester, year)
course(course_id, title, dept_name, credits)



Query 1: Find the names of all instructors in the Music department, along with the titles of the courses that they teach 

$\pi_{name, title}(\sigma_{dept\_name=``Music"}(instructor \bowtie (teaches \bowtie \pi_{course\_id, title}(course))))$

Here, dept_name is a field of only the instructor table. Hence, we can select out the Music instructors before joining the tables, hence reducing query time. 

Optimized Query: 
Using rule 7a, and Performing the selection as early as possible reduces the size of the relation to be joined. 
$\pi_{name, title}((\sigma_{dept\_name=``Music"(instructor)}\bowtie(teaches\bowtie\pi_{course\_id, title}(course)))$

Query 2: Find the names of all instructors in the CSE department who have taught a course in 2009, along with the titles of the courses that they taught 

$\sigma_{dept\_name=``CSE"}(\sigma_{year=2009}(instructor\bowtie teaches))$

Optimized Query: 
We can perform an “early selection”, hence the optimized query becomes: 
$\sigma_{dept\_name=``CSE"}(instructor)\bowtie \sigma_{year=2009}(teaches)$
 

Features :

Cost-Based Optimization: In cost-based optimization, the query optimizer estimates the cost of executing each possible query execution plan and selects the plan with the lowest cost. The cost of a query execution plan is usually measured in terms of the number of disk I/O operations or CPU cycles required to execute the plan.

Plan Space Exploration: The query optimizer explores the plan space, which is the set of all possible query execution plans, to find the optimal plan. This can be a complex and computationally expensive process, especially for complex queries involving multiple tables and join operations.

Query Rewriting: The query optimizer may rewrite the original query to create an equivalent query with a more efficient execution plan. For example, the optimizer may reorder the join operations or apply filter conditions earlier in the query execution plan to reduce the amount of data that needs to be processed.

Statistics: The query optimizer relies on statistics about the data in the tables being queried to estimate the cost of different query execution plans. These statistics may include information about the number of rows in each table, the distribution of values in each column, and the presence of indexes or constraints.

Index Selection: The query optimizer may select different indexes on the tables being queried to improve the efficiency of the query execution plan. This involves choosing the best index for each table based on the query’s filter and join conditions.

Caching: The query optimizer may cache the results of commonly executed queries to improve performance. This can reduce the need to execute the same query multiple times, especially in applications with frequent user queries.

Advantages:

Improved Query Performance: The main advantage of query optimization is that it can significantly improve the performance of queries. By selecting the most efficient query plan, the RDBMS can execute the query in the shortest possible time.

Cost Reduction: The optimization process reduces the cost of query execution by minimizing the use of system resources, such as CPU and memory, while still delivering the same results.

Query Tuning: Query optimization provides a way to fine-tune queries to make them more efficient, even in complex and large-scale databases.

Reduced Complexity: Query optimization makes the query execution process more efficient and reduces the complexity of the system. By using a streamlined approach, the RDBMS can improve its overall performance.

Disadvantages:

Increased Overhead: Query optimization requires additional processing time and system resources. This overhead may be significant for large databases and can impact system performance.

Limited Scope: Query optimization may not work for all queries, especially those that are ad-hoc and not well-defined. In such cases, the RDBMS may not be able to optimize the query, leading to suboptimal performance.

Algorithmic Complexity: The process of query optimization itself can be complex, especially for large and complex queries. This can result in a higher computational cost and may require specialized expertise to implement.

Maintenance: Query optimization requires ongoing maintenance to keep the query optimizer up to date with changes in the database schema, data distribution, and query workload.


  
 



 



Last Updated : 10 May, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads