Last Minute Notes – DBMS

Database Management System is an organized collection of interrelated data that helps in accessing data quickly, along with efficient insertion, and deletion of data into the DBMS. DBMS organizes data in the form of tables, schemas, records, etc.

DBMS over File System

The file system has numerous issues, which were resolved with the help of DBMS, the issues with the file system are:

Data Redundancy: Same data can be stored at multiple places.
Data Inconsistency: If multiple copies of the same data have different content in each copy. Like, the phone number of students is different in academic and accounts files.
Data access: In a file system, accessing data was difficult and insecure as well. Accessing data concurrently was not possible.
No Backup and Recovery: There is no backup and recovery in the file system that can lead to data loss.

ER-Model

ER Diagram: An ER diagram is a model of a logical view of the database which is represented using the following components:

Entity: The entity is a real-world object, represented using a rectangular box.
- Strong Entity: A strong entity set has a primary key and all the tuples of the set can be identified using that primary key
- Weak entity: When an entity does not have sufficient attributes to form a primary key. Weak entities are associated with another strong entity set also known as identifying an entity. A weak entity’s existence depends upon the existence of its identifying entity. The weak entity is represented using a double-lined or bold-lined rectangle.
Attribute: Attribute is the properties or characteristics of the real-world object. It is represented using an oval.
- Key attribute: The attribute which determines each entity uniquely is known as the Key attribute. It is represented by an oval with an underlying line.
- Composite Attribute: An attribute that is composed of many other attributes. E.g. address is an attribute it is formed of other attributes like state, district, city, street, etc. It is represented using an oval comprises of many other ovals.
- Multivalued Attribute: An attribute that can have multiple values, like a mobile number. It is represented using a double-lined oval.
- Derived attribute: An attribute that can be derived from other attributes. E.g. Age is an attribute that can be derived from another attribute Data of Birth. It is represented using a dashed oval.
Relationship: A relationship is an association between two or more entities. Entities are connected or related to each other and this relationship is represented using a diamond.

Some Important Terms

Cardinality of DBMS: Cardinality of relation expresses the maximum number of possible relationship occurrences for an entity participating in a relationship. Cardinality of a relationship can be defined as the number of times an entity of an entity set participates in a relationship set. Let’s suppose a binary relationship R between two entity sets A and B. The relationship must have one of the following mapping cardinalities:

One-to-One: When one entity of A is related to at most one entity of B, and vice-versa.
One-to-Many: When one entity of A is related to one or more than one entity of B. Whereas B is associated with at most one entity in A.
Many-to-One: When one entity of B is related to one or more than one entity of A. Whereas A is associated with at most one entity in B.
Many-to-Many: Any number of entities of A is related to any number of entities of B, and vice-versa.

The most commonly asked question in ER diagram is the minimum number of tables required for a given ER diagram. Generally, the following criteria are used:

Cardinality	Minimum No. of tables
1:1 cardinality with partial participation of both entities	2
1:1 cardinality with a total participation of at least 1 entity	1
1:n cardinality	2
m:n cardinality	3

If the relation is one-to-many or many-to-one then two or more relational tables can be combined.
If the relation is many-to-many two tables cannot be combined.
If the relation is one-to-one and there is total participation of one entity then that entity can be combined with a relational table.
If there is total participation of both entities then one table can be obtained by combining one table and both entities of the relation.

Note: This is a general observation. Special cases need to be taken care of. We may need an extra table if the attribute of a relationship can’t be moved to any entity side.

Specialization: It is a top-down approach in which one entity is divided/specialized into two or more sub-entities based on its characteristics.

Generalization: It is a bottom-up approach in which common properties of two or more subentities are combined/generalized to form one entity. It is exactly the reverse of Specialization. In this, two or lower level entities are generalized to one higher level entity.

Aggregation: Aggregation is an abstraction process through which relationships are represented as higher-level entity sets.

Participation Constraint: It specifies the maximum or a minimum number of relationship instances in which any entity can participate. In simple words, participation means how an entity is linked to a relationship.

Total Participation: Each entity of an entity set participates in at least one relationship.
Partial Participation: Some entities of the entity set may not participate in any relationship.

Database Design

Database design Goals: The prime goal of designing a database is:

To have zero redundancy in the system
Loss-less join
Dependency preservation
Overcome all the shortcomings of conventional file system

According to E.F. Codd, “All the records of the table must be unique”.

Keys of a relation: There are various types of keys in a relation which are:

Candidate Key: The minimal set of attributes that can determine a tuple uniquely. There can be more than 1 candidate key of a relation and its proper subset can’t determine tuple uniquely and it can’t be NULL.
Super Key: The set of attributes that can determine a tuple uniquely. A candidate key is always a super key but vice versa is not true.
Primary Key and Alternate Key: Among various candidate keys, one key is taken as the primary key and others are alternate keys.
Foreign Key: Foreign Key is a set of attributes in a table that is used to refer to the primary key or alternative key of the same or another table.

Functional Dependency: It is a constraint that specifies the association/ relationship between a set of attributes. In functional dependency, one set can accurately determine the value of another set. It is represented as A->B, where set A can determine the values of set B correctly. The A is known as the Determinant, and B is known as the Dependent.

Functional dependencies are further categorized into two types:

Trival Functional Dependency: In functional dependency, if B is a subset of A, then such dependency is known as trivial functional dependency.
Non-Trivial Functional Dependency: In functional dependency, if B is not a subset of A, then such dependency is known as non-trivial functional dependency.

Armstrong’s Axioms: It is a statement that is always considered true and used as a starting point for further arguments. Armstrong axiom is used to generate a closure set in a relational database.

Armstrong Axiom

Attribute Closure(X⁺): All attributes of the set are functionally determined by X.

Prime Attribute: An attribute that is part of one candidate key.
Non-prime Attribute: An attribute that is not a part of any candidate key.

Example: If the relation R(ABCD) {A->B, B->C, C->D}, then the attribute closure of

A will be (A⁺)={ABCD} [A can determine B, B can determine C, C can determine D]
B will be (B⁺)={BCD} [B can determine C, C can determine D]
C will be (C⁺)={CD} [C can determine D]
D will be (D⁺)={D} [D can determine itself]

Note: With the help of Attribute closure, we can easily determine the Superkey [The set of attributes whose closure contains all attributes of a relation] of a relation, So in the above example A is the superkey of the given relation. There can be more than one superkey in a relationship.

Example: If the relation R(ABCDE) {A->BC, CD->E, B->D, E->A}, then the attribute closure will be

A⁺= {ABCDE}
B⁺= {BD}
C⁺= {C}
D⁺= {D}
E⁺= {ABCDE}

Equivalence sets of Functional Dependency: If two sets of a functional dependency are equivalent, i.e. if A⁺= B⁺. Every FD in A can be inferred from B, and every FD in B can be inferred from A, then A and B are functionally equivalent.

Minimal set of Functional Dependency: A set of FD will be minimal if it satisfies the following conditions:

F logically implies all dependencies in F⁺
F⁺ logically implies all dependencies in F.
No functional dependency in F⁺contains an extraneous attribute.
Each left side of a functional dependency in F+ is unique. That is, there are no two dependencies A->B and C->D in such that A->C.

Normalization: Normalization is used to eliminate the following anomalies:

Insertion anomaly
Deletion anomaly
Updation Anomaly
Join anomaly

Normalization was introduced to achieve integrity in the database and make the database more maintainable.

1. First Normal Form: A relation is in first normal form if it does not contain any multi-valued or composite attribute. If the data is in 1NF then it will have high redundancy.

2. Second Normal Form: A relation is in the second normal form if it is in the first normal form and if it does not contain any partial dependency.

Partial Dependency: A dependency is called partial dependency if any proper subset of candidate key determines non-prime (which are not part of candidate key) attribute.
Let R be the relational schema and X, Y, A is the set of attributes. Suppose X is any candidate key, Y is a proper subset of candidate key, and A is a Non-prime attribute.

Partial Dependency

Y->A will be partial dependency iff, Y is a proper subset of candidate key, and A is a non-prime attribute.

Full Functional Dependency: If A and B are an attribute set of a relation, B is fully functional dependent on A, if B is functionally dependent on A but not on any proper subset of A.

3. Third Normal Form: A relation is in the third normal form if it is in the second normal form and it does not contain any transitive dependency. For a relation to be in Third Normal Form, either LHS of FD should be super key or RHS should be the prime attribute.

4. Boyce-Codd Normal Form: A relation is in Boyce-Codd Normal Form if the LHS of every FD is super key. The relationship between Normal Forms can be represented as 1NF, 2NF, 3NF or BCNF.

Design Goal	1NF	2NF	3NF	BCNF
Zero Redundancy	High redundancy	Less than 1NF	Less than 2NF	No redundancy
Loss-less decomposition	Always	Always	Always	Always
Dependency preservation	Always	Always	Always	Sometimes Not possible

Properties of Decomposition:

Loss-less Join Decomposition: There should not be the generation of any new tuple because of the decomposition.

Dependency Preserving Decomposition: There should not be the loss of any tuple because of the decomposition. Let R be a relation with Functional dependency F. After decomposition R is decomposed into R₁, R₂, R₃……R_n with FD set F₁, F₂, F₃……F_n respectively.

If F₁, F₂, F₃…..F_n ≣ F, then the decomposition is dependency preserving otherwise not.

Data Retrieval (SQL, RA)

Commands to Access Database: For efficient data retrieval, insertion, deletion, updation, etc. The commands in the Database are categorized into three categories, which are as follows:

DDL [Data Definition language]: It deals with how data should store in the database. DDL commands include CREATE, ALTER, DROP, COMMENT, and TRUNCATE
DML [Data Manipulation language]: It deals with data manipulation like modifying, updating, deleting, etc. DML commands include SELECT, INSERT, DELETE, UPDATE, LOCK TABLE, MERGE, CALL, AND EXPLAIN PLAN.
DCL [Data Control Language]: It acts as an access specifier, and includes GRANT, AND REVOKE.

Query Language: Language using which any user can retrieve some data from the database.

Note: Relational model is a theoretical framework RDBMS is its implementation.

Relational Algebra: Procedural language with basic and extended operators.

Basic Operator	Semantic
σ(Selection)	Select rows based on a given condition
π (Projection)	Project some columns
X (Cross Product/ Cartesian Product)	Cross product of relations, returns *mn** rows where m and n are numbers of rows in R1 and R2 respectively.
U (Union)	Return those tuples which are either in R1 or R2. Maximum number of rows returned = m+n Minimum number of rows returned = max(m,n)
– (Minus)	R1-R2 returns those tuples which are in R1 but not in R2. Maximum number of rows returned = m Minimum number of rows returned = m-n
ρ(Rename)	Renaming a relation to another relation.

Extended Operator	Semantic
(Intersection)	Returns those tuples which are in both relation R1 and R2. Maximum number of rows returned = min(m,n) Minimum number of rows returned = 0
⋈ (Conditional Join)	Selection from two or more tables based on some condition (Cross product followed by selection)
⋈(Equi Join)	It is a special case of conditional join when only an equality condition is applied between attributes.
⋈ (Natural Join)	In natural join, equality condition on common attributes holds, and duplicate attributes are removed by default. Note: Natural Join is equivalent to the cross product of two relations have no attribute in common and the natural join of a relation R with itself will return R only.
⟕(Left Outer Join)	When applying join on two relations R and S, Left Outer Joins gives all tuples of R in the result set. The tuples of R which do not satisfy the join condition will have values as NULL for attributes of S.
⟖(Right Outer Join)	When applying join on two relations R and S, Right Outer Joins gives all tuples of S in the result set. The tuples of S which do not satisfy the join condition will have values as NULL for attributes of R.
⟗(Full Outer Join)	When applying join on two relations R and S, Full Outer Joins gives all tuples of S and all tuples of R in the result set. The tuples of S which do not satisfy the join condition will have values as NULL for attributes of R and vice versa.
/ (Division Operator)	Division operator A/B will return those tuples in A which is associated with every tuple of B. Note: Attributes of B should be a proper subset of attributes of A. The attributes in A/B will be Attributes of A- Attribute of B.

Relational Calculus: Relational calculus is a non-procedural query language. It explains what to do but not how to do it. It is of two types:

Tuple Relational Calculus: The tuple relational calculus is based on specifying the number of tuple variables. Each variable usually ranges over a particular database relation. It is of the form
{t| cond(t)}
where t is the tuple variable and cond(t) is a conditional expression involving t. The result of such query is the set of tuples of t that satisfy cond(t).
Domain Relational Calculus: Unlike tuple calculus in Domain calculus variables range oer single values from domains of attributes. It is of the form
{x₁, x₂, ……, x_n | cond (x₁, x₂, ……., x_n, x_n+1, x_n+2, …….., x_n+m)}
where, x₁, x₂, ……., x_n, x_n+1, x_n+2, …….., x_n+m are domain variables ranging over domains, and cond is a condition.

SQL: Structured Query Language, lets you access or modify databases. SQL can execute queries, retrieve data, insert records, update records, delete records, create a new database, create new tables, create views, and set permissions on tables, procedures, or views.

SQL Commands:

Operator	Meaning
SELECT	Selects columns from a relation or set of relations. It defines WHAT is to be returned. Note: As opposed to Relational Algebra, it may give duplicate tuples for the repeated values of an attribute.
FROM	FROM is used to define the Table(s) or View(s) used by the SELECT or WHERE statements
WHERE	WHERE is used to define what records are to be included in the query. It uses conditional operators.
EXISTS	EXISTS is used to check whether the result of a correlated nested query is empty (contains no tuples) or not.
GROUP BY	GROUP BY is used to group the tuples based on some attribute or set of attributes like counting the number of students GROUP BY the department.
ORDER BY	ORDER BY is used to sort the fetched data in either ascending or descending according to one or more columns.
Aggregate functions	Find the aggregated value of an attribute. Used mostly with GROUP BY. e.g.; count, sum, min max. *select count() from the student group by dept_id Note:** we can select only those columns which are part of GROUP BY.
Nested Queries	When one query is a part of another query.
UPDATE	It is used to update records in a table.
DELETE	It is used to delete rows in a table.
LIKE	LIKE operator is used with the WHERE clause to search a specified pattern in a column.
IN	IN operator is used to specify multiple values in the WHERE clause.
BETWEEN	It selects values within a range.
Aliases	It is used to temporarily rename a table or a column heading.
HAVING	The HAVING clause was added because the WHERE keyword could be used with aggregate functions.

File Structure

File organization: It is the logical relation between records and it defines how file records are mapped into disk blocks(memory). A database is a collection of files, each file is a collection of records, and each record contains a sequence of fields. The blocking Factor is the average number of records per block.

Strategies for storing files of records in block:

Spanned Strategy: It allows a partial part of the record to be stored in the block. It is suitable for variable-length records. No wastage of memory in spanned strategy but block access time gets increases.
Unspanned Strategy: Data cannot be stored partially, the whole block will be occupied, this can lead to internal fragmentation and wastage of memory but block access time is reduced. This is suitable for fixed-length records.

File organizations is of following types:

Sequential File: In this method, files are stored in sequential order one after another.

Blocking factor:
Number of record blocks:
Average number of blocks accessed by linear search:
Average number of blocks accessed by binary search:

Index File:

Index blocking factor:
First level index block:
Number of block accesses:

Indexing Type:

1. Single level Index

Primary Index(Sparse): A primary index is an ordered file(ordered with key field), records of fixed length with two fields. The first field is the same as the primary key of the data file and the second field is a pointer to a data block, where the key is available. In Sparse indexing, for a set of database records there exists a single entry in the index file.
- Number of index file entries ≤ Number of database records.
Secondary Index (Dense): Secondary index provides secondary means of accessing a file for which primary access already exists. In Dense indexing, for every database record, there exists an entry in the index file. The index blocking factor is the same for all indexes.
- Number of database records = Number of entries in the index file
- Number of block accesses=
Clustered Index(Sparse): A clustering index is created on a data file whose records are physically ordered on a non-key field (called a Clustering field). Almost one clustering index is possible.
- Single-level index blocks=
- Number of block accesses=

2. Multilevel Index

Indexed sequential access method: Second level index is always sparse.
- Level 1 = “first-level index blocks” computed by index
- Level 2 =
- Level n =
- Number of blocks =
- Number of block access = n+1
B-Tree: Also known as Baye’s or balanced Search Tree. At every level, we have Key and Data pointers, and data pointer points either block or record.
- Root node: B-tree can have children between 2 and p, where p is the Order of the tree.
- Internal Node: to n children.
- Leaf nodes all are at the same level.
- Block size = p × (size of block pointer) + (p-1)× (Size of key field + size of record pointer)
- Minimum number of nodes =
- Maximum number of nodes = \frac{p^{h+1}-1}{p-1}
- Minimum height = \left ( \left \lceil log_p \ l \right \rceil \right ) l is the number of leaves
- Maximum height = \left \lfloor 1 +log_{\frac {p}{2}} \frac{l}{2} \right \rfloor
B⁺ Tree: It is the same as B-tree. All the records are available at the leaf (last) level. B⁺ tree allows both sequential and random access whereas in B-tree only random access was allowed. Each leaf node has one block pointer and all the leaf nodes are connected to the next leaf node using a block pointer.
- Order of non-leaf node= [p × size of block pointer] + [(p-1) × size of key field] <= Block size.
- Order of Leaf node= [(p_leaf -1) × (size of key field + size of record pointer) + p × (size of block pointer) <= Block size]

Transaction and Concurrency Control

A transaction is a unit of instruction or set of instructions that performs a logical unit of work. Transaction processes are always atomic in nature either they will execute completely or do not execute.

Transaction Properties:

Atomicity: Either execute all operations or none of them. It is managed by the transaction Management Component.
Consistency: Database must be consistent before and after the execution of the transaction. If atomicity, isolation, and durability are implemented accurately, consistency will be achieved automatically.
Isolation: In concurrent transactions, the execution of one transaction must not affect the execution of another transaction. It is managed by the Concurrency Control component.
Durability: After the commit operation, the changes should be durable and persist always in the database. It is managed by the Recovery Management component.

Schedule: Sequences in which instructions of the concurrent transactions get executed. Schedules are of two types:

Serial Schedule: Transactions execute one by one, another transaction will begin after the commit of the first transaction. It is inconsistent and the system’s efficiency is so poor due to no concurrency.
- The number of possible serial schedules with n transactions = n!
Non-Serial Schedule: When two or more transactions can execute simultaneously. This may lead to inconsistency, but have better throughput and less response time.
- The number of possible non-serial schedules with n transactions = Total Schedule – Serial Schedule

∩ (Intersection)Serializability: A schedule is said to be serializable if it is equivalent to a serial schedule. It is categorized into two categories: Conflict Seriablizability, and View Serializability.

Conflict Serializability: A schedule will be conflict serializable if it can be transformed into a serial schedule by swapping non-conflicting operations. It is a polynomial-time problem.

Conflicting operations: Two operations will be conflicting if
- They belong to different transactions.
- They are working on the same data item.
- At least one of them is the Write operation.

View Serializability: A schedule will be view serializable if it is view equivalent to a serial schedule. It is an NP-Complete Problem.

Check whether it is conflict serializable or not, if Yes then it is view serializable.
If the schedule does not conflict with serializable then check whether it has blind write or not. If it does not have blind write then it is not view serializable. [To be view serializable a schedule must have a blind write]
If the schedule has blind write, Now check whether the schedule is view-equivalent to any other serial schedule.
Now, draw a precedence graph using given dependencies. If no cycle/loop exists in the graph, then the schedule would be a View-Serializable otherwise not.

Types of Schedule based recoverability:

Irrecoverable Schedule: A transaction is impossible to roll back once the commit operation is done.
Recoverable Schedule: A schedule is recoverable if a transaction T_i reads a data item previously written by Transaction T_j, the commit operation T_j appears before the commit operation of T_i.
Cascadeless Recoverable Schedule: Cascadeless Schedule avoids cascading aborts/rollbacks (ACA). Schedules in which transactions read values only after all transactions whose changes they are going to read commit are called cascadeless schedules. Avoids that a single transaction abort leads to a series of transaction rollbacks.
- Cascading rollback: When failure of a single transaction leads to a series of transaction rollbacks.
Strict Recoverable Schedule: If there is no read or write in the schedule before the commit, then such schedule are known as a Strict recoverable schedule.

Concurrency Control with Locks:

To achieve consistency, isolation is the most important concept. Isolation can be achieved using locking very easily. A schedule acquires a lock prior to accessing the transaction and the lock is released when the transaction is completed. A locking protocol is a set of rules followed by all transactions while requesting and releasing locks. Locking protocols restrict the set of possible schedules.

Lock Types:

Binary Locks

It is in two states: Locked(1) or Unlocked(0)
When an object is locked it is unavailable to other objects.
When an object is unlocked then it is open to transactions.
An object is unlocked when the transaction is unlocked.
Every transaction locks a data item before use and unlocks/releases it after use.
Issues with binary locks: Irrecoverability, Deadlock, and Low concurrency.

Shared/Exclusive Locks:

Shared (S Mode): It is denoted by lock-S(Q), the transaction can perform a read operation, and any other transaction can also obtain the same lock on same data item at the same time and can also perform a read operation only.
Exclusive (X Mode): It is denoted by lock- X(Q), the transaction can perform both read and write operations, any other transaction can not obtain either shared/exclusive lock.

Two-Phase Locking: This protocol requires that each transaction in a schedule will be two phases: i.e. Growing phase and the shrinking phase.

In the growing phase, transactions can only obtain locks but cannot release any lock.
In the shrinking phase, transactions can only release locks but can not obtain any lock.
The transaction can perform read/write operations in both the growing as well as in shrinking phase.
Rules for 2-PL:
- Two transactions cannot have conflicting locks.
- No unlock operation can precede a lock operation in the same transaction.
- No data are affected until all locks are obtained.
Basic 2-PL:
- Equal serial schedule based on lock point.
- It has recoverability, deadlock, and starvation.
Strict 2-PL:
- Basic 2-PL with all exclusive locks should be hold until commit/roll back.
- It ensures serializability strict recoverable, but it has starvation and irrecoverability.
Rigorous 2-PL:
- Basic 2PL with all locks (S/X) should be hold until commit.
- Equivalent serial schedule based on the order of commit.

Time Stamp Ordering Protocols: Time stamp protocols assign a globally unique timestamp value to each transaction and produce the order for transaction submission. The timestamp protocol is free from deadlock, it may suffer from starvation but it ensures serializability.

Thomas Write Rule: It modify time-stamping protocol in obsolete write case, when T_irequests write (Q) if TS (T_i) < WTS(Q) then roll back T. Otherwise execute successfully. Set RTS (A)= max{TS(T), RTS (A)}.

Thomas write rule is View serializable, hence it guarantees consistency, but there is no guarantee of cascadeness and recoverable system because there is no restriction on write.

See Last Minute Notes on all subjects here.

Article Tags :

DBMS