There are many databases and data sources that need integration. Almost every application has many sources of data that can be used to work together. Data integration involves integrating data from various sources and it has a single view over the sources. It is done to answer queries using the information that’s been combined. It can be physical and virtual. Physical data integration survives the data to the warehouse.
Virtual integration keeps the data at the sources itself. Problem that arises due to the integration is the heterogeneity of data across the sources. There can be various heterogeneity issues while collecting data from the sources like semantic (different names of the attributes having similar data), communication, schema or data type. To overcome these issues, there are three models designed for integrating data, they are Federated databases, Data warehousing and Mediation.
Global as View (GAV) :
Global as view is one of the mediator types of view based data integration. Global schema act as a view over source schema i.e the mediator schema is described in terms of local schema. Given a query over the global schema, the mediator will follow the existing rules and templates to convert the query into source specific queries. It sends the new queries to wrappers for execution. Wrapper searches for all the possible expressions and how they can be combined to answer the given query.
- Enterprise Information Integration which makes separate databases that are owned by a company, and they work together.
- Scientific databases for example genome’s database.
- Integrating catalogs – that involves combining information of the product from every supplier.
How it works :
Mediation involves a mediator which is a virtual view of the data and it doesn’t store any data as the data is stored at sources. Schema from various sources is combined forming a virtual schema of mediator. Mapping takes place at query time. When a user queries, it is mapped to multiple other queries and each query is sent to the sources. Sources evaluate them and send back the results.
Results are merged together and sent to the end user. This process is called mediation. It uses wrappers which are responsible for performing the mapping of the queries. They use templates (which are already created) who represent many queries and thus are made flexible. If the mediator query matches a template then the results are returned, else not. There are two types of mediator, they are Global as View and Local as View. We will discuss Global As View.
Let’s take an example to understand the working of GAV.
To integrate catalogs. Suppose Zexmon (a company) wants to buy chips like DIPs and PGAs which has the same protocol.
Global Schema –
DIP ( manufacturer, model, protocol ) PGA ( manufacturer, model, protocol )
Local Schema –
Every DIP and PGA manufacturer has relation (model, protocol).
Zexmon will query the mediator. Mediator will start by querying every DIP manufacturers for the model and protocol pair. The wrapper will turn them to a triplet by adding attribute manufacturer. The protocols from all the sources for every DIP chips is returned to the mediator.
Now the mediator starts querying all the PGA manufacturers using the protocol returned previously. Again the wrapper adds the manufacturer attribute to the ( model, protocol ) pair. And this is how the mediator retrieves the DIP and PGA chips which has similar protocols. This is turn helps zexmon to buy the desired chips.
- Global as view is simpler to implement because you have the control over the working of the mediator.
- It is simple to design.
- The query answering approach is procedural and thus it is practiced by many Industrial Systems.
- Since the global database is in terms of multiple sources, the global schema cannot frame any information which is absent in any of the source schemas.
- It is an overhead when it comes to adding new sources to the existing ones as it has to be ensured that the present sources are dependent on it i.e independent sources are rarely added. If the new source is added, the mappings also have to be changed.
- The view of the content that could be generated is narrowed.
- Removing a data source may also require a lot of work making it inflexible.
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory concepts for SDE interviews with the CS Theory Course at a student-friendly price and become industry ready.