Open In App

How To Develop A Standard SQL Suite For A Robust Streaming Process?

Last Updated : 24 Feb, 2020
Improve
Improve
Like Article
Like
Save
Share
Report

In the current era of digitization, businesses need real-time data analysis and management to stay ahead of the curve. SQL has been the forefront for such data streaming analysis and management. Yet, there are certain limitations to it, which limits the streaming process.

Businesses use a combination of tables and streams along with historical data to conduct an analysis of data for several applications like decision making and other business operations. Despite the onset of data analytics and Artificial Intelligence, SQL still remains one of the main query languages used for data streaming processes.

So, here we are going to use three different methodologies to achieve the streaming process for SQL with higher efficiencies through:

  • Time-Varying Relations
  • Event Time Semantics
  • Materialization Controls

But, before we tap into these methodologies, let us get a grasp on the current approaches towards SQL:

  • Apache Spark: 
    Built on Spark SQL’s execution engine and optimizer, this declarative API is actually a spark’s dataset API. Normally, dataset programs are executed on the finite data streams. The streaming for Dataset API is popularly known as Structured Streaming. Structured streaming queries are evaluated through a micro-batch execution engine that processes the data streams in small batches and finds fault-tolerance guarantees.

  • KSQL:
    It is built on Kafka streams, which is a stream processing framework developed under the Apache Kafka project. KSQL is a declarative wrapper that covers the Kafka streams and develops a customized SQL type syntax to declare streams and tables. It is more focused on the materialized view semantics.

  • Apache Flink:
    It is made of two relational APIs- LINQ style table API and SQL. It uses a common logical plan representation and an optimized Apache calcite for queries from both the relational APIs. Then the execution takes place as a batch or a streaming process.

  • Apache Beam:
    It is specifically developed keeping in mind the optimization of the beam’s unification of bounded and unbounded data processing. It uses a subset of semantics for the execution of the data streams.

  • Apache Calcite:
    It is a popular streaming SQL parser in Flink SQL and Beam SQL. It parses, optimizes and supports stream processing semantics. 

    Now, let’s hop into three new approaches for streaming SQL.

  • Time-Varying Relations:
    This methodology focuses on the element “Time”. Whenever we are dealing with the streaming relations, we need to consider the relative time relations that vary over time. For this matter, we can use a time-varying relation(TVR) that is a type of relation, whose contents change over time. 

    TVRs may be encoded or materialized in many ways, especially as a sequence of classic relations or as a sequence of “INSERT” and “DELETE” operations. These two encodes are dual of one another and correspond to the tables and streams. While duality of the encodes can be an issue, we intend to use it as an advantage.

    We can leverage the fact that both streams and tables are representations of a common semantics object. Though we can treat TVR uniformly, using the changes in the stream itself, TVR can optimize and materialize the streams for better results towards queries.

  • Event Time Semantics: 
    It is assumed in many cases that the assumed data is according to the event time and this does not stand true for mobile app development, distributed systems or sharded archival data. Often, the data is streamlined according to the time of the event and yet the progress of the execution logic does not comply with the same.

    This is due to the fact that one hour of processing time has no relation to one hour of event time. Thus, event time must be accounted for to achieve correct results. STREAM system accounts for the event time and includes a feature called heartbeats that buffers the data that is not in the order of its event time feed it into the query processor. This allows timestamp skew by inducing latency. While the Millwheel system uses watermarks-which can compute on the data that is out of order along with the metadata.

    But, the best practice is a combination of timestamps and watermarks, as together they can allow correct calculation of the event time. These calculations are carried out by grouping intervals of time and executing them without unbounded resources. 

  • Materialization Controls: 
    This approach provides control over how relations rendered when the rows are materialized. In the first approach, we can use stream changelogs that capture element to element differences between two versions of relation and further using the encoding sequence of INSERT and DELETE to mutate the TVRs. 

    Another approach is Materialization Delay – this approach is utilized by modeling the tables and streams as TVRs and the resultant relation achieved is TVR.

An Example based on NEXmark benchmark for queries over data streams:

Where the query monitors the highest price items currently on auction with a time relative result of every 10 minutes it derives a result with the highest bid.

A query in CQL:




SELECT
     Rstream ( B . price, B . itemid )
FROM
     Bid [ RANGE 10 MINUTE SLIDE 10 MINUTE ] B
WHERE
B . price =
       ( SELECT MAX ( B1 . price ) FROM BID
       [ RANGE 10 MINUTE SLIDE 10 MINUTE ] B1 );


A query in SQL:




SELECT
MaxBid . wstart, MaxBid . wend,
Bid . bidtime, Bid . price, Bid . itemid
FROM
Bid,
( SELECT
     MAX ( TumbleBid . price ) maxPrice,
     TumbleBid . wstart wstart,
     TumbleBid . wend wend
FROM
     Tumble (
         data = > TABLE ( Bid ),
         timecol = > DESCRIPTOR ( bidtime )
         dur = > INTERVAL '10 ' MINUTE ) TumbleBid
 GROUP BY
     TumbleBid . wend ) MaxBid
  
WHERE
     Bid . price = MaxBid . maxPrice AND
     Bid . bidtime >= MaxBid . wend
                  - INTERVAL '10 ' MINUTE AND
     Bid . bidtime < MaxBid . wend ;


Takeaways: 
On the contrary to previous approaches, this approach uses timestamps as explicit data and rows in the bid stream do not appear in order of bidtime. Tumble is a TVR that assigns each bid stream with 10-minute intervals containing bidtime

By the above example, we can see that as the bid relation evolves over time and new elements are added over time, the relation defined by the query too evolves with it. Thus, we can use the above approach and induce a TVR that can evolve with the changes in elements of a query.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads