Data and analytics workloads: how to choose the right technology & tool

data applications scaled

 A framework which can aid in the decision-making process for
data and analytics workloads 

 In the below framework, ‘Application’ comes with its ‘Functional’ and ‘Non-Functional’ requirements.

 And this  drives the decision towards what technology or tool to go with. 

This article helps you in understanding the drivers which help in deciding what technology or tool to use for
any data and analytics workloads. Below are the key areas for which the suitable technologies need to be
recognized with the help of underlying drivers.
 

  1. Data Storage 
  2. Data Ingestion 
  3. Data Processing  

To be On Point: 

  • Application requirements drive “Data storage” technology choice 
  • Data characteristics that need to be ingested drive “Data Ingestion” technology choice. 
  • If “Data storage” and “Data Ingestion” technology choices are made, the business rule, quality, and
    latency would drive the
     “Data Processing technology” choice. 

     1.Datastore technologies: 

    The ‘Data Store Technologies’ can be primarily categorized into 3 buckets.  

  1. No SQL 
  2. Relational / SQL supported DB Engines 
  3. Distributed File Systems  

     No SQL Choice is preferred for the following use cases:  

  • Applications which require horizontal scaling — Mobile App with huge number of users —
    Read / Write operations for each user and Shading 
     
  • Low Latency / Session State — Ad tech and Game State  
  • Application Monitoring and IoT which has streaming data ingestion and continuous WRITE  

       The selection of No SQL database depends on CAP theorem
(Consistency, Availability, and Partition Tolerance)
 

  • Key-Value Store DB — a Large Amount of Data, Fast Access and Simple Relations.
    E.g. Online Shopping Carts, session data 
     
  • Document DB — Web Applications with information store in JSON format as a collection.  
  • Wide Column Store — Billions of Web Pages / Index. E.g. Storing data from Streaming data sources
    (IoT devices), Real-time Read / Write Operations 
     
  • Graph DB / Tripe Data Store — When Complex Relationships matter. E.g. Social Networks 

 

          NOTE: The above-mentioned DB Engines are not only limited to ‘Key-Value Store’ but also can be
used as ‘Document Store’, ‘Time Series DB’ etc. However, they are known and popular for that
primary purpose.
 

      Relational / SQL supported DB Engines  

  • Traditional RDBMS Engines for ACID operations ranging from SQL Server to MySQL and Postgres SQL  
  • Analytical Database Engines to support OLAP use cases — Columnar Data stores which store data
    in column-oriented models unlike Row-based 
     

     Distributed File Systems  

        Hadoop ecosystem has ‘HDFS’ which is the core file system to store files irrespective of its format
and structure.
 

          The other popular term which is being used is ‘Data Lake’. It is a concept and may contain one or more
different technologies.
Sometimes ‘Columnar DW’ can play the role of Data Lake and sometimes
HDFS plays the same role. It primarily holds all the data in its source format without any ‘Processing’
applied to it. It requires strong ‘Meta Data Management’ to identify and query
the required data set
for further applications. Microsoft has out of the box ‘Data Lake Store’ solution. We shall focus on
Data Lakes in separate paper
s/sections. 

     2. Data Ingestion Technologies 

     The data to be ingested once we identify the ‘Data Store’ choice need to be analyzed in terms of its
Characteristics and Quality. The combination of ‘Kind of Data’ and ‘Target Data Store’ determines
the right ‘Data Ingestion Technology’ choice. For e.g. Streaming data such as ‘log files’ needs to be
ingested and stored into the ‘Wide Column Store’. This can help identify the right ingestion
technology choice.
 

        3. Data Processing Technologies:

              The ‘Operations’ which need to be performed on the incoming data before/after storing on to the
‘Data Store’ and the ‘performance’ expectations determine the right technology choice for
‘Data Processing’. Few popular technology choices are ‘Spark’ and ‘MapReduce’.
 

               In the Big Data world, Streaming data processing in distributed systems is the key requirement with
low / no latency. This ‘Streaming Data’ is unbounded and there is no literal end to the incoming
data feeds. Below are a few essential parameters we need to keep in mind before making the
technology choice.
 

  • Delivery Guarantee — Incoming Streaming data need to be ingested and processed irrespective of failure.
    This can happen in 3 modes. ‘At least once’, ‘Exactly once’ and ‘At most once’ 
     
  • Fault Tolerance — In case of failure the processing must resume from the point of failure  
  • State Management — The ‘State’ of the data should be persisted if in case of failure  
  • Performance — Consumers should be able to read the processed data in Real-time / Near Real-time  

         Based on the above considerations we can further classify the available technologies into two categories  

  1. Native Streaming — Processing the data as soon as it arrives. Suitable for simple operations.
    E.g. Storm, Kafka Stream, Flink, and Samza 
     
  2. Micro Batching — Processing the incoming data in terms of micro-batches. E.g. Spark Streaming.
    This is good for complex operations with a trade-off on the latency to an extent
     

         All the above-mentioned technologies in the above 3 sections come with a set of libraries and support
multiple Programming languages from both the ‘Data world’ and ‘Application Development’ world.
E.g. Java, .NET, Go, SCALA, SQL, Python, R, etc. Other modules that are important to consider
are ‘Management & Coordination’, ‘Resource Management’, ‘Security’ and ‘Scheduling’. 

 

         About Technovert: 

        It all comes back to good data. We’ve got you covered for any data project across verticals.
        Our comprehensive business intelligence consulting  services inclusive of  PowerBI solutions,
Data Visualization, Dashboard Design & Development along with DBA support service
s can help. 

        Our experts help you to discover how to unlock the true potential of the data and let you take
the next ste
p in choosing the right tool and technology for any of your data and analytics workloads.

        Contact us or schedule a call back to know more about how you can transform your business.  

Jeevan

Related posts

Challenges bring the best out of us. What about you?

We love what we do so much and we're always looking for the next big challenge, the next problem to be solved, the next idea that simply needs the breath of life to become a reality. What's your challenge?