Data and analytics workloads: How to choose the right technology & tool

data applications scaled

A framework which can aid in the decision-making process for data and analytics workloads 

In the below framework, ‘Application’ comes with its ‘Functional’ and ‘Non-Functional’ requirements. And this drives the decision towards what technology or tool to go with. 

This article helps you in understanding the drivers which help in deciding what technology or tool to use for any data and analytics workloads. Below are the key areas for which the suitable technologies need to be recognized with the help of underlying drivers. 

  1. Data Storage 
  2. Data Ingestion 
  3. Data Processing  
Data and analytics workload framework
Data and analytics workload framework

To be On Point: 

  • Application requirements drive “Data storage” technology choice 
  • Data characteristics that need to be ingested drive “Data Ingestion” technology choice. 
  • If “Data storage” and “Data Ingestion” technology choices are made, the business rule, quality, and latency would drive the “Data Processing technology” choice. 

     1.Datastore technologies: 

The ‘Data Store Technologies’ can be primarily categorized into 3 buckets.  

  1. No SQL 
  2. Relational / SQL supported DB Engines 
  3. Distributed File Systems  

No SQL Choice is preferred for the following use cases:  

  • Applications which require horizontal scaling — Mobile App with huge number of users — Read / Write operations for each user and Sharding  
  • Low Latency / Session State — Ad tech and Game State  
  • Application Monitoring and IoT which has streaming data ingestion and continuous WRITE  

The selection of No SQL database depends on CAP theorem (Consistency, Availability, and Partition Tolerance)  

  • Key-Value Store DB — a Large Amount of Data, Fast Access and Simple Relations. E.g. Online Shopping Carts, session data  
  • Document DB — Web Applications with information store in JSON format as a collection.  
  • Wide Column Store — Billions of Web Pages / Index. E.g. Storing data from Streaming data sources (IoT devices), Real-time Read / Write Operations  
  • Graph DB / Tripe Data Store — When Complex Relationships matter. E.g. Social Networks 

Data Storage

Data Storage
Data Storage

NOTE: The above-mentioned DB Engines are not only limited to ‘Key-Value Store’ but also can be used as ‘Document Store’, ‘Time Series DB’ etc. However, they are known and popular for that primary purpose.  

Relational / SQL supported DB Engines  

  • Traditional RDBMS Engines for ACID operations ranging from SQL Server to MySQL and Postgres SQL  
  • Analytical Database Engines to support OLAP use cases — Columnar Data stores which store data in column-oriented models unlike Row-based  

Distributed File Systems  

Hadoop ecosystem has ‘HDFS’ which is the core file system to store files irrespective of its format and structure. 

Distributed file systems
Distributed file systems

The other popular term which is being used is ‘Data Lake’. It is a concept and may contain one or more different technologies. Sometimes ‘Columnar DW’ can play the role of Data Lake and sometimes HDFS plays the same role. It primarily holds all the data in its source format without any ‘Processing’ applied to it. It requires strong ‘Meta Data Management’ to identify and query the required data set for further applications. Microsoft has out of the box ‘Data Lake Store’ solution. We shall focus on Data Lakes in separate papers/sections. 

     2. Data Ingestion Technologies 

The data to be ingested once we identify the ‘Data Store’ choice need to be analyzed in terms of its Characteristics and Quality. The combination of ‘Kind of Data’ and ‘Target Data Store’ determines the right ‘Data Ingestion Technology’ choice. For e.g. Streaming data such as ‘log files’ needs to be ingested and stored into the ‘Wide Column Store’. This can help identify the right ingestion technology choice. 

Data ingestion technology
Data ingestion technology

        3. Data Processing Technologies:

The ‘Operations’ which need to be performed on the incoming data before/after storing on to the ‘Data Store’ and the ‘performance’ expectations determine the right technology choice for ‘Data Processing’. Few popular technology choices are ‘Spark’ and ‘MapReduce’. 

Data processing technology
Data processing technology

In the Big Data world, Streaming data processing in distributed systems is the key requirement with low / no latency. This ‘Streaming Data’ is unbounded and there is no literal end to the incoming data feeds. Below are a few essential parameters we need to keep in mind before making the technology choice.  

  • Delivery Guarantee — Incoming Streaming data need to be ingested and processed irrespective of failure. This can happen in 3 modes. ‘At least once’, ‘Exactly once’ and ‘At most once’  
  • Fault Tolerance — In case of failure the processing must resume from the point of failure  
  • State Management — The ‘State’ of the data should be persisted if in case of failure  
  • Performance — Consumers should be able to read the processed data in Real-time / Near Real-time  

Based on the above considerations we can further classify the available technologies into two categories  

  1. Native Streaming — Processing the data as soon as it arrives. Suitable for simple operations. E.g. Storm, Kafka Stream, Flink, and Samza  
  2. Micro Batching — Processing the incoming data in terms of micro-batches. E.g. Spark Streaming. This is good for complex operations with a trade-off on the latency to an extent 

All the above-mentioned technologies in the above 3 sections come with a set of libraries and support multiple Programming languages from both the ‘Data world’ and ‘Application Development’ world. E.g. Java, .NET, Go, SCALA, SQL, Python, R, etc. Other modules that are important to consider are ‘Management & Coordination’, ‘Resource Management’, ‘Security’ and ‘Scheduling’. 
 

About Technovert: 

It all comes back to good data. We’ve got you covered for any data project across verticals. Our comprehensive business intelligence consulting  services inclusive of  PowerBI solutions, Data Visualization, Dashboard Design & Development along with DBA support services can help. 

Our experts help you to discover how to unlock the true potential of the data and let you take the next step in choosing the right tool and technology for any of your data and analytics workloads.

Contact us or schedule a call back to know more about how you can transform your business.  

Related posts