A framework which can aid in the decision-making process for
data and analytics workloads
In the below framework, ‘Application’ comes with its ‘Functional’ and ‘Non-Functional’ requirements.
And this drives the decision towards what technology or tool to go with.
This article helps you in understanding the drivers which help in deciding what technology or tool to use for
any data and analytics workloads. Below are the key areas for which the suitable technologies need to be
recognized with the help of underlying drivers.
- Data Storage
- Data Ingestion
- Data Processing
To be On Point:
- Application requirements drive “Data storage” technology choice
- Data characteristics that need to be ingested drive “Data Ingestion” technology choice.
- If “Data storage” and “Data Ingestion” technology choices are made, the business rule, quality, and
latency would drive the “Data Processing technology” choice.
The ‘Data Store Technologies’ can be primarily categorized into 3 buckets.
- No SQL
- Relational / SQL supported DB Engines
- Distributed File Systems
No SQL Choice is preferred for the following use cases:
- Applications which require horizontal scaling — Mobile App with a huge number of users —
Read / Write operations for each user and Shading
- Low Latency / Session State — Ad tech and Game State
- Application Monitoring and IoT which has streaming data ingestion and continuous WRITE
The selection of No SQL database depends on CAP theorem
(Consistency, Availability, and Partition Tolerance)
- Key-Value Store DB — a Large Amount of Data, Fast Access and Simple Relations.
E.g. Online Shopping Carts, session data
- Document DB — Web Applications with information store in JSON format as a collection.
- Wide Column Store — Billions of Web Pages / Index. E.g. Storing data from Streaming data sources
(IoT devices), Real-time Read / Write Operations
- Graph DB / Tripe Data Store — When Complex Relationships matter. E.g. Social Networks
NOTE: The above-mentioned DB Engines are not only limited to ‘Key-Value Store’ but also can be
used as ‘Document Store’, ‘Time Series DB’ etc. However, they are known and popular for that
Relational / SQL supported DB Engines
- Traditional RDBMS Engines for ACID operations ranging from SQL Server to MySQL and Postgres SQL
- Analytical Database Engines to support OLAP use cases — Columnar Data stores which store data
in column-oriented models unlike Row-based
Distributed File Systems
Hadoop ecosystem has ‘HDFS’ which is the core file system to store files irrespective of its format
The other popular term which is being used is ‘Data Lake’. It is a concept and may contain one or more
different technologies. Sometimes ‘Columnar DW’ can play the role of Data Lake and sometimes
HDFS plays the same role. It primarily holds all the data in its source format without any ‘Processing’
applied to it. It requires strong ‘Meta Data Management’ to identify and query the required data set
for further applications. Microsoft has out of the box ‘Data Lake Store’ solution. We shall focus on
Data Lakes in separate papers/sections.
2. Data Ingestion Technologies
The data to be ingested once we identify the ‘Data Store’ choice need to be analyzed in terms of its
Characteristics and Quality. The combination of ‘Kind of Data’ and ‘Target Data Store’ determines
the right ‘Data Ingestion Technology’ choice. For e.g. Streaming data such as ‘log files’ needs to be
ingested and stored into the ‘Wide Column Store’. This can help identify the right ingestion
3. Data Processing Technologies:
The ‘Operations’ which need to be performed on the incoming data before/after storing on to the
‘Data Store’ and the ‘performance’ expectations determine the right technology choice for
‘Data Processing’. Few popular technology choices are ‘Spark’ and ‘MapReduce’.
In the Big Data world, Streaming data processing in distributed systems is the key requirement with
low / no latency. This ‘Streaming Data’ is unbounded and there is no literal end to the incoming
data feeds. Below are a few essential parameters we need to keep in mind before making the
- Delivery Guarantee — Incoming Streaming data need to be ingested and processed irrespective of failure.
This can happen in 3 modes. ‘At least once’, ‘Exactly once’ and ‘At most once’
- Fault Tolerance — In case of failure the processing must resume from the point of failure
- State Management — The ‘State’ of the data should be persisted if in case of failure
- Performance — Consumers should be able to read the processed data in Real-time / Near Real-time
Based on the above considerations we can further classify the available technologies into two categories
- Native Streaming — Processing the data as soon as it arrives. Suitable for simple operations.
E.g. Storm, Kafka Stream, Flink, and Samza
- Micro Batching — Processing the incoming data in terms of micro-batches. E.g. Spark Streaming.
This is good for complex operations with a trade-off on the latency to an extent
All the above-mentioned technologies in the above 3 sections come with a set of libraries and support
multiple Programming languages from both the ‘Data world’ and ‘Application Development’ world.
E.g. Java, .NET, Go, SCALA, SQL, Python, R, etc. Other modules that are important to consider
are ‘Management & Coordination’, ‘Resource Management’, ‘Security’ and ‘Scheduling’.
It all comes back to good data. We’ve got you covered for any data project across verticals.
Our comprehensive business intelligence consulting services inclusive of PowerBI solutions,
Data Visualization, Dashboard Design & Development along with DBA support services can help.
Our experts help you to discover how to unlock the true potential of the data and let you take
the next step in choosing the right tool and technology for any of your data and analytics workloads.
Contact us or schedule a call back to know more about how you can transform your business.