Big Data Demystified: Understanding Databases, Data Warehouses, and Data Lakes
Are you intrigued by the buzz around data engineering, but find the technical jargon and complex concepts daunting? You’re not alone! Data engineering is essential for businesses to manage and analyze large amounts of data effectively, but it can be difficult for beginners to grasp.
In this blog, we aim to simplify the key terms and concepts of data engineering for beginners, so that understanding this complex field is no longer a challenge, but plain sailing. Join me on this journey to demystify data engineering and unlock the potential of your data.
Let's start with a basic understanding of “ Big Data”. Big Data refers to the complex and large volume of data involving both structured and unstructured data on a massive scale which can not be handled by traditional tools.
Structured data is organized in a specific rule, format, or schema. Examples of structured data include data in spreadsheets and relational databases.
Unstructured data, on the other hand, does not have any defined structure or schema. Thus, it can not be easily analyzed using traditional tools. Examples include text documents, images, videos, and social media interactions like comments and reactions.
According to FinancesOnline, a single internet user generates a staggering amount of data, producing 146,880 MB of data per day. With such an enormous volume of data being generated every day, it begs the question of where all this data is stored and how it is managed.
To handle this massive amount of data, there are various data management mechanisms employed, each with its own unique features and capabilities. One such mechanism is the data lake. A data lake is a centralized repository that allows organizations to store and manage both structured and unstructured data at any scale. Unlike traditional databases, it does not enforce a specific schema or structure and can store data in its raw form. A data lake can be built using different cloud-based storage solutions provided by companies like Google Cloud Storage, Microsoft Azure Data lake, and Amazon S3.
While a data lake can handle a large volume of diverse data types, it may not be the best solution for every use case. For instance, when dealing with a more structured data type that is already cleaned and organized, a data warehouse may be a more suitable option.
Data Warehouse is a larger centralized repository that is used for reporting and analysis, containing data from multiple sources that have been cleaned, transformed, and organized to support business intelligence and decision-making.
A healthcare organization’s data warehouse contains patient data, hospital admissions data, and medical treatment data for analysis and research.
Data Marts are the subset of a larger data warehouse that is designed to serve a specific department or business unit within an organization. For example:
A sales department data mart will contain sales transaction data, customer data, and other sales-related data to support sales analysis and reporting
whereas a marketing department data mart will contain customer behavior data, campaign data, and other marketing-related data to support marketing analysis and reporting
Difference between Data Warehouse and Data Lake:
DATABASES:
Databases are also one of the most popular data management solutions that are referred to as a structured collection of data that is stored and managed on a computer system. It typically consists of tables, fields, and relationships between tables, and can be queried using SQL (structured query language). For instance:
A hospital’s electronic medical records database contains patient information, diagnoses, and treatment plans.
OLTP and OLAP are two major types of database systems. The table below summarizes the differences:
Note:
- OLTP systems are optimized for handling real-time transactions with a focus on speed and efficient management of small amounts of data.
- OLAP systems are optimized for handling complex queries and analysis of large amounts of data, typically for business intelligence and data analysis purposes.
- Both OLTP and OLAP systems have different functions and may use different hardware and software architectures to achieve their respective goals.
Wave of Evolution in Data Management:
Data management has evolved with the introduction of newer concepts like Data Lakehouse and Data Decentralization. A Data Lakehouse combines the features of data lakes and data warehouses to create a centralized repository that can store both structured and unstructured data with the support of data processing and analysis through a variety of tools and technologies. Delta Lakes is one such example that offers a unified data management system for data ingestion, processing, and serving.
Data Decentralization, on the other hand, is a newer architectural approach that emphasizes the decentralization of data ownership and the creation of self-organizing domain-specific data teams. Rather than centralizing data in a single repository, data mesh encourages organizations to create a network of domain-specific data products that can be easily shared and integrated across teams. For instance, Spotify implemented a data mesh approach to enable teams to manage their own data domains and make data-driven decisions at scale. By adopting a decentralized approach to data management, organizations can empower teams to work more efficiently and collaboratively while improving data quality and accessibility.
THE END…
In this blog, I have tried to walk you through the introductory terms of data engineering. Hope you get an idea of it!
This is just the beginning of Data Engineering, stay tuned for more such content and share with others.