A Data Pipeline is a set of processes that move data from one or multiple sources(Data Source) to a destination(Data Applications) for storage, processing, or analysis.
Data Engineering mainly focus on building and manage Data Pipeline. It involves the ETL, data storage, data computation, of data to ensure its quality and relevance.
Looking at the overall architecture of Big Data, the core layers should include: data collection, data storage, data processing, and data application. While terminology might vary, the fundamental roles are quite similar.
So, based on the clues from this architecture diagram, let's delve into the core technologies encompassed by Big Data.
I. Big Data Collection
The task of data collection involves gathering and storing data from various sources onto a data storage platform. This process may include some basic cleansing, constituting the fundamental Extract, Transform, Load (ETL) workflow.
There are various types of data sources:
Website Logs
In the internet industry, website logs occupy the largest share. They are stored on multiple servers, and typically, Flume agents are deployed on each server to collect and store website logs in real-time onto HDFS.
Business Databases
There is a variety of business databases such as MySQL, Oracle, SQL Server, etc. To synchronize data from various databases to HDFS, tools like Sqoop are essential. However, due to Sqoop's heaviness and its reliance on MapReduce, even for small data volumes, as well as the need for each machine in the Hadoop cluster to access the business database, other solutions like Flume configured for real-time data synchronization from databases to HDFS are often preferred.
Data Sources from FTP/HTTP
Other Data Sources
For manually entered data, providing an interface or a small program might be sufficient.
II. Big Data Storage and Processing
Storage
Undoubtedly, Hadoop Distributed File System (HDFS) is the most suitable data storage solution for data warehouses and data platforms in a Big Data environment.
Offline Processing
For offline data analysis and processing where real-time requirements are not critical, Hive is a preferred choice. Its rich data types, built-in functions, highly compressed ORC file storage format, and convenient SQL support make it more efficient for statistical analysis on structured data compared to MapReduce.
Spark, a memory-based batch processing framework that gained popularity in recent years, has significantly better performance than MapReduce. Its integration with Hive and Yarn is improving, making it a necessary choice for analysis and computation.
As we already have Hadoop Yarn, deploying Spark is relatively straightforward without the need for a separate Spark cluster.
Real-time Processing
With increasing demands for real-time data warehousing, frameworks like Storm and Spark Streaming are utilized. While Storm is mature in this field, Spark Streaming is preferred due to its slightly higher latency, which is negligible for specific needs.
Currently, Spark Streaming is employed for real-time website traffic statistics and real-time ad effectiveness analysis.
The approach is simple: Flume collects website and ad logs on the frontend log servers, sends them in real-time to Spark Streaming, which performs the analysis. The results are then stored in Redis for real-time access by the business.
III. Big Data Result Sharing
Result sharing refers to the storage of results from data analysis and processing, essentially an Online Analytical Processing (OLAP) database.
The results analyzed and computed using Hive, MapReduce, Spark, and SparkSQL are stored on HDFS initially. However, most businesses and applications cannot directly access data from HDFS. Hence, a data sharing place is needed to facilitate easy access for various businesses and products. In contrast to the data collection layer to HDFS, a tool is required to synchronize data from HDFS to other target data sources.
Additionally, some real-time computing results may be written directly into the data sharing layer.
IV. Big Data Applications
Based on the results of Big Data computation, various tasks can be accomplished, including the development of business products, report generation, ad-hoc querying, providing data interfaces for models, and conducting machine learning, among others.
1. Business Products (CRM, ERP, etc.)
The data used by business products is already present in the data sharing layer, accessible directly from there.
2. Reports (FineReport, Business Reports)
Similar to business products, report data is generally pre-aggregated and stored in the data sharing layer.
3. Ad-hoc Queries
Ad-hoc queries are performed by a diverse user base, including data developers, website and product operations personnel, data analysts, and even department heads. These queries typically involve direct querying from the data storage layer.
Ad-hoc queries are usually completed using SQL. The main challenge lies in response time. While Hive can be slow, SparkSQL offers faster response times compared to Hive, making it more efficient and compatible.
Of course, the ideal solution is OLAP products such as Clickhouse, Impala, Kudu, and others.
4. Other Data Interfaces
These interfaces can be generic or customized. For example, an interface to retrieve user attributes from Redis is generic and can be accessed by all businesses.
V. Task Scheduling and Monitoring
In a data warehouse/data platform, there are numerous programs and tasks, including data collection, data synchronization, and data analysis tasks. These tasks not only require periodic scheduling but also involve complex task dependency relationships. For instance, a data analysis task must wait for the corresponding data collection task to complete before starting. Similarly, a data synchronization task needs to wait for the data analysis task to complete before initiation.
This necessitates a sophisticated task scheduling and monitoring system. Serving as the hub of the data warehouse/data platform, this system is responsible for scheduling and monitoring the allocation and execution of all tasks.