In entire Data Department in a tech company, there are many roles about data, the most common three job titles are below three: Data Engineer, Data Scientist, and Data Analyst.
- Data Engineer: Data Engineers focus on designing, constructing, testing, and maintaining the architectures (e.g., databases, large-scale processing systems) that allow for the processing of big data. They ensure that data is available, reliable, and scalable.
- Data Scientist: Data Scientists leverage their skills in statistics, machine learning, and programming to analyze and interpret complex data. They extract insights and patterns from data, helping organizations make data-driven decisions.
- Data Analyst: Data Analysts are responsible for interpreting data and turning it into actionable insights. They use statistical methods to analyze and present findings to support decision-making processes.
🗯️My Insight about this
Job Roles in Big Data
The entire Big Data engineering system is highly specialized, and I like to divide it into two main layers: the infrastructure layer and the advanced application layer. The infrastructure layer focuses on establishing a stable, available, and efficient data platform, while the advanced application layer leverages the vast amount of data on this platform to create advanced data products, services, or applications based on statistical, computer science, and mathematical knowledge, such as analytical statistical models, data dashboards, AI, recommendation systems, etc.
Common roles in the infrastructure layer include several, often interchangeable, positions, ranging from the bottom to the middle:
- PaaS Development Engineer
- Requirements: Mastery of underlying technology frameworks, profound understanding of computer fundamentals, experience in high-concurrency, high-throughput, and high-availability environments.
Responsible for the secondary development or management of common big data technologies like Hadoop, Spark, Flink, Hive, Kafka, HBase, and ZooKeeper. Tasks involve maintaining the big data platform, including log management, performance monitoring, and program deployment. These engineers typically focus on low-level framework source code development and bottom-level services without much concern for the data itself.
- Data Platform Development Engineer
- Requirements: Familiarity with Java backend frameworks, such as the SSM framework and microservices technology, and knowledge of common big data technologies like Hadoop and Spark.
In charge of constructing and developing the architecture of the big data platform, creating data pipelines, and developing web services to enable access to the big data platform. This role is typically taken on by individuals with a background in Java backend development.
- Data Warehouse Engineer - Offline Data Warehouse:
- Requirements: Proficiency in SQL optimization, ability to write complex SQL queries, knowledge of data warehouse modeling theory, expertise in offline data pipeline technologies (Hadoop HDFS & Yarn, Hive, SparkCore, SparkSQL, ETL tools, scheduling tools, cloud operations), basic data analysis skills (using Python, R), proficiency in at least one programming language for tasks beyond SQL (such as Java, Scala, Python), and knowledge of databases and No-SQL databases.
Maintains offline data (historical data) on the big data platform. Builds workflows to load OLTP data into the big data file system (HDFS) through ETL (Extract, Transform, Load) processes. Utilizes data warehouse tools like Hive for modeling and layering, ensuring effective management of offline data in terms of consistency, security, desensitization, accuracy, and ease of use. Occasionally handles data extraction and analysis requirements.
- ETL Engineer
关于ETL的说法很多,但是ETL其实是负责把数据从多个不同数据源抽取出来,经过清洗筛选,然后再导入到数据仓库中。一般需要ETL工程师掌握一些基本的ETL工具。 Reference: https://aws.amazon.com/cn/compare/the-difference-between-etl-and-elt/
- Data Warehouse Engineer - Real-time Data Warehouse:
- Requirements: Proficiency in real-time data calculation frameworks like Apache Flink, cache technologies like Redis, NoSQL databases like HBase, and message queues like Kafka. In-depth understanding of the setup and usage of real-time data pipelines.
Maintains real-time data on the big data platform, especially data generated in real-time production scenarios. Deals with challenges related to large data volumes, fast speeds, and concurrency. Given the requirement for immediate consumption (collection, computation, output of results), this role is often referred to as real-time data streaming.
- Big Data Operations Engineer(DevOps):
Handles the operational aspects of the big data platform.
- Big Data Architect:
Designs the entire architecture of the big data system to optimize performance and control enterprise costs. Typically a seasoned big data engineer with extensive knowledge of overall big data technologies.
Roles in the advanced application layer include:
- Data Analyst:
- Requirements: Proficiency in programming languages supporting data analysis, such as Python or R, familiarity with BI tools (e.g., PowerBI, Tableau), knowledge of data analysis theory (statistics), A/B testing, report writing, and communication skills.
Utilizes application layer data from data warehouses or databases (prepared by data engineers) and BI dashboards to draw analytical conclusions. Writes reports for business decision-making.
- Data Scientist:
- Requirements: Background in statistics, foundational data knowledge, programming skills, expertise in data science algorithms, and knowledge of machine learning, deep learning theories, and applications.
Leverages statistical, machine learning, and programming skills to analyze and interpret complex data, extracting insights and patterns to aid organizations in making data-driven decisions.
Other roles include:
- Big Data Product Manager or Business Analyst:
Proposes valuable business models or digital products based on data assets from the big data platform to realize the value of data.
🤨Why Do People Get Confused?
This entirely depends on the organization. In many tech giants or large corporations (with over 1000 employees), roles within the data department are often highly specialized. A data engineer may be responsible for a specific segment of work, occasionally overlapping with other segments. However, in many startups, a data engineer may not only have to handle all aspects of data engineering but also be familiar with higher-level applications such as data analysis, visualization, A/B testing, and machine learning – responsibilities typically associated with data scientists or data analysts. This is why many people tend to confuse these three positions.