Duties and Tech Stacks for Data Engineer?

Because big data itself is a vast concept, the entire field of data engineering encompasses a wide range of content with detailed specialization. Each sector may involve numerous technologies, and these technologies often require updates. Additionally, based on the technical architecture of different enterprises, the things you need to master may vary. For instance, some companies use HBase for NoSQL databases, while others use Cassandra. Similarly, for BI analysis software, some companies use PowerBI, while others use Tableau. However, overall, there are commonalities.

So, what exactly do you need to master?

Below is my understanding of the technical stack for the role of a data warehouse engineer. Of course, there are still many things I haven't mastered yet. Don't worry; learning technology is a gradual and continuous effort.

Skills that a data warehouse engineer must master, as they are crucial for successfully taking on this role:

Strong ability to write and optimize SQL (writing hundreds of lines of SQL code should be routine).

Proficiency in at least one programming language (recommended to master in the big data field: Java, Python, Scala).

Knowledge of OLTP databases and NoSQL databases (including databases like MySQL, Oracle; NoSQL databases like HBase).

Proficiency in at least one OLAP tool (such as Clickhouse, Impala, Kudu).

Big data processing frameworks (such as Apache Hadoop, Apache Spark) and understanding of other components in the ecosystem, knowledge of big data storage (HDFS, HBase), offline computing (Hive(MR), Spark Core, Spark SQL), resource management (Yarn), and task scheduling (Airflow).

Data warehouse modeling theory (layering theory, dimensional modeling, star schema, snowflake schema).

Concepts and automation tools for ETL and ELT (such as Kettle, Sqoop).

Basic Linux commands.

Proficiency in at least one cloud service (such as AWS, GCP).

Daily development tools:

IDEs (such as JetBrains' IDEA, Datagrip, Dataspell, Pycharm, VS Code).
Version control tools like Git (and its UIs like SourceTree, GitLab, Bitbucket, etc.).
Maven (software development dependency).
Anaconda (integrated Python data engineering package).

Documentation writing ability (ability to write development documents based on Markdown, which is crucial for the implementation of data warehouses).

Agile development skills, experience in development within an agile team.

Advanced skills, which are essential for expanding your capabilities in this role:

Real-time data processing technologies, including:

Real-time computing frameworks (such as Apache Flink, Storm, Spark Streaming).
Message queue technologies (such as Kafka).
Cache technologies (such as Redis).
….

Data lake:

Apache Iceberg and data formats.

Data analysis technologies:

Python data analysis libraries: Pandas, Numpy, Matplotlib.
BI analysis software (such as PowerBI, Tableau).

Data science:

Concepts of machine learning, deep learning, etc.
Spark ML.