You would like to know what’s a data engineer?
What are his challenges?
Which skill he needs?
Here are some insights:
According to Statista, the volume already generated in 2019 is about 41 zettabytes. it’s increasing rapidly to reach 149 zettabytes in 2024.
this is mainly a direct consequence of smart devices and IoT proliferation as well as increasing social media penetration.
this huge data volume is mostly unstructured and unusable in its raw format in many cases.
Companies are facing big challenges to process such volume to extract insight and drive their business.
One key aspect to reach that goal is the increased need for a data engineer to process that data and prepare it
What is a data engineer?
Data Engineer’s basic role is to ingest, clean the data coming from different sources, and prepare it by building what we call: the data pipeline.
the prepared data need to serve the business requirements either in terms of machine learning model building or data visualization used across the company.
Many challenges face the data engineer to name just a few ones about data pipeline :
- either the data to be available via real-time streaming or batch processing according to the business requirements.
- data quality is a major challenge since a poor quality will lead to the bad output. here is a full article about data quality
What are the technical requirements?
As for the technical skills, I’m referring to Andreas Kretz Cookbook and those requirement
there are essentially 6 skill sets:
1- Coding skills
Data engineer needs to master at least one programming languages and most contexts require more than one :
- Shell scripting
as well as Data types/formats and structures
2- Cloud skills
Mandatory for at least one public cloud
and have the general knowledge about hybrid and private cloud-like Openshift
- DevOps and CI/CD
- Git, Jenkins, Gitlab like tools as well as continuous integration and continuous deployment chain
- Infrastructure automation (Terraform & Ansible)
3- Architecture Patterns
- Kappa Architecture
- Lambda Architecture
- Key elements to consider when crafting your data pipeline for batch or stream processing
- Serverless data pipelines
- SQL Databases(MySQL, Postgresql…)
- NoSQL(MongoDB, AWS DynamoDB, Google Cloud Bigtable …)
- Index Based – Elasticsearch
5- Tools and swiss knives
- Hadoop ecosystem
- Queuing( Kafka, Google Cloud Pub/Sub, …)
- Apache Beam / Flink
- Linux administration skills
- Basic security understanding
- Jupyter Notebooks / Zeppelin
you are wondering if there are any other skills?
Yes, there is. a data engineer is like a swiss knife 🙂
Additionally to all the technical skills listed above which are details in Andreas’s Cookbook. Soft skills are major requirements.
The Data Engineer is a team member dealing with complex projects with different stakeholders. so basically communication and behavior need to be taken very seriously.
You would like to know more about that?
I have written an article titled it’s not about what you know but it’s about how you behave, explaining those aspects. here is the link for further information.