Data Engineer — The Chef!

Jaswinder Paul
10 min readApr 18, 2021

I am a Data Engineer — The Chef! But why chef? Well, you would know by the end of this blog.

I have been working as a Data Engineer for quite a while now and worked with various clients and companies, some big ones and some mid-size ones. The culture is almost similar as far as the Data Engineers are concerned. They all treat Data engineers as Chefs and very rarely as waiters. Data Engineering is sometimes a very thankless job similar to a Soccer referee who is not noticed until he makes a mistake. You should really love Data and Data Engineering to keep yourself motivated. In my case, I love numbers and like to play with them, and working with Data is all about numbers where each row and column is like an Algebra where the row is x-axis and column is the y-axis. This is enough to keep me going.

First, what really pushed me to write this blog? I have a colleague who happens to be a Data Scientist and was leaving our company in two weeks' time and he asked me a question — “Do you have any documentation to learn Data Engineering?” This stuck in my mind for few days as I didn’t have an answer to that because Data Engineering is not like Data Science where you can go and read some books, get some training, create some models and become a Data Engineer. The other day, I met a friend’s friend and he asked me what’s your profile and I told him I am the Sr. Data Engineer in my company and he asked what does a Data Engineer do? I was thinking about how I should explain it to him in one or two lines. That made me think how to define a Data Engineer? What do they exactly do? Good but a tricky question! Why? Read the next para and you will know why.

Data Engineers are not Software Engineers but they do code using Python, Java, Scala, or other Programming languages and they do check-in code to Github. Data Engineers are not Database Engineers but they do maintain and administer their own Database called a Data Warehouse (and also various Data Marts). Data Engineers do ETL(Extract-Transform-Load) but they are not Integration Engineers. Data Engineers are not Machine Learning Engineers but they do work like one by providing data to models defined by Data Scientists. Data Engineers are not Data Warehouse Engineers as they do work with Star schemas and Snowflake schemas by creating aggregates and summary tables for Reporting and Analytics. Data Engineers are not BI Engineers but they do provide Data to them for reporting needs. Data Engineers are not API Engineers but sometimes they might have to create APIs using Java or sometimes tools like Mulesoft. They are not the DevOps Engineers but they should know how to build a CI-CD (Continuous Integration Continuous Development) pipeline.

Data Engineers wear many hats as you can infer by now. They are the ones who create Data Pipelines where data flows from the source to the destination. Well pretty simple, No! We just have to load the data to the destination from the source. Just read the data and load it. Let’s see how simple it is?

They are so many sources now-a-days starting from files with only a few types — CSV(comma separated values or comma-delimited and their siblings like tab-delimited, pipe-delimited and more), fixed-width, Pipe-delimited, ORC, Avro, Parquet, JSON, and many many more. Then comes another breed, databases to read from and I don’t need to mention all databases that are there — Mysql, SQL Server, Oracle, Postgres, and tons of other databases. There are other applications such as Salesforce(CRM system), Netsuite(for resource planning), Anaplan(for business planning), Zuora (for billing), Workday (for HR), and many more. We will come to the other technical challenges may be little later. Oh and I forgot about the Excel Sheets and Google Sheets which come with their own challenges. This just tells you how many types of sources a Data Engineer needs to deal with. Most of the applications are using APIs nowadays so you should know how to use APIs along with their authorization methods. You should know the difference between the POST and GET methods and which one to use where and how.

Next, comes data type conversions and transformation of data for which they need to learn at least one programming language, and nowadays Python is becoming a must but Java and Scala would also work as they are equally popular. Most of the companies use ETL Tools such as Informatica, Talend, Ni-fi, etc. There are so many ETL Tools available in the market right now and what they actually do is to eliminate the need for coding and just use a drag and drop approach to reduce development times as they come with some templates which could take weeks or months to code. Learning these ETL Tools is also the job of a Data Engineer and most of the time I have learned them on-the-job with no professional training and that is expected from a good Data Engineer. There are various challenges involved in the transformation step as well but that is for some other day.

Most often there is a requirement to integrate two or more sources and that's where the ETL Tools will have an edge over the coding sometimes because these ETL Tools come with connectors and sometimes they even have a built-in template where the only configuration is required. Most companies have similar kinds of sources and software like most companies nowadays uses Salesforce as their CRM system and also Zuora as their billing system. So, if the requirement is to build an integration where both these systems are involved then a template for loading it to the destination is already there. It looks pretty simple again but trust me it is not as simple by any means.

Loading to the Destination is probably the easiest of all the tasks mentioned above but to get this far you have to cross all the earlier stages. Now, as a Data Engineer, you have to see if it's an Incremental Load, Full Load, or Append-only Load. An Incremental Load (also referred to as Change Data Capture or CDC) is where it will only load the changes since the last time it ran (with Updates and Deletes along with new Inserts). So, if a table is loaded till ID -1000 then the next time when it runs it will consider all IDs greater than or equal to 1001 as new inserts, and of course, it will check for updates and deletes too. Full Load is where it will re-load everything. Append-only is the one where it only keeps on appending every time so no deletes or updates only new inserts. You also have to look at the frequency at which the data needs to be loaded — some loads could be hourly, daily, weekly, or monthly. Based on the frequency with which it is loaded the strategy to load the data changes.

With the advent of more and more data especially post Facebook and Twitter and other social media there is a buzzword that I heard in 2014 when the first time I came to The Bay Area — Big Data. What is Big Data? Does that mean a large volume of data? or when the tables are too wide we call it Big Data? Big Data means 4 Vs — Volume, Velocity, Variety, and Veracity. Apart from the volume of data Velocity means how quickly the data can be loaded e.g. real-time or near real-time — loading every 5 to 15 mins. Variety is a different type of source — like Structured or Unstructured Data. So far we have only talked about Structured Data — a conventional type involving rows and columns like in a table i.e. they have some set of rules. Unstructured data has no rules — A picture, a voice recording, a tweet, maybe a log file. The last of the V is Veracity — It refers to the trustworthiness of the data. Explaining Big Data in itself is a big topic and can’t be explained in full in this blog.

Data Engineers need to know how to deal with Big Data and it's much more complex than working with normal data sizes and structured data. Real-time data comes up with its own set of problems. Tools to learn for a Data Engineer to name a few — Hadoop, Hive, Pig, Sqoop, Impala, Flume, Mahout, Kafka. Honestly, I don’t know all of them, only some of them, but having knowledge of each one is important so that you know which one to use for what kind of pipeline. I didn’t mention Spark here on purpose because that is something that is very good and is extremely important to become a good Data Engineer. Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning, and graph processing. It’s a data processing framework that can quickly perform processing tasks on very large data sets and can also distribute data processing tasks. You can use Spark as an ETL tool as well to extract and load in memory and in a parallel distributed fashion. Spark comes with its own native languages — PySpark, Scala, Java and R. Knowing one of these programming languages is a must to work with Spark. Spark comes with its own powers and challenges.

Data Engineers along with what is mentioned above need to understand one scheduling tool like Cron (legacy but still widely used in Unix/Linux). The most common one that is used is Apache Airflow. Apache Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines. Airflow uses workflows made of directed acyclic graphs (DAGs) of tasks.

Data Engineers need to have an understanding of reporting and BI Tools and how they work so that they can create pipelines keeping in mind the challenges faced by reporting tools. Moreover, they need to understand at-least one of the Data Warehouses such as Teradata, Greenplum, Netezza, Hive, Amazon Redshift, Snowflake, Azure Warehouse, BigQuery, and more. Some of them are columnar DBMS, some are in the cloud and comes with their own challenges. A Data Engineer needs to learn all these on-the-job. I have so far dealt with many data warehouses and every one comes with its own pros and cons.

Nowadays one of the most used words in IT is Cloud, and Data Engineers needs to learn these skills as well. You need to know cloud for sure and again they also come in various flavors but three are the most common ones — AWS, Microsoft Azure, and GCP. The first time I used the cloud was Azure in 2016 and then when I moved to another company in 2018 where I work now they use AWS. I was using Containers and Blob Storage earlier but now they are called buckets and S3 Objects. I was using Azure Data Warehouse and now it's Amazon Redshift. Not a small change by any means but you have to get adjusted to the new normal.

Data Engineers needs to have an understanding of the CI-CD processes and how to integrate it with Github so that it is automated. Some of the common tools used are Jenkins, BuildKite, CircleCI, and Gitlab. This is to automate deploying the jobs to various environments.

Is this enough to be a Data Engineer? On top of all that has been mentioned above, you need to have good understandings of IAM and Security, Encryption, Networking and Protocols, Storage, Archiving, Caching, Monitoring, Auditing, Back-ups, and Disaster Recovery. There could be some more but I think this is enough to become a good Data Engineer.

Now comes the main question — Why I mentioned Data Engineers as Chefs? They are the ones who do the heavy lifting, be it understanding the source data or be it a challenge in integrating it with other sources or, maybe the data is unstructured and needs to blend with a structured dataset or, it could be that the requirement is to have it in real-time. Like a Chef faces challenges, to make a dish as per the demand of the customer with maybe more or less spices or, maybe with less time as a customer might be in a hurry, a Data Engineer faces similar kinds of challenges. Once the dish is ready it is being served to the customer on a plate by a Waiter and if the dish is good then all the credit and tip goes to him rather than the Chef. With all due respect to my fellow BI Engineers/Analysts as they present the reports (presentation layer) to the Business/Customers, they get all the accolades and we the Chef as Data Engineers sometimes are under appreciated because they aren’t the one presenting to the Business. Also, What I have seen is most of the time if something goes wrong in the report it is an issue with the Data and so, Data Engineers have to go back and fix it. Data issues are sometimes the biggest challenge for the Data Engineers as they have to troubleshoot their own database but they also have to go to the source and check if it could be their problem too. This makes the lives of Data engineers a bit painful.

This blog is for people who want to understand what Data Engineers' responsibilities are and what do they actually do and the various challenges faced. I have tried my best to bring most, if not all the aspects together and see from a Data Engineer’s point of view and see how difficult and complex it could be. It’s not easy to have everything in this blog as I can write a book mentioning that.

Next time you see a Data Engineer please remember the above points and think the Data Engineering world is not easy (but nothing is easy in life!) and also try to give them the same kind of credit as other Engineers. We are a different breed and I am proud to be a Data Engineer - the Chef!

You can reach out to me for any Questions or feedback on this at: jassipaul@gmail.com

--

--