What are some data engineering projects for beginners?

rohitkumar-singhvi-fe88d2d0 · 14 March 2022 08:21

Check the list of data engineering project examples below if you are new to data engineering and want to learn more about real-world data engineering projects.

Live Twitter sentiment analysis with spark.

People’s opinions are more essential than traditional media when influencing buying decisions or determining public sentiment for a political party. As a result, marketers have ample opportunity on Twitter.

The term “Twitter sentiment” refers to analyzing users’ sentiments in their tweets. Parsing is generally used to analyze Twitter sentiment in most big data applications. Companies can benefit from analyzing user attitudes on Twitter for their product, which primarily focuses on social media trends, user feelings, and future opinions of the online community.

This data engineering project’s data pipeline contains five stages: data ingestion, the NiFi GetTwitter processor, which receives real-time tweets from Twitter and ingests them into a messaging queue, and data output.

The Kafka subject is where collection takes place.
To determine the sentiment of each tweet, real-time data will be processed using Spark structured streaming API and evaluated using Spark MLib.

The processed and aggregated results are saved in MongoDB.
The results are shown as interactive dashboards using Python’s Plotly and Dash tools.

Log analytics project with spark and kafta.

Logs aid in determining the severity of any security breach, identifying any operational trends and establishing a baseline, and forensic and audit analysis.

In this project, you will use the Apache NiFi dataflow management framework to acquire server log data, preprocess it, and store it in a dependable distributed storage HDFS.

This data engineering project entails cleaning and manipulating data with Apache Spark to gain insights on server activity, such as the most frequent hosts hitting the server and which country or city generates the most network traffic with the server.

Next, you will use Plotly-Dash to visualize these occurrences and build a story about the server. Lambda architecture is the current architecture, allowing you to manage real-time streaming and batch data. NiFi is used to push log files to a Kafka topic.

For real-time analytics, this data is analyzed and saved in Cassandra DB.
This is referred to as the “Hot Path.” The extracted data from Kafka is also saved in the HDFS path, referred to as the cold path in this architecture, and will be analyzed and visualized afterward.