Understanding the ETL Process

brahmajit-mohapatra-f8fe5582 · 15 May 2021 23:19

One of the most critical mistakes that prevent companies from providing vital decision-making information to business users is the absence of reliable data from one or more data sources, gathered in a single location, organized and prepared to be used.

Imagine this scenario—you are the IT manager of a cell phone case company with retail locations all over the United States. Your board of directors is struggling to make decisions based on sales because the information is available but it is in different locations and formats. One of the directors asks you to generate a dashboard with sales information from all the points of sale to present in the next board meeting.

You already know that it is almost impossible to consolidate the information due to different formats and structures. Some of the retail kiosks are still using a proprietary system in an Oracle database. Bigger stores are using a new Salesforce system. The newest kiosks that started operations during the system transition have shared spreadsheets used to compute sales.

How will you collate all the data from different locations, formats, and structures into a unique database ready to be accessed for dashboard generation?

ETL Fundamentals

ETL stands for Extract, Transform, and Load . ETL is a group of processes designed to turn this complex store of data into an organized, reliable, and replicable process to help your company generate more sales with the data you already have.

In our case, we’ll receive data from an Oracle database (most kiosks), from Salesforce (stores), and from spreadsheets (newer kiosks), extract the data, transform when necessary, and load into a single data warehouse database to be accessed by reporting tools and used to generate dashboards and scorecards.

Let’s dive into the three steps of ETL to elaborate on the process.

Extraction

Extraction is the process of receiving data from single or multiple sources. Sources can have different formats and structures, such as documents, spreadsheets, CSV files, flat files, relational databases such as Oracle, MySQL, SQL Server, non-relational databases, and so on.

There are two main types of extraction: full and partial .

Full extraction is used for initial extractions or when the amount of data and consequently the time for extraction are acceptable.
Partial extraction is recommended when there is no need to extract all the data again, or the amount of data is big enough to make a full extraction impossible. In a partial extraction, only updated or new data will be extracted.

In addition to these aspects, some other considerations are necessary when choosing between full or partial extraction, and I want to describe one of them: data availability and integrity .

It means that only completed transactions will be considered for extraction, eliminating data that could cause an integrity fault. For example, an online test to identify an engineer’s knowledge with 10 questions. If an engineer is in the middle of the test and has answered a few questions but hasn’t finished yet, the extraction process can’t read the questions from unfinished tests. It could cause an integrity fault.

Transformation

After extracting the data, we can begin the transformation process: clean, manipulate, and transform data according to business rules and technical criteria to maintain an acceptable level of data quality.

Depending on a number of factors, the use of a staging area may be necessary. A staging area is an intermediate storage space used to temporarily store data extracted from data sources to be transformed.

In some projects, normally those with a small amount of data, there is no need to use staging areas, but most projects use it.

There is a range of tasks performed during the transformation phase:

Selection: criteria to select extracted data. The selection can be made during the extraction phase, during the transformation phase, or in both phases.
Integration: the combination of data from the extraction phase to the staging area. This combination means adding, deleting, and updating data in the staging area based on the extracted data.
Joins: used to join extracted data, similar to SQL joins (inner join, left join, right join, full join, etc.)
Cleansing or scrubbing: removes inconsistent or invalid data, or data with errors to improve data quality. Working with multiple data sources increases the chances of having data problems that need cleaning, such as:
- Referential integrity (customer with inexistent category)
- Missing values (customer without ID)
- Uniqueness (more than one person with same SSN)
- Misspelling (Sun Diego, Cannada, L.Angeles)
- Contradictory values (Alex dob 04.27.1974, Alex dob 04.14.2000)
- and many others
Summarizations: summarize sets of data for posterior use
Aggregations: data compiled and summarized in groups
Consolidations: data from multiple sources or structures consolidated in a single set of data

Here are some common transformation types:

Delete duplicate data
Splitting and merging
Conversions (date, time, number masks, measurements)
Encoding (Male to M)
Calculations (item_value = unit_Price * quantity)
Key generation

Loading

Last but not least, the final process in ETL is to load the data into the destination. Loading is the act of inserting transformed data (from a staging area or not) into the repository, normally a data warehouse database.

There are three main types of loading data: full or initial, incremental, and refresh.

Full or initial means a complete load of extracted and transformed data. All the data in the staging area will be loaded into the final destination to be made ready for business users.
Incremental load is the process of comparing transformed data with the data in the final destination and loading only new data. Incremental load can be used in conjunction with refresh load, explained below.
Refresh load is the process of updating the data in the final destination to reflect changes made in the original source. A refresh can be full or incremental.

In summary, every company, independent of its size, can use ETL processes to integrate information that already exists and to generate an even bigger wealth of information for decision-making, turning data that previously could not be used into a new source of revenue.

Testing

Testing is one of the most important phases of ETL, and yet one of the most overlooked too.

Transforming data from different sources and structures and loading it into a data warehouse is very complex and can generate errors. The most common errors were described in the transformation phase above.

Data accuracy is the key to success, while inaccuracy is a recipe for disaster. Therefore, ETL professionals have a mission to guarantee data integrity during the whole process. After each phase, a test must be performed. Whether extracting data from a single source or from multiple sources, the data must be checked to establish there are no errors.

The same must be done after any transformation. For example, when summarizing data during the transformation phase, the data must be checked to guarantee that no data was lost and the sums are correct.

After loading the transformed data into the data warehouse, the testing process must be applied again. The loaded data needs to be compared with the transformed data and after that with the extracted data.

In our cell phone case company example, we are working with three different sources (proprietary Oracle database, Salesforce, and spreadsheets) and different formats. The testing phase can use sample data from the original sources and compare them with the data that is in the staging area to guarantee that the extraction occurred without errors.

The sample data—which in this case can be sales information from three different locations (stores, old kiosks, new kiosks)—is to be compared against the original source. The differences, if any, must be analyzed to see whether they are acceptable or if they are errors.

If errors are found, they must be fixed, and there are a few decisions to be made if you need to fix them: Should the original data be modified? Is it possible to do so? If the errors can’t be fixed in the original source, can they be fixed by some transformation?

In some cases, the data with errors must be eliminated and an alert triggered to inform those in charge.

Some testing examples:

Data requires validating
Data quality
Performance
Data rules
Data modeling

Logging

Logging ETL processes is the key guarantee that you have maintainable and easy-to-fix systems.

An ETL with the correct logging process is important to keep the entire ETL operation in a state of constant improvement, helping the team manage bugs and problems with data sources, data formats, transformations, destinations, etc.

A robust log process helps teams save time allowing them to identify problems more quickly and easily, and lead engineers need less time to directly pinpoint the problem. Sometimes, errors occur in the middle of extracting tons of data, and without a log, identifying the problem is hard—at times almost impossible. Without logs, the whole process must be re-run. Using logs, the team can rapidly identify the file and row that caused the problem and can fix only that data.

The only case I can imagine where logs are not so important is with very small unautomated systems, where the process runs manually and there’s a small amount of data that can be monitored by hand.

Logs improve automation. ETL processes with a large amount of data that run automatically need log systems. If they’re well planned and executed, all the effort put into building a logging system will pay dividends in the form of faster error identification, more reliable data, and the improvement points found in log files.

There are three main steps in creating a log system: generate, archive, and analyze .

Generate is the process of documenting what is going on during the execution of ETL pipelines: when the process started, which file or table is being extracted, the data that is being saved in the staging area, the error messages, and more. All the important information that can help engineers must be logged. Alert : Pay attention not to generate so much information that will just consume time and space and will not be helpful.
Archive log data means keeping a track record of past executions to search past scenarios so as to identify errors or compare with the current scenario looking for improvements. It is important to check the relevance of a specific point in history to save—data from a long time ago, where the structure changed many times, is not worth keeping.
Analyze . Log analysis is of crucial importance. Storing tons of data that is not analyzed doesn’t make sense. It just costs time and money to generate and store the data. Log analysis is important not just to help search for errors but to identify improvement points and increase overall data quality as well.

Performance

ETL processes can work with tons of data and may cost a lot—both in terms of time spent to set them up and the computational resources needed to process the data. When planning an integration, engineers must keep in mind the necessity of all the data being employed. The best thing to do is to work with the minimum amount of data to achieve the objectives, rather than spend time and money just migrating useless data. Also, bear in mind that data amounts tend to increase over time, so try to consider your future needs.

How much effort should go into designing a logging system?

It depends on a number of different factors like the amount and frequency of the data to be processed. Small systems with little data can be managed manually, without the need to invest in advanced logging systems.

Companies with a large amount of data, many different processes, different data sources and formats, and complex pipelines must invest in creating logging systems. As we already mentioned, a good log system can save a lot of time and money.

For a more obvious approach, regardless of company size, amount of data, and frequency of integration, an ETL process must be profitable . An investment in time, money, and technical resources must generate returns—as economist Milton Friedman said: “There is no free lunch.”

In summary, an ETL process can help companies to increase profit with data they already have but that is not being used in the right way. ETL pipelines can integrate different systems, those spreadsheets with important information, and other portions of data that are spread out around different departments and branches, allowing organizations to make the most of their data.