ETL Testing Basics
12 September 2009
When doing developer testing outside of the database, it is always a good idea to start with simple unit tests that stub or mock any dependencies. If your test calls a method that calls other objects, a database, a service, etc., then you have entered the realm of integration tests. Integration tests are much harder to design, setup, and maintain. Unfortunately, when you start testing an ETL process, you are going to have to test the end-to-end integration of multiple databases. This makes an already difficult problem exponentially more difficult. For starters, your data needs to start in an expected state, and there can only be one test running against the database(s) at a given time or the data will be corrupted. To make the testing of ETL processes more manageable, this article introduces some general concepts that should be followed.
As with most things when developing software, one of the biggest concerns should be the management of complexity. To do this, some principles should be followed. Unfortunately, these principles are not as simple to implement as they might appear to be at first. Tests should be designed and carefully analyzed against these principles before being implemented. Furthermore, the simplest tests should be designed first and carefully expanded towards the more complex.
- Testers should have their own/private copies of the database(s), and there should only be one process (e.g. a test) running against the database(s) at any given moment.
- Each test should target as few tables as possible.
- Each test should populate as little data as is feasible during the test.
- Avoid modifying pre-populated (a.k.a. seed) data during the course of a test (If a test absolutely must modify the seed data, then the test tear-down must restore to original, pre-test state).
- Data should reflect the real world.
- Revert changes made to the database(s) in the tear-down of each test (This applies to both the source and warehouse databases).
- A test should have one and only one activity (i.e. Tests should not be chained together).
- Limit the duplication of work between tests.
It can be easy to lose focus on what you are supposed to be testing when you are not only testing an ETL, but have to also have a working knowledge of a source application. Always keep in mind that ETL testing is about testing the data that is transformed and loaded into the warehouse.
- All tests should target data mappings, calculations, and data formats (nullable, boundary, and data-types).
- Some tests should concentrate on initial loads.
- Some tests should concentrate on incremental loads.
- Some tests should concentrate on incremental loads for which there were no source database changes.
- Never shift focus toward testing the source application.
- When testing the business rules, treat the ETL as a black-box, and focus on the known inputs and outputs as much as possible.
All tests should follow a simple workflow with the following basic steps:
- Validate the pretest state of both the source system(s) and the target warehouse.
- Setup the data needed for the particular test.
- Run the ETL process (Please note that some tests will run this twice in order to verify that data is not duplicated).
- Make assertions.
- Revert data changes.
Limiting Data Population
Tests can get very complex when there are a lot of tables that are involved in a single test. Therefore, tests should target as few tables as possible. Ideally, a test should focus on one table at a time. However, in practice, this is rarely practical. Therefore, pick as few tables as possible, and target those tables for your test. When multiple tables are selected, they should be centralized around a single table. To do this, data should be classified into one of two categories:
- Test specific data - the minimum data that is required to perform a given test. For example, when writing a test that adds an order to a store front application and verifies the flow of that information to the data warehouse, the order itself would be the test specific data. All other information (e.g. the customer, the items in the order, the shipping address, etc.), is not test specific.
- Pre-populated dependency data (a.k.a. "Seed Data") - data that is needed for a test, but is not part of the test itself (e.g. A customer is needed to insert an order, but it should exist prior to testing the insertion of a new order. This is because customers are not the target of the test).
Avoid editing seed data during the setup of a test. This can lead to many unforeseen problems.
- It may lead to data corrupt issues because of business logic not intuitive in the schema.
- It may accidentally leave orphan records.
- Edited data is harder to revert during test clean up.
It will often be subjective as to whether data should be seeded or populated with the test. In these cases, the person writing the test should try to do the simplest thing and focus on what is the minimum need for the test. For example, some applications will use user editable look-up tables (for grouping) where simpler applications will simply have a column or group of columns. In a storefront application, there might be a look-up table for the company name. While entering a new customer name, the source application might give the user the ability to select from an existing company name or add their own. If a goal of the test is to verify that the customer name was mapped properly into the data warehouse, it would be most appropriate to treat the look-up table as test specific data. If it is not a goal of the test to verify the mapping of the customer name, it is more appropriate to treat the look-up table as seed data, pre-populate it, and use any available row during the test.
Segregating Test Specific Data from Seed Data
The best way to segregate seed data from test specific data is by reseeding the identity column for the given table. A standard reseed value should be selected and used everywhere (e.g. 1000). This means that all seed data will have run of the mill IDs (e.g. 1, 2, 3, ..., n), and that all test specific data will be greater than the reseed value (e.g. 1001, 1002, ..., n).
To reseeding the identity column in Sql Server:
- Add all of your seed data.
- Use the command DBCC CHECKIDENT ('<Table Name>', RESEED, <New Reseed Value>) For the example above, the command DBCC CHECKIDENT ('BillingAddress', RESEED, 1000) was used before inserting the third row.
To add additional seed data, reseed the identity to the maximum ID of a seed data row.
- Determine the maximum ID of seeded rows.
- Use the command DBCC CHECKIDENT ('<Table Name>', RESEED, <maximum ID of seeded rows>). For the example above, the command DBCC CHECKIDENT ('BillingAddress', RESEED, 2) would be used.
- Add additional seed data.
- Use the command DBCC CHECKIDENT ('<Table Name>', RESEED, <Previously Used Reseed Value>) (Assuming that the table has been cleared of test data. This is not the case in the above example).
Reflecting the Real World in the Data
Applications often populate databases in non-intuitive ways because of business rules that are not apparent in the schema. Therefore, all seed data should be populated via the source application. To populate test specific data, one could be a purest and use Watir or Selenium to script the data insert via the application, but this too adds additional complexity to the test. That is, it introduces the need for the application to be installed for each copy of the database(s), the application needs to be running properly, the application needs to be protected against other user manipulating data, background threads must be accounted for, etc... A simpler way is to run a Sql Server trace while manually adding the test data via the user interface during the test creation process, and then create the script to be used during automated testing via the information in the trace.
Reverting Data After a Test
One approach that can quickly come to mind for cleaning up data changes at the end of a test is to simply wrap the entire test in a transaction and roll the transaction back at the end. This is not a recommended approach. It quickly leads to unexpected complexity. Using a rollback transaction is not even feasible if the ETL runs in a separate process where the transaction cannot be shared.
Another approach that might have some appeal is to restore a master database backup after each test. Although this is the recommended approach for broad reaching tests, covered later, this is not recommended for the types of tests discussed so far. This is due to both speed of testing and because of issues that frequently arise when performing large numbers of restores in a short period of time.
For targeted, single subject tests, which this article focuses on, the recommended approach for reverting data after a test is to manually maintain each target table. If the test is targeting as few tables as possible and test specific data is properly segregated from seed data with a reseeded identity column, you can simply delete data with identities greater than the reseeded value. When done properly, a master cleanup script can be used. For example, in a store front application database where there is testing of customer and order inserts, the cleanup script would simply delete rows with identities greater than the reseed value (e.g. delete from customer where ID > 1000; delete from order where ID > 1000;). This will leave the seed data in place but remove clutter and data that may skew expected values for subsequent tests. It is highly recommended that you maintain a master backup file of your databases so that you can easily recover from the likely event of data corruption.
When to Break Principles
All of the principles above are defined to minimize complexity. These principles should always be considered and rarely broken. However, there will be cases where a principle should be broken. These cases should always be in an attempt to minimize complexity, and will frequently be due to over complexity of the source system itself.
In cases whether there are an extreme number of look-up tables, break the problem up into smaller tests. Instead of testing all of the data mappings in a single test, target the central table and a select a few look-up tables at a time. Treat look-up tables that you are not focusing on as seed data, and rotate through the look-up tables until you have covered them all.
In a store front application, when adding a customer, the application might allow the user to select an existing company or add their own. Furthermore, for each company, the application may allow the user to selecting from existing billing/shipping addresses or add their own.
In this example scenario, a tester verifying customer data mappings might write four tests all of which center around the Customer table.
- Test 1 ("Customer Mapping")
- The test would focus entirely on the value fields of the customer table.
- The look-up tables ("BillingAddress", "Company", and "ShippingAddress") would contain seed data.
- When populating the Customer table, the foreign keys to the look-up tables would use any available/seeded row. If the foreign keys are nullable, leaving them as null would be a better choice.
- The mapped values in the warehouse for Customer.FirstName and Customer.LastName would be asserted.
- All other data including the association to the look-up table and the mapped values from the look-up tables would be ignored in the assertions for this test.
- Test 2 ("Customer BillingAddress Mapping")
- The test would focus on the association between the Customer and BillingAddress as well as the data mapped from BillingAddress.
- The Customer and BillingAddress tables would be populated with test specific data.
- The Company and ShippingAddress tables would contain seed data.
- When populating the BillingAddress table, the foreign key to Company would use any available/seeded row or be left null is applicable.
- When populating the Customer table:
- The foreign key to BillingAddress would use the test specific data.
- The foreign key to Company would use the same value used when populating BillingAddress.
- The foreign key to ShippingAddress would use any available/seeded row where the CompanyID matches the value used when populating BillingAddress or be left null is applicable.
- The mapped values in the warehouse for Customer.FirstName and Customer.LastName would be ignored during the assertion phase because they were already tested in the "Customer Value Mapping" test (This is the principle of limiting the duplication of work between tests. It greatly reduces the complexity of the overall test suite and will decrease the time it takes to run all of the tests).
- The association between Customer and BillingAddress would be asserted in the warehouse as well as all of the value fields in the BillingAddress table (i.e. StreetAddress, City, State, and ZipCode).
- All other data would be ignored in the assertions for this test.
- Test 3 ("Customer ShippingAddress Mapping")
- The test would focus on the association between the Customer and ShippingAddress as well as the data mapped from ShippingAddress.
- The Customer and ShippingAddress tables would be populated with test specific data.
- The Company and BillingAddress tables would contain seed data.
- When populating the ShippingAddress table, the foreign key to Company would use any available/seeded row or be left null is applicable.
- When populating the Customer table:
- The foreign key to ShippingAddress would use the test specific data.
- The foreign key to Company would use the same value used when populating ShippingAddress.
- The foreign key to BillingAddress would use any available/seeded row where the CompanyID matches the value used when populating ShippingAddress or be left null is applicable.
- The mapped values in the warehouse for Customer.FirstName and Customer.LastName would be ignored because they were already tested in the "Customer Value Mapping" test.
- The association between Customer and ShippingAddress would be asserted in the warehouse as well as all of the value fields in the ShippingAddress table (i.e. StreetAddress, City, State, and ZipCode).
- All other data would be ignored in the assertions for this test.
- Test 4 ("Customer Company Mapping")
- This test is more complicated because of the data integrity issues surrounding the Company table.
Therefore, it would be simplest to seed the look-up tables with known values (this breaks the pattern used above to simplify the test), and use those IDs when populating the Customer table. Normally, hard coded IDs would be unacceptable because it can be a maintenance nightmare. However, in extreme situations it may be an option. Just be sure to verify the pre-populated data when verifying the pretest state of the environment. * The mapped values in the warehouse for Customer.FirstName and Customer.LastName would be ignored because they were already tested in the "Customer Value Mapping" test. * The association between Customer and Company would be asserted in the warehouse as well as all of the value fields in the Company table (i.e. Name). * All other data would be ignored in the assertions for this test.
In some scenarios, it might be more advantageous to change the central focus of the tests. This may be difficult to spot in applications that the person designing the test is extremely knowledgeable about because of biases caused by the layout of the user interface. For example, the GUI for this example application may display, associate, and allow of the editing of company information is many places. Conversely, the customer information may have been an afterthought and not widely displayed or editable via the user interface. If the test design had chosen the Company table as the center table, the problem could not be broken into multiple tests because Company is a look-up table.
Example 2 - Large Hierarchies
A cruise line has a complex software application that not only places travelers in unsold cabins, but also tracks the upgrading a guest's accommodations, the cleaning of cabins after a guest departs, maintenance tasks, and safety checks. All of these functions are separated into a suite of products, but they all share a common database and location hierarchy. The location hierarchy has grown organically as applications have been added to the suite and is very large and complicated. The physical structure is divided into ship, deck, and cabin. The placement application needs to know attributes about the cabin such as class (e.g. inside, outside, suite, etc), and whether it has a balcony, etc. Because the cleaning staff tracks cabins by zones and groups those zones into sections, each cabin is assigned to a zone and zones are assigned to sections. The maintenance staff divides cabins by region, therefore each cabin is assigned to a region. The safety staff also divides the cabins into zones and sections, however they do not use the same zones and sections as the cleaning staff, so each cabin is assigned to yet another zone and each of those zones is assigned to yet another section.
Since all functionality in these applications is centered around the location hierarchy, most (if not all) of the tests will require a row in the cabin table. Test complexity is exploded globally because the cabin table is complicated to populate. In a scenario like this, locations would always be pre-populated for tests that depend on a cabin row. Because some of the tests like upgrading a guest's accommodations will require a variety of types of cabins, pre-populated data should be diverse (e.g. inside cabins, outside cabins, suites, etc).
If it is sufficiently difficult to populate a cabin row, it might also be a good idea to pre-populate cabin rows even for tests that verify the mapping of location fields. If the data warehouse stores the natural key (the original primary key from the source system), it may be acceptable to verify the mappings based on a hard coded ID. Normally, this approach would be a maintenance nightmare, but in extreme situations, this may be an option.
Handling Broad Ranging Calculations
Data warehouses are frequently full of calculation based on large datasets spanning long periods of time. The bulk of this article has been about verifying the data mappings of a limited set of data, but this approach is not practical for testing calculations for an annual report based on many data rows with time stamps spanning a year's time. For these times of situations, all of the data for the tests should be setup in a test specific database, and the database should be restored at the beginning of each test run.
- Have a process in place for creating, protecting, maintaining, and distributing a master copy of the database that contains all pre-populated rows.
- Review the data dictionary and physical data models of both the source application and the data warehouse. Then, identify data mappings and/or calculations you wish to test, and write a test description.
- Pick the simplest possible scenario. Later similar, but more complex, tests can be written later based on the initial test.
- Identify all required tables for the test.
- Identify the subset of target tables that should be focused on.
- Identify the subset of tables that will require seed data.
- Carefully analyze the test plan against the principles above.
- In a master copy of the database (restored from last backup if necessary), add the seed data:
- Verify that the identity columns seeded with the correct value by running the command DBCC CHECKIDENT ('<Table Name>', RESEED, <maximum ID of a seeded row>).
- Add new seed data via the source application's user interface.
- Use the command DBCC CHECKIDENT ('<Table Name>', RESEED, <New Reseed Value>) to reseed the identity column for test data.
- Backup the master copy of the database, and store it in a common repository (e.g. source control).
- Write queries that verify the pretest conditions.
- Write scripts to populate the test specific data.
- Write assertion queries.
- Write a cleanup script.
- Manually run through the test scenario several times.
- Automate the test.
- Repeat using increasingly complex scenarios until sufficiently comfortable. Then, move to the next test area.