Automated Unit Testing in SAP Data Services. Part I

Introduction

One of the key drivers of this article is that although there are many articles on unit testing in general on internet there are hardly any on unit testing with SAP DataServices. Unit testing has been accepted as a key part of the software development lifecycle for projects building applications in languages such as Java or C++ however I have rarely come across a SAP DataServices project where unit testing is treated as a fundamental part of the software development life cycle. As a result projects without rigorous unit testing spend longer than planned in system testing, identifying trivial defects, and as a result suffer from delays and cost overruns.

This article first explains what is unit testing and what benefits it brings to a project and then looks at how to implementing automated unit testing in a SAP DataServices project.

Overview

This article discusses an approach to unit testing in an SAP Data Services project. It reviews the objective and benefits of unit testing in software development and how this also applies to the data integration process of a data warehouse system. The second part of the article looks at how unit testing can be implemented in a SAP Data Services approach by looking at how to construct component based data integration code that can be unit tested and finally how to construct unit test suites.

What Is Unit Testing

In software engineering unit testing involves testing the individual units of the application as opposed to testing the application as a whole. What defines a “unit” varies depending on the software being built – in Java, for example, a unit is typically a Java class. Essentially it is the smallest part of the application that can be tested in isolation [1].

A unit test of a Java class would involve executing individual tests against each method and object defined by the class to ensure that they meet their intended functionality. As well as positive testing – testing that the code meets the requirements – unit testing also involves a degree of negative testing that tests that the code handles unexpected situations.

For data integration code unit testing that the data is extracted as per requirement, all the required transformations that are required occur correctly and also that the data is loaded to the target data repository as expected.

Unit testing then provides the confidence that each component of the system delivers its required functionality. In addition defects are spotted and resolved earlier in the development life cycle and unit testing also provides a degree of regression testing.

Automation of unit testing is an essential requirement for agile development of a data warehouse [2] and additional code or test frameworks are written to facilitate automated unit testing. Not only does automation execute the unit test much quicker than manual testing it also ensures that the unit tests are consistent between executions.

Unit Testing in a Data Integration System

In data warehousing, data integration is the system that is responsible for acquiring and loading data into the data warehouse. It is also known as ETL as the three main aspects of data integration is extraction of data from the source system, transformation of the data and finally loading of the data to the data warehouse.

In data integration a unit of code would be any individual process that performs a single movement of data from a source of data to a target. For example a unit would be a process that reads data from one a source followed by applying a series of data transformations and then finally writing the data to a target table or other output. Another example of a unit of code would be scripts that are written to support the data integration process. For example, a script that moves files on the file system or a script unzips a compressed file or a script that logs processing warnings and errors.

For a more complete example consider a data integration system that loads data from flat files into a data mart. The process initially reads data from flat files and loads this data to tables in the staging area, then from the staging area the data is moved to tables in the historical data store (HDS) and then finally from the HDS to target tables in a data mart. Unit tests can then be defined for each separate movement of data from a source to a target as it flows through the system.

The diagram below illustrates this system where each dotted lined box is an individual unit of data integration code that is candidate for unit testing. Each box defines a source, or sometimes more than one source, and a target and these are either flat files (F) or tables (T). Note, even though the target of one process can be the source of another each of these processes are unit tested separately.

Diagram illustrating different flows of data within a data integration system with areas indicated for unit testing

Figure 1. Areas of Unit Testing in a Data Integration System

Therefore in a data integration system a component or unit of code is,

  • a process that moves data from any source to a target or
  • a script or function that provides supporting functionality.

 

System Testing

In contrast to unit testing system testing would test the system as a whole and as such treats the system as a “black box”. This means that system testing is not concerned with the internal workings of the overall process but it checks that for a given set of inputs the system generates the required set of outputs. In the example process described above the inputs are the source flat files and the outputs are the tables in the data mart layer. System testing also differs from unit testing in that it tests that the components of the system integrate and acts as a complete system rather than testing the functionality of individual components.

Benefits of Unit Testing

There are many benefits to unit testing [3],[4] of which the key benefits are,

  • Identification of defects early in development life cycle. It has been established that the earlier a defect is found in the systems development life cycle, the cheaper it is to correct. So defects found during development are cheaper to fix in terms of effort than defects found in system testing or production. [6]
  • Consistency. Repeated testing on the same input will produce same results, this then makes it very easy to regression test each unit to ensure that a fix or new functionality doesn’t introduce new defects.
  • Greater testing speed. This is particularly beneficial to an ETL project as manually testing an ETL process can be quite time consuming as it often requires preparing an input data set, executing the ETL and then analysing the data in the target table to confirm if the test has passed or failed. In addition automated unit testing is a key requirement of agile development [ref for what is agile in dw] as this is essential to the rapid cycles of dev, test and release.
  • Better designed software. This is not as apparent as the other benefits but often happens. A requirement of unit testing is that the units are tested in isolation from the rest of the application. This then enforces low dependencies between different parts of the application and less dependencies leads to less buggy code. [7]

Limitations

It should also be noted that there are limitations to unit testing, primarily that it won’t guarantee a defect free system. Unit testing does not test the integration of the units under test and many defects can occur in this area. System testing is used to identify these defects. Furthermore the unit test is only as good as the written unit test code. Often a trivial step is skipped in unit testing as developer thinks “that bit of code is so simple it can never be wrong” or “just looking at the code I know it works”. As code evolves what may have been simple may no longer be and so should then be tested. Lastly a related scenario is with negative testing being skipped or not being detailed enough. Here the developer has only unit tested for a positive result and has not tested for what would happen when invalid data is received by the integration process.

 

Now that we have discussed unit testing and its benefits Part II of this article looks at how unit testing can be implemented in a SAP Data Services project.

 

 

[6] Barry W. Boehm and Philip N. Papaccio, “Understanding and Controlling Software Costs” (IEEE Transactions on Software Engineering, v. 14, no. 10, October 1988, pp. 1462-1477). 1981. Software Engineering Economics. Englewood Cliffs, NJ : Prentice-Hall, 1981 ISBN 0-13-822122-7.

This article was published on August 31, 2012 by Al Gulland.