Using Oracle Data Integrator Knowledge Modules to Implement "History Preservation" on "Slowly changing Dimensions"
This article will review one aspect of the ODI product, the concept of "Knowledge Modules", and discuss how these can be used to stream-line the ETL development process through the use of fully re-usable code modules.
Oracle Data Integrator (ODI) is a recent addition to the Oracle Business Intelligence product set, (subsequent to the acquisition of Sunopsis by Oracle in 2006), and represents a powerful tool that will form the cornerstone of the Oracle Fusion data integration strategy. This article will review one aspect of the ODI product, the concept of "Knowledge Modules", and discuss how these can be used to stream-line the ETL development process through the use of fully re-usable code modules.
In this article we will use the example of history preservation on type 2 slowly changing dimensions (SCDs). History preservation is a standard requirement during warehousing projects, and is a feature that many ETL vendors have invested significant resources in to optimise their pre-packaged history preservation mechanisms. The KM feature in ODI can be used to optimise this process.
History preservation mechanisms
In general, traditional ETL tools preserve history on type 2 SCDs in a fairly consistent manner. Data is first extracted from the source table and then the dataset is compared with the target. This is achieved by comparing the natural key on the source dataset with the target. If the key is present and the compare columns have changed then the row is marked as an update, if not it is inserted. Update rows must be converted to an insert / update pair to save the old record while inserting a new record. The surrogate key must also be generated to create the unique identifier on the target table.
Each step in this process is required, although each ETL tool achieves this in a different manner. Some tools use SCD wizards that allow the user to enter the parameters and then a 'flow' is automatically generated. Others have well defined graphical transforms that allow the user to 'drag and drop' all the necessary fields into place.
Generally speaking, the process must be performed individually for every SCD within the warehouse. There is little chance for genuinely re-usable code as the parameters change for each table. The natural key field is different, the comparison columns on which history should be preserved also change between tables. Therefore, it is difficult to see how this can be re-used for all the dimensions in the warehouse.
Some of the more advanced ETL tools do allow re-usable elements that can be embedded within other objects. However, this would not normally accommodate the complexity of the history preservation technique. The fact that the history preservation mechanism must be defined for each dimension is a severe limitation of these tools - by defining the nature of the transformation separately within each flow the chances of errors creeping in increases dramatically, leading to more frequent errors and higher support and maintenance costs.
Overall, there are many similarities between the ODI designer and the GUIs offered by traditional ETL tools. The overall structure of the flow is controlled with a graphical interface that links each component in the process. The rules for the population of a target table are contained within an interface. The diagram tab within an interface will be familiar to many ETL developers, and represents the relationships between source tables and also defines the mappings for the target table.
In traditional ETL tools the developer not only defines the mappings but also introduces a number of steps between the source and the target. This is normally represented graphically as a 'stream', beginning with the source tables followed by a number of processing steps before loading the target. This is highly intuitive and gives the developer an enormous degree of control over the process. Effectively, the developer is defining WHAT they want to appear in the target and HOW it will get there.
In contrast, ODI uses an alternative philosophy and performs this task in a completely different manner using KMs.
ODI Knowledge Modules
An ODI KM is a code template that performs a task within an overall process. There are 6 categories of KM - in this paper we will focus on an integration KM, the "MSSQL SCD integration" KM. The key feature of KMs is that they are genuinely re-usable across a number of interfaces. This is made possible by the substitution method, which allows key parameters to be entered into the template at run time.
The best way to demonstrate this is with an example. One part of the SCD integration KM involves inserting all the rows that have been marked as 'I' for insert.
In effect, we want to generate the following SQL statement at run time.
The same basic steps are generally required to populate all dimensions within a warehouse. Therefore, we wish to take this code and somehow make it generic so that the same standard code module will perform history preservation for all the dimensions. This is exactly what ODI has achieved.
This code does not make any explicit references to a table or column - instead the KM is dependent on the metadata provided by datastores and interfaces. When the interface is executed and the API is called, this information is returned based on the metadata within the diagram tab of the interface. Therefore, this single KM can be used for every SCD within the warehouse.
The key question now becomes how does ODI actually know what column should be used for SCD_START, SCD_END etc? As well as knowing what the natural key is for comparison, it is also necessary to know what field will record the status of a record - current or historical. This is actually achieved on the datastore.
This is an essential distinction between ODI and traditional ETL tools. ODI allows the developer to define the structure of the table more thoroughly, identifying the natural key, surrogate key etc. In a traditional ETL tool the developer is normally limited to the table metadata provided by the database. Therefore, there is no opportunity to define a column as a natural or surrogate key to make fully re-usable code a possibility for SCDs.
Within every interface there is a Flow tab where the developer must select KMs to control the movement of data. A loading KM is used to define how the data will be loaded into the staging table, in this case the MSSQL bulk loading module was selected.
The loading KM controls the movement of the source data to the staging area. However, we are still missing the history preservation mechanism that will take this dataset and determine whether rows should be flagged as inserts or updates etc. This is achieved by the integration KM. An integration KM could perform a task as simple as deleting all rows from the target before inserting the new dataset. On Oracle the incoming dataset can be compared with the target via PL-SQL incremental KM or the Merge Into incremental KM.
In this example, the requirement is to carry out a more complicated function by preserving history on a type 2 SCD. This is achieved using the "MSSQL SCD integration" KM. This KM will make use of the column descriptions that we defined earlier at a datastore level. The API then feeds a number of substitutes such as the table names and column names into the statements within the KM.
To emphasize the point lets take a piece of the code from the KM and see how it compares to the code that was executed at run time. Let's use a basic one, where a unique index was created on one of intermediate staging tables.
This is generated from the following piece of code within the MSSQL SCD KM.
create unique index idx_
It's interesting to see how the table and column names are actually represented within the executed SQL. This is controlled by the parameters that are selected within the API call. The simplest example is the getTable API used within the ON clause in the statement above.
SnpRef.getTable can be used in the following manner:
public java.lang.String getTable( java.lang.String pMode, java.lang.String pProperty, java.lang.String pLocation)
There are a number of valid inputs for the property parameter, including ID for datastore identifier and INT_NAME for the full name of the integration datastore. The parameters that are used change the pattern of the string returned by the API. The following examples are taken from the documentation on the snpRef.getTable API.
Building a complex KM
The example above showed how a simple statement could be generated using substitution methods. More complex statements can easily be generated by using multiple calls to build up a more significant piece of SQL.
Tthe code that is executed at run time to historicize all old records within the dimension target table. This involves joining the target table to an integration table on the natural key. We are only interested in rows that are current, limiting to a current flag of 1 on the target table.
The KM code that is required to generate this SQL is shown below.
from as T,
and S.IND_UPDATE = 'I'
This demonstrates how powerful the API based substitution method is. Only two APIs are used to generate this SQL, with multiple calls made to each. Therefore, we can start to build a complex statement by making multiple calls and changing the parameters used. This enables us to re-create the above SQL for all dimensions despite the fact that we require metadata such as the natural key, current flag, and valid flag column names.
A complex KM is nothing more than a sequence of such statements linked together to perform a task. In the SCD KM, we need a step to flag rows for update, another to identify useless rows that haven't changed and others to historicize old rows before we finally insert the new rows. By executing each of these statements in turn, we perform all the steps necessary to control the loading of the target table and preserve history. We can actually track the execution of each statement within the KM when the interface is run.
By expanding an individual step we can see what SQL was actually generated. This allows us to trace the flow through the complete KM, demonstrating how each statement builds upon the task completed by its predecessor within the KM.
The execution of the above job also highlights another important fact about KM. This interface is controlled by 2 KMs - the loading KM and the integration KM.
The loading KM gets the data into an area where we can start the transformation process, which is controlled by the integration KM demonstrating the linkage between the two types of KM.
Creating and deploying KMs
A number of KMs are packaged along with the ODI software, providing many of the basic features that are required for data integration projects - in the previous example the pre-built "MSSQL SCD" KM would have been used However, there will be occasions where modification or re-development will be required, in which case the existing KM should provide a template which can then be easily modified.
In general, a good knowledge of SQL is more relevant than an understanding of Java. ODI is an ELT tool and KMs should make the most efficient use of the underlying database. This is achieved by maximizing the number of set based operations. Java can be embedded within KMs, but should only be used when SQL would be inadequate.
This emphasizes a key point - although ODI is Java based and makes extensive use of the API substitution method there should be no significant level of coding required for the majority of projects. So, the tool should be easily accessible by any developer with a sound knowledge of SQL.
Oracle recommends starting with the pre-packaged KMs, rather than developing KMs from scratch. If it is deemed necessary to develop custom KMs then this should not significantly hinder the project. Many projects involve a similar process that is repeated many times. In our example, we have many dimensions but they all use the same history preservation mechanism. On average, a project will use somewhere in the region of 5 KMs. This shows the impact that the KM concept could have on a project. Only 5 KMs would have to be developed and maintained, as opposed to a number of transformations for each individual table.
This paper has shown how effective KMs could be within data integration and warehousing projects by preventing the need to define the history preservation mechanism for each individual dimension table. This can lead to standardized processes and the reduction of coding errors. If any errors were found they could easily be traced to a central place. In addition, if a fundamental change to the logic was required then this could be made to the KM and would then impact every interface that uses it.
As well as a range of other ELT and integration features, the use of Knowledge Modules within ODI offers a concept that introduces genuine re-usable code to the ETL process. This is a stand out feature that has the potential to streamline and standardize the development process while reducing maintenance costs.