Slowly Changing Dimension
Maintains records in a slowly changing dimension table
Processing
For every input row the step ensures that the dimension table contains a record corresponding to the given business keys and dimension fields.
If there are no records for the given business keys yet, a new record is created.
If there is a current record for the given business keys, the contents of the table record are compared with the data in flow. If the contents are identical, no changes are made to the database table. If dimension fields differ, the action depends on the update configuration setting for each dimension field. The following rules are applied in order:
If any changed fields are configured as punch-through
, all versions of the entity are updated. All punch-through fields of all versions are updated to contain the current row’s values.
If no changed fields are configured as insert
, but at least one changed field is configured as update
, the current version of the record is updated to reflect the current row’s values.
If any changed fields are configured as insert
, the current version of the record is expired, and a new version is inserted, reflecting the current row’s values.
Constraints
All records must be loaded in historical order. When a record is expired, and a new version is inserted, a check is made to ensure that the new record’s effective date is after the expired record’s effective date.
Settings
Name | Type | Description |
---|---|---|
Connection |
dict |
The database connection to use. Evaluated for each input row |
Schema |
string |
The database schema of the table. Evaluated for each input row |
Table |
string |
Name of the dimension table in the database. Evaluated for each input row |
Commit Size |
long |
Number of rows after which data is committed.
Evaluated once when step initializes |
Effective Date |
datetime |
The effective date of the current record. Used when a previous version of a record expires, and a new version is inserted:
Evaluated for each input row |
First Effective Date |
datetime |
The effective date of the first version of a record. If Otherwise this value is used as the effective date of the first version of a record. Evaluated for each input row |
Non-Expired Date |
datetime |
The expiry date of the current version of a record. Evaluated for each input row |
Fields | ||
Primary Key |
string |
The name of the primary key field in the database table. Evaluated once when step initializes |
Business Keys |
The fields and values making up the business keys of a record. Evaluated for each input row |
|
Dimension fields |
The fields and values making up the slowly changing attributes of a record. Each field is configured with an update strategy:
Evaluated for each input row |
|
Insert fields |
When a new record is inserted, these values are evaluated and included in the insert. | |
Update fields |
Whenever records are updated, these values are evaluated and are included as fields in the update. | |
Date Effective Field |
string |
The name of the table field holding the effective date of a record. Evaluated once when step initializes |
Date Expired Field |
string |
The name of the table field holding the expiry date of a record. Evaluated once when step initializes |
Version Nr. Field |
string |
The name of the field holding the version number of a record. Evaluated once when step initializes |
Current Flag Field |
string |
The name of the field holding the current version flag of a record. Evaluated once when step initializes |
Cache | ||
Use Cache |
boolean |
Use a in-memory cache to hold table rows. Evaluated once when step initializes |
Cache Size |
long |
Number of records cached
Evaluated once when step initializes |
Pre-load Cache |
boolean |
If Evaluated once when step initializes |
Pre-load SQL Filter |
string |
If non-empty, this query text is appended as an AND condition to the cache pre-loading query. It should contain a predicate to append to the where clause of the query, for example: This is useful to selectively pre-warm the cache with a specific subset of records. Evaluated once when step initializes |
Pre-load SQL Parameters |
list |
If the pre-load cache filter contains Evaluated once when step initializes |
Bloom Filter | ||
Use Bloom Filter |
boolean |
If During processing, the bloom filter is then used to determine whether any records with given business keys are present in the table. The filter is most useful during initial loads, when no table records are present for most or all incoming rows. It allows skipping the corresponding lookup query. Evaluated once when step initializes |
Bloom Filter Size |
long |
The expected number of distinct business keys the bloom filter should be able to handle. This number is used to ensure the bloom filter operates with an expected false positive rate of 5%. That is 5% of the time the filter might indicate that a set of business keys might be present in the table, when in fact it is not. This number should be close to the number of distinct business keys in the fully loaded dimension table. Evaluated once when step initializes |
Results
Name | Type | Description |
---|---|---|
key |
long | Primary key for current record |