Building a date dimension
This example builds a date dimension table in a Postgres database. The following tweakflow features are used:
- using flow parameters
- using the SQL Script step to create the dimension table
- using the standard library for time calculations
- using the CSV Input step to read holiday definitions from external files
- using the Stream Lookup step to lookup holiday definitions
- using the SQL Insert step to fill the dimension table
The download contains the data flow creating the table as well as sample holiday definitions.
Download: example files
The file contents are:
├── date_dimension.dfl # data flow ├── bank_holidays.txt # holidays based off first/last weekday of month ├── Easter_based_holidays.txt # holidays based off Easter Sunday └── fixed_date_holidays.txt # holidays with fixed dates
What is a date dimension?
Date dimensions are database tables containing calendar information that is conveniently joined to analytics data. Date dimensions contain pre-calculated information, such as quarters, days of week, holiday markers, and labels.
The data flow in this example constructs a typical date dimension table, including holidays. The following is an excerpt of the generated table:
id |date |date_iso |year|month|day_of_month|quarter|day_of_year|day_of_week|week_of_year|is_weekend|is_leap_year|days_in_month|day_of_week_long|day_of_week_short|month_long|month_short|holiday |is_holiday|is_working_day| ----------|-------------------|----------|----|-----|------------|-------|-----------|-----------|------------|----------|------------|-------------|----------------|-----------------|----------|-----------|-------------|----------|--------------| 2019-04-15|2019-04-15 00:00:00|2019-04-15|2019| 4| 15| 2| 105| 1| 16|false |false | 30|Monday |Mon |April |Apr | |false |true | 2019-04-16|2019-04-16 00:00:00|2019-04-16|2019| 4| 16| 2| 106| 2| 16|false |false | 30|Tuesday |Tue |April |Apr | |false |true | 2019-04-17|2019-04-17 00:00:00|2019-04-17|2019| 4| 17| 2| 107| 3| 16|false |false | 30|Wednesday |Wed |April |Apr | |false |true | 2019-04-18|2019-04-18 00:00:00|2019-04-18|2019| 4| 18| 2| 108| 4| 16|false |false | 30|Thursday |Thu |April |Apr | |false |true | 2019-04-19|2019-04-19 00:00:00|2019-04-19|2019| 4| 19| 2| 109| 5| 16|false |false | 30|Friday |Fri |April |Apr |Good Friday |true |false | 2019-04-20|2019-04-20 00:00:00|2019-04-20|2019| 4| 20| 2| 110| 6| 16|true |false | 30|Saturday |Sat |April |Apr | |false |false | 2019-04-21|2019-04-21 00:00:00|2019-04-21|2019| 4| 21| 2| 111| 7| 16|true |false | 30|Sunday |Sun |April |Apr |Easter Sunday|true |false | 2019-04-22|2019-04-22 00:00:00|2019-04-22|2019| 4| 22| 2| 112| 1| 17|false |false | 30|Monday |Mon |April |Apr |Easter Monday|true |false | 2019-04-23|2019-04-23 00:00:00|2019-04-23|2019| 4| 23| 2| 113| 2| 17|false |false | 30|Tuesday |Tue |April |Apr | |false |true | 2019-04-24|2019-04-24 00:00:00|2019-04-24|2019| 4| 24| 2| 114| 3| 17|false |false | 30|Wednesday |Wed |April |Apr | |false |true |
The data flow
Defining the database connection
The database connection is defined in the flow properties dialog. If you want to run the flow locally, you must adjust the connection parameters. Please be advised that the flow will drop and re-create the date dimension table, so please be sure to run the flow in a safe test environment only.
The default database configuration
The data flow accepts three parameters influencing its behavior:
|datetime||The first day to generate|
|datetime||The last day to generate|
|string||The name of the SQL table to use|
Creating the table
The flow drops and re-creates the database table as a first step. The SQL statements work with Postgres 11. If you’re using a different database, you will have to adjust the SQL to conform your DB’s SQL dialect.
The flow proceeds to generate a row for every record it needs to generate. It creates a field containing a datetime value representing the day in question.
To determine the number of days needed it calls time.days_between from the standard library, passing in the flow parameters.
To generate a sequence of consecutive dates, it treats the current row nr as the offset in days to the starting date. The time.add_period function generates the corresponding date.
As a result the step generates a sequence of records with a field named
d holding consecutive days:
Generating calendar information
There is a seperate step generating English labels for months and days of week using datetime formatters. The locale is configurable, in case you’d like to change to another language, or generate additional localized labels.
generating localized labels
Generating lookup keys
In order to lookup holiday information, each date needs three lookup keys:
- For fixed date holidays, a key of the format
1225for Christmas Day) is necessary.
- For holidays based off Easter Sunday, an integer will serve as the offset in days, so Good Friday has an offset of
-2, and Easter Monday an offset of
1. The offset to Easter Sunday is calculated by first calculating Easter Sunday for the given year, and then calculating the difference in days from that date. The (New Gregorian) Easter Sunday calculation is based on the EasterMallen algorithm
- For holidays defined by their position as the first/last week day of a month, as is the case for some bank holidays in the UK, a key of the form
(First|Last)_<DayOfWeek>_<Month>is generated. So the first Monday in May would get the key
Looking up holidays
Holidays are looked up from three different files each using one of the key formats above. Fixed date holidays, Easter-based holidays and bank holidays are treated in essentially the same way.
- The CSV file containing their definitions is read into memory
- The Stream Lookup step is used to look up any potential holidays from each source
After holidays are known, additional flags can be calculated, such as whether a particular date is a working day.
Dropping lookup keys
At this point, the lookup keys have served their purpose and can be removed from the data stream. It is not strictly necessary to remove them, but it helps keeping the data stream uncluttered.
Inserting the records
The SQL Insert step is used to insert the records into the target table.
The flow finishes successfully once all records have been inserted.
Once the date dimension is in place, you can query the resulting table. Which dates are good candidates to take a ‘bridge’ day off? Find holidays that happen to fall on a Thursday or Tuesday, so you can plan the Friday after or Monday before as a day off:
SELECT date_iso, day_of_week_long, holiday FROM dim_date WHERE holiday != '' AND (day_of_week_long = 'Thursday' OR day_of_week_long = 'Tuesday') AND year BETWEEN 2019 AND 2025
date_iso |day_of_week_long|holiday | ----------|----------------|--------------| 2019-01-01|Tuesday |New Year's Day| 2019-06-20|Thursday |Corpus Christi| 2019-12-26|Thursday |Boxing Day | 2020-06-11|Thursday |Corpus Christi| 2021-06-03|Thursday |Corpus Christi| 2022-06-16|Thursday |Corpus Christi| 2023-06-08|Thursday |Corpus Christi| 2023-12-26|Tuesday |Boxing Day | 2024-05-30|Thursday |Corpus Christi| 2024-12-26|Thursday |Boxing Day | 2025-05-01|Thursday |May Day | 2025-06-19|Thursday |Corpus Christi| 2025-12-25|Thursday |Christmas Day |