
- This technique makes use of Unix representations of time and integer division.
- It can be adapted to downsample data to any interval.
Deprecated: Array and string offset access syntax with curly braces is deprecated in /home/thexsect/public_html/wp-content/plugins/easy-table/inc/Encoding.php on line 156
Deprecated: Array and string offset access syntax with curly braces is deprecated in /home/thexsect/public_html/wp-content/plugins/easy-table/inc/Encoding.php on line 158
Deprecated: Array and string offset access syntax with curly braces is deprecated in /home/thexsect/public_html/wp-content/plugins/easy-table/inc/Encoding.php on line 159
Deprecated: Array and string offset access syntax with curly braces is deprecated in /home/thexsect/public_html/wp-content/plugins/easy-table/inc/Encoding.php on line 160
I compare a lot of datasets against each other. More often than not, their time domains do not match. Inevitably, I need to resample one or more sets of data. There are a number of ways to deal with this, but I’d like to share one of my favorite techniques.
Introducing the Data
I’m currently dealing with a weather dataset that includes several measured quantities. This particular set of data was recorded at 1-minute intervals, as shown below. I also happen to know that there are no missing intervals for my given period.
ddatetime | GHI | DNI | DBT |
---|---|---|---|
6/19/2014 00:00 | -4.70365 | -2.00365 | 25.39 |
6/19/2014 00:01 | -4.69436 | -1.95881 | 25.36 |
6/19/2014 00:02 | -4.53416 | -1.81397 | 25.31 |
6/19/2014 00:03 | -4.497 | -1.81052 | 25.25 |
6/19/2014 00:04 | -4.54111 | -1.93812 | 25.21 |
6/19/2014 00:05 | -4.73613 | -1.74844 | 25.17 |
6/19/2014 00:06 | -4.64093 | -1.68981 | 25.14 |
6/19/2014 00:07 | -4.51788 | -1.66567 | 25.13 |
6/19/2014 00:08 | -4.42966 | -1.67257 | 25.09 |
6/19/2014 00:09 | -4.24393 | -1.86569 | 25.06 |
6/19/2014 00:10 | -4.27643 | -1.96914 | 25.03 |
Data recorded at this resolution can be useful, but in this particular case, I need the data to be averaged into 5-minute “bins”. This is essentially a way to resample, or, more specifically, downsample the data. In other words, I’m looking for the following, where each timestamp includes averaged values for the previous 5 timestamps:
ddatetime | GHI | DNI | DBT |
---|---|---|---|
6/19/2014 00:05 | avg(-4.70365, -4.69436, -4.53416, -4.497, -4.54111 ) | avg(-2.00365, -1.95881, -1.81397, -1.81052, -1.93812) | avg(25.39, 25.36, 25.31, 25.25, 25.21) |
6/19/2014 00:10 | avg(-4.73613, -4.64093, -4.51788, -4.42966, -4.24393 ) | avg(-1.74844, -1.68981, -1.66567, -1.67257, -1.86569) | avg(25.17, 25.14, 25.13, 25.09, 25.06, 25.03) |
Unix Timestamps
In my MySQL database, the timestamps are all stored as datetime objects. This means they adhere to the ISO 8601 standard. However, MySQL includes the capability of converting these timestamps to the Unix standard. The Unix standard is based on the number of seconds since the Unix Epoch (January 1st, 1970 UTC). For example, 6/19/2014 00:00 can be represented as 1403136000, and 6/19/2014 00:01 as 1403136001.
Integer Division
Integer division simply means that decimals are truncated after division has taken place. If you were to type 5/3 into your calculator, you would expect to see 1.66667 as the answer. In the integer division world, 5 div 3 instead equals 1 (the 0.666667 is truncated). Similarly, integer division calculates 6.5 div 0.5 to be 13, but it also calculates 6.8 div 0.5 to be 13.
Resample to Data
I’ll build up the code here piece by piece so that you can see how it words. The first step is to convert the timestamps to the Unix format. Next, integer division is used to divide by a specified number of seconds (determined by the desired number of minutes times 60). This code compares the two types of divisions.
SELECT UNIX_TIMESTAMP(ddatetime) div (5*60) as IntegerDivision,
UNIX_TIMESTAMP(ddatetime) / (5*60) as Division
FROM `weather`
Integer Division | Division |
---|---|
4677204 | 4677204.0 |
4677204 | 4677204.2 |
4677204 | 4677204.4 |
4677204 | 4677204.6 |
4677204 | 4677204.8 |
4677205 | 4677205.0 |
4677205 | 4677205.2 |
4677205 | 4677205.4 |
4677205 | 4677205.6 |
4677205 | 4677205.8 |
4677206 | 4677206.0 |
After the integer division has taken place, I multiply the modified Unix time by the same interval I just divided it by. Then, I add the interval to the result so that minutes 0-4 are labeled by the minute-5 timestamp. Lastly, I convert back to ISO 8601 time.
SELECT ddatetime,
UNIX_TIMESTAMP(ddatetime) div (5*60) as IntegerDivision,
(UNIX_TIMESTAMP(ddatetime) div (5*60))*(5*60)+(5*60) as ModifiedTimes,
FROM_UNIXTIME((UNIX_TIMESTAMP(ddatetime) div (5*60))*(5*60)+(5*60)) as FinalStamps
FROM `weather`
ddatetime | IntegerDivision | ModifiedTimes | FinalStamps |
---|---|---|---|
6/19/2014 0:00 | 4677204 | 1403161500 | 6/19/2014 0:05 |
6/19/2014 0:01 | 4677204 | 1403161500 | 6/19/2014 0:05 |
6/19/2014 0:02 | 4677204 | 1403161500 | 6/19/2014 0:05 |
6/19/2014 0:03 | 4677204 | 1403161500 | 6/19/2014 0:05 |
6/19/2014 0:04 | 4677204 | 1403161500 | 6/19/2014 0:05 |
6/19/2014 0:05 | 4677205 | 1403161800 | 6/19/2014 0:10 |
6/19/2014 0:06 | 4677205 | 1403161800 | 6/19/2014 0:10 |
6/19/2014 0:07 | 4677205 | 1403161800 | 6/19/2014 0:10 |
6/19/2014 0:08 | 4677205 | 1403161800 | 6/19/2014 0:10 |
6/19/2014 0:09 | 4677205 | 1403161800 | 6/19/2014 0:10 |
6/19/2014 0:10 | 4677206 | 1403162100 | 6/19/2014 0:15 |
The only thing left to do is average (resample!) and group the results on the ‘FinalStamps’ column!
SELECT FROM_UNIXTIME((UNIX_TIMESTAMP(ddatetime) div (5*60))*(5*60)+(5*60)) as FinalStamps, round(avg(GHI),3) as GHI, round(avg(DNI),3) as DNI, round(avg(DBT),3) as DBT
FROM `weather`
GROUP BY 1
FinalStamps | GHI | DNI | DBT |
---|---|---|---|
2014-06-19 00:05:00 | -4.594 | -1.905 | 25.304 |
2014-06-19 00:10:00 | -4.514 | -1.728 | 25.118 |
The true beauty of this method is that the interval can be set to any arbitrary number of minutes or seconds. Just modify the (5*60) term to your liking!
Good luck, and happy resampling!