AUTHORED BY
Andrew Cross
DATE
11/06/2014
CATEGORY
WORD COUNT
605
REV
0
REFERENCE IMAGE Resampling Data with MySQL
NOTES
  1. This technique makes use of Unix representations of time and integer division.
  2. It can be adapted to downsample data to any interval.
SOCIAL REACH

Deprecated: Array and string offset access syntax with curly braces is deprecated in /home/thexsect/public_html/wp-content/plugins/easy-table/inc/Encoding.php on line 156

Deprecated: Array and string offset access syntax with curly braces is deprecated in /home/thexsect/public_html/wp-content/plugins/easy-table/inc/Encoding.php on line 158

Deprecated: Array and string offset access syntax with curly braces is deprecated in /home/thexsect/public_html/wp-content/plugins/easy-table/inc/Encoding.php on line 159

Deprecated: Array and string offset access syntax with curly braces is deprecated in /home/thexsect/public_html/wp-content/plugins/easy-table/inc/Encoding.php on line 160

I compare a lot of datasets against each other. More often than not, their time domains do not match. Inevitably, I need to resample one or more sets of data. There are a number of ways to deal with this, but I’d like to share one of my favorite techniques.

Introducing the Data

I’m currently dealing with a weather dataset that includes several measured quantities. This particular set of data was recorded at 1-minute intervals, as shown below. I also happen to know that there are no missing intervals for my given period.

ddatetime GHI DNI DBT
6/19/2014 00:00 -4.70365 -2.00365 25.39
6/19/2014 00:01 -4.69436 -1.95881 25.36
6/19/2014 00:02 -4.53416 -1.81397 25.31
6/19/2014 00:03 -4.497 -1.81052 25.25
6/19/2014 00:04 -4.54111 -1.93812 25.21
6/19/2014 00:05 -4.73613 -1.74844 25.17
6/19/2014 00:06 -4.64093 -1.68981 25.14
6/19/2014 00:07 -4.51788 -1.66567 25.13
6/19/2014 00:08 -4.42966 -1.67257 25.09
6/19/2014 00:09 -4.24393 -1.86569 25.06
6/19/2014 00:10 -4.27643 -1.96914 25.03

Data recorded at this resolution can be useful, but in this particular case, I need the data to be averaged into 5-minute “bins”. This is essentially a way to resample, or, more specifically, downsample the data. In other words, I’m looking for the following, where each timestamp includes averaged values for the previous 5 timestamps:

ddatetime GHI DNI DBT
6/19/2014 00:05 avg(-4.70365, -4.69436, -4.53416, -4.497, -4.54111 ) avg(-2.00365, -1.95881, -1.81397, -1.81052, -1.93812) avg(25.39, 25.36, 25.31, 25.25, 25.21)
6/19/2014 00:10 avg(-4.73613, -4.64093, -4.51788, -4.42966, -4.24393 ) avg(-1.74844, -1.68981, -1.66567, -1.67257, -1.86569) avg(25.17, 25.14, 25.13, 25.09, 25.06, 25.03)

Unix Timestamps

In my MySQL database, the timestamps are all stored as datetime objects. This means they adhere to the ISO 8601 standard. However, MySQL includes the capability of converting these timestamps to the Unix standard. The Unix standard is based on the number of seconds since the Unix Epoch (January 1st, 1970 UTC). For example, 6/19/2014 00:00 can be represented as 1403136000, and 6/19/2014 00:01 as 1403136001.

Integer Division

Integer division simply means that decimals are truncated after division has taken place. If you were to type 5/3 into your calculator, you would expect to see 1.66667 as the answer. In the integer division world, 5 div 3 instead equals 1 (the 0.666667 is truncated). Similarly, integer division calculates 6.5 div 0.5 to be 13, but it also calculates 6.8 div 0.5 to be 13.

Resample to Data

I’ll build up the code here piece by piece so that you can see how it words. The first step is to convert the timestamps to the Unix format. Next, integer division is used to divide by a specified number of seconds (determined by the desired number of minutes times 60). This code compares the two types of divisions.

SELECT UNIX_TIMESTAMP(ddatetime) div (5*60) as IntegerDivision,
UNIX_TIMESTAMP(ddatetime) / (5*60) as Division 
FROM `weather`
Integer Division Division
4677204 4677204.0
4677204 4677204.2
4677204 4677204.4
4677204 4677204.6
4677204 4677204.8
4677205 4677205.0
4677205 4677205.2
4677205 4677205.4
4677205 4677205.6
4677205 4677205.8
4677206 4677206.0

After the integer division has taken place, I multiply the modified Unix time by the same interval I just divided it by. Then, I add the interval to the result so that minutes 0-4 are labeled by the minute-5 timestamp. Lastly, I convert back to ISO 8601 time.

SELECT ddatetime, 
UNIX_TIMESTAMP(ddatetime) div (5*60) as IntegerDivision, 
(UNIX_TIMESTAMP(ddatetime) div (5*60))*(5*60)+(5*60) as ModifiedTimes, 
FROM_UNIXTIME((UNIX_TIMESTAMP(ddatetime) div (5*60))*(5*60)+(5*60)) as FinalStamps 
FROM `weather`
ddatetime IntegerDivision ModifiedTimes FinalStamps
6/19/2014 0:00 4677204 1403161500 6/19/2014 0:05
6/19/2014 0:01 4677204 1403161500 6/19/2014 0:05
6/19/2014 0:02 4677204 1403161500 6/19/2014 0:05
6/19/2014 0:03 4677204 1403161500 6/19/2014 0:05
6/19/2014 0:04 4677204 1403161500 6/19/2014 0:05
6/19/2014 0:05 4677205 1403161800 6/19/2014 0:10
6/19/2014 0:06 4677205 1403161800 6/19/2014 0:10
6/19/2014 0:07 4677205 1403161800 6/19/2014 0:10
6/19/2014 0:08 4677205 1403161800 6/19/2014 0:10
6/19/2014 0:09 4677205 1403161800 6/19/2014 0:10
6/19/2014 0:10 4677206 1403162100 6/19/2014 0:15

The only thing left to do is average (resample!) and group the results on the ‘FinalStamps’ column!

SELECT FROM_UNIXTIME((UNIX_TIMESTAMP(ddatetime) div (5*60))*(5*60)+(5*60)) as FinalStamps, round(avg(GHI),3) as GHI, round(avg(DNI),3) as DNI, round(avg(DBT),3) as DBT
FROM `weather`
GROUP BY 1
FinalStamps GHI DNI DBT
2014-06-19 00:05:00 -4.594 -1.905 25.304
2014-06-19 00:10:00 -4.514 -1.728 25.118

The true beauty of this method is that the interval can be set to any arbitrary number of minutes or seconds. Just modify the (5*60) term to your liking!

Good luck, and happy resampling!