Setup of data for an onlineforecast model

Intro

This vignette explains how to setup data consisting of observations and forecasts, such that it can be used for onlineforecast models. A generic introduction and description is in available in onlineforecasting. The code is available here. More information on onlineforecasting.org.

Data example

First load the package:

# Load the package
library(onlineforecast)

In the package different data sets are included. The Dbuilding holds the data used for the example of heat load forecasting in the building-heat-load-forecasting vignette.

When the package is loaded the data is also loaded, so we can access it directly. Let’s start out by:

# Keep it in D to simplify notation
D <- Dbuilding

The class is ‘data.ĺist’:

# The class of D
class(D)
##     [1] "data.list" "list"

Actually, a ‘data.list’ is simply a ‘list’, but we made the class ‘data.list’ in order to have functions for the particular format of data - the format is explained in this document.

It consists of vectors of time, vectors of observations (model output) and data.frames of forecasts (model input):

# Print the names to see the variables in the data
names(D)
##     [1] "t"             "heatload"      "heatloadtotal" "Taobs"        
##     [5] "Iobs"          "Ta"            "I"

An overview of the content can be generated by:

summary.default(D)
##                   Length Class      Mode   
##     t             1824   POSIXct    numeric
##     heatload      1824   -none-     numeric
##     heatloadtotal 1824   -none-     numeric
##     Taobs         1824   -none-     numeric
##     Iobs          1824   -none-     numeric
##     Ta              36   data.frame list   
##     I               36   data.frame list

where it can be seen that t is a time vector, heatload is a vector, and Ta and I are data.frames.

A function giving a summary, including checks of the format of the ‘data.list’ is:

summary(D)
##     
##     Length of time vector 't': 1824
##     
##                    NAs length   class
##     $heatload       0%     ok numeric
##     $heatloadtotal  0%     ok numeric
##     $Taobs          0%     ok numeric
##     $Iobs           0%     ok numeric
##     
##         maxHorizonNAs NAs nrow colnames sameclass   class
##     $Ta            0%  0%   ok       ok        ok numeric
##     $I             0%  0%   ok       ok        ok numeric

The ‘NA’ columns indicate the proportion of NAs. If there is a ok in a column, then the check of the variables format is passed. See the help with ?summary.data.list to learn which checks are performed.

Time

First, lets have a look at D$t, which is the vector of time points:

# The time
class(D$t)
##     [1] "POSIXct" "POSIXt"
head(D$t)
##     [1] "2010-12-15 01:00:00 UTC" "2010-12-15 02:00:00 UTC"
##     [3] "2010-12-15 03:00:00 UTC" "2010-12-15 04:00:00 UTC"
##     [5] "2010-12-15 05:00:00 UTC" "2010-12-15 06:00:00 UTC"
tail(D$t)
##     [1] "2011-02-28 19:00:00 UTC" "2011-02-28 20:00:00 UTC"
##     [3] "2011-02-28 21:00:00 UTC" "2011-02-28 22:00:00 UTC"
##     [5] "2011-02-28 23:00:00 UTC" "2011-03-01 00:00:00 UTC"

Hence, the vector is of the class POSIXct. It is not a necessity, t can also simply be a numeric, but for plotting and many operations, its very useful to use the ‘POSIXct’ class (see ?POSIXt).

Rules for the time vector:

  • It must be named t.

  • There must be no gaps or NA values in t, since only equidistant time series can be used in the models (the other variables can have NAs).

  • Its best to keep the time zone in UTC or GMT (not providing any time zone tz can give rise to problems).

Use the basic R functions for handling the time class. Most needed operations can be done with:

?as.POSIXct
?strftime

A helper function is provided with the ct function which can be called using ?, or ?ct. See example below:

# Convert from a time stamp (tz="GMT" per default)
ct("2019-01-01 11:00")
##     [1] "2019-01-01 11:00:00 GMT"
# Convert from unix time
ct(3840928387)
##     [1] "2091-09-18 04:33:07 GMT"

Note that for all functions where a time value as a character is given, the time zone is always “GMT” (or “UTC”, but this can result in warnings, but they can be ignored). For some operations the package lubridate can be very helpful.

Observations

Note the rules for observations:

  • In a data.list observations must be vectors.

  • The vectors must have the same length as the time t vector.

  • Observation as numerical vectors can be used directly as model output (if observations are to used as model inputs, they must be setup in a data.frame as explained below in Section Forecasts).

In the current data, a time series of hourly heat load observations is included:

str(D$heatload)
##      num [1:1824] 5.92 5.85 5.85 5.88 5.85 ...

It must have the same length as the time vector:

# Same length as time
length(D$t)
##     [1] 1824
length(D$heatload)
##     [1] 1824

A simple plot can be generated by:

plot(D$t, D$heatload, type="l", xlab="Time", ylab="Headload (kW)")

The convention used in all examples is that the time points are always set to the time interval end point, e.g.:

# The observation
D$heatload[2]
##     [1] 5.85
# Represents the average load between
D$t[1]
##     [1] "2010-12-15 01:00:00 UTC"
# and
D$t[2]
##     [1] "2010-12-15 02:00:00 UTC"

The main idea behind setting the time point at the end of the interval is: Working with values averaged over the time interval, such values are available at the end of the time interval, not before. Especially, in real-time applications this is a useful convention.

Forecasts

As described in onlineforecasting the setup of forecasts for model inputs always follows the same format - as presented in the following. This is also the format of the forecasts generated by functions in the package. Hence all forecasts must follow this format.

The rules are:

  • All values at row i are available at the i’th value in time t.

  • All columns must be named with k followed by an integer indicating the horizon in steps (e.g. the column named k8 hold the 8-step forecasts).

Have a look at the forecasts of the global radiation:

# Global radiation forecasts
head(D$I)
##       k1 k2   k3    k4    k5    k6    k7    k8     k9    k10    k11   k12  k13  k14
##     1  0  0  0.0   0.0   0.0   0.0   0.0  46.9 119.52 168.41 181.49 158.5 97.6 19.4
##     2  0  0  0.0   0.0   0.0   0.0  46.9 119.5 168.41 181.49 158.52  97.6 19.4  0.0
##     3  0  0  0.0   0.0   0.0  46.9 119.5 168.4 181.49 158.52  97.64  19.4  0.0  0.0
##     4  0  0  0.0   0.0  49.9 125.6 175.0 190.6 165.10  99.86   9.94   0.0  0.0  0.0
##     5  0  0  0.0  49.9 125.6 175.0 190.6 165.1  99.86   9.94   0.00   0.0  0.0  0.0
##     6  0  0 49.9 125.6 175.0 190.6 165.1  99.9   9.94   0.00   0.00   0.0  0.0  0.0
##       k15 k16 k17 k18 k19 k20 k21 k22 k23 k24 k25 k26  k27  k28  k29  k30   k31
##     1   0   0   0   0   0   0   0   0   0   0   0   0  0.0  0.0  0.0  0.0   0.0
##     2   0   0   0   0   0   0   0   0   0   0   0   0  0.0  0.0  0.0  0.0  12.2
##     3   0   0   0   0   0   0   0   0   0   0   0   0  0.0  0.0  0.0 12.2  11.8
##     4   0   0   0   0   0   0   0   0   0   0   0   0  0.0  0.0 42.4 42.8  58.1
##     5   0   0   0   0   0   0   0   0   0   0   0   0  0.0 42.4 42.8 58.1  44.8
##     6   0   0   0   0   0   0   0   0   0   0   0   0 42.4 42.8 58.1 44.8 254.1
##         k32   k33  k34  k35  k36
##     1  12.2  11.8 15.5 12.3 24.1
##     2  11.8  15.5 12.3 24.1 38.7
##     3  15.5  12.3 24.1 38.7 31.4
##     4  44.8 254.1 20.6 30.3  0.0
##     5 254.1 169.4 17.2  0.0  0.0
##     6 168.5  40.4  0.0  0.0  0.0

At the first time point:

# First time point
D$t[1]
##     [1] "2010-12-15 01:00:00 UTC"

the available forecast ahead in time is at the first row:

# The forecast available ahead in time is in the first row
D$I[1, ]
##       k1 k2 k3 k4 k5 k6 k7   k8  k9 k10 k11 k12  k13  k14 k15 k16 k17 k18 k19 k20
##     1  0  0  0  0  0  0  0 46.9 120 168 181 159 97.6 19.4   0   0   0   0   0   0
##       k21 k22 k23 k24 k25 k26 k27 k28 k29 k30 k31  k32  k33  k34  k35  k36
##     1   0   0   0   0   0   0   0   0   0   0   0 12.2 11.8 15.5 12.3 24.1

We can plot that by:

i <- 1:ncol(D$I)
plot(D$t[i], D$I[1, ], type="l", xlab="Time", ylab="Global radiation forecast (I in W/m²)")

So this is the forecast available ahead in time at 2010-12-15 01:00:00.

The column in I named k8 holds the 8-step horizon forecasts, which, since the steps are hourly, is an equi-distant time series. Picking out the entire series can be done by D$I$k8 - hence a plot (together with the observations) can be generated by:

# Just pick some points by
i <- 200:296
plot(D$t[i], D$I$k8[i], type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)")
# Add the observations
lines(D$t[i], D$Iobs[i])
legend("topright", c("8-step forecasts","Observations"), bg="white", lty=1, col=2:1)

Notice how the are not aligned, since the forecasts are 8 hours ahead. To align them the forecasts must be lagged 8 steps by:

plot(D$t[i], lagvec(D$I$k8[i], 8), type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)")
lines(D$t[i], D$Iobs[i])
legend("topright", c("8-step forecasts lagged","Observations"), bg="white", lty=1, col=2:1)

Plotting

A few simple plotting functions are included in the package.

Time series plots

The plot function provided with the package actually does this lagging with plotting forecasts:

plot_ts(D, patterns=c("^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36))

The argument patterns is vector of a regular expressions (see ?regex), which is used to match the variables to include in the plot. See the help with ?plot_ts for more details.

An interactive plot can be generated using (first install the package plotly):

plotly_ts(D, patterns=c("heatload$","^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36))

Note that the patterns argument is a vector of regular expressions, which determines which variables from D to plot.

Scatter plots

When modelling with the objective of forecasting, it’s always a good start to have a look at scatter plots between the model inputs and the model output. For example the heatload vs. ambient temperature 8-step forecast:

par(mfrow=c(1,2))
plot(D$Ta$k8, D$heatload)
plot(lagvec(D$Ta$k8, 8), D$heatload)

So lagging (thus aligning in time) makes less slightly less scatter.

A wrapper for the pairs function is provided for a data.list, which can generate very useful explorative plots:

pairs(D, nms=c("heatload","Taobs","Ta","t"), kseq=c(1,8,24))

Note how the sequence of included horizons are specified in the kseq argument, and note that the forecasts are lagged to be aligned in time. See ?pairs.data.list for more details.

Just as a quick side note: This is the principle used for fitting onlineforecast models, simply shift forecasts to align with the observations:

# Lag the 8-step forecasts to be aligned with the observations
x <- lagvec(D$I$k8, 8)
# Take a smaller range
x <- x[i]
# Take the observations
y <- D$Iobs[i]
# Fit a linear regression model
fit <- lm(y ~ x)
# Plot the result
plot(x, y, xlab="8-step forecasts (W/m²)", ylab="Obsservations (W/m²)", main="Global radiation")
abline(fit)

Seen over time the 8-step forecasts are:

plot(D$t[i], predict.lm(fit, newdata=data.frame(x)), type="l", ylim=c(0,max(y)), xlab="Time", ylab="Global radiation (W/m^2)", col=2)
lines(D$t[i], y)
legend("topright", c("8-step forecasts lagged","Observations"), lty=1, col=2:1)

Of course that model was very simple, see how to make a better model in [building-heat-load-forecasting] and more information on the [website].

Subset

Taking a subset of a data.list is very useful and it can easily be done in different ways using the subset function (i.e. it’s really the subset.data.list function called when:

# Take the 1 to 4 values of each variable in D
Dsub <- subset(D, 1:4)
summary(Dsub)
##     
##     Length of time vector 't': 4
##     
##                    NAs length   class
##     $heatload       0%     ok numeric
##     $heatloadtotal  0%     ok numeric
##     $Taobs          0%     ok numeric
##     $Iobs           0%     ok numeric
##     
##         maxHorizonNAs NAs nrow colnames sameclass   class
##     $Ta            0%  0%   ok       ok        ok numeric
##     $I             0%  0%   ok       ok        ok numeric

Another useful function for taking data in a time range is:

which(in_range("2010-12-20",D$t,"2010-12-21"))
##      [1] 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139
##     [20] 140 141 142 143 144

always check the help of function for more details (i.e. ?in_range)

Actually, it’s easy to take subset from a period by:

Dsub <- subset(D, c("2010-12-20","2010-12-21"))
summary(Dsub)
##     
##     Length of time vector 't': 24
##     
##                    NAs length   class
##     $heatload       0%     ok numeric
##     $heatloadtotal  0%     ok numeric
##     $Taobs          0%     ok numeric
##     $Iobs           0%     ok numeric
##     
##         maxHorizonNAs NAs nrow colnames sameclass   class
##     $Ta            0%  0%   ok       ok        ok numeric
##     $I             0%  0%   ok       ok        ok numeric
Dsub$t
##      [1] "2010-12-20 01:00:00 UTC" "2010-12-20 02:00:00 UTC"
##      [3] "2010-12-20 03:00:00 UTC" "2010-12-20 04:00:00 UTC"
##      [5] "2010-12-20 05:00:00 UTC" "2010-12-20 06:00:00 UTC"
##      [7] "2010-12-20 07:00:00 UTC" "2010-12-20 08:00:00 UTC"
##      [9] "2010-12-20 09:00:00 UTC" "2010-12-20 10:00:00 UTC"
##     [11] "2010-12-20 11:00:00 UTC" "2010-12-20 12:00:00 UTC"
##     [13] "2010-12-20 13:00:00 UTC" "2010-12-20 14:00:00 UTC"
##     [15] "2010-12-20 15:00:00 UTC" "2010-12-20 16:00:00 UTC"
##     [17] "2010-12-20 17:00:00 UTC" "2010-12-20 18:00:00 UTC"
##     [19] "2010-12-20 19:00:00 UTC" "2010-12-20 20:00:00 UTC"
##     [21] "2010-12-20 21:00:00 UTC" "2010-12-20 22:00:00 UTC"
##     [23] "2010-12-20 23:00:00 UTC" "2010-12-21 00:00:00 UTC"

Data.list to data.frame (or data.table)

It can be really useful to bring the data.list on a format of a data.frame or equivalently data.table for processing.

Bringing to data.frame can easily be done by:

Df <- as.data.frame(Dsub)
names(Df)
##      [1] "t"             "heatload"      "heatloadtotal" "Taobs"        
##      [5] "Iobs"          "Ta.k1"         "Ta.k2"         "Ta.k3"        
##      [9] "Ta.k4"         "Ta.k5"         "Ta.k6"         "Ta.k7"        
##     [13] "Ta.k8"         "Ta.k9"         "Ta.k10"        "Ta.k11"       
##     [17] "Ta.k12"        "Ta.k13"        "Ta.k14"        "Ta.k15"       
##     [21] "Ta.k16"        "Ta.k17"        "Ta.k18"        "Ta.k19"       
##     [25] "Ta.k20"        "Ta.k21"        "Ta.k22"        "Ta.k23"       
##     [29] "Ta.k24"        "Ta.k25"        "Ta.k26"        "Ta.k27"       
##     [33] "Ta.k28"        "Ta.k29"        "Ta.k30"        "Ta.k31"       
##     [37] "Ta.k32"        "Ta.k33"        "Ta.k34"        "Ta.k35"       
##     [41] "Ta.k36"        "I.k1"          "I.k2"          "I.k3"         
##     [45] "I.k4"          "I.k5"          "I.k6"          "I.k7"         
##     [49] "I.k8"          "I.k9"          "I.k10"         "I.k11"        
##     [53] "I.k12"         "I.k13"         "I.k14"         "I.k15"        
##     [57] "I.k16"         "I.k17"         "I.k18"         "I.k19"        
##     [61] "I.k20"         "I.k21"         "I.k22"         "I.k23"        
##     [65] "I.k24"         "I.k25"         "I.k26"         "I.k27"        
##     [69] "I.k28"         "I.k29"         "I.k30"         "I.k31"        
##     [73] "I.k32"         "I.k33"         "I.k34"         "I.k35"        
##     [77] "I.k36"

So the forecasts are just bind with the time and observations, and .kxx is added to the column names.

It can be converted to a data.table by:

library(data.table)
setDT(Df)
class(Df)
##     [1] "data.table" "data.frame"

After processing it is easily converted back to the data.list again by:

# Set back to data.frame
setDF(Df)
# Convert to a data.list
Dsub2 <- as.data.list(Df)
# Compare it with the original Dsub
summary(Dsub2)
##     
##     Length of time vector 't': 24
##     
##                    NAs length   class
##     $heatload       0%     ok numeric
##     $heatloadtotal  0%     ok numeric
##     $Taobs          0%     ok numeric
##     $Iobs           0%     ok numeric
##     
##         maxHorizonNAs NAs nrow colnames sameclass   class
##     $Ta            0%  0%   ok       ok        ok numeric
##     $I             0%  0%   ok       ok        ok numeric
summary(Dsub)
##     
##     Length of time vector 't': 24
##     
##                    NAs length   class
##     $heatload       0%     ok numeric
##     $heatloadtotal  0%     ok numeric
##     $Taobs          0%     ok numeric
##     $Iobs           0%     ok numeric
##     
##         maxHorizonNAs NAs nrow colnames sameclass   class
##     $Ta            0%  0%   ok       ok        ok numeric
##     $I             0%  0%   ok       ok        ok numeric