fitting_a_model

# Multivariate Linear Regression

Here I will fit a simple linear regression model over time series data. The challange here is to transform data beforehand in order to better meet linearity assumptions.

Little discussion of model evaluation follows.

```# Load libraries
# creating a vector to be called
# in the lapply function
# silent = T will suppress the messages
x<-c('tidyr',
'ggplot2',
'lubridate',
'stringr',
'dplyr')
try(lapply(x, require, character.only = TRUE), silent = T)

# r options for plotting in jupyter notebook - no need to run it on Rstudio or similar
options(repr.plot.width=5, repr.plot.height=3)```

## Plot the data

```# compute useful date measures
df %>%
mutate(datetime_local = ymd_hms(datetime_local)) %>%
mutate(day = floor_date(datetime_local, unit='day')) %>%
mutate(time_bin = as.numeric(format(datetime_local, format='%H')) +
as.numeric(format(datetime_local, format='%M'))/60) -> df

# aggregate, plot
df %>%
filter(open) %>%
group_by(day) %>%
summarize(orders = sum(orders, na.rm=TRUE)) %>%
ggplot(aes(day, orders)) + geom_line() + geom_smooth(method = "lm")```
```# aggregate and plot
df %>%
filter(open) %>%
group_by(time_bin) %>%
summarize(orders = sum(orders, na.rm=TRUE)) %>%
ggplot(aes(time_bin, orders, group = 1)) + geom_line()```

## Outliers or Missing Data: who can tell

Very weird data here. Some values are distorted: what to do? I would initially take the log - but data points are even zero (log not defined).

I should try a different trasformation (… power transformation, still) that reshapes the data in a nice way, artificially gets rid of the zeros. I could try a function like the pseudologarithm, but it maps 0 → 0 so nothing special.

```# declare the transformation function
signedlog10 = function(x) {
ifelse(abs(x) <= 1, 0, sign(x)*log10(abs(x)))
}

# aggregate, transform, plot
df %>%
filter(open) %>%
group_by(day, closed) %>%
summarize(orders = sum(signedlog10(orders), na.rm=TRUE)) %>%
ggplot(aes(day, signedlog10(orders), colour=closed)) + geom_point() + ggtitle("Order transformed with Signed Log")```

I could try to apply a transformation that adds an arbitrary constant to the log base in order to avoid the scenario log(0)

```log_const = function(x, undo=FALSE){
constant <- 0.3
if(undo==FALSE){
return(log(x + constant))
} else {
return(exp(x - constant))
}
}

df %>%
filter(open) %>%
group_by(day, closed) %>%
summarize(orders = sum(log_const(orders), na.rm=TRUE)) %>%
ggplot(aes(day, log_const(orders), colour=closed)) + geom_point()  + ggtitle("Order transformed with Log + arbitrary constant")```

That's better: I got rid of the zero value squeezing everything above 1.

This is not the best way to do it tho. I could check the Box Cox procedure, in order to estimate the constant value of c in log(x+c).

## Fitting the Model

```# create a holiday dummy
df\$holiday_dummies <- as.factor(ifelse(is.na(df\$holiday_name),'0',df\$holiday_name))
# check levels
unique(df\$holiday_dummies)

# create a training dataset restricting to opening hours
df %>%
filter(open) -> training_df

# simple linear model version I
model_1 <- lm(log_const(orders) ~
day +
time_bin +
holiday_dummies
,data=training_df, na.action = na.exclude)

summary(model_1)

# Print RMSE for model 1
summary(model_1)\$sigma```

Let's try to add some other covariates - interaction terms between time bins and day. To evaluate the model we could check the R squared and RMSE. The R squared is not a very good measure since one can always artificially inflate it (R squared is how much of the variance is explained) by adding extra covariates. We can rely on the RMSE which makes a good general purpose error metric for numerical predictions. In this case, we can simply print the metric. In the second model, the error is in fact smaller (0.9463 vs. 0.9433)

```# create a holiday dummy
df\$holiday_dummies <- as.factor(ifelse(is.na(df\$holiday_name),'0',df\$holiday_name))
# check levels
unique(df\$holiday_dummies)

# create a training dataset restricting to opening hours
df %>%
filter(open) -> training_df

# simple linear model version I
model_1 <- lm(log_const(orders) ~
day +
time_bin +
holiday_dummies
,data=training_df, na.action = na.exclude)

summary(model_1)

# Print RMSE for model 1
summary(model_1)\$sigma

# RMSE for model 2
summary(model_2)\$sigma```

## Generate predictions

```# This is a dirty way to create prediction. I created a csv containing days and timebins
# for the week to forecast and I readjusted the type of variables (dummies to factor, day to POSIXct)
# in order to use the predict() command using the model_2 object. 