fitting_a_model

Here I will fit a simple linear regression model over time series data. The challange here is to transform data beforehand in order to better meet linearity assumptions.

Little discussion of model evaluation follows.

# Load libraries # creating a vector to be called # in the lapply function # silent = T will suppress the messages x<-c('tidyr', 'ggplot2', 'lubridate', 'stringr', 'dplyr') try(lapply(x, require, character.only = TRUE), silent = T) # Load dataset df <- read.csv('/home/vincenzograsso/Desktop/forecast-task/dataset.csv') # r options for plotting in jupyter notebook - no need to run it on Rstudio or similar options(repr.plot.width=5, repr.plot.height=3)

# compute useful date measures df %>% mutate(datetime_local = ymd_hms(datetime_local)) %>% mutate(day = floor_date(datetime_local, unit='day')) %>% mutate(time_bin = as.numeric(format(datetime_local, format='%H')) + as.numeric(format(datetime_local, format='%M'))/60) -> df # aggregate, plot df %>% filter(open) %>% group_by(day) %>% summarize(orders = sum(orders, na.rm=TRUE)) %>% ggplot(aes(day, orders)) + geom_line() + geom_smooth(method = "lm")

# aggregate and plot df %>% filter(open) %>% group_by(time_bin) %>% summarize(orders = sum(orders, na.rm=TRUE)) %>% ggplot(aes(time_bin, orders, group = 1)) + geom_line()

Very weird data here. Some values are distorted: what to do? I would initially take the log - but data points are even zero (log not defined).

I should try a different trasformation (… power transformation, still) that reshapes the data in a nice way, artificially gets rid of the zeros. I could try a function like the pseudologarithm, but it maps 0 → 0 so nothing special.

# declare the transformation function signedlog10 = function(x) { ifelse(abs(x) <= 1, 0, sign(x)*log10(abs(x))) } # aggregate, transform, plot df %>% filter(open) %>% group_by(day, closed) %>% summarize(orders = sum(signedlog10(orders), na.rm=TRUE)) %>% ggplot(aes(day, signedlog10(orders), colour=closed)) + geom_point() + ggtitle("Order transformed with Signed Log")

I could try to apply a transformation that adds an **arbitrary constant** to the log base in order to avoid the scenario log(0)

log_const = function(x, undo=FALSE){ constant <- 0.3 if(undo==FALSE){ return(log(x + constant)) } else { return(exp(x - constant)) } } df %>% filter(open) %>% group_by(day, closed) %>% summarize(orders = sum(log_const(orders), na.rm=TRUE)) %>% ggplot(aes(day, log_const(orders), colour=closed)) + geom_point() + ggtitle("Order transformed with Log + arbitrary constant")

That's better: I got rid of the zero value squeezing everything above 1.

This is not the best way to do it tho. I could check the Box Cox procedure, in order to estimate the constant value of c in log(x+c).

# create a holiday dummy df$holiday_dummies <- as.factor(ifelse(is.na(df$holiday_name),'0',df$holiday_name)) # check levels unique(df$holiday_dummies) # create a training dataset restricting to opening hours df %>% filter(open) -> training_df # simple linear model version I model_1 <- lm(log_const(orders) ~ day + time_bin + holiday_dummies ,data=training_df, na.action = na.exclude) summary(model_1) # Print RMSE for model 1 summary(model_1)$sigma

Let's try to add some other covariates - interaction terms between time bins and day. To evaluate the model we could check the R squared and RMSE. The R squared is not a very good measure since one can always artificially inflate it (R squared is how much of the variance is explained) by adding extra covariates. We can rely on the **RMSE** which makes a good general purpose error metric for numerical predictions. In this case, we can simply print the metric. In the second model, the error is in fact smaller (0.9463 vs. 0.9433)

# create a holiday dummy df$holiday_dummies <- as.factor(ifelse(is.na(df$holiday_name),'0',df$holiday_name)) # check levels unique(df$holiday_dummies) # create a training dataset restricting to opening hours df %>% filter(open) -> training_df # simple linear model version I model_1 <- lm(log_const(orders) ~ day + time_bin + holiday_dummies ,data=training_df, na.action = na.exclude) summary(model_1) # Print RMSE for model 1 summary(model_1)$sigma # RMSE for model 2 summary(model_2)$sigma

# This is a dirty way to create prediction. I created a csv containing days and timebins # for the week to forecast and I readjusted the type of variables (dummies to factor, day to POSIXct) # in order to use the predict() command using the model_2 object. forecast_df <- read.csv('~/forecast_df.csv') # readjust the types forecast_df$holiday_dummies <- as.factor(forecast_df$holiday_dummies) forecast_df$day <- as.POSIXct(forecast_df$day) forecast_df$forecast <- round(log_const(predict(model_2, newdata = forecast_df), undo=TRUE),0 )

fitting_a_model.txt · Last modified: 2016/11/28 15:43 by vincenzo