Spark on time series preference data

To be more general here in the introduction, the situation is that we have a user-item preference matrix which is also evolving over time. Essentially, we have a collection of user-item preference matrices, one for each time point. The preference matrix can be, for example, user’s preference on a collection of books, popularity of movies among people, effectiveness of a set keywords on a collection of campaigns. The prediction task is really to forecast a user-item preference matrix of the next time point.

From another angle, we really have a 3D preference matrix of with axes being users, items, and time. Given the matrix, the task is to predict a future slice of the matrix along the time axis.

The problem is very interesting and makes a lot sense in the real world. For example, the cinema can predict which movies will be popular next day/week/month, and make arrangements according to the predictions to maximize the profit.

Table of content

Link to the code

Coding

I would use Python for coding on top of Spark environment. Some Python packages involved and worth of mentioning are listed as follows.

The reasons are

Complete code for customer A and results are in my Github.

Complete code for customer B and results are in my Github.

Code can be run in Spark with the following command (in my case)

$../../spark-1.4.1-bin-hadoop2.6/bin/spark-submit solution.py

Question 1: Modeling

I would use collaborative filtering to impute missing values then random forest regression trained locally for each individual keyword.

Data exploration

Overview

Behavior of an average keyword over time

Behavior of individual keyword over time

Missing data

For now the practical problem is really how much data I actually have or can use. The following plots show all available data points in keyword-time matrix in terms of different match type. The plot demonstrates that

Learning and prediction

So far, the hints from basic analysis are

Missing value imputation

Localized regression model

Other possibilities

Question 2: Pitfall

The assumption in regression is the output variable is Gaussian distributed if input features are Gaussians which is true in most of cases. Therefore, if the number of conversions is few or zero, when learning the regression model, the distribution of is skewed.

Question 3: Measure the performance

In order to compare the performance of two models (model A and model B) in term of predicting conversion rate, one can work on

On historical data

RMSE as measure of success

Conversion rate () of a keyword is defined as the ratio between the number of conversions and the number of clicks during a fixed sample period as

The sample period can be 1 day or a few hours. Essentially, the number of clicks and the number of conversions during the period are collected and assigned to the sample point at the end of the sample period. The conversion rate for the period is computed accordingly and also assigned to the sample point. The measure of success is the rooted mean square error (RMSE) between the true conversion rate and the predicted conversion rate defined as

assuming there are keywords and time points.

The test procedure can be described as follows.

On future data

In stead of testing the model on historical data, one can perform the test on line. For example, if the sample period is 1 day, one can predict the conversion rate of tomorrow using all data available upto today; then repeat on the following days.

RMSE as measure of success

In stead of testing the model on historical data, one can perform the test in a online fashion. For example, if the sample period is 1 day, one can predict the conversion rate of tomorrow using all data available upto today; then repeat on the following days.

The test procedure is described as follows.

Conversion rate as measure of success (A/B test)

The model developed for predicting conversion rates of keywords is eventually applied to picking keywords during the campaign. Therefore, to compare model A and model B, it also make sense to measure the conversion rate over a collection of clicks from keywords suggested by model A and a collection clicks from keywords by model B over a certain time period.

The idea behind: if model A outperforms model B in predicting conversion rate of keywords, it will pick/update to a better collection of keywords and during the campaign period and such generate more conversions given a same amount of clicks.

Custom B data

Overview

Collaborative filtering

Learning and prediction

side note

Hongyu Su 01 February 2016