Click-Through Rate Prediction at large scale¶

In this example we will showcase how pai4sk-snapml can be used to train a logistic regression classifier for the task of click through rate prediction. For this purpose we use the publicly available criteo-kaggle dataset which can be downloaded from the Display Advertising Challenge.

The task is described as follows:

Display advertising is a billion dollar effort and one of the central uses of machine learning on the Internet. However, its data and methods are usually kept under lock and key. In this research competition, CriteoLabs is sharing a week’s worth of data for you to develop models predicting ad click-through rate (CTR). Given a user and the page he is visiting, what is the probability that he will click on a given ad?

Prerequisites¶

To run the example across multiple nodes in a cluster you must have

MPI configured
pai4sk conda package installed
password-less ssh enabled between hosts

Getting the data¶

Download and decompress the data:

mkdir data
cd data
wget https://s3-us-west-2.amazonaws.com/criteo-public-svm-data/criteo.kaggle2014.svm.tar.gz
tar xzf criteo.kaggle2014.svm.tar.gz
cd ../

Now one should see a file named criteo.kaggle2014.svm in the data/ directory.

Data Preprocessing¶

Before performing the training we show how to split the data into a train and a test set, how to perform data preprocessing and how to dump the data to snap format for fast data loading.

Therefore you should create a script called my-preprocessing.py containing the preprocessing functionalities you want to have. An example of different preprocessing functions is given below:

from __future__ import print_function
import os

# Load the data
print("Loading CRITEO dataset...")
from sklearn.datasets import load_svmlight_file
filename = "data/criteo.kaggle2014.train.svm"
X, y = load_svmlight_file(filename)

print('Splitting data...')
# Make the train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

print('Saving data in sklearn format...')
# Save train and test datasets for fast sklearn data loading
import scipy as sp
sp.sparse.save_npz("data/criteo.kaggle2014.X_train", X_train, compressed=False)
sp.sparse.save_npz("data/criteo.kaggle2014.X_test",  X_test,  compressed=False)

# Save train and test datasets for fast sklearn data loading
import numpy as np
np.save("data/criteo.kaggle2014.y_train", y_train)
np.save("data/criteo.kaggle2014.y_test", y_test)

# Convert to numpy arrays
print('Converting to dense format...')
import numpy as np
X_train = np.array(X_train.todense())
X_test  = np.array(X_test.todense())

# Normalize the training data
print('Normalizing data...')
from sklearn.preprocessing import normalize
X_train = normalize(X_train, axis=1, norm='l1')
X_test  = normalize(X_test,  axis=1, norm='l1')

print('Saving data in distributed snap.ml format...')
from paisk.sml_io import dump_to_snap_format
dump_to_snap_format(X_train, y_train, "data/criteo.kaggle2014.train.snap")
dump_to_snap_format(X_test, y_test, "data/criteo.kaggle2014.test.snap")

To run the pre-processing you need to call:

mpirun -n 1 python my-preprocessing.py

this creates the following files:

ls -l data/
criteo.kaggle2014.test.snap
criteo.kaggle2014.train.snap
criteo.kaggle2014.train.svm
criteo.kaggle2014.X_test.npz
criteo.kaggle2014.X_train.npz
criteo.kaggle2014.y_test.npy
criteo.kaggle2014.y_train.npy

Training and Evaluation a Logistic Regression Model¶

After running the data pre-processing steps above, we are ready to train a Logistic Regression classifier using the IBM pai4sk-snapml library. Note that you need to have access to the data from all the connected machines that will be used (by replicating the data or by providing a distributed file system).

To perform the training we create a script my-training.py where we specify the parameters and function calls. For an example see below:

from __future__ import print_function
import os
import time


# Load Data from snap format
print("[Snap.ML] Data loading...")

train_filename_snap = "data/criteo.kaggle2014.train.snap"
test_filename_snap = "data/criteo.kaggle2014.test.snap"

from pai4sk.sml_io import load_from_snap_format
train_data = load_from_snap_format(train_filename_snap)
test_data = load_from_snap_format(test_filename_snap)


# specify whether to use GPUs for training or not
use_gpu = True
device_ids = []

if use_gpu:
    num_threads = 256
    cpu_gpu = "GPU"
    # specify how many and which GPUs to use per node
    device_ids = [0,1,2,3]

else:
    num_threads = 8
    cpu_gpu = "CPU"


# specify whether to balance class weights
use_balanced_class_weights = True

if use_balanced_class_weights:
    class_weight = "balanced"
else:
    class_weight = None

# Create snapML Logistic Regression classifier
from pai4sk import LogisticRegression as snapmlLR
snapml_lr = snapmlLR(dual=True, use_gpu=use_gpu, device_ids=device_ids,
                    num_threads=num_threads, class_weight=class_weight)


# Fit the model
print("[Snap.ML] Training model...")
snapml_t0 = time.time()
snapml_lr.fit(train_data)
snapml_time = time.time() - snapml_t0

# Perform inference on test data
print("[Snap.ML] Running prediction...")

pred = snapml_lr.predict(test_data)
proba = snapml_lr.predict_proba(test_data)

from pai4sk.sml_metrics import log_loss as snapLOGLOSS
from pai4sk.sml_metrics import accuracy_score as snapACC

# Compute accuracy
snapml_accuracy = snapACC(test_data, pred)
snapml_logloss = snapLOGLOSS(test_data, proba)

# Print off SnapML result

print("[Snap.ML] [%s Training] Accuracy = %.4f" % (cpu_gpu, snapml_accuracy))
print("[Snap.ML] [%s Training] LogLoss = %.4f" % (cpu_gpu, snapml_logloss))
print("[Snap.ML] [%s Training] Training time = %.4f" % (cpu_gpu, snapml_time))

to run the training script call:

snaprun -H machine1,machine2 python ${CONDA_PREFIX}/pai4sk/mpi-examples/example-criteo/example-criteo.py

Note

The user can also run multiple MPI processes per machine, but, in that case, the MPI processes on one machine will share the same hardware resources during training. Namely, if the user trains the model on GPUs, the MPI processes on one machine will share the same GPUs as defined in the device_ids parameter. Therefore, we recommend the user to run 1 MPI process per machine in order to achieve the maximum performance.

As a reference we show the expected output for running the script on 2 P9 machines using one V100 GPU on each machine. We ran the following command:

snaprun -H node1,node2 python ${CONDA_PREFIX}/pai4sk/mpi-examples/example-criteo/example-criteo.py

and this yields the following output:

[Snap.ML] [GPU Training] Accuracy = 0.7877
[Snap.ML] [GPU Training] LogLoss = 0.4558
[Snap.ML] [GPU Training] Training time = 21.9170

Please note that the reported times can vary depending on the system architecture.