Machine Learning Case Study with Spark: Make it better

In this article, we are going to improve an existing machine learning model. Usually, the path to make a model better is not unique, and it's depending on the problem you are dealing with.

For instance, if you have highly correlated variables, you might want to either deleting some of them, or re-projecting them into a lower dimension. Usually PCA can help. Another problem is you got too many 1s and too few 0s (or vice versa) in your response variable--the"imbalanced data", you might want to try down-sampling.

Last example, if you got data that has too many categories, which might add noise in your model, probably combine the data into fewer groups is a good idea. In Spark, you probably need to write a udf function to implement this re-grouping.

However, the following two topics that I am going to talk about next is the most generic strategies to apply to make an existing model better: feature selection, whose power is usually underestimated by users, and ensemble methods, which is a big topic but I will only showcase a simple stacking method.

Feature Selection

Feature selection might help. Filtering out less important variables can lead to a simpler and more stable model. However, feature selection is harder to implement on Spark thansklearn. In the latter case, we can simply integrate the feature selection step as part of the pipeline. Specifically, we got a selectFromModel function, which can help find the most significant variables from, say, a logistic regression model.

On the other hand, the feature selection toolkit of spark is limited in the ml library, but we can still design something similar to that ourselves. The idea is this:

Fit a classifier first. If you do not have anything particular in mind. Try either random forest or logistic regression. Some people use decision trees. They are worse performance-wise, but easier to implement.
Find feature importance if you use random forest; find the coefficients if you are using logistic regression.
Find the most important features and write them in a list. Just which column. For instance, it needs to be like [1,3,9], which means keep the 2nd, 4th and 9th.
Use the VectorSlicer method from the ml library, and make a new vector from the columns you just selected.

The following code exemplifies how to do this. First we use the random forest model that we fitted previously and extract the feature importance. The ones bigger than 0.03 are kept. rf_fitted is a fitted random forest model.

importance_list = pd.Series(rf_fitted.featureImportances.values)
sorted_imp = importance_list.sort_values(ascending= False)
kept = list((sorted_imp[sorted_imp > 0.03]).index)

This choice of 0.03 is arbitrary, and you can tune it based on the AUC metric later. Then we use a slicer to collect all the features that have got an importance greater than 0.03. The rest is pretty standard.


from pyspark.ml.feature import VectorSlicer
vector_slicer = VectorSlicer(inputCol= "features", indices= kept, outputCol= "feature_subset")
with_selected_feature = vector_slicer.transform(training_data)

rf_modified = RandomForestClassifier(numTrees=20, featuresCol="feature_subset")
test_data = vector_slicer.transform(test_data)
prediction_modified = rf_modified.fit(with_selected_feature).transform(test_data)

evaluator_modified = BinaryClassificationEvaluator(rawPredictionCol="probability", metricName= "areaUnderROC")
evaluator_modified.evaluate(prediction_modified)

This random forest classifier gives 0.8145 in terms of AUC, and with full feature set we got 0.8082. This is a remarkable improvement considering the size data set, and what we do is to simply run the random forest classifier twice: one for variable selection and the other for fitting.

Interestingly, if you look under the hood, you will find that after feature selection process there are actually only 6 features left in the data set.

More classifiers and stacking them!

It is always better if you can use a combination of multiple classifiers. The notion is well accepted and the real questions is how implement them.

Early ideas like bagging, which uses bootstrap sampling. In short, you resample from the training set and get multiple classifiers based on different samples. By combining them you are reducing the variability of prediction. However the accuracy/point estimate is not improved.

Later ideas like boosting uses a similar strategy, and also makes multiple weak classifiers. However, the subsetting process is not like bagging, which is completely random. It depends on misclassification and gives the mislabelled terms more weights iteratively.

There is now another way called stacking. Still, multiple weak classifiers. However, stacking combine these classifiers in a whole new different way. It uses a second layer model which uses the predictions of weak classifiers as input. In this article, I'll demonstrate how to do stacking in spark.

First we need couple of weak classifiers. Note that we have done the basic preprocessing and vector assembling at this moment. Also note that on Spark 2.1 GBTClassifier does not support predicting probabilities, and that's why I commented it out. But remember, they are generally the best algorithms in terms of performance.

from pyspark.ml.classification import RandomForestClassifier, GBTClassifier,LogisticRegression, NaiveBayes

rf = RandomForestClassifier(numTrees=20)
#xgb = GBTClassifier(maxIter= 10)
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.2)
nb = NaiveBayes(smoothing= 0.5, modelType="multinomial")

methods = {"random forest": rf,
         # "boosting tree": xgb,
          "logistic regression": lr,
          "naive bayes": nb}

for method_name, method in methods.items():

    method.setPredictionCol("prediction_" + method_name)
    method.setProbabilityCol("probability_" + method_name)
    method.setRawPredictionCol("raw_prediction_" + method_name)

In the end we specified prediction column names for each of the classifier to avoid a naming collision. Otherwise the default name will always be "prediction" and Spark will give you a hard time if there are more than one classifiers. Just a quick side note, if you are using python 2, change methods.items() to methods.iteritems()


from pyspark.ml.evaluation import BinaryClassificationEvaluator
fitted_models ={}

for method_name, method in methods.items():
    # need to keep fitted model somewhere
    fitted_models[method_name] = method.fit(training_data)
    test_data = fitted_models[method_name].transform(test_data)
    evaluator= BinaryClassificationEvaluator(rawPredictionCol="probability" + method_name, metricName= "areaUnderROC")
    print(evaluator.evaluate(test_data))

We've fitted couple of classifiers with training data, and stored the trained classifier in fitted_models. We also printed out the ROC metric for each of the classifier. This is necessary because the better the base classifiers are, the better the combined version is. But as will see later, even combining some mediocre classifiers can improve the performance.

With the following code we can train a second-layer model on test data, and the classifier on the second layer is random forest.

prediction_vars = [var for var in test_data.columns if var.startswith("probability")]
vs_second_layers = VectorAssembler(inputCols= prediction_vars, outputCol= "second_layer_input")
second_layer = RandomForestClassifier(featuresCol= "second_layer_input", labelCol= "label",  probabilityCol = "second_layer_output")
model_second_layer = second_layer.fit(test_data)

Finally, we can applied the two-layer model on the hold out data and check out how well we do. In our case, the ensemble model after stacking is not performing decisively better than the random forest. In fact, the single random forest model gives 0.808 and the ensemble model after stacking gives 0.796. This is because all other two classifiers are under performing. In fact, the logistic regression only gives 0.663 and the Naive Bayes classifier gives only 0.519. The key to maximum the gain from stacking is to find some classifiers with similar performance.