I just finished my project for the machine learning class at Udacity and one thing that I’m glad to learn, aside from all the topics discussed, is how to loop over various classifiers when exploring them to get the best classifier for a particular dataset. I was lucky enough to find this page which shows a code created by Gaël Varoquaux and Andreas Müller, two of the leading machine learning scientists today. (Update, August 15, 2017: I have seen them in person when I attended the last SciPy Conference in July 2017. I have suckered Andreas Müller in taking a picture with me.) Prior to seeing this page, I was trying on various classifiers one at a time. After seeing it, I was able to create a simple code to loop over various possible classifiers for my project on machine learning class which is the determination of a classifier for persons of interest in the Enron scandal case.

The code starts by creating a list of the names of the classifiers and then another list of calling the actual classifiers in the corresponding index as the first list of names. A loop is then created where the classifiers are fit to the training dataset. Within this loop, scores can be calculated, as well as plot the decision boundaries for the classifiers. For my case, I need not do any plotting, but just the fitting. So I came up with the following code which loops over AdaBoost, Decision Tree, K Nearest Neighbors, linear and rbf SVM, Logistic Regression, Naive Bayes, and Random Forest. I was also able to chain feature scaling and selection with these classifiers in a loop. For feature selection, I explored SelectKBest and principal component analysis (PCA) and thought it best to use them separately than together. (I have seen cases there they are both used and I still have to get used to that idea. I might have to tackle this in another post.) So I have two loops that use pipelines which are different only in the second part of the chain.

names = ["K Nearest Neighbors", "RBF SVM", "Linear SVM", "Decision Tree",
         "Naive Bayes", "AdaBoost", "Random Forest", "Logistic Regression"]

classifiers = [KNeighborsClassifier(), SVC(kernel='rbf', random_state=42), \
                SVC(kernel="linear", random_state=42), DecisionTreeClassifier(random_state=42),\
                GaussianNB(), AdaBoostClassifier(random_state=42), \
                RandomForestClassifier(random_state=42), LogisticRegression(random_state=42)]

### pipeline using SelectKBest:

selectkbest = {}

for name, clf in zip(names, classifiers):
    pipe_skb = Pipeline([("scaler", MinMaxScaler()), ("skb", SelectKBest()), ("clf", clf)])
    pipe_skb.fit(features_train, labels_train)

    skb_scores = {}

    score =  round(pipe_skb.score(features_test, labels_test), 3)
    skb_scores["Accuracy score"] = score

    pred = pipe_skb.predict(features_test)

    conf_mat = confusion_matrix(labels_test, pred)
    skb_scores["Confusion matrix"] = conf_mat

    precisionscore = round(precision_score(labels_test, pred), 3)
    skb_scores["Precision score"] = precisionscore

    recallscore = round(recall_score(labels_test, pred), 3)
    skb_scores["Recall score"] = recallscore

    selectkbest[name] = skb_scores

### pipeline using PCA:

pca = {}

for name, clf in zip(names, classifiers):
    pipe = Pipeline([("scaler", MinMaxScaler()), ("pca", PCA(random_state=42)), ("clf", clf)])
    pipe.fit(features_train, labels_train)

    pca_scores = {}

    score =  round(pipe.score(features_test, labels_test), 3)
    pca_scores["Accuracy score"] = score

    pred = pipe.predict(features_test)

    conf_mat = confusion_matrix(labels_test, pred)
    pca_scores["Confusion matrix"] = conf_mat

    precisionscore = round(precision_score(labels_test, pred), 3)
    pca_scores["Precision score"] = precisionscore

    recallscore = round(recall_score(labels_test, pred), 3)
    pca_scores["Recall score"] = recallscore

    pca[name] = pca_scores

The resulting dictionaries can then be converted to a pandas dataframe so the metric scores can be viewed easily:

After getting this, I was able to hone in on the best classifiers and tune for their parameters. I ended up with AdaBoostClassifier, after doing MinMaxScale and SelectKBest. This was totally different from the last time I tried to come up with a classifier manually, which was MinMaxScale, PCA and then DecisionTreeClassifier. I think I have a more explainable route with the one I had above. I found other people’s work on the same dataset and thought that some were a little more complicated than what I just know at the moment. Maybe someday, I can understand what they did.

The person of interest identification in the Enron scandal case is probably one of the most common projects in various machine learning hackaton sites, though I still needed to get on those sites. The dataset was provided by the course instructor. It involved data cleaning more than I thought it did, and got ideas from other people’s work here, here and here.