I recently studied and implemented a few different binary classification models for my ongoing machine learning class at UC San Diego. Through this blog post, I’ll be introducing and comparing LDA (Linear discriminant analysis), Logistic regression and KNN (K-Nearest neighbors) classification models. I’ll be briefly describing these models, showing the math behind them, and then using them to classify whether or not a certain passenger survived on the Titanic dataset on Kaggle.

Data description

The Titanic dataset contains multiple variables, here are the following variables I am using to classify whether a passenger survived -

• pclass - Ticket class - Categorical variable
• sex (Male / Female) - Categorical variable
• age (in years) - Numeric variable
• sibsp - Number of Siblings/Spouses Aboard - Numeric variable
• parch - Number of Parents/Children Aboard - Numeric variable
• fare - Cost of the ticket - Numeric variable

The variable to predict is -
survived - 1 for survived, 0 for did not survive.

I’ll be training the models on a dataset of 600 rows, and testing it on dataset of 114 rows.

Here’s a simple visualization of the training data -

I’ll be showing all visualizations and plots in this blog post only in 2-dimensions, the 2 dimensions being Age, and Fare. I have chosen these dimensions since they are quantitative, easy to visualize and explain survival rate to a certain extent.

Logistic regression

Logistic regression is one of the most popular models for classification, and works especially well for binary response variables. This technique essentially fits a logistic curve on the probabilities of a tuple / point (in p-dimensions, p being the number of predictor variables) belonging to a certain class. Since the logistic curve calculates the probabilities of a point belonging to a given class, its value ranges from 0 to 1.

Example of a logistic Curve

We can then simply take the points with probabilities > 0.5 on the logistic curve, and classify them as 1, and 0 for points with probability < 0.5. Although it is seldom that a point has a probability of $0.5\overline{0}$, such points will have an equal probability of belonging to either class.

The Math

The logistic function is written as -

$$p(X) = \frac{e^{\beta_0 + \beta_1x_1}}{1 + e^{\beta_0 + \beta_1x_1}}$$

where $X$ is a $p$-dimensional vector of predictors, and $p(X)$ is the probability of the point described by $X$ belonging to a class. The coefficients $\beta_0$ and $\beta_1$ represent the weights of the features for each predictor variable. We want the find the coefficients in a way such that they maximize the following function -

$$l(\beta_0, \beta_1) = \prod_{i:y_i = 1} p(x_i) \prod_{i':y'_i = 0} (1-p(x'_i))$$

which is also known as the likelihood function. In practice, we actually maximize the log-likelihood function, since it is much (computationally) easier to maximize.

The Code

    #Fitting logistic curve
logisticFit = glm(Survived ~ ., data = train, family = binomial)


Model performance

Confusion matrix for the training (left) and testing (right) sets:

Training accuracy: 80.3%, Testing accuracy: 81.5%.

Visualizing Logistic regression

Here’s the logistic curve (given by the calculated coefficients). However, this curve only represents 2 quantitative features (fare and age), since visualizing qualitative features, and n > 3 is hard (or impossible).

X axis: Age, Y axis: Fare, Z axis: Survived

LDA

LDA, or Linear discriminant analysis is actually a commonly used dimension-reductionality technique. However, its properties also make it a great linear classifier. The way it works is for a given set of observations, or a given point (with p features/dimensions), it attempts to find the best density function for a known class $Y$. The definition of the best predictor, in this case, is a variable that can divide the two classes, such that their means are as far away as possible. Important Note: The classes are assumed to have a Gaussian distribution.

The Math

Let’s assume that the quantitative variable $Y$, can take on $K$ distinct classes. Let $\pi_k$ represent the prior probability that a randomly chosen observation comes from the $k$th class of $Y$. Now let’s define the density function of $X$ - a random variable for an observation in the $k$th class. That is, if $f_k(x)$ is large, there is a high probability that a given observation from class $k$ has $X \approx x$, and vice-versa. Using Bayes’ theorem we arrive at

$$Pr(Y = K | X = x) = \frac{\pi_k f_k(x)}{\sum_{l=1}^{K} \pi_l f_l(x)}$$

We refer to this function as the posterior probability that an observation $X = x$ belongs to the $k$th class. That is, it is the probability that the observation belongs to the $k$th class, given the predictor value for that observation.

The Code

    ldaFit = lda(Survived ~ ., data = train) #Training LDA Model


Model performance

Confusion matrix for the training (left) and testing (right) sets:

Training accuracy: 79.5%, Testing accuracy: 82.4%.

KNN

K-nearest neighbors is another widely used classification model. The intuition behind KNN is quite simple - let’s say we have K=1, for a point in n-dimensional feature space, we will classify the point to be the same as the closest point (from the training / labeled data) in eucledian space. For K=3, for the closest 3 points, the class with majority points will be classified to the new point.

The Math

Given a positive integer $K$ and a test observation $x_0$, the KNN classifier first identifies the K points in the training data that are closest to $x_0$, represented by $N_0$. It then estimates the conditional probability for class $j$ as the fraction of points in $N_0$ whose response values equal $j$ -

$$Pr(Y = j| X = x_0) = \frac{1}{K} \sum_{i \in \mathcal{N}_0} I(y_i = j)$$

KNN then applies Bayes’ rule and classifies the test observation $x_0$ to the class with the largest probability.

The Code

Note: I recently learned that I was making a massive mistake by not normalizing / scaling the features before running KNN. All the variables need to be scaled to be of similar order, since KNN works based on eucledian distance, some features might not be properly captured, or weighted correctly, if not scaled.

I first fit a knn model for K=1, then I used the R package - “caret” to fit a knn model, which minimizes the cost function, finding ideal K.

 library('class')
#Scaling training and testing dataset
train = train(as.factor(as.integer(train[,1]) - 1), scale(train[,-1]))
test = test(as.factor(as.integer(test[,1]) - 1), scale(test[,-1]))
knn1Fit <- knn(train[, - c('Survived')], test, train$Survived, 1) ##K=1 trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3) knn_fit <- train(Survived ~., data = train, method = "knn", trControl=trctrl, preProcess = c("center", "scale"), tuneLength = 10) knn_fit$bestTune


Turns out, K=13 works best for this dataset.

Model performance

With KNN, the training set does not have an accuracy, since all of the training set is used to classify observations of the test set. Here’s the confusion matrix for the test set:

Test accuracy: 79.80%

Visualizing KNN

Here’s how KNN looks (Only visualizing two quantitative predictors and their decision boundaries for K=1 (left) and K=13 (right):

It is visible that the higher the K, more defined the decision boundaries. Selecting the right number for K for a given dataset is key - it leads to noise reduction and better predictions.

Summary

Each of these models work very differently. For the titanic dataset, LDA worked best, in terms of accuracy. However, the best model to choose will always vary based on what kind of data is being classified.

Note to self: Learn LaTeX in markdown for better mathematical writing in blog posts. Done!

References
[1] Professor David Quarfoot at the UC San Diego mathematics department.
[2] Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.