# attempt to identify spam or ham SMS messages using the na¨ıve Bayes model. You will need to download the SMS dataset: http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. Python users: scikit-learn is not allowed.

In this problem, we will attempt to identify spam or ham SMS messages using the na¨ıve Bayes model. You will need to download the SMS dataset: http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection. Python users: scikit-learn is not allowed.

Consider an SMS message as a (case-insensitive) sequence of words (X1,··· ,XT). Ignore all other punctuation. Under the na¨ıve Bayes assumption, the probability of the words in each message factors as:

P(x1:T|y) =

T Y t=1

P(xi|y). (3)

When estimated from dataset D with pseudo-count prior of α, the model parameters are:

ˆ P(xi|y) =

CountD(xi,y) + α CountD(y) + Nα

, (4)

where: CountD(xi,y) and CountD(y) are the number of occurances of word xi in spam/ham messages y (from our sample D); and the number of words for label spam/ham words y (from our sample D) respectively; and N is the total number of dictionary words (including words not seen in D). Let us use N = 20,000 and

4

α = 0.1 in our experiments.

Note that the classes are heavily imbalanced. The number of spam messages is 747, while the number of ham messages is 4827. If a simple classiﬁer predicts that all messages are ham, it will get around 86% accuracy. In this case, accuracy is not a good measurement of the classiﬁer’s performance.

Instead of using accuracy, we can use confusion matrix to see the performance of our model. Below is the explanation of confusion matrix:

True condition Positive Negative

Predicted Condition

Positive True positive False positive Negative False negative True negative

Other important performance measurements are precision, recall, and F-score, deﬁned as:

precision =

true positive true positive + false positive

(5)

recall =

true positive true positive + false negative

(6)

F-score = 2·

precision·recall precision + recall

(7)

(a) Randomly split the messages into a training set D1 (80% of messages) and a testing set D2 (20% of messages). Calculate the testing accuracy, confusion matrix, precision, recall, and F-score of the Na¨ıve Bayes classiﬁer in determining whether a message is spam or ham. Submit your source code. Note: Let’s assume that spam is the positive class. (20 points)

(b) How does the change of α eﬀect the classiﬁer performance? Using random split above, evaluate the training and test accuracy and F-score under diﬀerent selections of α. The selection of α values are 2i where i = −5,··· ,0. Create two plots, the ﬁrst plot is for the accuracy measure and the second plot is for F-score. In each plot, x-axis represents i, and y-axis represents the performance measure (accuracy/F-score). Each plot contains two line chart, a line chart describing training accuracy/F-score measure, the other line chart is for test accuracy/F-score. Submit your source code. (20 points)

Hints: There are scripts in both Octave (nbayes.m) and Julia (nbayes.jl) for counting occurances of words from spam/ham messages. A few notes: (i) Octave can make use of Java data structures (e.g., the hashmap); this can be easier / more eﬃcient than using Octave structures. (ii) Julia has its own data structure library which includes many standard data structures e.g. Dict for hashmap. (iii) logxy = logx + logy.