TEXT MINING - DATA

AI ART - ART

SUPPORT VECTOR MACHINES

This is the last part of the mod 3 text mining project assignment, exploring machine learning models and their applications to my AI/Art Text Data.

Support Vector Machines

Support Vector Machines (SVMs) are another type of supervised learning model I’m using on the same cleaned and vectorized dataset as before. Like decision trees and Naïve Bayes, SVMs are used for classification, but they approach the problem in a different way. Instead of working through splitting the data by single words in a step-by-step branching structure, SVMs tries to fit a line (a hyperplane, in high-dimensional space) that best separates the two classes in the dataset. The goal is to find the widest possible margin between points of one class and points of the other, where the “support vectors” are the data points that sit closest to that dividing line, marking the boundaries. (generally: wider is better)

I’ll be trying to use SVM and VADER sentiment analysis to try and train a model that can identify the sentiment of online text data about AI art/Art. These conversations are subtle and rarely contain obvious sentiment markers, so using SVMs with different kernels lets me test whether transforming the feature space (with RBF or polynomial kernels) makes the sentiment easier to separate.

Of these three models, linear technically has the highest accuracy, with 76% —however at a glance we can see that its dramatically skewed, even with my attempt at weighing the classes. The model misses most negative samples: With a recall of 0.41 and precision of 0.30. I suppose you can say its just REALLY good at identifying positive text!

RBF has a super high recall for negative text (0.79), but with it, a super low recall (0.21) (you can tell because the bottom left quadrant of the confusion matrix is full of false alarms!) The RBF SVM seems to have done a dramatic overcorrection from the linear kernel: where the first labelled everything Positive, this one labels everything Negative.

The Poly SVM might as well not exist for how badly it fails. It classifies nearly everything as negative. I think the disparity between negative and positive text data to train on, combined with the bit of nuance in the discussion, means that SVM is just a really bad fit for my sentiment results.

To explore further, I went back to the labelled data and tried to see how well SVM will do with the data that previously went through NB and Decision trees.

The linear model, this time, had an accuracy of 71%, with 32 AI art cases mislabeled as art, and 23 art cases mislabeled as AI. (AI art has a recall of .68). Its the most succesful SVM model I managed with my data.

The relatively symmetrical confusion matrix shows it’s identifying both communities with moderate confidence. This makes it the best option so far for flagging AI Art-related discourse without heavily mislabeling or mischaracterizing of actual classes.

To create a version of my dataset that could be used for sentiment classification, I started by labeling each piece of text as either positive, negative, or neutral. Since I didn’t have pre-existing sentiment labels, I used the VADER sentiment analyzer. VADER gives a compound sentiment score between -1 and 1. I set a threshold where texts with a score above 0.05 were labeled “positive,” those below -0.05 as “negative,” and anything in between as “neutral.” After labeling, I filtered out the neutral examples to keep the task binary, which simplifies the classification and helps SVMs focus on clearer distinctions between positive and negative.

Despite messing with the threshholds, and a lot of experimenting: I couldnt get a better representation of negative text data from VADER, so I concluded my data truly was mostly positive.

This is the results of the training with the sentiment as my classes on the VADER Sentiment dataset.

SVMs rely on labeled data because they’re supervised models — they need clear examples of what counts as one class (label) versus another in order to learn how to separate them. In my case, that means each piece of text must already be tagged as either “positive” or “negative” sentiment so the SVM can figure out where to draw the boundary between those two categories. (More on this soon)

To train a model wouldn’t be possible with unlabeled data, because the model wouldn’t have any frame of reference for what it's supposed to be predicting. Also, SVMs don’t work on raw text — they need the data to be numeric. That’s why I’ve been processing all my cleaned Reddit and NewsAPI text using CountVectorizer, which turns each document into a vector of word frequencies. The result is a Document term matrix — exactly the kind of numerical format SVMs require in order to calculate my margins.

RBF successfully captures nearly all AIart texts, but mislabels a majority of art texts as AIart, while Poly overfits for the AIart label. Even though I didnt see much success, the Linear SVM performs the best overall. I think the fact that they always favor AI art over Art much indicate that the “Art” label’s data contains much more disparate than the AI art text data.

Previous
Previous

New Portfolio Item