
TEXT MINING - CLUSTERING
AI ART - ART
MOD 2 - CLUSTERING
This is the first section of the Mod 2 assignment for analyzing the text data. I will first explore clustering to see if the topics that arise align cleanly with the labels, or if there is something more complex under the surface! This requires data in numeric format, so you’ll see I’m working with a DTM (document-term matrix)
DATA FORMATS
K means clustering is a process where we partition the data into a fixed number (k) of clusters by minimizing the variance within each cluster. Clusters are collections of like items - each document is assigned to the nearest cluster based on Euclidean distance measure. I used clustering to group similar text posts (e.g., AI posts vs. Art posts from Subreddits and news articles) so I can understand common topics, trends, or styles in my dataset. I expect to see clusters that correspond to different themes or genres within my project focus. For example, I know from earlier EDA that people might be talking about ‘stealing art’, and articles might often reference ‘copyright’. I imagine if theres enough substance to these different conversations, smaller clusters will show those breakdowns. Largely, I expect the data to divide into two clusters, based on if they discussed AI art or Art.
I tried this twice, in R and python, because either language had different affordances for visualizations. Both files are named Clustering-ROF-Mod2 (.R, .IPYNB) and linked below. In python, I experimented with silhouette values rather than visualizations, to hone in on which of the K values showed the most information about the data. I also explored how many documents made up each cluster - there was a small cluster with only 10 documents, which contained my handful of outliers. When scaling down to two clusters, the outliers were so strong that it became a cluster with 1 document, and then a cluster with everything else! You can see the shape of the plots remains the same between programs, and the different clusters in the early version (left) and my final version (above). Below are two charts showing the other K values.
This is a contingency table comparing the clustering from K-means vs. hierarchical clustering. If there are a lot of really high numbers on the diagonal intersection, it would mean that either approach resulted in similar findings. In my comparison, the largest overlap was between KMeans=1 and HClust=1 with 100 documents in common. There is also 37 documents from KMeans=1 that ended up in HClust=2, etc. the approaches are somewhat aligned.
To get an even better idea of what these clusters actually meant, I made some word clouds!
Continue reading through to ARM in R and LDA to see the findings!
The data is organized into three groups: a small group of very influential outliers, discussing Data and AI & technical prompting. The other groups seem to align around art dicsussions more largely, the third cluster with negative connotations around art posts - with words like ‘human’, ‘removed’, ‘artists’ , and ‘never’ ‘dislike’, and so on. The second cluster has a focus on distinctly human or nonAI topics, with discussion of art programs, mediums, and devices for digital works.
This dendrogram is simplified (truncated) so that the bottom most layer represents a bin of a few documents, rather than an individual one. The horiztonal axis is irrelevant - we care about the height distance between merges. Closer merges mean more closely related data. We can see the hierarchy of how these groupings of documents are related to eachother. If you look at it in horizontal slices, it represents the amount of data in each cluster - so, for my analysis, I was looking at three clusters. We can see theres a major split at the level of TWO, and then the split into three groups where I found the most interesting data.
This sillohuette analysis shows us how many clusters the data would best fit - in this case two clusters lines up with my two labels. The elbow shows something similar - looking for bends where the line suddenly decreases, we can get a sense for how many clusters would suit the data. I found that suprisingly, there were three clusters rather than my expected two. I explored several K values, but the silhouette indicates that k = 3 will best fit the data!