
TEXT MINING - DATA
AI ART - ART
“AI ART” VS “ART”
This is Module 1 Part 1 of the project for INFO 5653 - TEXT MINING, and a topical continuation into my exploration of the impact of AI art on art communities.
NEWS API DATA COLLECTION
INTRODUCTION
Artificial intelligence (AI) image generation refers to the use of machine learning models to create visual content based on text prompts or reference images. These models, often called generative AI (GAI), are trained on vast datasets of existing art and photography to learn patterns, styles, and composition techniques.
Popular AI image generation tools, such as Stable Diffusion, Midjourney, and DALL-E, use complex algorithms to synthesize new images that resemble human-made art. AI art, a term used to describe artwork created wholly or partially with AI tools, has surged in popularity across social media, advertising, and creative industries. For some, AI is an innovative tool that expands creative possibilities, while others feel that AI-generated art challenges fundamental notions of originality, authorship, and artistic labor.
Beyond issues of consent and intellectual property, AI-generated art has disrupted creative industries by flooding online platforms with vast quantities of easily produced content. This has led to fears of job displacement for digital artists and illustrators, as companies and individuals may choose instant AI images over human made works.
Proponents of AI in art argue that these tools democratize creativity by making image creation more accessible to those without traditional artistic skills. Supporters view AI as a powerful medium for experimentation, allowing users to generate unique visuals that may not have been possible otherwise.
As various legal, ethical, and technological discussions struggle to keep pace with rapid advancements in AI technologies, Artists, tech companies, policymakers, and online platforms are actively negotiating the role of AI in creative spaces, with regulatory and legal frameworks still in flux.
SubReddits offer a uniquely clean view into these community-level forms of regulation, and almost every art-related Subreddit has explicit guidelines addressing the use of AI tools. As AI-generated content becomes more prevalent, understanding its impact on artistic communities, digital culture, and creative industries is crucial to shaping policies that balance innovation with ethical considerations. Exploring the discussion and text generated in these spaces may reveal interesting connections not immediately obvious about the community.
JSON to CSV
The initial data aquisition left me with a series of JSON files - one JSON file of articles for each label’s newsAPI query, and one JSON file of AI related posts from each subreddit. For my own organization, I turned each of these into a CSV with the same schema, to enable me to standardize my text cleaning process.
News API allows you to search for articles for keywords. The free tier limits responses to only articles within the past 30 days, and returns an article’s title, description, and the first 200 characters of the article content.
TWO MAIN FUNCTIONS
The first function searches the sections of the subreddits for posts. Each retrieved post would be passed through the second function, and various smaller ones to parse the text. Posts with keywords in the post body or any of the top 15 comments would be ‘approved’ by the system and saved the final output.
Because of their hard rules on whether AI Art was allowed in their spaces, specific art subreddits could easily be characterized as anti-AI (labelled “ART”) and pro-AI art (labelled “AI ART”).
RAW OUTPUT FILES
The resulting files from the API calls varied in size depending on which parts of a post were determined to be ‘relevant’. There is also to consider that each subreddit has different norms for what constitutes a post. some subreddits’ posts often had no relevant comments, while other subreddits didn’t allow text content in the original posts and thus are populated only by comments. ’
PRAW PSEUDOCODE
PRAW - Python Reddit API Wrapper - is a useful tool for scraping SubReddits. For my use, I built two main functions to fetch and process Reddit posts. The search for posts was conducted on a SubReddit level. Posts from TOP, CONTROVERSIAL, and HOT would be selected, and then all the text checked for whether it contained a keyword (which matches largely with the search queries used for newsAPI).
WEB SCRAPING
the News API returns a limited number of characters from the body text of an article. However, seeing as it saves the URL for each returned article, I can use newsPaper3k and Bs4 to create a robust webscraping function.
The function first attempts to get the article content from newspaper3k - and if that fails, will scrape the page with bs4, and compare the description and article text as returned by News API to the page content, and save matching sections. Every article was scraped for the full extent of it’s text content, at the very least to check for relevancy.
SUBREDDIT DATA COLLECTION
DATA CLEANING & EXPLORATION
MY QUERIES
Finding a robust enough search query to select articles that were truly relevant to my intentions was an iterative process, where I’d scan through the retreived articles’ titles and content to determine what kinds of keywords would be most helpful.
In an earlier version of the code, I had the article names printed as they were saved to the file, to better gauge what content was slipping through the cracks.
DISTILLING OUTPUTS
Earlier versions of these functions included extensive prints and afforded in depth exploration of each subreddit as I worked out what I needed & what needed filtering out.
This is what allowed me to dig into the content of the various kinds of posts, and build a robust process that would catch a truly ‘related’ posts or series of comments, for which a label would be actually valuable.
Seeing as these posts needed to directly mention a keyword to be included, all scraped and saved text content from reddit posts are ‘on-topic’. However, the relevance is one aspect of data cleaning.
WORDCLOUDS
Testing the efficacy of the cleaning function is just a matter of exploring the data myself. I experimented with random prints, scrolling through the cleaned datafiles, and wordclouds to visualize and target columns and keywords that needed cleaning.
This is a snapshot of my labelled DataFrames with only cruft cleaning applied.
The initial API calls and ‘relevancy’ parsing did include some preliminary text cleaning. Mostly, basic regex to catch URLs, non english and non alphabetic characters, and string methods to isolate keywords.
The subreddits themselves were labelled depending on their rules regarding AI - something I had to do by hand and note down.
Many versions of these dataframes (and their representative word clouds) were made as each one made new unnecessary words pop out. These are all various snapshots of the cleaning process. The Reddit posts AI and Art datasets are both around 400 rows long, and the NewsAPI datasets are 200. The labels are AIART (post), AIART (Article), ART (post), and ART (article).
NEWS API - CODE
I collected data from newsAPI through use of two big functions - the first making the bulk of the request and the others applying some preliminary data management and cleaning. My code was robust enough to allow me to experiment with API calls, so as to find the most effective search query to get articles truly relevant to “AI Art” or “Art” discussions.
The first function - fetchNewsArticles - takes in a query and a few other parameters, and returns a JSON file with each article’s data saved. To be saved to the output file, an article must pass through filterAndProcessArticles, which checks text content - the article’s title, description, and the sample article body content provided by NewsAPI. The function flags articles that match none of the keywords from the query, and those with too few characters.