AI Seasons
Note: This research was conducted over a span of fall semester 2024 with Muhlenberg undergraduate student researcher, Andrew Franklin, and was merely a data collection project.
1. Project Overview
This project was a data collection initiative designed to gather a comprehensive dataset of digital content related to the concepts of “AI Winter,” “AI Summer,” and general “Artificial Intelligence.” The primary objective was to systematically scrape, parse, and store information from a diverse range of sources, including technology news sites, academic repositories, and video platforms. This dataset serves as a foundational resource for future research and analysis on the cyclical trends, sentiment, and discourse surrounding AI development.
2. Methodology and Workflow
The data collection process was automated through a series of Python scripts, each tailored to a specific data source. The overall workflow was managed by a master script that employed multiprocessing to run the individual scrapers concurrently, enhancing the efficiency of the data gathering process.
The typical workflow for each scraper was as follows:
- Target Identification: The script navigates to a specific website or API endpoint. For broad sources, it uses search queries like “ai winter,” “ai summer,” or “artificial intelligence.”
- Content Fetching: The script downloads the raw content, which could be an HTML page, a PDF document, or a video transcript.
- Data Parsing & Extraction: Key pieces of information (metadata) are parsed from the raw content. Common data points include the title, author(s), publication date, URL, summary/abstract, full text, and associated tags or keywords.
- Data Storage: The extracted information is structured and then inserted into a central MySQL database.
3. Data Sources and Collection Scripts
The project aggregated data from seven primary sources, each handled by a dedicated script:
Data Source | Description |
---|---|
ArXiv | Searched for and downloaded academic papers as PDFs based on topics (“AI spring,” “AI winter”) and extracted their full text and metadata. |
TechCrunch | Scraped articles from the “Artificial Intelligence” category, iterating through pages to collect article content and metadata. |
The Verge | Traversed the website’s sitemaps to find and process articles, extracting text and metadata. |
YouTube | Used the YouTube Data API to find videos based on search terms (e.g., “ai summer”) and retrieved video details and full-text transcripts. |
Hacker News | Queried the Hacker News API for posts related to “ai winter” and scraped the content of the linked articles. |
Gizmodo | Scraped articles from the “Artificial Intelligence” section of the site. |
PubMed | Searched for articles related to “artificial intelligence” and extracted titles, URLs, and abstracts. |
4. Core System Components
The project was built around a few key components that managed the overall process:
- Main Execution Script: The primary script used a thread pool to run scrapers for multiple sources (TechCrunch, ArXiv, YouTube) simultaneously, improving efficiency.
- Database Insertion Utility: A standardized function was used to connect to the MySQL database and insert new records of collected data in a consistent manner.
- Debugging Utility: A simple utility was available for printing command-line arguments, likely used for testing purposes during development.