hero-image.fill.size_1248x702.v1709591037.png

Reddit Dataset

The Reddit dataset includes posts from various subreddits focused on relationships, containing elements such as URLs, subreddit names, dates posted, and user demographics. This dataset was used to analyze the frequency and demographic usage of the term "situationship," providing insights into its spread and contextual usage on Reddit.

Open this dataset in Google Sheets

Critique of Reddit Dataset & Collection

The Reddit dataset was collected using the free Chrome browser extension BrowserFlow, where we created a formula that identified lists on webpages of Reddit search queries, and recorded the following elements as individual columns onto a local CSV file: URL, Subreddit Name, Date Posted, Post Title. Then, the file was converted to Google Sheets for easier access to powerful usage of formulas, extensions, and intuitive visualization programming. The main methodology for choosing the data points was how the Reddit search queries were conducted to generate the lists for data scraping. Because of time constraints, we decided to specifically target communities where users often discussed stories or asked for advice concerning love, sex, relationships, etc. to target data points that would be relevant to our research; otherwise, we would’ve been subject to pool data from over hundreds of thousands of entries to identify relevant data points. As a consequence, the way we selected our data is less than ideal for frequency analysis, as a completely randomized selection would provide a more accurate reflection of the usage of the word “situationship” across all potential uses of language. In attempts to counteract this data selection bias, we tried to maintain a standard for generating the lists across the relationships-related subreddits by filtering each one with the same constraints: Top Posts of All Time, including the word “situationship.” This way, we would collect data points that carried the most social influence and had the most engagement within their respective digital communities (subreddits) and be able to conduct frequency analysis based on the popularity of a significantly large dataset and its timestamps. The unfortunate limitation with frequency analysis, due to the information that the data scraper is able to access based on Reddit’s website format, is that we were only able to identify the year that each data point was posted rather than the specific date; this, in turn, makes our frequency analysis much more surface level than we had hoped for identifying a timeline for the word “situationship.” For demographic analysis, we specifically chose Reddit as a platform to analyze since it is a common post format for users to self-identify their age and gender, to provide more context to other users about their stories (e.g. “I, M22, have a girlfriend, F20”). To utilize this for analysis, we applied a formula to the “Post Title” column (which included the actual textual context) that extracted the first “Gender/Age” format into a new column; subsequently, this new column was split further into the columns “Age” and “Gender” for more specific demographic visualizations. However, there were multiple limitations revealed in the dataset due to this process. Because not every user mentions their age and gender in the typical “M22” format, the filtered data for demographic analysis is significantly smaller than the original dataset (714/5156 data points), making the data less effective than what was used for frequency analysis.

< Go back to Data Critique Mainpage

Reddit Dataset

Critique of Reddit Dataset & Collection

The Story of "Situationship"