Students for Urban Data Systems

at Carnegie Mellon University

Work Night: Valentine’s Day Edition

| 0 comments

First Work Night of 2017!

There are a few projects going on right now.

Anyone have any ones to add!?

Valentine’s Day Dataset

Inspired by Amy Webb’s TED talk “How I Hacked Online Dating” and the data dives on the OKCupid blog, we decided to play with dating profiles for February.

Dataset, codebook, sample files on Github. Dataset is profiles.csv.zip, must be unzipped.

About the dataset:

The dataset and preliminary analysis are largely pulled from Albert Y. Kim and Adriana Escobedo-Land’s write up in the Journal of Statistics Education.

The data consists of the public profiles of 59,946 OkCupid users who were living within 25 miles of San Francisco, had active profiles on June 26, 2012, were online in the previous year, and had at least one picture in their profile. Using a Python script, data was scraped from users’ public profiles on June 30, 2012; any non-publicly facing information such as messaging was not accessible.

Variables include typical user information (such as sex, sexual orientation, age, and ethnicity) and lifestyle variables (such as diet, drinking habits, smoking habits).

Furthermore, text responses to the 10 essay questions posed to all OkCupid users are included as well, such as “My Self Summary,” “The first thing people usually notice about me,” and “On a typical Friday night I am…” For a complete list of variables and more details, see the accompanying codebook.

Some questions:

  • How do the heights of male and female OkCupid users compare? What about ages?
  • What does the San Francisco online dating landscape look like? Or more specifically, what is the relationship between users’ sex and sexual orientation?
  • How accurately can we predict a user’s sex using their listed height?
  • Are there differences between the sexes in what words are used in the responses to the 10 essay questions?
  • What trends or relationships in the data can we generalize to the rest of the San Francisco population? To the wider population? For which analyses does the fact that the dataset came from OKCupid make it less generalizable? What about coming from San Francisco?

Mini Tutorial: Albert Y. Kim and Adriana Escobedo-Land’s article accompanying the dataset gives a walk through of how to do summary statistics, conditional probabilities, predictions, and text analysis in R using this dataset. Good stuff in there!

Author: Lauren Renaud

SUDS leadership team: director of data projects

Leave a Reply

Required fields are marked *.