Work Night: Valentine’s Day Edition

First Work Night of 2017!

There are a few projects going on right now.

Anyone have any ones to add!?

Valentine’s Day Dataset

Inspired by Amy Webb’s TED talk “How I Hacked Online Dating” and the data dives on the OKCupid blog, we decided to play with dating profiles for February.

Dataset, codebook, sample files on Github. Dataset is, must be unzipped.

About the dataset:

The dataset and preliminary analysis are largely pulled from Albert Y. Kim and Adriana Escobedo-Land’s write up in the Journal of Statistics Education.

The data consists of the public profiles of 59,946 OkCupid users who were living within 25 miles of San Francisco, had active profiles on June 26, 2012, were online in the previous year, and had at least one picture in their profile. Using a Python script, data was scraped from users’ public profiles on June 30, 2012; any non-publicly facing information such as messaging was not accessible.

Variables include typical user information (such as sex, sexual orientation, age, and ethnicity) and lifestyle variables (such as diet, drinking habits, smoking habits).

Furthermore, text responses to the 10 essay questions posed to all OkCupid users are included as well, such as “My Self Summary,” “The first thing people usually notice about me,” and “On a typical Friday night I am…” For a complete list of variables and more details, see the accompanying codebook.

Some questions:

  • How do the heights of male and female OkCupid users compare? What about ages?
  • What does the San Francisco online dating landscape look like? Or more specifically, what is the relationship between users’ sex and sexual orientation?
  • How accurately can we predict a user’s sex using their listed height?
  • Are there differences between the sexes in what words are used in the responses to the 10 essay questions?
  • What trends or relationships in the data can we generalize to the rest of the San Francisco population? To the wider population? For which analyses does the fact that the dataset came from OKCupid make it less generalizable? What about coming from San Francisco?

Mini Tutorial: Albert Y. Kim and Adriana Escobedo-Land’s article accompanying the dataset gives a walk through of how to do summary statistics, conditional probabilities, predictions, and text analysis in R using this dataset. Good stuff in there!

Work Night: Transportation Camp Edition

SUDS is joining forces with Transportation Club and Traffic21’s Transportation Camp for this Work Night.

The WPRDC has data on every vehicle crash in Allegheny County (and four other counties). This dataset includes 300+ variables on nearly every aspect of the crash. A crash gets in the dataset if the police were called, so it doesn’t include very small incidents, but does cover most.

  1. Download the dataset from the WPRDC for the full dataset, or on our github where we have a dataset that we filtered down to some of the most easily understandable variables. That one is called “crashes_smaller.csv”. Right click to download it.
    On our github you can also find an R file called transpoHackNight that created the smaller dataset, in case you want to work with that yourself.
  2. If you scroll down on the WPRDC’s page, you’ll also find the codebook and primer. this will explain what the variables mean. The crashes_smaller.csv dataset is filtered to those variables that are mostly easy to understand just by their names.
  3. As a reminder, this is open public data that people have been working on across the county. Here is a document containing questions that were posted at an early hack night looking into this dataset. If you find anything interesting, or answers to those questions, don’t be shy! Make sure to share your findings so we can all build on each other’s work.

Guiding Questions and Areas to Explore
Just to get you started…

  • What time of day, week, and year do more crashes happen?
  • Are there any interesting year to year comparisons?
  • What variables stand out if you compare crashes with minor injuries, major injuries, and fatalities?
  • Do car crashes involving bicycles fluctuate over the year? What about pedestrians?
  • If you map the data, do you see any patterns or areas of concern?
  • More questions and share your answers on the HackPad here.

Other Events

Tomorrow at 3:00 the WPRDC will be hosting an open data session at Transportation Camp. Be sure to check it out if you’re attending!

Also as part of Transportation Camp on Sunday, come tour the Port Authority of Allegheny County’s West Mifflin bus garage. Join CMU’s Transportation Club for a “behind the scenes” look at the daily operations of operating a transportation service, including the agency tasks that transportation planners and riders often take for granted (maintenance, cleaning, etc.). Included with the tour will be a discussion of how planning is impacted by the realities of garage locations, operator schedules, union rules, and other agency constraints.
Meet at the West Mifflin Garage, located at 1011 Lebanon Road, West Mifflin, PA 15122. For the full tour experience, take Bus 51 directly to the garage from downtown (or from Oakland by connecting with Bus 54). There’s a 51 bus that arrives at the depot at 9:49 am, and another that departs the depot for downtown at 11:24 am.
When: Sunday November 20th, 10:00 – 11:15 am
Where: West Mifflin Garage, 1011 Lebanon Road, West Mifflin, PA 15122
Sign up for the tour here:, you’re welcome to attend even if you’re not registered for Transportation Camp

Work Night: Can You Beat Nate Silver?

SUDS is joining forces with the Heinz Policy and Politics Club to bring you a night of election prediction excitement. 

Think you can predict the 2016 Presidential Election? Find our Election Prediction Form here. Make a copy of it (online or offline). Follow the instructions on the bottom right portion of the sheet. And then email your submission by Friday, November 4th at 8pm to (you can submit by email or by sharing the document inside Google Drive).

Here’s some guidance for tonight:

  1. Follow this link and navigate to “Download CSV of polls” at the bottom of the page, above the links to sponsored content: (For more polling data, and to better understand what you’re looking at, check out links below).
  2. Import the CSV into R, Python, Excel, whatever you like, and simplify the dataset to only the variables that are of greatest interest to you.
  3. Though you may explore different outcomes using any of the three scenarios FiveThirtyEight uses to estimate the vote (Polls-Plus, Polls-Only, and Now-Cast), be sure to include rows from only one of these labels in your analysis. The dataset will include approximately 9000 rows; thus, your analysis should include approximately one-third of this total, or 3000 rows, per scenario.
  4. Pay special attention to the ‘rawpoll’ and ‘adjpoll’ columns, which give figures presented before and after FiveThirtyEight weighting was applied. Based on the resources below and your own intuition, decide if your interpretation of the raw data would differ from the ‘adjusted’ poll numbers in the dataset.
  5. Fill out the entry form (instructions on the bottom right) with your predictions and submit!

How to understand the polls + R polls-specific tutorials + more polling data:

  • A User’s Guide To FiveThirtyEight’s 2016 General Election Forecast
  • Get started with this fascinating piece that shows how different pollsters can arrive at varying conclusions – even using the same raw data.
  • Read through a recent post by Peter Ellis, a professional statistician based in New Zealand. While the exact proprietary methodology that FiveThirtyEight employs remains a mystery, Ellis’ overview does touch on the apparent effect of grades assigned by Nate Silver and his team of data whizzes on weighting and distribution of the data.
  • Tutorial on R package that’s an interface to the Huffington Post Pollster API
  • R tutorial for visualizing and exploring 538 national polls data, plus links to other polls sources.

Guiding Questions and Other Areas to Explore

  • Take a look at the historical electoral college maps at and use them as references in your answers to the following questions.
  • Taking into account the national and state-by-state polling Hillary Clinton has held the edge in national polling since the first presidential debate a month ago, but her lead looks to be dipping – and polls have not yet caught up to the news that the FBI discovered “pertinent” emails while investigating Anthony Weiner in an unrelated case. Which maps can you produce that result in a victory for Donald Trump?
  • The last candidate to win 400 electoral votes was George H.W. Bush, who defeated Democratic challenger Michael Dukakis by a margin of 426-111 in 1988. (Dukakis won only 10 states, including his home state of Massachusetts, and the District of Columbia.) Can you plot out a path to 400 for either candidate? Which path do you think seems more likely, and why?
  • Per your reading of the polls, where does Hillary Clinton have the best chance of “turning a red state blue” (i.e., winning a state that voted for Mitt Romney in 2012)?
  • Conversely, which states that voted for Barack Obama in 2012 seem most likely to move into the Republican column and cast their share of the electoral vote for Donald Trump?
  • Looking back at the last 24 years (since Bill Clinton was first elected in 1992), are there any general regional changes you can pinpoint? What is your hypothesis for these shifts?
  • It has been 48 years since a third party candidate last won a state and carried its electoral votes. (Consult the maps for more on the pivotal 1968 election, and read more on the staunchly segregationist campaign George Wallace ran here.) This year, Libertarian Gary Johnson appears primed to capture at least five percent of the popular vote nationally, but it is Evan McMullin, a conservative, #NeverTrump flagbearer running as an American Independent, who stands a reasonable chance at defeating both Trump and Clinton in Utah. (His Mormon background in a predominantly Mormon state is crucial to understanding his viability there.) How would you assess McMullin’s chances in the Beehive State? Are you willing to venture a guess that he will win its six electoral votes?

Final Thoughts

We’re always on the lookout for insights, visualizations, and other creative output from SUDSers. In addition to your prediction submission, if you have any further thoughts that you’d like to share with the world (maybe answers to the above questions?) send us a line at, we’d like to publish you!

And if you’re registered, remember to vote!

Criminal Justice Work Night highlight: do police from smaller units use force more often?

At our first Work Night a few weeks ago, SUDS members dug into data on crime and criminal justice–particularly from the 2013 Law Enforcement Management and Administrative Statistics (LEMAS) survey. One student, Kee Won Song, pulled together some interesting initial insights and a sweet chart in just a few hours. He writes:
I am interested to see if we can identify factors that contribute to use of force incidents.  Specifically, I am interested to see if factors like employee demographics, education level of employees, size of department, participation in academic research (which we might also use to assign a score for ‘transparency’), budget, training methods, number of specialized units, use of data/computers in evaluating performance etc. have any affect on the frequency of use of force.  I did not get to analyze many of these factors, however, this is one figure that I produced that plots use of force incidents (expressed as incidents per employee) against total employees (full-time plus part-time):

LEMAS surveyIt’s hard to say that anything substantive can be gleaned from the visualization but it might allow us to further focus on smaller departments that have a high use of force rate (or identify outliers for further analysis).

Kee Won Song is a full-time MPM student at CMU, who is also completing Masters of Sustainability at Chatham; his interests include researching the impacts of unconventional oil and gas extraction on air quality, particularly in underprivileged communities.

Work Night: Hacking & Fracking

Environmental Sensors HackNight
We are going to be using the ESDR dataset collected by various sensors across the country. The data is collated by CMU Create Lab.
  1. Follow this link, and download the dataset (or from here and select PGH_Sensors_Data.csv if that link doesn’t work).
  1. For simplicity, we filtered it only for Pittsburgh and only the sensors that have been active. If you need more data, it can be collected through
  1. The file is a CSV file, with columns indicating the name, id , location of the sensor, and the observations it has been collecting, and at what times.
  1. Load the file into R, iPython, or wherever.
You can also check out other cool visualizations on this data here –
Discuss your ideas, cool observations here: Environmental Sensors HackNight
Sample Ideas for the ESDR data
  1. Mapping the sensors, and visualizing the pollution levels based on the neighborhood.
  1. Finding which neighborhoods are the worst in air-quality
  1. Combining with 311 data from WPRDC to figure out some cool stuff.
  1. Talk to people, and think what more can be done.
PS. If you want some tips on using Tableau to visualize data, this tutorial from a former SUDSer can get you started. The video starts with pulling data from a public API, which we’re not doing here, you’ll have to bring the data in manually, but after that follow along.
PPS. If you tweet / insta / etc… #SUDSWorkNight