February 9, 2017
by Vrishali Shah
First Work Night of 2017!
There are a few projects going on right now.
Anyone have any ones to add!?
Valentine’s Day Dataset
Inspired by Amy Webb’s TED talk “How I Hacked Online Dating” and the data dives on the OKCupid blog, we decided to play with dating profiles for February.
Dataset, codebook, sample files on Github. Dataset is profiles.csv.zip, must be unzipped.
About the dataset:
The dataset and preliminary analysis are largely pulled from Albert Y. Kim and Adriana Escobedo-Land’s write up in the Journal of Statistics Education.
The data consists of the public profiles of 59,946 OkCupid users who were living within 25 miles of San Francisco, had active profiles on June 26, 2012, were online in the previous year, and had at least one picture in their profile. Using a Python script, data was scraped from users’ public profiles on June 30, 2012; any non-publicly facing information such as messaging was not accessible.
Variables include typical user information (such as sex, sexual orientation, age, and ethnicity) and lifestyle variables (such as diet, drinking habits, smoking habits).
Furthermore, text responses to the 10 essay questions posed to all OkCupid users are included as well, such as “My Self Summary,” “The first thing people usually notice about me,” and “On a typical Friday night I am…” For a complete list of variables and more details, see the accompanying codebook.
- How do the heights of male and female OkCupid users compare? What about ages?
- What does the San Francisco online dating landscape look like? Or more specifically, what is the relationship between users’ sex and sexual orientation?
- How accurately can we predict a user’s sex using their listed height?
- Are there differences between the sexes in what words are used in the responses to the 10 essay questions?
- What trends or relationships in the data can we generalize to the rest of the San Francisco population? To the wider population? For which analyses does the fact that the dataset came from OKCupid make it less generalizable? What about coming from San Francisco?
Mini Tutorial: Albert Y. Kim and Adriana Escobedo-Land’s article accompanying the dataset gives a walk through of how to do summary statistics, conditional probabilities, predictions, and text analysis in R using this dataset. Good stuff in there!
November 18, 2016
by Vrishali Shah
SUDS is joining forces with Transportation Club and Traffic21’s Transportation Camp for this Work Night.
The WPRDC has data on every vehicle crash in Allegheny County (and four other counties). This dataset includes 300+ variables on nearly every aspect of the crash. A crash gets in the dataset if the police were called, so it doesn’t include very small incidents, but does cover most.
- Download the dataset from the WPRDC for the full dataset, or on our github where we have a dataset that we filtered down to some of the most easily understandable variables. That one is called “crashes_smaller.csv”. Right click to download it.
On our github you can also find an R file called transpoHackNight that created the smaller dataset, in case you want to work with that yourself.
- If you scroll down on the WPRDC’s page, you’ll also find the codebook and primer. this will explain what the variables mean. The crashes_smaller.csv dataset is filtered to those variables that are mostly easy to understand just by their names.
- As a reminder, this is open public data that people have been working on across the county. Here is a document containing questions that were posted at an early hack night looking into this dataset. If you find anything interesting, or answers to those questions, don’t be shy! Make sure to share your findings so we can all build on each other’s work.
Guiding Questions and Areas to Explore
Just to get you started…
- What time of day, week, and year do more crashes happen?
- Are there any interesting year to year comparisons?
- What variables stand out if you compare crashes with minor injuries, major injuries, and fatalities?
- Do car crashes involving bicycles fluctuate over the year? What about pedestrians?
- If you map the data, do you see any patterns or areas of concern?
- More questions and share your answers on the HackPad here.
Tomorrow at 3:00 the WPRDC will be hosting an open data session at Transportation Camp. Be sure to check it out if you’re attending!
Also as part of Transportation Camp on Sunday, come tour the Port Authority of Allegheny County’s West Mifflin bus garage. Join CMU’s Transportation Club for a “behind the scenes” look at the daily operations of operating a transportation service, including the agency tasks that transportation planners and riders often take for granted (maintenance, cleaning, etc.). Included with the tour will be a discussion of how planning is impacted by the realities of garage locations, operator schedules, union rules, and other agency constraints.
Meet at the West Mifflin Garage, located at 1011 Lebanon Road, West Mifflin, PA 15122. For the full tour experience, take Bus 51 directly to the garage from downtown (or from Oakland by connecting with Bus 54). There’s a 51 bus that arrives at the depot at 9:49 am, and another that departs the depot for downtown at 11:24 am.
When: Sunday November 20th, 10:00 – 11:15 am
Where: West Mifflin Garage, 1011 Lebanon Road, West Mifflin, PA 15122
Sign up for the tour here: http://r1ght.com/bustour, you’re welcome to attend even if you’re not registered for Transportation Camp
November 12, 2016
by SUDS Admin
Cool data job alert!
Love solving tough problems with data? Want to make a difference next summer? The City of Boston’s Analytics Team is recruiting for its summer fellowship program.
From their site:
“Mayor Walsh’s Citywide Analytics Team brings the power of data to the City of Boston. We use data to tackle some of the City’s biggest challenges, from reducing firefighter injuries and combating foodborne illness to preventing overdoses and matching homeless residents to housing. As part of the Department of Innovation and Technology, we are at the forefront of the City’s efforts to apply modern technology and analytics to make life better for everyone who lives and works in Boston.”
Positions are available for Data and Performance, Data Science, and Data Engineering Fellows.
More details here. Application deadline is December 30.
Find other jobs, internships, and opportunities in data under Resources.