Census Data Workshop: Monday, October 2

Navigating U.S. Census Data Workshop with Eileen Patten 

Eileen Patten presenting at the Census event
Eileen Patten presenting at the Census event

Monday, October 2, join us for a workshop on accessing and using U.S. Census Bureau population and business data to assess your community, explore a topic of interest to you, or just learn more about the United States. U.S. Census data is a treasure trove of information about our country that spans many decades and many topic areas. Did you know you can use the American Community Survey to figure out how many homes in the U.S. have flush toilets? Or the American Time use Survey to find out how many hours Americans spend mowing their lawns?


This workshop will be led by the SUDS Speaker Series Chair, Eileen Patten. She is a second year Master of Science in Public Policy and Management student, specializing in Data Analytics. Before coming to Heinz, Eileen used census data to perform analysis on topics like gender and racial wage gaps, Latino and Asian populations in the U.S., and teen birth rates while working for the “fact tank” Pew Research Center.

Eileen will introduce several tools and surveys, including:

  • Integrated Public Use Microdata Series (IPUMS-USA): IPUMS-USA collects, preserves, and harmonizes U.S. census microdata and provides easy access to this data with enhanced documentation. Data includes decennial censuses from 1790 to 2010 and American Community Surveys (ACS) from 2000 to the present. Eileen will be working through the IPUMS online tabulator and dataset downloading. You can sign up for an IPUMS account if you want to follow along!
  • American FactFinder: This is the Census Bureau’s main tool for distributing information collected by their programs. Data from the Decennial Census, the Economic Census, the American Community Survey, the American Housing Survey, and many more.
  • QuickFacts: This “is an easy to use application that provides tables, maps, and charts of frequently requested statistics from many Census Bureau censuses, surveys, and programs”. (QuickFacts)

If you’re interested, make sure you’re registered on EventBrite! The event will be on Monday, October 2 from 4:30-6:00pm in Hamburg Hall, Room 2008.

Welcome to a new semester with SUDS!

Wei, Ben K, Michael, Ben S, Chris, Vrishali, Akanksha, Eileen, Shouvik, Ali

Subscribe to keep up to date with SUDS

About Us

Let us reintroduce ourselves – Students for Urban Data Systems (SUDS) is a group of intellectually curious students at CMU dedicated to solving real life problems. We believe data analysis, data visualization, and machine learning skills are key to understanding and responding to issues in today’s data-heavy world. Unfortunately, many community organizations are not always equipped to do this type of data manipulation and analysis. That’s where we come in.

SUDS partners with organizations in Pittsburgh and Allegheny Country that are dedicated to social good, and we lend our skills to their mission. In sum, we connect our members with opportunities to do good with data.

Who We Are

In 2015, a small group of students recognized an opportunity to combine the skills they were learning in class with their ambition for contributing to the welfare of community organizations. That idea was the seed for SUDS. In two short years, we have grown from the original group of five students to a network of 50+ regular members, with hundreds of students participating in our activities in the past two years.

Obviously, our founders tapped into something that excited students across the CMU community. Today, students from all academic disciplines and skill levels participate in SUDS projects, workshops, hack nights, and speaker series.

What Makes Us Unique

We prize enthusiasm and effort before skill set. Our projects help members develop their skills, including data analysis (Python, R), data visualization (R Shiny, GIS, JS), machine learning, and project management. One of the most important parts of the SUDS experience is our mentorship. We learn from each other and help each other grow. Finally, we are open to all. There is no pay to play with SUDS. It’s as simple as this: our members are the folks that show up. You can keep up to date with projects, workshops, and events in the SUDS community by subscribing to our mailing list and community discussion board:

The Affordable Housing Problem and Investor Behavior in Pittsburgh

By Nick Kharas, Claire Jacquillat, Erin Yanacek, and Maksim Khaitovich

Our cross-functional team from Heinz College and Tepper is interested in looking at the affordable housing scene in Pittsburgh. This was part of a case competition we won, jointly organized by SUDS and the Data Analytics clubs at Heinz and Tepper. Eventually, the topic grew on us as we shared similar experiences while searching for a house to rent or buy. Currently, Pittsburgh has an affordable housing deficit of 17,000 units. Affordability is a reflection of the price of the house, its condition, and the livability of the surrounding area. Thus, solutions to Pittsburgh’s housing challenges need to focus on healthy community development. Keeping that in mind, we propose an “Assess and Address” framework that identifies investor behavior and creates a system of incentives and penalties to address negative influences.

Hazelwood – Our Pilot

Like many other neighborhoods in Pittsburgh, Hazelwood has been hit hard since the city’s steel mills shut down. Still, it has maintained its tight community spirit. New developments like technology company investments, startups and malls are reviving communities in neighborhoods like East Liberty and Lawrenceville, but at the risk of gentrification. Hazelwood is representative of what has been happening across Pittsburgh. Recent developments like Almono and Summerset at Frick Park have the potential to revive the neighborhood. At the same time, we must also ensure that these new developments bring positive change and not displacement for longtime residents.


Hazelwood’s community focus is still intact, but the recent developments could attract bad investors. As we see from the comparison of average property prices, Hazelwood does not have cases where properties in poor or unsound condition are sold at inflated values, unlike the rest of Allegheny County. While this is a good sign for Hazelwood, it still has some vacant properties and many houses in average or fair condition, which could potentially attract attention from bad investors in the future, if they aren’t already there.


Note: The graph highlights outliers indicating that some properties in poor or average condition are sold for significantly inflated values. The edges of each box in the plot indicate the interquartile range of sale values for each property condition. The flat line within each box is the median. The dotted lines are the outliers. For our assessment, we considered all valid sales from the property assessments data maintained by the WPRDC. Also, we used zip code 15207 to distinguish Hazelwood from the rest of Allegheny County. Although 15207 covers parts of Glen Hazel and Greenfield, it represents the challenges we aim to address.


Investor Behavior – Who Are These “Bad” Investors?

A neighborhood stands to benefit if houses in poor or unlivable conditions are purchased and redeveloped. However, some investors are only interested in buying and reselling houses to make a quick profit, without improving their condition or spending on maintenance. Unlike “rehabbers”, “flippers” and “milkers” buy and keep properties in distressed conditions, and hope to sell them off for a profit as quickly as possible. Such behavior does not attract the healthy investors, and also negatively affects the community and quality of life in the neighborhood.


We decided to look at data driven ways to identify and predict bad behavior using the property assessments data maintained by the WPRDC. Although we do not have any information on house owners, our cognitive solution identifies and flags potentially bad investments and highlights insightful characteristics.

The data does not have class labels that identify bad investments. We ran a k-means clustering algorithm on ownership duration, property value appreciation, and sale price to set our class labels for good and bad investments needed for analytical modeling. We avoided using general definitions for flippers to determine class boundaries as that could ignore any hidden patterns and add bias. For example, any house resold within 12 months for less than $100,000 is potentially a transaction conducted by a flipper. However, using this information alone to set class boundaries ignores several transactions where the property was held for only a little more than a year. The results from clustering directed us to target parcels that were

  • Owned for less than three years, and,
  • The property value depreciated, or appreciated less than 15%, or was sold for less than $90,000.

Any parcels that met the above condition were flagged as properties owned by potentially bad investors, while the rest were labeled as unsuspicious.


6Once we were able to set the labels for good and bad investments, we ran few classification algorithms to predict investor behavior. We excluded 20% of the data for testing, and trained the remaining 80% data on different classifiers like Gradient Boosting, Random Forest, and Conditional Inference Trees (a modeling technique based on unbiased recursive partitioning). All returned an AUC between 0.70-0.72. Gradient Boosting and Random Forest with feature selection returned marginal improvements, but not significant.


Our primary challenge was to not let our results get affected by any bias in the data. A majority of the records are not classified as bad investments, and around 90% of the parcels are not red flagged. This makes sense, as we cannot expect the property market to be completely overrun by flippers. Additionally, the assessments data has missing values for some fields, including those which determine our class labels. For example, some parcels do not have previous sale records, making it difficult to determine how long the house was held before it was sold. We cannot assume that this is missing data, as the property may not have changed ownership more than once in its lifetime. Further, the assessments data set gives us the parcel characteristics as of now, and not as when the property was last sold. For example, the condition of a property may have either improved or deteriorated after it was last sold four years ago, but we have no way to find out. For this reason, the simpler models did not perform well. Logistic regression proved to be computationally expensive with a large number of categorical variables, and decision trees performed poorly on test data.


We have shared our detailed analysis on GitHub.


Our Proposal – Assess and Address

A city like Pittsburgh would like to identify and avoid bad investor activity. However, in an effort to maintain housing affordability, the city cannot drive away potentially good investments that can develop and enrich its vibrant community. To maintain this balance, we propose an Assess and Address framework that gives actionable recommendations.



Confirm Results with In-Person Observation

Our analytical model can highlight and flag properties that are potentially at a risk of being owned by flippers. The city inspectors and community leaders can export a list of such at-risk targets, look up information on their owners or landlords, and monitor their behavior.


Sanction Bad Practices

Install practices and systems that would discourage landlords from mistreating their tenants and violating requirements for minimum standards. Our suggestions include:

  • Minimum Property Standards established by the Allegheny County Health Department for rental properties.
  • Rental Registration – Force landlords to act responsibly towards tenants. A good example is the Probationary Rental Occupancy Permit (PROP) set by the city of Raleigh, NC, which aims to ensure better housing quality for tenants and discourages landlords to violate City Codes.

Foster Positive Practices

These are programs that would encourage good investors and community homeowners to invest in enriching and developing the city’s community spirit. Some suggestions and examples of their implementation include:


We hope to take this initiative forward with the help of SUDS at Carnegie Mellon University and present our findings to the city council at Pittsburgh.


Nick Kharas graduated from Carnegie Mellon University with a Masters degree concentrating in Data Analytics and Business Intelligence. Prior to his time at CMU, he was a business intelligence and data warehousing SME at a Japanese multinational financial holding company. When not a data buff, he enjoys travel, sport and meeting new people. Click here to check out his work on GitHub. You can also connect with Nick at https://www.linkedin.com/in/nickkharas.

Claire Jacquillat is an MBA candidate at Carnegie Mellon’s Tepper School of Business. She focuses her studies on Operations management and Operations research. She is the president of Tepper Data Analytics Club where she strive to foster an integrated use of business analytics in various industries. Before starting her MBA at the Tepper School, she worked as a strategist in Sales Enablement for a Fortune 500 company. You can connect with Claire at https://www.linkedin.com/in/clairejacquillat/en

Erin Yanacek is an MBA candidate at Carnegie Mellon’s Tepper School of Business. In summer 2017, Erin will join McKinsey as a Summer Associate. Prior to business school, Erin was a classical musician. She founded a non profit organization, the Chamber Orchestra of Pittsburgh, and toured internationally performing and teaching classical trumpet. You can connect with Erin at https://goo.gl/BEWz1L

Maksim Khaitovich is an MBA candidate at Carnegie Mellon’s Tepper School of Business. In summer 2017, Maksim will join A. T. Kearney as Summer Data Science Associate. Prior to business school Maksim worked as an engineer and IT consultant in fintech and wireless communications. You can connect with Maksim at https://www.linkedin.com/in/maksim-khaitovich-828a2b47/

Internship opportunity: Boston Summer Analytics Fellowship

Cool data job alert!

Love solving tough problems with data? Want to make a difference next summer? The City of Boston’s Analytics Team is recruiting for its summer fellowship program.

From their site:

“Mayor Walsh’s Citywide Analytics Team brings the power of data to the City of Boston. We use data to tackle some of the City’s biggest challenges, from reducing firefighter injuries and combating foodborne illness to preventing overdoses and matching homeless residents to housing. As part of the Department of Innovation and Technology, we are at the forefront of the City’s efforts to apply modern technology and analytics to make life better for everyone who lives and works in Boston.”

Positions are available for Data and Performance, Data Science, and Data Engineering Fellows.

More details here. Application deadline is December 30.

Find other jobs, internships, and opportunities in data under Resources.


Work Night: Can You Beat Nate Silver?

SUDS is joining forces with the Heinz Policy and Politics Club to bring you a night of election prediction excitement. 

Think you can predict the 2016 Presidential Election? Find our Election Prediction Form here. Make a copy of it (online or offline). Follow the instructions on the bottom right portion of the sheet. And then email your submission by Friday, November 4th at 8pm to sudscmu@gmail.com (you can submit by email or by sharing the document inside Google Drive).

Here’s some guidance for tonight:

  1. Follow this link and navigate to “Download CSV of polls” at the bottom of the page, above the links to sponsored content: http://projects.fivethirtyeight.com/2016-election-forecast/ (For more polling data, and to better understand what you’re looking at, check out links below).
  2. Import the CSV into R, Python, Excel, whatever you like, and simplify the dataset to only the variables that are of greatest interest to you.
  3. Though you may explore different outcomes using any of the three scenarios FiveThirtyEight uses to estimate the vote (Polls-Plus, Polls-Only, and Now-Cast), be sure to include rows from only one of these labels in your analysis. The dataset will include approximately 9000 rows; thus, your analysis should include approximately one-third of this total, or 3000 rows, per scenario.
  4. Pay special attention to the ‘rawpoll’ and ‘adjpoll’ columns, which give figures presented before and after FiveThirtyEight weighting was applied. Based on the resources below and your own intuition, decide if your interpretation of the raw data would differ from the ‘adjusted’ poll numbers in the dataset.
  5. Fill out the entry form (instructions on the bottom right) with your predictions and submit!

How to understand the polls + R polls-specific tutorials + more polling data:

  • A User’s Guide To FiveThirtyEight’s 2016 General Election Forecast
  • Get started with this fascinating piece that shows how different pollsters can arrive at varying conclusions – even using the same raw data.
  • Read through a recent post by Peter Ellis, a professional statistician based in New Zealand. While the exact proprietary methodology that FiveThirtyEight employs remains a mystery, Ellis’ overview does touch on the apparent effect of grades assigned by Nate Silver and his team of data whizzes on weighting and distribution of the data.
  • Tutorial on R package that’s an interface to the Huffington Post Pollster API
  • R tutorial for visualizing and exploring 538 national polls data, plus links to other polls sources.

Guiding Questions and Other Areas to Explore

  • Take a look at the historical electoral college maps at 270towin.com and use them as references in your answers to the following questions.
  • Taking into account the national and state-by-state polling Hillary Clinton has held the edge in national polling since the first presidential debate a month ago, but her lead looks to be dipping – and polls have not yet caught up to the news that the FBI discovered “pertinent” emails while investigating Anthony Weiner in an unrelated case. Which maps can you produce that result in a victory for Donald Trump?
  • The last candidate to win 400 electoral votes was George H.W. Bush, who defeated Democratic challenger Michael Dukakis by a margin of 426-111 in 1988. (Dukakis won only 10 states, including his home state of Massachusetts, and the District of Columbia.) Can you plot out a path to 400 for either candidate? Which path do you think seems more likely, and why?
  • Per your reading of the polls, where does Hillary Clinton have the best chance of “turning a red state blue” (i.e., winning a state that voted for Mitt Romney in 2012)?
  • Conversely, which states that voted for Barack Obama in 2012 seem most likely to move into the Republican column and cast their share of the electoral vote for Donald Trump?
  • Looking back at the last 24 years (since Bill Clinton was first elected in 1992), are there any general regional changes you can pinpoint? What is your hypothesis for these shifts?
  • It has been 48 years since a third party candidate last won a state and carried its electoral votes. (Consult the maps for more on the pivotal 1968 election, and read more on the staunchly segregationist campaign George Wallace ran here.) This year, Libertarian Gary Johnson appears primed to capture at least five percent of the popular vote nationally, but it is Evan McMullin, a conservative, #NeverTrump flagbearer running as an American Independent, who stands a reasonable chance at defeating both Trump and Clinton in Utah. (His Mormon background in a predominantly Mormon state is crucial to understanding his viability there.) How would you assess McMullin’s chances in the Beehive State? Are you willing to venture a guess that he will win its six electoral votes?

Final Thoughts

We’re always on the lookout for insights, visualizations, and other creative output from SUDSers. In addition to your prediction submission, if you have any further thoughts that you’d like to share with the world (maybe answers to the above questions?) send us a line at sudscmu@gmail.com, we’d like to publish you!

And if you’re registered, remember to vote!