The Affordable Housing Problem and Investor Behavior in Pittsburgh

By Nick Kharas, Claire Jacquillat, Erin Yanacek, and Maksim Khaitovich

Our cross-functional team from Heinz College and Tepper is interested in looking at the affordable housing scene in Pittsburgh. This was part of a case competition we won, jointly organized by SUDS and the Data Analytics clubs at Heinz and Tepper. Eventually, the topic grew on us as we shared similar experiences while searching for a house to rent or buy. Currently, Pittsburgh has an affordable housing deficit of 17,000 units. Affordability is a reflection of the price of the house, its condition, and the livability of the surrounding area. Thus, solutions to Pittsburgh’s housing challenges need to focus on healthy community development. Keeping that in mind, we propose an “Assess and Address” framework that identifies investor behavior and creates a system of incentives and penalties to address negative influences.

Hazelwood – Our Pilot

Like many other neighborhoods in Pittsburgh, Hazelwood has been hit hard since the city’s steel mills shut down. Still, it has maintained its tight community spirit. New developments like technology company investments, startups and malls are reviving communities in neighborhoods like East Liberty and Lawrenceville, but at the risk of gentrification. Hazelwood is representative of what has been happening across Pittsburgh. Recent developments like Almono and Summerset at Frick Park have the potential to revive the neighborhood. At the same time, we must also ensure that these new developments bring positive change and not displacement for longtime residents.

1

Hazelwood’s community focus is still intact, but the recent developments could attract bad investors. As we see from the comparison of average property prices, Hazelwood does not have cases where properties in poor or unsound condition are sold at inflated values, unlike the rest of Allegheny County. While this is a good sign for Hazelwood, it still has some vacant properties and many houses in average or fair condition, which could potentially attract attention from bad investors in the future, if they aren’t already there.

2

Note: The graph highlights outliers indicating that some properties in poor or average condition are sold for significantly inflated values. The edges of each box in the plot indicate the interquartile range of sale values for each property condition. The flat line within each box is the median. The dotted lines are the outliers. For our assessment, we considered all valid sales from the property assessments data maintained by the WPRDC. Also, we used zip code 15207 to distinguish Hazelwood from the rest of Allegheny County. Although 15207 covers parts of Glen Hazel and Greenfield, it represents the challenges we aim to address.

3

Investor Behavior – Who Are These “Bad” Investors?

A neighborhood stands to benefit if houses in poor or unlivable conditions are purchased and redeveloped. However, some investors are only interested in buying and reselling houses to make a quick profit, without improving their condition or spending on maintenance. Unlike “rehabbers”, “flippers” and “milkers” buy and keep properties in distressed conditions, and hope to sell them off for a profit as quickly as possible. Such behavior does not attract the healthy investors, and also negatively affects the community and quality of life in the neighborhood.

4

We decided to look at data driven ways to identify and predict bad behavior using the property assessments data maintained by the WPRDC. Although we do not have any information on house owners, our cognitive solution identifies and flags potentially bad investments and highlights insightful characteristics.

The data does not have class labels that identify bad investments. We ran a k-means clustering algorithm on ownership duration, property value appreciation, and sale price to set our class labels for good and bad investments needed for analytical modeling. We avoided using general definitions for flippers to determine class boundaries as that could ignore any hidden patterns and add bias. For example, any house resold within 12 months for less than $100,000 is potentially a transaction conducted by a flipper. However, using this information alone to set class boundaries ignores several transactions where the property was held for only a little more than a year. The results from clustering directed us to target parcels that were

  • Owned for less than three years, and,
  • The property value depreciated, or appreciated less than 15%, or was sold for less than $90,000.

Any parcels that met the above condition were flagged as properties owned by potentially bad investors, while the rest were labeled as unsuspicious.

5

6Once we were able to set the labels for good and bad investments, we ran few classification algorithms to predict investor behavior. We excluded 20% of the data for testing, and trained the remaining 80% data on different classifiers like Gradient Boosting, Random Forest, and Conditional Inference Trees (a modeling technique based on unbiased recursive partitioning). All returned an AUC between 0.70-0.72. Gradient Boosting and Random Forest with feature selection returned marginal improvements, but not significant.

 

Our primary challenge was to not let our results get affected by any bias in the data. A majority of the records are not classified as bad investments, and around 90% of the parcels are not red flagged. This makes sense, as we cannot expect the property market to be completely overrun by flippers. Additionally, the assessments data has missing values for some fields, including those which determine our class labels. For example, some parcels do not have previous sale records, making it difficult to determine how long the house was held before it was sold. We cannot assume that this is missing data, as the property may not have changed ownership more than once in its lifetime. Further, the assessments data set gives us the parcel characteristics as of now, and not as when the property was last sold. For example, the condition of a property may have either improved or deteriorated after it was last sold four years ago, but we have no way to find out. For this reason, the simpler models did not perform well. Logistic regression proved to be computationally expensive with a large number of categorical variables, and decision trees performed poorly on test data.

7

We have shared our detailed analysis on GitHub.

 

Our Proposal – Assess and Address

A city like Pittsburgh would like to identify and avoid bad investor activity. However, in an effort to maintain housing affordability, the city cannot drive away potentially good investments that can develop and enrich its vibrant community. To maintain this balance, we propose an Assess and Address framework that gives actionable recommendations.

8

Assess

Confirm Results with In-Person Observation

Our analytical model can highlight and flag properties that are potentially at a risk of being owned by flippers. The city inspectors and community leaders can export a list of such at-risk targets, look up information on their owners or landlords, and monitor their behavior.

Address

Sanction Bad Practices

Install practices and systems that would discourage landlords from mistreating their tenants and violating requirements for minimum standards. Our suggestions include:

  • Minimum Property Standards established by the Allegheny County Health Department for rental properties.
  • Rental Registration – Force landlords to act responsibly towards tenants. A good example is the Probationary Rental Occupancy Permit (PROP) set by the city of Raleigh, NC, which aims to ensure better housing quality for tenants and discourages landlords to violate City Codes.

Foster Positive Practices

These are programs that would encourage good investors and community homeowners to invest in enriching and developing the city’s community spirit. Some suggestions and examples of their implementation include:

 

We hope to take this initiative forward with the help of SUDS at Carnegie Mellon University and present our findings to the city council at Pittsburgh.

 

Nick Kharas graduated from Carnegie Mellon University with a Masters degree concentrating in Data Analytics and Business Intelligence. Prior to his time at CMU, he was a business intelligence and data warehousing SME at a Japanese multinational financial holding company. When not a data buff, he enjoys travel, sport and meeting new people. Click here to check out his work on GitHub. You can also connect with Nick at https://www.linkedin.com/in/nickkharas.

Claire Jacquillat is an MBA candidate at Carnegie Mellon’s Tepper School of Business. She focuses her studies on Operations management and Operations research. She is the president of Tepper Data Analytics Club where she strive to foster an integrated use of business analytics in various industries. Before starting her MBA at the Tepper School, she worked as a strategist in Sales Enablement for a Fortune 500 company. You can connect with Claire at https://www.linkedin.com/in/clairejacquillat/en

Erin Yanacek is an MBA candidate at Carnegie Mellon’s Tepper School of Business. In summer 2017, Erin will join McKinsey as a Summer Associate. Prior to business school, Erin was a classical musician. She founded a non profit organization, the Chamber Orchestra of Pittsburgh, and toured internationally performing and teaching classical trumpet. You can connect with Erin at https://goo.gl/BEWz1L

Maksim Khaitovich is an MBA candidate at Carnegie Mellon’s Tepper School of Business. In summer 2017, Maksim will join A. T. Kearney as Summer Data Science Associate. Prior to business school Maksim worked as an engineer and IT consultant in fintech and wireless communications. You can connect with Maksim at https://www.linkedin.com/in/maksim-khaitovich-828a2b47/

Want to host open data? You can with CKAN!

By Matt Cleinman

If you visit many government open data websites, you may notice that they all start to look very, very similar.  (For some examples, look at the UK national government, Washington DC, and our own Western Pennsylvania Regional Data Center.)  Your eyes are not going numb from looking at datasets – it’s that many are powered by CKAN.  

pic1

What is CKAN?  It’s a behind-the-scenes secret that helps make open data possible.  In their words:

CKAN is a powerful data management system that makes data accessible – by providing tools to streamline publishing, sharing, finding and using data. CKAN is aimed at data publishers (national and regional governments, companies and organizations) wanting to make their data open and available.

Even better, this web application is open source, meaning that anyone can see the sourcecode and add features for their implementation.  Even-even better, it is designed to easily incorporate extensions so that any organization that uses CKAN can add your feature.

As part of the Master of Information Systems Management degree from Heinz College, all students participate in a client-driven group capstone project in their final semester.  Being a member of SUDS, I was delighted that my team was assigned to work with the City of Philadelphia, as every other group had a large corporate client.

Tim Wisniewski, Philadelphia’s Chief Data Officer, had several exciting ideas for projects.  With our end-of-semester time constraint in mind, we chose a CKAN extension that would streamline Philadelphia’s open data workflow.  (Some of his other proposals will be tackled by future MISM capstone teams!)

CKAN is wonderful, but does not allow for data dictionaries (or “metadata”) to be stored for each dataset. Philadelphia currently handles this by using a separate system to track the data dictionaries.  Most datasets contain a link to the metadata mixed in with the links to the data – and those links go to the metadata server.

What is metadata? If you’ve ever asked someone what column D in a spreadsheet represented, you have asked for metadata – it’s the information about the data.  “Column D is the number of clients impacted by the project described in Column A.  It should be a positive integer.”

pic2Our challenge: Learn about CKAN development and write an extension that allows native handling of data dictionaries.  Great documentation is available, but CKAN is a fairly complex system.  It uses Jinja for the frontend, Python on the backend, a PostgreSQL database, and many more technologies.  Luckily our team brought a diverse skillset to the project.

I’ll spare you the gory details, but we eventually got our extension working and tested.  Like CKAN itself, the extension is open-source, and we’ve been excited by the interest in it so far.  You can view our extension on GitHub.

pic3
CKAN was a perfect project for us: Large enough to be complex and somewhat bewildering at first, but understandable enough to be able to deliver the final product.  It stretched our skills, but in a manageable way.  For individuals looking to push their web app development abilities, consider contributing to CKAN – sharpen your skills while contributing to the open data movement!

Matt Cleinman is a recent grad of the Heinz College MISM program (’15). While writing this post, he realized he never actually learned why the application is named CKAN.

Emojis of Pittsburgh

by Dan Tasse and Jennifer Chou

What do people in Squirrel Hill talk about?

Or, more interestingly, what do people in Squirrel Hill talk about that people in other neighborhoods don’t? What is it that makes Squirrel Hill Squirrel Hill? That’s the question we set out to answer with this project.

words

Most frequently tweeted words in each Pittsburgh neighborhood

How it works

We gathered all tweets geotagged in Pittsburgh over about a year, from December 2013 to January 2015. We sorted them by neighborhood (using boundaries provided by the WPRDC) and used a modified TF-IDF algorithm to figure out what words were specific to each neighborhood. This algorithm counts the frequency of a word in a given neighborhood, and then adjusts the word’s final score based on how many other neighborhoods also use that word.

For example, “Steelers” is used a lot in Squirrel Hill, but it’s also used in many other neighborhoods, so it has a pretty low score. “Tunnel”, however, is quite popular in Squirrel Hill (mostly due to people grousing about tunnel traffic), but not elsewhere. Similarly, “10a” is a popular bus used to get around Pitt, but isn’t used elsewhere, so “10a” shows up a lot in Oakland.

10a

Tweets referencing the “10a” bus

An emoji is worth…

These words just represent what people are talking about on Twitter. What are people feeling? To answer that question, we looked to the emojis people are tweeting. Emojis are an interesting new form of communication: one character can often say more than a word, so they can tell us about where people like to do certain things, or maybe even how people feel.

emojis

Top emojis in each ‘hood

For example, we can see that the zoo is up in Highland Park, and that people like watching baseball and football and drinking beer on the North Shore. Obvious enough. But did you know how popular the swimming pool in Oakland is, or the Christmas tree lighting downtown?

Future work, and so what?

There’s still work to do, of course. One major challenge is algorithmic: How do we combine these posts from multiple people into a representative aggregate? A lot of these words/emojis are boosted by one person tweeting them multiple times. We don’t want one person to dominate the neighborhood’s tweets, but we do want an avid basketball fan to count more than someone who just tweeted about basketball once.

We hope this is the first step towards useful neighborhood guides. Imagine if you were moving to Pittsburgh for the first time, and looking for the right area to live in. Knowing that Squirrel Hill South has a lot of basketball fans, or that the top words in Lawrenceville are trendy bars or music venues, could really help you get a feel for the city and its many unique neighborhoods.

Try it out! http://emojimap.herokuapp.com

(Be patient; it’s on a free server so it’ll be a little slow.) And send any feedback or ideas to dantasse@cmu.edu.

Dan Tasse is a PhD student in Human-Computer Interaction at CMU. He’s interested in how we can use social media posts to help people understand their cities and neighborhoods better.

Jennifer Chou is an undergraduate studying Computer Science at CMU.

Why did Pittsburgh survive the housing slump?

by Nick Kharas and Emily Sasse

The Stability of Pittsburgh’s Property Market

Pittsburgh is known to have one of the most stable property markets in the United States. The city has not had a housing recession. It is safe from housing bubbles for a few reasons:

  • Land Value Tax – Historically, Pittsburgh’s taxation policies encouraged productive land use and steadied its housing market. The city taxed the value of land at a higher rate, and the value of buildings and improvements at a lower rate. Productive investors could maximize their after-tax return on investment, while speculating on idle land was not lucrative. However, this split-rate tax was discontinued in 2001.
  • Available Space – Unlike larger cities, Pittsburgh is not constrained in building space or in growing outward.

It is important to note that there has been a drop in property value in recent times. Home values have fallen over the last year. However, this does not yet indicate that there is a bubble in the property market.

  • Owners of high-quality real estate want to hold on to their property and not sell. Additionally, lower interest rates since 2010 encourage owners to hold on to low mortgage rates.
  • According to Mr. Hanna of Howard Hanna Real Estate Services, first time buyers are finding it hard to get a mortgage, and millennial buyers who want to be flexible and not be in Pittsburgh forever will not want to buy real estate in the city.
  • Home values are down in some neighborhoods, but are rising in neighborhoods like Lawrenceville. This is primarily because of the thriving restaurants, shops and jobs in the east side of the city.

To validate these views, we analyzed Pittsburgh property sales over the last four years (January 2012 to November 2015) found on the WPRDC open data portal. We can see that a majority of the property parcels involved in transactions are owned by individuals rather than corporations. This can give us a good indication of the housing market in Pittsburgh.image1We also decided to look at the changes in the median property values in Pittsburgh over the last four years. We selected the median instead of the mean, as there are a few transactions in 2012 and 2013 with extremely high property values, and these outliers are adding unwanted bias to the average property values. Thus, the median is a good indicator of the common trend in Pittsburgh housing.

image2

The data does seem to support the points mentioned already. In 2013, the median property value rose by only 6%, while in 2014, we noticed a negligible fall by 0.31%. This does indicate the stability of the city’s property market. Also, we do notice that the median property value has fallen by about 10% in 2015. However, it does not give sufficient evidence to conclude the presence of a real estate bubble in the city.

In parallel, we also decided to check if the federal rates (retrieved from https://research.stlouisfed.org/fred2/series/FEDFUNDS#) were having any impact on Pittsburgh’s real estate market. We found that there was no statistically significant correlation between both. We calculated the below figures to reach our result:

Correlation Coefficient -0.16
Significance test – p-value 0.25

The probability of the property values being correlated to federal interest rates by chance rather than statistical significance is less than 0.25. For the correlation to be statistically significant, this probability should have been less than 0.05.image3

Next Steps

This project focuses on property sales in Pittsburgh. Going forward, future work involves continuing to monitor property sales and relevant indicators in the county, to determine whether or not current and historical trends continue. This will enable informed and successful future policy decisions.

In addition to Pittsburgh, it would be interesting to extend this research to Allegheny County and surrounding counties, or the nation overall. This type of analysis would provide key insight into the Pittsburgh real estate market. It would reveal the performance of the Pittsburgh real estate market in comparison to real estate markets in similar cities across the United States.

 

Nick Kharas is pursuing a Masters in Information Systems Management at Carnegie Mellon University. He has a deep focus on emerging technologies in business intelligence (BI), advanced analytics and data science. Prior to this, he was a data warehousing professional at a Japanese multinational financial holding company. When not a data nerd, he enjoys travelling or just meeting new people. You can connect to Nick at https://www.linkedin.com/in/nickkharas

Emily Sasse is pursuing a Masters in Public Policy and Management: Data Analytics. During her time at Heinz, she has developed a keen interest in the study of business intelligence and data analytics. After graduation, she will join Accenture as a Digital Consultant in Boston, Massachusetts. Emily enjoys winter sports and exploring the east coast. You can connect to Emily at https://www.linkedin.com/in/emilysasse

The original project also had active contributions from Sridevi Yagati Venkateshdatta, Ranjani Padmanabhan and Jingwei Cao.