DataBurst 1: School Improvement funds allocations

We’re trying out our first data-driven introductory activity for our hacknight! This is meant to be interactive and fast-paced, so think conceptually and implement whatever you have time for. If you use any neat data tools, see if you can’t get a very quick demo to show off using this data set.


Google Drive version:

The spreadsheet document (*.xlsx) contains tables. The first tab called is called SchoolImprovementGrantsByDistrict and it lists the total payout by district as part of their 2010 disbursement. Consider this data a reflection of the US Federal government’s education spending priorities.

The next four tab contains test score data from the NAEP tests, which is the only randomized, norms-referenced test administered in the USA and the only one appropriate for interstate academic progress comparisons. You have four different breakdowns of the NAEP averages by state. The most straightforward is a public school student average.

Imagine you’re working on the Hill and have been given an urgent request to any or all of these questions for the rep’s upcoming meeting with a radical left-wing education interest group.

This interest group is claiming that the Federal government is spending most of its improvement money on relatively affluent states which are also states with the highest relative mathematics achievement.

Determine one preliminary conclusion WITH an accompanying image for your boss.

Ready, set, GO!


SUDS tours Google: better cities through public-private data partnerships

Last Wednesday, SUDS visited Google’s Pittsburgh office, on the eve of its 10-year anniversary celebrations. We got to see their famous hammock room, Kennywood-themed hallways, and micro kitchens stocked according to behavioral science. But as jealous as we were of the nap pods, the best part of the visit was a talk by CMU Computer Science PhD alumna Sarah Loos on Google’s Better Cities project.

Cities face huge challenges in monitoring and managing their transportation infrastructure. In the US alone, $124 billion is wasted each year in traffic jams. The Better Cities team at Google has been piloting methodologies that match up cities’ transport data with aggregate, anonymized snapshots of historical traffic statistics in order to yield insight and solutions to nasty traffic problems.

For example, Google partnered with the City of Amsterdam to validate sensor readings on the A10 highway, which can tell when cars are slowing down (and thus, if a traffic jam might occur). The city can then analyze the data and change speed limits on its digital signs and take other measures to mitigate the jam’s impact. The physical sensors are really accurate, but also really expensive to install and maintain. Google found that by combining only some of the sensor data with representative models of aggregate data, it could detect the same traffic patterns with a high level of accuracy. By reducing the number of sensors needed in each stretch of road, Amsterdam’s government can save between 50,000-100,000 Euros per kilometer per year.


As cities are using more individual-level data from more sources, including public-private partnerships, Loos stressed the importance of keeping information anonymous and private. Her work is focusing on differential privacy algorithms, which add enough noise to the data to mask the influence of any one individual’s contribution to the set.

These pilot projects are an exciting example of how simple, but smart, data collaboration can improve city management. And Loos and her Google team are looking for new cities to partner with—we hope Pittsburgh will be one of them!




Emojis of Pittsburgh

by Dan Tasse and Jennifer Chou

What do people in Squirrel Hill talk about?

Or, more interestingly, what do people in Squirrel Hill talk about that people in other neighborhoods don’t? What is it that makes Squirrel Hill Squirrel Hill? That’s the question we set out to answer with this project.


Most frequently tweeted words in each Pittsburgh neighborhood

How it works

We gathered all tweets geotagged in Pittsburgh over about a year, from December 2013 to January 2015. We sorted them by neighborhood (using boundaries provided by the WPRDC) and used a modified TF-IDF algorithm to figure out what words were specific to each neighborhood. This algorithm counts the frequency of a word in a given neighborhood, and then adjusts the word’s final score based on how many other neighborhoods also use that word.

For example, “Steelers” is used a lot in Squirrel Hill, but it’s also used in many other neighborhoods, so it has a pretty low score. “Tunnel”, however, is quite popular in Squirrel Hill (mostly due to people grousing about tunnel traffic), but not elsewhere. Similarly, “10a” is a popular bus used to get around Pitt, but isn’t used elsewhere, so “10a” shows up a lot in Oakland.


Tweets referencing the “10a” bus

An emoji is worth…

These words just represent what people are talking about on Twitter. What are people feeling? To answer that question, we looked to the emojis people are tweeting. Emojis are an interesting new form of communication: one character can often say more than a word, so they can tell us about where people like to do certain things, or maybe even how people feel.


Top emojis in each ‘hood

For example, we can see that the zoo is up in Highland Park, and that people like watching baseball and football and drinking beer on the North Shore. Obvious enough. But did you know how popular the swimming pool in Oakland is, or the Christmas tree lighting downtown?

Future work, and so what?

There’s still work to do, of course. One major challenge is algorithmic: How do we combine these posts from multiple people into a representative aggregate? A lot of these words/emojis are boosted by one person tweeting them multiple times. We don’t want one person to dominate the neighborhood’s tweets, but we do want an avid basketball fan to count more than someone who just tweeted about basketball once.

We hope this is the first step towards useful neighborhood guides. Imagine if you were moving to Pittsburgh for the first time, and looking for the right area to live in. Knowing that Squirrel Hill South has a lot of basketball fans, or that the top words in Lawrenceville are trendy bars or music venues, could really help you get a feel for the city and its many unique neighborhoods.

Try it out!

(Be patient; it’s on a free server so it’ll be a little slow.) And send any feedback or ideas to

Dan Tasse is a PhD student in Human-Computer Interaction at CMU. He’s interested in how we can use social media posts to help people understand their cities and neighborhoods better.

Jennifer Chou is an undergraduate studying Computer Science at CMU.

Energy for all in Nigeria

by Madeleine Gleave

Nigeria’s energy poverty crisis

Like many developing countries, Nigeria is facing an energy poverty crisis. The International Energy Agency (IEA) estimates that nearly 1.3 billion people globally lack access to electricity, and about half of these people live in Africa. Energy poverty has crippling side effects; no electricity also means no access to safer and healthier electric cooking and heating, powered health centers and refrigerated medicines, light to study at night, or electricity to run a business. In Nigeria, the average level of access is only 53%.

Despite being rich in natural resources required to produce energy, such as oil and gas, Nigeria’s energy infrastructure is lacking. Many people live near power plants and transmission lines, but aren’t yet connected to the grid. Others are in very remote areas where off-grid solutions, such as solar panels, may help them generate their own electricity long before a power line reaches them.

To explore this problem, I created a StoryMap in ArcGIS that shows the highly disparate levels of electricity access, energy demand, and infrastructure across Nigeria.

Check out the full StoryMap here:

Identifying the best electricity access solution

As Nigeria and its development partners look for energy access expansion solutions, how can they choose the best intervention for the best region? Where should they target grid connections, grid expansion, or off-grid solutions?

Selecting the best approach from this set of solutions depends on the context of the specific geographic area, and is influenced by existing levels of access, proximity to existing lines and power plants, level of urban development, demographic characteristics, and income levels. I developed a composite energy access index, mapped using a kernel density heat map, to evaluate an area’s suitability for each type of access intervention. The higher the index score (the red areas on the heat map seen here), the more suited the area is to grid supply. The lower the index score (pale yellow), the more suitable for off-grid power. Mid-range scores (the orange and dark yellow areas) are good candidates for grid expansion.


Madeleine Gleave is a Public Policy and Management student in the Heinz College at CMU. She is particularly passionate about using data to improve planning, management, and evaluation in international development policy and programs.

Why did Pittsburgh survive the housing slump?

by Nick Kharas and Emily Sasse

The Stability of Pittsburgh’s Property Market

Pittsburgh is known to have one of the most stable property markets in the United States. The city has not had a housing recession. It is safe from housing bubbles for a few reasons:

  • Land Value Tax – Historically, Pittsburgh’s taxation policies encouraged productive land use and steadied its housing market. The city taxed the value of land at a higher rate, and the value of buildings and improvements at a lower rate. Productive investors could maximize their after-tax return on investment, while speculating on idle land was not lucrative. However, this split-rate tax was discontinued in 2001.
  • Available Space – Unlike larger cities, Pittsburgh is not constrained in building space or in growing outward.

It is important to note that there has been a drop in property value in recent times. Home values have fallen over the last year. However, this does not yet indicate that there is a bubble in the property market.

  • Owners of high-quality real estate want to hold on to their property and not sell. Additionally, lower interest rates since 2010 encourage owners to hold on to low mortgage rates.
  • According to Mr. Hanna of Howard Hanna Real Estate Services, first time buyers are finding it hard to get a mortgage, and millennial buyers who want to be flexible and not be in Pittsburgh forever will not want to buy real estate in the city.
  • Home values are down in some neighborhoods, but are rising in neighborhoods like Lawrenceville. This is primarily because of the thriving restaurants, shops and jobs in the east side of the city.

To validate these views, we analyzed Pittsburgh property sales over the last four years (January 2012 to November 2015) found on the WPRDC open data portal. We can see that a majority of the property parcels involved in transactions are owned by individuals rather than corporations. This can give us a good indication of the housing market in Pittsburgh.image1We also decided to look at the changes in the median property values in Pittsburgh over the last four years. We selected the median instead of the mean, as there are a few transactions in 2012 and 2013 with extremely high property values, and these outliers are adding unwanted bias to the average property values. Thus, the median is a good indicator of the common trend in Pittsburgh housing.


The data does seem to support the points mentioned already. In 2013, the median property value rose by only 6%, while in 2014, we noticed a negligible fall by 0.31%. This does indicate the stability of the city’s property market. Also, we do notice that the median property value has fallen by about 10% in 2015. However, it does not give sufficient evidence to conclude the presence of a real estate bubble in the city.

In parallel, we also decided to check if the federal rates (retrieved from were having any impact on Pittsburgh’s real estate market. We found that there was no statistically significant correlation between both. We calculated the below figures to reach our result:

Correlation Coefficient -0.16
Significance test – p-value 0.25

The probability of the property values being correlated to federal interest rates by chance rather than statistical significance is less than 0.25. For the correlation to be statistically significant, this probability should have been less than 0.05.image3

Next Steps

This project focuses on property sales in Pittsburgh. Going forward, future work involves continuing to monitor property sales and relevant indicators in the county, to determine whether or not current and historical trends continue. This will enable informed and successful future policy decisions.

In addition to Pittsburgh, it would be interesting to extend this research to Allegheny County and surrounding counties, or the nation overall. This type of analysis would provide key insight into the Pittsburgh real estate market. It would reveal the performance of the Pittsburgh real estate market in comparison to real estate markets in similar cities across the United States.


Nick Kharas is pursuing a Masters in Information Systems Management at Carnegie Mellon University. He has a deep focus on emerging technologies in business intelligence (BI), advanced analytics and data science. Prior to this, he was a data warehousing professional at a Japanese multinational financial holding company. When not a data nerd, he enjoys travelling or just meeting new people. You can connect to Nick at

Emily Sasse is pursuing a Masters in Public Policy and Management: Data Analytics. During her time at Heinz, she has developed a keen interest in the study of business intelligence and data analytics. After graduation, she will join Accenture as a Digital Consultant in Boston, Massachusetts. Emily enjoys winter sports and exploring the east coast. You can connect to Emily at

The original project also had active contributions from Sridevi Yagati Venkateshdatta, Ranjani Padmanabhan and Jingwei Cao.