Data Skills Workshops recap: javascript viz and open city data

Thanks to everyone who’s been joining our SUDS Data Skills Workshops this semester! So far, we’ve heard from Alex Sciuto on using Javascript and D3.js for data visualization, and Bob Gradeck from the Western PA Regional Data Center on finding and using city data. It’s been awesome to dive into the many tools and resources out there that we can use to tackle our data dreams (or nightmares…). And we’re so grateful to have these gurus share with us!


Bob Gradeck schooling us in the ways of open city data

dataviz example

Example of a D3.js vizualization

Here’s a link to Bob’s hot tips on city data and to Alex’s presentation on GitHub (seriously great stuff in here).

We’ve still got a stacked deck of workshops remaining for the semester, including:

Hope to see you there! As with all SUDS events, workshops are free and open to all (including non-students).

And a big thanks to Krista Kinnard for coordinating our workshops! If you have any ideas for more topics, please contact us at


Do Pittsburgh’s buses bunch?

by Mark Egge, Ranjana Krishnamoorthy, Bhavna Prasad, Enbo Zhang, and Rohita Kamath

As bus riders, we wanted to know what trends of bus service levels can be learned from the data on bus locations. Like many mass transit systems, the Port Authority of Allegheny County (the entity that operates the bus system that services Pittsburgh and the surrounding vicinity) publishes real-time information about the locations of its vehicles in service. This information can be accessed through the Port Authority website (, or through various third-party apps and websites (e.g. Unfortunately for would-be analysts, the Port Authority does not publish any historical bus location data. That is, the data published by the Port Authority cannot be used to answer questions about historical service delivery patterns.

In particular, we wanted to know if buses bunch, or cluster. We’ve observed, anecdotally, that the wait between bus arrivals can sometimes be much longer than scheduled, and that after a long delay, the buses often show up in pairs or even triples.

Building a data warehouse

To answer this question, we obtained an API key from the Port Authority (which allows Port Authority data to be retrieved in XML or JSON format) and built a data warehouse to capture and record the real-time bus location information. We capture the location of all buses on the 61A/61B/61C/61D routes once every sixty seconds. Additionally, to investigate if service levels are impacted by weather, we also capture the concurrent weather conditions (via WeatherUnderground’s developer API).

The data is retrieved from the Port Authority in XML format. We use Microsoft SQL Server Integration Services (SSIS) to extract, transform, and load this data into a data warehouse that captures historic bus location information. In addition to the vehicle location information, we also load in other dimensions useful for analysis, including the routes and the patterns (sequences of stops and waypoints) that constitute a route.  For those connected to the CMU network, you can access our database by following these instructions.

How bad is the bunching?

Our data substantiates our anecdotal observations. We see that buses do often end up travelling in bunches of two or more buses. The graph below shows the progress of buses along their routes from their downtown departure (at the bottom of the graph) to their arrival at Hamburg Hall (approximately 18,000 along the path from downtown on 61 bus routes). Each ascending line represents one vehicle. The horizontal distance between lines (at any fixed distance) shows the wait time between buses at that location. When lines are close together or overlapping, they represent buses that are clustered together (or, bunched).


As a result of bunching the average time between bus arrivals varies greatly. The box and whiskers plot below shows wait times by hour of the day (at the outbound bus stop located in front of Hamburg Hall). Wait times are minimized (and have the least variance) just before rush hour (from 7:00 to 8:00 am, and from 4:00 pm to 5:00 pm), and on weekends and holidays. Wait times exhibit the greatest variance during the 6:00 – 7:00 pm hour.


Problems and solutions

Bunching is a widely-observed and well-documented transportation phenomenon (see this great visual explanation). Bunching is caused when a leading bus is delayed (such as a rush-hour crowd, or loading a bike), causing more riders than average at subsequent stops. If the trailing bus does not experience the same delay it will have fewer than average riders. The phenomenon continues until buses end up operating in pairs or groups of three.

Creating more slack in the system reduces the frequency and severity of bunching, but requires more buses, operators, or longer ride times. Real-time GPS tracking opens a window of opportunity of reducing bunching through better coordination. If a trailing bus is notified of a delay encountered by the proceeding bus, it could reduce its travel speed to avoid having fewer-than-average riders at subsequent stops. Unfortunately, such remedies only function when a bus operator has some discretion in travel speed, which is seldom possible in Pittsburgh’s historic, narrow streets.


Mark Egge is a data analyst with a background in healthcare operations and entrepreneurship. He balances his work in GIS, data mining, and health information exchange with an abiding love of playing outside and exploring the natural world.

Ranjana Krishnamoorthy is a graduate student of the master of information systems management program at Carnegie Mellon. University. She loves working with data and is passionate about exploring how technology can be used to improve a business, its management and processes.

Bhavna Prasad recently completed her Masters in Information Systems Management from the Heinz College at Carnegie Mellon University. She is passionate about technology and has a strong penchant to mold raw data into key business drivers for product decisions.

Enbo Zhang is a Public Policy & Management student and graduate teaching assistant at Carnegie Mellon’s Heinz College.

Rohita Kamath is a Summer MISM graduate student in Heinz college. She has previously worked with Deloitte Consulting for 4 years. She loves working on technology and has worked in SAP practice. She loves the idea of using data and technology for the improvement of management in the companies.