Exploring Open Data with Pandas and IPython at the Berkeley I School

"Working with Open Data", a course by Raymond Yee

This will be a guest post, authored by Raymond Yee from the UC Berkeley School of Information (or I School, as it is known around here). This spring, Raymond has been teaching a course titled "Working with Open Data", where students learn how to work with openly available data sets with Python.

Raymond has been using IPython and the notebook since the start of the course, as well as hosting lots of materials directly using github. He kindly invited me to lecture in his course a few weeks ago, and I gave his students an overview of the IPython project as well as our vision of reproducible research and of building narratives that are anchored in code and data that are always available for inspection, discussion and further modification.

Towards the end of the course, his students had to develop a final project, organizing themselves in groups of 2-4 and producing a final working system that would use open datasets to produce an interesting analysis of their choice.

I recently had the chance to see the final projects, and I have to say that I walked out blown away by the results. This is a group of students who don't come from traditionally computationally-intensive backgrounds, as the only requirement was some basic Python experience. And in a matter of a few weeks, they created very compelling tools, that dove into different problem domains from health care to education and even sports, producing interesting results complete with a narrative, plots and even interactive JavaScript controls and SVG output elements. Keep in mind that the tools to do some of this stuff aren't even really documented or explained much in IPython yet, as we haven't really dug into that part in earnest (that is our Fall 2013 plan).

The students obviously did run into some issues, and I took notes on what we can do to improve the situtaion. We had a follow-up meeting on campus where we gave them pointers on how to do certain things more easily. But to me, these results validate the idea that we can construct computational narratives based on code and data that will ultimately lead to a more informed discourse.

I am very much looking forward to future collaborations with Raymond: he has shown that we can create an educational experience around data-driven discovery, using IPython, pandas, matplotlib and the rest of the SciPy stack, that is completely accessible to students without major computational training, and that lets them produce interesting results around socially interesting and relevant datasets.

I will now pass this on to Raymond, so he can describe a little bit more about the course, challenges they encountered, and highlight some of the course projects that are already available on github for further forking and collaboration.

As always, this post is available as an IPython notebook. The rest of the post is Raymond's authorship.

The course

Over a 15-week period), my students and I met twice a week to study open data, using Python to access, process, and interpret that data. Twenty-four students completed the course: 17 Masters students from the I School and 7 undergraduates from electrical engineering/computer science, statistics, and business.

We covered about half of our textbook Python for Data Analysis by Wes McKinney. Accordingly, a fair amount of our energy was directed to studying the pandas library. The prerequisite programming background was the one semester Python minimum requirement for I School Masters students. The students learned a lot (while having a good time overall, so I'm told) about how to program pandas and use the IPython notebook by working through a series of assignments. Students filled in missing code in IPython notebooks I had created to illustrate various techniques for data analysis. (Most of the resources for the course are contained in the course github repository.)

My students and I were particularly grateful for the line-up of guest speakers (which included Fernando Perez), who shared their expertise on topics ranging from scraping the web to archive legal cases, the UC Berkeley Library Data Lab, open data in archaeology, open access web crawling, and scientific reproducibility.

Final projects

The culmination of the course came in the final projects, where groups of two to four students designed and implemented data analysis projects. The final deliverable for the course was an IPython notebook that was expected to contain the following attributes):

  • a clear articulation of the problem being addressed in the project
  • a clear description of what was originally proposed and what the students ended up doing, describing what led the students to go from the proposed work to the final outcome
  • a thorough description of what was behind-the-scenes: the sources of data from which the students drew and the code they wrote to analyze the data
  • a clear description of the results and precise details on how to reproduce the results
  • if the students were to continue their project beyond the course, what would be this future work
  • a paragraph outlining how the work was split among group members.

The students and I welcomed enthusiastic visitors to the Open House -- in which the I School community and, in fact, the larger campus community was invited to attend.

Working with Open Data 2013 Open House

Working with Open Data 2013 Open House

Here are abstracts for the eight projects, where each screenshot is a link to its IPython notebook. Enjoy!

Stock Performance of Product Releases
Edward Lee, Eugene Kim

Drawing connections for open data available pertaining to Apple in order to examine how Apple's stock performance was impacted by a certain product. We examine Wikipedia data for detailed information on Apple's product releases, make use of Yahoo Finance's API for specific stock performance metrics, and openly available Form 10-Q's for internal (financial) changes to Apple. The main purpose is to examine available data to draw new conclusions centered on the time around the product release date.

Education First
Carl Shan, Bharathkumar Gunasekaran, Haroon Rasheed Paul Mohammed, Sumedh Sawant

Most parents nowadays have a general sense of the significant factors for choosing a school for their children. However, with a lack of existing tools and information sources, most of these parents have a hard time measuring, weighing and comparing these factors in relation to geographical areas when they are trying to pick the best place to live in with the best schools.

Thus, our team aims to address this problem by visualizing the statistical data from the NCES with geo-data to help the parents through the process of picking the best area to live in with the best schools. Parents can specify exactly what parameters they consider important in their decision process and we will generate a heat-map of the state they’re interested in living in and dynamically color it according to how closely each county matches their preferences. The heat map will be displayed with a web browser.

The League of Champions
Natarajan Chakrapani, Mark Davidoff, Kuldeep Kapade

In the soccer world, there is a lot of money involved in transfer of players in the premier leagues around the world. Focusing on the English Premier League, our project - “The League Of Champions” aims to analyze the return on investment on soccer transfer done by teams in the English premier league. It aims to measure club return on each dollar spent on their acquired players for a season, on parameters like Goals scored, active time on the field, assists etc. In addition, we also look to analyze how big a factor player age is, in commanding a high transfer fee, and if clubs prefer to pay large amounts for specialist players in specific field positions.

All About TEDx
Chan Kim, JT Huang

TED is a nonprofit devoted to Ideas Worth Spreading. It started out (in 1984) as a conference bringing together people from three worlds: Technology, Entertainment, Design. The TED Open Translation Project brings TED Talks beyond the English-speaking world by offering subtitles, interactive transcripts and the ability for any talk to be translated by volunteers worldwide. The project was launched with 300 translations, 40 languages and 200 volunteer translators; now, there are more than 32,000 completed translations from the thousands-strong community. The TEDx program is designed to give communities the opportunity to stimulate dialogue through TED-like experiences at the local level.

Our project wants to encourage people to translate TEDx Talk as well by showing how TEDx Talk videos are translated and spreaded among different languages, places and topics, and comparing the spreading status with TED Talk videos.

Environmental Health Gap
Rohan Salantry, Deborah Linton, Alec Hanefeld, Eric Zan

There is growing evidence to support environmental factors trigger chronic diseases such as asthma that result in billions in health care costs. However a gap in knowledge exists concerning the extent of the link and where it is more prevalent. We aim to create a framework for closing this gap by integrating health and environmental condition data sets. Specifically, the project will link emissions data from the EPA and the California Department of Public Health in an attempt to find a correlation between incidences of asthma treatments and emissions seen as triggers for asthma.

The project hopes to be a stepping stone for policy decisions concerning the value tradeoff between health care treatment and environmental regulation as well as where to concentrate resources based on severity of need.

World Bank Data Analysis
Aisha Kigongo, Sydney Friedman, Ignacio Pérez

Our goal was to use a variety of tools to investigate the impact of project funding in developing countries. In order to do so, we looked at open data from the World Bank, which keeps a strong track of every project that gets funded, who funds it, and the goal of the project whether agricultural, economic or related to health. By using Python, we used an index of the World Bank to see where the most funded countries were and how they related to various indicators such as the Human Development Index, the Freedom Index, and for the future, health, educational and other economic indexes. Our secondary goal is to analyze what insight open data can give us as to how effective initiatives and funding actually is as opposed to what it’s meant to be.

Dr. Book
AJ Renold, Shohei Narron, Alice Wang

When we read a book, all the information is contained in that resource. But what if you could learn more about a concept, historical figure, or location presented in a chapter? Dr. Book expands your reading experience by connecting people, places, topics and concepts within a book to render a webpage linking these resources to Wikipedia.

Book Hunters
Fred Chasen, Luis Aguilar, Sonali Sharma

When we search for books on the internet we are often overwhelmed with results coming from various sources. It’s difficult to get direct trusted urls to books. Project Gutenberg, HathiTrust and Open Library all provide an extensive library of books online, each with their own large repository titles. By combining their catalogs, Book Hunters enables querying for a book across those different sources, our project will highlight key statistics about the three datasets. These statistics include: number of books in all the three data sources, formats, language, publishing date. Apart from that we will ask users to search for a particular book of interest and we will return combined results from all the three resources and also provide the direct link to the pdf, text or epub format of the book. This will be an exercise to filter out results for the users and provide them with easy access to the books that they are looking for.