Applied Data Science @ Columbia


Project American Community Survey

Jan 29, 2016

For the very first project of this course, we looked at the American Community Survey.

The 2013 ACS is hosted on kaggle as a “kaggle script” challenge. This project led to some unexpected learning experiences. My initial thought about this project was to make it a team-building experience. I would like team members to browse 700+ kaggle scripts and discuss their favorites and collectively come up with an idea that is better than they have seen.

The data is very structured, clean and big but managable. The context is easy to understand and the variables do not require much domain knowledge. Everyone understands a broad analysis goal yet there are so many possibilities. Simple project!

Well, I under-estimated the interestingness of this project as it turned out. For this project alone, during the course of one hour in-class brainstorm session, and sampled from my incomplete communication with a subset of the teams, we have discussed

  • sampling and survey weights
  • association and confounding
  • missing value and imputation (allocation)
  • nonresponses and their codes
  • spatial visualization
  • clustering

I am very much looking forward to see the final products!