In this article, I want to show how computerized data analysis is within the grasp of all individuals and not just programming experts. To do this, I show how data science has worked hard to become near-invisible to you, the user. This means that you can now navigate analysis programs using common sense. Whether you are a scholar, a citizen, a teacher or a student, working with data using programs need no longer be intimidating. If, after reading this, you feel compelled to go try out a data analysis of the sort I present below, I will consider my work done.
I’ve picked, as a case study, a very interesting question. I want to know why people in India no longer want to have babies. The rate at which each Indian woman bears children has been rapidly dropping, to the point where some Indian cities now have among the lowest fertility rates in the world. The total fertility rate (TFR) is the number of children child-bearing women in a locale have on average. In the 1950s, the average TFR in India was nearly 6. That is, women had SIX children on average.
In a span of about 60 years, the Indian TFR has dropped from this high number – 5.88 (in 1959) – to 2.23 (in 2016), barely above the replacement rate. If the large drop in fertility continues, India will soon join many European nations in having sub-replacement fertility. This signals inevitable ageing of society, and a decline in the population size. From a country that was a population bomb waiting to explode, we have turned, rapidly, into a country with fewer children being born. But people don’t agree on how this has happened. Some say it is because of greater literacy, some say it is because of more economic development, others that it is a result of there being too many people. Can we concretely test the different hypotheses that try to explain this big change?
That is what I’m going to show you how to do, as an instructive example of how to run a computerized data analysis from start to finish. The first step is to find a reliable source for the data. Since this is a teaching demonstration, I’m going to rely on properly cited statistics reported on Wikipedia. In more formal settings, I would have set a higher bar for reliability, likely looking to government websites, but this will do for now. In particular, I extract data for the GDP per capita, the population density, the literacy rate, and the total fertility rate for Indian states and Union territories from Wikipedia and government websites. If there is a relationship between any of the first three and TFR, it should show up in the tests we will then run. As a bonus, I throw in a fourth predictor variable – the fraction of households with access to TV.
The author teaches Computer Science at IIT Kanpur. He can be reached at firstname.lastname@example.org.