Big data: a game changer for social scientists

Dominik Wurnig, Freelance, Berlin, Germany

PUBLISHED ON: 18 Nov 2015

Once, Mike Savage predicted the downfall of sociology – but now, he revises his pessimism. In 2007, his famous essay “The coming crisis of empirical sociology” had caused quite a stir. Back then, he stated that social scientists would be falling behind the natural scientists, failing to make use of the oil of the 21st century: big data.

In a talk within the series “Big Data: Big power shifts?” held on 5 November 2015 in Berlin, the sociologist from the London School of Economics and Political Sciences (LSE) drew the conclusion that, today, the most successful and popular social scientists primarily build up their work on data analysis.

Nowadays, it would be unthinkable that intellectuals could contrive theory buildings or propagate grand narratives like those of Michel Foucault or Jürgen Habermas. Still, there is a new star in the sky: social scientist Thomas Piketty, with his book “Capital in the Twenty-First Century”.

Piketty as a pioneer of big data

Piketty is said to be using data from various sources, focusing on income distribution, in order to illustrate the complex coherences in a bunch of simple data visualisations. “Piketty is using big data, but he is not calling it big data,” Savage said.

Further, the French economist builds his critical reasoning on comprehensible data visualisations, combining a descriptive approach with a critique of the prevailing conditions.

Robert Putnam is said to be using a similar approach in his book “Bowling Alone”, building his thesis of a decrease in social integration on data drawn from memberships in clubs and other statistics. Another example for social sciences relying on big data is the book “The Spirit Level” by Richard Wilkinson and Kate Pickett, which focuses on social inequality. According to Savage, social scientists are only able to make up for their lack of technical knowledge by better contextualising.

At the early November event - organised by the Humboldt Institute for Internet and Society, in collaboration with the Vodafone Institute for Society and Communications - Isabelle Sonnenfeld from Google News Lab made a similar statement: “Social scientists, unlike computer scientists, can come to a data source with a more complex and historical understanding.” The Mountain View firm decided to make some of its most important data – search data – partially accessible via offerings such as google.com/trends. This said, the decisive factor is not the data itself, but rather its interpretation. “We provide aggregated and anonymised Google Trends data, but it is the journalists and academics who are contextualising it,” Sonnenfeld said.

Why big data still has a long way to go

Google's practice of sharing some of its data with the public clearly demonstrates that access to big data is still unevenly distributed. Fortunately, more and more large companies – most recently Deutsche Bahn, but state institutions as well – bank on openness and decide to make machine-readable sets of data available to the public, as a web search on “gov data” shows. A significant problem that remains is, however, that these data are not very informative, because they usually lack two things: 1) relevant context and, 2) enough granularity.

Deutsche Bahn, for example, has so far only released seven sets of data, including a directory listing the lengths and heights of the railway platforms in Germany. At the same time, far more interesting and informative data regarding the consumption and mobility patterns of the German people remain inaccessible for the public. Data journalist Lorenz Matzat therefore sees the datasets published so far as “Schnarchdaten” (“snoring data”). So far, state administrations are keeping back the more interesting sets of data: i.e. the City of Cologne has published its budget data in machine-readable form. However, since the budget items are summarised in rough categories, the data remains difficult to decipher.

While many datasets are not published at all, there are also problems with the ones that are available. Typically, datasets are published in an anonymised manner, which is also important in the way of privacy protection. It makes it almost impossible though to compare an anonymised dataset with another set of data. To be able to integrate and compare different data is precisely what is needed in order for the scientists and the public to gain insight.

An example: a supermarket chain collects data regarding the shopping habits of its customers via customer cards. Now, there is only little demographic or personal data connected to the customer card; mainly name and address. By itself, the dataset is of little interest, so – in order to gain more insight – the supermarket chain purchases additional data on demography, household size, age, hobbies, interests, and more from a third party. The customer profiles are “filled with life”, allowing conclusions about the possible motives behind purchasing decisions. Big data can only develop its full potential if it is possible to connect different datasets.

The determining factor to connect different datasets to each other is a so-called unique identifier, serving to identify a person in several different datasets; again, in our example, mainly name and address. While companies and security agencies rely on integrating different datasets, researchers and journalists often don’t have this option. Firstly, because of a lack of financial resources, and secondly, because of ethical concerns vis-à-vis the investigation and publishing of such data.

Great opportunity vs. ethical concerns

For social sciences, the fact that people (or data subjects) feel unobserved while producing data, unaware that they are an object of study, is both an ethical dilemma as well as a great opportunity. For example, it is now possible to examine income distribution or prostitution with the help of big data, while in the past voluntary disclosure and self-description often lead to inaccurate results.

Of course, it can be argued that social scientists have always been working with large amounts of data – or Big Data – in censuses, election analyses or large surveys. However, the new thing about big data is that a lot of data is seemingly collected incidentally, not for a specific purpose such as in the scope of a census. Thus, the online retailer Amazon primarily sells products – but a lot of consumer data is collected as well. It is stored and processed as a raw material, based on the assumption that it will sooner or later be used for further evaluation.

Social scientists as 'Jacks of all trades'?

It is not only the access to large datasets that is unequally distributed, but also the skills to handle them. While companies such as Google have countless programmers and data analysts to interpret data, social scientists often work on their own.

Should aspiring sociologists thus also learn programming? Savage doesn’t think so: “If you had to actually learn those big data skills, that would be a big commitment - and you would lose a lot of theoretical and substantive skills too.” Instead, there have to be cooperations with programmers and data analysts. In mixed teams like this, the sociologists’ theoretical, critical and historical knowledge can help to interpret data.

Add new comment