Data collection

 

Social networks are everywhere.  You just need to uncover them from the various datasets that are already out there, or from the data your system generates and collects. 

Here we present the various datasets we have collected and used in our research.  By analysing these large datasets we have been able to gain insight into aggregate and longitudinal human behaviour.  In all your own analyses, I urge you to consider retaining the temporal information about the instantiation of links and nodes in your social networks.  In other words, ask yourselves: if the links between nodes are instantiated intermittently, then can a static graph adequately represent the data?


Bluetooth

A big portion of our research relies on Bluetooth scans.  These are sometimes also called mobility traces.  In the Crawdad website you can find a number of such datasets, most of which use WiFi instead of Bluetooth.  We have written software that turns computers into what we call Bluetooth Scanners.  Effectively, each scanner continuously seeks nearby Bluetooth devices and records the time and date when each device was seen (See figure below - left). Using this data, we can recreate the visit history of each device.  In other words, we know when each device came near our scanners (figure below - right).



From this data we are able to extract social networks using the following procedure:  each device is represented as a node, and we link together devices that have encountered each other.  In other words, we link devices that have been co-present near one of our scanners.  For example, in the figure below we have linked together those devices that encountered each other.



The deployment of these scanners, and their positioning in physical space is crucial in determining the kinds of data generated.  We draw on Space Syntax theory to help us differentiate between at least two modes of capturing data: gatecounts and static snapshots.

Gatecounts are used to establish the flows of people at sampled locations within the city over the course of a day.  A gate is a conceptual line across a street, and gatecounts entail counting the number of people crossing that line.  The scanner is positioned on or near the street and counts the number of people crossing the gate in either direction (without being able to tell which direction).

With static snapshots the open spaces of the city are considered in detail.  These spaces may be external, such as a plaza, or internal, such as a café.  The method can be used for recording both stationary and moving activities, and is useful when a direct comparison is being made between the two types of space use.  This method makes apparent the relationships between different types of space use in an urban area.  For each open space under consideration, the scanner records the movements in and out of the space.

Additionally, we have developed software for mobile devices, thus turning phones into mobile Bluetooth scanners.  Our software is an extension to Wireless Rope and can be downloaded from here (source is also available).  Please note that this software is meant to work with Facebook (read following section).


Facebook

We decided to open up our Bluetooth software and platform, and enable members of the public to deploy their own Bluetooth scanners.  Our Facebook application is reported in the following articles:

  1. http://news.bbc.co.uk/2/hi/technology/6949473.stm

  2. http://www.newscientist.com/blog/technology/2008/02/towards-facebook-phone.html

To use the application itself, go to http://apps.facebook.com/cityware.  An overview of our platform can be seen in the following figure.



Effectively, users are able to turn their computers and mobile phones into Cityware nodes, that carry out Bluetooth scanning and report the results to our servers.  Users can then use Facebook to see the results of their scans, and see what devices they encounter.  A cool feature is the fact that users can “tag” devices, hence giving us a link between Bluetooth ID and a user’s Facebook profile.


Email

The Enron email dataset is a well-known case study and dataset of emails exchanged between employees of the Enron corporation.  This dataset was published by the US government, and refined by the research community.  You can download this dataset from

  1. http://www.cs.cmu.edu/~enron/

  2. http://www.isi.edu/~adibi/Enron/Enron.htm

The simplest way to extract social network data from the Enron dataset is to represent people as nodes, and link together those people who have exchanged email messages.  There are a few decisions you need to make when analysing this data, such as making your social network directed or undirected, setting a lower threshold of number of messages before a link is instantiated, and dealing with one-to-many email messages.  Additionally, you should  consider if you want to retain the temporal information about when each link is instantiated.


Telephone logs

We have been privileged enough to have access to a 20 million record dataset of phone-calls made using landlines over a 6 month period.  The dataset is from what appears to be large corporations, and includes both internal and external calls.  To extract social networks from this data we represent each phone number as a node, and link together phone numbers that have called each other.  There are a few decisions you need to make when analysing this data, such as making your social network directed or undirected, and setting a lower threshold of number of calls before a link is instantiated.


Combining datasets

In general it is always desirable to have multiple datasets that can be cross-referenced.  By combining and correlating datasets, your analysis can be much more powerful.  We have been fortunate enough to have a large number of users try our Facebook application.  This effectively gave us two datasets: their Bluetooth scans, and their Facebook profiles.  Hence, for this set of users, we are able to tell whether or not they are friends on Facebook, as well as whether or not they encounter each other in physical space (a.k.a. they have been co-present near a scanner).