News

Updates in Version 1.0.2 (6-13-17)

  • Added in $node_level_data identifying Senators in the Congress_Cosponsorship_XX datasets to make them easier to link to other metadata.

Updates in Version 1.0.1 (6-12-17)

  • A number of minor spelling fixes in the network_metadata.csv file.
  • In version 1.0 we incorrectly labeled some undirected networks as directed. This has been corrected in version 1.0.1.

Data Overview

This page contains documentation and download links for a dataset currently comprising 304 social and biological networks that have been curated in such a way as to facilitate comparative studies of these networks. Where applicable, networks are accompanied by node-level covariate data, and detailed metadata is also available for each network. In particular, we have hand-coded each network into one of a small number of broad categories (exchange, friendship, biological, etc.) to facilitate comparison within and across types of networks. These networks are drawn from a number of sources, all of which are documented in the metadata, and are available as R Lists of sociomatrices, iGraph objects, and network data objects compatible with the Statnet suite of R packages. This project was funded in part through the NSF Big Data Social Science IGERT Program at Penn State and the data are freely available on this website. If you are interested in a more in-depth application using some of the networks in this dataset, please visit the source webpages, or read the source papers (both documented in the metadata file) as they will often contain detailed documentation.

Our goal in developing this dataset and website was to improve access to large numbers of network datasets in a common and well documented format. A number of previous studies have looked at samples of networks, but have often only relied on sources the authors are aware of, and involve re-collecting and reorganizing existing network datasets. We believe this represents a great deal of wasted effort, and this project was designed to prevent this sort of wasted effort in the future. In service of this goal, we are actively looking for additional contributors to this project and dataset. If you have network data or know of additional network data sources that could be included in this dataset, please email mdenny@psu.edu and we would love to have your collaboration on this project.

Contributors

This dataset was contributed to by the following people:

  1. Cassie McMillan is a PhD Student in the Department of Sociology at Penn State. You can email her at clm453@psu.edu and check out her website [here].
  2. Sayali Phadke is a PhD Student in the Department of Statistics at Penn State. You can email her at sayalip@psu.edu and check out her website [here].
  3. Mitchell Goist is a PhD Student in the Department of Political Science at Penn State. You can email him at mlg307@psu.edu and check out his website [here].
  4. Matt Denny is a PhD Student in Political Science and Social Data Analytics at Penn State. You can check out the rest of my website [here].

If you would like to contribute to the dataset, or if you believe there is an error in any of our metadata, please email mdenny@psu.edu.

This dataset may be cited as:

Dataset Documentation

The data may be downloaded in a number of formats (along with metadata) by selecting one of the following options:

  1. Metadata for all networks (including source paper citations) can be downloaded as a [.csv] or an [.RData] object.
  2. An R List object containing networks represented as (dense) sociomatrices (adjacency matrices) with one list entry per network can be downloaded here as an [.RData] object (approximate file size 483Kb). See below for a full description of each list entry.
  3. An R List object containing networks represented as igraph network objects with one list entry per network can be downloaded here as an [.RData] object (approximate file size 7.93Mb). See below for a full description of each list entry.
  4. An R List object containing networks represented as network objects compatible with the Statnet libraries with one list entry per network can be downloaded here as an [.RData] object (approximate file size 2.22Mb). See below for a full description of each list entry.

Each entry in one of the data list objects follows the same general structure. They are all R List objects of length 304, with each entry representing a network, and with the List indices matching up to row indices in the Metadata file. As depicted below, each entry in these list objects is itself a List object with three fields: $network, $node_level_data (which is NULL if node node level covariates were not available for the network), and $metadata. The $network field varies based on the way the network is represented (as a numeric matrix, igraph object, or statnet network object), but the other two fields remain identical across the three different representations of the network. The $node_level_data is represented as a data.frame if it is available and the rows of the data.frame correspond to the rows/columns of the network. The $metadata field holds much of the same information as the stand-alone metadata file, and is designed to make it easy to filter networks inside of a loop by checking its values. This structure was designed to facilitate efficient automated analysis across multiple networks using loops or the apply() family of functions.

List of 3
 $ network        : int [1:50, 1:50] 0 0 0 0 0 0 0 0 0 0 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:50] "V1" "V2" "V3" "V4" ...
 $ node_level_data:'data.frame':	50 obs. of  4 variables:
  ..$ Alcohol: int [1:50] 3 2 2 2 3 4 4 4 2 4 ...
  ..$ Drugs  : int [1:50] 1 2 1 1 1 1 3 3 1 1 ...
  ..$ Tobacco: int [1:50] 2 3 1 1 1 1 1 3 1 1 ...
  ..$ Sports : int [1:50] 2 1 1 2 2 2 1 2 2 2 ...
 $ metadata       :List of 9
  ..$ name        : chr "Girls' Friendships 1"
  ..$ directed    : logi TRUE
  ..$ valued_edges: logi FALSE
  ..$ category    : chr "friendship"
  ..$ source_URL  : chr "https://sites.google.com/site/ucinetsoftware/datasets/50women"
  ..$ description : chr "This is a friendship network of a cohort of girls attending a school in Western Scotland..."
  ..$ num_nodes   : int 50
  ..$ node_type   : chr "student"
  ..$ edge_type   : chr "friendship"

A writeup of the procedures used to compile this dataset, along with many of the descriptive statistics from this page can be downloaded [here].

Data Overview and Descriptive Statistics

The data comprise 304 social and biological networks. We began by coding these networks into one of (currently) 9 broad categories: Association, Biological, Ecological, Exchange, Friendship, Kinship, Perception, Support, and Transportation. These categories are described in greater detail below and they allow us to study variation in the properties of these networks across different types of nodes and ties. While we make no claims that these categories are definitive, they serve as a basis for making comparisons between networks, or for looking at particular types of networks.

  1. Association: This category primarily captures relationships of group co-membership including the number of movies actors have co-starred in, whether two students went to the same school, or the number of scenes two characters in a book shared.
  2. Biological: This category includes metabolic, protein, and gene interaction networks. This category of networks is distinguished from ecological networks by the nodes, which are not autonomous in this classification.
  3. Ecological: This category includes interactions, flows, and relationships among animals and ecosystems. Some examples include dominance relationships among cattle, hens, and female sheep, the count of interactions between kangaroos, a monkey-grooming network, and the carbon flow network.
  4. Exchange: This category includes trade relationships at the national and local levels. Examples include the volumes of raw materials exchanged between countries and the amount of Taro exchanged among 22 households in a Papuan village, as well as a number of communication networks.
  5. Friendship: This category records friendship relations between people in a number of different contexts (both in person and online). Some examples include the self assessed friendship networks of highschool and college students, prison inmates, bank employees, and monks.
  6. Kinship: This category includes networks of familial relationships, often recorded over a long time period.
  7. Perception: This category includes networks that were collected by asking respondents to give their perception of romantic, social, friendship, etc. relationships between a group of their peers or subordinates.
  8. Support: This category primarily includes networks of social support and advice giving. Some examples include advice giving networks in several firms, a law office, and the Harry Potter books, as well as legislative co-sponsorship networks.
  9. Transportation: This category includes transportation links between cities and countries. For example, one of the networks in this category records whether there is a direct flight between two U.S. cities.

The table below provides descriptive statistics for networks in each of these categories (as well as the entire dataset). These include the minimum, median, and maximum number of nodes in networks assigned to that category, the average proportion of non-zero edges in networks assigned to that category, and the count of networks assigned to that category. Perception and Friendship networks are the two largest categories, and currently make up over half of the dataset. One important aspect of the dataset is that it does not include any particularly large networks. This is a conscious choice designed to ensure that most forms of statistical analysis can be applied to these networks, and to ensure that the resulting aggregate file sizes would not be prohibitively large. If the reader is interested in perform comparative studies using very large networks, we suggest they look at the data available through Stanford's SNAP Lab. In addition to practical concerns associated with storing and analyzing very large networks, we also believe that there are likely to be substantial differences in the way that network processes operate at the scale of millions of nodes as opposed to the scale of tens of nodes. This focus on smaller networks is reflected in a median network size of just 34 nodes in the dataset, with a maximum network size of under 2,000 nodes.

Category Min. # Nodes Median # Nodes Max. # Nodes Prop. Non-Zero Edges # Networks
Association 14 34 410 0.41 29
Biological 212 453 1706 0.01 3
Ecological 16 28 62 0.28 11
Exchange 10 24 293 0.3 17
Friendship 14 31 336 0.25 123
Kinship 20 25 25 0.14 5
Perception 44 44 44 0.05 39
Support 12 22 1899 0.31 75
Transportation 1174 1374 1574 0.01 2
All Networks 10 34 1899 0.25 304

The figure below plots the size of the network (on a log scale) against the proportion of non-zero edges for each of the 304 networks in our dataset, with nodes colored by category. As we can see, there is a great deal of heterogeneity in the proportion of non-zero edges across different categories and network sizes, with communication networks showing some of the highest variability across these dimensions.

oops!

Data Sources

We did not collect most of these networks ourselves. The goal of this project is to make it easier for researchers to perform studies of multiple existing networks, so we sourced our data from a number of wonderful existing sources. Please check out their websites and refer to them if you are interested in studying any of the networks in this dataset in greater detail. Links to the source website for the networks in our dataset are provided in the metadata files. Note that we do not have web links for all datasets, so the source paper field is often a better way to find a description of a particular dataset. Additionally, here are a number of links to network data sources (some of which are not yet included in our dataset).

  1. The UCINet data repository.
  2. Stanford's SNAP Lab.
  3. Tore Opsahl's website.
  4. The National Institute of Standards and Technology: Complex Networks Data Sets.
  5. Mark Newman's website.
  6. The Koblenz Network Collection.
  7. The Arizona State University Network Data Archive.
  8. Stanford's SoNIA group website.
  9. Princeton's International Networks Archive.
  10. The Gephi network dataset archive.
  11. The LINK Group at Semmelweis University.
  12. Alex Arenas' website.
  13. Jake Hoffman's network data repository.
  14. Harvard Dataverse.
  15. The Abdul Latif Jameel Poverty Action Lab Dataverse.
  16. The SIENA network data webpage.
  17. The ICPSR data webpage.
  18. The Duke Network Analysis Center's network data webpage.