Appendix: Influences on Tag Choice on del.icio.us
by Emilee Rader and Rick Wash (CSCW '08)

This web page contains all the appendix information for our CSCW '08 Paper Influences on Tag Choice on del.icio.us [ slides ]. Via this online appendix, we are providing access to our datasets, additional statistical results, and the code we wrote for our computer model and to conduct our analyses.

Statistical Results: Table 1 in the paper (the big table summarizing the logistic regression results) is just a summary of our logistic regressions. Our statistics software output a lot more diagnostic and substantive information. It is available here:

Code: We used the R programming language (as in the R Statistics software) for our logistic regression and interuser agreement analyses, and our computer models. R is available for all major platforms; you will need to download and install it before running our code. UCLA's Academic Technology Services has an excellent set of instructions and examples for installing and using R.

The following files are available with a BSD-style license. The top of each file includes the license and information about its use:

  • distcomp.R: For a given dataset that approximately resembles a powerlaw, this code compares the goodness-of-fit of seven different probability distributions using Maximum-Likelihood estimation and the Kolmorogrov-Smirnov test.
  • logistic.R: Primary code for the logistic analysis study in the paper. Also includes code for transforming the dataset into a form the logistic regression can use, and displaying the results.
  • iua_site.R: Code for calculating the inter-user agreement between users of a site
  • montecarlo.R: Primary code for the computer models (simulations) study in the paper.
  • plfit_sim.R: Calculates the powerlaw fit and inter-user agreement for the computer models
  • powerlaw.R: Support code. R doesn't include either discrete or continuous powerlaw distributions by default. Includes code for generating random numbers from a powerlaw, and fitting / plotting a powerlaw.
  • parse_all_url.pl: A perl script to parse a directory of .html files saved from del.icio.us/url/

Datasets: We will make both of our datasets available to those who would like to replicate and extend our analyses. This includes the data we scraped from del.icio.us, and the data generated by our computer models. These datasets are very large (over 2GB), and we are unable to host them permanently online. If you are interested, contact us by email in order to work out transfer arrangements: emilee@gmail.com and rick.wash@gmail.com. Serious inquiries only, please --- these are not general purpose datasets. The data were collected from del.icio.us using a sampling frame specific to our research, and therefore may not be suited to your research questions or objectives. We are providing the computer model data so that you may replicate our analyses exactly; subsequent runs of the model using the same parameters may not produce exactly the same numbers due to random number generation.

Computer Model: Our iConference 2008 paper goes into much greater detail about the specifics of the computer models and our design decisions. In particular, the random number generator was set up as follows:

  • Number of tags per user: For each run, a meanlog and sdlog were chosen at random. For each user in the run, the number of tags for that user is chosen from the discrete log normal distribution with the parameters meanlog and sdlog.
    • meanlog was chosen from a normal distribution with mean 0.824073727104055 and standard deviation 0.454489864656875, constrainted to be greater than 0.
    • sdlog was chosen from a nomral distribution with mean 0.475836488317795 and standard deviation 0.129929430171246, constrained to be greater than 0.
    • These parameters were estimated from the empirical del.icio.us data. See our iConference 2008 paper for more details.
  • Specific tags chosen: For each run, an alpha is chosen, and used consistently for that run. Each time a random tag is needed, a number is chosen from the discrete powerlaw distribution with explonent alpha and xmin = 1.
    • alpha was chosen from a normal distribution with mean 2.5 and standard deviation 0.40163, constrained to be greater than 1.
    • When multiple tags were needed, the discrete powerlaw distribution was repeatedly sampled until a sufficient number of unique numbers were returned.
    • The empirically estimated alpha was approximately 1.92 (see the iConference 2008 paper). However, because of this resampling, the final alpha values in the simulations were too low. Experimentation suggested that an alpha of approximately 2.5 would yield simulations with exponents approximately equal to the empirically observed value of 1.92.

Database Schema: The R files and perl script linked to above assume a database of information that has the same layout as ours. Here are the schema (from MySQL):

CREATE TABLE  `delicious`.`jan2007_site` (
  `id` int(11) NOT NULL auto_increment,
  `deliciousID` varchar(200) NOT NULL default '',
  `title` varchar(400) default NULL,
  `url` varchar(500) default NULL,
  `user` varchar(200) default NULL,
  `date` date default NULL,
  `position` int(11) default NULL,
  PRIMARY KEY  (`id`),
  KEY `deliciousID` (`deliciousID`),
  KEY `date` (`date`),
  KEY `position` (`position`),
  KEY `user` (`user`),
  KEY `id_date` (`deliciousID`,`date`)
)
and
CREATE TABLE  `delicious`.`jan2007_tag` (
  `id` int(11) NOT NULL auto_increment,
  `site_id` int(11) default NULL,
  `tag` varchar(200) default NULL,
  `position` int(11) default NULL,
  PRIMARY KEY  (`id`),
  KEY `tag` (`tag`),
  KEY `deliciousID` (`site_id`)
)
Also, for downloads of users' bookmarks, you also need this table:
CREATE TABLE  `delicious`.`jun2007_user` (
  `id` int(11) NOT NULL auto_increment,
  `user` varchar(200) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `user` (`user`)
)
and an extra column in the site table:
CREATE TABLE  `delicious`.`jun2007_site` (
  `id` int(11) NOT NULL auto_increment,
  `deliciousID` varchar(200) default NULL,
  `title` varchar(400) default NULL,
  `url` varchar(500) default NULL,
  `user_id` int(11) NOT NULL default '0',
  `date` date default NULL,
  `count` int(11) default NULL,
  `position` int(10) unsigned NOT NULL default '0',
  PRIMARY KEY  (`id`),
  KEY `deliciousID` (`deliciousID`),
  KEY `date` (`date`),
  KEY `id_date` (`deliciousID`,`date`),
  KEY `user` (`user_id`)
)
And of course, the standard tag table.
Also, there are two more tables that are used for temporary storage (they will be filled by the code):
CREATE TABLE  `delicious`.`logdata_temp` (
  `deliciousID` varchar(200) default NULL,
  `user` varchar(200) default NULL,
  `prefix` varchar(50) default 'jan2007',
  `id` int(11) NOT NULL auto_increment,
  `finished` tinyint(1) NOT NULL default '0',
  PRIMARY KEY  (`id`)
)
and
CREATE TABLE  `delicious`.`logistic_data` (
  `id` int(11) NOT NULL auto_increment,
  `site` varchar(200) NOT NULL,
  `user` varchar(200) NOT NULL,
  `tag` varchar(200) NOT NULL,
  `chosen` tinyint(1) NOT NULL,
  `used.onSite` tinyint(1) NOT NULL,
  `used.byUser` tinyint(1) NOT NULL,
  `fromUserTags` tinyint(1) NOT NULL,
  `fromSiteTags` tinyint(1) NOT NULL,
  `position` int(10) unsigned NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `site` (`site`),
  KEY `user` (`user`),
  KEY `tag` (`tag`),
  KEY `site_position` (`site`,`position`)
)