3. Gathering Data

Although collecting data from Wikinews turned out to be the most time-consuming phase in our project, it will be discussed here only briefly. While everybody can post an article, Wikinews employs a history mechanism to avoid abuse. The history mechanism makes all edits to an article traceable in terms of who changed which part at what time. Wikinews documents every change of an article on the article’s corresponding “history-page”. This enables users to reset corrupted articles to an earlier state and to find out who is responsible for the corruption.

Because the article-history is created automatically and therefore highly formalized, its content is easy to parse. We generated two different datasets from the history-pages applying the following rules in order to create a communication flow:

Dataset1: For each editor, include a communication flow from this editor to the user who is the immediate predecessor in the article-history (note that this might also be the creator of the article).
Dataset2: For each editor, include a communication flow from this editor to the creator of the article.

Dataset1 typically contains one communication flow for each article edit. The only exception is the case when one user edits his changes from before; we did not count this „soliloquy“ as communication.

Dataset2 was created in order to be able to distinguish between creators and editors of articles. Note that because the creator is linked to every editor of an article, in this dataset creating an article is weighted much higher than editing one. For further analysis it is also important to keep in mind that an article creator is always the receiver of the corresponding communication flows.

Both datasets are based on the articles that were posted between November 8, 2004 and November 7, 2005. Therefore, we were able to analyze the communication activity of the first year of the Wikinews project.

For each communication we included the author, the time and the title of the corresponding article as well as its category in the database. It would have been even more valuable – especially for the analysis of H4 - to include the complete text of each article in the database. However, because there where almost 6.000 articles and hence more than 70.000 communications during the one-year period, the resulting amount of data would have been too huge to process with the computing power available to us.

Our initial plan to create a third dataset using the discussion pages, on which users can discuss about the corresponding articles (for example concerning style or information content), was also carried out, but did not lead to a dataset that would satisfy scientific criteria. This is due to the (missing) structure of the discussion pages, which made it impossible to successfully parse a sufficiently large part of the existing data. Therefore, this part of our work is not being described in detail in this paper.