Applying Social Network Analysis to Wikinews

Abstract

This paper reports first results of a Social Network Analysis (SNA) of Wikinews contributors using the tool TeCFlow. We describe how SNA can provide interesting insights into both the roles of individual users and the attributes of the Wikinews network itself, also in relation to external events. The analysis is based on data we gathered from articles that were posted between November 8, 2004 and November 7, 2005, containing the communication activity of the first year of the Wikinews project.

We find that the SNA concept of centrality is suitable to describe the importance of a user for the Wikinews network and that high centrality corresponds well with the granting of admin status by the Wikinews community. While we could identify a group of highly central contributors, we could not identify particular subject matter experts. This is contrary to the original claim of Wikinews, trying to get “citizen reporters” to directly contribute late-breaking news. We used SNA to determine whether users report about events they personally experienced or rather spread news they discovered in other sources; and find that the latter is true. By doing a SNA-based content analysis of the messages posted, we could also obtain an overview of the most important events of the last year in correct chronological order.

1. Introduction

Prior analyses of online communities have applied SNA to blogs [Blo06] or web forums [Hol04]. Both types of communities have in common that the communication links between the users are well-defined: For example, one blog links to another blog or one forum user answers to a question another user has posted.

With our work we assess to what extent the concept of SNA can be applied to a social network that is not represented by communication data which contains well-defined links between the actors in the network. Instead, the existing data had first to be mapped appropriately. In order to do so, we model social ties between actors as co-authorship in an article, in particular between the creator of an article, and its subsequent editors.

2. About Wikinews

Wikinews is one of the projects of the Wikimedia Foundation, which “is an international non-profit organization dedicated to encouraging the growth, development and distribution of free, multilingual content, and to providing the full content of these wiki-based projects to the public free of charge” [Wik05].

As stated in its charter, “Wikinews is a free content news source [...]. Wikinews allows anyone to report news on a wide variety of subjects. Its mission, as stated on the main page, is to create a diverse environment where citizen journalists can independently report the news on a wide variety of current events. [...]

Wikinews started with the establishment of a demonstration wiki in November 2004, which was then moved into beta stage one month later. In the middle of March 2005 there were already more than 1,000 articles in Wikinews” [Wiki05].

In contrast to other news portals Wikinews enables users to write articles in a collaborative manner. While Yahoo! News and Google News only categorize news published online by news agencies, newspapers, TV stations and the like, indymedia.org is the news site that comes closest to the idea behind Wikinews. The www.indymedia.org newswire allows for open and even anonymous publishing of self written articles and enables users to comment on other users articles. However, it is not possible to edit published articles. Indymedia focuses on enabling reporters to publish articles independent from “corporate coverage” [Ind06], but does not support the collaborative development of content.

3. Gathering Data

Although collecting data from Wikinews turned out to be the most time-consuming phase in our project, it will be discussed here only briefly. While everybody can post an article, Wikinews employs a history mechanism to avoid abuse. The history mechanism makes all edits to an article traceable in terms of who changed which part at what time. Wikinews documents every change of an article on the article’s corresponding “history-page”. This enables users to reset corrupted articles to an earlier state and to find out who is responsible for the corruption.

Because the article-history is created automatically and therefore highly formalized, its content is easy to parse. We generated two different datasets from the history-pages applying the following rules in order to create a communication flow:

Dataset1: For each editor, include a communication flow from this editor to the user who is the immediate predecessor in the article-history (note that this might also be the creator of the article).
Dataset2: For each editor, include a communication flow from this editor to the creator of the article.

Dataset1 typically contains one communication flow for each article edit. The only exception is the case when one user edits his changes from before; we did not count this „soliloquy“ as communication.

Dataset2 was created in order to be able to distinguish between creators and editors of articles. Note that because the creator is linked to every editor of an article, in this dataset creating an article is weighted much higher than editing one. For further analysis it is also important to keep in mind that an article creator is always the receiver of the corresponding communication flows.

Both datasets are based on the articles that were posted between November 8, 2004 and November 7, 2005. Therefore, we were able to analyze the communication activity of the first year of the Wikinews project.

For each communication we included the author, the time and the title of the corresponding article as well as its category in the database. It would have been even more valuable – especially for the analysis of H4 - to include the complete text of each article in the database. However, because there where almost 6.000 articles and hence more than 70.000 communications during the one-year period, the resulting amount of data would have been too huge to process with the computing power available to us.

Our initial plan to create a third dataset using the discussion pages, on which users can discuss about the corresponding articles (for example concerning style or information content), was also carried out, but did not lead to a dataset that would satisfy scientific criteria. This is due to the (missing) structure of the discussion pages, which made it impossible to successfully parse a sufficiently large part of the existing data. Therefore, this part of our work is not being described in detail in this paper.

4. Analyzing Wikinews with TeCFlow

Social Network Analysis (SNA) measures relationships between people, organizations, or other entities by constructing a network of ties between those entities or “actors”. SNA provides both a visual and mathematical way to analyze those ties [Was94], helping to understand the importance of an actor in the network.

TeCFlow [Glo04] is a tool to automatically analyze social networks based on communication logs. It creates a database from email logs, web links, phone archives, and the like. The entries in the database can be analyzed for different persons, content or other typical categories of social networks. To visualize the results, TeCFlow creates graphs that show the actors as nodes and the relation between them as lines.

A very important aspect that must be mentioned is that TeCFlow can show social networks in a dynamic way. The network is not only shown at a single point in time; rather TeCFlow visualizes the changes of a structure over time. This way TeCFlow creates an interactive movie, where the nodes change their position according to their current position in the network.

In general we studied the Wikinews Network at two different levels of abstraction: On the one hand we analyzed the individual actors and the roles they assume in the network. On the other hand we looked into attributes of the network itself, also in relation to external events.

We formulated the following four hypotheses:

H1: There are a small number of key contributors to Wikinews who form a core network.
H2: Users contribute to various categories. Therefore, no separate interest networks exist.
H3: Active article writers have first been active commentators and later active article editors until they have evolved to actively writing new articles.
H4: External events as well as shifting interests in the community are reflected in the network.

Note that Hypotheses 1 to 3 correspond to the actor-level analysis, while Hypotheses 4 deals with the network-level and external influences.

In the following sections each individual hypothesis is discussed in detail. This involves explaining the goals we planned to achieve with analyzing the hypothesis as well as describing and critically reviewing the approach we took to test it.

5. A few key contributors form a core network

The idea behind this first hypothesis was to discover if the Wikinews network benefits from the work of a relatively small number of very active contributors or if the extend of contributions is more or less the same for all users.

Our expectation was that there would be some very active users, while on the other hand there would be users who cease quickly after having contributed maybe only once. By analyzing the social structure of Wikinews using TeCFlow, we expected to be able to make a more precise statement and to corroborate the hypothesis.

We used TeCFlow’s Static View Function in order to generate an overall view of the social network, anticipating that this would help to identify the key contributors by means of their centrality in the network. After coloring the actors according to their Betweenness Centrality the – due to the amount of data – very complex network gained clarity (Figure 1). It turned out that there are indeed only very few central users in the Wikinews network.

Figure 1

What could not resolved is whether the high centrality is a result of contributing more articles than others, or whether it must be attributed to editing or even resetting articles in a policing manner. The fact that most of the very central users act as so called “admins” for Wikinews and that they are granted this status by the community based on their activity, however indicates that the former is true.

6. There is no domain-specific interest network

The second hypothesis reflects our assumption that most users do not use Wikinews to report about events they personally experienced, but to spread news they discovered in other sources. This is especially interesting considering that, according to Jimmy Wales, president of Wikimedia, Wikinews “aims to include more original reporting” [Wei05], which we believe was not achieved yet. However, our procedure can be used for monitoring future progress in this matter.

Our first approach to test the hypothesis was to use TeCFlow’s Dynamic View as follows: The basic idea was that, if our hypothesis was incorrect, in Dataset2 there should be star-like structures, since for each article there are links between its creator and all its editors. The creator of each article should be in the center of such a star structure. Therefore, it should be possible to distinguish between editors and creators of articles. If, on the other hand, our hypothesis is correct, editors are linked to various actors and the single star-structures will overlap so that the overall network will have a very democratic structure.

We conducted both analyses using TeCFlow’s Dynamic View with a time window of 50 days in order to avoid too much overlapping simply due to the large timeframe (as we observed it in the Static View). As expected, no star-structures could be identified (Figure 2). Links between editors showed that most users edit on a variety of topics.

Figure 2

In order to verify this result, we conducted another analysis, this time not looking into connections between users, but between categories of articles. In order to do this, we used TeCFlow’s Term View Function, using categories as input for the term analysis. If our hypothesis was correct, there should be no dense clusters of categories in the graph.

The results of this second analysis confirmed the outcome of the first. There was no indication for the existence of separate interest networks.

Therefore, we summarize: Users who contribute to Wikinews regularly are not confined to only a few categories, but write on a variety of topics. This also confutes the assumption that Wikinews is used to spread first-hand knowledge about current events. On the contrary, most articles are written based on quotes from other sources, i.e. news agencies like Reuters. Therefore, Wikinews has failed to achieve its goal until now.

7. Active writers are born that way

The intent of the third hypothesis was to recognize patterns of a typical lifecycle of a Wikinews user. Because – from a psychological perspective – the inhibition threshold is higher for each step (commenting, editing, writing new articles), intuitively we expected that a typical user would first join the network as a reader, then become a commentator and later on evolve to an editor or even writer of new articles.

Unfortunately, it was impossible to check for the existence of the first step in this suspected lifecycle, because our data only included users that already were editors. Due to the data source, mere reading of articles could not be included in our datasets.

Moreover, with our datasets, we were not able to develop any approach that would allow us to conduct an aggregated analysis based on all users. Instead, in order to see if users to whom the suspected lifecycle pattern applies exist at all, we selected several individual users for analysis. For each of these users we looked at the Contribution Index and how it changed over time.

When using DataSet2, the Contribution Index should reflect whether a user mainly edits articles or mainly creates new articles. A user who only edits articles would appear at top of the Index, while a user who only creates articles would be at the bottom. The user’s position on the x-axis in addition indicates the amount of contributions.

Observing the change in position in the Contribution Index aimed at finding the transition from editor to creator. In order to also find the transition from commentator to editor, we had to include the activity on the discussion pages into the analysis. Although, as stated earlier, the corresponding dataset was incomplete due to the many unstructured posts, we used it to generate a second Contribution Index which we viewed in parallel with the first one. In this second index, however, we did not focus on the vertical position of the user but on the general activity. For this analysis we believed the Discussion dataset to be accurate enough.

Summarizing our approach, we expected to find the following pattern if our hypotheses was correct: A new user should appear in the Contribution Index of the Discussion Dataset first. After that, he or she should enter the Contribution Index of Dataset2 from the top and then move slowly towards the bottom right corner. In doing so the user will not necessarily reach the bottom, as this would indicate that he or she only creates new articles but does not edit anymore at all.

However, the results were not at all as expected. Appearance and movement of users in the two Contribution Indices seemed to be random; no consistent pattern could be discovered. Obviously each individual user has to be characterized by his or her own specific approach to Wikinews. To our surprise, there are users who start with creating one or more articles and only later on edit other user’s work. And while some users follow the suspected pattern, there is no significant number of this type. There are for example also users who start with an almost balanced Contribution Index value and only slowly develop towards creating more articles.

Moreover, the activity on the discussion pages has no correlation with the writing or editing of articles.

Having looked at a large number of both article and discussion pages, we gained the impression that there are two fundamentally different types of users: On the one hand there are those interested in the topics; they primarily write and edit articles. On the other hand there are users that act more like reviewers who criticize the article author’s writing style or question the journalistic value of an article on the discussion page. Although we cannot prove this observation, it is consistent with the results of our analysis, especially considering that some of the most central users in the Discussion Dataset were not central in the other two Datasets.

8. Timeline of External Events can be constructed automatically

The starting point for this hypothesis is the idea that a news platform naturally reacts to external events.

The presumption that important events make the community direct its attention to these events seems to be obvious considering the result of H2. Because the community filters and spreads news from other sources, it can be expected that global events change the community’s interest structure.

The Group-Betweenness-Centrality (GBC) measures the level of networking. A high GBC indicates the existence of one very central actor; so most of the communication has to pass though him (i.e. when the other members don’t know each other). Accordingly, we will have a low degree of GBC in very democratic communities, in which nearly everyone communicates with each other. In TeCFlow, GBC is graphically represented in a trend line over time.

A global event would probably cause someone to write an article about it very fast. This article would then be edited by many different persons. This way the event would change the GBC-Plot of Dataset2; it would cause the value to rise. This can be explained by the fact that one user, the writer of the article, should become very central, since he or she is established as the recipient for every edit.

Concentrating on those points in time when the GBC is higher than the normal level and taking a look at the single actor’s centrality at these points in time will enabled us to clearly identify the actor who caused a peak.

However, looking at the articles which led to this actor’s high centrality shows that no single articles but rather a collection of many “unimportant” articles caused the actor’s high centrality.

Obviously, users do not become very central in this dataset by writing single articles that are edited by many other users. Instead our approach identified those actors who are the global connectors at these points in time, maybe because these actors just entered the community recently and therefore are concerned with many different topics. Another possible explanation is that an actor removed many junk edits on the search for suspect articles.

Since this first approach did not work as expected, we tested the hypothesis again using Dataset1. Analogous to the proceeding before, there should be an important event whenever there is a minimum in the GBC-Plot of Dataset1. At such a point in time no actor has a high centrality, so there is a lot communication between many different actors. But the results did not support the hypothesis either. Compared with Dataset2 there are only very small changes in the GBC. These changes could not be used for any further analysis, they seemed to be randomly.

Another approach to test this hypothesis was possible on our database as well: Using the Term Analysis it should be feasible to figure out dominating key-terms which than should lead us to important topics the community is dealing with at particular points in time.

In doing so, we move away from analyzing Wikinews’ structure and instead run a content-based analysis. For this we included the content by running a Term Analysis. We then tried to find out whether different events can be recognized by measuring the centrality of different key terms.

In the Term Analysis every word of the whole communication is weighted and connected to other terms. In the static or dynamic view the words are shown according to their centrality and are connected to those terms that occurred in the same communication.

The first communication content we looked at consisted of the article’s headlines. That way data processing was faster, because otherwise the amount of data would have been huge. Also, we did not expect a loss of quality in the result, because by focusing on the headlines of the articles the important terms have been sorted out in a natural way.

By running the dynamic view we could basically see an overview of the most important events of the last year without using any other variable. In the correct chronological order we could recognize the tsunami catastrophe, the election of the pope, the election in Great Britain, the terrorist attacks on the subway in London, the hurricane Katharina and the earthquake in Pakistan. The term “Bush” was quite central during the whole time and got strongly connected to the term “Iraq”, which also had a high centrality and was connected with the term “war” almost constantly.

9. Discussion

When reviewing the analyses we conducted, the most problematic aspect in our work is the definition of the datasets. Because the Wikinews community is not represented by communication data which contains direct message exchanges between the actors in the network, we had to define the communication links ourselves. In doing so, one will, consciously or unconsciously, manipulate the data to some extent. Nevertheless, we believe that our definition of the datasets avoids bias as far as possible.

However, because the datasets are defined differently from those that TeCFlow was made for, the question remains whether the results attained are valid.

While for example the indices were defined for analyzing direct mail-communication and can be interpreted clearly in that context, we had to consider in each analysis whether it is suitable to use an index or not. Additionally, we had to interpret the results generated with TeCFlow against the background of how the datasets were defined. In general, unlike the analysis of a mailbox, our analysis required us to reassess each of TeCFlow’s functions with respect to its meaning and its significance.

Nonetheless, when recapitulating how we finally used TeCFlow for our analysis, we believe that the tool significantly facilitated our work.

Considering the first three hypotheses, emphasis was placed on automated processing, meaning that TeCFlow made the analysis easier and faster to accomplish. Therefore, the advantage here was mainly quantitative. Yet, in the analysis of the third hypothesis it would not have been feasible to look at every single article of an author in order to identify his role at a certain point in time. This was the first step into a qualitative advantage. At the latest for the fourth hypothesis TeCFlow not only acted as a facilitator but as an enabler. Considering the amount of data, it would have been impossible to conduct such an analysis of the content manually. Moreover, without TeCFlow and its features, we would not have had the idea to approach the problem the way we did. The Dynamic View Function made it possible to easily identify the shifting interests in the community.

Therefore, we conclude that while the definition of the datasets is problematic, applying Social Network Analysis to Wikinews is possible and TeCFlow is of great worth for the analysis if the peculiarities that follow from the data definition are considered in the interpretation of the results.

10. Potential applications and future work

In this last section of our paper we want to address the applications of our work, some of which we already mentioned in the analysis, in more detail.

One potential application is using SNA to identify and assess candidates for admin status. Remember that our analysis showed that high centrality corresponds well with the granting of admin status. Obviously, the requirements which the community expects of its admins, are reflected in a high centrality.

Currently, candidates can be identified and nominated by any community member. TO DO

We suggest that, in addition to the nomination by community members, candidates should be identified by means of Betweenness Centrality. Moreover, the centrality of a candidate nominated by another user can be used as an additional criterion for the assessment of that candidate.

Another application is related to quality control. As we mentioned earlier, Wikinews’ aim is to attract more “citizen reporters”, which wasn’t achieved yet. While quality of content is controlled by the system of collaborative editing itself, a measurement instrument for the amount of original reporting currently does not exist. Our approach to look for star-structures (see Chapter 6) should indicate when there is a shift towards more users contributing late-breaking news directly and therefore provides a method for monitoring the achieving of objectives.

Once this goal has been achieved, it is also possible to find experts on a certain subject matter using the Term Analysis. In contrast to the results in Chapter 6, the Term View will then show dense clusters of interrelated topics. Identifying the most central users in a certain cluster reveals the users who deal most with the topics in that cluster, and therefore can be regarded as experts for these topics.

Starting from the analysis of H4, which showed that a timeline of external events can be obtained from the Term Analysis, we suggest that the same method can be used to identify trends and trendsetters. By observing the centrality of terms over time, the rise of new topics and the vanishment of topics that are no longer of interest could be discovered. Once a trend has been identified, it would be easy to detect the initiators and trendsetters.

However, in our analysis the terms that we traced corresponded to major events, but when looking for trends one has to discover the rise of new topics at a very early stage. It has to be examined in further research with what level of reliability this is possible.

Acknowledgements

We would like to thank Ilkka Lyytinen, who conducted parts of the analysis described in this paper.

References

[Blo06] http://radio.weblogs.com/0114726/categories/socialNetworks/2003/01/02.html, visited on 2006/01/16

[Glo04] Gloor, P. Zhao, Y. TeCFlow - A Temporal Communication Flow Visualizer for Social Networks Analysis, ACM CSCW Workshop on Social Networks. ACM CSCW Conference, Chicago, Nov. 6. 2004

[Hol04] Holme, P., Edling, C. R. & Liljeros, F. Social Networks, 26, 155 - 174, doi:10.1016/j.socnet.2004.01.006 (2004).

[Ind06] http://www.indymedia.org, visited on 2006/04/11

[Was94] Wasserman, S., & Frost, K. (1994). Social network analysis: Methods and applications. New York: Cambridge.

[Wei05] “The Unassociated Press” by Aaron Weiss, published in The New York Times on February 10th, 2005

[Wik05] http://wikimediafoundation.org/wiki/Home, visited on 2005/12/28

[Wiki05] http://en.wikipedia.org/wiki/Wikinews, visited on 2005/12/28