The technical session Wiki Tools and Interfaces will feature three presentations. See the schedule for details on when and where to go.
Vandalism Detection in Wikipedia: A High-Performing, Feature-Rich Model and its Reduction Through Lasso
Sara Javanmardi, David W. McDonald, Cristina V. Lopes
User generated content (UGC) constitutes a significant fraction of the Web. However, some wiki–based sites, such as Wikipedia, are so popular that they have become a favorite target of spammers and other vandals. In such popular sites, human vigilance is not enough to combat vandalism, and tools that detect possible vandalism and poor-quality contributions become a necessity. The application of machine learning techniques holds promise for developing efficient on-line algorithms for better tools to assist users in vandalism detection. We describe an efficient and accurate classifier that performs vandalism detection in UGC sites. We show the results of our classifier in the PAN Wikipedia dataset. We explore the effectiveness of a combination of 66 individual features that produce an AUC of 0.9553 on a test dataset – the best result to our knowledge. Using Lasso optimization we then reduce our feature-rich model to a much smaller and more efficient model of 28 features that performs almost as well – the drop in AUC being only 0.005. We describe how this approach can be generalized to other user generated content systems and describe several applications of this classifier to help users identify potential vandalism.
Autonomous Link Spam Detection in Purely Collaborative Environments
Andrew G. West, Avantika Agrawal, Phillip Baker, Brittney Exline, Insup Lee
Collaborative models (e.g., wikis) are an increasingly prevalent Web technology. However, the open-access that defines such systems can also be utilized for nefarious purposes. In particular, this paper examines the use of collaborative functionality to add inappropriate hyperlinks to destinations outside the host environment (i.e., link spam). The collaborative encyclopedia, Wikipedia, is the basis for our analysis.
Recent research has exposed vulnerabilities in Wikipedia’s link spam mitigation, finding that human editors are latent and dwindling in quantity. To this end, we propose and develop an autonomous classifier for link additions. Such a system presents unique challenges. For example, low barriers-to-entry invite a diversity of spam types, not just those with economic motivations. Moreover, issues can arise with how a link is presented (regardless of the destination).
In this work, a spam corpus is extracted from over 235,000 link additions to English Wikipedia. From this, 40+ features are codified and analyzed. These indicators are computed using wiki metadata, landing site analysis, and external data sources. The resulting classifier attains 64% recall at 0.5% false-positives (ROC-AUC = 0.97). Such performance could enable egregious link additions to be blocked automatically with low false-positive rates, while prioritizing the remainder for human inspection. Finally, a live Wikipedia implementation of the technique has been developed.
NICE: Social translucence through UI intervention
Aaron Halfaker, Bryan Song, D. Alex Stuart, Aniket Kittur, John Riedl
Social production systems such as Wikipedia rely on attracting and motivating volunteer contributions to be successful. One strong demotivating factor can be when an editor’s work is discarded, or “reverted”, by others. In this paper we demonstrate evidence of this effect and design a novel interface aimed at improving communication between the reverting and reverted editors. We deployed the interface in a controlled experiment on the live Wikipedia site, and report on changes in the behavior of 487 contributors who were reverted by editors using our interface. Our results suggest that simple interface modifications (such as informing Wikipedians that the editor they are reverting is a newcomer) can have substantial positive effects in protecting against contribution loss in newcomers and improving the quality of work done by more experienced contributors.