This presentation is part of the WikiSym + OpenSym 2013 program.
Bluma S. Gelley, Torsten Suel
Wikipedia’s low barriers to participation have the unintended effect of attracting a large amount of inappropriate content. One form of inappropriate content is articles whose topics do not meet Wikipedia’s inclusion standards. The deletion of these articles wastes a large amount of time and effort that could be better spent improving Wikipedia’s quality. We propose to partially automate the task of detecting unencylopedic pages using machine learning. We examine three main deletion methods in Wikipedia and collect a dataset of articles, heretofore inaccessible, deleted using each method. We use the data to train classifiers to detect articles that should be deleted. We report precision of .986 and recall of .975 in the best case and high precision with lower, but still useful, recall, in the most difficult case. Our results show that it is possible to use an automated software system to assist humans in finding articles for deletion.
A PDF file will be made available on August 5, 2013, through the WikiSym + OpenSym 2013 conference proceedings.