Article Quality Classification on Wikipedia: Introducing Document Embeddings and Content Features

Title: Article Quality Classification on Wikipedia: Introducing Document Embeddings and Content Features

Authors: Manuel Schmidt (University of Innsbruck), Eva Zangerle (University of Innsbruck)

Abstract: The quality of articles on the Wikipedia platform is vital for its success. Currently, the assessment of quality is performed manually by the Wikipedia community, where editors classify articles into pre-defined quality classes. However, this approach is hardly scalable and hence, approaches for the automatic classification have been investigated. In this paper, we extend this previous line of research on article quality classification by extending the set of features with novel content and edit features (e.g., document embeddings of articles). We propose a classification approach utilizing gradient boosted trees based on this novel, extended set of features extracted from Wikipedia articles. Based on an established dataset containing Wikipedia articles and quality classes, we show that our approach is able to substantially outperform previous approaches (also including recent deep learning methods). Furthermore, we shed light on the contribution of individual features and show that the proposed features indeed capture the quality of an article well.

Download: This contribution is part of the OpenSym 2019 proceedings and is available as a PDF file.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.