History: What data to collect

Source of version: 4 (current)

Etherpad: [http://eiximenis.wikimedia.org/WhichData|WhichData]

Notes from Open Space session on Data - WikiSym

Attendees:
* Cliff Lampe
* Danese Cooper
* Eugene Eric Kim
* Erik Zachte
* Guillaume Paumier
* Kevin Crowston
* Louis-Philippe Huberdeau
* Peter Gehres
* Peter L. Jones
* Philippe Beaudette
* Rand Montoya
* Zack Exely

-ComScore
-Fundraising
-Community Health (Philippe's list at http://strategy.wikimedia.org/wiki/Thread:Talk:Strategic_Plan/Movement_Priorities/%22Community_Health%22_measures#x.22Community_Health.22_measures_5711 )

Cliff - Aggregation is worrying...instead of "number of speedy deletes" it would be better to have the raw data (list of all speedy deletes with timestamps)

What data is not part of the history that you might need?

Funnel Analysis - if one of the Foundation's goals is to increase participation, maybe doing Funnel Analysis

Notes from previous "WMF Data Summit"

Micah's suggestions about how to approach what data to collect:

If some of our common goals are:
* To grow the community with new qualified editors
* To increase article quality
* To manage abuse
* To grow the encyclopedia and related wikimedia projects

Then:
* What kinds of data would help us gain insight into these challenges?
* What kind of experiments could we design to evaluate new user interface or policy ideas?
* What are the cultural, policy, and technical hurdles to adding more instrumentation, reporting, and A/B testing to the development of Wikipedia?

-Key funnels (article page, submit process, reading, clicks on edit, survivability of edits) Wikia has done research on edit-to-save ratio, brings up questions about why, etc.
Diagram example of potential user flow analysis: http://www.flickr.com/photos/35034358900@N01/4622563354/sizes/o/

-Precious - could be good for "ground truth" of the dataset approved by the Foundation (like a complete data dump). Belief there is "cleaning" happening.
-Special test wiki for regressions to make sure we've not broken the dumps when we check in code.
-marking bot edits an example of "cleaning". Currently have full archive and recent edits...would be cool to create a few more.

- Open Data licensing...ZakG says it can be tricky because of subsequently layered data. Public Domain with good CLA is his conclusion. Look at "NotreDame SourceForge agreement" and SRDA except it requires that you sign a license with NotreDame and has anti-defamation clauses. There is a European group in Madrid also. San Diego SuperComputer Center had funding to host "for free" and now its a fee for service. Nosh Contractor at Northwestern U has some large datasets that they are in the process of making a data center for. Look at their mechanisms.

Aggregation of Definitions that read on OpenData: http://www.opendefinition.org/

Algorithms need to also be licensed in Open Source.

What does Open Data mean to WMF Community?
-Respect Privacy Policy
-Data available "for free" as in no cost

Analytics Task Force from Wikimedia's Strategic Planning Process: http://strategy.wikimedia.org/wiki/Task_force/Analytics

How do we do Product Research in an "Open Way"?

Much more tempting than getting a big dump of "what's happened". We might have a theory for instance of how to retain editors, and being able to test theories on "live and evolving" datasets would be a unique opportunity. A/B Testing against the data for instance (in order of difficulty):
a) text changes
b) visual changes
c) changing javascript
d) page layout
e) business logic

danese: research panel in SF or east coast; people who manage large sets of data, if possible with prior wikimedia experience
Cornell: Jon Kleinberg
Syracuse: Kevin
CMU: Aniket (Niki) Kittur
Rice: Hadley Wickham
Oxford: Brian Ripley
Yale: Jay Emerson
Cal Berkeley: Hal Varian /

would be valuable to try to involve some industry practitioners as well:
Sean Power (author of O'Reilly: Complete Web Monitoring http://twitter.com/seanpower)
- ooh ooh, also the o'reilly data guy... richard?

Annual survey of Wikipedians - like the social survey

zack: we also need large-picture information not related to a specific question; perhaps our goals are wrong
cliff: google does a lot of A/B testing but keeps it secret. Wikimedia has an opportunity to do things better.

ErikZ: the anonymization problem

User advocacy group to get access to data, represent the community, and give feedback.

Daniel Kinzler (after the session):
* access stats for media files would help 8including thumbnails). GLAM folks love to know how much their content is viewed. If possible, count "internal" vs. "external" views (by referrer).
* log search terms; compare what people search for to what people find (page views). log especially unsuccessful searches.