WikiFactMine: Liberating facts for Wikimedia

ContentMine was funded by Wikimedia Foundation (the organisation which operates
Wikipedia) to run a project called WikiFactMine. WikiFactMine sought to make Wikidata the
central resource for identifying objects in bioscience.

Wikidata is a free and open knowledge base that can be used by machines and humans
alike. It is a store of structured data that used by Wikipedia and other Wikimedia projects, as
well as by individuals.

Our team extracted facts (words) linked to concepts in bioscience to build dictionaries of
these words from a variety of sources and we then linked them to Wikidata.
We not only extracted words themselves, but also a small amount of the surrounding text to
provide context. For example, we have made a list of all those plants in Wikidata that yield
cereals.

The tools we developed are extremely powerful; in the case of cereal outlined above,
whenever a paper mentions a cereal-producing plant we will be notified, even if are not
aware that the plant produced cereals. To date using our Fatemah tool, we have
contributed more than 10 million items to Wikidata from individual scientific papers!

For more details please contact us!

Text and Data Mining and UK Law

In 2014, the UK government introduced a number of changes to its 1988 Copyright Act . Amongst those changes to the law was one relating to Text and Data Mining (TDM), introducing a new exception to copyright i.e., giving users permission to do things that were previously legally uncertain. The exception (Section 29A of the UK Copyright Act) allows researchers to make copies of any copyright material for the purpose of “computational analysis” (i.e., TDM) if they already have have “lawful access”, for example because the researcher (or their employer) has purchased or subscribed to it. This exception only permits the making of copies for TDM for non-commercial research.

The exception permits any published and unpublished in-copyright works to be copied for the purpose of TDM. This includes sound, film/video, artistic works, journal articles, textual materials, tables and databases, as well as data, It over-rides any contractual term that states you cannot undertake such copying and analysis.

This all sounds great, but there are two important caveats. The first is that the research must be non-commercial. This means commercial organisations can use the exception if the research in question is for non-commercial purposes. A not for profit organisation such as a University cannot take advantage of the exception if the research in question is for a commercial purpose, e.g., with the intention of selling the results of the analysis. However, “non-commercial” is not defined in the law or by case law, e.g., what if a University researcher is doing the research which is part or fully funded by a for profit company, which will have access to the results of the research and may well use those results to launch new commercial products?

The second caveat relates to potential damage to a vendor’s online system, such as digital access to the full text of a range of journals. The exception states: “Publishers and content providers are able to apply reasonable measures to maintain their network security or stability.” Although the exception also states that “these measures should not prevent or unreasonably restrict researcher’s ability to text and data mine”, in practice many publishers have imposed limits on how much can be downloaded. I have yet to see any clear evidence from a publisher showing that TDM activities do slow down their systems, and suspect their rules are designed simply to frustrate researchers. As it would be illegal to try to by-pass any measures imposed by publishers, researchers, and librarians who maintain such subscriptions on behalf of their users, are very reluctant to permit heavy TDM activities.

In my view, researchers and librarians are being unnecessarily nervous of antagonising publishers on this issue, so in practice the new exception has not helped researchers as much as was originally hoped.

Charles Oppenheim

Copyright © 2018 ContentMine. All Rights Reserved

.