The Right to Read is an Opportunity to Mine

This is my comment after the news in Nature popped up saying that a big academic publisher allows now limited text mining of its papers. Originally sent to LibLicense list. Reviewed by a lawyer, expert in copyright law.

As a data mining specialist, I’ve followed the different discussions about mining scholarly publications for some time already, and I’ve noticed that there is a big confusion about the legal nature of text mining and the true origin of restrictions related to it. The discussions far too often touch the issue of copyright law, which unnecessarily fudges the problem. Below is my take on this topic.

1) It’s important to observe that current restrictions on text mining are technical, not legal. Publishers impose technical limits on how much content can be downloaded in a given period of time, and if someone downloads too much, the university may get cut off from publisher’s servers. This is regulated legally, of course, but only in the agreement signed between the university and the publisher, not by general law, the least by copyright. What exact terms are signed is a matter of mutual agreement between parties – they can agree on whatever they want – so blaming copyright for limited bandwidth to publisher’s servers is unreasonable.

2) Restrictions are related to subscription contents alone. There are no ways to impose restrictions on mining Open Access contents, even if OA means only “free” OA. Even more: if I get access to an article illegally (or the legality is disputed as in Aaron Swartz’s case) and then mine it, I can only be accused of illegal copying and maybe hacking, but not of text mining. That’s because copyright law has nothing to do with mining, these are two different things.

Data mining is just another name for collecting statistics, and as such, it’s related to information contained in the paper and not to the paper itself; whereas the copyright protects only the paper as a creative work, in its literal and graphical form, not the information contained inside. It’s similar to the difference between copyright and patent law: copyright may protect an article that describes a new invention, but it’s only a patent that may protect the invention itself – without patent protection the invention can be used freely by everyone, even if the article is protected by copyright. It’s important to see the distinction.

Data mining per se – when decoupled from a separate problem of accessing the paper – is nothing else but collecting stats (e.g., frequency of word occurence in the text) and extracting basic facts (e.g., that two given proteins interact with each other), so it’s related solely to the information contained in the paper and can’t be restricted by copyright. It’s my personal freedom to collect whatever stats I want, on whatever objects (papers) I want, without any special permission from anybody, and nobody can forbid me to do this. The only thing publishers can do to restrict data mining is constrain access to their papers – which they do indeed notoriously, but that’s a completely different story: the one of limited access to literature, not the one of data mining.

So, it’s true what Peter Murray-Rust and OKFN say, that the right to read is the right to mine. I would say even more: mining does not need any explicit right, thus the right to read is an opportunity to mine. If only I’m lucky enough to see the paper – on whatever legal basis, or even none at all – I’d be foolish if I missed the opportunity and didn’t mine the information contained in it! The part that remains questionable is the “right to read” – do we really have it, if it’s hampered by technical limitations imposed by publishers? and does really subscription access deserve the name of “the right to read”, maybe a better name would be “the mercy to read (but not too much and not too fast, otherwise you’ll be banned)”?

9 responses

  2. This article ignores database rights, applicable in EU member states, which adds a level of complexity to data mining on those countries. It also ignores the impracticality of trying to negotiate an agreement for generous data mining rights with a publisher who holds all the cards. But the fundamental point made by the author is correct.

    1. Thanks for this remark. Any specific example how database rights may get into play?

  3. There is no exception for TDM in database rights law anywhere in EU, though there is in UK copyright law. So if a collection of data enjoys database rights, which many, but not all data collections do, or if a data collection enjoys BOTH copyright and database rights (which some do), then there is no legal protection for someone who wants to data mine it. Just mildly surprised that the copyright lawyer who checked the text didn’t mention this extra layer of rights that sometimes arises.

    1. Text mining is done on article level, not at the level of entire collection, and exploits data from individual articles rather than the data specific to the collection as a whole (like the selection of articles, relations between them, their meta-data etc.), so I would still argue that even database protection – which refers only to a collection as a whole, not to inner contents of individual items – can’t restrict the right to mine, if only the user has the basic right to browse (access, read) individual articles… The right to read is the right (opportunity) to (text) mine, even inside collections of articles.

      Note also that according to the legal definition, data in a protected database must be “arranged in a systematic or methodical way”, while the data extracted during text mining is NOT arranged in any way initially, therefore it can’t be treated as a database. This data exists originally in a raw, unstructured form exclusively, and text mining – which collects and arranges raw data – is exactly the procedure that transforms it into a structured form, possibly of a database. Thus, a database can be a result – rather than an object – of text mining, and so I can’t see how database protection may put any restrictions on TM.

  4. But an individual article may well contain enough data to qualify for database right! You are right about the definition of database, but the data within an article may well be presented in a systematic and methodical way, for example in a Table of results. An individual fact, as you say, enjoys no protection, but a collection of data within an article – say a list of melting points of a series of newly synthesised compounds ordered in order of melting point, or in order of molecular weight, potentially qualifies as having database rights. Thus, TDM that involves taking one melting point from one article, another melting point from another article, etc., is fine, but TDM that involves taking several melting points from a single article may well not be.

    1. True, in special cases an individual article may contain a (small) database inside, typically in a form of a table, and if this database is protected by law, some restrictions may apply as to what you can do with it if you manage to extract it from the text. However, I don’t think this case is the one discussed in the “text mining rights” debate, or covered by exceptions mentioned by you. Usually, when speaking of “text mining” we mean transforming raw (unstructured) input into usable (structured) knowledge (“data”). If you have structured data already at the beginning, you don’t need text mining anymore, but a parsing algorithm that pulls the database out with its exact structure and meaning, as encoded in the text. Besides, I doubt that any general exception, like the one mentioned by you, could waive a preexisting legal protection of such databases. Rather, the exception works at a higher level and applies only to data represented by an entire article or a collection of articles, not to self-contained objects embedded in an individual paper, which may come with their own sets of legal provisions, separate from those of the paper itself.

  5. My point is that an article with a Table in it has structured data in it, but it may well not be structured in the way the data miner wants. In such cases, they will need to re-structure the data into a form suitable for their needs. At that point, he lack of an exception for TDM in database rights law becomes problematic. I’m not alone in this. Other commentators have said the EU law on database rights will have to be addressed if TDM is to be encouraged within the EU. I am unsure what proportion of TDM activities will involve potential infringement of database rights, but certainly some will, and in the absence to a change in the law of database rights, there will be ambiguity that I am sure the major scholarly publishers will exploit in their efforts to inhibit TDM outwith their control. We both want to encourage TDM, so surely if there is a potential problem with db rights and the only way to resolve it is to have consistency in the exception to copyright and to database rights, why argue about what % of TDM activities might be affected?

