This is my comment after the news in Nature popped up saying that a big academic publisher allows now limited text mining of its papers. Originally sent to LibLicense list. Reviewed by a lawyer, expert in copyright law.
As a data mining specialist, I’ve followed the different discussions about mining scholarly publications for some time already, and I’ve noticed that there is a big confusion about the legal nature of text mining and the true origin of restrictions related to it. The discussions far too often touch the issue of copyright law, which unnecessarily fudges the problem. Below is my take on this topic.
1) It’s important to observe that current restrictions on text mining are technical, not legal. Publishers impose technical limits on how much content can be downloaded in a given period of time, and if someone downloads too much, the university may get cut off from publisher’s servers. This is regulated legally, of course, but only in the agreement signed between the university and the publisher, not by general law, the least by copyright. What exact terms are signed is a matter of mutual agreement between parties – they can agree on whatever they want – so blaming copyright for limited bandwidth to publisher’s servers is unreasonable.
2) Restrictions are related to subscription contents alone. There are no ways to impose restrictions on mining Open Access contents, even if OA means only “free” OA. Even more: if I get access to an article illegally (or the legality is disputed as in Aaron Swartz’s case) and then mine it, I can only be accused of illegal copying and maybe hacking, but not of text mining. That’s because copyright law has nothing to do with mining, these are two different things.
Data mining is just another name for collecting statistics, and as such, it’s related to information contained in the paper and not to the paper itself; whereas the copyright protects only the paper as a creative work, in its literal and graphical form, not the information contained inside. It’s similar to the difference between copyright and patent law: copyright may protect an article that describes a new invention, but it’s only a patent that may protect the invention itself – without patent protection the invention can be used freely by everyone, even if the article is protected by copyright. It’s important to see the distinction.
Data mining per se – when decoupled from a separate problem of accessing the paper – is nothing else but collecting stats (e.g., frequency of word occurence in the text) and extracting basic facts (e.g., that two given proteins interact with each other), so it’s related solely to the information contained in the paper and can’t be restricted by copyright. It’s my personal freedom to collect whatever stats I want, on whatever objects (papers) I want, without any special permission from anybody, and nobody can forbid me to do this. The only thing publishers can do to restrict data mining is constrain access to their papers – which they do indeed notoriously, but that’s a completely different story: the one of limited access to literature, not the one of data mining.
So, it’s true what Peter Murray-Rust and OKFN say, that the right to read is the right to mine. I would say even more: mining does not need any explicit right, thus the right to read is an opportunity to mine. If only I’m lucky enough to see the paper – on whatever legal basis, or even none at all – I’d be foolish if I missed the opportunity and didn’t mine the information contained in it! The part that remains questionable is the “right to read” – do we really have it, if it’s hampered by technical limitations imposed by publishers? and does really subscription access deserve the name of “the right to read”, maybe a better name would be “the mercy to read (but not too much and not too fast, otherwise you’ll be banned)”?