20:28 02/05/2008, [Robotic mining] is comparatively new, and (not surprisingly) hits some problems. Articles, like web pages, are designed for human consumption, and not for machine processing. We humans have read many like them; we know which parts are abstracts, which parts are text, which headings, which references. We can read the tables, find the intersections and think about what the data points mean. We can look at the graphs, the spectra etc, and relate them to the author’s arguments. Most of these tasks are hard for robots. But with a little bit of help and persistence, plus some added “understanding” of genre and even journal conventions, etc, robots can sometimes do a pretty good job.
PMR: Agreed
However, most science articles are published in PDF. And PDF does not make the robot’s task easy; in fact, PDF often makes it very hard (not necessarily to be deliberately obscure, but perhaps as side-effects of the process leading to the PDF).PMR: Agreed. An expansion is “most articles are authored in Word/OO or LaTeX and converted to PDF for the purposes of publishing.”
Peter Murray-Rust has been leading a number of one-man campaigns (actually they all involve many more than one man[*], but he is often the vocal point-person). One such campaign, based on attempts to robotically mine chemical literature can be summed up as “PDF is a hamburger, and we’re trying to turn it back into a cow” (the campaign is really about finding better semantic alternatives to PDF). I’ve referred to his arguments in the past, and we’ve been having a discussion about it over the past few days (see here, its comments, and here).PMR: [*] Alma Swan does me the honour to quote this in her talks
I have a lot of sympathy with this viewpoint, and it’s certainly true that PDF can be a hamburger. But since scientists and publishers (OK, mostly publishers) are not yet interested in abandoning PDF, which has several advantages to counter its problems, I’m also interested in whether and if so, how PDF could be improved to be more fit for the scientific purpose.PMR: I don’t think scientists care about PDF. It’s something that comes down the wire. If it came down in Word they wouldn’t blink. So it’s the publishers, not the readers. And most authors create Word. The tip it into the publisher’s site which converts it to PDF. PMR: Having said that “PDF” is rapidly moving from a trademark to an english word. Rather than “send the manuscript” it’s “send the PDF”. Just like “please send us your Powerpoints”.
One way might be that PDF could be extended to allow for the incorporation of semantic information, in the same way that HTML web pages can be extended, eg through the use of microformats or RFDa, etc. If references to a gene could be tagged accordning to the Gene Ontology, references to chemicals tagged according to the agreed chemical names, InChis etc, then the data mining robots would have a much easier job. Maybe PDF already allows for this possibility?PMR: This is completely possible at the technical level. My collaborator Henry Rzepa is keen on using PDF as a compound document format and metadata container. It can do it. But nobody does, it certainly will require tools that have to be bought.
PMR argues quite strongly that PDF is by design unfit for our purpose (in this case, holding scientific information such that it can reliably be extracted by text mining robots); that PDF’s determined page-orientation and lack of structural and semantic significance doom such attempts to failure. He also argues strongly that the right current solution is to use XML… or perhaps XHTML for science writing. I don’t know. He might be right. But being right is not necessarily going to persuade thousands of journal editors and hundreds of thousands of scientists to mend their ways and write/publish in XML.PMR: I’m not asking for XML. I’m asking for either XHTML or Word (or OOXML)
CR: I think we should tackle this in several ways:
- try to persuade publishers to publish their XML (often NLM XML) versions of articles as well as the PDFs
- try to persuade publishers who don’t have a XML format to release HTML versions as well as (or instead of PDFs)
- tackle more domain ontologies to get agreements on semantics
- work on microformats and related approaches to allow semantics to be silently encoded in documents
- try to persuade authors to use semantic authoring tools (where they exist), and publishers to accept these
- try to persuade Adobe to extend PDF to include semantic micro-metadata, and to help provide tools to incorporate it, and to extract it.
Might that work? Well, it’s a broad front and a lot of work, but it might work better than pursuing only one of them… But if we got even part way, we might really be on the way towards a semantic web for science…PMR: It will work at some stage - the stage when the publishers want to help scientists in their endeavour rather than prevent them taking the next logical step because it might impact on subscriptions or be extra work. The W3C community , the Googles, Flickrs, etc. etc do all this already. They have semantic linked data. It just that the scientific publishing Tardis is still stuck in the nineteenth century. It looks lovely from the outside.
Er, Peter, you didn’t *really* mean OOXML, I hope….
(1) Actually I did. I could use the term “Word2007.docx” instead. See my next post. Please feel free to critique that
I will….
[...] PDFs” over the months and the most recent exchange is that between Chris Rusbridge and Peter Murray-Rust. Another conversation that I have seen go on has been about making Word documents structure [...]
[...] and deserves a lot of the credit for making these issues more visible. Here’s an interesting post in which he points out that PDF files are not ideal from an archiving viewpoint: I should make it [...]
[...] – most political – but this is tackling the hamburger cow problem. (see, for example, a blog discussion last year). Turning a PDF into XML is like turning a hamburger back into a cow. Turning PDF into CML is even [...]