Unilever Centre for Molecular Informatics
 

petermr’s blog

A Scientist and the Web

 

Hamburgers and Cows; The Cognitive Style of PDF

PDF is one of the greatest disasters in scientific publishing - why? I normally give my slides in XHTML rather than Powerpoint and prefix them with the quote which I made up: “Power corrupts; Powerpoint corrupts absolutely” I then searched the web and found thefts Edward Tufte had already thought of it in The Cognitive Style of PowerPoint. Tufte contends that PP had an important role in the Space Shuttle disaster(s). Tufte’s premise is that PP requires authors to omit critical data and dumb-down thought. I had never thought of PP as actually perverting they way we think, but it is absolutely right Mine attack on PP is complementary - technical rather than political. PP corrupts any semantics in the document completely. Just try to read the saved HTML from a PP (in say Google) and you will be lucky to get anything. PP is probably the most effective destroyer of semantic information yet devised. Tufte urges that authors use Word instead. I will interpret this to mean “any tool that displays conventional compound documents at the required level and without loss”. I therefore choose XHTML (because Word is a pretty good semantic destroyer as well).So why not just use PDF? It’s universal, it’s beautiful to look at? It’s used for scientific publishing… NO! PDF is the biggest destroyer of scientific information currently in use. PDF concentrates on only one thing: reproducing the process of adding printers’ ink to paper. The PDF that scientists use for publications was not promoted by them, but by the scientific publishers. How many scientists wrote to the publishers saying “we would like double column text in PDF”. The “e publishing revolution” has had the major and very sad effects of: * transferring the printing bill from the publisher to the reader (almost all scientists seem to print out the papers and annotate them with markers * transferring political power to the publishers. It allows the publishers to claim (as the ACS does) that
What is important to realize is that a subscription to an STM journal is no longer [...] a subscription; in fact, it is an access fee to a database maintained by the publisher. [...] one important consequence of electronic publishing is to shift primary responsibility for maintaining the archive of STM literature from libraries to publishers. I know that publishers like the American Chemical Society are committed to maintaining the archive of material they publish. Maintaining an archive, however, costs money. From “Socialized Science” (ACS[*] commentary on NIH) RUDY M. BAUM, Editor-in-Chief, C&E News, September 20 2004 Volume 82, Number 38 p. 7
How many scientists asked the publishers to convert journals into databases. How many asked the publishers to become the guardians of the archive? And have them switch off access at a moment’s notice (as they did to Cambridge last week) There are some minor benefits from ePublishing, Crossref, more rapid access, but it’s a Faustian bargain and we are suffering. PDF has been the devil’s agent in this. It has insidiously transferred control to publishers with the unintended but equally horrific downside of semantic destruction. Apart from the politics, why is PDF so bad? A question on XML-DEV about how to convert PDF to XML brought the lovely comment from Mike Kay (author of the (OpenSource) Saxon XSLT tool):
> > Could you please tell me, How we can convert the PDF data > into Xml file using java? I found a library PDFBox. > Converting PDF to XML is a bit like converting hamburgers into cows. You may be best off printing it and then scanning the result through a decent OCR package. Michael Kay http://www.saxonica.com/ http://lists.xml.org/archives/xml-dev/200607/msg00509.html
So I use XHTML and preserve my semantics. It’s a labour - but it has to be the way forward. I’ll write more on this later and why the browser manufacturers have destroyed semantics as well. (Judith M-R tells me there were too many typos in last post, so I shall edit offline, spellcheck and paste. I am still losing edits in Wordpress and then finding later they have been saved after I have rewritten them.)

13 Responses to “Hamburgers and Cows; The Cognitive Style of PDF”

  1. This is a very interesting post from a writing perspective. In business and technical writing, we promote the use of both of these evil forms (PPT and PDF). I will now be able to make a case against their use. I think this information is so crucial that I am going to force all of my students to read this post. Thanks for the information!!

  2. pm286 says:

    Thanks Beth,
    I have come to this via data corruption through PDF and will post more on this. I hadn’t thought very much about PDF as having a political side until I wrote this post. Now I shall think about it. As fuel for though I was talking to the ACS publishing officers last night. They use worlds like “database” and “users”; I am old fashioned and use terms such as “journal” and “reader”. I shall post on this now…

    P.

  3. [...] Here is a scenario from the near future: Joe is writing a review on Cephalosporin that he wants to publish the modern way - directly to the Web. An entirely new concept in scientific publishing has started to take hold. Rather than sumitting scientific articles to publishers, who then make hamburger out of them and strip authors of their rights to reproduce their own work, a new system in which journals simply aggregate content already on the Web is gaining momentum. Some journals specialize in only including the very best scientific Web content available, and so enjoy a prestige factor. It’s still a peer review system, but with inversion of control. The trick for scientists is getting their work indexed, and so noticed, in the first place. [...]

  4. pm286 says:

    Thank Rich - I have responded positively on your blog

  5. [...] Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Hamburgers and Cows; The Cognitive Style of PDF [...]

  6. pm286 says:

    The “hamburger and cow” analogy seems to be older than I thought. I have found the following link in 2003
    http://www.techwr-l.com/techwhirl/archives/0307/techwhirl-0307-00942.html

  7. [...] (I’m also not specifically criticising the authors of the paper - at least not more than all other organic chemists because this supporting information (SI) is typical. I am of course suggesting gently that the process of publishing organic chemical experiments is seriously and universally broken). The supporting information is a hamburger PDF and this example excellently makes my point. (Please readers, read it - or as much as you can manage - as I need help. Especially from anyone who is involved in graphical communication). It’s a separate document from the original paper and even though on the ACS site remarkably seems to be openly viewable. Maybe the ACS will close it sometime or maybe this exercise shows that Openness enhances downloads. [...]

  8. [...] Use machines to read the literature for you. Our software in Cambridge can now read and use large amounts of the primary chemical literature. This is a good way of alerts and aggregation for well-defined concepts (e.g. are there any papers which mention the sort of molecule I am interested in?) The main difficult is that publishers (see the reference to Wiley) are still prodcuing Hamburgers, not Cows so the machines struggle. So my prediction is that publishers who include semantics in their publications (Cows) will gain market share over Hamburgers. But of course while the mighty impact factor is the only thing that matters that will be slow. [...]

  9. [...] Hamburger House of Horrors Horrible GIFS Hamburgers and Cows - the cognitive style of PDF [...]

  10. Odonata says:

    very good article

  11. Odonata says:

    maybe, .swf can work intead of both .ppt and .pdf?

  12. [...] The bad news is that keeping an article publicly visible is the last thing most scientists want to spend valuable time and energy on. After all, that’s what the journal was there for, wasn’t it? Given the technical barriers to self-archiving Open Access content, who could blame them? First, an author needs to find a server willing to host their content. After that comes learning the software to get the article onto the server. Then comes the need to decide on the archival format, being ever-mindful of the hamburger effect. Of course, authors would probably want some assurance that the location of this article won’t change and will be “permanently” available. Does a DOI need to be re-assigned? And let’s not forget about how the poor reader is supposed to find these articles (some would say that Google is the answer, but I would disagree). Expecting each author to solve these problems on his or her own simply won’t work. There must be a better way. [...]

  13. Noel O'Blog says:

    Supporting information available as text…

    “Supporting information is available for this article as text files. See below for the links.”Providing datasets as text files is much more useful than making them available as PDFfiles (hamburger, an……

Leave a Reply