Murray-Rust Group: Blogs
There’s a research associate position available working with Peter Murray-Rust and others here at the Unilever Centre. There is potential for some fantastic applied research in the role; using semantic web technologies, CML and text mining technologies to generate cutting edge polymer informatics tools.
A basic repository feature is providing a list of all the resources in a collection, and a way to incrementally discover changes. The usual way for repos to enable this is OAI-PMH, using either the ListRecords verb or the ListIds verb, and the ‘from’ argument to perform efficient incremental update, and the resumptionToken system to enable the server to condition the load generated.
The way the rest of the world does it is with Atom or RSS. Unnecessary retrievals can be prevented using conditional GET. The server chooses the size of the feed documents so it can control it’s own load. It’s even possible to avoid lost updates or list an entire collection using ‘first’, ‘last’, ‘next’ and ‘previous’ links (as in this tip). There’s no direct equivalent of PMH’s ‘from’ but as long as the feed has timestamps on each entry, then the client knows when to stop retrieving more feed chunks.
I’m currently reading the REST book, so I’m in a frenzy of resource-oriented fervour. OAI-PMH is, in the REST patois, a STREST interface (this theme was picked up in the discussion between Carl Lagoze and Andy Powell recently). The rich resource discovery possible with OAI-PMH is also overkill for what I’m after here.
I’m also unsure about syndication - I have a feeling that the resource representations in Atom / RSS feeds are unlikely to satisfy most repository clients’ needs. Isn’t a more resource-oriented approach to simply link to the resource and let the client negotiate with the resource for an appropriate representation? If so, Sitemaps fit the bill perfectly.
Well, maybe, but on balance I still think that Atom / RSS is a better choice; the RESTful repository will almost certainly have a feed around for human clients, and it’s better to adapt this for machine clients than adopt an additional mechanism.
A position in polymer informatics has become available in our group. It is a post-doc position for up to 16 months (limit of tenure applies). The holder of the post will help to deepen and expand our capability in text mining polymer (related) information. A complete job description and the official application instructions can be found here. The closing date is 6 July 2007.
The project web sites for the SPECTRa tools are now online, and I’ll be moving the help and documentation over in due course. The main point of the SPECTRa tools is to make building repository ready packages of X-Ray crystallography, NMR spectroscopy and Gaussian input files as easy as possible.
Once prepared, the packages can be saved to local storage for manual deposition or deposited into a DSpace repository (although this requires some customization of DSpace). Hopefully I’ll have time to write a SWORD client for it too, and I’ve been thinking about writing an S3 client for fun.
In other news: -
This is the summary of a presentation I am giving tomorrow at ETD2007 (run by Networked Digital Library of Theses and Dissertations. I’m blogging this as the simplest way of (a) reminding me what I am going to say and (b) acting as very rough record of what I might have presented. (My talks are chosen from a menu of 500+ possible slides and demos and I don’t know which at the start of the presentation so it’s very difficult to have a historical record. The blog carries the main arguments).
Main themes (many of which have been blogged recently):
In detail scientific theses need support for authoring and validating:
Some exciting thesis projects:
Why PDF is so awful: Organic Theses: Hamburger or Cow?
Subversion (CML project)
Wikipedia - caffeine - (info boxes)
GoogleInChI - semantic chemical search without Google knowing
The power of the semantic Web -dbpedia.org - Using Wikipedia as a Web Database.
Chemical blogspace - overview of exciting developments in chemistry
Local demos including analysis of theses:
What should institutions and NDLTD do to promote this vision?
and overall… Use the power of the scholarly community to show that they can communicate science far better than the absurd e-paper, unacceptable business models, and repression of innovation that is forced on us by the commercial and pseudo-commercial publishers. Destroy the pernicious pseudo-science of citation metrics. Reclaim our scholarship.
I am currently at the European Science Foundation’s first summer school on Nanomedicine in Cardiff, where I was invited to present some of the work in polymer informatics which we are doing in Cambridge. The summer school is a wonderful event, with approximately 180 attendees, the majority of which are PhD students and even a few undergraduates as well as a significant number of tenured faculty. The attendees came from a number of scientific disciplines, such as chemistry, biology, physics, medicine and ethics. And bringing people together in this way to talk about a field of research which is completely interfacial is the only sustainable way forward.
An awful lot of people were very impressed by the work we do and our approach to data and knowledge management and many of the PhD students I spoke to were enthused by the potential power that informatics can bring to their research. They also appreciated the need to have well-curated data that is freely available and not copyrighted by publishers etc. With so many PhD students here talking to each other freely about their research, getting to know each other and appreciating each other’s science, it seemed to me, that there is a real chance to build a community, that exchanges data and information in order to communally advance a field of research.
While the summer school was very multidisciplinary, there was a predominance of people interested in the use of polymers for all sorts of different applications - not least for applications in drug and gene delivery.
People working in polymer therapeutics are quite often “jacks of all trades;” not only are they chemists who know how to synthesize and purify polymers, but, to a certain extent at least, they also have to be physical chemists, biologists, formulators etc. So the polymer pharmaceuticals community produces very rich and diverse datasets. The data they create is usually of general importance:
An important property of polymers in medical applications, for example, is solubility. So quite often, people working in polymer pharmaceuticals will engage in the determination of phase diagramms for polymers. And as there is a lot of interest in stimulus responsive polymers, these diagramms are not just measured in pure water, but also in the presence of different ions and pH values. Researchers might also be interested in the dimensions of the polymer chain under all of those conditions, so light or x-ray scattering studies are carried out. And that is just on the pure polymer! Conjugation of a drug or gene to th pure material changes the game completely and so all of these measurements potentially get carried out again.
Once we are done with the physicochemical characterisation, we then go on to try and characterize the polymers we have synthesized w.r.t. their biological properties: we are interested in their toxicology, their biodistribution, their specificity etc. That, too, generates an awful lot of data which is potentially related to the structure of the polymers we are dealing with.
And as I said before, it is not only other pharmaceutical people that are interested in this sort of data. A lot of polymer chemists in general as well as companies should in principle be very interested in thi type of data: polymers are present in most modern household and cleaning products (check the labels of your shampoo and washing powder bottles).
Therefore it seems to me, that we have a rich source of polymer-related data here, that we should attempt to harvest. Judging from the initial enthusiasm that I have encountered at the summer school leads me to think that maybe we have an opportunity to work with the polymer pharmaceutics/nanomedicine research community to build up, at least in the long term, a valuable polymer knowledge base. Now, I am aware of the fact that this community in particular is very conscious of patents and intellectual property and we have mechanisms to ensure that these considerations can be taken into account and accommodated. How could we get hold of this data?
Over on his blog, Peter has pointed out that a viable way would be to capture digital theses in repositories, which, would not only allow the thesis to be preserved, but will undoubtedly also help with dissemination and intelligent data mining. Furthermore, it would be a way to prevent publishers from copyrighting scientific data.
All of this said, the potentialities go much further than this. I have already mentioned the strongly interdisciplinary nature of the summer school. Now, in our work here in Cambridge, we use semantic web technologies to hold information about polymers….we have developed an XML-based polymer markup language and are working on ontologies, which codify polymer knowledge. One of the conclusions of my talk was, that biologists and medics use exactly the same technologies to communicate their data and knowledge and so here for the first time, we have an opportunity to bring knowledge from disparate disciplines together and map it onto each other. In that way, we should be able to develop a joint language which we and our information systems can understand each other and that should allow us to ask new questions - Peter has already demonstrated what is possible when a thesis can be turned into RDF.
And theses originating in a strongly interdisciplinary field of research could be a wonderful starting point.
So, dear polymer science/polymer pharmaceuticals community, how about it? If you are interested not only in preserving and disseminating your data (after patenting etc.), but also in being able to ask new questions of it and in bringing multiple disciplines together, then give us your theses and let us work with you to show you how all this can be achieved. Here’s an offer - please take us up on it.
I’ve been moving more of SPECTRa over to Sourceforge and finding things out about the service. Something that made me sit back and have a think was the sourceforge backup policy. In a nutshell this states that they take at least weekly backups, but won’t restore them unless there’s a catastrophe at their end. I’d say the #1 risk, with high hazard and probability, is me making a mistake. Sourceforge don’t protect you against yourself.
That’s fair enough, it’s just a bit more work to arrange backups. But I’m glad I noticed the policy (I got there from a reference in the login shell), other colleagues I spoke to have used Sourceforge for years, were unaware of it, and don’t backup, expecting sourceforge to do it. The only problem here was my own expectation. Sourceforge provide a generally great service, and I’ve never heard a tale of woe about data loss on Sourceforge, so I built an expectation that they take care of backup and restore.
Perhaps this effect works in favour of IRs? One of the values of an IR is in acting as a deposition, access and dissemination service (as especially espoused by OA evangelists). Another value is in the provision of good curation. The expectation is matched by the service. I think, though, that the expectation has been built in a large part by the increasingly recognized brands of repo software projects such as DSpace, ePrints, Fedora et al.
I think this lies at the heart of why I felt initially uncomfortable about the idea of repositories sitting wholly within the web architecture (Andy Powell on the subject): If the IR is presented as ‘just a website’ then there’s no expectation, and you have to work to convince the user that they’re getting value. If you buy in to the web architecture vision Andy and others have been describing for IRs (as I have!), and if you agree that IRs are going to need a whole range of softwares to satisfy their users’ needs, then the importance of the software brand is going to be less and less important to users’ perception of the value of the IR, which might diminish as a result.
Earlier this week I attended JISC’s Dealing with the Data Deluge conference; part of their digital repositories programme work. The presentations were good, and more importantly there were some very interesting thoughts flying around in coffee rooms, dinner halls and pubs.
One of the stand out presentations for me was John MacColl’s presentation on the findings of the StORe project, which was investigating issues around data repositories and linking research publication repositories to data repositories. Two items in particular caught my notice.
Firstly, StORe found that whilst academia treats PhD students very differently to postdoctoral researchers, their data management, curation and reposition requirements are the same. This is interesting from my point of view on the SPECTRa-T project; it’s reassurance that SPECTRa-T will be relevant to the wider problem of chemistry publications even though our focus is on theses.
It’s also encouraging for anyone who wants the state of the art in data repositories to move forward, since this will almost inevitably require changes in the behaviour of researchers and PhD candidates tend to be more open to change.
The second thing that particularly caught my notice was StORe’s conclusion that data curation is difficult task which we cannot / should not burden researchers with. Additionally, it’s so specialised that the expertise probably can’t be provided at an institutional level, but could be successfully handled by a number of (perhaps peripatetic) specialist data librarians (e.g. funded by JISC).
This strikes a chord; from my early experiences with chemistry data on the DSpace@Cambridge project building the WWMM collection there, it was clear to me that a centralized institutional repository service could not hope to effectively preserve specialist scientific data. It seemed to me that preservation could only be achieved by a collaboration between people with curation expertise (librarians) and domain expertise on data formats and trends. Thinking on it more I’ve decided that you can apply this not just to “specialist scientific data”, but to any data that isn’t in the usual run of office and web formats. John’s findings are a more wide ranging statement of this, applying to all of curation, not just to preservation. It’ll be interesting to see whether and how the JISC or other funding bodies take this idea up.
As John pointed out (supported by Chris Rusbridge subsequently), this all makes the AHRC’s strange decision to cease funding for the AHDS particularly disappointing, especially since AHDS are providing a service that’s pretty close to John’s vision. Let’s hope this petition has some positive impact.
I am grateful for the recent correspondence from Peter Suber and Stevan Harnad as it helps me get my thoughts in order for ETD2007. In response to Stevan:
Open Access: What Comes With the Territory,
Peter has analysed the central question very clearly (as always)
Summary [of Stevan’s post]:Downloading, printing, saving and data-crunching come with the territory if you make your paper freely accessible online (Open Access). You may not, however, create derivative works out of the words of that text. It is the author’s own writing, not an audio for remix. And that is as it should be. Its contents (meaning) are yours to data-mine and reuse, with attribution. The words themselves, however, are the author’s (apart from attributed fair-use quotes). The frequent misunderstanding that what comes with the OA territory is somehow not enough seems to be based on conflating (1) the text of research articles with (2a) the raw research data on which the text is based, or with (2b) software, or with (2c) multimedia — all the wrong stuff and irrelevant to OA.
Comments
PMR: “The chief problem with this view is the law”. That puts it precisely, and that’s where Stevan and I differ. At the moment I think we have to work within the law, and I think the law debars me from crunching. There may come a time where we feel that civil disobedience is unavoidable but it hasn’t arrived yet - if it does I shall be there.
And some comments on other parts of Stevan’s post:
Get the Institutional Repository Managers Out of the Decision Loop
The trouble with many Institutional Repositories (IRs) (besides the fact that they don’t have a deposit mandate) is that they are not run by researchers but by “permissions professionals,” accustomed to being mired in institutional author IP protection issues and institutional library 3rd-party usage rights rather than institutional author research give-aways.
PMR: I have had similar thoughts. I got the distinct impression that some IR’s are run like victorian museums - look but don’t touch. Ithe very word “repository” suggests a funereal process - it’s no surprise that having put much of my stuff into DSpace I find it’s an enormous effort to get it out. Why don’t we build “disseminatories” instead?
[Stevan’s analysis of how we should deposit papers omitted. I don’t disagree - I’m just more interested in data t present.]
Now, Peter, I counsel patience! You will immediately reply: “But my robots cannot crunch Closed Access texts: I need to intervene manually!” True, but that problem will only be temporary, and you must not forget the far larger problem that precedes it, which is that 85% of papers are not yet being deposited at all, either as Open Access or Closed Access. That is the inertial practice that needs to be changed, globally, once and for all.
PMR: Here we differ. In many fields there has been little movement and no Green journals. We could wait another five years for no effect. But my main concern is the balance between Green access and copyrighted data. The longer we fail to address the copyrighting of data the worse the situation will become. Publishers are not stupid - they have revenue-oriented business people working out how to make money out of our data - Wiley told me so. Imagine, for example, that a publisher says “I will make all our journals green as long as we retain copyright. And we’ll extend the paper to cover the whole of the scientific record”. That would be wonderful for Stevan and a complete disaster for paper-crunchers. We can’t afford to wait for that to happen.
TJust as I have urged that Gold OA (publishing) advocates should not over-reach (”Gold Fever“) — by pushing directly for the conversion of all publishers and authors to Gold OA, and criticizing and even opposing Green OA and Green OA mandates as “not enough” — I urge the advocates of automatized robotic data-mining to be patient and help rather than hinder Green OA and Green OA (and ID/OA) mandates.
PMR: I am not - I hope - hindering Green access. I am not personally agitating for Green or Gold - my energies go into arguing that the experimental process must not be copyrighted by the publisher or anyone else. And that institutional repositories should start to be much much more proactive and actively support the digital research process.
Peter-Murray Rust writes on how open open access is with respect to PhD theses. In particuar, he quotes Steve Hannard’s list of things you can and can’t do:
“that I can create derivative works”
You may not create derivative works. We are talking about someone’s own writing, not an audio for remix, And that is as it should be. The contents (meaning) are yours to data-mine and reuse, with attribution. The words, however, are the author’s (apart from attributed fair-use quotes). Link to them if you need to re-use them verbatim (or ask for permission).
“that I can use machines to text- or data-mine it”
Yes, you can. Download and crunch away.
Speaking as a chemist-turned-natural-language-processing-person, there are a number of things might I want to do, which may or may not come under the heading of creating derivative works.
Finally, I find the comment comparing writing to audio, and saying that audio is good for remixing, but words aren’t. This seems most strange. Many songs are artistic works, and contain some sort of artistic integrity that the artists may wish to protect. Now many audio artists are enlightened about these things, for example a prized part of my music collection is a set of Depeche Mode remixes which do some very odd things to the source tracks; some are rubbish, but some are amazing re-interpretations. However, for example, if I were creating some music, I might be offended by people using my music to convey religious opinions with which I do not agree.
However, we are not talking about audio remixes, we are talking about glosses. Those editions of Shakespeare with notes in the margin for students are an example, as are Bibles with notes in the margin giving helpful hints on how to interpret it and apply it to your life (or the historical/archaeological/linguistic context of various passages, or pointing out inconsistent or morally questionable parts etc.). To a certain extent these examples take advantages of the authors being long dead. I could also mention Fermat writing in the margins of mathematics books.
More importantly, we are not talking about works created for entertainment or aesthetic appeal. We are talking about scholarly works. We are talking about the dissemination of knowledge. It is quite possible that in the 21st century, one of the most useful tools in the effective dissemination of knowledge will be the (automatic) creation of glosses, indexes, concordances and other derivative works. Now I can understand people locking their works up so that they can sell them to people, and get funding for their work that way - this does not apply to PhD theses in this repository. I can understand people being wary about other people putting words into their mouth; conventional fair-use quotation is just as dangerous for this, more so, as excerpts can be deliberately taken out of context in order misrepresent people. Or people may just feel precious about the fruit of their labours - to which I have to say that that’s not how scholarship works! Your funders have paid you to produce useful knowledge, don’t lock it up without a really good reason.
Stevan Harnad - a tireless evangelist of OA - has replied to my points. He has been consistent in arguing the logic below and I agree with the logic. The problem is that few people believe that this allows us to act as he suggests.
Stevan argues that current Green Open Access allows us to do all we wish with the exposed material without permission. However when I spoke to several repositories managers at the JISC meeting all were clear that I could not have permission to do this with their current content. I asked “can my robots download and mine the content in your current open access repository of theses?” - No. “Can you let me have come chemistry theses from your open access collection so I can data-mine them/” - No - you will have to ask the permission of each author individually. So Stevan’s views on what I can do iseem not to be - unfortunately - widely held.
Peter Murray-Rust’s worries about OA are groundless. Peter worries he can’t be be sure that:
“I can save my own copy (the MIT [site] suggests you cannot print it and may not be allowed to save it)”
Pay no attention. Download, print, save and crunch (just as you could have done if you had keyed in the text from reading the pages of a paper book)! [Free Access vs. Open Access (Dec 2003)]
“that it will be available next week”
It will. The University OA IRs all see to that. That’s why they’re making it OA. [Proposed update of BOAI definition of OA: Immediate and Permanent (Mar 2005)]
“that it will be unaltered in the future or that versions will be tracked”
Versions are tracked by the IR software, and updated versions are tagged as such. Versions can even be DIFFed.
“that I can create derivative works”
You may not create derivative works. We are talking about someone’s own writing, not an audio for remix, And that is as it should be. The contents (meaning) are yours to data-mine and reuse, with attribution. The words, however, are the author’s (apart from attributed fair-use quotes). Link to them if you need to re-use them verbatim (or ask for permission).
“that I can use machines to text- or data-mine it”
Yes, you can. Download and crunch away.
This is all common sense, and all comes with the OA territory when the author makes his full-text freely accessible for all, online. The rest seems to be based on some conflation between (1) the text of research articles and (2a) the raw research data on which the text is based, and with (2b) software, and with (2c) multimedia — all the wrong stuff and irrelevant to OA).
Stevan Harnad
American Scientist Open Access Forum
Specific issues:
My concern was not with just with material in repositories but elsewhere. Some publishers allow posting on green open access on web sites but debar it from repositories. So the concerns remain.
The MIT repository deliberately adds technical restrictions from printing there theses and this also technically prevents data and text mining. There are some hacks possible to get round this but it comes close to dishonesty and illegailty.
“derivative works” is a phrase that doesn’t work well in the data-rich subjects and we need something better. But it’s what the licenses use at present.
In data-rich subjects Linking to repositories is often little use. I need thousands of texts on specialist machines accessed with high frequency and bandwidth.
My problem is not with Stevan’s views but that few others give positive support to them, particularly not the repository managers. Maybe I’m too cautious…
I recently posted my concern about the use of “open access” as phrase which is sufficently broad to be confusing and Peter Suber has created a thoughtful and useful reply. I agree in detail with all his analysis and any differences are probably in emphasis and strategy.
Peter Murray-Rust, “open access” is not good enough, A Scientist and the Web, June 10, 2007. Excerpt:
Comments [PeterS]
PMR: Agreed. PeterS continually and consistently asserts this - I am arguing that the level of emphasis throughout the community should be higher.
PMR: I accept this. In which case I think we have too look for additional tools of discourse. If “open access” serves an important current purpose in a broad sense it should continued to be used in that way but we should not expect it to deliver precision.
PMR: agreed. But “we often need the extra precision” is also valid.
Part of the problem arises because in the Green approach to “open access” there is often an implicit trade-off between price freedom and permission freedom. There is tool-free access at the expense of having no permissions other than human readability - all the permissions (other than “fair use”) remain with the publisher. Many people may feel that this is a reasonable compromise in journal publishing at the present stage. Some may feel that 100% Green open access is an acceptable endpoint.
But I think it comes with a cost to those of us who wish to develop digital scholarship - the use of the information in scholarship by machines as well as humans. As an example the JISC meeting on institutional repositories I have just been at was called “Digital Repositories - Dealing with the Digital Deluge”. This is an emotive phrase - but it’s currently misleading. In many subjects there is a complete Digital Drought. And unless the permissions issue is dealt with there will continue to be. Permission freedom is essential for digital scholarship.
My concern is that unless we address the permission issue much more actively we shall slide into the acceptance that permission freedom is the exception or less important than price. The one area where we have to power to act unilaterally is those parts of our own scholarship over which we have effective control - theses, data in repositories, lteaching/learning materials, technical reports, etc. Let us work to make these 100% permission free.
My immediate urgency is fueled by the ETD2007 meeting tomorrow. I hope that we can find consensus on this issue.
I have continued to try to find full OpenAccess theses and encountered considerable difficulty. The main problem is that universities and their repositories do not help readers to find theses with OpenAccess licenses and in many cases they do not give any license information at all.
Anyway the story… I searched Google for “open access creative commons thesis” and found Mathias Klang’s thesis on Disruptive Technology. Mathias claims this is the first thesis in Sweden to be issued under CC, so I mailed and asked whether he had information from other countries about earlier theses. He mailed back:
Oleg Evnin at Caltech (successfully defended May 26, 2006) [PMR: blogged by Peter Suber]…a number of CC-licensed ETDs at the U of Edinburgh and that the earliest seems to be by Magnus Hagdorn, submitted on March 4, 2004.
Many thanks Mathias, and I shall enjoy reading your thesis - this whole area needs some disruptive technology - I am finding that approaches to repositories still look conservative and based on outdated models of thought.
I can’t comment in detail on the science but the format of Magnus’ thesis is an excellent example of what a modern thesis should contain - it’s 400Mbyte zipped but contains spendid animations and data of glaciation - worth a look.
But the problem with the repositories is that there is no indication that the actual thesis is OpenAccess. The Edinburgh repository announces:
All items in ERA are protected by copyright, with all rights reserved.
Copyright for this page [1] belongs to The University of Edinburgh
[1] i.e. the metadata splash page
which discourages the visitor for looking for an Open License within the thesis.
I’m sure this isn’r deliberate, but, repository managers, here is a very simple idea:
Add dc:rights to the splash page and metadata and proudly proclaim in large letters:
THIS THESIS CARRIES A CREATIVE COMMONS LICENCE - ENJOY!
As you know I am looking for real Open Access theses (not fuzzy open). Where have I found the most so far? Not in any of the highly supported repositories but in Harvard College Thesis Repository part of Harvard College Free Culture - here’s their splash page…
Welcome to the Harvard College Thesis Repository
Welcome to the Harvard College Thesis Repository, a project of Harvard College Free Culture! Here Harvard students make their senior theses accessible to the world, for the advancement of scholarship and the widening of open access to academic research.
Too many academics still permit publishers to restrict access to their work, needlessly limiting—cutting in half, or worse—readership, research impact, and research productivity. For more background, check out our op-ed article in The Harvard Crimson.
If you’ve written a thesis in Harvard College, you’re invited to take a step toward open access right here, by uploading your thesis for the world to read. (If you’re heading for an academic career, this can even be a purely selfish move—a first taste of the greater readership and greater impact that comes with open access.)
If you’re interested in what the students at (ahem) the finest university in the world have to say at the culmination of their undergraduate careers, look around.
There are 28 theses here and - unlike the green fuzzy repositories - all have been deposited under CC-BY (i.e. completely compliant with BOAI). The web page didn’t make the license position clear but I got the following clarification today:
Yes–all users of
our repository agreed to a CC-by license when they uploaded theirtheses. As part of the submission process, all users agreed to thefollowing terms:“I am submitting this thesis, my original work, under the terms ofthe Creative Commons Attribution License, version 2.5: roughly, Igrant everyone the freedom to share and adapt this work, so long asthey credit me accurately. I have read and understood this license.”We will work to make this more clear in the metadata for each thesis.
Well done Harvard College Free Culture - you have made an important step forward. Convince students in other institutions to follow your lead and the battle is won.
(Not surprisingly there are no chemistry theses but I am sure that can be fixed).
I shall be using Alicia’s Open Science Thesis in Useful Chemistry as a technical demonstrator at ETD2007. I really want to show how a born digital thesis is a qualitative step forward. Completely new techniques can be used to structure, navigate and mine the information. Here’s a taster:
A chemical reaction diagram (”scheme”) is a graphic object which looks like this:
As you can see this is semantically useless. A lot of work has gone into this, but none of it is useful to a machine (look closely and you’ll see it’s a JPEG). Even in the native software which was used to draw it it is unlikely that the semantics can be easily determined. However XML and RDF allow a complete representation. It took me about 1 hour to handcraft the topology - if we had decent tools it would be seconds. The complete set of reaction schemes (I counted 11 in the thesis can be easily converted to a single RDF file which looks something like this:
uc:scheme1_1 pmr:isA pmr:reactionScheme .
uc:scheme1_1 pmr:hasA uc:rxn1_1a .
uc:scheme1_1 pmr:hasA uc:rxn1_1b .
uc:rxn1_1a pmr:hasReactant uc:comp1 .
uc:rxn1_1a pmr:hasReactant uc:comp2 .
uc:rxn1_1a pmr:hasReactant uc:comp3 .
uc:rxn1_1a pmr:hasReactant uc:comp4 .
uc:rxn1_1a pmr:hasProduct uc:comp5 .
uc:rxn1_1b pmr:hasReactant uc:comp5 .
uc:rxn1_1b pmr:hasProduct uc:comp6 .
(uc: refers to the usefulChemistry namespace, pmr: to mine).
There are many Open Source tools for graphing this and here is part of the output of one from the W3C

Here you can see that reaction1.1a has four reactants (compound 1,2,3,4) and 1 product (comp 5). Comp5 is the reactant for another reaction (clipped to save blog problems). The complete picture for the whole thesis looks like this:

and (assuming you have a large screen) you can see immediately what reactions every compound is involved in.
That’s only the start as it is possible to ask sophisticated questions from a SPARQL endpoint - and that’s where we are going next…
… IFF you make the theses true Open Access
I have ranted at regular intervals about the use of “Open Access” or often “open access” as a term implying more than it delivers. My current concern is that although there are are tens of thousands of theses described as “open access” I have only discovered 3 (and possibly another 15 today) which actually comply with the BOAI definition of Open Access.
The key point is is that unless a thesis (or any publication) explicitly carries a license (or possibly a site meta-license) actually stating that it is BOAI compliant, then I cannot re-use it. I shall use “OpenAccess” to denote BOAI-compliant in this post and “open access” to mean some undefined access which may only allow humans to read but not re-use the information
I do not wish to disparage the important efforts to making scholarly information more widely available, and I applaud the general direction and achievement of the groups below. I appreciate that the copyright of historical content normally is held by the student author and it’s certainly very valuable to have “access” to it. But it is not OpenAccess. And unless specific policies are put in place to add specific BOAI-compliant licenses then future theses will also be non-compliant.
Here are typical statements:
By contrast let’s look at “Open Source” which applies to software and has been highly successful in liberating the field. It’s very widely used in academia and elsewhere. The Open Source Definition states
Open source doesn’t just mean access to the source code. [PMR’s emphasis] The
distribution terms of open-source software must comply with
the following criteria [PMR’s elisions]:The license shall not restrict any party from selling or
giving away the software as a component of an aggregate
software distribution containing programs from several
different sources. The license shall not require a
royalty or other fee for such sale.
2. Source CodeThe program must include source code […]
The license must allow modifications and derived works, and must
allow them to be distributed under the same terms as the license
of the original software.4. Integrity of The Author’s Source Code
The license may restrict source-code from being distributed in
modified form only if the license allows […][…]
7. Distribution of LicenseThe rights attached to the program must apply to all to whom
the program is redistributed without the need for execution of
an additional license by those parties.[…]
*10. License Must Be Technology-NeutralNo provision of the license may be predicated on any individual
technology or style of interface.
In general the term “Open Source” is completely self-explanatory within a large community. I can describe my software as OS and everyone understands what I mean. There are some licenses (e.g. GPL) which require additional freedoms but they don’t invalidate the above.
By contrast if someone describes something as “open access” it simply means that I may - as a human - and at some arbitrary time in human history - read the document. It does not guarantee that
So I believe that “open access” should be recast as “toll-free” - i.e. you do not have to pay for it but there are no other guarantees. We should restrict the use of “Open Access” to documents which explicitly carry licenses compliant with BOAI. [A weaker (and much more fragile approach) is that a site license applies to all content. The problem here is that documents then get decoupled from the site and their OpenAccess position is unknown.]
If the community wishes to continue to use “open access” to describe documents which do not comply with BOAI then I suggest the use of suffixes/qualifiers to clarify. For example:
However there is no value in “Green open access” for theses. Let’s make sure they are all BOAI compliant.
As regular readers will know we are applying text-mining to chemistry in Open theses. The problem is finding fully Open theses - so far we have got Alicia’s. Alicia has captured all here molecules in semantic form so text-mining isn’t required - and I’m hoping to do some fun stuff with XML on it.
I’ve searched for large collections of theses. MIT has a promising collection which would be ideal but they are only TollFree, not OpenAccess. I’m still appealing for readers to help. But in one of those quirks of Googling I ended up at the digital repository of the University of Stirling.
I am delighted about this since I spent 15 very happy years on the staff at Stirling in the Chemistry department. It doesn’t have one now - that’s why I left
- although we’re having our 40th anniversary later. But perhaps there are theses with chemical concepts Now the repository announces:
The copyright in theses in this collection remains with the author, unless it is stated to have been assigned to the University of Stirling. The University of Stirling reserves the right to keep electronic copies for consultation in both cases.
so I wasn’t very hopeful. But I thought I’d have a look and found one in aquaculture - one of the successful disciplines in Stirling:
which carried the licence:
This item is protected by original copyright
Items in the Repository are protected by copyright, with all rights reserved, unless otherwise indicated.
still not hopeful, until I read the license:
License granted by Kriengkrai Satapornvanit (ffiskks@ku.ac.th) on 2007-03-26T06:34:59Z (GMT):[...]END USER LICENCE This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 1.0 Licence. YOU ARE FREE: - to copy, distribute, display, and perform the work - to make derivative works Under the following conditions: ATTRIBUTION You must give the original author credit. NON COMMERCIAL You may not use this work for commercial purposes. SHARE ALIKE If you alter, transform, or build upon this work, you may distribute the resulting work only under a licence identical to this one. For any reuse or distribution, you must make clear to others the licence terms of this work. Any of these conditions can be waived if you receive permission from the author. Your fair dealings and other rights are in no way affected by the above.
Many public thanks, Kriengkrai Satapornvanit and I hope your future work prospers. Now I am completely free to see if chemicals can be mined from the thesis:
You can see that OSCAR has recognised many words as likely chemical terms (in yellow) and knows the structure of the underlined ones (the full version would know all of them). It’s not 100% accurate - you can see it thinks “P” is the element phosphorus - but Peter Corbett has addressed this in later versions.
So this allows us to collect metadata from theses automatically. OSCAR can tell us in a few seconds that this thesis is concerned with specific pesticides. That’s part of the basis of the SPECTRa-T project. Since we’ve benefited from Open Source theses, maybe we should do the whole project on an Open Wiki…
Jean-Claude Bradley and coworkers has pioneered the concept of Open Science in chemistry - and it goes beyond that. On UsefulChem he writes:
The fact that Alicia’s masters thesis “Synthesis of Diketopiperazines, Possible Malaria Enoyl Reducatase Inhibitors Using Open Source Science” is being written on a wiki was noted by Pharyngula, A Blog around the Clock and Pimm - Partial Immortalization.
I am particularly happy that Attila from Pimm has obtained permission from his supervisor to write at least part of his thesis on his blog. Outside of the sciences, I recall Mark Wagner doing something similar for his thesis on educational gaming. Also see Laura Blankenship’s thesis on blogging in the classroom.
Yes - there has been a lot of interest in this innovative approach and I’m delighted to echo it. Since they wish this to be an open process here are my comments directly for Alicia to use if she wishes:
My immediate technical goal would be the creation of a datument (everything in XML) for the thesis - I’m not going to do all that myself. But I would be keen to see the reaction sequences in animated SVG…
The same goes, of course, for anyone else writing Open theses.
I am currently working on an OWL ontology for polymers and one of the questions that pops up from time to time is how do you code statements along the line of “polystyrene has a glass transition temperature, which was measured using DSC”. So what we have here is a ternary statement. “Polystyrene has a glass transition temperature” would be a binary statement. However, “glass transition temperature” also has the concept “DSC” associated with it. Graphically, this could be represented as follows:

One possible way of dealing with this is to define the relationship as an additional class rather than a property, so we, could, for instance, introduce a class called “PropertyRelation” which would denote the fact that polystyrene has a GlassTransitionTemperature, which was measured by DSC.
Graphically this could look like this:

When expressing this in the form of a class hierarchy, finally, we can draw the following diagramm:

(I tried to post some OWL code here, but wordpress messes up the code…..sorry).
I’ve booked my flights to SciFoo. I’ve never actually been outside Europe before, so it feels like a bigger deal than I guess it actually is. Next up I’d better get my passport reviewed; one advantage of which is that the photo in my current passport is me-as-a-sixteen-year-old — and goodness is it embarrassing. I won’t be sorry to see the back of that. I’m going to be in the Bay Area for a while after the conference too, so this is a shout-out to anyone who’s going to be around there in early August and who’d like to meet up. Chuck a note in the comments!
Anyway, I’ve been a bad blogger recently; I’ve had a bunch of ideas floating around which I’ve meant to write about but I’ve not quite been able to find the words. The thing is that a lot of them are connected to the major thrust of my recent research, and I’m not quite ready to talk about that yet…
One teaser though. Adrian Holovaty, Django guru — software which gets the BtC seal of approval, incidentally, and no mistake — writes:
Medill (a college in the States — adw) received a grant to offer full journalism scholarships for geeks. Essentially, they’re looking to train the next generation of journalist-programmers, and I’m really excited to see that. If you’re a computer scientist looking to get into journalism (a rewarding and important field), you would do well to check out the scholarship and apply.
Journalist-programmers. Journalists first, programmers second. I like that — in fact, it’s really exciting to see people thinking of journalism as something which can be carried out by writing, by interviewing, by radio, by photography… and now by programming. I’ve mentioned this before, but something like Chicago Crime is, when you look at it, journalistic in nature and intent; it may be implemented as a computer program, but isn’t that missing the point when it comes to whether it’s journalism or not? It’s like criticising a painting for being made of canvas and pigment.
Anyhow, how this links with science is that, for me, programming is the medium, not the message. It’s the technique by which I do science, rather than the science being an excuse to write code. Maybe that’s a subtle distinction, but it feels like an important one to me!
June 12th, 2007 at 3:37 am eOpen Access: What Comes With the Territory