This is the transcript of a talk I gave to the Mineral Sciences group, in the Earth Sciences department of the University of Cambridge, on the 5th of March.
The talk was recorded on video; if the video looks good, I’ll be uploading it to Youtube. There are a number of URLs mentioned in the talk -as this is just a copy of my notes, you can get all of those from the list on my del.icio.us account.
On with the show!
Web 2.0 for Scientists: An Introduction
Andrew Walkingshaw, Unilever Centre for Molecular Informatics, Department of Chemistry, University of Cambridge
Hi, everyone; I’m Andrew Walkingshaw. Most of you know me already - I was a PhD student in this group - but these days I work in Peter Murray-Rust’s group in the Unilever Centre for Molecular Informatics, which is part of the Department of Chemistry. As a group, what we’re mostly interested in is the future of chemical information; I’m a member of a project (led by Martin) called MaterialsGrid, and in that I work on tools to teach computers about concepts in solid-state physics.
Undoubtedly, you’ll be relieved to hear that I’m not going to be talking much about that today, but if you want to know more about that have a look at our website. By the way, I’m going to be mentioning a lot of websites in this talk, but there’s no need to take notes; I’ll be posting a full transcript on my blog later. As you can see up the back, one of my colleagues is videoing this talk too; hopefully I’ll get that up on Youtube soon - if it goes well, anyway, so please, be nice!
So.
The title of this talk’s “Web 2.0 for Scientists”. Tricky title; first up, no-one seems to really have a good handle on Web 2.0 is. However, Professor Michael Wesch at Kansas State University’s Digital Ethnography group does; I’d like to play you a video he made and uploaded to Youtube. It captures what excites, and scares, me about the modern Web better than anything else I’ve seen.
1994 was the year the Internet broke into the mainstream. I was thirteen; half my life ago. (It’s a bit scary to look back and remember that). What’s even scarier, though, is the dawning realisation that the students who’ll be starting university in 2008 will have lived with the Internet their entire academic lives. When you grow up with a technology that powerful, you can’t imagine a world without it; I can’t imagine a world without the telephone, or television. What differences are there going to be between their worldview and mine? What are they going to expect about privacy? About collaboration? About access to information? About copyright?
How are they going to expect science to be communicated?
The real question is this: how are those assumptions going to affect what we do, and how we talk about it? We’re going to spend the rest of our careers in this brave new world, so we’d better get with the program now.
So, let’s draw up a timeline. What technologies - what aspects of the Web and always-on communications - will have “always” been around for someone matriculating in the coming years?
I mean, I went to school for the first time in 1985, when I was four. The Macintosh was launched in 1984, so in my world, the coolest computers have always had mice, and sound, and graphics; they’ve always been these friendly little grey boxes on the corner of the desk, not enormous monolithic blocks with flashing lights, attended to by stern armies of white-coated priests. That’s a really different worldview.
Let’s start with the folks who first went to school in 1989; they were four or five then, so they’re 22 or 23 now; they’re grad students or just starting full-time work. Well, there’s always been commercial Internet service providers. The first one of those launched in 1989, as I say, so the class of 2003 have lived through that; and the class of 2004 were starting primary school when the Web was invented.
The class graduating this year, so next year’s grad students - the class of 2005 - they started schools with the first digital mobile phones in the UK, so everyone’s always been the push of a button away for them.
This year’s class - 2007 - well, the first mainstream graphical web-browser, Mosaic, came into the academic world alongside them. It’s no surprise that as they filter into the workplace, everything’s changing.
2008 - next year - is when it’s all really going to accelerate. That’s the class of Netscape Navigator and the mainstream web; and the class of 2009 really are the Internet generation.
What have they known their entire intellectual lives?
Windows 95. Internet Explorer. eBay. Yahoo! The first wiki.
That’s the start of the Web 1.0 boom, which really carries on until the class of 2015.
2011 brings you the class who’ve always known blogs - what are they going to think about the mainstream media and mainstream scientific publishing? - and 2012’s new students have always been around the BBC News website, and most significantly of all, they’ve never known a web without Google. They’re the class who started primary school when I started university; I still, just about, remember google.stanford.edu when it was a research project.
(I don’t know what it is about Stanford and search engines; Yahoo! started there too.)
The Internet’s always had more than fifty million computers on it if you started school in 2014; in 2015, you’ve always been able to look your homework up on Wikipedia. 2016’s kids will have always been able to do it in free web browsers from the Mozilla project, probably Firefox (which was launched in time for the class of 2018) and for the class of 2017 - 1999 - they’ve always kept in touch with their friends on social networks like Myspace.
And 2019’s class - the students born in 2001? Well, they’ve always watched their TV on YouTube.
The next fifteen years are going to be full of even more change than the last fifteen, because they’re going to be defined and driven by the expectations of people who grew up with the innovations we’ve lived through. We’ve learned to live with them; they’ve shaped the coming generations’ worldviews.
I’ve still not said what Web 2.0 is, though.
A lot of people think Web 2.0 means social networking, means Myspace, means Facebook. I’m on both of them, for what it’s worth, but for very different reasons; different social networks have different audiences. On Myspace I’m a musician; on Facebook I’m keeping up with the folks I was a student with. They’re the coffee-rooms, pubs, meeting-places of the new Internet.
Anyway, social networking is a part of it, but it’s not the whole story. For those of you who haven’t seen Facebook, it’s a website where you keep in touch with your friends; swap notes, photos, stories, and yes, quite a lot of undergrads use it for things which I’m probably not going to go into in a scientific seminar, but mostly it’s a way of making contact with people and keeping in touch. Now, replace dating with scientific collaboration, and you’ve got something we can use; that’s what Nature have been working on with the Nature Network, which they announced last week.
That’s exciting, and useful, and potentially a new way of finding scientific collaborators and colleagues - but potentially awesome as that is, and it is, it’d be selling the potential of the new Web short to just restrict ourselves to this one application, no matter how cool and useful.
But I still haven’t said what Web 2.0 is.
Well, originally, it’s a brand-name for a set of conferences from an American publisher of computer books called O’Reilly and Associates. But it isn’t that either; it’s become a useful shorthand.
What it’s become short-hand for, though, is a lot more complex.
The Web in general is at least three things.
- It’s a set of technologies which make it unprecedently easy to distribute data of any form between computers anywhere in the world.
- It’s a political and economic force.
- It’s a sociological phenomenon - a new medium of communication with subtly different properties to all of the old ones.
The thing is, if you make a change in any one of these areas, you inevitably affect the other two. Really, they’re inseparable.
So, predicting the future’s essentially impossible; it’s like predicting the weather. Everything’s too interlinked, too tightly-coupled, too chaotic.
But what we can do is spot trends. We can observe and predict what the forces creating change are going to be.
And I’ve still not said what Web 2.0 is.
The Web doesn’t have version numbers in the way that software does. You can’t say “right, this is Web 1.0, and this is Web 2.0, and that’s that”. It’s constantly, and gradually, changing.
But let’s look at a few websites which typify Web 2.0, and I’m going to come back to most of these later.
Wikipedia. Flickr. Myspace, and blogging in general. A lot of what Google does - especially things like Google Maps. del.icio.us.
Here’s what Web 2.0 is; it’s something all of these sites have in common, and that’s a set of attitudes.
It’s a bias towards sharing and making things public - who your friends are, putting your photos up on the web, sharing everything. It’s a Web genuinely made up of things supplied by its users, rather than by a small class of people who knew the incantations necessary to make things appear on webservers.
It’s about making the web interactive; instead of a series of digital printing-presses with static content, it’s about websites as applications that you do things with. Websites as places, genuine locations, rather than as “just” magic - but static - newspapers that automatically update themselves.
And equally importantly, though this isn’t as immediately obvious, it’s about semantics - teaching computers what things actually mean, so that they can talk to each other about them; leading to unprecedentedly rich and flexible searching, interoperability, and integration of information from multiple different sources all over the Web.
Those are the things which the most interesting parts of the whole Web 2.0 phenomenon have in common. Again, though, like the web itself, you can’t separate them; together they’re far more than the sum of their parts.
Tim Berners-Lee, when he designed the web, always intended it as a platform where people could easily edit webpages in place and on the fly; the first web-browser and webserver let you do just that. It was a collaborative platform - a platform for sharing that anyone with web access could contribute to. There was no divide between the people reading the web and the people writing it; they were one and the same.
Doesn’t that, incidentally, sound an awful lot like Wikipedia to you?
Also, HTML - the language webpages are written in - was never intended to describe how pages look. Instead, it was meant to describe what the different bits of the pages were - paragraphs, quotes, links, lists, tables. In other words, HTML was designed as a language which’d capture the semantics of hypertext.
Web 2.0 is just the Web becoming what it should have been in the first place. It’s the Web growing into itself. It’s the real Web at last.
So why didn’t we get here in the first place? Commercial pressures is the short answer. I’m not going to go into the Browser Wars of the late nineties here, or I’d be here all day and you’d all get really, really bored, but the short answer is that the Web got very, very close to splitting in two between the two popular browsers of the day. Web pages got a million miles away from the semantic ideal Tim Berners-Lee intended. It was really only around 2003, with the Web Standards Project, a standard for describing how a webpage should look called Cascading Style Sheets, and the rise of a browser that commercial pressures couldn’t affect - Mozilla, the parent of the Firefox browser I’m using in this talk - that we started moving back to a Web where we could separate style from content.
That separation of style from content is the first stage in having a Web where you can integrate information from multiple sources; where you can analyse, and search, data on the Web automatically using computer programs; where you can start to use the system to do real science.
That’s the power of semantics. Most of the academic research into the future of the Web focusses on an idea called the Semantic Web - a web of richly self-describing, strongly specified, sources of data, which computers will be able to read, interpret and reason over. In fact, my own research is along those lines - teaching computers about the concepts used, and kinds of data produced, in simulating the properties of materials. I’m doing this as part of a project called MaterialsGrid, and if you’re interested our website’s at http://www.materialsgrid.org/.
A lot of the work on the Semantic Web is still at a very early stage, and as such it’s a distance away from being something that you’re going to see much of on the mainstream Web - but the fundamental idea of a Web of richly-described, richly-indexed data is beginning to come to pass in all sorts of odd places.
One example’s what’s often called “tagging” or “folksonomy”.
Let’s say we’re needing a photo of something for a talk (not that it’s on my mind right now or anything); for example, a photo on the theme of science. So, we go and have a look on the photo-sharing site, Flickr, to see what photos they’ve got on that subject.
Now, that’s a hard task; how on earth is a computer going to be able to tell what a photo’s about? Some companies - there’s one called Riya, for instance - have partial solutions for very specialised problem areas, like searching peoples’ faces, but this is a much harder problem. The answer’s, simply, not to solve it; but instead to let people put as many labels as they want on each of their photos. You can then look at all the public photos on the site tagged “science”.
Let’s do that now. [demo]
http://www.flickr.com/photos/tags/science/interesting/
There are some nice photos there.
Now, this isn’t just using the tags, because there’s a bunch of extra information you can use - what photos a lot of people have viewed, linked to, commented on or named one of their favourites. The combination of this automatically-captured data and these human-supplied taxonomies is very, very powerful. Flickr uses it for photos; del.icio.us uses it for websites; Youtube, at heart, is the same idea extended to videos; and it’s not a stretch to imagine these kinds of lightweight, dynamic ways of organising data becoming more and more popular when we’re sorting through the increasingly massive datasets we use in our science. Even Amazon are beginning to use tagging to allow its users to help each other to find things to buy.
Delicious, in fact, extends this idea in a really interesting way. At heart, it’s a website which lets you bookmark webpages and categorise them in the same way as Flickr lets you categorise photos. That’s more useful than you think - it lets you get access to your bookmarks from anywhere, which is really useful if you use a few computers in a few different locations. But there are two ideas which Delicious is a really good example of, and they both fall in the openness-and-sharing category I was talking about earlier.
Firstly, you can bookmark things for your friends who also use the website - and when they do that, they show up in their bookmarks. So, your friends are looking out for things you’re interested in at the same time you are - and because it’s the same way you’re sorting out your own bookmarks, it’s easier and better-integrated than sending an email or posting it into an instant messenger. Secondly, you can subscribe to your friends’ bookmark lists - you can make bookmarks private if you want to, but the default is for them to be public, so the page (or feed, but I’ll come back to that later) of your friends’ bookmarks becomes a really neat list of what the folks you know on the service are looking at right now. Now, imagine a research group doing that for all the papers they’re looking at - it’d become a really effective way of sharing the research you’re doing.
But I’m not the first to think of that; have a look at Nature’s Connotea, which does exactly that. It doesn’t have as many users as Delicious, so the social network isn’t quite as powerful, but it does have export to Endnote and BibTeX - so you can manage your references right there on the Web.
Anyway, enough of that digression into another sort of social networking, but it just illustrates how all these different aspects of Web 2.0 are interleaved. Let’s go back to semantics. With tagging and folksonomies, you’re asking your users to build an index of the data they add into the system. But you don’t have to ask your users to categorise all your data explicitly. You can get a long way just by considering the links people naturally make between different pieces of data - the hyperlinks in webpages.
Now, the video we saw at the start of this talk mentions linking. Linking’s at the heart of the Web; it’s the essence of it. What Larry Page, one of the founders of Google, observed was that you could treat linking like voting; pages which lots of people link to are more likely to be interesting and high-quality, and the links out from those pages are in themselves likely to be a stronger indicator of merit. Good pages acquire a lot of good Google karma; popularity leads to influence.
This shouldn’t sound too unfamilar to any of us, though; if you think about it, there’s not much difference between linking and citation, and Google’s PageRank algorithm - Larry Page’s insight was the single thing which, more than anything else, has turned Google into one of the world’s major media companies - is kissing cousins with citation analysis. The techniques which we use to assess the impact of our research are really the same as the ones the worlds’ search engines use to assess the impact of webpages.
When you think in those terms, what’s the difference in impact between a blog that loads of people link to - or a video like Prof. Wesch’s which millions of people watch - and a highly-cited scientific paper? It’s not nearly as big a difference as it might seem on the surface. The obvious difference is peer-review, but then when you start thinking about open access publishing and open archiving, the boundaries become even more blurred.
The arXiv self-archiving system’s been around for quite a few years now. The idea is that scientists upload the last, pre-submission version of their paper - submitted independently to a journal - and anyone’s able to read it. Of course, there’s no guarantee of quality, no peer-review, which makes you a bit, well, nervous. However, some communities - including fair bits of solid-state physics - have found even this uncurated source of data really, really useful; after all, at the very least it’s cheap! To an extent, you can argue that whether the peer-review happens before or after publication is moot. An example of that, recently, involving the arXiv is the case of Grigory Perelman, a Russian mathematician who, rather than going through the process of peer review decided to upload his purported solution to the Poincare conjecture directly to the arXiv where anyone could read it, rather than going through any of the normal channels. Purportedly, his thinking was that either his solution to this problem, one of the big unsolved problems in maths, was right - in which case, who cares whether it was formally published or not? He’d established priority.
Of course, most of us aren’t proving the major unsolved problems in mathematics, which is probably just as well, but there are real benefits of a lot of papers being on the arXiv. Some are just pragmatic; I can read papers on the arXiv at home without having to worry about subscriptions or authentication or any of that rigmarole, for instance. Some, though, are for want of a better word, political - papers on the arXiv are open to everyone, regardless of whether their institution can afford a subscription to the journal I’ve chosen to publish in, what country they’re in, any of that. I can’t speak for anyone else, but I’m just about idealistic enough to believe that science can be built on a culture of openness - or at least a culture of the presumption of openness without good reason - rather than a culture of secrecy.
In addition to that, I’m not going to go into great detail about it here - people spend their entire careers working in this area - but there are alternative models - open-access models - for scientific publishing which get round the peer-review problem. Journals like BioMed Central turn the conventional model of journal funding inside-out - instead of paying to read the journal, you pay for the paper to go through peer-review, and it’s free for anyone to read. That actually winds up costing the users of the journal about the same as a conventional subscription - but has the social benefit that anyone, anywhere - including researchers in the Third World - can read the journal. A lot of large charities - most notably the Wellcome Trust - are beginning to mandate open-access archiving of papers as part of the terms and conditions of getting a grant; there was a massive petition at http://www.ec-petition.eu/ recently requesting the European Union to “GUARANTEE PUBLIC ACCESS TO PUBLICLY-FUNDED RESEARCH RESULTS SHORTLY AFTER PUBLICATION”. It had attracted 23,000 signatures from researchers all across Europe last night, after being open for a month and a half.
Furthermore, there are increasingly sophisticated tools for automatically processing the literature, assuming you can get access to it. The OSCAR program - it can extract and check experimental data from chemical papers - has been used by the Royal Society of Chemistry in their reviewing process for a few years now, and the current version - OSCAR3 - uses machine-learning techniques and a lot of carefully-tuned heuristics to extract parts of speech - experimental techniques, chemical names, and so forth and so on - from the full text of papers. Dr. Peter Corbett, one of my colleagues down in the Unilever Centre, is the lead author of it, and it’s free software - what’s more, it’s been integrated into the RSC’s website in the form of Project Prospect, highlighting the key terms in papers to enhance how they’re presented on the web. Here’s an example.
[demo]
Now, that’s a very powerful example of machine-assisted semantics. This’ll only become more powerful, and expand to more areas of science, in the coming years.
In the line of chemical communication, we’ve got to consider blogs at some stage. Blogs have already become hugely important parts of the media landscape - they’re breaking news stories, bloggers are starting to get media accreditation, and the most popular have millions of readers.
Here’s a popular blog in chemistry; it’s written by a grad student here in Cambridge, and it’s called totallysynthetic.com. It reviews the latest organic total synthesis literature; it’s not a review journal, but it’s not as far from that as you might think. There’ll be more and more blogs and other sorts of user-generated content - videos and podcasts of talks and lectures, for instance - in that middleground in the near future, and if nothing else our metrics for citations and assessing peoples’ publication record are going to need to be adapted to deal with those. The cost of producing media is falling - I’m carrying an audio recorder, high-res digital camera and digital camcorder right now - my mobile phone. They’re going to get cheaper, smaller, and better, and that’s going to lead to more audio and video on the web; some of that’s going to be science.
But back to semantics; all a blog is is a list of stories, ordered in chronological order. Each blog is just a set of posts, and therefore you can write down the semantics of a blog post; it’s pretty simple, there’s a title, and a date, and the story, and so forth and so on. Thus, again, you can separate the content of the blog from the way it’s presented. The content can be described by a file format called RSS; you can subscribe to the RSS feed of a blog and be alerted when a website’s updated, for example. However, because you know where everything is in an RSS feed in a way you don’t with a webpage, you can write programs which will process the data on a website every time it’s updated. We’ll come back to that later, but there are hundreds of possible applications of this kind of technology - feeds with the latest papers from a journal, for instance. The arXiv, which I mentioned earlier, has feeds of each of the sections - so you can easily search through those, and seeing the latest condensed-matter physics papers - at least, the ones posted there, and there are over a hundred a day - every morning takes a few minutes.
Of course, thinking about it maybe the most obvious example of science on the Web, and arguably the most successful of all Web 2.0 sites, is Wikipedia. Wikipedia’s really simple - it’s an encyclopedia which anyone, anywhere, can edit without any permissions, checking of credentials, any of the normal checks which you’d expect in an academic project at all.
Somehow, though, it really, really works. No-one, even the people who created it - who were planning to use it to generate ideas for a curated encyclopedia, originally - expected it to work as well as it does.
Nature published a study about the mistake-rate of Wikipedia versus other web-accessible encyclopedias - the Encyclopedia Britannica and Microsoft’s Encarta - and found that, for science, they were broadly comparable in terms of their accuracy. Of course, Wikipedia has problems - it’s not perfect, but the astounding thing is that it works at all. It’s a fantastically useful reference source, created by people doing it for the love of it, and even with people using it to try and grind their respective axes, it’s still probably the single most useful reference resource on the Web. It’s not original research - indeed, original research is actually explicitly banned from Wikipedia - but it’s still a really powerful argument for a presumption of openness and goodwill in academic work, rather than a culture of privacy and suspicion.
But that culture of sharing doesn’t just need to be between people; it can also be between programs, computers and sites. Once you’ve got a way of describing the content of a website, you can write programs which will read and manipulate that content. One way I mentioned earlier was RSS, but that’s not the only way - some websites expose their data using interfaces accessible by other methods, like Javascript. Remixing the content of multiple websites using RSS, XML and Javascript has acquired the pretty rubbish name of the mashup, by analogy with the remixes of multiple songs which were fashionable a couple of years ago, but some of them are very impressive in their own right.
Loads of websites, like chicagocrime.org - which won its author, Adrian Holovaty - now a programmer-journalist at the Washington Post - journalism awards - exist simply to integrate the information available from multiple different sources, telling a tale which is more than the sum of the parts.
What this website does is take the publically-available database of reported crime in Chicago, and superimpose that on maps supplied by Google; because of the publically available data sources, it makes something which is richer, easier to explore, and much smarter than either of the two information sources are by themselves. What’s more, it’s automatically updated - it itself becomes a tool.
I’m sure there are countless ways that these techniques could be used in our work, but here’s one concrete example. Now, there’s a big source of public data in the fields we work in - and that’s supplementary data from papers. One particularly rich source of supplementary data, thanks to standardization of data formats around ten or more years ago, is crystal structures - the CIF format’s used by all the major journals in the field, including all of Acta Cryst. Therefore, if you’ve got access to that, you can run analysis, calculation - whatever you like - over that data and extract any insights you can gain from this data. Nick Day and Joe Townsend, two PhD student colleagues, have done a lot of work in this area, which again I’m going to have to skim over here, but for instance, if any new or interesting structures crop up in the literature, you can be alerted to it automatically. You can get aggregate statistics, and, even better, be alerted to any outliers; the tools will spot new, interesting and unusual structures. That’s real science, done in an entirely new way, using freely-available data, software and techniques. Now, that’s really exciting. It’s brilliant.
So, we’ve got social networking; automated data mining; dynamic, unprecedentedly flexible aggregation of data from multiple sources, all updated in real time; rich semantics, both supplied by humans and inferred automatically by cutting-edge computational techniques; open data; open access; open minds. Sharing, sociability, interactivity, and in the last hour or so I’ve barely scratched the surface; I hope I’ve given you a sense of just how exciting things are right now, though.
That’s the really good use of computers and the Web, if you ask me; turning them into our research assistants, amanuenses, and scribes, rather than just overgrown typewriters and printing presses. It’s a future within our grasp. I’m looking forward to helping shape it.
I’d like to make a few acknowledgements, before I finish; my PhD supervisors, Martin and Emilio; the head of my current group, Peter Murray-Rust; and I’m funded by the Department of Trade and Industry in partnership with Accelrys and IBM. I’d like to throw the floor open to questions now. Thanks for your time.
Andrew - thanks for making the transcript available.
Very nice presentation! Keep spreading the message…
Congratulations - right on the pulse thanks for providing your transcript - only wish i could see the response of your audience.
I demand you upload the video to youtube!
Wonderful talk. Chemistry and Web 2.0 seem like a natural match to me.
Can you provide a link to the Project Prospect demo?
[...] Here’s the video of the seminar I gave the other day. Thanks to Dr. Nico Adams, who was behind the camera; any editing mistakes or blunders in the talk are, of course, entirely my responsibility! [...]
[...] This is a seminar given by Andrew Walkingshaw. The transcript can be found here. [...]
[...] A few minutes ago I discovered a new blog. Andrew Walkingshaw is at the Unilever Center for Molecular Informatics at Cambridge University. There are some rather interesting posts on the blog including a couple that I would like to comment on soon enough. But let me start with what caught my attention. A bunch of videos on YouTube from a seminar on Web 2.0 for Scientists. Here is the complete set of videos [...]