Open Notebook Science and Cheminformatics
Transcript of Open Notebook Science and Cheminformatics presented at Indiana University.
Jean-Claude Bradley: ...OK. So thank you very much for the invitation to speak to your Cheminformatics class this morning. I would, basically, like to show in detail how we're using Cheminformatics concepts in my lab. But in order to really show its place, I'm going to give a little bit of an overview of the big picture. In other words, where Open Notebook Science, how its actually done on a high level and how that trickles down to the smallest details of SMILES, InChIs, and all of that.
So, in the very beginning, let me just introduce the concept of Open Notebook Science. You want to think of it as, basically, being at the end of a continuum here. There is recently been a lot of effort done to try to make Science in research and teaching--although we're talking mainly of our research here--more open. If we look at the traditional lab notebook, typically, it's an unpublished document. The only person who has access to it is typically the student and their supervisors, of course. But, when they leave the lab, all of that information is typically not readily accessible to other people. You may have other people around the world repeating their experiments when they could very well make use of that information.
Of course, a step above that is the traditional journal article. Here, this is the typical article format where you have your Introduction, you have your results, your discussions, and it's all one little story. That's great in that it helps communicate some science but a lot of what's missing in there is often all the failed experiments, a lot of the ambiguous results that are probably more common than not in a typical or organic chemistry lab. That's more open, but it's still not completely open because not everyone in the world can access it. Usually, you have to get a subscription to the journal.
So there's a new wave coming out to make journal articles open access. There, typically, the fee is paid by the author although not all of these, but those articles are available for free to anyone in the world. But again here, the format is the same as the traditional journal article, so a lot of the failed experiments and things like that are not included. What we're doing in my lab is what I call "Open Notebook Science," which is where we're trying to achieve full transparency so the actual lab notebook is a public document and it's on a wiki, actually.
Now, to show you how all this connects together, we have various vehicles in the lab. We have wikis, blogs, mailing lists, things like that. They all have their place. Usually, when I get this talk, I like to start at the blog level because that's really the public interface. That's where things get discussed that most people--scientists or possibly people who are not scientists--can actually figure out a lot of what's going on. So I'm going to give you a few examples of what goes on there and I'm going to show how that's connected to our laboratory notebook.
Some of the things that are discussed on my UsefulChem blog are funding. Again, this is something that's not typically public, but I think there's a tremendous advantage to making these things public. One of the things is you can find new collaborators. You can have people try to understand what's going on in their field, what people are trying to do. I think that, overall, even though there's some hesitations about scooping--and we can discuss that later--I think, overall, it's very positive to make things as public as possible.
The other thing on my blog that I've done recently--this is just examples from the past couple of months--supporting funding initiatives. My friend, Cameron Neylon at Southampton University, he was writing a proposal to help people travel to an Open Science talk. He put out a request out there for support and with my blog, I was able to do that in a small way. Whenever we had media coverage--I like to report on that-high light peer-reviewed coverage. Lab notebooks that I'll be showing you are not the traditional peer-reviewed publication.
It doesn't mean that we're not interested in that> In fact, we're very much interested in publishing our work using traditional channels. But, if you want to be able to cite these things, people don't necessarily believe that you can do that, that you can actually cite lab notebook page. So I always like to point out specifically when people in peer-reviewed articles, actually, cite our work. So that's just one example of that happening in the past few months.
I like very much to announce new collaborators. Rajarshi, of course, has been a long time collaborator, but in the past few months, we've had some recent one's. Gus Rosania, University of Michigan, talked a little bit more about what he's doing drug transport and collaborating with us. Matthias Zeller, he is an expert crystallographer, who's actually done crystal structures for our compounds, which Rajarshi and I was talking about a few minutes ago. So he's the guy, actually, who's responsible for that.
Again, another collaborator who is very generous with his time to be able to provide us this information. We have another collaborator, at University of California in San Francisco - Phil Rosenthal. He's the person who has been testing our compounds for anti-malarial activity.
The other thing I like to talk about are presentations, so what I'm doing here, I'll most likely blog about and link to the recording. Presentations in Second Life, so you'll see a few slides here about Second Life, which is a virtual world where people exist in the form of avatars. This would be me, for example, and these were all the people at the meeting, then we can interact this way.
We can have presentations here and these are just basically PowerPoint slides in Second Life, so that's something else that I discuss. I can also discuss Science in new media, so here's an example of a protein in Second Life. I can talk about how I use it in teaching, so here's a student of mine flying around on a camphor molecule and here's a buckyball.
These are all things are related somehow to the work that we're doing in my lab. As you know, we're doing malaria work, so we're trying to make anti-malarial compounds. Part of the advantage of having a blog is once people know that they can ask you for support for other related initiatives, here is a "Run for Malaria" in Philly and it's to collect money for nets in Africa. A perfect example of people who would be following our work here, there's a good overlap of people who might be interested in participating in things like this.
Finally, I talked about more general science philosophies, so not necessarily just organic chemistry or the services or anti-malarial compounds but a lot of the fundamental issues about how science gets done and the opportunities of Web 2.0 technologies to facilitate that.
OK. So those are just really quickly some examples of things that I discuss. The one thing that I didn't show in those blog posts is a lot of them, actually, have links to back up some of the statements that I'm making. I'll be linking to specific experiments on our laboratory wiki. Basically, here I make an announcement that the falcipain-2, which is an enzyme discovered by Phil Rosenthal that degrades hemoglobin. So it's an enzyme that belongs to the malarial parasite. We just talked about that we just shift a couple of compounds.
Now, there's a link here, it says "See Experiment 150." So this is where the beef is. Basically, there's no reason for you to believe anything that I'm saying, and you really shouldn't if you're really applying the scientific criteria. I'm going to be linking to the original data and you can make up your own mind as to whether or not what I'm saying is reasonable.
So if we click on this link, we'll end up on the Lab Notebook page, which is on a wiki, so this is an example of a Ugi reaction. We're just mixing four components together and we're getting a precipitate, which is this Ugi product.
There's no need to really go into the chemistry here, just to say that we're using these compounds so we want to index them in some way and I'll get into that. We're making this compound and if we're trying to find which experiment was this compound made, there are ways of, basically, finding that using some cheminformatics tools. But right now, I'm just going to break down this Lab Notebook page and show you how you can gain access to all of the raw data.
The first part, this is actually a pretty long page, so I'm going to take it section by section. The first part is, of course, the Objective, and we're trying to make this particular compound. I'm going to click on this on the next slide and I'll show you that it's going to link to an entry in ChemSpider, which hopefully you're familiar with. It's a pretty large database of chemicals, I think, it has almost 20 million compounds now. We've been using ChemSpider to archive our results to a large extent.
The other thing you'll notice on the Objective here is that it has links to all kinds of things including a Summary Post. The reason that I really want to make sure that there's some sort of link like this in every experiment page is, if you're Googling and you find this lab note page, I think it's very important for people to be able to know what the bigger picture is.
If you read this, and OK, you do understand the chemistry but you don't understand why we're doing it, you would click on the Summary post. This will take you, typically, to a blog post that explains, in a lot more detail, at a much higher level what--first of all--anti-malarial compound are trying to make. It will also explain the reason why we're attacking falcipain-2 as target enzyme and the docking results that Rajarshi ran for us. All of that, basically, is traceable and linkable from here.
So I'm now going to click on this Ugi Add It link to show you what the ChemSpider looks like. So ChemSpider has a bunch of pages and here they have the molecule. There's a little bit of a rendering problem here, you all know this bond, it looks kind of weird. That's actually been resolved for the most part. ChemSpider has been around for about a year and they're in constant development. One of the things is that if you're working with new technology, sometimes these things will happen.
But the people there are extremely responsive. That's the advantage that a lot of things that we want to do, they can actually implement for us. Whereas, if we're a more traditional kind of archive, it might be more difficult for them to customize what we want to do. But this service, actually provides a lot of useful stuff. They provide the SMILES, they provide the InChI, and they provide the InChIKey automatically.
So I'll be using these for various purposes. The SMILES, of course, is pretty handy for searching in online databases. If you can search by any way, usually, the SMILES is going to be always included. The InChI is starting more and more to be included on online databases but it's not always the case. The advantage of the InChI is that there should be only one unique InChI per molecule. Whereas with the SMILES, oftentimes, there are multiple SMILES for the same molecule, so that's a big advantage of using InChI.
One of the disadvantages of InChI is that for large molecules--and this would be considered actually, a pretty large molecule for InChI. The InChI is so long that it doesn't properly get indexed by search engines like Google. So we've started to use a lot more the InChIKey, which is basically, just look up table from the InChIs. The advantage of the InChiKey is it's very short, and it's just a bunch of letters that should not recur accidentally very easily. So if you actually do a Google search for the InChiKey, you're pretty likely to find what you're looking for and your not going to get a lot of junk, in addition.
InChiKey here has two components and that's also a pretty useful thing. One of the components, the first part here, actually, tells you the connectivity of the atoms. So if you're not interested--like here we have a trans-double bond, so if I didn't care for the cis or trans, I could just search for the first part of the InChiKey and it would pull up all cis and trans, all the various stereoisomers. But if I wanted to specify this particular isomer, there is this additional information in the InChiKey that does that specification. So this is handy as well, this way of making InChiKeys.
OK. So the other thing that we can link to is the experimental plan. Before the students start the experiment, they typically, are going to follow some sort of plan. This is something that you can look up or anybody can look up. We also want to link to the docking procedure. You may not be a synthetic organic chemist; you may, actually, be more from the docking side.
If you want to see exactly what was done--Rajarshi, in this case, actually, did this run for us--and so here's the library, if you wanted to see the list of all the compounds that he used to dock, you can see that. Here's a link to falcipain-2 and it explains more about this enzyme.
The procedure--this basically Rajarshi wrote--and he's explaining how he used the PDB file and he's explaining the two docking sites on that particular enzyme that he thought would be a good starting point and here are the actual results. So these are the list of compounds that we're trying to make. If you want to see what they look like, you click on one of these links and you'll see that these are just tables of SMILES.
Once again, SMILES is a pretty convenient format to store lists of molecules and that's what we've done in this particular case. Someone could actually, if they were interested in all of the first 1500 hits; they could easily come here and copy and paste.
We also have a procedure section. This is supposed to look like more the kind of thing that you'll find in a traditional journal article. These are written in such a way to make it easy for us to send out our papers. We just have to basically copy and paste these sections in the "experimental" section. In order to actually verify the observations and conclusions of the InChI experiment, we provide all of the raw data for the spectrum. In this particular case, again, we're looking at the same Experiment 150.
A compound was isolated, and the proton NMR and the carbon NMR were uploaded. They're uploaded using JCAMP-DX format. This is a pretty convenient format for all kinds of spectroscopy. This is NMR; you could have carbon NMR, proton NMR, IR, Mass Spec; all of these different experiments, equipment can usually save the results as JCAMP-DX format. The advantage in that is that there are free open-source viewers such as JSpecView that will enable you to view the data in an interactive format.
So if you were to come onto the website and click this proton NMR link, it would pop this up. If you wanted to zoom into any region, all you have to do is drag with your mouse across and it will expand all of these little peaks. If you're familiar with NMR, you'll know that that's actually really critical. That's the only way to get J-Constants, for example. Also, it can reveal a lot about the phasing. On this unexpanded view I can see peaks but it is unclear what the quality of the peaks are.
When I expand them, I can see that this is a triplet, but there's some phasing issue going on. It's a way of looking for the details of impurities and things like that. In the supplementary section of most journals, they don't typically give you all of the expansions for the NMRs. They will give you the expansions that the researcher wants you to see, but sometimes that can be a little misleading. That's why I'm a big fan of using JSpecView. This also does not require the person viewing the information to download any software.
This is just using Java, so they're using a common browser and they can just click on it and view the information right away. I'm a big fan of that. Finally, when you get to the conclusion of the experiment, in this case the UV product was obtaining 59% yield; you really don't have to trust that. You can drill down to any part of this experiment and you can see if you agree with that conclusion, based on all the evidence provided.
That's really what Open Notebook Science is all about. It's about making all of your results public, so that if you're making a statement you can truly back it up with real support. OK, the last section on that page is typically a tags page. Here, we're listing all of the molecules that were used in this experiment. There are three or four formats that we're using right now. The common name. When I say the common name of course, that is the problem with common names.
There's more than one name for a molecule, so the common name is just a handy thing for us to keep track of. What is this entry? Really, what we're interested in is to put the InChI, and to put the InChIKey for each one of the molecules that was used.
These are actually linking to Google, so if you were to click on one of these links, it would give you a Google search for this InChIKey. It would show, most likely, mainly the results from UsefulChem, but it would also show other people who've used this compound and have indexed it with InChIKey. So right now I think this is the best way for tagging molecules. Now we're going to be looking at comparing experiments. I just showed you one experiment there; we've actually done a whole bunch of experiments.
We'd like to have convenient ways of comparing them. There's a few ways of doing that, one is with a simple table. We're using Google Docs, again because it's something that's easy. It's hosted, we can actually make it public very easily. Remember, the point of this is to make things as public, as quickly as possible. I'm a big fan of Google Docs. You can see here that we are keeping track of things, not only with the common name, but also using SMILES.
We've also used InChIKeys in this case as well. InChI not so much, again, because a lot of the InChIs for the larger molecules are just unmanageably long. I will put InChIs for small molecules but I tend to stay away from it for the big ones. The most important section of every experiment is of course the log. If you don't have the log or if the log is incomplete, you really can't do anything with that experiment. We have a log, which is just basically the student recording what they did and what they observed at different times.
You can actually construct all of the results from this. You can construct your discussion, and ultimately, of course, your conclusion. But if this is missing, you really don't have much proof for what you did. I do consider it the most important section. The problem with the log written in this way is that this is written in freeform. I know that Rajarshi and I have talked a lot about trying to automate things a little bit more. If you have a log in a freehand format, that makes it very difficult.
If a machine were to look at this, it might be able to pull a few things out. It would know that benzylamine was used, but it wouldn't know very much more precisely what it actually did. One of the things that we're doing now is converting all of these logs, that are written in a freehand form, and converting it to a machine-readable format. Here's one example of how we're doing that. I'm taking that log and I am now breaking it down into a series of steps in a workflow.
These words here; add, weight, vortex, take picture; all of these are terms that we agreed that we would use to specify certain kinds of actions. If I use the term "vortex" every time I vortex, I'm going to be using this exact way of representing that. These are represented in steps. This actually could be read by a machine fairly easily. When I say "Add Compound," I specify with a common name. This is mainly for human use so we can tell roughly what's going on.
I'm using the InChIKey for the machine to read. The InChIKey, again, I prefer this to the InChI because it's always going to be the same length no matter how large the molecule. It's a really convenient way of keeping things concise. This is one of the ways in which we are trying to automate things, or make them available for people who are interested in automating it.
So we can take these workflows as well and we can represent them in some more tables.Here, what we're doing, just to clarify the difference between the first table I showed you and this one. On the first table, I was looking at entire experiments. Here, I'm actually breaking down each experiment in to each individual result that was obtained.
In Experiment 150, for example, I took pictures multiple times and each one of these is a self-contained result. This is a self-contained result that--let's say the experiment, I drop the flask on the fifth day. That doesn't mean that everything that I did in days 1, 2, 3, and four are not helpful.
This certainly can be helpful but not if I think of them in terms of an experiment. If I drop the flask on the fifth day, the typical thing to do would be to just abort the experiment and move on. But if we're breaking things down into each individual result, we can actually use that information to plan further experiments. The key here is to represent it in such a way that it's systematic and that other people can use it easily.
So just a couple of different organizing ways, there is a table here on one of the wiki pages that is just a table of contents, so the list of all experiments. If you, as a human being, are looking for information and you know the experiment number, this is probably the easiest way to do it. You can also see the person who did the experiment. You can see a brief title, but in order to really see what's going on, you'd have to click in but it is probably the most common way of starting.
Now, I would like to talk briefly about why it is that we're using a wiki in Open Notebook Science. One of the things is wikis, it's very easy to tell what's going on in the lab. So if I click on the recent changes button here, it will tell me in the past few minutes, hours, and days who did something. It'll tell me exactly when and it'll tell me which page was modified. So if I'm interested in either what Emily is doing or if I'm interested in Experiment 158, I would then click on this and then see exactly what the addition was.
So if I do click on one of the experiments, there's a history button on every page of the wiki and I can see every single modification of that page over time. Now, we're using Wiki Spaces to do this which is a free-hosted service. The big advantage of that is that these date-time stamps are third party generated. So if I were running this on my own server, someone could argue, "How do I know that the time is correct?" Since we're using a third party time stamp, I think, it's pretty objective that if I look at anyone of these versions, I'm able to prove that I knew what I knew at this particular time.
So even though things might change, anyone can go back in time and see what exactly we knew and what we talked about. If there are mistakes made, we can see what those mistakes were and how long it took to correct them, all kinds of things like that.
Comparing two pages on a wiki is very simple, you just compare and anything that's new will show up in green and anything that's deleted will show up in red. So if you're making comments and the students are responding, this is actually a really convenient way of doing that and keeping track of it.
I also use on every blog and wiki page this little service called Site Meter. Again, it's another free-hosted service that tells you how people are finding the various pages. So this is something that I check pretty regularly because it tells me how the information that's on our wiki and blogs is actually being used. This is very interesting because you see people, for example, searching for the NMR of butylamine. That's the kind of search that is a little bit harder to do on a traditional journal article.
Whereas if I'm looking for oxidation of catechol or if I'm looking for immine chemistry, these are things where there's a lot of discussion about troubleshooting. There's discussion about things that don't typically show up in a traditional article. This gives me good feedback that we're actually accomplishing what we set out to do in terms of making the information usable.
Another nice thing about having Open Notebook Science is that you can actually tell the story of the failure, something very difficult to do in a traditional journal. We did manage to make this compound, for example, [inaudible] and we ran into a lot of problems. It took actually a long time to do this. Although you can discuss a little bit of this in a traditional article, here, I can go into great detail and talk about exactly why it is that we fail, why it is that it took so long. I can link to the original experiments. Maybe we save somebody sometime if they're trying to do a similar thing.
So we use additional vehicles besides the wiki and the blog, we use a mailing list. I find mailing lists to be very useful for other groups. The wiki is good for my group and I certainly welcome people to contribute to it. But a lot of times, it's easier for people that are different institutions to just hit Reply in their mail. So we use a UsefulChem mailing list for that purpose. We're working out a lot of little details that are not important enough to make it into the blog.
So if I just step back and look at the very big picture here, my group at Drexel--we are synthetic organic chemistry group. In terms of the information flow, so we have Rajarshi doing the docking. We also had, although not recently, Tsu-Soo Tan from Nanyang Institute who's also done some docking work for us.
What we'll do is take that information from Rajarshi and we will then decide what compounds to make. Once we've made the compounds, we then ship them out to either the Phil Rosenthal group if we're doing an anti-malarial test, or we'll shoot them out to NCI, where they've actually done anti-tumor testing on our products.
So once we get feedback from that, we can go back to Rajarshi and we can say, this is working, this is not working, then we can alter the model. This is nice to have so many great people that are willing to donate their time and expertise to actually do something constructive, do a whole loop here. So what I call it "Closing the Science Loop." Once we get information, then we can then start over again and get more information from the docking group and just make more compounds until we make a better and better agents.
A couple of other people that started to collaborate with us, I talked earlier about Gus Rosania. He's actually building red blood cell model in a malarial parasite model, so that we can try to simulate the transport of the various drugs that we're making to see if we can make them better by changing their transport properties.
Other people who are very involved with Open Notebook Science: Cameron Neylon from South Hampton University. He's also recording his experiments in great detail, but he's not using the wiki. He's actually using blogging software that is modified to keep track of versions.
There's not one way to do this, I'm just showing that you can use wikis and blogs to do it this way. There are other people and they have other reasons for doing things in a different way. So it's nice to see different examples like that.
In the next few minutes here, I'd just like to talk about some more detail on the Cheminformatics side of this. What we want to do, by exploring the information about compounds, InChIs, SMILES, whatever; is we want to be able to have machines process it as much as possible.
Here's something that we've done on our server, where we published information about each molecule in the format of SMILES, using a traditional blog. One post would correspond to one molecule on the blog. We're no longer using that because we have too many molecules now, but at first we were doing this. There are a few advantages to this; we can actually read the feed from that blog, and we can then calculate the InChI and we can calculate the molecular weight.
We can look up suppliers; we can make all of that available as separate web pages. We can also show the appearance of the molecule in 3D using JML. All of this stuff here is generated automatically. It only requires people to dump the SMILES in a blog post. OK, so that's one example. The other thing that we can do, is we can take that same feed and we can convert it to a CML RSS feed. We're not doing a lot of this, but we did it just to demonstrate that it was possible.
We can take that CMLRSS feed and we can read it in readers like BioClips. Very briefly, BioClips enables you to read CML RSS and then each entry here would be a different molecule; you can associate spectra and things like that. If anyone as more interest in that, feel free to contact me. There are people that are still working on this. This hasn't made it to part of our routine workflow, but it is something that we have experimented with.
Now I talked a little bit earlier about looking at spectral information using JSpecView and the JCAMP format. If you remember, the reason that this is really nice is you can expand any peak. This is showing on the browser how it looks, and that I can expand the peak. Now if we want machines to be able to make use of this JCAMP format, we can actually use Excel DBA. If I have Excel DBA reading the information from the JCAMP files, I can do something pretty interesting.
I can specify different regions in Excel as to the peak locations, and I can attribute each one of those peak locations with what I think is the corresponding chemical group. If I run that, it will read each individual spectrum and it will take out all the XY data and all the metadata. It will automatically give me a reaction profile where I can determine kinetics from. This is neat because all I'm starting with here is.
I'm monitoring an experiment using NMR. I'm dumping those NMRs in JCAMP format, in a folder. I specify the start time of the experiment, I click on the Excel DDA and it spits out this reaction profile.
We don't do reaction profiling much lately, but about a year ago we did a lot of this. This is just another example of how you can leverage automation if you have the broad data in a usable format. A last thing here, what's the next step with what we're doing?
I showed you a little bit about Second Life, how we can actually create molecules. We have a rezzer in Second Life that Andrew Lang built for us. All that means is it's a little bit of software where instead of figuring out how to connect all these bonds, you dump either the SMILES or the InChI or the InChIKey in the chat box.
It will create a 3D version of the molecule in Second Life. This is using some of the scripts that Rajarshi wrote. It's hitting ChemSpider, web services for converting InChIKeys into InChIs. It's using all kinds of different things, and it is very conveniently giving you the molecule. One of the advantages of these SMILES and InChIs and all this, if you can minimize how much the user has to worry about transforming things; you can get people who are not familiar with Cheminformatics to do some pretty powerful things.
This is an example where I have my students do assignments in Second Life. All they have to do is figure out how to grab that SMILES or InChI and they can do this stuff. In the background here, this particular student assignment is looking at acetylphenone. He is trying to explain how the spectrum supports the assignment of the molecule. Now, it's a little fuzzy here in the back, but this is basically a screenshot from JSpecView. What we're currently working on with Andy is we want to make this interactive.
We want to be able to read JCAMP files directly into Second Life, and to have an interactive way to do expansions. This is something that I think is going to be really neat to meet up with students, and we can talk about the molecule with is here in 3D. We can interact with the spectrum and we can try to figure out all of the different regions of the spectrum; whether or not they support the assignment. In order to do that, we need something better than this, which is just an image. We need to interact with the spectrum.
The bottom line with all of this is these Cheminformatics tools that we've been talking about; yes they're great for communicating between human beings, but I think ultimately where the real power is going to be is to go from human-human interaction to human-machine interaction; and eventually to machine-machine interaction over the free web, to design experiments, execute experiments and analyze them.
A big motivation for us, at least, is to start to use these Cheminformatics tools in a way that is easy for people who do like to do coding for them to be able to write programs that will analyze what we've done in the lab. Without having to interact with human beings to figure that out. That's the much bigger picture, and that's where we're headed. That's all I have for today.