ACS Talk on Cheminformatics in Open Notebook Science
Jean-Claude Bradley: ...so yes, I'd like to talk to you about Open Notebook Science, specifically, the role of cheminformatics in terms of storing information and in terms of retrieving it. So first of all, a little definition, what do I mean by Open Notebook Science? Well, if you look in the past couple of years, there's been a movement towards making the scientific process more open. I'd like to use this little chart here to show where we've come from and where we're going to.It falls into four different categories: It could be specific compounds. So someone might be looking for the NMR of TFA. If they do that they're in luck, we have lots and lots of NMRs of TFA. It could be a molecular formula. They could just be searching for T-guanidine, just tell me everything that you know about it, and I'm sure they're going to find ChemSpider hits with that. But people are clicking on these links and they're finding them.
In the traditional lab notebook, for example, everything is unpublished unless somebody makes an effort to put the date together and then send it for publication. But if they don't do that, everything including the failed results are not going to make it to anybody. All right, so as we move along this area here, we have traditional journal article, so in that case, it's less closed and that more people can have access to the information. But again, it doesn't include a lot of the elements such as the failed experiments or the view of all of the experiments that have been ran in a given project.
Something has come up lately, a lot has been open access journal articles and here, we're talking about articles that could be access for free by anyone and, typically, the author or a third party will pay the cost of that. That's good, it's making things even more open, but again, these still have the traditional journal article format. So what we're talking about here in Open Notebook Science is full transparency in the scientific process where, actually, the lab notebook of my research group is made available in real time using a few different technologies which I will discuss.
OK, so to give you sort of background of where we are, my group at Drexel--we're Synthetic Organic Group--and by doing these things openly, we need collaborators, obviously. One of the greatest benefits of doing Open Notebook Science has been to find some great collaborators. Rajarshi here has been a very active collaborator of ours; he's been doing docking for us. We've had people who've tested our compounds. Lately, Phil Rosenthal has been testing our compounds for anti-malarial activity. So most of what I'll be talking about today is in the context of anti-malarial agents. We're also tested a few compounds for anti-tumor activity.
What I'm going to do in the later part of my talk is give you screenshots of the various tools that we use and how they fit together. So we have a blog, we have a wiki, we use Google Docs, mailing lists, we use ChemSpider, and we use CDD and all of these talk to each other. You'll notice that all of these are free-hosted services and that's really important for me. If we have people who think that what we're doing is a good idea, they should be able to replicate it for no cost and with minimal effort and because the services are free-hosted, you can in fact do that.
I'd like to start the story with the blog and the blog, we typically report things such as milestones or larger problems that we've been having. There are not any hard core experiments in the blog because that would be very monotonous and that you wouldn't read it. So we put things that are more interesting to a broader audience.
Here, we're targeting this enzyme falcipain-2 which Phil Rosenthal is testing for us, malarial enzyme, and I'm talking about all kinds of things. Here, I'm linking to EXP150. Now, this is actually a link to the wiki that is the lab notebook of how those compounds are actually made. So what I'm talking about this compound here from this Ugi reaction and here's a nice picture of the crystal of it.
So when I click on that link, it takes me to a pretty long page, I'm going to look at different sections of it and show you where information come from and I'll try to focus more on the Cheminformatics aspects of this. Essentially, if you're going to this page, you should be able to link to the summary post that sort of explains the bigger picture of why we're doing this. Someone might have fall on this page just by doing a Google search. It has a whole explanation of everything including the molecules.
So if you want to click on this link, this Ugi [indecipherable], it will actually take you to the ChemSpider entry for that compound. Tony has already been through how great ChemSpider is and it is and it's got all kinds of information that his [indecipherable] catalates that I don't have to and that's a huge benefit for group like mine that doesn't really have a lot of computer science people. I mean, we're Synthetic Organic chemists.
So we can also link to the Experimental Plan of the experiment and we can link to the docking procedure. Rajarshi was talking about storing the procedure in such a way that other people would be able to reproduce it and that's where we try to use as much as possible. Again, this is just as a wiki page so we're linking to the library; we're linking to information on the enzyme that we're docking against. Rajarshi actually wrote this and these are the results.
If we click on these results links, we end up with a Google Doc that has just the list of SMILES in the order in which they were docked with the enzyme. So this is still falcipain-2 that we're looking at and Rajarshi is saying these are the top ten compounds that we should probably think about making. That's what we're going to do and we've been getting feedback from the people doing calculations to decide which compounds to make next from our virtual libraries.
We also have a Procedure Section, OK, so they're somewhere done around here. The idea here is to write up the information in such a way that it could be quickly copied and pasted when we submit some of these works for publication in a traditional article. What I'm talking about here is not a way of bypassing the traditional system; it's just the way of getting the information out there much more quickly. We're still following that plan of submitting to traditional articles and this is, of course, what the format would look like.
Again, part of that page, there's a Results Section and here, instead of just linking to the NMRs as PDF, we actually link to them in JCAMP format and we use JSpecView written by Robert Lancashire. It's a fantastic free program that allows you to do an NMR spectra or any spectra in JCAMP format in a browser in a way that can be expanded very easily. So this here is what the spectra might look like in a PDF in a supplementary Information Section.
I know a lot of you probably know it, you want to find out what the NMR look like and you get this PDF that was scanned and you can't tell what the picture look like. But if you have actually the raw data in JCAMP format, you can easily expand this little peak and see that yes, it is a triplet and you can measure the [indecipherable], you can do whatever you want with it. That's really, really important if you're making a statements about things, people has to be able to verify your raw data, they have to be able to make the same conclusions based on the same information.
JCAMP is really a nice little format to use; we've done a couple of things with it. I had a [indecipherable] student last year write Excel VBA, where we basically monitor reactions. We would basically give the start time of the experiment and then Excel VBA will be able to run and it will calculate all of the different concentrations over time.
Of course, you had this on the peaks, you had to tell which peaks correspond to which compound. But if you've done automatically print out a kinetics-run for the disappearance of some compounds and for the appearance of other compounds. So those are some of the neat things that you actually cannot do wit.pdf of an NMR. So it's very, very useful to keep it in that format.
And as you saw Tony talk about, we've also been using ChemSpider to characterize compounds. So the difference here is that these are spectra that I approved of; these are final isolated compounds. But of course in research, you take monitoring runs, you take impure compounds, you're in the process of purification. And so all of those NMRs and spectra have to be accounted for as well. So right now, we're only using ChemSpider to store the best stuff, the stuff that we would send to a paper.
And the next thing that we're about to do - we've almost got this done - I'm working with Andy Lang, and we're using Second Life to actually display NMR data. And this will read JCAMP, and it will do so in a way where you can actually expand the spectrum. So right now, this is actually working, but for a spectrum of a fixed X-axis.
We're working so that you can talk to the spectrum and tell it to expand. And we've almost got that done, and I think it's just going to expand the capabilities of Second Life. I'll be talking about Second Life tomorrow. Actually ACS Island is going live tomorrow at the SciMix, so hopefully I'll see some of you there.
Another section of this page - again, I'm going through this long, long page that's one experiment - is that there has to be a log. So we can actually construct the rest of the experiment based on the log, but without the log, we can't do anything because you can't remember what you did, exactly when you added things, exactly the way you measured them.
This is something that I try to reemphasize to my students: It's absolutely critical that you keep a proper log because later on, you can do whatever you need to do. So when I say that our experiments are in real-time, what it means is that the log has to be up by the end of the day. The other sections of the experiment can take weeks to actually get uploaded.
So finally, when you come to the conclusion section of this article, it says that the CD product was obtaining 59% yield. You don't have to take our word for it at all; you can go back and reinvestigate every single aspect and the arguments that we made.
OK, so now comes storing and retrieving information and this is where the Cheminformatics comes in. So in order for us to retrieve compounds in experiments, it's been a challenge. We've used a tag section, so at the very bottom of each blog or Wiki page, there will be tags. And we can use a number of things for this: We can use SMILES, for example. But we chose not to do that because there are multiple SMILES for a given compound.
So we've been using InChIs and that's worked well for small molecules, but the reality is that for very large molecules, like our Ugi products, the InChIs are not indexed properly by Google. So what we've started to do in the past couple of months is use these InChIKeys. And we use ChemSpider to provide the service to generate those InChIKeys based on either the InChI or the SMILES that are submitted to it. And these links here of the common names, those are for human readability, and when you click on them it takes you to ChemSpider.
So that's how everything is connected together, so that when you do a Google search with this partial InChIKey, it will come up with all of the different experiments where we used that compound. And if you're going to do this, you'll have to remember to click this 'Repeat Search with Omitted Results' because Google will assume that you're not interested in results if it comes from the same domain name - but we are because that's the point of this, we want to get all the experiments.
Now you can do a lot of tricks with Google. For example, if you're familiar with the Google Co-op, also called Google Custom Search, you can take all of our blogs, Wikis and all the pages that we've generated and create a special Google search that will only look on our approved pages; and if you do that then you have a way of searching a very rarified part of the Internet. So all that is available, totally free and anybody can do this.
So how are people actually finding our experiments? Well I've yet to actually catch them when using an InChIKey to find an experiment, so this is something in moving forward that we're going to be doing. But I think it's very, very interesting to see how people are actually finding our experiments through Google.
It could be experimental conditions. And this is actually really important because a lot of this stuff is like side reactions of amines. If you search in the traditional literatures, side reactions, how can stuff not work, you typically don't find that there. But of course, most of the typical lab book is almost all failures, so you're going to find lots and lots of stuff that doesn't work, and that's the point. If someone was searching for kinetics of the Bak protection, they will also be in luck; we have lots of kinetics analysis of that.
So the other thing that they can find out at a higher level is I talk a lot about educational things. So if they're looking for free downloading chemistry video, we've got that. If they're looking for 3-D periodic tables, it's something that I discuss, and people are looking for that. And also, some people are looking for bigger pictures, like zomal targets, Skinner formatics, project proposals. So I've discussed the proposals that I've put in and that's great, somebody who was looking for that would have actually found it. And of course, you can also search these experiments just by the traditional table of contents file.
OK, so now if you want to use this information in a much more meaningful way, we want to be able to compare different experiments. And as you keep doing experiments it becomes more and more difficult to keep in mind everything that's been done. So again, we're using tools as they become available to us. If somebody comes to me after my talk and volunteers to run a database with this, I'd be very happy. But right now, we're using Google Docs because it's simple and we don't have to have a lot of results to track, and so it makes sense, but of course, at some point, it should be imported into a real database.
But basically here, each one of these rows is one of these experiments of these Ugi reactions. And one of the things that we observed is that sometimes we get a precipitate and sometimes we don't. If you get a precipitate that's great because you can just filter it and you can scale it up, and you can do a lot of things. If you don't get a precipitate, you would have to run chromatography and that would really complicate things and make it very difficult to scale up.
So one of the things that we've been looking at is can we predict which Ugi product is actually going to precipitate. So in order to do this simply, everything is put into this table that is publicly available, and then people are free to run models on this. I'll talk about that a little bit later, well now actually.
Research has built models to actually predict, going forward, compounds that we have not tried to make yet. He's predicting right now - in the list of 100 compounds that we're scheduled to make - these three should be precipitates. We've also collaborated with other people, MSA Analytics. They've actually just this morning, run a model and co-predicted that this compound should precipitate.
But I predicted another compound that Rajarshi did not precipitate. So again, as he was talking about it'd be really nice to have ways of comparing models to each other and to do that in a very systematic way. As long as we're keeping this fully open, it makes it very easy for people to participate and can collaborate with us.
Now going forward, I show you that log and there are no rules for that log except that my students have to record what they do, when they do it, and what they observe, but they're not required to type any special words or anything. The problem with that is it makes it really difficult to convert that into a format that a machine could use or you couldn't readily extract that and put it into a database. One of the things that we've been doing in the past few months is actually rewriting our logs in a format that should be machine-readable.
Here, I have a series of steps, workflow actually, where you're allowed to add something, you're allowed to wait, you're allowed to vortex, you're allowed to take a picture but you can't do anything else. We have words that we've defined meanings certain things, we have parameters that we specified. For example, we specified the molecule using the InChiKey and we'll also specify with the common names so it's human readable but, essentially, that's just something that we decided to do that way for a bunch of reasons. But there's not reason that it has to be done this way.
If somebody wanted to convert this and have the SMILES, they could do that easily. They would just scrape the information and then convert it. So this, I think going forward, is going to be very important especially as we start to gather more and more information.
So looking at these results, we can also compare them. This is all the different than the table that I show you earlier where we're looking for precipitates or no precipitates. These are actually individual results from experiments that are stripped out and left to stand on their own. So we would mix compounds together and then wait four hours and take a picture. That's not the whole experiment, that's just the first data point. Then we will take an eight hours, we will take a 12 hours.
Now, if something bad happens on the 12th hour, let's say the student drops the sample, well, we will call that an aborted experiment. Everything in that experiment probably wouldn't be worth going through and digging through it. But if we extract every individual result as something that's addressable and minable, then it really doesn't matter what happens down the road.
If somebody who may not be interested in the Ugi reaction whatsoever, may not just be interested in all the reactions where an amino has reacted with an Aldehyde, they would find that here. They would not get any interpreted information; they would just get a picture of what that looks like. What does this look like I mix it together and wait four hours? So that's a strategy that I've been using to get a lot more information that's going to be much more machine-readable and machine-friendly.
A few things about a wiki, when we first started to do this project, we used the blog actually to record experiments. Then it turns out the blog is really not a great tool for this because if you change an entry, you have no record of it. You can't tell if it was changed, you don't even know who changed it. With the wiki, you can see all the recent changes in it; you can look at a specific page. So this is EXP150 that we've been looking at through all this time. I can see all the different versions and I can see who made the changes. I can compare any two versions and using wiki spaces, the new stuff shows up in green and the stuff that was deleted shows up in red.
I can use the wiki as a way to organize results, to explain something that is extremely difficult to publish - failures. So here's a little story about the synthesis of DOPAL and we tried to make this compound. We eventually did make it but we failed trying to make it in some really interesting way. This is just the story of that and it's got links to the papers we tried to use that had wrong information. It's just basically explaining all of that and I think that that can be useful for synthetic organic chemists. It's typically not something that you make public.
We also use a mailing list which turns out to be really handy for collaboration between groups. My group would use the wiki almost exclusively, we'll only collaborate with people who do docking, with people who do testing, the mailing list appears to work pretty well.
The other piece of this that we've just started to tap into is CED collaborative drug discovery. Drexel University now has users on here and this is a way for us to basically store or retrieve or ask a results. It turns out that two of our compounds and active against falcipain-2 and are active to prevent the infection of malaria. They're not terribly active, nor are they less active than chloroquine or the best agents that they have, but it's a start, so those results are stored here.
Something really neat that actually has happened just last week or two--so what's the point of making all of these stuff towards the public? Well, you get contacted by people that you totally don't expect. Brent Friesen at Dominican University, he runs the Sophomore Teaching Lab and he was interested in having his students do something more interesting than just repeating experiments that they've been repeating for 20 years that everybody knows the answer to. He thought maybe the Ugi reaction might be useful for that, so he contacted me and we talked for a little while and we determined, "Yes, this will make sense."
So he just wrote his manual for the spring for Chem 254 and it is the Ugi reaction and his students are going to be doing new reactions that we haven't done yet and we're going to be testing those compounds against malaria. That would be really neat to be able to include students taking regular teaching labs; it's a whole untapped resource. It requires some more time, it requires some more dedication from people who run the teaching sections, but this could be very, very interesting and it could be very motivating for the students.
There have been some other people doing Open Notebook Science, students from Gus Rosania's group was here earlier. He's a collaborator with us; he's going to be studying the drug transport [indecipherable] for our compounds in the parasite and the red blood cells. Cameron Neylon over at Southampton has also been doing Open Notebook Science. He doesn't use a traditional Wiki; he used a modified blog for this.
I just like to end on this slide where I think Science is headed. We've been living in a world where Science has the only point of communicating it was really for other humans to understand it. We're getting into this really interesting time now where we can have actually human beings collaborating with machines if the human beings choose to make information available to them. I think that's how we're going to get to this point where we can have machines actually doing real science, formulating hypotheses, testing them, analyzing the results, and then planning the next experiment.
I think to get to that kind of situation, we need to have free services, we need to have a possibility that anybody in the world can write a script that's going to try to process information and spit out something useful. Hopefully, this is one way that we can actually get there and, I think, it'd be very useful once we do.
So that's it.