Wednesday, November 26, 2008

Open Notebook Science in 15 minutes

Open Notebook Science in 15 minutes

Jean-Claude Bradley: All right. So, I will try to explain to you these two concepts of the synthesis of anti-malarial compounds and Open Notebook Science in the next 15 minutes. Well, this is actually a pretty good time to give this talk. This week we actually got our Wikipedia entry for Open Notebook Science. And it turns out that it required a lot of peoples coordinate efforts. And it required a body of work before we were able to do this. But, if you go on Wikipedia, you can learn a lot more that I don't have time to explain today.

The idea of Open Notebook Science is basically to report the work that you do in the laboratory in real time or as close as you can to real time, so that the entire world knows as much as you do about your research. Like I said, there are a number of references here that you can take a look at the background of this. But, the motivation is that - well, it should be self evident that it's a way to do faster science compared to either not disclosing some things or significantly delaying them.

And I think, it's also a way of doing better science, which is not immediately obvious, but hopefully I will show you some examples of how that can be.

OK, now to the synthesis part. So, we are a synthetic organic chemistry group and our target is malaria, specifically Falcipain-2. Malaria, as you should know, is a disease that is spread by mosquitoes. And here's actually the malarial parasite inside of the red blood cell. And it uses the enzyme Falcipain-2 to metabolize hemoglobin. So, if we can inhibit that enzyme, it could be a way to basically stop the process of it replicating.

And so, what we have done is we have collaborated with a group at the Indiana University, Rajarshi Guha. And he does the docking, which basically means that he takes the Falcipain-2 in the computer and tries to dock molecules and see if they fit or not. If they fit, there's a chance that it might inhibit it. And so, he tells us which compound to make. And then we make them and we ship them off to UCSF where Phil Rosenthal does the testing.

So, this is a collaboration done completely openly, and people can join in or they can follow what we are doing well before the publications come out. So, I am not going to talk too much about the nuts and bolts of it, but, suffice to say that we use blogs, wikis and all these different social networking sites to try to make this system fully hosted and fully replicatable by anyone else in the world who might want to do a similar thing. And that's happened. And I can surely talk to you if you are interested in seeing the different groups that have done that.

I was telling you about it's a way of doing better science and really comes down to where's the beef when you talk about your experiments. So, this is a blog post here, where we are talking about doing different things. And it says "See experiment 150 for more information." So, this is Ugi reaction that I will be mentioning over and over in this talk.

And if you click on that link, it takes you to the lab notebook page experiment 150. And this actually looks very similar to what it would look like in a paper notebook. And that's on purpose. We wanted to make things easy as possible for people to get involved with Open Notebook Science.

So, you have an objective, and you have all these different hyperlinks. So, one of the things that you can link to - and then this is a pretty long page, I am just going to skip through it giving you examples. You can hit that Ugi edit link and it takes you to an entry in ChemSpider. ChemSpider is a free database. It has over 21 million compounds. You can do such sorts of searching. You can do all kinds of things for free. And I don't have to worry about that on my server.

So, that's what we are trying to achieve here. We are trying to get high quality information processing without having to become computer scientists to do it. And it's becoming really possible to do.

We also link to the docking procedure that our collaborator Rajarshi uses. Again, here the idea is that this is replicatable. Someone who has done docking before should be able to get enough information from this page to generate the same compounds in the same order; all right? So, these are called SMILES codes and they are convenient ways of representing molecular structures, and you can just dump them in spreadsheets. So, it's a pretty convenient way.

Again, this is all made explicit, so you don't have to ask the researcher for permission. You can just go and look at the results.

Another very helpful thing is our spectra. If you know anything about organic chemistry you know that the basis of it is spectra, especially NMR spectra. And there's actually a very neat way - if you have your NMR spectrum in a JCAMP format, you can run JSPEC view so that someone who does not know anything about the Java or anything, just hits this link and this spectrum pops up, and it's actually interactive.

So, you can use your mouse and drag across any peak and it will expand. Again, here - this is what I am talking about doing better science, you know. May be, you didn't expand that peak in your paper. May be you didn't talk about it. But, if I am trying to replicate this where I am trying to extend your research, maybe I am interested in that peak. Maybe I want to measure it. And so there are just more details.

So, by the time we end up with the final conclusions and it's says "This Ugi product was within 59% yield." You don't have to take our word for it. It's all backed up - either well or poorly - but it's all backed up, exactly what's supporting our statement.

If you are not familiar with the wiki, the reason that we use it for a live notebook is that every time there's a change made, it tells you who made the change and exactly when. And we have a third party time stamp for it. So, we can claim that we knew what we knew exactly when. And we are not running the time stamp. It's run on a third party that's well respected. So, that could be interesting down the road to settle claims.

We can compare any two versions, and using wiki spaces it lets you - basically shows you the stuff in green is the stuff that was added, and the stuff in red was deleted. So, it's a really nice way to understand what people are good at, right? Because this is a collaboration, many people in the lab working together, certain people are good at some thing and other people are good at other things. And this is a really good way to keep a track of all that.

Now, to find information, that's actually a big issue. Obliviously, if we just left it in the wiki like that - I mean we have tags. We have ways for searching for the information. But, you don't want to have to do that if you are interested in seeing the collection of experiments that we have run.

So, we've run this Ugi reaction several times and we have modified the conditions. So, we have used different staring materials, different solvent amounts, and different concentrations. And we have sometimes gotten a nice precipitate that was pure product and sometimes we don't. So, we are trying to understand that. And we are using these Google docs as a way of sharing that information in a very convenient way.

So, this is a spreadsheet. It works very similar to Excel but it's free and it's hosted. So, I will show you an example of an opportunity we had recently to use a robot from Mettler-Toledo. And we are able to actually automate this optimization of this reaction. This was done in collaboration with Dr. Owens. He did some statistical analysis, which I want have time to get into. But, the idea here is that we wanted to find the highest yields - the condition for the highest yields.

So, we modify concentration, we modify the solvent, and we modify the excess of some of the reagents. So, we actually did these reactions in little tubes that had a filter at the tip. So, the robot added the four different components. And then it precipitated or it didn't. And if did. We just washed it and then weighed the results. And of course took an NMR to make sure that we actually got the compound.

So, this is a picture of the robot. And it's basically just a syringe that goes and takes the liquid out and puts them. An interesting thing about using a robot is that you get automatically the log of what the robot did. And it pays attention all the time. So, it will record what it is it think it did. That's a double edged sword. It gives you a lot of information. If you want to debug things, yeah, you absolutely have some good data to look at.

But, it also means because you are able to do so many more experiments, you have to be even more vigilant about systematic errors. And we've had that problem. And so, you end up doing a thousand experiments before you find the problem, all right.

But, once you get it working, actually, this can be extremely useful. So, just to go to the final results here. So, we did these experiments and we had enough material to publish a paper. So, here's another use of the wiki where we actually wrote the paper in the wiki. So, every single draft was saved. And we can go back and see exactly how the paper was written.

And the really nice thing about having a notebook to point to is... See, I can have reference nine to 11 be the melting point of the compound, and I can specify the batch that it was taken from - from experiment 99, whereas the proton NMR was taken from experiment 203, sample A 11. So, that information is typically not part of a typical publication. You assume that the guy knows what he is doing and that he actually characterizes his compounds properly. While that is not always the case as we find out painfully. So here we can actually go and see if there is a problem with the specific batch if we are not getting the same information.

Now, where we actually submitted this paper is kind of interesting. It's called the Journal of Visualized Experiments, JoVE. So, there is a written part to this that I just showed you; that is what we wrote on the wiki. And they actually sent some camera people to record our experiments. And so, this is now under peer review. And we should hear back shortly about this. And I don't see any problems and I don't expect any problems.

So, this will be a nice way to communicate with video as well. So, there are so many tools now that make communicating your science faster without losing anything. Another thing that the physicists have been using for a while is pre-print servers. So, chemistry really didn't have a good pre-print server - well, they did, but that's a whole other story; it's no longer working. So, Nature actually recently came up with this Nature Precedings, which is a pre-print server and it's backed with the editorial filter of the Nature Publishing Group. If you are not familiar with Nature, it's one of the most well-respected publishers out there.

So, if they basically say that this has good scientific quality, it's probably true. And so, we can before publication in JoVE or any peer review journal that we choose to publish in, we can actually link to this document. People can comment on this document. They can vote on it. They can give us feedback. You can have versioning on here. All kinds of things you can do.

Normally, we have a paper out, you just tell people "Well, it's going to come out next week," and when it does "Here's the link." Well here, now, you can actually give the link and you can have [inaudible 10:39].

So, the bottom line here is we did find a maximum yield - 66%. We went in with a yield of about 49 to 50% - we got some increase. But, the major result of this was really to prove that we could optimize the reactions in robotics.

Now, so far as the malaria project, that's actually important because that's how we make our compounds with Ugi reaction. Recently, we've actually gotten some results about this. We have four compounds that actually are active in inhibiting the enzymes, and they are also effective in inhibiting the infection of plasmodium falciparum. And these are in the micromolar range. So, it's not bad. I mean, it's definitely publishable stuff.

And there are different stories here. We used one receptor area on the enzyme here. We used another receptor here. I don't have time to get into it, but it's kind of interesting the results that are coming out of this. And again, this is out into the open. And we never know who is going to stop by and collaborate.

A last little story. I recently did a little trip in the UK. And my friend here Cameron Neylon who also does open notebook science - although, he uses a different system than I do, we had the chance to spend a day in the lab to do experiments. And one of the things that evolved from my trip is a very simple project using open notebooks. And we spent the day measuring solubilities. So, we took a bunch of compounds and we took a bunch of organic solvents, and measured the solubilities. And then, we reported these solubilities in a Google doc.

Now, this is actually very interesting. So, for Boc-glycine and methanol, we are measuring 4.4 molar. And you notice that that's in green. And down here, for D-glucose and methanol, we do get a number, right, and it's 0.05. But, I put it in red and I don't actually include that number in my final results, because I am not satisfied that I am going to stand behind these. I don't think that 1.8 milligrams in the way that we were measuring it is good enough to report this.

But, what if you want a ballpark estimate? You can still access my number and you have all the details of the context in which it was taken. So, again, that's better science, I think. And what we are trying to do with this project - it's actually related to the malarial project in the sense that we can measure solubilities, report them publicly, and then build models; and Rajarshi Guha is going to help us build models of solubility - we should be able to predict the yields of these Ugi reactions in different solvents.

So, the idea is, for this Ugi product, you should do it in 51% methanol, 4% ethanol and the rest is acetonitrile. So, that will be a very powerful thing that can be used not just for our project, but really anyone could. And sort of to get this ball rolling, I set up this Open Notebook Science Challenge. And what it is, it is essentially we are asking people from around the world to contribute their measurements so long as they link them to a well maintained notebook. And if they do that then we can use these results, and we can publish with them, and we can do everything that we do as scientists.

And we have a sponsor. Aldrich's is actually volunteered to ship compounds anywhere in the world to encourage people to do this. So, I am very excited about this. It's a new initiative. And I think it has a good chance of working.

And there are so many people to thank here. Khalid is my grad student. Kevin Owens you just heard from. Tim Bohinsksy is an undergrad who just started to working in my lab, his term measuring solubilities. James is also an undergrad. Tom Osborne is the Mettler-Toledo rep who was very patient and took a lot of time to bring us the robot for us to get these results.

Antony Williams is the guy who runs the ChemSpider, the database that I showed you for molecules. Andrew Lang actually put our results into Second Life. Because of the briefness of this talk, I wasn't able to get into that. But, you can visualize the optimization of the reaction using 3D plastic. You can rotate it in Second Life. So, Andy did that. And of course, Cameron from Southampton.

So, that's it. Any questions?

Labels:

Monday, November 24, 2008

iSchool Open Notebook Science talk

iSchool Open Notebook Science Talk

Jean-Claude Bradley: I would like to tell you today about Open Notebook Science. My talk is based on the work we did in chemistry in terms of making anti-malarial compounds and measuring solubilities. But, I have actually put this talk together in a way to sort of minimize the chemistry and focus more on the IT aspects of it. Hopefully, by the end of it, you will understand pretty well what it is we are doing.

This really comes as, there are several themes that have been emerging the past few years in science and in teaching and one of them is actually openness. There are a few people here who are doing more open teaching in terms of recording their lectures and making them open on iTunes. All of these things are progressing.

The same thing is happening in research. We are going from a world where we have a traditional lab notebook, which is unpublished and will never see the light of day unless somebody writes a paper about it and going to more and more open forums. Traditional journal articles are more open, but people have to pay for it, and so it's limited to only certain a sub-set of people.

Recently, people are talking about open access journal articles. Again, that's more open, because the articles are free to access; however, generally the authors have to pay for the cost, so it's not generally a totally free deal.

At the end of the spectrum, we have Open Notebook Science. The idea of that is total transparency in the research process. We want to make available the actual lab notebooks of the students in real time or as close to real time as possible for the world to see so that's what I'll be talking to you about.

Now, my job is a little bit easier as of the past couple of weeks because we now have our Open Notebook Science entry in Wikipedia. If you're more interested in looking at some references, we've got some things coming out in Cell, CD News and Nature. So, with this kind of accumulating scholarship, it actually is starting to really take shape. I'll tell you about some of the people, besides myself, who are involved in this.

Again, I am going to start from the IT perspective instead of the chemistry. The way that I always like to talk about this, which of this happens to be usually my last line for the chemists, is we're moving from a world where we have human-to-human communication. That's what science has really been from the very beginning, one scientist telling another scientist what they did or trying to avoid what the other scientist did.

Right now, I think we are in a very interesting phase where we are starting to have humans communicating with machines and back and forth in terms of scientific information. Eventually, I think that the whole scientific process will get done by machines talking to machines, but I think in order to get to that point we have to go through this period where we have to somehow find a compromise so that both humans and machines can access the same information and talk to each other.

That comes down to what is the information that chemists actually manipulate. There's this concept of what's a fact. What's true? What's false? If we find a number, how sure can we be that it is close to the true value? We have to remember and students often forget this, that there really are no facts they're just measurements embedded within assumptions. The problem with the way that chemistry, especially, is being currently communicated even in traditional journals is that those assumptions are not made explicit. It is very difficult to tell exactly what was done or what the author actually did or didn't do.

Open Notebook Science maintains the integrity of the data providence from the lab notebook all the way to whatever documents happen to come out of that. So, if somebody wants to question a number, they can just click through and have access to the original lab notebook page.

Here, we are moving from a concept of trust and there is actually a lot of trust today in science. If someone that you know writes a paper, you are more likely to think that it's correct or more likely to be more valued than if someone else writes it, or if an article comes out in a certain journal you may give it more credibility than in another.

We have that in all the three fields. Chemistry has its own journals where if there's a yield that shows up, you know that you can't trust it because it's a certain journal, which I won't say which it is, but as chemists, we all know those. That's really based on trust. I think that if we start to move away from trust and move to just simply providing proof, if you provide sufficient proof you don't need trust anymore because the machine or the person has to back up everything that they're claiming.

Let me give you a very specific example. Here, we are looking at the solubility of 4-chlorobenzolvide. Solubility is just how much of a compound goes into solution, very, very simple chemical concept. It should be something fairly easy to answer. You should just basically look it up in literature. It turns out that a lot of these measurements have not been done surprisingly. Very, very simple things but you can't access them.

It should be a very simple thing. Give a couple of under grads some compounds and a scale and just let them go. Well, it's not that simple. If you look at the values that our students have been obtaining, you'll see here a high number five to one to 3, and then there's a number here 0.07. That number was collected along with all the other numbers since we're just reporting results.

As a chemist I looked at that and it didn't make any sense at all because these compounds are actually pretty similar chemically. It would be extremely surprising for them to have very different properties. Either, we discovered an extremely interesting phenomenon or, as is more likely the case, there is a problem with the way the measurement was done.

That particular measurement, by looking at the lab notebook, I'm going to actually look at the specific experiment for a bunch of slides and show you that we were able to uncover the fact that this compound was operating in the speed back, the machine that basically lets us find out how much is dissolved.

We redid the experiment and now we get a value of 3.6 moles. That number makes more sense so that number is validated according to what I think based on what I saw of the experiment where the first number is rejected. In the literature, you will not get to dig down into the original proof for this.

Let's take a look at what is actually in the lab notebook page. We're missing a little bit of the screen here. Up here, it says log, so this is the log section of the lab notebook. It basically has just a sequence of what the students did and what they observed. That's how you are supposed to keep a lab notebook in organic chemistry.

I have highlighted a couple of things here. Unfortunately, there's a section missing and the section that is missing actually said did not measure the amount of time vortexed. The lab notebook just doesn't tell me the stuff that the students did. It tells me what they didn't do whereas if I were using a trust-based system I would assume: oh, this person probably measured the time they did this. They probably measured the pressure, but unless you actually check, you actually can't know that.

Down here is the same thing. There are actually times here that are missing from the screen. I can actually see how long they put it in and what they did or didn't measure.

The other thing that we can actually drill down to is the rationale of the findings. We can make those explicit. This is actually a discussion and conclusion section off that same page, and it basically talks about the data. It talks about the raw data. Some of this, I wrote. My student wrote another part. Someone else might have come in and actually added to it. It explains the rationale of why we think that 0.07 number is incorrect.

You can look at the raw data, but you can also look at the rationale of the scientist. Again, this is not typically provided in a paper because this is actually just developing. You can also look at the actual documents that are offered up as proof.

Here, we actually have pictures. You can see here that there are some interesting things. These are after evaporation so we would expect to have all dry solids here, but you can see that the one on the left actually still has a bit of liquid left in it. Is that a problem? In this particular case, it turns out it's not a huge problem, but you're made richer by knowing that there was a potential issue here.

You'll notice also that number 46 is all covered in white, and it looks different from the other ones. I would look at that number a little bit more cautiously because it looks like it actually burst and started to bubble over while it was evaporating. Those are the things, again, that are not provided in a typical research article in chemistry.

Down here this is actually on Flickr. Everything that we're using is as open as possible. This is the nice thing, that you can get random people coming in and making their own contribution or using the data in a way that you wouldn't expect. This guy thought that it was a good looking picture. It had nothing to do with chemistry, but that doesn't matter. This is the Web 2.0 type of information sharing.

We also make use of YouTube. That actually is a very efficient way to record experiments because, again, instead of asking the student to write every detail of what they did you could just do a quick video, and I can actually ask myself questions: did we hook this up correctly? Where is the thermometer? All kinds of things that are gone if they don't record it.

Here, we are actually using very short clips. We're not talking about recording the entire experiment over hours. We are talking about a 30-second clip. Show me your setup and then they don't have to write it up. This is a way of learning....

Audience Member: It seems like 20th century technology where you have to make animate recordings that are translated into digital as opposed to, maybe, in the future having smarter machines that automatically record what you're doing.
Jean-Claude: Yes.
Audience Member: ...a chance to measure temperature and stuff like that.
Jean-Claude: That's where we're headed. In fact, I'll show you a robot that we've used exactly in that way. The thing is, you know, we don't have robots for everything, and we don't have all the stuff accessible. We have to do what we can with what we have. But I agree, if all this could be automated, it would certainly be a lot easier.

The other things that we show are the calculations. This is a really simple measure. You are basically just seeing how much material dissolves in a certain amount of liquid, and you're just measuring it. It turns out there are a lot of calculations in that. You have to weigh the empty vial. You have to weigh it with the liquid. You have to weigh it after it evaporates.

It's clear that there are a lot of places where you can make mistakes. If you see a number that looks strange, I would come here first to actually take a look at the calculation. Maybe, the student made a mistake. If they did make a calculation mistake, then, at least, I know that and I can deal with it. I can drill down to see exactly where the problem might be.

Here, we're making pretty extensive use of Google's spreadsheets. It's a really great way to share any kind of calculated information like this, and you can do all kinds of calculations. It's totally free and hosted.

The other nice thing about the Google spreadsheets is it's not that dangerous to make them open for editing because you can see the history. If someone comes in and just completely deletes all of your values, it's a little bit annoying. It hasn't happened yet, but it's not a big deal. You can just basically go back to a previous version, and it will restore it.

This is nice because we can now make these spreadsheets editable to anybody, so someone can come in here and actually mark something up and write a comment. You can color code something. It becomes so much more flexible than having to give people permission to come in and modify a certain spreadsheet.

Now, if you've gathered we use a Wiki actually for the lab notebook itself, There's a couple of reasons for that. The main one is that the Wiki gives you a page history. I can see every single version since the beginning of that experiment. I can see who made the contribution. By comparing two versions, I can see what each person added at each point.

We're using Wiki spaces, and Wiki spaces show the changes in green by the stuff that's added. It shows the stuff in red by stuff that's deleted. Often times, I will write a comment and ask a student a question, and then they will address the question and remove my comment. That's exactly what happened here. The red stuff was my original question.

This is actually also a great tool for interacting with students, especially graduate students where that's what they are supposed to learn how to basically report on science and how to make conclusions from their data set.

It doesn't end with just pictures. There are all kinds of different data formats. In chemistry, we are very big on spectra. That's how we prove that we made certain things. That's how we prove purity of things, and NMR is by far the most useful of those special techniques.

Here, we're making use of JSPEC view and the JCAMP-DX format. Are any of you familiar with that, JCAMP? It's a very handy format. It's open. It supports any XY data, and you can convert a lot of the proprietary software or, say, file systems into different instruments into a JCAMP format.

Once you have it as a JCAMP format, you can put it up on the web in a way that a browser can interact with the data. You don't need for the person to download software to view it. It runs in Java. Basically, they click on a link. It pops up the spectrum. If you drag your mouse across, you can actually expand any peak that you want. Unfortunately here, because we're missing half a screen, you can't see that, but there's actually a very detailed peak. You cannot see the details in the original spectrum.

This is a big deal because in a traditional publication you do supply supplementary materials, but it's generally in the format of a PDF. You cannot zoom into a PDF peak, and there's a lot of information there in terms of purities, in terms of all kinds of things that you would like to get at to figure out what happens.

Getting into indexing, again unfortunately, we are missing part of the screen here. I'll try to describe it as best I can. Over here, there is a list of compound names. So, we're looking at a list of solvents, like toluene, ethanol and vanillin that I'll be talking about. We can represent these different molecules using different things, like INChIs and InChlKeys and SMILES code. Anybody here work with those? OK, one person.

Basically, these are the contemporary way of representing molecules using linear text. If you want to represent toluene, you could type toluene, but there are many names for it. Methylbenzene is another name. How can you represent that in a way that a machine, for example, could read it unambiguously?

So, there are SMILES, InChls and InChlKeys. I would certainly be happy to talk to anybody about the details of those, but the InChlKey is always the same length no matter how big the molecule is. That's nice for indexing in Google.

For example, if we click on this link which is vanillin, you can see that it pops up my lab's work, and it also pops up some paper where that particular compound shows up. These are starting to be used more and more. They have huge advantages in terms of compressing information and making it sure that it's absolutely corresponding to the molecule you want.

Audience Member: What are the InChls?
Jean-Claude: The InChl, you actually can't figure them out. You'd have to look them up, but there are web services so you can use Self. You can use different kinds of web services.
Audience Member: Self is not representative of the structures.
Audience Member: But, InChl is represented of its structure and its connectivity is represented here.
Jean-Claude: The InChl, actually, the big reason why the InChl started to be used - just for small molecules the InChl are fine, but when you have molecules that are medium sized they are so big that Google can't index them anymore and that's a problem.
Audience Member: Now, we're taking [muffled voice].
Jean-Claude: These things are fairly recent. InChl is pretty recent. It's only in the past couple of years. InChlKeys is, maybe, one year.
Audience Member: [muffled voice]
Jean-Claude: The problem with the cast is the copyright. That's the big problem.
Audience Member: What about abstracts?
Jean-Claude: Chemical abstracts? People do use them, but if you're talking to people that are interested in indexing a lot of stuff without having to worry about the legal aspects, they tend to stay away from the cast number. That doesn't mean you'll find them on Wikipedia. You'll find them in a bunch of places, but, yeah, you will definitely find them. But, the whole copyright issue is actually a big problem with that.
Audience Member: Since you're interrupting, let me say that as a reader I would much rather have trust than proof, that is, who do I trust? I'd like to trust, read like a proof. But, to have every reader have to go back and verify things from the beginning, it seems to be a very great burden.
Jean-Claude: The point is you only do it when you have to. The reason that I looked at that number is because it didn't make sense in the context of the other numbers. I didn't drill down to every other number.

Right now, you cannot do that from publication. The peer review process does not cover that at all. So, that is the problem right now. My only option in a paper is that I have to email the author and hope that they respond by sending me the information that I want and that just doesn't happen very seamlessly.

Audience Member: Proof is needed by the reviewers, and that is proven by me in advance.
Jean-Claude: Yeah. So, basically, use trust for as long as you can get away with it, but when you are trying to repeat an experiment and you can't, you're kind of stuck with the whole trust issue.

I don't know if it's a post-doc who wrote that up. I don't know if it's a new student. Why not just give me the proof? I mean, it's really not that big of a deal. They already have all this information. It is just a question of making it open and having people access it.

Audience Member: Recent publications are devoted to doing this sort of thing, producing validated data over, let's say, a sequence of temperature, pressure, and this sort of thing. Is it that there are so many chemical compounds out there and so many different variables that even the publications that are devoted to that would work beyond this in the old days, I think. [crosstalk] just can't do it all.
Jean- Claude: I mean, there is so much to do that people haven't done them. Quite frankly people and companies probably have done these measurements and have no benefit in sharing them. So, there is a lot of that kind of thing as well.

But the level of detail that I am talking about, even if you look at it in this database, I don't think that you can access... Let's say that you challenged a number in this data base. They're not going to send you the lab notebook pages where they got that information from. And that is what we are talking about. We are talking about transparency at the level of no insider information from the research group to the rest of the world.

So, there may be stuff that was added in the past hour, that is incorrect, by my students. And that's OK. We accept that as part of the process of working in the open. That is no different that any other place where we work in the open.

OK. So, one of the main databases that we do use is ChemSpider. In fact, the CEO is coming tomorrow, Tony Williams - two o'clock, Disquay 109. It is a great opportunity. He will be talking about all this and he will be doing a demo if you want to interact with him. But, this is really a fantastic resource for manipulating organic chemicals.

What we do for the lab notebook... I don't want to be running software on my servers that will actually do substructure searching, that will be doing any kind of analysis like that. With ChemSpider I can actually farm that out. And this is free and hosted for you, for anyone to do.

You basically just link from the molecule to this and then it gives you all this information. It gives you the smiles, the InChl, the InChlKey.

Audience Member: And the empirical formula, is what it gives.
Jean-Claude: OK. Empirical formula, yes. What we normally call it. But, that doesn't have enough information to...
Audience Member: [muffled voice]
Jean-Claude: Yes. Certainly that would be what we file under that.
Audience Member: Where is ChemSpider pulling this in from?
Jean-Claude: ChemSpider is pulling it all over the place. They have links to the vendors. Down here you see experimental properties, melting point, and boiling point. They are actually links from ChemSpider to where they got them from. It could me an MSDS sheet. It could be any number of things.
Audience Member: Does this generate on the fly? Do they go out and get this stuff when you ask for it or has it harvested a lot?
Jean-Claude: ChemSpider has already harvested, like, over 21 million compounds.
Audience Member: So, you would call this a search engine for the invisible web, the chemical properties, which works like Google.
Jean-Claude: Yeah, that's a good way to look at it; a search engine for the chemical invisible web.
Audience Member: How does this compare to the Walstein that was acquired from Crossfire?
Jean-Claude: Well, it's free first of all.
Audience Member: What's that?
Jean-Claude: And it's free.
Audience Member: And it's not in German!
Audience Member: How complete is it?
Jean-Claude: Yeah, Walstein is certainly superior in terms of reactions, but this has actually started to compete with... If you're looking for properties like a boiling point, things like that, it is actually getting pretty comparable, I think, we used in the past year. This is all new stuff here. Going in the next five years, this is going to change. It is going to be totally unrecognizable. Right now, as of today, this is the state of the art.
Audience Member: [muffled voice]
Jean-Claude: We still pay for it. Drexel still has Walstein, SciFinder, and all those pay services, and they're still useful. As long as they still provide information that these sites don't we're still going to keep using them and teaching them.
Audience Member: So, you're not against this empirical overview of the literature associated with ChemSpider and Walstein?
Jean-Claude: Each database has its own things they provide. ChemSpider, for our purposes, is extremely useful because it has these key things. Like if we want to generate the InChlKey, for example, this is by far the easiest way to do so. Go on ChemSpider, put your compound in, and then you have the InChlKey, you can copy and paste.

So, there is a whole bunch of things. You've got synonyms...

Audience Member: Is there sufficient meta data for a program, not a person, to go and find information, that it will point to some arbitrary substance and then take it and use it somewhere else?
Jean-Claude: Yeah, there are web services that you can hook up. Now, you can't get all the information because of licensing issues. Like some of these properties you can get them on one page, but you can't download 10,000 of them. So, there are those kinds of limitations. But, ChemSpider tries to be very good about providing everything they can in the form of a web service. So, that makes it very useful.
Audience Member: OK.
Jean-Claude: The other nice thing about ChemSpider is we can also upload the raw data, like the raw spectra, and they're also in JCamp format. JSpecview looks at it. And as of, actually, yesterday we can deposit our solubility properties. So, Tony actually made a special parameter for us to put our solubility properties.

That is going to be very interesting as people... If you want to find the solubility in methanol you are going to be able to, over time.

OK. Let's see how much time I have here. This is until 1:30, right?

Audience Member: That's right.
Jean-Claude: OK.

So, this is something that I am very exited about, that also happened quite recently in the past few weeks. I told you about Open Notebook Science and I set up an Open Notebook Science Challenge, which is for people from around the world to actually do solubility measurements of certain kinds of organic compounds and report them to the central place. So, we actually have 127 measurements now. There are just a variety of solvents. There's different people that did similar techniques, but not exactly the same techniques.

We just recently got funding from Submeida for these Open Notebook Science Awards. And Drexel students are eligible for them. Actually, any student at a university in the States or in the UK is eligible for them. They are $500 a piece and there's ten of them over the next ten months.

This is kind of neat because we have judges that are chemists, that are either in academia or high up in industry, who will actually give feedback to students on their lab notebook. So, the student might put a report and a judge might come in and say, "You didn't provide enough information." or "This is wrong." or "Look at this." So, what we are trying to do is have a peer reviewed Open Notebook. That is something that hasn't been done and I think this is going to be very exciting.

It requires something of a challenge to get everyone motivated to actually do it. So, these are not big awards, but they're interesting enough that we have five students now contributing to this. It will be very interesting to see, in the next ten months, how this is going to play out.

This is what we have so far. Each one of these experiments can have 20 to 40 solubility measurements. That is not a big number, but, like I said, they cover about 127. And these all link to the lab notebooks.

You will see on the top here there is a summary data. So, again, we don't expect people to click through every experiment to find what they're looking for. There should be easy ways of accessing that information. And again, Google Doc is a really good way to do that.

Here, I have validated solubilities. What does that mean? It means that I have decided that I don't see anything wrong with these data points at this time. Now that could change tomorrow, but I feel comfortable enough to put them up on here. And they link back - you see these links, experiment 208, 205 - these link back to the actual notebook pages. So, if you want to see how these numbers are generated you can.

Here, it just shows up as a nice look-up. You can see the solubility, and it's got the smiles there which is another way, it's like an InChl - the solvent and the solute. And so this is a very large spreadsheet. And Google Docs can't handle very large volumes, but this will probably work for at least the first thousand entries that we have. That's roughly what we're looking at in the next few months.

Now, this is where it got really exciting in the past week, is that - I don't know if you guys have heard of Google Visualization API - anybody try to use that here? This is very cool stuff. So, Google Docs, again, it's free, it's hosted; and they actually have an API where you can query it and you can return results. So, if I search for Vanoline, it gives me all the measurements of Vanoline to date in all the different solvents.

Unfortunately, we're missing the key part here, which is the names of the solvent because the screen isn't big enough! But, it turns out that some of these are actually farther away then we might expect. On the right, you can see one of those examples. So, for methanol, we get values ranging from 2.8 to 4.2. So, this is the same experiment, run by different people, and we're getting values that are pretty far apart.

It actually tells us we have a mean of 3.2, and we have a standard deviation of 1.4. So, how do you interpret that? Well, if I were to only give you the mean and the standard deviation, you could probably use that for some purposes. But, maybe you want to see what that one looks like it's very big - which is actually one that I personally did. It looks like it's out of the rest of the group.

You might want to look at what's different about that experiment from the other ones. From the live notebooks I'm starting to think that it has to do with how long they were vortexed - how long they were actually mixed. Because it may take actually a longer time than you think to reach a saturated solution. But, then again, we can only ask that question because we have access to the raw data.

Audience Member: ... sciences, in the field of chemistry, a kind of passive knowledge of what actually happens at the lab bench. I am thinking about medical experiments [inaudible] biology - DNA extraction, these sorts of things. There is something that you couldn't learn even from the best value. You can get the rest of people's take to the shelf over the sink - you can share that.

But still, the only way to learn how to do these things or to do them correctly was to go to somebody in the lab and actually physically do the experiment. This is something that is still kind of missing, it seems to me, and could be even exacerbated in this reliance on 'he's done it' and you get a lot to think of that.

Jean-Claude: Well, this is stuff you have to record anyway. We're just making it public.
Audience Member: Right.
Jean-Claude: We're not really changing the workflow.
Audience Member: Right. But, there is a question of truth, and replicability. At some point, you also have the human.
Jean-Claude: No. That's the whole point. If you have a lab notebook and links to all the raw data, you don't need the human.
Audience Member: So, you know exactly what the pH of the water coming out of that faucet is?
Jean-Claude: If we measured it. Then again, there are some chemists who are better than others, and that's the point of this too is to show ways of recording science that's better. And if they didn't record it, you'd want to know that they didn't record it.
Audience Member: At what point are the things not rationally recordable. I'll give you an example from clinic indexing. A kind of customer, a former colleague of mine, Paddy Goodman; well, someone wanted a photograph of the squish ivy on the mantelpiece in the Oval Office. Now, photographs of the wall, these are indexed. What do you record in an image that is retrieval-worthy for some purpose? What do you record in a laboratory experiment that somebody is going to be concerned about down the road? Because you cannot.

I would argue, 'record everything.' So what is being recordable and what you wish - 'Damn, I wish I'd recorded that!' - later on.

Jean-Claude: And that's happened; and that's the conclusion of some experiments. 'We should have recorded this and guess what, next time we do.'
Audience Member: So, how does that information gets shared out so the people know this?
Jean-Claude: That's what this is for.
Audience Member: Thanks. So, this is how you do it?
Jean-Claude: There is no other mechanism that exists right now to enable that sharing except if you happen to work closely with someone. Yeah, this is really the point of Open Notebook Science.
Audience Member: But isn't your point here, that assuming all four of those measures in your notebook, then yours would surely spend more of the time on mixing, and that would be different.

Or, yours would show you measured the mixing time, and theirs would show they didn't.

Jean-Claude: That's in fact the case, yeah. With some of those they did not measure the mixing time. So, now that brings the question...
Audience Member: But, at least it's a clue...
Jean-Claude: It's a clue.
Audience Member: ...and, of course, it maybe a red herring because that may not be what affected it.
Jean-Claude: Exactly, but at least we could design that experiment. Yes. Yeah, there is not one way to do this measurement; you can evaporate, you can also use UV; there's a bunch of different tools you can use. And that's the whole point of this is you don't want people to have to go through each thing; you don't want to have to Google stuff to find it. You want to have this interface. Or you can access this now, and you can add your own intelligence to this. You don't need to know the chemistry to use this. But, you may find an anomaly to give to the chemist, and then the chemist can look in to what's causing this anomaly.

But, yeah, all these numbers should be very close to each other and they're not, so there is something... Somebody is not doing it right.

So, down here, you see there is a link. This number - that first one - it's experiment 207, sample number three. You click on that it takes you to the lab notebook, and you will see that in fact, that is my experiment. That I did with my collaborator from Southampton.

Audience Member: I think you would have done...
Jean-Claude: That's the point. That everything - yeah. That you maintain the chain of where the data came from at all points in time.
Audience Member: It is a broad...
Jean-Claude: Yes.
Audience Member: ... what we did then, holding them.
Jean-Claude: I mean, it's very surprising in chemistry that the whole concept of providence is really not widespread. You can open up a book of melting points, and it doesn't usually tell you where they got those numbers from; but it's from a trusted source. I don't know if that meets...
Audience Member: ...physics and references in the last part of the...
Jean-Claude: Well, they may reference the actual papers but I can tell you when you read those papers that there's often not a lot of information. But, at least here, you can see how much information was recorded.
Audience Member: Lot of detail with it.
Jean-Claude: Well, in fact, patents is probably the reason that chemists are so hyper about keeping a lot of notebooks. Because those are the legal documents that you use in a court case when you want to prove that you were first. And now, we're trying to do the same thing - in using them openly in real-time for another purpose is not legal. Things are actually very hard to fake when you have to provide all the raw data. Things are easy to fake if all you have to do is put a number in a table. Oh, that was my yield!

It's very difficult to fake all the raw spectra, to fake all the weights, to fake all that stuff would be so difficult that you would get caught.

Audience Member: [muffled voice]
Jean-Claude: It's actually easier to just be truthful than to try to figure it out!
Audience Member: This is an excellent tool for researchers in, say, academia. Is there any thoughts about how this might integrate into the workflow of say, a pharmaceutical company or an optical...
Jean-Claude: Well, this is certainly available to the pharmaceutical companies. If they wanted to find out what solvent to do a reaction, they could certainly find that information on our site. So, going forward we're going to have more and more values, and that's something that I hope will be copied as well, but, yeah absolutely. And there's not any intellectual property issues, everyone comes on board collaborating, making things open, so that simplifies things a lot.
Audience Member: It means that the private companies will know this is a resource, but probably they are not going to participate.
Jean-Claude: That's what they're doing now..
Audience Member: It's different now. They might do a version of this internally for their own research, and would that become a standard tool that they can use live?
Jean-Claude: I mean, that's actually what they're doing with ChemSpider, because they don't want people to see what they're searching - they can get a copy of ChemSpider for a fee.

So, there are basically different business models emerging from this. I'm coming at it from the standpoint of we want good data, we want to report good data, and we want to publish the stuff. That's what we're doing, but, absolutely, it can get pretty tricky.

Anybody here working on RDF? Maybe we should talk later because I want to make sure that I do cover this thing with the robots. I have very little time.

We were talking about workflows before even if we didn't use that term: protocols, workflows. We have actually been converting what we wrote to be human readable into a very standard format that machines can read.

Basically, it says here common name methanol InChlKey. It gives you the InChlKey, and then it gives you the volume in milliliters. This can be scraped pretty easily and can be put into a database, for example. Is anybody working with people at MyExperiment? They're basically very heavy in the bioinformatics area.

What they call an experiment really means taking information, submitting it to BLAST search or something like that, and then getting information back. Some of these can actually be pretty tricky with lots of web services being called, so there's a place in my experiment where you can upload those.

In the past two weeks, they've actually opened up what they take as a workflow to include what I have on the right there, which are physical transformations or physical workflows, not just converting information. So, this is potentially very exciting. MyExperiment people are very heavily invested in this, so I think that there's definitely a future.

Audience Member: Is there actually a workflow notation they do on the...?
Jean-Claude: There is a specific notation for their bioinformatics queries, but right now, I think they just decided, "Look, just open it up for people with physical processes." I guess, they're going to look later at standardizing those. No, right now ,it's just a page.

Chemical Markup Language: anybody here involved a little bit? That's something else we haven't discussed.

There are always these different ways of making things machine-readable. I would like to talk a little bit about the use for it. The reason I got involved in solubilities is that we were trying to make compounds as anti-malarial agents. If we can predict the solubilities, we can actually figure out how to get good yields off of our target compounds. It just hadn't been done, so that's why we're doing it.

In parallel with recording the values, we are working with Rajarshi Guha at Indiana University who's actually doing modeling to predict the solubilities. So, in the coming months that would be very interesting to see what the predictions are verses what we've observed. Let me skip through here.

Rajarshi is doing docking. Is anybody doing docking? You basically have an enzyme and a small molecule, and you try to dock into it to try to inhibit it. Here we're also using Google Docs to report on the results of those docking runs, and again we're using Google Docs to do that as a very simple way of sharing information.

Just to tell you about the robots a little bit, this is the Ugi reaction that we're actually doing to make these anti-malarial compounds. We wanted to see if we could optimize this by using robots. So, Rajarshi lent us their mini mapper system, which is basically just the syringe on an arm. It can go and pick up some liquid and deliver it at different positions.

We were able to do 48 reactions in parallel. These are little filter tubes about this big. The idea is that you have the robot add the four solutions, changing the parameter slightly, and then it precipitates. We filter and weigh it, and the weight is the yield of the reaction. The nice thing as was mentioned earlier about a robot is the robot actually spits out its own log of what it did.

That can be very useful from the standpoint of figuring out what it thinks it did, which is not necessarily what it did. This actually can be more problematic than having a human do it because you're doing a lot of reactions in parallel. If you have a systematic error, you sometimes don't know why you're having a problem. We had that problem.

The machine wasn't programmed carefully, and what happened was we didn't realize it wasn't changing the solvents completely. We were getting numbers definitely, but as a chemist I was looking at the numbers and saying, "Something is wrong." It took awhile actually. We did probably over 1000 experiments before we knew what the problem was.

Eventually, we did do it. We recorded, so this is the calculation part of that. We wrote this up on a Wiki. Now, we're talking about a document that its intended audience, or its intended vehicle, is actually a standard journal. We wanted to write this to see, first of all, if a publisher would take it. Will he take a paper that was written in the open? The answer was yes. The publishers will definitely do it.

We submitted this, and the nice thing about this is that we can actually link to individual lab notebook pages from the journal article. Down here, it says the melting point was taken from experiment 99, for the proton the NMR was taken from experiment 203. Right now, there's not a mechanism in chemistry that will actually do that because the lab notebooks are not made public or even available.

But here, because we do have those documents available, we can point to them. That means that basically that was a different batch than that one, and in a traditional article you don't make that distinction. You trust people that all of their stuff is the same quality.

If you're interested in automatic markup of documents, tomorrow Tony Williams will be talking about this in the ChemSpider talk. This is basically software that goes through a document and it figures out the chemical, and then you can see that it actually knows how to draw it. This kind of markup is actually very, very interesting going forward.

This paper, which is still under peer review, should appear in the Journal of Visualized Experiments where they actually send a team of people to do a camera recording of our experiments as well. So, there's the written, there's the video, and this will be the first example for us of having a peer-reviewed article linking back to the lab notebooks.

Another handy thing you can use these days is Nature Precedings. If you're familiar with Archive that handles a lot of the physics preprints. Nature Precedings will handle more broadly. It's nice because it does have the editorial approval of Nature, a nice DOI, and a nice author list that you can use.

While my article is pending under peer-review at JoVE, I can actually put it up on Precedings. I can talk about it, I can give people a link, and they can download it. This is actually a very handy tool and very complementary to the peer-review process.

Audience Member: Publishers don't care....
Jean-Claude: No, many publishers don't care. What they don't want you to do is to take the final document with all of the editing that they did on a PDF copy and make that available. That's more the issue. They really don't have a problem with text. That's not the case for ACS, but it is the case for many others.

Some people are asking about Second Life. We don't have enough time, but we can do a lot of the same things looking at spectral and molecules. We have an eCrystals repository where people can submit the 3D structures of their molecules. That's another way we can make things open. We can report on the activities of the various compounds.

I find outcomes in terms of the malaria project is we actually have nine compounds, which show activity against the enzyme, and we have four that show activity against infection of the malaria parasite into the red blood cells. This is neat because this has not been written up yet, but the information is available to anyone who actually wants to use it.

There are other people who are doing Open Notebook Science. Gus Rosania up in Michigan also works on drug transport, and all of his students are using the same Wiki approach. Cameron Neylon from the University of Southampton is also doing the Open Book Science, but he's not using the Wiki. He's using a modified blog engine to do much of the same thing as we are.

I'd just like to thank my students. I think, I've run out of time, right?


Labels: ,