ACS UsefulChem talk (podcast)
ACS UsefulChem talk (podcast)
Man 1: Are you guys hearing?
All right. I'd like to talk to you today about the UsefulChem project, and this basically involves open source chemical research using blogs and wikis.
So from a larger perspective, just to try to give you an idea of why we're trying to do this research and where it comes from, you take a look at scientific research today. It's mainly human-to-human interaction, human-to-human communication. That's the way it's been since I started.
Well, we're entering a very interesting period right now where humans are starting to collaborate with machines. You saw that in this symposium, using machines to try to process information and to try to contribute to scientific information. I don't think we are actually there yet, that's why the arrow is right here. I think that we are moving towards that where machines and blogs will actually be real collaborators and contributors. Eventually I do see that humans are going to be bottleneck in scientific process and are going to be out of it, and it's going to be machines and machine interaction. So that's where we're headed with this, and let's see what it is that we can do today in that direction.
There are people who are definitely trying to make this happen from different fronts. Here's the robot scientist -- this is Ross D. King's project. His robot, apparently, can formulate hypotheses and can execute experiments and can analyze them. Now, the thing is that he is doing this on the yeast genome, so his equipment is going to be pretty limited. I mean basically, he is looking for expression, and so all he's doing is growing yeast. You can't really do the same thing for chemistry, because the tools that we use are so varied compared to the biological world. But I think this is a good example where things are headed.
How do I think this is going to happen? Well, I think that if we have self-organizing redundant processes... in other words, instead of having these top-down approaches where you design a large system, and then you have that system write stuff to a database and then you understand exactly how that's getting delivered, I think that what we're observing today... you know, you saw a lot of people use different standards, for example, to represent molecules. That's just the way that people are. They are going to do things a certain way, and it's going to be very difficult to get everybody to do things the same way. So I think that's a good thing; I think that these processes will, in fact, self-organize, and that's actually beneficial.
One of the things that's happening right now is that it's possible to participate with zero or near-zero cost. So the tool that I'll be using, most of them, involve using free and hosted services. This is the kind of thing that anyone can actually do to contribute to science. Also, we're entering a world of fully-open access, both read and write. You've heard about open access. Typically the way the term is used now, today, is that it's free to read articles, but a lot of these journals actually charge significant amounts of money to the author. So it's not open access from the standpoint of the author.
But there are -- for example, the Beilstein Journal of Organic Chemistry, ArticBot, those are fully open to read and to write, and of course blogs and wikis, which you'll see why it is that we're using it.
The kind of research that I'll be talking to you about is open-source science, and by that I mean that everything is exposed. We expose the raw data, we expose the thinking behind the experiments that we do, and everything. I'll be giving you very small milestones through this process, but if you're interested, this is all recorded and you can certainly go through it.
OK, so if we want machines eventually to do research, how will they know what to do? I think they need to ask the humans. And the way that we do that is by looking at what humans are saying in their papers as to what is important. So this part started about a year ago and it's just a very simplistic approach of looking at these search terms, such as "what is missing is," "what is needed now," and looking in journals to find out what people are saying that is really important to do. In 2005, a couple of things came up. There's a pressing need for the identification of newer, more effective dyes. In our region, some fuel cells, and then this one: there's a pressing need for identifying and developing new drug-based anti-viral therapies. So these were being collected, and as this was happening, came across this site called FindADrug. This is a not-for-profit organization that has screened hundreds of millions of molecules against targets for various diseases. And one of the targets that they actually did was the enoyl reductase enzyme for the malaria parasite. I contacted them, and they sent me a library of about 220 molecules that they predict should be active against that enzyme. So that gave us a place to start, to actually try to do this open-source research.
The molecules were basically all diketopiperazines in this library. So they're all the same except they have different groups of R1, R2, and R3, and the first synthesis that I proposed was a solid-support synthesis. You end up getting diketopiperazines often when you do peptide synthesis, and so that was one of the approaches. So I was writing that to the blog, and you can see the evolution of my thinking on this. But eventually I stumbled across this Ugi reaction, cyclization, which is one-pot. You basically bring an aldehyde, an amine, a carboxylic acid, and an isonitrile together, mix them up at room temperature and then add acid, and you get this diketopiperazines. So that's a pretty efficient synthesis, and that's the one that we selected.
To do this research on wikis and blogs, I will show you the various components and how that actually evolved over time. The first thing that we set up was this molecules blog, where we have the SMILES code and then we have the picture. As students are contributing to this project, if they find information that is relevant, the thing we care most about is commercial availability and how much it costs. So that's the kind of thing that students might put in here. And if we get any hard data we'll put this in, and basically this is in a blog, you can subscribe to an RSS feed and you can find out when new things got added or changed. We created another blog, which is the experiments blog, and this one -- you see these links, these actually link back to the molecules blog. So basically, if you're reading this information and you want to find out more about adrenalin, you click on it, and it'll take you, you'll see the NMR spectra, you'll see everything that you need to be able to understand what we did.
Now this is interesting, because this is done completely in the open, and blogs are pretty well indexed by Google and various search engines. So we started to get comments -- like here's a Matt Todd, from the University of Sydney, made a comment on our experiment three saying that we should probably be using a higher concentration. That's exactly the kind of thing that we want to encourage. This experiment was not even finished yet and we were already getting feedback.
Now, as you start to accumulate these experiments, you find that a blog isn't very good for organizing, so we created a wiki, the Useful Chem Wiki, and here, basically, all these links, these are linking to the various blog posts. But it's a place where someone who wants to be briefed very quickly can just read through that and can click. So we use the wiki for the organizational aspects of it and we use the blog because it has a nice feed.
A nice thing about this is that the vast majority of our experiments are failures, which you normally don't get to see. But in this case, from the wiki, if you go on this page you will see the actual story of the failures. You'll see why it is that we failed - there was some information in a peer-reviewed journal that happened to be incorrect. We were following that, and that whole thing is described. That is something that you would not normally get from a traditional journal article. Now eventually we found that even trying to maintain the blog was a little bit difficult because the students were doing their experiments and constantly updating it. You couldn't really tell how this blog post ended up the way it was.
So we started to use a wiki to actually store the experiments as well. It looks kind of similar to the blog entry except that you can do this: you can look at the experiment's history. So for this experiment 25 I can see exactly when and who modified it. I can click on any of these versions to find out exactly what happened. I can see a comment was made and then it was responded to, or a spectrum was put up.
So here's an example. In the red here, this is what was deleted, and in the green, this is what was added. So I had put why milimhos are such a percent yield because the student had not calculated it and I wanted them to do it. And then this is actually Colleen, my graduate student, came in and actually put in the values. So that's how it works. And you can actually tell when and how all this was done.
Now the really nice thing about using a wiki for your experiments is that you have an unambiguous, third-party timestamp as to when the experiment was done. If we ever go down the road and somebody claims that they did this experiment, we can go back and see what was the timestamp, and we can actually make a reference to that specific version. So not just the wiki entry, but also the actual version. I think that's going to become very important in terms of precedents.
You can also monitor the entire wiki. So you can see these are all the various experiments, different times, and who made the last entry. Now the other thing that I've put on my sites -- and everything I've shown you so far is 100% free and hosted, so it requires absolutely nothing to set this up. SiteMeter is also something that is also free and hosted. They will actually tell you how people are finding your site, so what keywords they're using. For example, somebody searched on Chmoogle, which is the former name of eMolecules, and they ended up on the site. There are also blogs that link to our wiki, so there are all kinds of ways that people are finding it. We also, for our molecules, put Inchi code, and so if you search Google using Inchi, you will in fact find our entries there.
So there's this whole other automation aspect here that I think this particular group might find interesting. You remember the blog that has all of the entries -- one post is one particular molecule. Well, the minimum amount of information for a post is the SMILES code, simply because it's the most useful string to have if you are searching databases. What this code actually does is to read the blog once a day and it will separate out the molecules or read the SMILES code and then it will return back the Inchi, the molecular weight, it will also hit e-molecules and this is in the Chmoogle URL, and it will tell you if there are commercial sources for that molecule. So this is useful because that's stuff that my students would have to do; as we do more and more automation, we're not going to have to do as much work to try to figure out stuff. And we also get the Jmole here connected.
Now we also, in addition to that series of web pages with the dropdown, we have a CML RSS feed that is viewable on Bioclips. So this Bioclips, you've seen a couple of times in this workshop, and this will be very useful in the future.
You can also read the CML RSS feed with a regular reader like Bloglines. It will ignore the CML part, but it will in fact display your molecule and your Inchis and all that stuff. And these readers are already built to tell you about new stuff, which as far as I know Bioclips isn't yet capable of doing.
OK, let's skip through a couple of things here.
As far as I'm concerned, it's a really exciting world that we're entering. There is not a top-down approach here. This is a system that evolved over time, and as we needed things, we added to it, and we were just transparent about what we were doing in our lab. What's been interesting about this is that we've interacted with other groups that are out there that are also interested in doing open-source science. The Synaptic Leap is another organization that wants to coordinate open-source science for tropical diseases mainly, so malaria definitely falls under their interest as well. ChemRefer is a very small operation that they look at our RSS feed, and if they could find articles that can help us, they'll send it to us. And they have, in fact, helped us at one key point in the design of our synthesis.
The last connection here is that doing all this out in the open also enables other people to collaborate with us. Beth Ritter-Guth is I believe at Lehigh Carbon Community College, and she has a couple of classes where her students, who are taking technical writing and English, are actually looking at our blog and trying to make sense out of it to explain to people who are not chemists. So what they do is they interview me, they interview my students, it's sort of a dialogue that happens to try to explain what it is that we do. One of the projects they have early in the term is to define open-source science. So if you have any interest in that, maybe take a look at their blogs. It's open and you can contribute and give them feedback.
So, the next steps. What we'd really like to see in the short term is, we are submitting our molecules to eMolecules, and that'll be useful because it'll be on their site and we'll be able to do substructure searching without having to host any software ourselves. Right now, you can't do a substructure search on our site; you have to know the exact molecule that you're looking for. And it also enabled people to find it easier to find our molecules.
We're also developing custom CMO RSS feeds. So if you wanted it to only alert you when there's new commercial sources for these compounds, we can set that up and people can subscribe to it.
We want to get our spectra in the JCAMP format. Right now our spectra are just printouts from the machine -- like you get an NMR and it's just scanned in. It's really annoying, because of course if you want to look at the J constants, you've got to ask the student, and if they haven't done that expansion, they're going to have to retake the spectrum. So with JCAMP, you should be able to expand the spectra and even reintegrate if it wasn't done properly. So that's something we definitely want to get going.
We would also like to extend our collaboration with other chemists, especially.
People who are interested in doing docking. For example, we have intermediates that look like that they may in fact fit into the enoyl reductase pocket but we don't do docking. We haven't done it and there are people who do that all the time So that would be interesting for them to take our feed and to tell us hey, this one is actually worth making. Some of them are easier to make than the final products that we're trying to make. But of course what we really want is to get our antimalarials so that we can get them tested.
We have one person in the medical school at Drexel who will do simple in vitro testing of our compounds against red blood cells to see if it inhibits malaria. And again, we want to do all that out in the open, not just the final results but the whole methodology as to how we did it what it means. So that's basically it.
Moderator: Any questions?
Man 2: [inaudible question]
Man 1: The question was about the reaction database. Actually I was talking to Peter about putting this in CMO reaction format so that it has the actual reaction in CMO form. I'm a big believer in redundancy. Instead of picking one best way to do it, right now we're using the wiki because it works, but absolutely if there's a database where we can put our stuff we will definitely want to do that. It doesn't stop us from doing anything else. So if you have something specific in mind, definitely let's talk.
Man 2: [inaudible]
Man 1: Let's talk, definitely. You have a question?
Man 3: [inaudible question]
Man 1: The question about the funding agencies. Well, we're looking for funding. This project started from scratch a year ago and so now I think we have some really good data to show that this is feasible, this is doable. NIH is very interested in having people publish Open Access and Open Source. It's really a question of getting the right team of people together - I think if we really do want to go for NIH, we're going to have to find someone who seriously wants to get involved in the testing, and if we do, then I think that makes a lot of sense.
Man 4: Are all the participants located at in your lab or are there people from other places?
Man 1: The question is about where are people located. Actually from my acknowledgement slide here, Khalid is my graduate student, he's in my lab; Jane Giamarco is an undergrad; Lynn Jamieson is another undergrad; Dave Strumpels is the guy who did all the chem informatic stuff and he's also in my lab. But then we have this loose association of other people like others and Peter. We've been contact through our blogs and there are a lot of people like that who are contributing. Especially when it comes to voting stuff, because that's something that you don't need to physically give people materials. I would say right now mainly the physical experiments it's definitely in our lab right now, and we'd like to change that, and again it's not necessarily to get people to post stuff in our wiki. This is a model that is zero cost. It can be replicated by anyone in the world for free. We're using WikiBases, which is free and hosted and is actually owned by Google. We're also using Blogger, which is also owned by Goggle, so these are very, very stable platforms. Actually what I'd like to see is somebody saying, hey I'd like to do this kind of thing to actually replicate it so that they can go off and do it.
Man 1: OK, thanks very much.
Transcription by CastingWords