Thursday, December 20, 2007

Cameron Neylon's Talk on Open Notebook Science

Watch or listen to recorded talk here

Jean‑Claude Bradley: Well I'm very pleased today to have a special guest from overseas, Cameron Neylon. He received his bachelor's at University of Western Australia, a PhD at Australia National University ‑ Research School of Chemistry in protein chemistry doing some work on ribosomes, and then he did a post‑doc at University of Bath ‑ Wellcome Trust Fellowship. In 2001, he became a lecturer in the Department of Chemistry at the University of Southampton working on protein chemistry. He has a joint appointment at the Rutherford Appleton Laboratory's Neutron Scattering Facility, and he's going to tell us about that.What makes his visit very special is that Cameron is working on Open Notebook Science, which is something that my lab is also working on. It's really a pleasure to have people come and share their experiences and to talk about how we can do science in that way. So Cameron, please.
Cameron Neylon: Thanks Jean‑Claude for the introduction. It's a great pleasure to come here and talk about what we've been doing in the area of Open Notebook Sciences, Jean‑Claude being the person who really kicked this off as an idea, and was the inspiration for what we're doing to make our work open as well. I titled the talks, slightly tongue in cheek, 'Beginner's Guide to Open Science', not necessarily because I'm trying to downplay what we're doing as unimportant, but because I think we are beginners when it comes to figuring out how to do these things.I'm going to try and make the talk quite interactive. I'm going to show you a bunch of things that are live on the web ‑ that's the whole point of what we do in this area. Some of these things may not work and they may break, but that goes with the territory. But if you've got any queries along the way, please ask questions if you want to. I'll be quite happy to discuss things as we go along.So where I started in this area was with a problem. I have a group, mainly made up of graduate students, who work in a lab and generate data. That data is collected in lab books and computers. Some of these computers are ancient and are running Windows 3.1 because we can't get the cards to run the instruments.A lot of people outside chemistry and biochemistry as well, possibly still think of this as the way laboratories look. And while in many cases, the infrastructure and the benches maybe the same, as when these pictures were taken, that's not what a modern laboratory actually looks like. This is actually a picture of McGill Biochemistry Department in 1921. This is my laboratory, incidentally, from pictures ripped off the web in the usual fashion.The modern laboratory is full of instruments and lots of computers, some of these are legacy computers, but some of them are up to date. But there are a lot of separate computers and that's the core of our problem. We have our data spread out over two many different computers, some of which are network, and some of which are not. Many were not allowed to have networks because again, we're running old versions of our program systems and we can't get automatic virus updates.The data is almost usually not accessible or indexed in any useful fashion. And what's critical is the data gets lost, as both the computers get retired, whether those computers are in a write‑up area or computers in the lab, or people leave and you don't have their passwords anymore. All these things are problems that I'm sure are very familiar to a lot of academics.So that's where we started. We started with the aim of developing some sort of laboratory notebook system ‑ we were very open‑minded about what that might look like ‑ that would solve some of these problems. So our starting objectives, in priority order, were to have a system that would actually store the data in one place, is backed up and reasonably safe so I can find it. Students can usually find their own data given a day and a half to pull it out of some computer somewhere. But I want to be able to find it, even if I don't know what their email address is anymore.A lot of what we do is quite 'high throughput' where we're developing 'high throughput' systems. It would be very nice if we could embed into this a system that would track samples, so that we'd know which sample is what, and when you go to the 'frig and the label is some obscure number in rather poor handwriting, we can actually tell what it is ‑ that would be quite nice.As Jean‑Claude said, I work in two different places and I'll come back to that point a little later. It's really important for me to be able to actually track what my students are doing. I may not see them for a week or two weeks. I've been there for 10 days now, so I haven't seen them for two weeks. So actually being able to keep track of what they're doing, making sure they're on track, that things are happening and any problems have been dealt with is quite important as well.Number four is the dream of where we'd like to be. We'd like all this data to be in a form that can be processed, possibly not by us but other people. We want to be able to go back and understand, if we got this result, and that feeds back to some material that we bought from some supplier two years ago, can we link those things together? Is this particular primer, chemical or whatever bad for whatever reason and can we figure those things out? Or, as I'll come back to a bit later, can I tell who used up the last of my buffer? Little things like that can be quite useful. So those are our objectives and starting point.There are lots of systems out there. I'm going to take you through what we've done. This was built out of a blog‑based system. The system we're using is being built in‑house by my collaborators. It's based on a blog format and it's fundamentally an open source engine, which has been modified. So it's not a publicly hosted service or a freely available system, but that means we can make changes to it and adjust the way the system works.I've said something that's jargon, 'a fully flexible system with arbitrary meta‑data'. What this means is the system itself has no idea about semantics. There's no presupposed structure of the data that's going in. Any instructions that you put in, you're completely free to do, and that causes problems but it's also very powerful. An important point, if you're going to use something as a notebook, is that there is a full record of the changes. This is not like for blogs on publicly available systems where you can actually change the date on a post after the event. Obviously, this is totally unacceptable for a lab book.So we have a full record of changes. Changes can be made, though it's not terribly easy, and deliberately so. But at the moment, we don't have the nice feature that you have in wiki, where you can track back through something that will be developed, but at least the systems are there.So the sites down at the bottom are the URLs for the lab books we have, and I'll show you those in a second. Also I have another blog discussing how this is progressing. I'll put those links back up at the end as well, if anyone wants them.What does this thing look like in practice? If I go over here‑‑so this is the lab book that's been set up for a collaborator. This is a project that hasn't started yet. This is what it looks like before anything goes into it. Conventional looking log type of thing, I mean this looks like possibly something you get out of a default version of WordPress or something like that.Fairly conventional looking kind of thing, there's some sections down the side. There's a search box. Unfortunately, that doesn't actually work at the moment, but that's something we'll work towards. If I go to where we started with what we're doing here, I go right back to the beginning.This is a blog. The reason why we used a blog is because a lab book is essentially a journal. It makes sense. Your first thought is, "It's a journal. That's what a lab book is, so we can use it that way. That's fine." That's what we did. Essentially, it was just a process of typing in some of the stuff that was going on.At this stage, actually, we weren't able to make changes. That's a point I'll come back to. These are the very first posts that went in. This was halfway through an experiment, as it happens. So Stevens described an experiment down here. In this very first experiment here, you'll notice one thing down the bottom here. There are 10 comments.Our first thought was, "Well, you describe what you're going to do. Then, if anything changes, you make comments on it." If anything happens, make comments on that.A lot of this is actually about the mechanics of the system. As you go down, one thing that comes across quite a lot is that tables are fairly awful. In fact, this has got some problems down here, obviously. This is the kind of thing we had. The results of the posts were going into comments.So a couple of things came up fairly quickly. There's a couple of things we can do here. The student whose lab book this is was working on two different experiments in parallel. So she categorized those as β‑galactosidase and β‑[inaudible], just two different enzymes that we were essentially doing the same things to. And he brought immediately, the first step beyond a piece of paper. I can now click on one of these and see all the stuff that relates to that particular project.But something that became quite clear early on is that we've got a lot of text here. It's very difficult to figure out what's going on. How different posts relate to each other.You see that there are a lot of these numbers around the place? Those are actually references to the student's paper lab book. The student was still using both in parallel at this point. Partly for political and safety reasons we were obliged to use paper, and partly because she wasn't comfortable with trusting the system at this stage. So she tended to be writing stuff in the lab book‑‑in the paper lab book‑‑and then transferring it to the notebook.She's got a series of references here‑‑this again shows you you can put in any type of method data the whole way. The problem was, we didn't know‑‑so if we do an experiment‑‑we want to say we used a material to generate another material. We do something to it. We want to link those two things together. How do we know what was the material that went into an experiment?The easy way to do that, we thought, was to have this thing called a sample parent. This is a method data that's attached to each post. If I click on this, and go through to that particular post, we have a procedure here. This happens to be restriction digestion of a piece of DNA. You can see this is the lab book reference where she's done this thing. This is a reference‑‑happens to be on the same page for some reason‑‑but on the same page of the paper lab book was where she made the material that went in.I think you probably immediately see what the problem here is. As soon as you've got a reaction where you put in two things, then you're in trouble. We ended up‑‑you can see if you track down here. We had sample parent, sample parent 2, sample parent 3. This was just starting to get a little bit ridiculous.If I just show you‑‑oh, I'm not allowed to edit this one. We had this problem. We want to link products; materials being used through to the materials they came from, through the materials go into. I don't know. It may seem obvious to you, but it took us a long time to realize this, that if we want to link something to something else, then the easiest way to do that is to put a link in, as in a hypertext link. I certainly have come to realize that in this field that you often come to the situation, there's something that's blindingly obvious once you've thought of it.At this point, we realized that we had‑‑we'd been actually doing this entirely‑‑not in a wrong way, but in a way that wasn't going to help us. So what we did at that stage‑‑what Steven did‑‑was actually start it again, from scratch. We reorganized the system.If I now go into the lab book, which is using "at the moment", this is an experiment done‑‑is that today or yesterday? I've lost track of what the date is now. It's the second today, isn't it? Obviously she hasn't done anything yet today. But this is right up to date. Actually, it's probably it's that she's been in the lab in the morning, and she hasn't got round to putting it in. She tends to do things‑‑put things into the lab book afterwards.What we have now‑‑I'll just track down the right hand side again, show a few things. We've now divided these posts up, so one piece of method data‑‑actually what I should show you is‑‑I'll just show you what that method data looks like. Oh! Oh, I know what that is. I'm not allowed to edit her posts. Again, there are obvious reasons for that.I'll duck out of this one, and actually go into my lab book so I know I can edit things. If I edit this post‑‑I just want to show you what this method data thing looks like when you're dealing with it. We have a group that each post falls into. At the moment I'm using this thing called "post type." I'll show you why that's not a terribly good idea in a second.I can type anything into here. The category nine can be anything. Then, what the attribute is can be anything. You can add as many of these as you like. Let's back out of that. Oops.We have these sections now. We have materials, stuff we've brought in basically, notes, just anything that doesn't fit into any other category really, descriptions. Procedure, where we've done something. Product, something we've made. Software issues are software issues, unsurprisingly. I'll come back to the templates in a second.What I can do is, if I just go into the middle of a process‑‑back out of this. I'll go to‑‑this is just a category post. If I run down‑‑oh come on, go into that post. OK, so‑‑oh there's no data in it. I'm not actually operating on my own rules. I've actually put the gel in there, which is a bad thing to do. Nevertheless, this is a gel I ran sometime ago.Basically, what I am doing here is an analysis on a set of samples. This is a calibration standard that goes into that analysis and these are my products that I am doing an analysis on. So where do these things come from? Well, let's click on that. This takes me back to a product post. This is the post that refers to a material, a tube, of something.If I wanted to generate a label for this, I can do that because each post has a unique ID number. We automatically have a system of labeling each and everything we make. I can press print and it will print on a label printer in Southampton which is not going to be helpful to anybody.It will print out this label. It's got the date, time, a nice little code and the post number and that bar code refers to the code for the number. So I have actually got a quite powerful system here for identifying samples already built‑in.I can track back to where I made this. This is now the thing where I have made this product... where I have gone on to analysis that this is a PCR reaction. I can clink through to all the material that went into that. I can click backwards and forwards, the machine can do this as well ‑‑ track back through the entire history of the sample. What we have is a standard build out system that is quite powerful for tracking data and samples.I just want to show one more thing. This is a blog. The idea you can put anything in you like, you can type anything you want. The problem is ‑‑ as with most wikis and blogs ‑‑ we really like to organize things in tables.Tables are a nice way to represent stuff particularly when you are doing things in parallel. If I show you the codes for this post, all of that nonsense is what created that nice table. This is not something your average grad student is prepared to do ‑ figure out how to get this coding working.Another problem is, because you can type anything into the metadata sets, it is quite easy to make a target mistake because we are doing the linking manually. It is relatively easy to get the link wrong. This is why we actually needed to put in the ability to edit the posts actually because we are just making too many typos and that was a complete disaster.Is anyone here a microbiologist by any chance?[pause]
Man: You mind if I ask you a question?
Cameron: Yes.
Man: The last thing you clicked on was for something from a gel... a gel is somehow in the table.
Cameron: Yes.
Man: Is the gel [inaudible]?
Cameron: I should have. One of the reasons why I find this good is because actually it does force you to raise your standards. Not just the student's, it's also me. I actually keep one of the worst lab books I've ever seen.
Man: Another dimension to having to do the things with different computers... the gel is just the same as a spectrum that is [inaudible]
Cameron: If I go back to Jenny's lab book, I'll try to find you an example of a gel. The trouble is that she is now running through the system so fast, she doesn't actually run the gels that often.I can show you some pictures. There you go; there's a gel image.The result of the analysis is a picture. It's not the physical object. That is a JPEG, rather than, perhaps, a higher quality image, too. That's another issue.A lot of what we generate is images, but we're also doing analyses on the enzymes we're generating, so that the Excel files come out of the product analysis there, as well.The intention is that any data that we would record should be in there, in whatever its raw form is.
Man: Like you said, though, it's hard to find a grad student who has the enthusiasm to write the code for that table. This grad student, for probably very good reasons, didn't want to go to the trouble of making a better gel image, so that later on, you'd be [inaudible].
Cameron: The problem there is the gel, not the gel image, actually. And to be fair, for what she's done here, that's actually for this type of low melting gel purification. They don't actually get a lot better than that, to be fair.That's actually not terrible. Actually, the light's not terribly good, either, is it?It's better now.
Man: You can see it better now.
Cameron: Yes, sorry. As I said, this is a JPEG, which is a bad way to do it. The reason why the students prefer the JPEGs is because the system automatically displays them.The technical reason is that no one has written a plug‑in to display the TIFF. No one has imported the plug‑in to display the TIFF here, yet.
Man: I think the JPEG was a very good way of doing it.
Cameron: Where was I? We're getting to the right lab book.There's actually an interesting question whether we should have one lab book for the whole lab, or one lab book per person. That starts really interesting issues, as well.We can crosslink, but we can't do what I'm about to show you.The classic problem with recording data in any form, whether it's a LIMS system, a data base, or anything, is actually persuading people that it's worth putting any metadata in. Unless it's very, very easy, there's an issue with that.Additionally, we've got this problem with creating tables. We've got just the problem of doing the linking, and all these things that are just a little bit difficult. All we want is to make all of that easy, so that people do it automatically.We have an advantage in molecular biology in that a lot of the things we do are quite stereotyped. The procedures we use are very similar, each time we do them. There is quite a large set of procedures. Nonetheless, each time you do any one of them, they tend to be about the same thing.There is an obvious answer to that, which is that you have a template procedure which you work from, and then you just put in the material. You don't just want to have a template which you need to fill in, even. You can do slightly better than that.I showed you the gel. This is the template. For that gel post that I showed you in my lab book, where I hadn't put in the image, this is the template, or is very similar to the template, that made that post.I'll show you what these little codes do, in a little while, but basically, the templates are exactly the same as a normal post. They just have little place markers in them. Those place markers say what is supposed to go in that place.Up here, the system will look for anything that has been labeled as DNA. The percentage symbol is a wild card, so it's looking for any post for which the metadata DNA is not empty, because those are the types of things you want to run, or gel.The box, unsurprisingly, creates a little box that you can type into.Down here, we have the procedure so that you know what you are supposed to be doing. Then down here this actually will tell it what other method data to enter the post when I use the template.So, when you do something that is stereotyped, that is a common procedure, which is precisely the ones when you want to track the flow through things is it's all handled automatically. So if I then use that template...what it comes up, it just renders that in a slightly different way so that a title up here, I can change the title as appropriate.I can select anything that's in the blog that has the appropriate metadata. It comes up as nice drop down menus. So, let's say, we'll run our standard again. These are some products from an earlier demo, but I'll probably want about five micro liters of that, five micro liters of that, say. I can change these things at the bottom if I want to, but I shouldn't really need to. Pop that in. It then goes into an edit windows if you want to make any other changes to the thing you can. This post now is actually now the post‑it so if I just open it up again that's come up.Now, there is one problem we haven't dealt with here which is that I have pulled down a product that had been created previously. We're still creating those posts manually. Then we actually have to do some linking backwards and forwards to tie all the ends together. So, we still have an automation process that we need to solve.My feeling is that once we've got that process of generating these product posts sorted out, we actually have a very powerful system here, but semi‑automated collection of metadata. We have got a system relatively easy to use. People really don't need to get down to the level of putting the formatting in to posts.We're making some sort of significant progress toward what our goals were. So, is that the right one? Nope, that's not. This is kind of where we are at the moment. We have this one post per item approach. I think this is actually our real breakthrough. Doing that really creates an information framework where a lot can happen.Templates are really very important, and I'll come back to this issue. We want to develop this web service interface that other things can link into it. So, in terms of what we set out to do, we've actually got a pretty good system of storing and recording the data.I can stand here; I can go back into the system; and I can track bits of data. I can pull that data out, and the only person who is really falling down on the job of putting data in the first place appears to be me. So, I've only got myself to blame for that. Actually, the students are quite good at this, primarily because they are aware that I might be watching.We have one post per sample. Every sample, every thing has a unique ID, so we've got a real chance to track where things, what things are and also who has been using them. So, again, this issue of someone using up the last of the buffer is covered.We can monitor researchers' progress, as I said. I'll just flip back here. If I pop over to our accessorator and Peter Murray‑Rust has been very busy again; [laughs] that surprise is there, nonetheless.So, these are a set of posts that have come up. These are the new ones that Jenny has put in since I last looked, which is quite nice. She's done a bunch of things, so I can follow this wherever I am; pull this down and see what's going on. It's really quite a robust system for tracking what's happening, and I'll show you how I aggregate that whole set of things in a little while.So what we've not done...we think this system is broadly speaking in a theoretical sense, machine rateable. We know we've got a fair bit of work to do on what the best way to organize the metadata into really make it work. But, we've not done any work on really showing whether that's true yet.That's the direction we'd like to go, and it's really beyond the scope of the grant funding we've currently got to do this. So, it's kind of the next stage of development to really try and do that, seriously. But, we've made some progress.I've put open sides to the title. I've actually not talked at all about open sides. So, what has this got to do with it? I have to admit when I started this project I wasn't really thinking about the process of making this readily available.I live in Bath which is the left hand drawing pin on that map. I also work at the Rutherford Appleton Laboratory which is the one at the top right and at the Southampton Chemistry Department. That is a map of the entirety of the UK. That's a triangle that is about 80 miles a side. So, I do a bit of commuting.But, putting that aside, there's this real serious issue of being able to find out what's going on. Now, I showed you that I can pull that through on an RSS data, but there's a problem with getting access into the system. The people at the Rutherford Lab run a really quite irritating firewall, and so essentially we couldn't find any way of me getting an authenticated log‑in into the lab book at Southampton from Rutherford.I just got sick of this. I thought, "Right, let's just make it wide open." That was easy. I had access to it. There's the first real benefit of actually making things open. Not so much that anyone else can get at it, it's that you can get at it.[laughter]People tend to skip over that as a benefit, but it's actually really quite an important one. At about the same time I was beginning to become aware of what other people were doing. I started looking around once we made it, and I saw a particularly good video by a guy called Deepak, I think. He actually had been doing this amazing five minute presentation at a conference in Seattle which really inspired me; as what Jean‑Claude was doing here at Drexel.So, the background information and the community that was really building around making these things available and the potential for doing that. And that got us really excited about what we could do. That's where we really bought into this, this open notebook science approach. There's a lot of benefits here, a lot of potential benefits. There's also a lot of confusion.Open science means more things to more people probably than open access does. We have adopted this terminology "open notebook science" which was created by Jean‑Claude to really try and describe the process where really you do your absolute utmost to try and make everything available ideally as it happens. I use the term "as fast as practicable".You can't make things instantly available. In reality, you can never make absolutely everything available. There's always something that doesn't get recorded. There's always some information that's just sort of between the neurons of the people involved, but you do your best to take that as far as you possibly can. That's really the key.So, there are lots of potential issues with making the material completely open. The thing that everyone starts to disassociate is: Isn't someone going to steal your results? Answer, probably not, realistically. Yes, people feel that this has happened to them, but I think in most cases and speaking as someone who has been scooped by papers now twice in the last four months. Most of the time even when people have become aware of results because they're already doing it, they move to publish faster.Now that's a problem, obviously, if you're beaten to publication. In many cases, you possibly would have been beaten to publication anyway. But if someone was really coming in and looking at your results and then going away to publish faster, you've got a record of them being there.What's more, you've already disclosed and reported it. You have a rigid date stamp saying that you've done this. It's not peer reviewed and that's an important distinction to make, but it's out there and you have reported on it. We've actually got a case of this now in my group, where we have disclosed a particular technique for fluorescent labeling of protein about three weeks before the paper that scooped us came out. It's going to be interesting to see how that flows through when we try and publish it.There's a real potential to be embarrassed, to put your foot in it and look really stupid. The flipside of that is if you want to get help from people, you've got to ask. I think this actually is what's scaring people a lot, particularly the grad students that you're asking to do this. Someone's going to see what you're doing. We're all stupid, really. There's always someone smarter and we all make mistakes. So that's a real issue.It does require a bit more effort than putting together a paper notebook. I suspect the reason it requires a bit more effort is because most of us don't keep terribly good paper notebooks. Again, this creates a bit of an emphasis to push you into doing better, and I think that's a really important aspect of it.There are potential legal issues. In fact, if someone were to take a particularly rigid interpretation of my employment contract at the University of Southampton, by doing this I could be in violation of it. I'm supposed to report anything through our technology transfer office, just so they can find out whether it's potentially commercially‑viable.That's not really a major issue in most cases because in practice, you don't send in every paper to the technology transfer office, for most people in most places. I was talking to someone in Oak Ridge, yesterday, and they actually have to do that. Every paper, every poster and every note has to be approved by the technology transfer office. It's a nightmare.One area where I am a little bit more worried is safety information. Now in the UK safety information is very strongly tied to the lab book. We weren't putting this in, and arguably, this was information on how the experiment was done, so we should have been putting it in. I had a post‑doc come in the door to do a few weeks work on a project, and she said, "It's a lab book. It has to have the safety information", so it all went up.And I saw this going up 15‑30 minutes after it happened; I was offsite that day. I thought what if someone looks at this, uses it and manages to injure themselves? This safety information is totally inappropriate for anyone working here. You have a different regulatory framework, different legal framework and different working approaches. Yet the other side of that is it's very difficult to get good safety information. MSDSs are usually useless because they tell you that salt is a carcinogen and that you should wear a full respirator to do anything with it. This is actually a serious problem.And there are ethical and political issues. If you're working on animals, some particular stem cells and these kinds of things, you possibly don't want those to be made public. The question then is if you don't want it to be made public should you be doing the work. There are lots of issues here that are not clear. But there are lots of benefits as well, and I think that's where we're going to go.I wanted to talk briefly about the differences between the different groups doing open notebook science. Obviously there is Jean Claude's group here using a wiki‑based system which is shown up there on the left.The other person who is really doing this in a really notebook‑centric manner is [inaudible] student called Jeremiah Fife at the University of Boston and he uses a PDF document that he creates each day. The difference between wikis and blogs, I think, is not per say a central issue. It is just a different way of looking at a day device of posts. People think differently about how they interact with wikis and blogs, but I think what is more interesting is the difference in how Jean Claude's group and my group lay things out.So, up here we have ‑‑ what is that? ‑‑ experiment, that is 135. So I think I've got experiment 918 here. This is an experiment from UsefulChem Wiki. This is the final version. So there are a couple of interesting things that I take from this. This is a version where it is now described in sort of a complete, finalized approach. This is the experiment after all the analyses have been done, after everything has been explained. We can go back and look at the history of this, being a wiki, obviously. There is a history going back over several months, I think this actually was done over. When there are various things going on, the experiment going on in the first couple of posts, then some comments between Jean Claude and a colleague on what these experiments mean to how the interpretation should go.Now, if I pop across here ‑‑ I've actually tried to replicate some of this in our system in a way that I think I would go about it. Now, some obvious differences, one page on UsefulChem ‑‑ There is the first set of posts, there are the second set of posts, there is the third page of posts. Somewhere down here we actually get to the description of where something happens. Now, you see a lot of these are written in math so the experiment was based on an end‑amount time costs, taking a series of analyses over time and I wanted to separate those out because each piece of data ‑‑ so you can upload the end amount data for each of those things. Now I actually have put those in there in fact.To associate a particular product with a particular piece of data, so a computer can tell which one was necessary, I felt, creates a separate post for each of those end‑amount samples. So arguably I've made it easier for a machine to pass this, but I've also made it fairly horrible to read. Now we can take all those product posts out and just look at the procedures. That's one option, but I can go back through these, go back to the top policy ‑‑ again just because of the way the different systems work, I've put a slightly different picture at the bottom.I think the really essential difference between the way Jean Claude's group worked and the way my group is working, really in fact one student is working, is that we sort of try and log the whole process in the visible final form. It's more like a traditional lab book in the sense that the record of what's happened is immediately available.In Jean Claude's group's case, the final product is actually a finalized description and to go back through the process, it is not immediately visible, you have to go back to the history. I am slightly uncomfortable with that, but that is just because I think of a lab book as a log of what has happened and I think it is a question of what these things are being used for. So the object of UsefulChem Wiki is to provide information so you can come to that and realize how they should do the experiment.That is less true of our system because they have to track back through a series of things to figure out how they ought to do it, but they can see the process that has happened. I think what we are finding out is actually that the notion of the notebook ‑‑ we have a lot of biases and obsessions about notebooks, and properly said because record keeping is important, but a lot of those things don't matter when you are using an electronic system. We've got to figure out which ones do matter and which ones don't ‑ different things will matter to different people. OK, I'll come back to the right one at some point.Interestingly, again, this is a more traditional written record of a notebook. There's less obsession with linking backwards and forwards and keeping the connection between things. A lot of people like this because they feel comfortable with it; it looks like a lab book and feels like a lab book, if you print it out.I think the differences between Jean‑Claude's approach and our approach largely comes from where we started. I started from the perspective of wanting to manage the information, and Jean‑Claude started from the perspective of wanting to make that information available. That's what is actually driving a lot of the differences, rather than the difference in the science we're doing.For most laboratory research, there are electronic lab notebook systems out there, and there are Win systems out there. But as soon as you actually do the research, those systems are going to break. With any sort of structure that you try to impose on the data in advance of the experiment, you're almost instantly going to break.It's very important to realize that different viewers of the system have different needs. The supervisor has different needs and the student has different needs to the visitor, and we're trying to work through ways of making those different views available to people.Most people start this with a simple journal approach. They have some sort of online presence and keep a journal, and that's actually quite good. It has a lot of advantages; you can do text searching and all these kinds of things. It also has 'ease advantages'.There's an issue here of getting people to do things where you've got to get them over a barrier, and if there are disadvantages, then they're not always going to see the advantages. I'm not sure that keeping a simple journal provides enough advantages to keep people involved. I might be wrong about that but we'll have to see how it evolves.Flexible meta‑data is critical. If you're going to keep meta‑data, you need good ways of getting in there, and you desperately need to be able to play with it and change it at will, if it needs to be changed. I think this method of templating things is a very powerful way of bridging the gap between structured systems, where you can impose some structure in advance, versus systems where you really want to be able to type some random stuff that says "I did this, it made this or it turned green", things that don't necessarily fit into any straightforward structure.So where are we going with this? I think I've got my slides in the wrong order. We want to take our system and do a couple of things in terms of usability; but the next big step for us is to try and make the system a web service. There are packages out there that don't work terribly well at the moment, but they let people manage workflows.So you can grab some stuff from here, put it over there and combine it with something, which is called all sorts of different things. You've got to make the system available to a computer to do something with that. For instance, this is our workflow engine at the top, which is being described as 'MySpace for Scientists', but it's not quite there yet. It's a thing that holds workflows, lets people share workflows and will ultimately allow people to execute them as well.I've got to show you one other thing that we're doing, which is what we do when there is a data structure. We're actually using an online, freely available web‑based data system. It's a lovely graphical system called Dabble DB, and it's really worth looking at if you're interested in these things. We're actually making all of our lab stocks available through a proper database. You can see these things are publicly viewable on a public page, and they're accessible by Jason, RSS and various other approaches.So we can set off something that might go to our lab book and then comes back and reports something, maybe a data file or a sequence file. That workflow can then go off to the database that could be stored and maintained by a facility of some sort so that we can go on to do some experiments. Then you might want to combine those sets of data and send them off to query another database. I'm showing germ bank here, but it could be any sort of data analysis, which pumps it back into your workflow engine. And because all of this has to be put back into the lab book, you've got a record of how this thing is run.The problem with these workflow engines is that they're written by computer scientists, and that's a problem in and of itself. They don't have the concept of letting you know how this was run, when it was run and what exact parameters you used to run it ‑ they don't pass that concept at all. So we need to keep a record of instantiations and particular runs of data analysis, which is critical from our perspective.So where are we? I think there are small groups in terms of people who are doing Open Notebook Science, as Jean‑Claude originally defined it. There are really three groups doing it. There are a variety of other projects that use a variety of other names and approaches that have some sort of similarities, but it's a relatively small group of people.I think we're seeing benefits to the community, but in terms of the science community, we're seeing less benefits in terms of driving the science, at the moment. But as the community size grows and the activity grows between those groups, I think we're going to start see some real advantages.I think a lot of the tools we have at the moment don't really provide clear wins and real advantages that are really going to drive adoption by a wide range of people. I think we're still at a relatively early stage. We're a bit hesitant to suggest people start using our system, unless they really have a specific need, which it can serve, compared to trying to persuade them to make their results more open.If we're going to communicate in this way, we need to think about how we provide tools, and in a wider sense, how people are going to find it. In the chemistry world, a series of tags that describe molecules can be searched. You can pick those things up in Google, but that's less true for what we're doing. You might think that with biology and biochemistry it would be very clear; you just link back to the database.The problem is that the database doesn't actually refer to what I'm working with and it doesn't describe what I'm doing. We need to think a little bit about how we're going to use some of the tools that are already out there, and perhaps, do a little tagging. The whole Web 2.0 thing is really based on picking something, going with it and seeing whether other people adopt it.I think we need to figure out how to do this in an integrated way. But I think overall, Open Notebook Science is a great term in terms of its ability to get people excited, though I don't think we understand what it means, in terms of the terminology. It really is quite a different approach to put your stuff directly out in the full glare of the two or three people who might look at it.It is a very different way of thinking. There are a lot of subtleties that I think ‑‑ or we're certainly still trying to figure out what they really ‑‑ where they're going. I think there's also a desperate need here to deal with the problem that you cannot simply, in this day and age, evaluate people on the basis of how many papers they've got in a particular journal.It's not the way we're working anymore. Traditional publishing is broken anyway. e also need a way to try and figure out what people's contribution is in a much wider sense.So, I need to acknowledge a few people, a few very important people. I don't do any of the technical development of the work. I just say, "I'd like this." Andrew Milstead is a PhD student in Jeremy Fray's group that goes away and does it sometimes, when he has enough time. The development of all of this has essentially been Andrew's work. He's only actually months into his PhD so it's really quite impressive.The people who've used the system and helped to evolve it. My PhD students Jennifer Hale, Wendy Smith is a post doc who came in for a few weeks on this project. Justin Shay is another PhD who's been involved in recording some stuff along the way.We have some funding from not quite our equivalent of the NIH, but broadly speaking that kind of thing to develop this specific system. And also Jeremy Fray is a much bigger eScience. He's got a platform grant which provides a five year reasonable long term funding to develop these things in general.If you want to look at some of these things, then go to the web sites and I think I've probably gone over time, but thank you for your attention.[applause]
Jean‑Claude: Questions?
Man: I thought when you put up a few slides back. I can't remember the title, but it was like the five dimensions. You had the one before that where you show, for example, a flow diagram and some vents coming out of the right hand side. Vents on the left hand side, and you jumped around.I suppose in a way that's what I was looking for. The database is... The management systems from the [inaudible] really go without saying. They kind of... Literally a window. But I'm always trying to do like this, to open the window to see what's there that I could grab onto. I mean, that workflow...Not the workflow per se but the kind of logic diagram that the example finds is really a great thing because what if you could do... If you could click on a part of that, and continue on so you could have the gel runs, the compounds that went in and out of the box. The air mask connected behind those in some way.Even if you anticipate a net [inaudible] visible, it's best for your potential.
Cameron: That certainly where we're trying to get to with the idea of making it available as a web service, and it links also to this concept that different people want to view things in different ways. Let me show you one or two examples of that. Why don't we talk what we forgot to do on the way through it?This is a very simple example. Basically, there are five blogs operating in my group, and I post to those as well. I don't want to look at those in five different places. I don't want to look at the stuff that I've already seen. So, there's some really quite cool tools out there.This is Yahoo Pipes for anyone who hasn't seen it. Basically, this allows you... It's essentially made to manipulate RSS feeds. It's essentially a way for doing graphical programming for mashups. This is simple enough that I can use it, though I have been known to swear at the screen when I can't figure out how to make something work.What this is doing and this was wired up in five minutes or less, is it just takes all of the feeds from the web that I'm interested in and it aggregates them into one and filters out anything I've written.There's a relatively simple example, but nonetheless it shows you the kind of thing we're working towards ‑ If I show you just one other thing, which I think you might find ‑ These are examples. They're not quite what you're looking for, but they are quite nice examples.This wasn't working this morning. I hope it's been fixed.So, there's more than one way to look at a lab book. There's certainly more than one way to look at an electronic lab book. And ‑It was working this morning. Ah, here we go.So, one way you might want to look at it is: what's happened over the last, sort of, period of time. This is being generated by a ‑ This is a viewing product which is freely available, again, developed by a group at MIT. And what this does is it takes XML data with some time information and generates a timeline. So this is that timeline.Posts that categorize colored by the category, by the section. The red ones are products; green ones are procedures. Some things are student key always generates the products first and then adds the procedures. It's interesting. One of those little issues that comes up and how you think about how these things work.But we can either track back on a large scale at the bottom here ‑ the screen size isn't brilliant here, I'm afraid ‑ either track back on a large scale, or get more into the data. We are tracking slowly here.What we'd love to be able to do is impose on this the links. Or to have the network impose, to understand how things are linked in together. No one seems to have quite generated the visualization tools that will do that for us yet. I'm going to Boston on Sunday talking to the people there who develop this and have done some similar stuff, and hopefully we'll see what we can put together there.Absolutely, we need ways of getting in there and letting people manipulate the data in what ever way they want, because that's where the real power is going to come out of this. We're working on it.
Man 2: I have a question about your data. Is there data available to the general analysis or some other analysis?
Cameron: It's available to them if they wanted to do something with it. The principle we generally work on here ‑
Man 2: Because if scientists bring their instruments, and they are not linking them to your website, they are not using the information for their research.
Cameron: So there are two approaches to this. Jean Claude has looked at how people come to him, and what sort of Google searches bring people there. We haven't done that. And partly, we haven't the right tracking thing set up.
Man 2: Are you indexed on Google?
Cameron: We are indexed.
Man 2: Not Google Scholar?
Cameron: Not Google Scholar. [laughs] This is a mini discussion. This is part of this whole cultural shift. Google Scholar is about peer review literature.I just did a search on sorties, then. We're not at the top of the list ‑ hold on a second, I've got to find it. If I'd done a search on sorties cloning, which is the title of the blog, it would have been right at the top. There we are. So, the second page.If I was working on a really hot cancer gene, we'd be down on the bottom somewhere, obviously. We're not heavily linked to from outside. This is a moderately obscure system, so we're quite close to the top.It's this issue with tagging, though. How do we expose the right sort of data, so that people doing searches do find us? And I don't have the answer to that.If you put in an entry key or a smiles code, or you're exposing CML, then there's clear ways of indexing this molecule. We should put that data in where we are using specific chemicals. But how you do that for a protein or a small DNA sequence, that's the hard thing I haven't got the answer to that yet.
Woman 1: If you want to go to total electronic lab notebooks writing in everything on the projects, how do you handle it when the students are ruining the technique, when they are not doing anything new.It's not a new experiment yet, but it is what they are doing, and they need to show how long it takes. You need to go over it with them and see what they might be doing wrong if they can't duplicate a well known experiment or whatever.
Cameron: It should be in the lab book. It should go in. There's no question on that as well as experiments that don't work.
Woman 1: Do you shove that up inside in your own way?
Cameron: No, I am sorry. Perhaps, the other way to answer the question of question is, "Do we promote the exciting stuff up to a higher level?" So, we have some ideas about how to do that. One thing that it has become very clear is that you don't learn a lot by looking at most people's lab books. You need to either find the specific thing you are looking for, or you need some sort of summary of what's going on.What I've been doing personally is...I have another blog, which is a completely separate conventional blog, where I am trying to tag some of the interesting stuff that comes up. I hadn't done a great deal of that either, I have to say. Jean‑Claude points up stuff on his blog as opposed to the old useful Wikiware where interesting stuff has come up or where he's introducing an experiment that is going to start up.So, there is a series of different levels, and again it's part of this view. How do we go from this, which is basically the equivalent to scribbles in a paper notebook to something which is not a paper, perhaps, but may be a report.So, it can automatically generate a report that covers the things you are interested in and how we go about doing that is a very interesting question.The first step we want to take, and again we probably don't really have the results to do it within the current grant, is that we want to somehow create a button, where we can say...what's written in this paper is going to published, and I want to generate a supplementary information, perhaps readable, all the raw data in it and whack it out as a PDF.
Woman 1: I think that it's very appealing not to push it off to the side. How do you highlight the stuff in advance?
Cameron: Yeah. Well. Again its...I think this is on the first page here. The student has been struggling for a very long time with experiments. She's been trying to do a moderately complicated experiment, and I think it's down here. [pause] I can't find it...She's actually just got it to work at the beginning of this week. She suddenly multiplied by 10 the number of transform that she had, the number of columns she hasn't applied. That's really important to the experiment. And she says that, but I wouldn't have noticed it unless I had actually read it through in detail.So, we could just have a piece of metadata that says: Worked? Yes. Or this is a good result. Or this is something that you want to see. This person wants to see. Yeah, we don't know how to do it yet.
Man 1: Could you put some kind of tag on it?
Cameron: Absolutely, yeah.
Man 1: Whereas you could assign an experiment a certain number value. I think a wonderful example is if you've got something you are trying to do with students. You try at least 37 times to do this, and three years from now what you want to be able to do is have a new student group go back to attempt it and not bother...
Cameron: And the reason why it worked is because she used a different brand of [inaudible]. [laughs]
Woman 1: Is that true?
Cameron: No. No. It's absolutely true. She's been struggling because ‑‑ if you know anything about getting DNA into cells, in these bacterial cells, we do what's called a heat shock. So, you basically heat the cells for a very specified amount of time at a specified temperature.The width of the plastic and the shape of the plastic tube make a huge difference. I simply hadn't realized they we're using different tubes.Last week, before I went away, I said we need to get some of these tubes and test it out. I said, "That's not going to make any difference." So, these things happen. I could have told her that had I known they weren't doing it, but it's a classic thing. She wasn't writing down the brand of tube she was using ‑‑ why would you ‑‑ unless you realized it was important.
Man 2: It really does highlight the fact that you have to have different vehicles for different audiences. Someone will not read this lab notebook ‑ you don't read our lab notebook page after page. They'll do a Google search. They're looking for the boiling point of something and we happen to know [inaudible] follow that.You need to have another vehicle. I use a blog to summarize important things. Even people who are not in my field can actually get some information and then those link to specific experiments. So that does require effort, I mean, you know you have to decide.
Cameron: Again, it's this issue that my students are terrible at writing, at noting what's interesting, what's worked, what the context is. They're really actually quite bad at that. I try to encourage them to do it. It's one of these things that require extra effort. It's because we've traditionally not kept very good lab books because no one looks at them.
Man 2: Compared to what you've first started talking about getting instrument data; you know inspect all the instruments. It seems to me that one of the big problems is the format of how we save that data. There certainly are some older formats like the JCAM. There's all of this before, and there has been some attempts at coming up with 'X' and all these formats there are sometimes probably very humungous when you take a spectrum and put it into that format. Do you have any question about where those kinds of things are going? It seems to me like that's a big sticking point.
Cameron: We don't have a problem with storage space. That's not come up as a problem yet. This system will upload a file ‑ any file. It will start to chock if you try to put in something that's five gigabytes, but it'll cope with 20 megabyte GIFs, for instance. I've done that. So it will take anything.What you can then do with that is another question. But you can, again, for me this is partly the philosopher here of plug‑ins to start with someone could write a plug‑in to display or do something with it. One would particularly imagine this with DNA sequences. You automatically have a little thing that pops up and gives you a semi‑editable graphic picture of a plasmid sequence or something like that.Part of the idea of making this accessible to the web that we don't worry about the data format. Someone who is interested in using it has something that comes and grabs the data. Maybe they have to send it off to a translator to translate it into the appropriate format for their thing.A lot of these things are available as web services now. It's relatively easy if you've got a package sitting on your computer to make it either look like a web service or make it available as a web service, even if it is just to you. There are ways and means of pulling these things through. Having spent three days talking about file formats I don't have an answer. It seems to me XNL's a good thing ‑ at least it's partially self‑describing.You've got some internal information in there and it should be relatively easy for people to put together starshades and schema that can tell you, "I want to do this with it". That does seem to me to be the way forward, within reason. A lot of the time what you really want is a couple of columns of ASCII, that's the easiest part for making it ‑ figuring out where the columns are.Jean‑Claude: Well thank you very much, I appreciate it.

0 Comments:

Post a Comment

<< Home