Citation :
Hello, everyone. What I want to show you is a scenario that demonstrates some of the concrete steps towards the vision that Bill is describing. The scenario I'll talk about is the one of cancer research, and show how some of the products and technologies available today can help the research improve their productivity through easier data management, and integration of the distributed computing resources into their workflow. Now, early detection is super important in successful cancer treatment. Current diagnostics methods involve invasive and expensive biopsies. And ideally what you'd be able to do is collect blood tests and be able to study it for biomarkers of the disease. This is both an inexpensive and fairly rapid procedure.
Recent improvements in protein mass spectroscopy are making proteomics a promising area for exactly this type of thing. So, what I'll show you is how our research work in proteomics can benefit from some of the tools. We have 216 patients both in cancer and normal groups, and the data has been downloaded from mass spectroscopy machine. Now, typically it's in a flat file. This is the file for one of the patients. It has 350,000 rows, they're both map charts, and intensity information for that map chart, and they total about 1.2 gigabytes of data. The database is an ideal way to structure query and to be able to store metadata about these experiments. So, what we've done is created, imported all this data into a SQL database. There are two tables I want to call your attention on. That definition table acts as a metadata tracking the prominent information about these experiments. So, for each one of the patients, we have the time stamp of when the data was added, so that's a marking, is it a control or is it a cancer patient. And you can imagine additional columns being added. For example, the make and model of the machine that produced the data, the last time it was calibrated, and additional information. The sector samples is the table into which we'll import all the files that I've showed you, and we'll have about 75 million data points.
Now, Visual Studio 2005 Integration Services has their rich visual development environment that we've designed and ran through this workflow to import that data. We can look for new files, fold information from those files and store them into the sector table. And then, kick off preprocessing on a computational cluster to be able to signal analyze that data. This workflow, after being designed and debugged in this rich way, can then be stored on the SQL Server and run in the regular basis as an automated workflow.
There's also a rich toolbox that will allow us to drag and drop components such as downloading data from an FTP site, or to be able to process that from a Web service, integrated directly into this workflow that we've already configured. So, now the data is in the SQL Server. I'm not going to show you running it, but this workflow was used for that. And let's imagine us stepping to a workstation where a proteomics researcher is sitting using a familiar technical computing environment and a workstation such as MathLab with Bioinfomatics toolbox. We can use the rich UI to query and visualize the data out of the database. In this case, we have 350,000 data points in the X axis, the cancer and control samples represented on the Y axis, and the intensity of the measured molecule on the Z axis. What we want to do is to kick off a computation to pull out from 350,000 possible values 20 MZ values that can best discriminate between cancer and control patients. Now, in broad board way, this would require testing 10 to the 92nd sets of MZ combinations. Using genetic algorithms, this makes it a little bit more trackable, but it's still about 10 to the 9 statistics model needs to be created and tested to identify discriminate features that achieve high confidence.
So, tuning an algorithm, a genetic algorithm, is an iterative process that we want to do in an interactive fashion, and for that we need readily available computational resources. So, when I click find biomarkers, the job right from the workstation was scheduled on a personal cluster running with this computer cluster server, such as this prototype, which is a 25-gigaflop machine. It has four nodes, each with a dual-core Intel processor, and has a built in gigabit Ethernet switch that turns it to the self-contained cluster. It's under $10,000. You can imagine buying something like this from your favorite OEM and just putting it in your office, and all you need to do is just plug in the power into the outlet, and to be able to plug in an Internet connection or a network connection into that box. And I think an accelerator for what you're doing.
The results came back, the red lines represent the best selected 20 molecules because of those eight genetic algorithm iterations. What we're seeing is the performance of the algorithm and the confidence, so it's pretty decent, and we'll run it several times to be able to get the kind of performance that we want. Once we're satisfied with the algorithm's general performance, we want to kick off a much larger computation to be able to pull out the actual peptides that we'll use in the further research. We want to do that to run about 320 samples. This will take over an hour on a workstation, and possibly an hour in a personal cluster. So, we'll go ahead and request 320 processors to do that, and kick off the computation again.
Now, a researcher does not want to worry about where the resources come from. He just says, I need that much computing power, and so the job gets scheduled through the personal cluster. The scheduler in the compute process server has been configured by an IT manager to know about two additional computational resources, in this case larger clusters in the environment that the job will be forwarded to if it exceeds the program threshold. In this case, it was eight processors. And we all know that HPC is a heterogeneous environment, and integration with existing infrastructure, so it's super important to maximize utilization.
To demonstrate this, backstage we have a cluster of 60 nodes, 32 processors, it's a Dell box, and it's running Linux, and Class Form LSF Job Schedule is being used to schedule the jobs on that. The other cluster is in Intel's remote access location halfway across the state connected over SCinet and InfiniBand networks, and it's a 64-node machine, two processors, dual core, so it's 256 core total, and it's running Windows Compute Cluster Server.
And so now, if you look at the monitor here, we need to switch it to Workstation 2, we can see the job running on both of those computers either standard performance monitoring tools built into both Windows and the client. The jobs have completed. They're just wrapping up here. The results will start appearing back on the workstation. MATLAB with distributed computing toolbox was running on both of those compute nodes, and now the results are coming back. We're seeing that with increasing number of iterations, we're getting a higher degree of confidence. Now, it's important to note here that because we have a very large number of MZ values, 350,000, and a small number of patients, only 216, there's a high likelihood that the marker says that each genetic algorithm pulled out will demonstrate high classification performance, but might not be actual significance. This is why to maximize the confidence, we're running these multiple iteration of the algorithm while measuring for probabilities of the model. So, here you're seeing black lines demonstrating the procure probability.
And so, the performance of this algorithm was quite good. We're satisfied with the results. The red 20 markers we'll save for further use, for additional clinical experiments, or additional analysis, and at this point the researcher was able to successfully get the results done.
What I show in this demo is how directly into the workflow of a researcher we can integrate data management tools that take raw data from the experiments and put them in a structured format, add the metadata information, how computation directly weaves into a rich visual environment by integrating both remote, heterogeneous and personal clusters like this, and improve the kinds of insight that the researcher can achieve through using advanced computation. So this demo that you just saw was a collaborative effort over the last few weeks, pulled together by teams from MassWorks, Platform, Cynet, Open ID, Intel, and Silver Stream, and Microsoft, and I want to thank those members for contributing to this. And if you're interested to see more details, I invite you to stop by our booth and pick up a white paper that describes what you saw in more detail.
Thank you very much. (Applause.)
|