Scientific Domain Integration
This Core Function encapsulates the whole set of activities for Science Domain Integration in the SciServer system. Each science domain project has “big data” needs, and has identified value in the use of a re-engineered SkyServer environment. Each project will thus both drive the details of the project and its capability, while benefiting themselves from the data services available – each project will work “with” the development team to evolve and iterate the system to maximize its value to all communities. The science domains who have projects that will work with and use SciServer are:
- Astronomy: this domain is already served by SkyServer. There are however a number of additional capabilities required to support advanced searching and analysis, and these will depend on the new scalable infrastructure.
- Earth Science: JHU already had remote sensing data from an earlier Earth Science project called Life Under Your Feet (LUYF). A new project call GLUSEEN is now running, and will use SkyServer to store and process its data, and to serve it up to the scientific community. A second goal is to merge the remote sensing data from GLUSEEN with LUYF into one model, so that it is accessible by the community through one common and unified interface.
- Cosmological Simulations: These data sets are huge (several 10s of TB), generated in supercomputing facilities and then made available to the communities for analysis and subsequent simulations. Because the data are so large, making them available in a way that is feasible for users to do much with it (other than small subsets) is almost impossible. Hosting the data in SkyServer and providing the Numerical Simulation Capability will enable anyone to run significant analyses on the data sets in a way that is currently impossible. JHU already has some data sets, and there are several more with a host of analyses to be run.
- Turbulence Simulations: Similar to Cosmological simulation data, large datasets with a need for extensive processing capability in an accessible environment. The Turbulence project already uses its own CASJobs/MyDB and is maintained by JHU, so an initial activity is to bring this into the newly engineered environment, and provide a numerical simulation laboratory environment with MyScratch space. We plan on demonstrating a pilot capability for the 18 month review using turbulence data sets.
- Genomics: Data sets will be brought into the SkyServer environment, and provided with a CASJobs/MyDB capability for performing Genome Assembly. The project will also implement a “Terabase Search Engine” on top of CASJobs
- Connectomics: This project will bring data sets into the SkyServer environment, setup CASJobs/MyDB environments and make these available. In particular the data itself will derive from High Throughput neuroscience, will require and develop a data streaming workflow, and provide data annotation services on top of the data.
- Oceanographic Simulation: This project has large simulation data sets that will be hosted in SkyServer and served through CASJobs/MyDB. A number of advanced processing capabilities will require the scalable extensions being developed. In particular this project will look to Integrate simulation data with observational data hosted elsewhere, by setting up a workflow to bring the two data sets together, perform analysis, and store the results in a MyDB.
- Long Tail Support: The architecture of SciServer and the provision of MyDB space is ideally matched for the scientist in the long-tail. We will develop out the capabilities to maximize the value by extending the SciDrive interface, providing extended data and metadata characterization capabilities, allowing seamless integration across data sets (including the large data sets) promoting sharing and reuse, and extending the UI capability using modern social media type technologies for semantic integration across the web.