Notebook 1

Query SDSS Data Release 12, save thumbnails to SciDrive

This notebook shows you how to use SciServer compute to communicate with the other components of SciServer. You will learn how to:

  1. Single-sign-on authentication through Keystone tokens
  2. import special purpose libraries written for SciServer actions
  3. Querying relational databases registered in CasJobs (SciServer's database frontend and batch query engine)
  4. Manipulating query results in python code (visualization)
  5. Storing results on local scratch disk as an HDF5 file for later reuse
  6. Based on query result retrieve images from persistents store, available to notebook because the docker container was created with a link to the corresponding volume container and show the images on the screen.
  7. Write images to the sharable, dropbox-like SciDrive. There they can be found through the UI interface and shared with colleagues.
  8. Store public URLs to thimbnails together with other data to table in one's private database, MyDB

1. Sign in to SciServer

All SciServer tools (CasJobs, SciDrive, iPython Notebooks, etc.) use the same single-sign-on system, so you only need to remember one password.

When you open your Docker container from the SciServer Compute dashboard page, the current token will be written in the file /home/idies/kestone.token. You can find your token on your Compute dashboard, under your username.

The code block below reads the token and stores it in a local variable, then prints its value along with your login name.

Note: since your token may expire, you should sometimes refresh the token as you work. You can do this by refreshing the token on the Dashboard, and then rerunnning the next block of code.

In [7]:
# This code block defined your token and makes it available as a 
#   system variable for the length of your current session.
# 
# This will usually be the first code block in any script you write.
with open('/home/idies/keystone.token', 'r') as f:
    token = f.read().rstrip('\n')
# async queries require token to be in --ident system variable
import sys
sys.argv.append("--ident="+token)
print("Your current token is"+token)
Your current token ise1833d6cc07641958af85ac572ad6f53

2. Import SciServer libraries

The SciServer team has written a number of libraries, generally prefixed by SciServer, that assist in various functions. As with all Python libraries, they must be actively imported before being used.

The next code block imports those, together with some standard Python libraries helpful for scientific analysis. The code block below applies some settings you may find helpful.

In [3]:
# Step 2a: Import Python libraries to work with SciServer

import SciServer.CasJobs as CasJobs # query with CasJobs
import SciServer.SciDrive           # read/write to/from SciDrive
In [4]:
# step 2b import other libraries for use in this notebook.
# all of these are included in the default Docker image
# but others can often be downloaded through a terminal or a '!pip install ...' run from 
# within the notebook
import numpy as np                  # standard Python lib for math ops
import pandas                       # data manipulation package
import matplotlib.pyplot as plt     # another graphing package
import skimage.io                   # image processing library
import urllib                       # accessing resources thorugh remote URLs
import json                         # work with JSON files
In [5]:
# Step 2b: Apply some special settings to the imported libraries
# ensure columns get written completely in notebook
pandas.set_option('display.max_colwidth', -1)
# do *not* show python warnings 
import warnings
warnings.filterwarnings('ignore')

3. Query an astronomy database (SDSS DR12)

The next code block searches the SDSS Data Release 12 database via the CasJobs REST API. The query completes quickly, so it uses CasJobs quick mode.

CasJobs also has an asynchronous mode, which will submit job to a queue and will store the results in a table in your MyDB. If your results are very large, you may order it to store the results in MyScratchDB instead.

Run the code block below to query DR12. Try changing some of the query parameters in step 3a to see the effect on the results returned in step 3d.

Documentation on the SciServer Python libraries can be found at our documentation site at:
http://scitest02.pha.jhu.edu/python-docs/

The actual source code is accessible on GitHub at
https://github.com/sciserver/SciScript-Python/tree/master/SciServer

In [22]:
# Step 3a: Find objects in the Sloan Digital Sky Survey's Data Release 12.
# Queries the Sloan Digital Sky Serveys' Data Release 12.
# For the database schema and documentation see http://skyserver.sdss.org
#
# This query finds "a 4x4 grid of nice-looking galaxies": 
#   galaxies in the SDSS database that have a spectrum 
#   and have a size (petror90_r) larger than 10 arcsec.
# 
# First, store the query in an object called "query"
query="""
SELECT TOP 16 p.objId,p.ra,p.dec,p.petror90_r
  FROM galaxy AS p
   JOIN SpecObj AS s ON s.bestobjid = p.objid
WHERE p.u BETWEEN 0 AND 19.6
  AND p.g BETWEEN 0 AND 17
  AND p.petror90_r > 10
"""
In [23]:
# Step 3b: Send the query to CasJobs using the SciServer.CasJobs.executeQuery method.
# The method takes the query and the 'context' (= target database) as parameters, and possibly the token.
# This example uses DR12 as context - the code makes a connection 
#   to the DR12 database, then runs the query in quick mode.
#   When the query succeeds, an "OK" message prints below.
queryResponse = CasJobs.executeQuery(query, "dr12",token=token)
executeQuery POST response:  200 OK
In [24]:
# Step 3c: store results in a pandas.DataFrame for easy analysis.
#
# CasJobs returns the results as a CSV string, stored in the "queryResponse" variable.
# Now parse the results into a DataFrame object using the pandas library.
# We identify the first column asn an index column, which is for slightly technical reasons explained below.
#    pandas.read_csv documentation: 
#    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
gals = pandas.read_csv(queryResponse,index_col='objId')
In [25]:
# Step 3d: Show the table of results
gals
Out[25]:
ra dec petror90_r
objId
1237645941824356443 40.285690 -0.714957 14.72862
1237645943978393694 55.123625 0.872487 18.85110
1237645943979114622 56.847420 0.875488 15.66479
1237645943979311221 57.248385 0.925979 14.60275
1237646588246688297 80.135126 1.186679 11.14294
1237648672921485632 243.708876 -0.915653 15.01578
1237648672922337955 245.700633 -0.844301 10.51866
1237648672922468973 246.015172 -0.902869 10.80825
1237648673459077169 245.367353 -0.457074 11.18379
1237648673459667002 246.695626 -0.510111 10.27149
1237648673459667234 246.782081 -0.492432 11.98310
1237648673460585255 248.894520 -0.452329 11.48170
1237648673971437623 189.522249 -0.027031 21.94494
1237648673992671592 237.945144 -0.105170 13.36310
1237648673994965546 243.236782 -0.096251 26.27760
1237648673995686549 244.875590 -0.038526 12.17980

Exercise A: Querying CasJobs

Try changing the query in the following ways.

  1. Return 8 matching objects insetad of 16.
  2. Select only nearby galaxies (redshift < 0.05).
  3. Select only galaxies likely to be spirals (with u-r color >= 2.22)
  4. Search for galaxies in SDSS Data Release 10 instead of DR12.

Some helpful resources

SciServer Python modules documentation: http://scitest02.pha.jhu.edu/python-docs/

Schema of SDSS Data Release 12: http://skyserver.sdss.org/dr12/en/help/browser/browser.aspx

Schema of SDSS Data Release 10: http://skyserver.sdss.org/dr10/en/help/browser/browser.aspx

4. Make a simple Plot

Now that we have run the query and stored the results, we can start analyzing the results.

Start by making a simple plot of positions, using the default query from step 3 (select top 16... AND p.petror90_r > 10).

In [26]:
plt.scatter(gals['ra'], gals['dec'])
plt.show() 

Exercise B: Plotting data

Try changing the code block below in the following ways. What changes do you notice in the plots?

  1. Plot the first 100 matching objects instead of the first 16.
  2. Plot galaxies out to dec = +/-5 (change the display axes as well)
  3. Plot a color-magnitude diagram with g on the x-axis and g-r on the y-axis. (Needs extra columns to be returned in the SQL query).

Some helpful resources

SciServer Python modules documentation: http://scitest02.pha.jhu.edu/python-docs/

Schema of SDSS Data Release 12: http://skyserver.sdss.org/dr12/en/help/browser/browser.aspx

Documentation for matplotlib module: http://matplotlib.org/contents.html

5. Store results on server for later use

The next code block saves the data table "gals" as an HD5 file and as a CSV file.

To see these files, go back to the folder in your Jupyter dashboard from which you opened this notebook. You should see your files there. Click on the file names to preview.

In [27]:
# store result as HDF5 file 
h5store = pandas.HDFStore('GalaxyThumbSample.h5')
h5store['galaxies']=gals
h5store.close()

# store result as CSV file
gals.to_csv('GalaxyThumbSample.csv')

Exercise C: Saving files

Try some alternatives:

  1. Change the name of the files
  2. Save the file as tab-separated variable (.tsv) instead of the default comma-separated variable (.csv).

Documentation on the Pandas package's DataFrame.to_csv method:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

6. Retrieve thumbnail cutouts of galaxies and show them on screen

SkyServer, another SciServer component, has a service that will produce a color JPG image cutout of certain dimensions around a specified position, useful for creating thumbnails.

The service creates the thumbnail using a pre-defined image pyramid. For a single image, you can construct the URL of the service using your query results, then use the skimage package to call the URL. To get all thumbnails in your query result, you can iterate using a loop.

The code block below gives an example of how to retrieve JPG thumbnails of galaxies in DR12. We need to create a URL for accessing the serice and set the parameters appropriately to produce nice thumbnails.

Note, the SQL query aboove was designed to produce positions of some nice looking galaxies.

In [28]:
width=200
height=200
pixelsize=0.396
plt.figure(figsize=(15, 15))
subPlotNum = 1

for index,gal in gals.iterrows():
    # the 'scale' parameter is set so that the image will be about 2x the size of the galaxy
    scale=2*gal['petror90_r']/pixelsize/width
    url="http://skyservice.pha.jhu.edu/DR12/ImgCutout/getjpeg.aspx?ra="+str(gal['ra'])
    url+="&dec="+str(gal['dec'])+"&scale="""+str(scale)+"&width="+str(width)
    url+="&height="+str(height)
    img=skimage.io.imread(url)
    plt.subplot(4,4,subPlotNum)
    subPlotNum += 1
    plt.imshow(img)
    # show the object identifier (objId) above the image.
    plt.title(index)

7. Write thumbnails to SciDrive

SciDrive is a new component of SciServer. It allows you to save query results as flat files in a Dropbox-like interface you can access anywhere.

The version of SciDrive this notebook connects to is not the same as the pre-production version you may have used before. Use the link below to access this test version of SciDrive. You should have no containers in this SciDrive yet.

Check your test SciDrive at:
http://scitest09.pha.jhu.edu/scidrive/scidrive.html

If the above link does not show a proper view of scidrive, with folders etc., please let us know, and do not run the rest of the code in this notebook until we investigate.

The three code blocks below work together to write the thumbnails you generated in step 6 into your test SciDrive.

In [29]:
# Step 7a: a function for generating a public URL for resources stored in SciDrive
# TODO this should be isolated as a part of the SciServer.SciDrive library
def scidrivePublicURL(path):
    req = urllib.request.Request(url=SciServer.Config.SciDriveHost+'/vospace-2.0/1/media/sandbox/'+path,method='GET') 
    req.add_header('X-Auth-Token', token) 
    req.add_header('Content-Type','application/xml') 
    res=urllib.request.urlopen(req)
    jsonResponse = json.loads(res.read().decode())
    return jsonResponse['url']
In [30]:
# Step 7b: create a container (~folder) in your SciDrive to hold the thumbnail images
container = 'thumbnails'
# IMPORTANT: Only run next if the container does not yet exist. If you have already created 
#   the container, comment out the next line.
# Note the token that must be provided as it allows the system to connect you to the proper scidrive root folder.
SciServer.SciDrive.createContainer(container,token=token)
In [31]:
# Step 7c: Write thumbnails to SciDrive. You will see a confirmation message below
#   for each thumbnail.
width=200
height=200
pixelsize=0.396

# for later use we determine publicly accessible URLs to each thumbnail and store these in a separate list.
puburls=[]
for index,gal in gals.iterrows():
    scale=2*gal['petror90_r']/pixelsize/width
    url="http://skyservice.pha.jhu.edu/DR12/ImgCutout/getjpeg.aspx?ra="+str(gal['ra'])
    url+="&dec="+str(gal['dec'])+"&scale="""+str(scale)+"&width="+str(width)
    url+="&height="+str(height)
    req = urllib.request.Request(url=url,method='GET')
    res = urllib.request.urlopen(req)
    data=res.read()
    scidrivename_name = container+"/new_"+str(index)+".jpg"
# Here the file gets uploaded to the container
    SciServer.SciDrive.upload(scidrivename_name, data,token=token)
    puburls.append(scidrivePublicURL(scidrivename_name))
    
# add the column of public urls to the original pandas.DataFrame
gals['pubURL']=puburls
{"size":"3.9 KB","rev":1,"thumb_exists":false,"bytes":4034,"modified":"Tue, 26 Apr 2016 15:59:05 -0400","path":"/thumbnails/new_1237645941824356443.jpg","is_dir":false,"icon":"file","root":"dropbox","mime_type":"application/file"}
{"size":"3.6 KB","rev":1,"thumb_exists":false,"bytes":3658,"modified":"Tue, 26 Apr 2016 15:59:07 -0400","path":"/thumbnails/new_1237645943978393694.jpg","is_dir":false,"icon":"file","root":"dropbox","mime_type":"application/file"}
{"size":"4.2 KB","rev":1,"thumb_exists":false,"bytes":4266,"modified":"Tue, 26 Apr 2016 15:59:08 -0400","path":"/thumbnails/new_1237645943979114622.jpg","is_dir":false,"icon":"file","root":"dropbox","mime_type":"application/file"}
{"size":"3.7 KB","rev":1,"thumb_exists":false,"bytes":3761,"modified":"Tue, 26 Apr 2016 15:59:09 -0400","path":"/thumbnails/new_1237645943979311221.jpg","is_dir":false,"icon":"file","root":"dropbox","mime_type":"application/file"}
{"size":"3.9 KB","rev":1,"thumb_exists":false,"bytes":4036,"modified":"Tue, 26 Apr 2016 15:59:11 -0400","path":"/thumbnails/new_1237646588246688297.jpg","is_dir":false,"icon":"file","root":"dropbox","mime_type":"application/file"}
{"size":"4.4 KB","rev":1,"thumb_exists":false,"bytes":4546,"modified":"Tue, 26 Apr 2016 15:59:12 -0400","path":"/thumbnails/new_1237648672921485632.jpg","is_dir":false,"icon":"file","root":"dropbox","mime_type":"application/file"}
{"size":"4.1 KB","rev":1,"thumb_exists":false,"bytes":4236,"modified":"Tue, 26 Apr 2016 15:59:13 -0400","path":"/thumbnails/new_1237648672922337955.jpg","is_dir":false,"icon":"file","root":"dropbox","mime_type":"application/file"}
{"size":"4 KB","rev":1,"thumb_exists":false,"bytes":4065,"modified":"Tue, 26 Apr 2016 15:59:14 -0400","path":"/thumbnails/new_1237648672922468973.jpg","is_dir":false,"icon":"file","root":"dropbox","mime_type":"application/file"}
{"size":"4.4 KB","rev":1,"thumb_exists":false,"bytes":4481,"modified":"Tue, 26 Apr 2016 15:59:15 -0400","path":"/thumbnails/new_1237648673459077169.jpg","is_dir":false,"icon":"file","root":"dropbox","mime_type":"application/file"}
{"size":"4.1 KB","rev":1,"thumb_exists":false,"bytes":4170,"modified":"Tue, 26 Apr 2016 15:59:18 -0400","path":"/thumbnails/new_1237648673459667002.jpg","is_dir":false,"icon":"file","root":"dropbox","mime_type":"application/file"}
{"size":"4.1 KB","rev":1,"thumb_exists":false,"bytes":4228,"modified":"Tue, 26 Apr 2016 15:59:20 -0400","path":"/thumbnails/new_1237648673459667234.jpg","is_dir":false,"icon":"file","root":"dropbox","mime_type":"application/file"}
{"size":"3.7 KB","rev":1,"thumb_exists":false,"bytes":3822,"modified":"Tue, 26 Apr 2016 15:59:22 -0400","path":"/thumbnails/new_1237648673460585255.jpg","is_dir":false,"icon":"file","root":"dropbox","mime_type":"application/file"}
{"size":"4.3 KB","rev":1,"thumb_exists":false,"bytes":4439,"modified":"Tue, 26 Apr 2016 15:59:23 -0400","path":"/thumbnails/new_1237648673971437623.jpg","is_dir":false,"icon":"file","root":"dropbox","mime_type":"application/file"}
{"size":"4.7 KB","rev":1,"thumb_exists":false,"bytes":4778,"modified":"Tue, 26 Apr 2016 15:59:24 -0400","path":"/thumbnails/new_1237648673992671592.jpg","is_dir":false,"icon":"file","root":"dropbox","mime_type":"application/file"}
{"size":"4.9 KB","rev":1,"thumb_exists":false,"bytes":4992,"modified":"Tue, 26 Apr 2016 15:59:26 -0400","path":"/thumbnails/new_1237648673994965546.jpg","is_dir":false,"icon":"file","root":"dropbox","mime_type":"application/file"}
{"size":"3.9 KB","rev":1,"thumb_exists":false,"bytes":4006,"modified":"Tue, 26 Apr 2016 15:59:27 -0400","path":"/thumbnails/new_1237648673995686549.jpg","is_dir":false,"icon":"file","root":"dropbox","mime_type":"application/file"}

Check your test SciDrive folder again. You should see a container called "thumbnails".

Double-click on the name to open the container. You should see the thumbnails you just saved!

Your test SciDrive URL:
http://scitest09.pha.jhu.edu/scidrive/scidrive.html

8. Store result in MyDB table

We'll store the results of our work in a table in your CasJobs/MyDB. This includes the result of your original query, with an extra column containing the public URL of the thumbnail corresponding to the galaxy retrieved from the database.

Check the state before: http://scitest02.pha.jhu.edu/CasJobs/MyDB.aspx

In [32]:
# add column with public urls to the galaxies table ...
gals['pubURL']=puburls
# show the table
gals
Out[32]:
ra dec petror90_r pubURL
objId
1237645941824356443 40.285690 -0.714957 14.72862 http://scitest09.pha.jhu.edu/vospace-2.0/data/863cd3b3-87a0-4646-bdb9-5c6748e9e526
1237645943978393694 55.123625 0.872487 18.85110 http://scitest09.pha.jhu.edu/vospace-2.0/data/f58e3dcd-9050-4613-95d5-6a555d717afd
1237645943979114622 56.847420 0.875488 15.66479 http://scitest09.pha.jhu.edu/vospace-2.0/data/9ba527b7-0153-4274-bfe1-27d6c6aa477e
1237645943979311221 57.248385 0.925979 14.60275 http://scitest09.pha.jhu.edu/vospace-2.0/data/65f7d281-843a-48da-b1a7-8cbfdcfc3663
1237646588246688297 80.135126 1.186679 11.14294 http://scitest09.pha.jhu.edu/vospace-2.0/data/7e01ed2e-5ef7-4d92-8a7e-676b5d93b00e
1237648672921485632 243.708876 -0.915653 15.01578 http://scitest09.pha.jhu.edu/vospace-2.0/data/4884b450-9cd7-4e20-a1c5-550c7d385d90
1237648672922337955 245.700633 -0.844301 10.51866 http://scitest09.pha.jhu.edu/vospace-2.0/data/b5650076-a991-45fa-8563-245340dccf44
1237648672922468973 246.015172 -0.902869 10.80825 http://scitest09.pha.jhu.edu/vospace-2.0/data/c06aac38-36bc-476f-a7e1-41a408738d02
1237648673459077169 245.367353 -0.457074 11.18379 http://scitest09.pha.jhu.edu/vospace-2.0/data/73b70550-6ca5-47c1-9e76-108ef60e6faf
1237648673459667002 246.695626 -0.510111 10.27149 http://scitest09.pha.jhu.edu/vospace-2.0/data/e065ef6f-6957-4f29-bc57-d86671ecd098
1237648673459667234 246.782081 -0.492432 11.98310 http://scitest09.pha.jhu.edu/vospace-2.0/data/e1a42965-664a-4f09-a795-a136c046fa28
1237648673460585255 248.894520 -0.452329 11.48170 http://scitest09.pha.jhu.edu/vospace-2.0/data/bb6ae24a-368e-495a-ae1b-32ca40b77674
1237648673971437623 189.522249 -0.027031 21.94494 http://scitest09.pha.jhu.edu/vospace-2.0/data/e6458b80-c827-4963-9860-e9a60cc51a15
1237648673992671592 237.945144 -0.105170 13.36310 http://scitest09.pha.jhu.edu/vospace-2.0/data/f66f95cd-a974-4c2b-8703-31ea6782c5e6
1237648673994965546 243.236782 -0.096251 26.27760 http://scitest09.pha.jhu.edu/vospace-2.0/data/125f47ce-bfac-4730-828a-d6977e408da2
1237648673995686549 244.875590 -0.038526 12.17980 http://scitest09.pha.jhu.edu/vospace-2.0/data/37b14114-385a-4586-a9a4-b15a9981518b
In [33]:
# to write to your MyDB, first create the table
# For technical reasons the column names must be exactly the same as the columns in the DataFrame.
# Note, skip this step if the table already exists
ddl = 'CREATE TABLE GalaxyThumbs(objId bigint, ra real, dec real, petror90_r real, pubURL varchar(128))'
response = SciServer.CasJobs.executeQuery(ddl,token=token)
# if no 200 OK is printed as result, something has gone wrong.
executeQuery POST response:  200 OK
In [34]:
# Now upload the data directly from the DataFrame
response=SciServer.CasJobs.uploadPandasDataFrameToTable(gals,"GalaxyThumbs2",token=token)
Uploading  2441 bytes...
uploadCVSDataFrameToTable POST response:  200 OK

Check the state of your MyDB after: http://scitest02.pha.jhu.edu/CasJobs/MyDB.aspx

Epilogue: Async queries to CasJobs

instead of executing a query, a query job can be submitted. Returns with a jobId, wbout which the status can be requested.

Important: for now the result of the query MUST be written explicitly to a table.

In [16]:
# async query example. Note the SELECT ... INTO ... pattern
query="""
SELECT TOP 16 p.objId,p.ra,p.dec,p.petror90_r
into MyDB.intro1query
  FROM galaxy AS p
   JOIN SpecObj AS s ON s.bestobjid = p.objid
WHERE p.u BETWEEN 0 AND 19.6
  AND p.g BETWEEN 0 AND 17
  AND p.petror90_r > 10
"""
jobId=CasJobs.submitJob(query, context = "DR12")
submitJob PUT response:  200 OK
In [18]:
# retrieve status about job
# returns a JSON string.
# job is complet is Status attribute in (3,4,5)
CasJobs.getJobStatus(jobId)
Out[18]:
{'AutoComplete': 0,
 'Created_Table': 'intro1query',
 'Estimate': 500,
 'HostIP': 'DR12_long      ',
 'JobID': 541025,
 'Message': 'Query Complete',
 'ModifiedQuery': 'SELECT TOP 16 p.objId,p.ra,p.dec,p.petror90_r\n\n  FROM galaxy AS p\n   JOIN SpecObj AS s ON s.bestobjid = p.objid\nWHERE p.u BETWEEN 0 AND 19.6\n  AND p.g BETWEEN 0 AND 17\n  AND p.petror90_r > 10/*CASJOBS_INTO_TABLE:[scitest02].mydb_958083076.mydb.intro1query*/',
 'OutputLoc': '',
 'OutputType': 'QUERY',
 'Params': '',
 'Query': '\nSELECT TOP 16 p.objId,p.ra,p.dec,p.petror90_r\ninto MyDB.intro1query\n  FROM galaxy AS p\n   JOIN SpecObj AS s ON s.bestobjid = p.objid\nWHERE p.u BETWEEN 0 AND 19.6\n  AND p.g BETWEEN 0 AND 17\n  AND p.petror90_r > 10\n',
 'Rows': 16,
 'SendEmail': 0,
 'Status': 5,
 'Target': 'DR12',
 'TaskName': 'REST Job Task',
 'TimeEnd': '2016-04-26T18:33:39.507',
 'TimeStart': '2016-04-26T18:33:38.517',
 'TimeSubmit': '2016-04-26T18:33:34.51',
 'WebServicesID': 958083076}
In [ ]: