Child pages
  • ConnectomeDB, pyxnat, and the OHBM Hackathon

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: removed most of the text, since it's been moved to a general tutorial

...

One of the datasets available for the OHBM Hackathon is the Q1 public data release from the Human Connectome Project. In addition to the imaging data, which are mirrored on S3 for easy access from AWS, a great deal of imaging metadata and associated non-imaging data is accessible through ConnectomeDB, a web application built on the XNAT imaging informatics platform.

pyxnat is a library that provides a Python language API to XNAT's RESTful web services. In this tutorial, we'll use pyxnat to access behavioral measures stored in ConnectomeDB and to view and download the imaging data. Even if you're not a Pythonista, read on, as the underlying XNAT REST API can be accessed from just about any language. I have small examples of code using the REST API in bashJava and Clojure, and I'd probably find it amusing to cook up an example in your favorite language; send me mail if you'd like details.

Getting started

You'll need Python (version 2.7.x recommended) and pyxnat to follow along. Because of bugs in the released version of pyxnat, I recommend creating a Python virtual environment and installing a development version of pyxnat. I'm writing this using Python 2.7.1 on Mac OS X 10.7.5, but I regularly use pyxnat on Gentoo Linux; other people use pyxnat on other Linuxes and even Windows. It's also possible to use pyxnat on an Amazon EC2 instance. In principle, this all should work just about anywhere you can run Python, but send me mail if you run into trouble.

You'll also need to create an account on ConnectomeDB and agree to the HCP Open Access Data Use Terms.

We'll look at some behavioral measures in ConnectomeDB: the Non-Toolbox Data Measures, a variety of tests that aren't part of the NIH Toolbox. (NIH Toolbox scores are forthcoming but not available in the Q1 data release.) The non-Toolbox measures are documented in detail herenontoolbox.xsd is an XML Schema document that specifies the non-Toolbox data type in ConnectomeDB; it's not particularly readable, but it does provide the exact naming conventions used in ConnectomeDB.

Let's start by firing up a Python session, loading pyxnat, and setting up a connection to ConnectomeDB.

Code Block
bash-3.2$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyxnat
>>> cdb=pyxnat.Interface('https://db.humanconnectome.org','mylogin','mypasswd')
>>>

This Interface object creates a session on ConnectomeDB. Be warned: if the session is idle for a while – say, for example, you're too busy reading documentation to keep typing -- ConnectomeDB may close the session. You can tell that the session has gone stale if, when you try to do a query:

Code Block
>>> cdb.select.project('HCP_Q1').subject('100307').id()

you get a plateful of nonsense that looks like:

Code Block
['status', 'content-location', 'content-language', ...
200
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
...

The same error occurs if you provide the wrong username or password.

If this happens, whether due to timeout or mistyping, just create a new Interface:

Code Block
>>> cdb=pyxnat.Interface('https://db.humanconnectome.org','mylogin','mypasswd')

Any query result objects that you created from the stale Interface will also need to be refreshed. There's an example later in this tutorial.

Exploring the ConnectomeDB data hierarchy

ConnectomeDB's data is organized into projects, which are the main access control structure in XNAT. If you have access to a project, you can see that project's data. Let's see what projects we have access to:

Code Block
>>> cdb.select.projects().get()
['HCP_Q1']                           # maybe some others, depending on your access settings
>>>

cdb.select.projects() asks ConnectomeDB for project details and turns the result into a collection of project objects. The get() method returns the identifiers for each object in the collection. We could get the same result using a list comprehension; let's try that now, because that will be a more convenient form in general:

Code Block
>>> [project.id() for project in cdb.select.projects()]
['HCP_Q1']
>>>

Since we're interested in the HCP Q1 data, let's get a handle on just that project:

Code Block
>>> q1 = cdb.select.project('HCP_Q1')
>>>

Note that if the session goes stale, so will this object q1. So in addition to refreshing cdb, you'll probably need to refresh q1, too, by reissuing this command:

Code Block
>>> q1 = cdb.select.project('HCP_Q1')
>>>

Querying for Subjects in the Q1 Project

What's inside of this project object? Each project contains subjects and experiments. Let's look at the list of subjects:

Code Block
>>> [subject.label() for subject in q1.subjects()]
['100307', '103515', '103818', '111312', '114924', '117122', '118932', '119833', '120212', '125525', '128632', '130013', '137128', '138231', '142828', '143325', '144226', '149337', '150423', '153429', '156637', '159239', '161731', '162329', '167743', '172332', '182739', '191437', '192439', '192540', '194140', '197550', '199150', '199251', '200614', '201111', '210617', '217429', '249947', '250427', '255639', '304020', '307127', '329440', '499566', '530635', '559053', '585862', '638049', '665254', '672756', '685058', '729557', '732243', '792564', '826353', '856766', '859671', '861456', '865363', '877168', '889579', '894673', '896778', '896879', '901139', '917255', '937160', '131621', '355542', '611231', '144428', '230926', '235128', '707244', '733548']
>>>

We used subject.label() instead of subject.id(), which inside the list comprehension would have given the same result as q1.get(). Why label() instead of id()? The label is the human-readable name for the subject within a specified project (HCP_Q1 in our case); the first label in the list is 100307, which is the HCP-assigned name for that subject. The subject id is the XNAT site-wide unique identifer for that subject, a not-intended-for-human-consumption identifier; the id for subject 100307 is 'ConnectomeDB_S00230'. In principle, different projects might assign different labels to the same subject, or different subjects might share the same label in different projects. We aren't engaging in those sorts of shenanigans on ConnectomeDB, but we do inherit a little complexity from XNAT's flexibility.

Querying for Experiments for each Subject

What data are available for subject 100307? Let's ask:

Code Block
>>> [expt.label() for expt in q1.subject('100307').experiments()]
['100307_3T', '100307_SubjMeta', '100307_NonToolbox']
>>>

There are three "experiments" here: 100307_3T contains the imaging data and associated metadata acquired on the HCP 3T Skyra; 100307_SubjMeta holds some bookkeeping about what data have been collected for this subject; and 100307_NonToolbox has the non-Toolbox scores. Again we use label() instead of id() (or get() on the experiments collection), because each project has a human-readable label for the experiment, whereas the id is the site-wide, XNAT-generated identifier.

Exploring Experiment Data

The experiments are represented by XML documents; we can view the XML for 100307_NonToolbox to see what's inside:

Code Block
>>> nt_100307 = q1.subject('100307').experiment('100307_NonToolbox')
>>> print(nt_100307.get())
<?xml version="1.0" encoding="UTF-8"?>
<nt:NTScores ID="ConnectomeDB_E00299" project="HCP_Subjects" label="100307_NonToolbox" xmlns:arc="http://nrg.wustl.edu/arc" xmlns:val="http://nrg.wustl.edu/val" xmlns:pipe="http://nrg.wustl.edu/pipe" xmlns:hcp="http://nrg.wustl.edu/hcp" xmlns:wrk="http://nrg.wustl.edu/workflow" xmlns:scr="http://nrg.wustl.edu/scr" xmlns:xdat="http://nrg.wustl.edu/security" xmlns:nt="http://nrg.wustl.edu/nt" xmlns:cat="http://nrg.wustl.edu/catalog" xmlns:prov="http://www.nbirn.net/prov" xmlns:xnat="http://nrg.wustl.edu/xnat" xmlns:xnat_a="http://nrg.wustl.edu/xnat_assessments" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://nrg.wustl.edu/workflow https://db.humanconnectome.org/schemas/pipeline/workflow.xsd http://nrg.wustl.edu/catalog https://db.humanconnectome.org/schemas/catalog/catalog.xsd http://nrg.wustl.edu/pipe https://db.humanconnectome.org/schemas/pipeline/repository.xsd http://nrg.wustl.edu/hcp https://db.humanconnectome.org/schemas/HCPMetadata/metadata.xsd http://nrg.wustl.edu/nt https://db.humanconnectome.org/schemas/nontoolbox/nontoolbox.xsd http://nrg.wustl.edu/scr https://db.humanconnectome.org/schemas/screening/screeningAssessment.xsd http://nrg.wustl.edu/arc https://db.humanconnectome.org/schemas/project/project.xsd http://nrg.wustl.edu/val https://db.humanconnectome.org/schemas/validation/protocolValidation.xsd http://nrg.wustl.edu/xnat https://db.humanconnectome.org/schemas/xnat/xnat.xsd http://nrg.wustl.edu/xnat_assessments https://db.humanconnectome.org/schemas/assessments/assessments.xsd http://www.nbirn.net/prov https://db.humanconnectome.org/schemas/birn/birnprov.xsd http://nrg.wustl.edu/security https://db.humanconnectome.org/schemas/security/security.xsd">
<xnat:sharing>
<xnat:share label="100307_NonToolbox" project="HCP_Q1">
<!--hidden_fields[xnat_experimentData_share_id="1",sharing_share_xnat_experimentDa_id="ConnectomeDB_E00299"]-->
</xnat:share>
<xnat:share label="100307_NonToolbox" project="HCP_Q2">
<!--hidden_fields[xnat_experimentData_share_id="77",sharing_share_xnat_experimentDa_id="ConnectomeDB_E00299"]-->
</xnat:share>
</xnat:sharing>
<xnat:subject_ID>ConnectomeDB_S00230</xnat:subject_ID>
<nt:HCPPNP>
<nt:mars_log_score>1.76</nt:mars_log_score>
<nt:mars_errs>0</nt:mars_errs>
<nt:mars_final>1.76</nt:mars_final>
</nt:HCPPNP>
<nt:DDISC>
<nt:SV_1mo_200>103.13</nt:SV_1mo_200>
<nt:SV_6mo_200>46.88</nt:SV_6mo_200>
<nt:SV_1yr_200>103.13</nt:SV_1yr_200>
<nt:SV_3yr_200>21.88</nt:SV_3yr_200>
<nt:SV_5yr_200>21.88</nt:SV_5yr_200>
<nt:SV_10yr_200>9.38</nt:SV_10yr_200>
<nt:SV_1mo_40000>19375.0</nt:SV_1mo_40000>
<nt:SV_6mo_40000>29375.0</nt:SV_6mo_40000>
<nt:SV_1yr_40000>24375.0</nt:SV_1yr_40000>
<nt:SV_3yr_40000>9375.0</nt:SV_3yr_40000>
<nt:SV_5yr_40000>9375.0</nt:SV_5yr_40000>
<nt:SV_10yr_40000>9375.0</nt:SV_10yr_40000>
<nt:AUC_200>0.16217604</nt:AUC_200>
<nt:AUC_40000>0.31145853</nt:AUC_40000>
</nt:DDISC>
<nt:NEO>
<nt:NEO>144</nt:NEO>
<nt:NEOFAC_A>33</nt:NEOFAC_A>
<nt:NEOFAC_O>24</nt:NEOFAC_O>
<nt:NEOFAC_C>35</nt:NEOFAC_C>
<nt:NEOFAC_N>15</nt:NEOFAC_N>
<nt:NEOFAC_E>37</nt:NEOFAC_E>
</nt:NEO>
<nt:SPCPTNL>
<nt:SCPT_TP>59</nt:SCPT_TP>
<nt:SCPT_TN>115</nt:SCPT_TN>
<nt:SCPT_FP>5</nt:SCPT_FP>
<nt:SCPT_FN>1</nt:SCPT_FN>
<nt:SCPT_TPRT>412.0</nt:SCPT_TPRT>
<nt:SCPT_SEN>0.9833</nt:SCPT_SEN>
<nt:SCPT_SPEC>0.9583</nt:SCPT_SPEC>
<nt:SCPT_LRNR>11</nt:SCPT_LRNR>
</nt:SPCPTNL>
<nt:CPW>
<nt:IWRD_TOT>35</nt:IWRD_TOT>
<nt:IWRD_RTC>1442.0</nt:IWRD_RTC>
</nt:CPW>
<nt:PMAT24A>
<nt:PMAT24_A_CR>17</nt:PMAT24_A_CR>
<nt:PMAT24_A_SI>2</nt:PMAT24_A_SI>
<nt:PMAT24_A_RTCR>11839.0</nt:PMAT24_A_RTCR>
</nt:PMAT24A>
<nt:VSPLOT24>
<nt:VSPLOT_TC>9</nt:VSPLOT_TC>
<nt:VSPLOT_CRTE>834.3</nt:VSPLOT_CRTE>
<nt:VSPLOT_OFF>29</nt:VSPLOT_OFF>
</nt:VSPLOT24>
<nt:ER40>
<nt:ER40_CR>39</nt:ER40_CR>
<nt:ER40_CRT>1471.0</nt:ER40_CRT>
<nt:ER40ANG>8</nt:ER40ANG>
<nt:ER40FEAR>8</nt:ER40FEAR>
<nt:ER40HAP>8</nt:ER40HAP>
<nt:ER40NOE>8</nt:ER40NOE>
<nt:ER40SAD>7</nt:ER40SAD>
</nt:ER40>
<nt:ASR>
<nt:ASRSyndromeScores>
<nt:ASR_anxdp_raw>3</nt:ASR_anxdp_raw>
<nt:ASR_anxdp_t>50</nt:ASR_anxdp_t>
<nt:ASR_wthdp_raw>0</nt:ASR_wthdp_raw>
<nt:ASR_wthdp_t>50</nt:ASR_wthdp_t>
<nt:ASR_som_raw>0</nt:ASR_som_raw>
<nt:ASR_som_t>50</nt:ASR_som_t>
<nt:ASR_tho_raw>1</nt:ASR_tho_raw>
<nt:ASR_tho_t>50</nt:ASR_tho_t>
<nt:ASR_att_raw>1</nt:ASR_att_raw>
<nt:ASR_att_t>50</nt:ASR_att_t>
<nt:ASR_agg_raw>3</nt:ASR_agg_raw>
<nt:ASR_agg_t>51</nt:ASR_agg_t>
<nt:ASR_rule_raw>1</nt:ASR_rule_raw>
<nt:ASR_rule_t>51</nt:ASR_rule_t>
<nt:ASR_int_raw>1</nt:ASR_int_raw>
<nt:ASR_int_t>50</nt:ASR_int_t>
<nt:ASR_other_raw>8</nt:ASR_other_raw>
<nt:ASR_critical_raw>2</nt:ASR_critical_raw>
<nt:ASR_cmp_internalizing_raw>3</nt:ASR_cmp_internalizing_raw>
<nt:ASR_cmp_internalizing_t>39</nt:ASR_cmp_internalizing_t>
<nt:ASR_cmp_externalizing_raw>5</nt:ASR_cmp_externalizing_raw>
<nt:ASR_cmp_externalizing_t>46</nt:ASR_cmp_externalizing_t>
<nt:ASR_cmp_other_raw>10</nt:ASR_cmp_other_raw>
<nt:ASR_cmp_total_raw>18</nt:ASR_cmp_total_raw>
<nt:ASR_cmp_total_t>40</nt:ASR_cmp_total_t>
</nt:ASRSyndromeScores>
<nt:ASRDsmScores>
<nt:DSM_dep_raw>1</nt:DSM_dep_raw>
<nt:DSM_dep_t>50</nt:DSM_dep_t>
<nt:DSM_anx_raw>3</nt:DSM_anx_raw>
<nt:DSM_anx_t>50</nt:DSM_anx_t>
<nt:DSM_som_raw>0</nt:DSM_som_raw>
<nt:DSM_som_t>50</nt:DSM_som_t>
<nt:DSM_avoid_raw>1</nt:DSM_avoid_raw>
<nt:DSM_avoid_t>50</nt:DSM_avoid_t>
<nt:DSM_adh_raw>4</nt:DSM_adh_raw>
<nt:DSM_adh_t>51</nt:DSM_adh_t>
<nt:DSM_inatt_raw>1</nt:DSM_inatt_raw>
<nt:DSM_hyp_raw>3</nt:DSM_hyp_raw>
<nt:DSM_asoc_raw>2</nt:DSM_asoc_raw>
<nt:DSM_asoc_t>51</nt:DSM_asoc_t>
</nt:ASRDsmScores>
</nt:ASR>
</nt:NTScores>
>>>

That's a lot of stuff. Let's take it line-by-line.

The first line, <?xml version="1.0" ... , just tells us that this is an XML document.

The second line, <nt:NTScores ID="ConnectomeDB_E00299" ..., is the start of the actual content. It tells us that this is a N(on)T(oolbox)Scores document, gives us the experiment ID (the XNAT site-wide identifier), the project ID, the experiment labels (the human-readable, in-project-context name), and ends with a bunch of namespace information in case we want to validate this document against the schema we were looking at earlier. (I don't. You're welcome to if you like.)

The next few lines, <xnat:sharing> through </xnat:sharing>, tell us what projects know about this experiment. We can skip over this. (Yes, there's an HCP_Q2 project, but the data are not mirrored on S3, so I'm focusing on the Q1 data.)

Next comes the subject ID; again, this is the XNAT site-wide ID, not the human-readable name (label). We can use pyxnat to ask ConnectomeDB for the label in a specified project:

Code Block
>>> q1_proj.subject('ConnectomeDB_S00230').label()
'100307'
>>>

After that come the scores (and lots of them), organized into a few groups. The schema document nontoolbox.xsd may be useful in helping to decipher this. We can ask for individual scores by walking the XML DOM:

Code Block
>>> nt = q1_proj.subject('100307').experiment('100307_NonToolbox')
>>> nt.xpath('nt:ER40/nt:ER40_CR')
[<Element {http://nrg.wustl.edu/nt}ER40_CR at 0x102065370>]
>>> nt.xpath('nt:ER40/nt:ER40_CR')[0].text()
'39'
>>>

That's a slow way of retreiving scores, since we need a full HTTP request and response for each field. (Actually, pyxnat does some caching so the requests aren't repeated. Probably. Usually. I'd still recommend doing something else.) If we want multiple scores -- either more than one score from a single experiment, or one or more scores from each of multiple experiments, there are more efficient methods.

Let's start with selecting multiple scores for a single experiment. A reasonable approach is to grab and parse the entire experiment XML document, using the Python standard library module ElementTree:

Code Block
>>> import xml.etree.ElementTree as ET
>>> nt_dom = ET.fromstring(nt.get())
>>> nt_dom.tag
'{http://nrg.wustl.edu/nt}NTScores'
>>> er40 = nt_dom.find('{http://nrg.wustl.edu/nt}ER40')
>>> [[e.tag,e.text] for e in er40]
[['{http://nrg.wustl.edu/nt}ER40_CR', '39'], ['{http://nrg.wustl.edu/nt}ER40_CRT', '1471.0'], ['{http://nrg.wustl.edu/nt}ER40ANG', '8'], ['{http://nrg.wustl.edu/nt}ER40FEAR', '8'], ['{http://nrg.wustl.edu/nt}ER40HAP', '8'], ['{http://nrg.wustl.edu/nt}ER40NOE', '8'], ['{http://nrg.wustl.edu/nt}ER40SAD', '7']]
>>>

Getting scores from multiple experiments can be done either by iterating over experiment IDs with the methods described above (single-attribute or XML document requests), or by using the pyxnat search interface.

Searching on ConnectomeDB

Retrieving values for multiple subjects or experiments is usually best done through the pyxnat search interface.

Simple searching

Let's start with an example: getting all of the NEO-FFI scores for all subjects in project 'HCP_Q1', with the corresponding subject labels. Parts of this example won't yet make sense, but we'll march ahead to a result, then backtrack to fill in the missing details.

First, we identify the fields we want to retrieve:
Code Block
>>> xsitype = 'nt:scores'
>>> fields = ['SUBJECT_ID']+['NEO_NEOFAC_{}'.format(e) for e in 'AOCNE']
>>> fields = ['{}/{}'.format(xsitype, f) for f in fields]
>>> fields
['nt:scores/SUBJECT_ID',
 'nt:scores/NEO_NEOFAC_A', 'nt:scores/NEO_NEOFAC_O', 
'nt:scores/NEO_NEOFAC_C', 'nt:scores/NEO_NEOFAC_N', 
'nt:scores/NEO_NEOFAC_E']
>>>

Next, we build a constraint to use only subjects in project HCP_Q1:

Code Block
>>> constraints = [(xsitype+'/PROJECTS', 'LIKE', '%HCP_Q1%')]
>>>

Now, we run the search:

Code Block
>>> table = cdb.select(xsitype,fields).where(constraints)
<JsonTable 76:6> subject_id,neo_neofac_a ... neo_neofac_n,neo_neofac_o
>>>

The result is an instance of a pyxnat-defined class (JsonTable). The most useful methods on this class are headers(), which shows the ordering of each row; and items(), which returns the content as an array of rows, each row a list:

Code Block
>>> table.headers()
['subject_id', 'neo_neofac_a', 'neo_neofac_c', 'neo_neofac_e', 'neo_neofac_n', 'neo_neofac_o']
>>> table.items()
[('ConnectomeDB_S00230', '33', '24', '35', '15', '37'), ('ConnectomeDB_S00231', '19', '27', '32', '25', '25'), ...

ConnectomeDB gave us subject IDs instead of labels. Let's reorder this result into a dictionary with subject labels as the keys:

Code Block
>>> neoffi_fields = table.headers()[1:]
>>> neoffi = {q1.subject(row[0]).label():row[1:] for row in table.items()}
>>> neoffi.keys()
['143325', '937160', '889579', '707244', '197550', ... ]
>>>

Now it's easy to view the scores for a single subject:

Code Block
>>> dict(zip(neoffi_fields, neoffi['143325']))
{'neo_neofac_a': '28', 'neo_neofac_c': '25', 'neo_neofac_e': '32', 'neo_neofac_n': '25', 'neo_neofac_o': '15'}
>>>

We can also do simple local searches with data in this form. For example, here are all the subjects with agreeableness (neo_neofac_a) score higher than 40:

Code Block
>>> ia = neoffi_fields.index('neo_neofac_a')
>>> [k for [k,v] in neoffi if v[ia] and int(v[ia]) > 40]
['792564', '877168', '732243', '149337']
>>>

Simple searching (once more, with details)

Now that we've seen that it's possible to do something with the search interface, let's dig into the details.

Naming fields in searches

Our first step was to build a list naming the fields of interest:

Code Block
>>> xsitype = 'nt:scores'
>>> fields = ['SUBJECT_ID']+['NEO_NEOFAC_{}'.format(e) for e in 'AOCNE']
>>> fields = ['{}/{}'.format(xsitype, f) for f in fields]
>>> fields
['nt:scores/SUBJECT_ID',
 'nt:scores/NEO_NEOFAC_A', 'nt:scores/NEO_NEOFAC_O', 
'nt:scores/NEO_NEOFAC_C', 'nt:scores/NEO_NEOFAC_N', 
'nt:scores/NEO_NEOFAC_E']
>>>

All of the fields we're retrieving are defined in the nt:scores datatype; we need to include the datatype name as a prefix to the field names. If you look at nontoolbox.xsd, the XML Schema document where the scores datatype is defined, you'll see the definitions of all these fields, but with a twist: none of the names in the schema match the names we're using above. For example, the NEO fields are all defined inside a wrapper element. We might expect the fields to have names like NEO/NEOFAC_A; instead, we have names like NEO_NEOFAC_A.

Why does the pyxnat search interface use a different field naming convention than the datatype definitions? It's a historical accident, and XNAT's fault. The search interface uses an field naming system that is independent of field naming elsewhere in the application; the mapping between search-service field names and everywhere-else field names is defined in the display document, which primarily specifies how data types appear in the web application. The display document for nt:scores lists all the fields that can be displayed in the web application, with some information about how they are to be displayed. An excerpt follows:

Code Block
languagehtml/xml
titlent:scores display document excerpt
<?xml version="1.0" encoding="UTF-8"?>
<Displays xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="../../xdat/display.xsd" schema-element="nt:scores" full-description="NTScores" brief-description="NTScores">
        <Arc name="PARTICIPANT_EXPERIMENT">
                <CommonField id="PART_ID" local-field="SUBJECT_ID"/>
                <CommonField id="DATE" local-field="DATE"/>
                <CommonField id="EXPT_ID" local-field="EXPT_ID"/>
        </Arc>
        <DisplayField id="SUBJECT_ID" header="Subject" visible="true" searchable="true">
                <DisplayFieldElement name="Field1" schema-element="nt:scores.subject_ID"/>
                <HTML-Link>
                        <Property name="HREF" value="none"/>
                        <Property name="ONCLICK" value="return rpt('@Field1','xnat:subjectData','xnat:subjectData.ID');">
                                <InsertValue id="Field1" field="SUBJECT_ID"/>
                        </Property>
                </HTML-Link>
                <description>Subject Label</description>
        </DisplayField>
        <DisplayField id="EXPT_ID" header="ID" visible="true" searchable="true">
                <DisplayFieldElement name="Field1" schema-element="nt:scores.ID"/>
                <HTML-Link>
                        <Property name="HREF" value="none"/>
                        <Property name="ONCLICK" value="return rpt('@Field1','nt:scores','nt:scores.ID');">
                                <InsertValue id="Field1" field="EXPT_ID"/>
                        </Property>
                </HTML-Link>
                <description>Experiment ID</description>
        </DisplayField>
...
        <DisplayField id="NEO_NEOFAC_A" header="NEOFAC_A" visible="true" searchable="true">
                <DisplayFieldElement name="Field1" schema-element="nt:scores/NEO/NEOFAC_A"/>
                <description>NEO Factor A (Agreeableness)</description>
        </DisplayField>
 ...

Each field that can be accessed through the search interface has a corresponding DisplayField element; the id attribute is the name by which the field is addressed in searches. Look carefully at the DisplayField element with id="NEO_NEOFAC_A" ; note that that element has a DisplayFieldElement subelement, which includes an attribute "schema-element"  containing the field's name as we'd find it in the schema.

The process for finding the search interface field identifiers, then, is:

  1. Determine from the data type documentation what fields you'll be using.
  2. Find the field in the schema document (nontoolbox.xsd for the non-Toolbox Data Measures)
  3. Find the corresponding DisplayElement in the display document (nt_scores_display.xml for the non-Toolbox Data Measures scores data type nt:scores), then use the id attribute value, with the data type prefix.

In many cases, the description in the display document is sufficient information and you can skip step 2.

Search constraints

Next, we build a constraint to use only subjects in project HCP_Q1:

Code Block
>>> constraints = [(xsitype+'/PROJECTS', 'LIKE', '%HCP_Q1%')]

 What's happening here? The nt:scores display document defines a PROJECTS DisplayField that is a comma-separated list of all projects that can see the nt:scores experiment: the project that contains the experiment, plus any projects into which the experiment has been shared. Since XNAT checks this constraint by doing a SQL query, the value is compared with a LIKE operation – a pattern match. % is the wildcard character in SQL, so we're looking for HCP_Q1 anywhere in that comma-separated list.

Aside for the obsessively detail-oriented: We'd have to be more clever if there were a project named, say, HCP_Q1b. It's possible to conjoin LIKE constraints with NOT LIKE constraints to make this work, and I'll update this text if that becomes necessary. For now, the project names on ConnectomeDB are probably sufficiently distinct.

Running the search and transforming the results

This part is mostly straightforward: we need to tell pyxnat what datatype we're searching for, what fields we want from it, and what constraints we want to apply. We use a dictionary comprehension to transform the results into a dictionary where the subject labels are the keys.

Code Block
>>> table = cdb.select(xsitype,fields).where(constraints)
<JsonTable 76:6> subject_id,neo_neofac_a ... neo_neofac_n,neo_neofac_o
>>> neoffi_fields = table.headers()[1:]
>>> neoffi = {q1.subject(row[0]).label():row[1:] for row in table.items()}

The first field (index 0) is the subject ID, which we strip out of both the headers and the row contents using Python's slice operator. We then build a dictionary by looking up the subject labels (q1.subject(row[0]).label()), using those as keys, and using the rows with the subject ID removed as values.

Once we have the subject-to-values dictionary, we can build single-subject field-name-to-value dictionaries by zipping the field names together with the values for that subject, then building a dictionary from the resulting stream of key-value pairs:

Code Block
>>> dict(zip(neoffi_fields, neoffi['143325']))
{'neo_neofac_a': '28', 'neo_neofac_c': '25', 'neo_neofac_e': '32', 'neo_neofac_n': '25', 'neo_neofac_o': '15'}

Finally, since we've downloaded all the NEO scores, we can search in memory without asking ConnectomeDB anything else:

Code Block
>>> ia = neoffi_fields.index('neo_neofac_a')
>>> [k for [k,v] in neoffi if v[ia] and int(v[ia]) > 40]
['792564', '877168', '732243', '149337']
>>>

Here we look up the index of the agreeableness score, store it as ia=0, and extract the subject labels where there is an agreeableness score (v[ia] evaluates truthy for nonempty scores, false for the empty string) and the value is greater than 40 (int(v[ia]) > 40).

Building complex searches

The pyxnat search interface documentation explains how to compose simple searches into complex ones, so I'll just present some examples.

This is the same search that we performed on the NEO-FFI scores above, except this time on the server side:

Code Block
>>> fields = ['SUBJECT_ID']+['NEO_NEOFAC_{}'.format(e) for e in 'AOCNE']
>>> fields = ['nt:scores/{}'.format('nt:scores', f) for f in fields]
>>> constraints = [('nt:scores/PROJECTS','LIKE','%HCP_Q1%'), ('nt:scores/NEO_NEOFAC_A', '>', '40'), 'AND']
>>> table = cdb.select('nt:scores',fields).where(constraints); 
>>> [q1.subject(row[0]).label() for row in table.items()]
['149337', '732243', '792564', '877168']

Finding (and counting) subjects scoring from 30 to 34 total correct on the Computerized Penn Word Memory:

Code Block
>>> fields = ["nt:scores/"+n for n in ['SUBJECT_ID','CPW_IWRD_TOT','CPW_IWRD_RTC']]
>>> constraints = [('nt:scores/PROJECTS','LIKE','%HCP_Q1%'), ('nt:scores/CPW_IWRD_TOT', '>=', '30'), ('nt:scores/CPW_IWRD_TOT', '<', '35'), 'AND']
>>> table = cdb.select('nt:scores',fields).where(constraints);
>>> len(table)
26
>>> [q1.subject(r[0]).label() for r in table.items()]
['111312', '114924', '119833', '125525', '128632', '137128', '144226', '162329', '167743', '192439', '192540', '199150', '201111', '210617', '250427', '255639', '672756', '685058', '856766', '894673', '896778', '901139', '611231', '230926', '235128', '707244']

Finding subjects scoring either less than 30 or 35 or more total correct on the Computerized Penn Word Memory:

Code Block
>>> fields = ["nt:scores/"+n for n in ['SUBJECT_ID','CPW_IWRD_TOT','CPW_IWRD_RTC']]
>>> constraints = [('nt:scores/PROJECTS','LIKE','%HCP_Q1%'), [('nt:scores/CPW_IWRD_TOT', '<', '30'), ('nt:scores/CPW_IWRD_TOT', '>=', '35'), 'OR'], 'AND']
>>> table = cdb.select('nt:scores',fields).where(constraints);                  >>> len(table)
50
>>> [q1.subject(r[0]).label() for r in table.items()]
['100307', '103515', '103818', '117122', '118932', '120212', '130013', '138231', '142828', '143325', '149337', '150423', '153429', '156637', '159239', '161731', '172332', '182739', '191437', '194140', '197550', '199251', '200614', '217429', '249947', '304020', '307127', '329440', '499566', '530635', '559053', '585862', '638049', '665254', '729557', '732243', '792564', '826353', '859671', '861456', '865363', '877168', '889579', '896879', '917255', '937160', '131621', '355542', '144428', '733548']

Now that OHBM and the hackathon are past, most of this tutorial has been moved to a generic tutorial on pyxnat and ConnectomeDB. A few topics, mostly related to AWS and the S3 mirror of the Q1 data, remain here.

Accessing imaging data

Before trying to access the data, it's important to understand what's in the Q1 release. The Q1 data release documentation describes the session structure and file layout in detail. The Q1 imaging data are mirrored on Amazon's S3, which is particularly useful for copying data into an EC2 instance. The Python library boto provides an interface for many of Amazon's web services, including S3. If you installed the HCP-customized pyxnat, you already have boto.

I'm working to extend pyxnat to translate between the internal storage paths on ConnectomeDB and the (differently organized) copy of the data on S3. You really don't need to wait for this, though: the example searches above produce subject labels, which is enough information to point you to the right data directories on S3 – for example, subject 100307's data can be found at s3://hcp.aws.amazon.com/q1/100307/ . Consult the hackathon HCP data release announcement for more details.

Browsing the imaging data

In order to start exploring and using the S3 Q1 mirror, you'll need to set up your AWS account and get access to the Amazon-hosted data. This process will get you an access key and a secret key, which you'll use to authenticate against S3. From a Python command line, you can browse the Q1 data by getting a handle to the "bucket" where the data are stored:

Code Block
>>> from boto.s3.connection import S3Connection
>>> s3 = S3Connection('your-access-key','your-secret-key')
>>> bucket = s3.get_bucket('hcp.aws.amazon.com')

S3 is a key-value-oriented store, rather than a hierarchical filesystem, but the HCP Q1 data is stored with keys that echo a regular file system. boto's interface to S3 makes it easy to pretend you're walking a file tree. There is a single root element q1, and each subject is a child of that root:

Code Block
>>> [k.name for k in bucket.list('q1/','/')]
[u'q1/', u'q1/100307/', u'q1/103515/', u'q1/103818/', u'q1/111312/', u'q1/114924/', u'q1/117122/', u'q1/118932/', u'q1/119833/', u'q1/120212/', u'q1/125525/', u'q1/128632/', u'q1/130013/', u'q1/131621/', u'q1/137128/', u'q1/138231/', u'q1/142828/', u'q1/143325/', u'q1/144226/', u'q1/149337/', u'q1/150423/', u'q1/153429/', u'q1/156637/', u'q1/159239/', u'q1/161731/', u'q1/162329/', u'q1/167743/', u'q1/172332/', u'q1/182739/', u'q1/191437/', u'q1/192439/', u'q1/192540/', u'q1/194140/', u'q1/197550/', u'q1/199150/', u'q1/199251/', u'q1/200614/', u'q1/201111/', u'q1/210617/', u'q1/217429/', u'q1/249947/', u'q1/250427/', u'q1/255639/', u'q1/304020/', u'q1/307127/', u'q1/329440/', u'q1/355542/', u'q1/499566/', u'q1/530635/', u'q1/559053/', u'q1/585862/', u'q1/611231/', u'q1/638049/', u'q1/665254/', u'q1/672756/', u'q1/685058/', u'q1/729557/', u'q1/732243/', u'q1/792564/', u'q1/826353/', u'q1/856766/', u'q1/859671/', u'q1/861456/', u'q1/865363/', u'q1/877168/', u'q1/889579/', u'q1/894673/', u'q1/896778/', u'q1/896879/', u'q1/901139/', u'q1/917255/', u'q1/937160/']

Each subject is organized as described in the Q1 data release documentation.

Code Block
>>> [k.name for k in bucket.list('q1/100307/','/')]
[u'q1/100307/.xdlm/', u'q1/100307/Diffusion/', u'q1/100307/MNINonLinear/', u'q1/100307/T1w/', u'q1/100307/release-notes/', u'q1/100307/unprocessed/']

The three key fragments associated with the "minimally preprocessed" data are Diffusion, MNINonLinear, and T1w. The key fragment .xdlm marks file manifests for download integrity checking, including checksums; while unprocessed marks unprocessed data.

Downloading files from S3

Now let's copy all files for the motor task with left-to-right phase encoding from S3 to a local disk.

Code Block
>>> import errno,os,os.path
>>> for k in bucket.list('q1/100307/MNINonLinear/Results/tfMRI_MOTOR_LR/'):
...   dir = os.path.dirname(k.name)
...   try: os.makedirs(dir)
...   except OSError as exc:
...     if exc.errno == errno.EEXIST and os.path.isdir(dir): pass
...     else: raise
...   with open(k.name, 'w') as f:
...     k.get_contents_to_file(f)
...
>>> 

Note that we use bucket.list(...)a little differently here: with one argument, it returns all keys starting with the provided text, which is comparable to a full recursive listing in a hierarchical file system.

Accessing S3 data through the Subject object

The instructions above handle ConnectomeDB and S3 as different worlds; there is some limited support for bridging this gap. Pyxnat can maintain a local mirror of the ConnectomeDB data, with contents downloaded on request. The first piece you'll need is an object representing both sides of the mirror (the local file space and the S3 bucket):

Code Block
>>> from pyxnat.core.mirror import S3Mirror
>>> mirror = S3Mirror.open('hcp.aws.amazon.com','your-access-key','your-secret-key','/path/to/local/mirror')

Next, you'll need to hand this to the pyxnat Interface: 

Code Block
>>> cdb = pyxnat.Interface('https://db.humanconnectomem.org','username','password',data_mirror=mirror)
>>>

You can now access local copies of the data files through the subject object: 

Code Block
>>> cdb.project('HCP_Q1').subject('100307').files('MNINonLinear/Results/tfMRI_MOTOR_LR')
[u'/path/to/local/mirror/q1/100307/MNINonLinear/Results/tfMRI_MOTOR_LR/...]

The return value from this call is a list of local file paths for the requested data. If the files are already on your disk, this will return quickly; if not, the files are downloaded before the files() call returns. You can request all of a subject's data by skipping the root path argument: 

Code Block
>>> cdb.project('HCP_Q1').subject('100307').files()

Note that this will most likely take a really long time, because a single subject is close to 20 GB, unless you have a very fat pipe to S3 -- say, from an EC2 instance. (This is why I'm not even showing the return value. It's a long list of files. You probably should find think about whether you really need all those files, and find a better way to get them if you do.)

Downloading data from ConnectomeDB using Aspera

An alternative method for downloading HCP imaging data is to use Aspera's faspTM, a proprietary transport protocol supported by ConnectomeDB. This might be particularly attractive if you're downloading to your own desktop machine; also, while S3 mirrors only the Q1 data release, the Q2 data can be downloaded from ConnectomeDB.

You'll need the free (gratis, not libre) Aspera Connect browser plugin. The easiest way to get this plugin is to log in to ConnectomeDB; if you don't have the plugin installed, a message at the top of the screen will hector you until you do.

The HCP-customized version of pyxnat includes a class for browsing and downloading HCP data using the Aspera plugin. For example, we can list the available package types:

Code Block
>>> cdb.packages.list()
[u'3T_Structural_preproc', u'3T_Structural_unproc', u'3T_rfMRI_REST1_preproc', u'3T_rfMRI_REST1_unproc', u'3T_rfMRI_REST2_preproc', u'3T_rfMRI_REST2_unproc', u'3T_tfMRI_EMOTION_preproc', u'3T_tfMRI_EMOTION_unproc', u'3T_tfMRI_GAMBLING_preproc', u'3T_tfMRI_GAMBLING_unproc', u'3T_tfMRI_LANGUAGE_preproc', u'3T_tfMRI_LANGUAGE_unproc', u'3T_tfMRI_MOTOR_preproc', u'3T_tfMRI_MOTOR_unproc', u'3T_tfMRI_RELATIONAL_preproc', u'3T_tfMRI_RELATIONAL_unproc', u'3T_tfMRI_SOCIAL_preproc', u'3T_tfMRI_SOCIAL_unproc', u'3T_tfMRI_WM_preproc', u'3T_tfMRI_WM_unproc', u'3T_Diffusion_preproc', u'3T_Diffusion_unproc']
>>>

The Aspera-downloadable data are organized a little differently to the S3 mirror or Connectome-in-a-Box; each package is a zip file that contains the associated files. The zip files unpack into the same layout as S3 or C-in-a-B.

We can get archive file metadata about packages available for particular subjects: 

Code Block
>>> cdb.packages.for_subjects(['100307','125525'])
{u'packages':[{u'count': 476, u'subjects_total': 2, u'description': u'HCP Q1 Data Release Structural Preprocessed', u'subjects_ok': 2, u'keywords': [u'preprocessed', u'structural'], u'label': u'HCP Q1 Structural Preprocessed', u'id': u'3T_Structural_preproc', u'size': 2551381019}, ...

 

We can also use the Aspera plugin to download the archive files: 

 

Code Block
>>> cdb.packages.download(['100307'],['3T_Structural_preproc','3T_rfMRI_REST1_preproc'])
 

 

This places the .zip files (and .zip.md5 checksums) in your working directory; you can instead specify a directory to place the downloaded zip files: 

 

Code Block
>>> cdb.packages.download(['100307'],['3T_Structural_preproc','3T_rfMRI_REST1_preproc'], '/data/hcp/q1')

 

This download will take a while, even with the Aspera transport – and trust me, it's a lot faster than FTP/HTTP, especially if you're far from St. Louis. You are moving gigabytes of data through the rusty old intertubes. If you're going to download more than a few subjects, consider getting Connectome in a Box. The cost is comparable to the extra disk you were going to need to buy to hold the data anyway.

What's next?

More important than the features I'm working on to integrate metadata searches with data storage – which I only suspect will be useful -- is the actual experience of OHBM hackathon participants. Go write some code! If you're working through this tutorial or diving into the data, and you find rough edges that need to be smoothed (or big gaping holes that need to be filled), let me know, by either submitting a comment below or sending me mail.

See you in Seattle!

Table of Contents

Table of Contents