Child pages
  • ConnectomeDB, pyxnat, and the OHBM Hackathon

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

One of the datasets available for the OHBM Hackathon is the Q1 public data release from the Human Connectome Project. In addition to the imaging data, which are mirrored on S3 for easy access from AWS, a great deal of imaging metadata and associated non-imaging data is accessible through ConnectomeDB, a web application built on the XNAT imaging informatics platform.

pyxnat is a library that provides a Python language API to XNAT's RESTful web services. In this tutorial, we'll use pyxnat to access behavioral measures stored in ConnectomeDB. Even if you're not a Pythonista, read on, as the underlying XNAT REST API can be accessed from just about any language. I have small examples of code using the REST API in bashJava and Clojure, and I'd probably find it amusing to cook up an example in your favorite language; send me mail if you'd like details.

Getting started

You'll need Python (version 2.7.x recommended) and pyxnat to follow along. Someday soon we'll have a hackathon-customized version of pyxnat to provide easier access to the S3-hosted data, but there's nothing AWS-specific about this introduction, so plain old pyxnat will be fine. I'm writing this using Python 2.7.1 on Mac OS X 10.7.5, but I regularly use pyxnat on Gentoo Linux; other people use pyxnat on other Linuxes and even Windows, and in principle this all should work just about anywhere you can run Python. Send me mail if you run into trouble.

Aside for Python experts: because I'm working on pyxnat and not just with it, I usually don't install pyxnat to the system Python; instead I set up a virtualenv and install to that. We'll probably have to do this in a later tutorial, as we start using not-yet-published pyxnat extensions for working with the S3-hosted data.

You'll also need to create an account on ConnectomeDB and agree to the HCP Open Access Data Use Terms.

We'll look at some behavioral measures in ConnectomeDB: the Non-Toolbox Data Measures, a variety of tests that aren't part of the NIH Toolbox. (NIH Toolbox scores are forthcoming but not available in the Q1 data release.) The non-Toolbox measures are documented in detail herenontoolbox.xsd is an XML Schema document that specifies the non-Toolbox data type in ConnectomeDB; it's not particularly readable, but it does provide the exact naming conventions used in ConnectomeDB.

Let's start by firing up a Python session, loading pyxnat, and setting up a connection to ConnectomeDB.

Code Block
bash-3.2$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyxnat
>>> cdb=pyxnat.Interface('https://db.humanconnectome.org','mylogin','mypasswd')
>>>

This Interface object creates a session on ConnectomeDB. Be warned: if the session is idle for a while – say, for example, you're too busy reading documentation to keep typing -- ConnectomeDB may close the session. You can tell that the session has gone stale if, when you try to do a query:

Code Block
>>> cdb.select.project('HCP_Q1').subject('100307').id()

you get a plateful of nonsense that looks like:

Code Block
['status', 'content-location', 'content-language', ...
200
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
...

If this happens, just create a new Interface:

Code Block
>>> cdb=pyxnat.Interface('https://db.humanconnectome.org','mylogin','mypasswd')

Any query result objects that you created from the stale Interface will also need to be refreshed. There's an example later in this tutorial.

Exploring the ConnectomeDB data hierarchy

ConnectomeDB's data is organized into projects, which are the main access control structure in XNAT. If you have access to a project, you can see that project's data. Let's see what projects we have access to:

Code Block
>>> cdb.select.projects().get()
['HCP_Q1']                           # maybe some others, depending on your access settings
>>>

cdb.select.projects() asks ConnectomeDB for project details and turns the result into a collection of project objects. The get() method returns the identifiers for each object in the collection. We could get the same result using a list comprehension; let's try that now, because that will be a more convenient form in general:

Code Block
>>> [project.id() for project in cdb.select.projects()]
['HCP_Q1']
>>>

Since we're interested in the HCP Q1 data, let's get a handle on just that project:

Code Block
>>> q1 = cdb.select.project('HCP_Q1')
>>>

Note that if the session goes stale, so will this object q1. So in addition to refreshing cdb, you'll probably need to refresh q1, too, by reissuing this command:

Code Block
>>> q1 = cdb.select.project('HCP_Q1')
>>>

Querying for Subjects in the Q1 Project

What's inside of this project object? Each project contains subjects and experiments. Let's look at the list of subjects:

Code Block
>>> [subject.label() for subject in q1.subjects()]
['100307', '103515', '103818', '111312', '114924', '117122', '118932', '119833', '120212', '125525', '128632', '130013', '137128', '138231', '142828', '143325', '144226', '149337', '150423', '153429', '156637', '159239', '161731', '162329', '167743', '172332', '182739', '191437', '192439', '192540', '194140', '197550', '199150', '199251', '200614', '201111', '210617', '217429', '249947', '250427', '255639', '304020', '307127', '329440', '499566', '530635', '559053', '585862', '638049', '665254', '672756', '685058', '729557', '732243', '792564', '826353', '856766', '859671', '861456', '865363', '877168', '889579', '894673', '896778', '896879', '901139', '917255', '937160', '131621', '355542', '611231', '144428', '230926', '235128', '707244', '733548']
>>>

We used subject.label() instead of subject.id(), which inside the list comprehension would have given the same result as q1.get(). Why label() instead of id()? The label is the human-readable name for the subject within a specified project (HCP_Q1 in our case); the first label in the list is 100307, which is the HCP-assigned name for that subject. The subject id is the XNAT site-wide unique identifer for that subject, a not-intended-for-human-consumption identifier; the id for subject 100307 is 'ConnectomeDB_S00230'. In principle, different projects might assign different labels to the same subject, or different subjects might share the same label in different projects. We aren't engaging in those sorts of shenanigans on ConnectomeDB, but we do inherit a little complexity from XNAT's flexibility.

Querying for Experiments for each Subject

What data are available for subject 100307? Let's ask:

Code Block
>>> [expt.label() for expt in q1.subject('100307').experiments()]
['100307_3T', '100307_SubjMeta', '100307_NonToolbox']
>>>

There are three "experiments" here: 100307_3T contains the imaging data and associated metadata acquired on the HCP 3T Skyra; 100307_SubjMeta holds some bookkeeping about what data have been collected for this subject; and 100307_NonToolbox has the non-Toolbox scores. Again we use label() instead of id() (or get() on the experiments collection), because each project has a human-readable label for the experiment, whereas the id is the site-wide, XNAT-generated identifier.

Exploring Experiment Data

The experiments are represented by XML documents; we can view the XML for 100307_NonToolbox to see what's inside:

Code Block
>>> nt_100307 = q1.subject('100307').experiment('100307_NonToolbox')
>>> print(nt_100307.get())
<?xml version="1.0" encoding="UTF-8"?>
<nt:NTScores ID="ConnectomeDB_E00299" project="HCP_Subjects" label="100307_NonToolbox" xmlns:arc="http://nrg.wustl.edu/arc" xmlns:val="http://nrg.wustl.edu/val" xmlns:pipe="http://nrg.wustl.edu/pipe" xmlns:hcp="http://nrg.wustl.edu/hcp" xmlns:wrk="http://nrg.wustl.edu/workflow" xmlns:scr="http://nrg.wustl.edu/scr" xmlns:xdat="http://nrg.wustl.edu/security" xmlns:nt="http://nrg.wustl.edu/nt" xmlns:cat="http://nrg.wustl.edu/catalog" xmlns:prov="http://www.nbirn.net/prov" xmlns:xnat="http://nrg.wustl.edu/xnat" xmlns:xnat_a="http://nrg.wustl.edu/xnat_assessments" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://nrg.wustl.edu/workflow https://db.humanconnectome.org/schemas/pipeline/workflow.xsd http://nrg.wustl.edu/catalog https://db.humanconnectome.org/schemas/catalog/catalog.xsd http://nrg.wustl.edu/pipe https://db.humanconnectome.org/schemas/pipeline/repository.xsd http://nrg.wustl.edu/hcp https://db.humanconnectome.org/schemas/HCPMetadata/metadata.xsd http://nrg.wustl.edu/nt https://db.humanconnectome.org/schemas/nontoolbox/nontoolbox.xsd http://nrg.wustl.edu/scr https://db.humanconnectome.org/schemas/screening/screeningAssessment.xsd http://nrg.wustl.edu/arc https://db.humanconnectome.org/schemas/project/project.xsd http://nrg.wustl.edu/val https://db.humanconnectome.org/schemas/validation/protocolValidation.xsd http://nrg.wustl.edu/xnat https://db.humanconnectome.org/schemas/xnat/xnat.xsd http://nrg.wustl.edu/xnat_assessments https://db.humanconnectome.org/schemas/assessments/assessments.xsd http://www.nbirn.net/prov https://db.humanconnectome.org/schemas/birn/birnprov.xsd http://nrg.wustl.edu/security https://db.humanconnectome.org/schemas/security/security.xsd">
<xnat:sharing>
<xnat:share label="100307_NonToolbox" project="HCP_Q1">
<!--hidden_fields[xnat_experimentData_share_id="1",sharing_share_xnat_experimentDa_id="ConnectomeDB_E00299"]-->
</xnat:share>
<xnat:share label="100307_NonToolbox" project="HCP_Q2">
<!--hidden_fields[xnat_experimentData_share_id="77",sharing_share_xnat_experimentDa_id="ConnectomeDB_E00299"]-->
</xnat:share>
</xnat:sharing>
<xnat:subject_ID>ConnectomeDB_S00230</xnat:subject_ID>
<nt:HCPPNP>
<nt:mars_log_score>1.76</nt:mars_log_score>
<nt:mars_errs>0</nt:mars_errs>
<nt:mars_final>1.76</nt:mars_final>
</nt:HCPPNP>
<nt:DDISC>
<nt:SV_1mo_200>103.13</nt:SV_1mo_200>
<nt:SV_6mo_200>46.88</nt:SV_6mo_200>
<nt:SV_1yr_200>103.13</nt:SV_1yr_200>
<nt:SV_3yr_200>21.88</nt:SV_3yr_200>
<nt:SV_5yr_200>21.88</nt:SV_5yr_200>
<nt:SV_10yr_200>9.38</nt:SV_10yr_200>
<nt:SV_1mo_40000>19375.0</nt:SV_1mo_40000>
<nt:SV_6mo_40000>29375.0</nt:SV_6mo_40000>
<nt:SV_1yr_40000>24375.0</nt:SV_1yr_40000>
<nt:SV_3yr_40000>9375.0</nt:SV_3yr_40000>
<nt:SV_5yr_40000>9375.0</nt:SV_5yr_40000>
<nt:SV_10yr_40000>9375.0</nt:SV_10yr_40000>
<nt:AUC_200>0.16217604</nt:AUC_200>
<nt:AUC_40000>0.31145853</nt:AUC_40000>
</nt:DDISC>
<nt:NEO>
<nt:NEO>144</nt:NEO>
<nt:NEOFAC_A>33</nt:NEOFAC_A>
<nt:NEOFAC_O>24</nt:NEOFAC_O>
<nt:NEOFAC_C>35</nt:NEOFAC_C>
<nt:NEOFAC_N>15</nt:NEOFAC_N>
<nt:NEOFAC_E>37</nt:NEOFAC_E>
</nt:NEO>
<nt:SPCPTNL>
<nt:SCPT_TP>59</nt:SCPT_TP>
<nt:SCPT_TN>115</nt:SCPT_TN>
<nt:SCPT_FP>5</nt:SCPT_FP>
<nt:SCPT_FN>1</nt:SCPT_FN>
<nt:SCPT_TPRT>412.0</nt:SCPT_TPRT>
<nt:SCPT_SEN>0.9833</nt:SCPT_SEN>
<nt:SCPT_SPEC>0.9583</nt:SCPT_SPEC>
<nt:SCPT_LRNR>11</nt:SCPT_LRNR>
</nt:SPCPTNL>
<nt:CPW>
<nt:IWRD_TOT>35</nt:IWRD_TOT>
<nt:IWRD_RTC>1442.0</nt:IWRD_RTC>
</nt:CPW>
<nt:PMAT24A>
<nt:PMAT24_A_CR>17</nt:PMAT24_A_CR>
<nt:PMAT24_A_SI>2</nt:PMAT24_A_SI>
<nt:PMAT24_A_RTCR>11839.0</nt:PMAT24_A_RTCR>
</nt:PMAT24A>
<nt:VSPLOT24>
<nt:VSPLOT_TC>9</nt:VSPLOT_TC>
<nt:VSPLOT_CRTE>834.3</nt:VSPLOT_CRTE>
<nt:VSPLOT_OFF>29</nt:VSPLOT_OFF>
</nt:VSPLOT24>
<nt:ER40>
<nt:ER40_CR>39</nt:ER40_CR>
<nt:ER40_CRT>1471.0</nt:ER40_CRT>
<nt:ER40ANG>8</nt:ER40ANG>
<nt:ER40FEAR>8</nt:ER40FEAR>
<nt:ER40HAP>8</nt:ER40HAP>
<nt:ER40NOE>8</nt:ER40NOE>
<nt:ER40SAD>7</nt:ER40SAD>
</nt:ER40>
<nt:ASR>
<nt:ASRSyndromeScores>
<nt:ASR_anxdp_raw>3</nt:ASR_anxdp_raw>
<nt:ASR_anxdp_t>50</nt:ASR_anxdp_t>
<nt:ASR_wthdp_raw>0</nt:ASR_wthdp_raw>
<nt:ASR_wthdp_t>50</nt:ASR_wthdp_t>
<nt:ASR_som_raw>0</nt:ASR_som_raw>
<nt:ASR_som_t>50</nt:ASR_som_t>
<nt:ASR_tho_raw>1</nt:ASR_tho_raw>
<nt:ASR_tho_t>50</nt:ASR_tho_t>
<nt:ASR_att_raw>1</nt:ASR_att_raw>
<nt:ASR_att_t>50</nt:ASR_att_t>
<nt:ASR_agg_raw>3</nt:ASR_agg_raw>
<nt:ASR_agg_t>51</nt:ASR_agg_t>
<nt:ASR_rule_raw>1</nt:ASR_rule_raw>
<nt:ASR_rule_t>51</nt:ASR_rule_t>
<nt:ASR_int_raw>1</nt:ASR_int_raw>
<nt:ASR_int_t>50</nt:ASR_int_t>
<nt:ASR_other_raw>8</nt:ASR_other_raw>
<nt:ASR_critical_raw>2</nt:ASR_critical_raw>
<nt:ASR_cmp_internalizing_raw>3</nt:ASR_cmp_internalizing_raw>
<nt:ASR_cmp_internalizing_t>39</nt:ASR_cmp_internalizing_t>
<nt:ASR_cmp_externalizing_raw>5</nt:ASR_cmp_externalizing_raw>
<nt:ASR_cmp_externalizing_t>46</nt:ASR_cmp_externalizing_t>
<nt:ASR_cmp_other_raw>10</nt:ASR_cmp_other_raw>
<nt:ASR_cmp_total_raw>18</nt:ASR_cmp_total_raw>
<nt:ASR_cmp_total_t>40</nt:ASR_cmp_total_t>
</nt:ASRSyndromeScores>
<nt:ASRDsmScores>
<nt:DSM_dep_raw>1</nt:DSM_dep_raw>
<nt:DSM_dep_t>50</nt:DSM_dep_t>
<nt:DSM_anx_raw>3</nt:DSM_anx_raw>
<nt:DSM_anx_t>50</nt:DSM_anx_t>
<nt:DSM_som_raw>0</nt:DSM_som_raw>
<nt:DSM_som_t>50</nt:DSM_som_t>
<nt:DSM_avoid_raw>1</nt:DSM_avoid_raw>
<nt:DSM_avoid_t>50</nt:DSM_avoid_t>
<nt:DSM_adh_raw>4</nt:DSM_adh_raw>
<nt:DSM_adh_t>51</nt:DSM_adh_t>
<nt:DSM_inatt_raw>1</nt:DSM_inatt_raw>
<nt:DSM_hyp_raw>3</nt:DSM_hyp_raw>
<nt:DSM_asoc_raw>2</nt:DSM_asoc_raw>
<nt:DSM_asoc_t>51</nt:DSM_asoc_t>
</nt:ASRDsmScores>
</nt:ASR>
</nt:NTScores>
>>>

That's a lot of stuff. Let's take it line-by-line.

The first line, <?xml version="1.0" ... , just tells us that this is an XML document.

The second line, <nt:NTScores ID="ConnectomeDB_E00299" ..., is the start of the actual content. It tells us that this is a N(on)T(oolbox)Scores document, gives us the experiment ID (the XNAT site-wide identifier), the project ID, the experiment labels (the human-readable, in-project-context name), and ends with a bunch of namespace information in case we want to validate this document against the schema we were looking at earlier. (I don't. You're welcome to if you like.)

The next few lines, <xnat:sharing> through </xnat:sharing>, tell us what projects know about this experiment. We can skip over this. (Yes, there's an HCP_Q2 project. No, it's not ready for you to look at yet.)

Next comes the subject ID; again, this is the XNAT site-wide ID, not the human-readable name (label). We can use pyxnat to ask ConnectomeDB for the label in a specified project:

Code Block
>>> q1_proj.subject('ConnectomeDB_S00230')
'100307'
>>>

After that come the scores (and lots of them), organized into a few groups. The schema document nontoolbox.xsd may be useful in helping to decipher this. We can ask for individual scores by walking the XML DOM:

Code Block
>>> nt = q1_proj.subject('100307').experiment('100307_NonToolbox')
>>> nt.xpath('nt:ER40/nt:ER40_CR')
[<Element {http://nrg.wustl.edu/nt}ER40_CR at 0x102065370>]
>>> nt.xpath('nt:ER40/nt:ER40_CR')[0].text()
'39'
>>>

That's a slow way of retreiving scores, since we need a full HTTP request and response for each field. (Actually, pyxnat does some caching so the requests aren't repeated. Probably. Usually. I'd still recommend doing something else.) If we want multiple scores -- either more than one score from a single experiment, or one or more scores from each of multiple experiments, there are more efficient methods.

Let's start with selecting multiple scores for a single experiment. A reasonable approach is to grab and parse the entire experiment XML document, using the Python standard library module ElementTree:

Code Block
>>> import xml.etree.ElementTree as ET
>>> nt_dom = ET.fromstring(nt.get())
>>> nt_dom.tag
'{http://nrg.wustl.edu/nt}NTScores'
>>> er40 = nt_dom.find('{http://nrg.wustl.edu/nt}ER40')
>>> [[e.tag,e.text] for e in er40]
[['{http://nrg.wustl.edu/nt}ER40_CR', '39'], ['{http://nrg.wustl.edu/nt}ER40_CRT', '1471.0'], ['{http://nrg.wustl.edu/nt}ER40ANG', '8'], ['{http://nrg.wustl.edu/nt}ER40FEAR', '8'], ['{http://nrg.wustl.edu/nt}ER40HAP', '8'], ['{http://nrg.wustl.edu/nt}ER40NOE', '8'], ['{http://nrg.wustl.edu/nt}ER40SAD', '7']]
>>>

Getting scores from multiple experiments can be done either by iterating over experiment IDs with the methods described above (single-attribute or XML document requests), or by using the pyxnat search interface, which will be covered in an update coming soon.

Table of Contents

Table of Contents