Graph Info

The GripQL API allows a user to download the schema of a graph. This outlines the different types of nodes, the edges the connect them and the structure of the documents stored in graph elements. A graph document has a graph field that has the name, a vertices field and an edges field.

{"graph': "bmeg_rc1_2",
 "vertices': [{"gid": "Compound",
   "label": "Compound",
   "data": {"name": "STRING", "term": "STRING", "term_id": "STRING"}},
  ...],
  "edges": [{"gid": "(Project)--InProgram->(Program)",
   "label": "InProgram",
   "from": "Project",
   "to": "Program",
   "data": {}},

Connect to BMEG server

import networkx as nx
import matplotlib.pyplot as plt
import gripql
from networkx.drawing.nx_agraph import graphviz_layout
conn = gripql.Connection("https://bmeg.io/api", credential_file="bmeg_credentials.json")

Print avalible graphs

print(conn.listGraphs())
['bmeg_rc1_2', 'bmeg_rc1_3', 'bmeg_rc1_3__schema__', 'bmeg_rc2', 'bmeg_rc1_2__schema__', 'bmeg_rc2__schema__']
O = conn.graph("bmeg_rc2")

Get the schema graph

schema = conn.getSchema("bmeg_rc2")

Start build graph using NetworkX

g = nx.MultiDiGraph()
for v in schema['vertices']:
    g.add_node(v['gid'])
for e in schema['edges']:
    g.add_edge(e['from'], e['to'])

Draw Schema Graph

pos = graphviz_layout(g, prog='twopi', args='')
fig, ax = plt.subplots(1, 1, figsize=(8, 6));
nx.draw(g, pos, ax=ax, with_labels=True)

png

Allele Lookup

import hashlib
import gripql
conn = gripql.Connection("https://bmeg.io/api", credential_file="bmeg_credentials.json")
O = conn.graph("bmeg_rc2")

BMEG stores alleles using a hashed version of the alteration description

genome:chromosome:start:end:reference_bases:alternate_bases

So an example allele would be:

GRCh37:1:27100988:27100988:C:T

Which is then run though a sha1 hash to get the string

0b0a7a23d57414e768677a6cbd764563922209df
def allele_gid(genome, chromosome, start, end, reference_bases,
                 alternate_bases):
        vid = "%s:%s:%d:%d:%s:%s" % (genome, chromosome,
                                     start, end, reference_bases,
                                     alternate_bases)
        vid = vid.encode('utf-8')
        vidhash = hashlib.sha1()
        vidhash.update(vid)
        vidhash = vidhash.hexdigest()
        return "Allele:%s" % (vidhash)
chrom = 1
loc = 27100988
ids = []
for r in ['A', 'C', 'G', 'T']:
    for a in ['A', 'C', 'G', 'T']:
        ids.append( allele_gid("GRCh37", chrom, loc, loc, r, a) )
for row in O.query().V(ids):
    print( row )
[INFO]  2019-07-24 17:12:10,693 1 results received in 0 seconds


<AttrDict({'gid': 'Allele:0b0a7a23d57414e768677a6cbd764563922209df', 'label': 'Allele', 'data': {'alternate_bases': 'T', 'chromosome': '1', 'dbSNP_RS': '.', 'effect': 'Nonsense_Mutation', 'end': 27100988, 'ensembl_transcript': 'ENST00000324856', 'genome': 'GRCh37', 'hugo_symbol': 'ARID1A', 'project_id': 'Project:Reference', 'reference_bases': 'C', 'start': 27100988, 'strand': '+', 'submitter_id': '0b0a7a23d57414e768677a6cbd764563922209df', 'type': 'SNP'}})>

Gene Info

import gripql
conn = gripql.Connection("https://bmeg.io/api", credential_file="bmeg_credentials.json")
O = conn.graph("bmeg_rc2")

Look up a gene by its hugo symbol

gids = list(O.query().V().hasLabel("Gene").has(gripql.eq("$.symbol", "BRCA1")).render("_gid"))
[INFO]  2019-07-26 20:44:57,342 1 results received in 0 seconds

Find some of the Gene Ontology terms the gene is linked to

for ent in O.query().V(gids).out("gene_ontology_terms").limit(10):
    print(ent.gid, ent.data.definition)
[INFO]  2019-07-26 20:46:07,469 10 results received in 0 seconds


GeneOntologyTerm:GO:0003684 Interacting selectively and non-covalently with damaged DNA.
GeneOntologyTerm:GO:0004842 Catalysis of the transfer of ubiquitin from one protein to another via the reaction X-Ub + Y --> Y-Ub + X, where both X-Ub and Y-Ub are covalent linkages.
GeneOntologyTerm:GO:0000151 A protein complex that includes a ubiquitin-protein ligase and enables ubiquitin protein ligase activity. The complex also contains other proteins that may confer substrate specificity on the complex.
GeneOntologyTerm:GO:0000729 The 5' to 3' exonucleolytic resection of the DNA at the site of the break to form a 3' single-strand DNA overhang.
GeneOntologyTerm:GO:0000724 The error-free repair of a double-strand break in DNA in which the broken DNA molecule is repaired using homologous sequences. A strand in the broken DNA searches for a homologous region in an intact chromosome to serve as the template for DNA synthesis. The restoration of two intact DNA molecules results in the exchange, reciprocal or nonreciprocal, of genetic material between the intact DNA molecule and the broken DNA molecule.
GeneOntologyTerm:GO:0003713 A protein or a member of a complex that interacts specifically and non-covalently with a DNA-bound DNA-binding transcription factor to activate the transcription of specific genes. Coactivators often act by altering chromatin structure and modifications. For example, one class of transcription coregulators modifies chromatin structure through covalent modification of histones. A second ATP-dependent class modifies the conformation of chromatin. Another type of coregulator activity is the bridging of a DNA-binding transcription factor to the basal transcription machinery. The Mediator complex, which bridges transcription factors and RNA polymerase, is also a transcription coactivator.
GeneOntologyTerm:GO:0000800 A proteinaceous core found between sister chromatids during meiotic prophase.
GeneOntologyTerm:GO:0005515 Interacting selectively and non-covalently with any protein or protein complex (a complex of two or more proteins that may include other nonprotein molecules).
GeneOntologyTerm:GO:0003677 Any molecular function by which a gene product interacts selectively and non-covalently with DNA (deoxyribonucleic acid).
GeneOntologyTerm:GO:0003723 Interacting selectively and non-covalently with an RNA molecule or a portion thereof.

Sample Counts

import gripql
conn = gripql.Connection("https://bmeg.io/api", credential_file="bmeg_credentials.json")
O = conn.graph("bmeg_rc2")

Count number of Projects per Program

q = O.query().V().hasLabel("Program").as_("p").out("projects").select("p")
q = q.aggregate(gripql.term("project_count", "$._gid"))
for row in q.execute()[0]["project_count"]["buckets"]:
    print("%s\t%s" % (row["key"], row["value"]))
[INFO]  2019-07-26 18:01:50,332 1 results received in 0 seconds


Program:CCLE    37
Program:TCGA    33
Program:GTEx    31
Program:GDSC    31
Program:CTRP    30
Program:TARGET  7
Program:CPTAC   1
Program:FM  1
Program:NCICCR  1
Program:CTSP    1
Program:HCMI    1
Program:BEATAML1.0  1
Program:VAREPOP 1

Count number of Samples per Program

q = O.query().V().hasLabel("Program").as_("p").out("projects").out("cases").select("p")
q = q.aggregate(gripql.term("sample_count", "$._gid"))
for row in q.execute()[0]["sample_count"]["buckets"]:
    print("%s\t%s" % (row["key"], row["value"]))
[INFO]  2019-07-26 18:02:37,232 1 results received in 2 seconds


Program:FM  18004
Program:TCGA    11315
Program:TARGET  3360
Program:CCLE    1617
Program:GDSC    1001
Program:CTRP    886
Program:GTEx    752
Program:NCICCR  489
Program:CPTAC   322
Program:BEATAML1.0  56
Program:CTSP    45
Program:HCMI    7
Program:VAREPOP 7

Cohort Mutation Counts

Find the number of mutations per gene for TCGA cohort

import pandas
import gripql
conn = gripql.Connection("https://bmeg.io/api", credential_file="bmeg_credentials.json")
O = conn.graph("bmeg_rc2")

Select all the tumor samples in the TCGA KIRC cohort, and aggregate across the ensembl_gene field.

q = O.query().V("Project:TCGA-KIRC")
q = q.out("cases").out("samples").has(gripql.eq("gdc_attributes.sample_type", "Primary Tumor"))
q = q.out("aliquots").out("somatic_callsets").outE("alleles")
q = q.has(gripql.contains("methods", "MUTECT"))
q = q.aggregate(gripql.term("geneCount", "ensembl_gene"))

res = list(q)
counts = {}
for i in res[0]['geneCount']['buckets']:
    counts[i['key']] = i['value']

[INFO]  2019-07-24 17:37:39,468 1 results received in 1 seconds

Create a Pandas.Series with the output and find all the genes with 20 or more mutations

countDF = pandas.Series(counts)
goi = list(countDF.index[countDF >= 20])
for e,g in O.query().V(goi).render(["$._gid" ,"$.symbol"]):
    print("%s (%s) = %d" % (e,g, countDF[e]))
[INFO]  2019-07-24 17:37:39,515 10 results received in 0 seconds


ENSG00000007174 (DNAH9) = 22
ENSG00000081479 (LRP2) = 22
ENSG00000134086 (VHL) = 113
ENSG00000151914 (DST) = 27
ENSG00000155657 (TTN) = 120
ENSG00000163930 (BAP1) = 30
ENSG00000163939 (PBRM1) = 74
ENSG00000181143 (MUC16) = 42
ENSG00000181555 (SETD2) = 34
ENSG00000198793 (MTOR) = 34

Gene Mutation Hotstops

For this example, we will start from a single gene, and identify all mutations that occur on it.

import matplotlib.pyplot as plt
import pandas
import gripql
conn = gripql.Connection("https://bmeg.io/api", credential_file="bmeg_credentials.json")
O = conn.graph("bmeg_rc2")

Get BRCA1 start and stop locations

loc = list( O.query().V().hasLabel("Gene").has(gripql.eq("symbol", "BRCA1")).render(["$.start", "$.end"]) )[0]
[INFO]  2019-07-26 17:48:09,443 1 results received in 0 seconds

Run an aggregation query to count up all the mutations

counts = [0] * (loc[1]-loc[0])
q = O.query().V().hasLabel("Gene").has(gripql.eq("symbol", "BRCA1")).out("alleles").has(gripql.eq("type", "SNP"))
q = q.aggregate(gripql.term("brac1_pos", "start"))
res = list(q)[0]
for v in res.brac1_pos.buckets:
    counts[ v['key'] - loc[0] ] = v['value']
[INFO]  2019-07-26 17:51:52,104 1 results received in 0 seconds

Save as a dataframe

s = pandas.DataFrame(counts)

Plot the hotspots

s.rolling(500).sum().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7fdf9c6ba6a0>

png

Cohort Genomics

Connect to BMEG server

import matplotlib.pyplot as plt
import gripql
conn = gripql.Connection("https://bmeg.io/api", credential_file="bmeg_credentials.json")
O = conn.graph("bmeg_rc2")

Do a query that starts on the TCGA BRCA cohort, goes though Cases -> Samples -> Aliquots -> SomaticCallsets -> Alleles. Once at the alleles, do an aggrigation to count the number of times each chromsome occurs

q = O.query().V("Project:TCGA-BRCA").out("cases").out("samples")
q = q.has(gripql.eq("gdc_attributes.sample_type", "Primary Tumor"))
q = q.out("aliquots").out("somatic_callsets").out("alleles")
q = q.has(gripql.eq("type", "SNP"))
q = q.aggregate(gripql.term("chrom", "chromosome"))
res = list(q)
[INFO]  2019-07-26 18:41:25,205 1 results received in 7 seconds

Visualize the results

name = []
count = []
for i in res[0].chrom.buckets:
    name.append(i["key"])
    count.append(i["value"])
plt.bar(name, count, width=0.35)
<BarContainer object of 24 artists>

png

Expression Data

import seaborn as sns
import pandas
import gripql
conn = gripql.Connection("https://bmeg.io/api", credential_file="bmeg_credentials.json")
O = conn.graph("bmeg_rc2")

Download gene expression values from TCGA-READ cohort and build matrix with submitter id as label

c = O.query().V("Project:TCGA-READ").out("cases").out("samples").as_("sample")
c = c.out("aliquots").out("gene_expressions").as_("exp")
c = c.render( ["$sample._data.gdc_attributes.submitter_id", "$exp._data.values"])

data = {}
for row in c.execute(stream=True):
    data[row[0]] = row[1]
[INFO]  2019-07-26 18:25:17,143 177 results received in 34 seconds

Take the data we downloaded and turn it into a Pandas data frame

samples = pandas.DataFrame(data).transpose().fillna(0.0)

Take a look at the top corner of the dataframe

samples.iloc[:5,:5]
ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460
TCGA-AG-3883-01A 56.417170 0.532873 25.468040 1.398758 2.519876
TCGA-DC-6158-01A 213.685187 1.031829 151.992725 7.119824 8.481721
TCGA-EI-6507-01A 25.749271 0.000000 40.100776 7.127823 6.484232
TCGA-AF-3913-01A 254.599210 4.348810 119.589229 4.213013 7.268218
TCGA-EI-6883-01A 75.138424 0.406882 95.234971 5.180626 3.090609

Take a quick look to see the top expressing samples for the gene ENSG00000000003

samples["ENSG00000000003"].sort_values(ascending=False).head()
TCGA-DC-5869-01A    281.005900
TCGA-DC-6683-01A    258.010380
TCGA-AF-3913-01A    254.599210
TCGA-EF-5831-01A    253.774315
TCGA-DC-6157-01A    251.170374
Name: ENSG00000000003, dtype: float64
sns.kdeplot(samples['ENSG00000000003'], color="g")
<matplotlib.axes._subplots.AxesSubplot at 0x7fd0990397b8>

png

Differential Expression

import matplotlib.pyplot as plt
import seaborn as sns
import pandas
from scipy.stats import ttest_ind
import gripql
conn = gripql.Connection("https://bmeg.io/api", credential_file="bmeg_credentials.json")
O = conn.graph("bmeg_rc2")

Look at the expression in the TCGA LUAD project

PROJECT="Project:TCGA-LUAD"

Starting from the TCGA-LUAD project, follow the edges from Project -> Case -> Sample -> Aliquot -> GeneExpression. At the Sample node, we select from Solid Tissue Normal so we only pull the normals during this query.

Once on the gene expression node, extract the values, which hold the gene expression values in a Map[String,Float] format with Ensembl Gene ids as the keys, and the Gene expression TPMs as the values. We then load that into a Pandas data frame, and transpose so that the rows are the sample ids.

c = O.query().V(PROJECT).out("cases").out("samples").as_("sample")
c = c.has(gripql.eq("gdc_attributes.sample_type", "Solid Tissue Normal"))
c = c.out("aliquots").out("gene_expressions").as_("exp")
c = c.render( ["$sample._data.gdc_attributes.submitter_id", "$exp._data.values"])
data = {}
for row in c.execute(stream=True):
    data[row[0]] = row[1]
normalDF = pandas.DataFrame(data).transpose()
[INFO]  2019-07-26 20:44:50,476 59 results received in 12 seconds

Do the Project to Gene Expression traversal again, but this time only select the tumor samples.

c = O.query().V(PROJECT).out("cases").out("samples").as_("sample")
c = c.has(gripql.eq("gdc_attributes.sample_type", "Primary Tumor"))
c = c.out("aliquots").out("gene_expressions").as_("exp")
c = c.render( ["$sample._data.gdc_attributes.submitter_id", "$exp._data.values"])
data = {}
for row in c.execute(stream=True):
    data[row[0]] = row[1]
tumorDF = pandas.DataFrame(data).transpose()
[INFO]  2019-07-26 20:46:41,760 539 results received in 110 seconds

For each gene, run t-test to determine the genes with the most differential expression between the tumor and normal sets.

stats = {}
for gene in tumorDF:
    s = ttest_ind(tumorDF[gene], normalDF[gene])
    stats[gene] = { 'statistic': s.statistic, 'pvalue' : s.pvalue }
statsDF = pandas.DataFrame(stats).transpose()

Display the results

statsDF[ statsDF['pvalue'] < 0.0001 ].sort_values('statistic').head()
pvalue statistic
ENSG00000168484 6.625251e-227 -53.352962
ENSG00000135604 5.332727e-208 -48.705092
ENSG00000204305 1.356906e-197 -46.256957
ENSG00000114854 1.756643e-188 -44.172306
ENSG00000022267 5.421931e-173 -40.745618

Using the top gene from the T-Test experiment (ENSG00000168484), plot the expression of the gene across the tumor and normal samples

sns.kdeplot(tumorDF['ENSG00000168484'], color="r")
sns.kdeplot(normalDF['ENSG00000168484'], color="g")
<matplotlib.axes._subplots.AxesSubplot at 0x7f272d6aeb38>

png

Do a quick search of ENSG00000168484 to identify the Gene Ontology terms that it is linked to

for row in O.query().V("ENSG00000168484").out("gene_ontology_terms"):
    print(row.gid, row.data.definition)
[INFO]  2019-07-26 20:47:15,413 10 results received in 0 seconds


GeneOntologyTerm:GO:0005515 Interacting selectively and non-covalently with any protein or protein complex (a complex of two or more proteins that may include other nonprotein molecules).
GeneOntologyTerm:GO:0005576 The space external to the outermost structure of a cell. For cells without external protective or external encapsulating structures this refers to space outside of the plasma membrane. This term covers the host cell environment outside an intracellular parasite.
GeneOntologyTerm:GO:0005615 That part of a multicellular organism outside the cells proper, usually taken to be outside the plasma membranes, and occupied by fluid.
GeneOntologyTerm:GO:0042599 A membrane-bounded organelle, specialized for the storage and secretion of various substances (surfactant phospholipids, glycoproteins and acid phosphates) which are arranged in the form of tightly packed, concentric, membrane sheets or lamellae. Has some similar properties to, but is distinct from, a lysosome.
GeneOntologyTerm:GO:0042802 Interacting selectively and non-covalently with an identical protein or proteins.
GeneOntologyTerm:GO:0045334 A clathrin-coated, membrane-bounded intracellular vesicle formed by invagination of the plasma membrane around an extracellular substance.
GeneOntologyTerm:GO:0097486 The volume enclosed by the outermost membrane of a multivesicular body.
GeneOntologyTerm:GO:0005789 The lipid bilayer surrounding the endoplasmic reticulum.
GeneOntologyTerm:GO:0007585 The process of gaseous exchange between an organism and its environment. In plants, microorganisms, and many small animals, air or water makes direct contact with the organism's cells or tissue fluids, and the processes of diffusion supply the organism with dioxygen (O2) and remove carbon dioxide (CO2). In larger animals the efficiency of gaseous exchange is improved by specialized respiratory organs, such as lungs and gills, which are ventilated by breathing mechanisms.
GeneOntologyTerm:GO:0044267 The chemical reactions and pathways involving a specific protein, rather than of proteins in general, occurring at the level of an individual cell. Includes cellular protein modification.

CNA Histogram

import matplotlib.pyplot as plt
import numpy as np
import gripql
conn = gripql.Connection("https://bmeg.io/api", credential_file="bmeg_credentials.json")
O = conn.graph("bmeg_rc2")

Get Ensembl Gene ids for genes of interest

GENES = ["PTEN", "TP53", "RB1"]
gene_ids = {}
for g in GENES:
    for i in O.query().V().hasLabel("Gene").has(gripql.eq("symbol", g)):
        gene_ids[g] = i.gid
[INFO]  2019-07-26 18:24:02,481 1 results received in 0 seconds
[INFO]  2019-07-26 18:24:02,616 1 results received in 0 seconds
[INFO]  2019-07-26 18:24:02,741 1 results received in 0 seconds
gene_ids
{'PTEN': 'ENSG00000171862',
 'TP53': 'ENSG00000141510',
 'RB1': 'ENSG00000139687'}

For each gene of interest, obtain the copy number alteration values and aggregate them by gene.

q = O.query().V("Project:TCGA-PRAD").out("cases").out("samples").out("aliquots")
q = q.has(gripql.eq("$.gdc_attributes.sample_type", 'Primary Tumor')).out("copy_number_alterations")
q = q.aggregate(
    list( gripql.term( g, "values.%s" % (g), 5) for g in gene_ids.values() )
)

res = list(q)
for r in res[0]:
    for b in res[0][r]['buckets']:
        print("%s\t%s:%s" % (r, b['key'], b['value']))
[INFO]  2019-07-26 18:24:06,428 1 results received in 3 seconds


ENSG00000139687 0:269
ENSG00000139687 -1:139
ENSG00000139687 -2:81
ENSG00000139687 1:3
ENSG00000141510 0:329
ENSG00000141510 -1:126
ENSG00000141510 -2:37
ENSG00000171862 0:327
ENSG00000171862 -2:95
ENSG00000171862 -1:64
ENSG00000171862 1:5
ENSG00000171862 2:1

Create a barchart showing the counts of copy number altered samples in the cohort.

val = []
count = []
for i in sorted(res[0]['ENSG00000139687']['buckets'], key=lambda x:int(x["key"])):
    val.append(int(i["key"]))
    count.append(i["value"])
plt.bar(val, count, width=0.35)
<BarContainer object of 4 artists>

png

Drug Response

import pandas
import gripql
import itertools
import scipy.stats as stats

conn = gripql.Connection("https://bmeg.io/api", credential_file="bmeg_credentials.json")
O = conn.graph("bmeg_rc2")

Find all of the samples in the CTRP Breast Cancer experiment

q = O.query().V("Project:CTRP_Breast_Cancer").out("cases").distinct("_gid")
all_cases = []
for row in q:
    all_cases.append(row.gid)
[INFO]  2019-07-26 20:29:14,278 40 results received in 0 seconds

For the genes of interest, get Ensembl gene ids, from the HUGO symbols

GENES = ["PTEN", "TP53"]
gene_ids = {}
for i in O.query().V().hasLabel("Gene").has(gripql.within("symbol", GENES)):
    gene_ids[i.data.symbol] = i.gid
[INFO]  2019-07-26 20:29:14,423 2 results received in 0 seconds

The CTRP doesn’t have direct mutation calling, but rather they used the same cell lines as the CCLE, and we can use the mutation calls from that project. So use the same_as edge to identify the cases from CCLE that are the same as the ones in CTRP. Then use the mutations from those samples.

gene_ids
{'PTEN': 'ENSG00000171862', 'TP53': 'ENSG00000141510'}

For each of the genes, find the set of samples that have a mutation in that gene

mut_cases = {}
norm_cases = {}

q = O.query().V(all_cases).as_("ctrp").out("same_as").has(gripql.eq("project_id", "Project:CCLE_Breast_Cancer"))
q = q.out("samples").out("aliquots").out("somatic_callsets")
q = q.outE("alleles").has(gripql.within("ensembl_gene", list(gene_ids.values())))
q = q.render({"case" : "$ctrp._gid", "gene" : "$._data.ensembl_gene"})

for res in q:
    mut_cases[res.gene] = mut_cases.get(res.gene, set()) | set([res.case])

#get CCLE samples without mutation
for i in gene_ids.values():
    norm_cases[i] = list(set(all_cases).difference(mut_cases[i]))

    print( "%s Positive Set: %d" % (i, len(mut_cases[i])) )
    print( "%s Negative Set: %d" % (i, len(norm_cases[i])) )

[INFO]  2019-07-26 20:29:14,690 43 results received in 0 seconds


ENSG00000171862 Positive Set: 9
ENSG00000171862 Negative Set: 31
ENSG00000141510 Positive Set: 29
ENSG00000141510 Negative Set: 11
pos_response = {}
for g in gene_ids.values():
    pos_response[g] = {}
    q = O.query().V(list(mut_cases[g])).as_("a").out("samples").out("aliquots")
    q = q.out("drug_response").as_("a").out("compounds").as_("b")
    q = q.select(["a", "b"])    
    for row in q:
        v = row['a']['data']['auc']
        compound = row['b']['gid']
        if compound not in pos_response[g]:
            pos_response[g][compound] = [ v ]
        else:
            pos_response[g][compound].append(v)
   
[INFO]  2019-07-26 20:29:16,035 3,592 results received in 1 seconds
[INFO]  2019-07-26 20:29:19,540 12,224 results received in 3 seconds
neg_response = {}
for g in gene_ids.values():
    neg_response[g] = {}
    q = O.query().V(list(norm_cases[g])).as_("a").out("samples").out("aliquots")
    q = q.out("drug_response").as_("a").out("compounds").as_("b")
    q = q.select(["a", "b"])    
    for row in q:
        v = row['a']['data']['auc']
        compound = row['b']['gid']
        if compound not in neg_response[g]:
            neg_response[g][compound] = [ v ]
        else:
            neg_response[g][compound].append(v)
   
[INFO]  2019-07-26 20:29:23,489 13,250 results received in 3 seconds
[INFO]  2019-07-26 20:29:25,137 4,618 results received in 1 seconds
drugs = set(itertools.chain.from_iterable( i.keys() for i in pos_response.values() ))
out = []
for drug in drugs:
    for g in gene_ids.values():
        if drug in pos_response[g] and drug in neg_response[g]:
            row = {"drug" : drug, "mutation" : g}
            mut_values = pos_response[g][drug]
            norm_values = neg_response[g][drug]
            if len(mut_values) > 5 and len(norm_values) > 5:
                s = stats.ttest_ind(mut_values, norm_values, equal_var=False)
                row["t-statistic"] = s.statistic
                row["t-pvalue"] = s.pvalue
                s = stats.f_oneway(mut_values, norm_values)
                row["a-statistic"] = s.statistic
                row["a-pvalue"] = s.pvalue
                out.append(row)
pandas.DataFrame(out, columns=["drug", "mutation", "t-statistic", "t-pvalue", "a-statistic", "a-pvalue"]).sort_values("a-pvalue").head(10)
drug mutation t-statistic t-pvalue a-statistic a-pvalue
369 Compound:CID9967941 ENSG00000141510 4.098840 0.000438 12.705330 0.001136
136 Compound:CID11433190 ENSG00000141510 2.941631 0.011004 12.336689 0.001189
315 Compound:CID56949517 ENSG00000171862 -2.611946 0.041333 13.650130 0.001197
191 Compound:CID11152667 ENSG00000141510 -3.003887 0.008382 9.205017 0.004602
41 Compound:NO_ONTOLOGY~BRD-K97651142 ENSG00000141510 2.556805 0.023745 8.381266 0.006489
44 Compound:CID11609586 ENSG00000171862 -1.952776 0.068235 7.382291 0.008403
649 Compound:CID52947560 ENSG00000141510 2.319721 0.036225 7.511026 0.009386
682 Compound:CID46943432 ENSG00000171862 -2.160075 0.055895 7.443883 0.009680
538 Compound:CID5566 ENSG00000141510 1.736441 0.118978 7.362028 0.010777
377 Compound:CID31703 ENSG00000171862 -2.835380 0.009463 6.628285 0.012186

Kaplan Meier

Install the lifelines library

!pip install lifelines
from lifelines import KaplanMeierFitter
import matplotlib.pyplot as plt

import pandas
import gripql
conn = gripql.Connection("https://bmeg.io/api", credential_file="bmeg_credentials.json")
O = conn.graph("bmeg_rc2")

Look at the TCGA-BRCA cohort, and find all of the cases where there is a recorded days_to_death

q = O.query().V("Project:TCGA-BRCA").out("cases")

data = {}
for i in q:
    if i.data.gdc_attributes.demographic is not None and i.data.gdc_attributes.demographic.vital_status == "Dead":
        if 'days_to_death' in i.data.gdc_attributes.demographic:
            data[ i.gid ] = i.data.gdc_attributes.demographic.days_to_death
survival = pandas.Series(data)
[INFO]  2019-07-26 21:19:47,291 1,098 results received in 0 seconds

Gene ensembl gene id for TP53

gene = list(O.query().V().hasLabel("Gene").has(gripql.eq("symbol", "TP53")))[0].data.gene_id
print(gene)
[INFO]  2019-07-26 21:18:49,921 1 results received in 0 seconds


ENSG00000141510

Starting from the cases with attached survival information, find all of the cases that have a mutation in the gene on interest

q = O.query().V(list(survival.keys())).as_("case").out("samples").out("aliquots").out("somatic_callsets").outE("alleles")
q = q.has(gripql.eq("ensembl_gene", gene))
q = q.select("case").distinct("$._gid").render("$._gid")
mut_cases = list(q)
[INFO]  2019-07-26 21:20:14,268 52 results received in 0 seconds

Plot a Kaplan Meirer curve to demonstrate the different in survival of the somatic mutation group and those with no somatic mutation.

kmf = KaplanMeierFitter()

ax = plt.subplot(111)
kmf.fit(survival[mut_cases], label="%s Mutations" % (gene))
ax = kmf.plot(ax=ax)
kmf.fit(survival[ survival.index.difference(mut_cases) ], label="No Mutation")
kmf.plot(ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x7f835d6eb160>

png