Decomposing the DWPC to assess intermediate node or edge contributions

As a reminder, the degree-weighted path count (DWPC) measures the prevalence of metapath between a specific source and target node [1]. It equals the sum of path degree products (PDPs), which provide a score for a single path based on the degrees along the path.

Traditionally, the DWPC sums the PDPs for all paths connecting the source and target node along a specified metapath. Here I propose a new type of DWPCs that only sums paths that traverse the same intermediate node at a specified position. In other words, traditional DWPCs are defined for a source–target–metapath combination, whereas the proposed DWPCs are defined for a source–target–metapath–position combination. Position refers to an intermediate metanode. However, this approach would also work with an intermediate metaedge as the position. Note that choosing either the source or target metanode as the position is equivalent to the traditional DWPC.

The purpose of this approach is to assess the contribution of intermediate nodes (or edges) in composing the DWPC. Remember that the sum of all "partial" DWPCs equals the traditional DWPC. This approach doesn't replace the need for traditional DWPCs — they serve different needs and answer different questions.

I'm not satisfied with the traditional versus partial nomenclature. @alizee, any advice?

Daniel Himmelstein Researcher

Enalapril for coronary artery disease example

Prelude: I recently helped @cgreene with a grant proposal titled "Network-based algorithms for drug discovery from genetic associations" (application 1R01HG009516-01A1). For this proposal, we wanted to show an example where considering the tissue-specificity of paths helped identify the mechanisms of drug efficacy. In the course of this analysis, we came up with the partial DWPC method and the following example (the tissue-specific additions are not included below).

Enalapril treats coronary artery disease (CAD) by inhibiting angiotensin-converting enzyme (ACE) [1]. Traditionally, if we were interested in potential pathways contributing to drug efficacy we may search for CbGpPWpGaD paths between enalapril and CAD. Below is the Cypher query to return all paths, ranked by PDP (run the query at https://neo4j.het.io):

MATCH path = (n0:Compound)-[:BINDS_CbG]-(n1)-[:PARTICIPATES_GpPW]-
(n2)-[:PARTICIPATES_GpPW]-(n3)-[:ASSOCIATES_DaG]-(n4:Disease)
USING JOIN ON n2
WHERE n0.name = 'Enalapril'
AND n4.name = 'coronary artery disease'
AND n1 <> n3
WITH
path,
[
size((n0)-[:BINDS_CbG]-()),
size(()-[:BINDS_CbG]-(n1)),
size((n1)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n2)),
size((n2)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n3)),
size((n3)-[:ASSOCIATES_DaG]-()),
size(()-[:ASSOCIATES_DaG]-(n4))
] AS degrees
RETURN
substring(reduce(s = '', node IN nodes(path)| s + '–' + node.name), 1) AS nodes,
reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4) AS PDP
ORDER BY PDP DESC

Overall, 757 paths were returned. The top 3 paths are:

nodesPDP
Enalapril–ACE–Metabolism of Angiotensinogen to Angiotensins–ACE2–coronary artery disease0.000258
Enalapril–ACE–ACE Inhibitor Pathway–NR3C2–coronary artery disease0.000252
Enalapril–ACE–ACE Inhibitor Pathway–ACE2–coronary artery disease0.000245

Now let's assume we're more interested in the contributions of specific pathway nodes rather than specific paths. In other words, we don't really care what genes got us to a pathway, we just want an overal score per pathway. In this case, we can select n2 as the position. Now we're computing a DWPC for Enalapril–binds–Gene–participates–Pathway–participates–Gene–associates–coronary artery disease, where bold indicates position. The query becomes:

MATCH path = (n0:Compound)-[:BINDS_CbG]-(n1)-[:PARTICIPATES_GpPW]-
(n2)-[:PARTICIPATES_GpPW]-(n3)-[:ASSOCIATES_DaG]-(n4:Disease)
USING JOIN ON n2
WHERE n0.name = 'Enalapril'
AND n4.name = 'coronary artery disease'
AND n1 <> n3
WITH
path,
n2 AS pathway,
[
size((n0)-[:BINDS_CbG]-()),
size(()-[:BINDS_CbG]-(n1)),
size((n1)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n2)),
size((n2)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n3)),
size((n3)-[:ASSOCIATES_DaG]-()),
size(()-[:ASSOCIATES_DaG]-(n4))
] AS degrees
RETURN
pathway.identifier AS pathway_id,
pathway.name AS pathway_name,
count(*) AS PC,
sum(reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4)) AS DWPC
ORDER BY DWPC DESC, pathway_name

40 pathways are returned, of which the top 5 are displayed below:

pathway_idpathway_namePCDWPC
WP554_r84372ACE Inhibitor Pathway110.0015
PC7_8339Transmembrane transport of small molecules1500.0008
PC7_5323Metabolism of Angiotensinogen to Angiotensins30.0005
PC7_7290SLC-mediated transmembrane transport400.0004
PC7_5322Metabolism3090.0004

As shown, we now have a ranking of pathways based on their contribution to the overall CbGpPWpGaD metapath. Currently, I don't see a huge role for this approach for feature extraction, but think it's useful for following up on specific predictions and highlighting mechanisms of drug efficacy.

• Pouya Khankhanian: Agree with "I think it's useful for following up on specific predictions and highlighting mechanisms of drug efficacy". Especially if the function to display this result is embedded in a button on the neo4j interface.

I'd love to see the weight given to various nodes in the top predictions for epilepsy, especially the ones in the top 100 which were not classified as AEDs.

Daniel Himmelstein Researcher

Grouping paths by their source or target edge

The previous comment discussed grouping paths by an intermediate node and then calculating partial DWPCs. This comment introduces an alternative grouping method: grouping either by the source edge (first edge in the path) or target edge (last edge in the path).

Here's the intuition behind this approach. In a hetnet, a node derives its meaning from its relationships. For example, our algorithm is based solely on relationships. Therefore, a good way to investigate a prediction is to consider which edges of either the source compound or target disease mattered. We can this for a specific source–target–metapath combination, by grouping paths by their source or target edge.

For example, the following query takes the enalapril–CAD example and asks which target edges are composing the CbGpPWpGaD paths.

MATCH path = (n0:Compound)-[:BINDS_CbG]-(n1)-[:PARTICIPATES_GpPW]-
(n2)-[:PARTICIPATES_GpPW]-(n3)-[:ASSOCIATES_DaG]-(n4:Disease)
USING JOIN ON n2
WHERE n0.name = 'Enalapril'
AND n4.name = 'coronary artery disease'
AND n1 <> n3
WITH
path,
[
size((n0)-[:BINDS_CbG]-()),
size(()-[:BINDS_CbG]-(n1)),
size((n1)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n2)),
size((n2)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n3)),
size((n3)-[:ASSOCIATES_DaG]-()),
size(()-[:ASSOCIATES_DaG]-(n4))
] AS degrees, n3, n4
RETURN
n4.name AS target_name,
type(relationships(path)[3]) AS target_edge_type,
n3.name AS n3_name,
sum(reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4)) AS DWPC
ORDER BY DWPC DESC

The top five results are:

target_nametarget_edge_typen3_nameDWPC
coronary artery diseaseBINDS_CbGSLC22A30.00072
coronary artery diseaseBINDS_CbGACE20.00058
coronary artery diseaseBINDS_CbGREN0.00044
coronary artery diseaseBINDS_CbGSLC6A60.00038
coronary artery diseaseBINDS_CbGNR3C20.00025

These are the top ranking CAD-associated genes that participate in pathways with enalapril targets. As shown by the DWPC column, several of the top target edges are contributing to a similar extent. There is no one CAD-associated gene that is responsible for the bulk of the CbGpPWpGaD DWPC.

In instances where only one path composes the bulk of the total DWPC, you know that a single relationship is driving the score. For example, we can rewrite the above query to analyze the source edge:

MATCH path = (n0:Compound)-[:BINDS_CbG]-(n1)-[:PARTICIPATES_GpPW]-
(n2)-[:PARTICIPATES_GpPW]-(n3)-[:ASSOCIATES_DaG]-(n4:Disease)
USING JOIN ON n2
WHERE n0.name = 'Enalapril'
AND n4.name = 'coronary artery disease'
AND n1 <> n3
WITH
path,
[
size((n0)-[:BINDS_CbG]-()),
size(()-[:BINDS_CbG]-(n1)),
size((n1)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n2)),
size((n2)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n3)),
size((n3)-[:ASSOCIATES_DaG]-()),
size(()-[:ASSOCIATES_DaG]-(n4))
] AS degrees, n0, n1
RETURN
n0.name AS source_name,
n1.name AS n1_name,
sum(reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4)) AS DWPC
ORDER BY DWPC DESC
source_namesource_edge_typen1_nameDWPC
EnalaprilBINDS_CbGACE0.00273
EnalaprilBINDS_CbGSLCO1A20.00081
EnalaprilBINDS_CbGABCB10.00081
EnalaprilBINDS_CbGSLC22A70.00068

These results show that enalapril's binding ACE is driving the CbGpPWpGaD DWPC. In other words, if enalapril did not bind ACE, the CbGpPWpGaD DWPC would be ~40% lower (the total CbGpPWpGaD DWPC between enalapril and CAD is 0.00677).

Views
22
Topics
Referenced by
Cite this as
Daniel Himmelstein (2016) Decomposing the DWPC to assess intermediate node or edge contributions. Thinklab. doi:10.15363/thinklab.d228