In this section, we describe the 3 main kernel methods that are studied in this paper, namely the Tree Kernels [10, 19, 23], the All Path Graph (APG) Kernel and the Approximate Subgraph Matching (ASM) Kernel [15].
Tree kernels
Tree kernels [8] using constituency parse or dependency parse trees have been widely applied for several relation extraction tasks [13, 18, 24]. They estimate similarity by counting the number of common substructures between two trees. Owing to the recursive nature of trees, the computation of the common subtrees can be efficiently addressed using dynamic programming. Efficient linear time algorithms for computing tree kernels are discussed in [10].
Different variants of tree kernels can be obtained, based on the definition of a tree fragment, namely subtree, subset tree and partial tree. A subtree satisfies the constraint that if a node is included in the subtree, then all its descendents are also included in the subtree. A subset tree only requires, that for each node included in the subset tree, either all of its children are included or none is included in the subtree. A partial tree is the most general tree fragment, which allows for partial expansion of a node, i.e for a given node in the partial tree fragment, any subset of its children nodes may be included in the fragment. Subset trees are most relevant with constituency parse trees, where the inner nodes refer to grammatical production rules. Partial expansion of a grammatical production rule leads to inconsistent grammatical structures. As such, subset trees restrict the expansion of a node to include all of its children or none. For dependency parse trees with no such grammatical constraints, partial trees are more suitable to explore a wider set of possible tree fragments. We experiment with subset tree kernels (SSTK) with constituency parses and partial tree kernels (PTK) with dependency parses and report the results on both. We illustrate the constituency parse tree for a sample sentence in Fig. 1.
Here, we present the formal definition of tree kernels. Let T1 and T2 denote two trees and let F={f1,f2,…} denote the set of all possible tree fragments. Let I
i
(n) be an indicator function that evaluates to 1 when the fragment f
i
is rooted at node n and 0 otherwise. The unnormalized kernel score is given by:
$$ K(T_{1},T_{2}) = \sum_{n_{1} \in N_{T_{1}}} \sum_{n_{2} \in N_{T_{2}}} \Delta (n1,n2) $$
(1)
where \(N_{T_{1}}\) and \(N_{T_{2}}\) are the sets of nodes of T1 and T2 respectively and \(\Delta (n1,n2)= \sum _{i=1}^{|F|} I_{i}(n_{1}) I_{i}(n_{2})\).
Efficient algorithms for computing tree kernels in linear time in the average case are presented in [10]. We used the implementation of tree kernels provided in Kelp [22].
APG kernel
The APG kernel [14] is designed to work with edge weighted graphs. A given dependency graph G needs to be first modified, to remove edge labels and introduce edge weights. Let e=l(a,b) denote an edge e with label l, from the vertex a to vertex b. For every such edge in the original graph, we introduce a new node with label l and two unlabeled edges (a,l) and (l,b) in the new graph. The APG kernel recommends a edge weight of 0.3 as a default setting for all edges. To accord greater importance to the entities in the graph, the edges along the shortest path between the two entities are given a larger weight of 0.9. This constitutes the subgraph derived from the dependency graph of a sentence. Another subgraph derived from the linear order of the tokens in the sentence is constructed. In this subgraph, n vertices are created to represent the n tokens in the sentence. The lemma of a token is set as the label of the corresponding node. These vertices are connected by n−1 edges, for the n tokens from left to right. That is, edges are introduced between token i and token i+1. These two disconnected subgraphs together form the final edge weighed graph over which the APG kernel operates.
Let A denote the adjacency matrix of the combined graph. Let “connectivity” of a path refer to the product of edge weights along the path. Intuitively, longer paths or paths with lesser edge weights, have connectivity closer to 0 and shorter paths or paths with greater edge weights have a connectivity closer to 1. Note that the matrix Ai represents the sum of connectivity of all paths of length i, between any two vertices. The matrix W is defined as the sum of the powers of A, I.e \(W=\sum _{i=1}^{\infty }A^{i}\). It is efficiently computed as W=(I−A)−1. Therefore, W denotes the sum of connectivity over all paths. Any contribution to connectivity from self loops is eliminated by setting W=W−I. Finally, the APG kernel computes the matrix Gm=LWLT, where L is the label allocation matrix, such that L[i,j]=1 if the label l
i
is present in the vertex v
j
and 0 otherwise. The resultant matrix Gm represents the sum total of connectivity in the given graph G between any two labels. Let \(G_{1}^{m}\) and \(G_{2}^{m}\) denote the matrices constructed as described above, for the two input graphs G1 and G2. The APG kernel score is then defined as :
$$ K(G_{1},G_{2}) = \sum_{i=1}^{|L|} \sum_{j=1}^{|L|} G_{1}^{m}\left[l_{i},l_{j}\right] \times G_{2}^{m}\left[l_{i},l_{j}\right] $$
(2)
Impact of linear subgraph
We noticed substantially lower performance with the APG kernel when the labels marking the relative position of the tokens with respect to the entities, i.e. labels such as “before”, “middle” and “after” in the linear subgraph are left out. For example, the F-score for AIMed in PPI task drops by 8 points, from 42 to 34%, when these labels are left out. This highlights the importance of the information contained in the linear order of the sentence, in addition to the dependency parse graph.
ASM kernel
The ASM kernel [15] is based on the principles of graph isomorphism. Given two graphs G1=(V1,E1) and G2=(V2,E2), graph isomorphism seeks a bijective mapping of nodes M:V1⇔V2 such that, for every edge e between two vertices v
i
,v
j
∈G1, there exists an edge between the matched nodes M(v
i
),M(v
j
)∈G2 and vice versa. The ASM kernel though, seeks an “approximate” measure of graph isomorphism between the two graphs, that is described below. Let L be the vocabulary of node labels. In the first step, ASM seeks a bijective mapping M1:L⇔V1, between the vocabulary and the nodes, such that M1(l
i
)=v
j
,v
j
∈V1 when the vertex v
j
has the node label l
i
. To enable this, all nodes in the graph are assumed to have distinct labels. For every missing label l
i
in the vocabulary, a special disconnected (dummy) node v
j
with the label l
i
is introduced. Next, ASM does not seek matching edges between matching node pairs. Instead, it evaluates the similarity of the shortest path between them.
Consider two labels l
i
,l
j
. Let x,y be the vertices in the first graph with these labels respectively. I.e M1(l
i
)=x,M1(l
j
)=yandx,y∈V1. Let \(P_{x,y}^{1}\) be the shortest path between the vertices x and y in the graph G1. Similarly, let x′,y′ denote the matching vertices in the second graph. I.e M2(l
i
)=x′,M2(l
j
)=y′andx′,y′∈V2. Let \(P_{x^{\prime },y^{\prime }}^{2}\) denote the shortest path between the vertices x′ and y′ in the graph G2. The feature map ϕ that maps a shortest path P into a feature vector is described following the ASM kernel definition below.
The ASM kernel score is computed as:
$$ \begin{aligned} K(G_{1},G_{2}) &= \sum_{i=1}^{|L|} \sum_{j=1}^{|L|} \phi\left(P_{x,y}^{1}\right) \cdot \phi\left(P_{x^{\prime},y^{\prime}}^{2}\right) \\ \text{s.t}\ M_{1}(l_{i}) &= x, M_{1}(l_{j}) = y\ \text{and}\ x,y \in V_{1} \\ \text{and}\ M_{2}(l_{i}) &= x^{\prime}, M_{2}(l_{j}) = y^{\prime}\ \text{and}\ x^{\prime},y^{\prime} \in V_{2} \end{aligned} $$
(3)
Feature space
The feature space of ASM kernel is revealed by examining the feature map ϕ that is evaluated for each shortest path P. ASM kernel explores path similarity along 3 aspects, namely structural, directionality and edge labels, as described below. We use the notation W
e
to denote the weight of an edge e. An indicator function \(I_{e}^{l}\) is used to indicate if an an edge e has an edge label l. Similar to the APG graph, we set the edge weights to 0.9 for edges on the shortest dependency path between two entities and 0.3 for the others.
Structural similarity is estimated by comparing “path lengths”. Note that similar graphs or approximately isomorphic graphs are expected to have similar path lengths for matching shortest paths. Therefore, a single feature \(\phi _{\text {distance}}(P) = \prod _{e \in P} W_{e} \), is computed to incorporate structural similarity, where W
e
denotes the weight of an edge e in the path P.
Directional similarity is computed like structural similarity, but unlike structural similarity, edge directions are considered. ASM kernel computes two features, \(\phi _{\text {forward edges}} (P) = \prod _{f \in P}W_{f}\) and \(\phi _{\text {backward edges}} (P) = \prod _{b \in P}W_{b} \), where f and b denote a forward facing and backward facing edge respectively, in the path P.
Edge directions may themselves be regarded as special edge labels of type “forward” or “backward”. Edge label similarity generalizes the above notion to an arbitrary vocabulary of edge labels E. In particular, E is the set of dependency types or edge labels generated by the syntactic parser. For each such edge label l∈E, ASM kernel computes the feature \(\phi _{l} (P) = \prod _{e \in P} W_{e}^{I_{e}^{l}} \), where \(I_{e}^{l}\) denotes an indicator function that takes a value 1 when the edge e has a label l and 0 otherwise.
The full feature map ϕ(P) is the concatenation of the above described features for structural, directionality and edge label similarity. We illustrate this feature map for a sample enhanced dependency graph illustrated in Fig. 1. For the label pair “seizures, fatigue”, the shortest path P is through the single intermediate vertex “caused”. For this path, the non-zero features are : ϕ(P)={ϕdistance=(0.9)2,ϕforward edge=0.9,ϕbackward edge=0.9,ϕnsubj=0.9,ϕnmod:by=0.9,}.
Implementation details
We implemented the APG and ASM kernel in the Java based Kelp framework [22]. The Kelp framework provides several tree kernels and an SVM classifier that we used for our experiments. We did not perform tuning for the regularization parameter for SVM, and used the default settings (C-Value =1) in Kelp. Dependency parses were generated using Stanford CoreNLP [7] for the CDR dataset. For the Protein-Protein-Interaction task, we used the pre-converted corpora available from [14]. The corpus contains the dependency parse graphs derived from Charniak-Lease Parser, which was used as input for our graph kernels. All software implemented by us for reproducing the experiments in this paper, including the graph kernels APG and ASM implementations are available in a public repository.