您好,欢迎来到九壹网。
搜索
您的当前位置:首页RASCAL Calculation of graph similarity using maximum common edge subgraphs

RASCAL Calculation of graph similarity using maximum common edge subgraphs

来源:九壹网
promoting access to White Rose research papers

Universities of Leeds, Sheffield and York

http://eprints.whiterose.ac.uk/

This is an author produced version of a paper published in The Computer Journal.

White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/3568/

Published paper

J.W., Gardiner, E.J. and Willett, P. (2002) RASCAL: Calculation of Graph

Similarity Using Maximum Common Edge Subgraphs, The Computer Journal, Volume 45 (6), 631-4.

White Rose Research Online eprints@whiterose.ac.uk

RASCAL: Calculation of Graph Similarity Using Maximum Common Edge

Subgraphs

John W. Raymond (john.raymond@pfizer.com)

Pfizer Global Research and Development, Ann Arbor Laboratories, 2800 Plymouth Road, Ann

Arbor, Michigan 48105

Eleanor J. Gardiner (e.gardiner@sheffield.ac.uk), and Peter Willett (p.willett@sheffield.ac.uk)

Krebs Institute for Biomolecular Research and Department of Information Studies, University of

Sheffield, Western Bank, Sheffield S10 2TN, United Kingdom

ABSTRACT

A new graph similarity calculation procedure is introduced for comparing labeled graphs. Given a minimum similarity threshold, the procedure consists of an initial screening process to determine whether it is possible for the measure of similarity between the two graphs to exceed the minimum threshold, followed by a rigorous maximum common edge subgraph (MCES) detection algorithm to compute the exact degree and composition of similarity. The proposed MCES algorithm is based on a maximum clique formulation of the problem and is a significant improvement over other published algorithms. It presents new approaches to both lower and upper bounding as well as vertex selection. 1.0 INTRODUCTION

It is often desired in many applications to compare objects represented as graphs and to determine the degree of similarity between the objects. This is sometimes referred to as inexact graph matching. Inexact graph matching is of importance in many applications such as protein ligand docking [1], video indexing [2], and computer vision [3, 4]. One algorithmic approach to this problem is to use the maximum common subgraph (MCS) between the graphs being compared. In this paper, a novel and efficient algorithm based on the edge induced formulation of the MCS problem is presented. The algorithm has been developed for use in chemical similarity searching applications where molecular structures are represented as graphs such that

1

the atoms correspond to labeled vertices and the chemical bonds correspond to labeled edges, but it is directly applicable to a comparison of any collection of labeled graphs. For a more detailed description of the graphical representation of a chemical structure, the reader is referred to a text on chemical graph theory [5].

The similarities between the graphs representing molecules play an important role in many aspects of chemistry and, increasingly, biology. Examples include database searching [6], the prediction of biological activity [7], the design of combinatorial syntheses [8], and the interpretation of molecular spectra [9]. A wide range of types of similarity measures has been described [10]. However, many of these are too time-consuming when very large numbers of molecules need to be considered, as is the case for applications involving the searching and clustering of chemical databases which typically contain tens or even hundreds of thousands of molecules. In such cases, molecules are represented by a simple bit-string, in which bits are set to denote the presence in a molecule of pre-defined molecular substructures, such as a particular ring system or functional group. The similarity between two molecules is then calculated using an association coefficient based on the numbers of bits in common between the bit-strings describing those two molecules [10]. Although widely used for similarity purposes, such simple molecular representations have several limitations [11]; most importantly, a bit-string does not preserve the connectivity of the parent molecule. MCS-based similarity measures do take full account of molecular connectivity but have been little used to date in chemical information systems because of their computational requirements. The work reported here has been undertaken to address this situation.

The proposed inexact graph matching procedure RASCAL (Rapid Similarity CALculation) consists of two components, screening and rigorous graph matching. The screening procedure is intended to determine rapidly whether the graphs being compared exceed some specified minimum similarity threshold (without resulting in any false dismissals) in order to avoid unnecessary calls to the more computationally demanding, graph matching procedure. The screening procedure is described in section 2 along with an example calculation. The latter graph matching process consists of an efficient determination of the MCS using the graph similarity concept. Section 3 formulates the MCS graph matching procedure in terms of the maximum clique problem and presents improvements to existing clique detection algorithms for this purpose, including several new contributions. In section 4, the RASCAL algorithm is

2

compared with other published algorithms in several experimental trials where it is shown to be considerably more efficient.

1.1 Definitions and Terminology

All graphs referred to in the following text are assumed to be simple, undirected graphs. For an introduction to graph related concepts and notation, the reader is referred to an introductory text on graph theory [12]. N(vi) will denote the set of vertices adjacent to a vertex vi. A maximum clique, ω(G), of a graph G is the largest set of mutually adjacent vertices in G. A line graph L(G) is a graph whose vertex set consists of the edge set of G; therefore, if (vi, vj) is an edge in G it is also a vertex in L(G). A pair of vertices in L(G) are adjacent if the two corresponding edges in G are incident on each other.

A pair of graphs are said to be isomorphic if there is a one-to-one correspondence between their vertices and an edge only exists between two vertices in one graph if an edge exists between the two corresponding vertices in the other graph. An induced subgraph is a set S of vertices of a graph G and those edges of G with both endpoints in S. A graph G12 is a common induced subgraph of graphs G1 and G2 if G12 is isomorphic to induced subgraphs of G1 and G2. A maximum common induced subgraph (MCIS) consists of a graph G12 with the largest number of vertices meeting the aforementioned property. Related to the MCIS is the maximum common edge subgraph (MCES). An MCES is a subgraph consisting of the largest number of edges common to both G1 and G2. Note that the MCIS or MCES between two graphs is not necessarily connected or unique by definition. Figure 1(a) illustrates an MCIS between two graphs (highlighted in bold), and Figure 1(b) demonstrates an MCES between the same two graphs. 2.0 SCREENING PROCEDURE

Using the classification of Sanfeliu and Fu [13], graph distance/similarity measures can be categorized into two classes: 1)

Feature-Based Distances: A set of features or invariants is established from a structural description of a graph, and these features are then used in a vector representation to which various distance or similarity measures can be applied. 2)

Cost-Based Distance: The distance or similarity between two graphs reflects the number The proposed two-tiered screening system is a cost-based method that has been developed to be used in conjunction with an MCES algorithm. It will be referred to as the cost-of prescribed edit operations that are required in order to transform one graph into the other.

3

vector approach in order to differentiate it from feature-based methods. While the new screening system is almost as simple as the feature-based approach from a computational perspective, it has the desirable features that it cannot result in any false dismissals of graphs that are in fact similar, and it provides an upper bound on the size of the MCES between two molecular graphs. The procedure forms a screening hierarchy whereby the calculated upper-bound for the second tier is always less than or equal in magnitude to the value calculated in the first tier. Following the formal treatment in sections 2.1 and 2.2, we present an example illustrating the proposed screening procedure. 2.1 1st Tier Cost-Vector

Degree sequences of a graph have been used by other authors to establish upper bounds on graph invariants [14, 15] and recently for indexing graph databases [16]. In our proposed 1st tier screening procedure, we have used degree sequences to develop a novel method to calculate an upper bound on the size of an MCES between a pair of graphs. First, the set of vertices in each graph are partitioned into l partitions by label type, and then sorted in non-increasing order by degree. Let L1i and L2i denote the sorted degree sequences in partition i in graphs G1 and G2, respectively. An upper-bound on the similarity between a pair of graphs can be given as follows:

2

V(G1,G2)=∑min{L1i,Li}

i=1l

⎢lmax{L1i,L2i}min{d(v1),d(v2)}⎥

jj

⎥ E(G1,G2)=⎢∑∑2⎢i=1⎥j=1

⎣⎦(V(G1,G2)+E(G1,G2))2

sim(G1,G2)=

(V(G1)+E(G1))⋅(V(G2)+E(G2))

cv

Note that the term to the right of E(G1,G2) is divided by two since each edge in this

2

formulation is counted twice, and if L1i≠Li, then entries of value zero are appended to the

shorter sequence to make the sequences of equal length. It is clear that the similarity measure simcv(G1,G2) ranges from 0 to 1 and that it obeys the following inequality,

(V(G12)+E(G12))2

sim(G1,G2)≥,

(V(G1)+E(G1))⋅(V(G2)+E(G2))

cv

where G12 is the MCES between graphs G1 and G2. This measure has been investigated in depth by Johnson [17]. The value simcv(G1,G2) can serve as an upper bound on the size of the MCES

4

between any two graphs. For screening purposes, all that is necessary is to specify a minimum acceptable value for the MCES-based graph similarity measure. If the value determined by simcv(G1,G2) is less than the minimum acceptable similarity, then a rigorous MCES comparison can be avoided. Since chemical graphs are of bounded degree, this procedure can be performed using a bucket sort in O(n) time for chemical graphs or O(n log n) otherwise, where n =

2

max{L1,Lii}.

2.2 2nd Tier Cost-Vector

The 1st tier procedure provides a rapid screening mechanism which takes advantage of local connectivity and vertex labels to help eliminate unnecessary and costly MCES

comparisons, but it doesn’t take account of edge labeling. For this purpose, a 2nd tier-screening scheme has been developed. Since it is more costly from a computational perspective, it is intended to be used on pairs of graphs that have passed the initial cost-vector screen.

An unambiguous integer code is assigned to each edge incident on a given vertex which incorporates the edge label along with the labels of both vertex endpoints. Instead of simply

2assigning the respective degree to each vertex in L1i or Li, each vertex is assigned its set of

incident edge codes. If f(L1i,L2i) is defined to be a linear assignment of compatible edge codes

21associated with each pair of vertices in the sequences L1i and Li such that each vertex in Li is

ndcompared to each vertex in L2i, then the 2 tier cost-vector upper bound can be given as

2

⎢lf(L1⎥i,Li)E(G1,G2)=⎢∑⎥

2⎣i=1⎦

with V(G1,G2) and simcv(G1,G2) again being calculated as previously discussed. It has been found that the linear assignment algorithm described by Carpaneto et al. [18] performs well for this purpose. This algorithm has a worst case time complexity of O(n3) where n =

2

max{L1i,Li}.

2.3 Example of Cost-Vector Screening

The following example illustrates the operation of the cost-vector screening approach.

Figure 2 depicts two molecular graphs, methadone and meperidine, with their respective MCES highlighted in bold. Since there are 16 bonds in the MCES, E(G12)=16, and since the number of atoms in common between the two structures is 17 (including any isolated atoms),

5

V(G12)=17. In order to calculate an MCES based similarity measure, the following parameters are used:

methadone:

meperidine: Namep = 18

The MCES based similarity is then calculated using the Johnson metric to be 0.63.

Applications of similarity in chemical information systems focus on pairs of molecules having a high degree of structural resemblance. For example, in drug discovery programs, given a molecule that has been shown to exhibit some useful biological activity (e.g., lowering blood pressure or restricting tumor growth), molecules in a chemical database having a high degree of similarity are also likely to exhibit some activity and can hence be selected for biological testing [19, 20]. It is hence common to consider just the nearest neighbors of a query molecule (i.e., those having a similarity greater than some minimum threshold value).

In the present case, if we were to set this threshold to 0.70 then it is clear that this pair of molecular graphs does not fulfill the requirement, but since a rigorous MCES comparison is computationally expensive, it is desirable to see if it is possible to skip the unnecessary MCES comparison. 1st Tier Screen

In order to initiate the 1st tier cost-vector comparison, the vertices are first separated into partitions according to label type. Any vertex whose corresponding label is not present in both graphs is removed from consideration. Since in this case, both molecules have only carbon, oxygen, and nitrogen atoms, this pruning is not applicable. This is shown in Figure 3. The

1

entry d(v1)= 4 (7) in column L11 represents vertex number 7 in graph 1 (i.e., methadone) with

Nameth = 23 Nemeth = 24

(number of atoms in methadone) (number of bonds in methadone) (number of atoms in meperidine) (number of bonds in meperidine)

Nemep = 19

degree 4 and of type carbon. Note that since column L12 has one fewer element than L22, it is padded with a zero to make the columns the same size.

6

V(G1,G2) is calculated by summing the size of the set Li with the fewer non-null

2

elements in graphs 1 and 2. Since L1 has 15 non-null elements and L11 has 21, V(G1,G2) is

incremented by 15. Considering the other two partitions L2 and L3 results in V(G1,G2)=17.

E(G1,G2) is determined using min{d(v1j),d(v2j)}. This procedure is also illustrated in Figure 3. E(G1,G2) is calculated by summing each column in the min{d(v1j),d(v2j)} set of columns, adding the resulting values together, and then integer dividing the result by two. Since the resulting sum equals 36, E(G1,G2)=18. Now the cost-vector similarity value simcv(G1,G2) is calculated to be 0.70 using the Johnson measure. Since this value is equal to the minimum acceptable similarity of 0.70, the 2nd tier screening process must be implemented. 2nd Tier Screen

Since there are three distinct vertex labels present in both molecular graphs, three separate linear assignments will need to be performed in order to calculate the updated value of

E(G1,G2). The three cost matrices associated with these assignments are presented in Figure 4.

Considering the assignment of L3 first, it is clear from Figure 3 that the assignment cost

matrix will consist of only one element. In Figure 3, d(v1) in the methadone molecular graph corresponds to vertex 2 in methadone, and d(v1) in the meperidine molecular graph represents vertex 2 in meperidine. Since both vertices have three incident edges and all incident edges are the same type (N-C), all incident edges in one graph have a compatible edge in the other graph. The corresponding element in the cost matrix will therefore be 3. Since there is only one

2element in the cost matrix, the linear assignment is it is trivial and f(L13,L3)= 3.

This process has been repeated in Figure 4 for L1 and L2 consisting of the vertices of type carbon and nitrogen. The L1 cost matrix is considerably more complex than the other two. In this case an efficient linear assignment algorithm is necessary. The maximum assignment in each cost matrix is highlighted. For the L1 cost matrix, the resulting maximum score is

f(L11,L12)=29 (i.e., the sum of the boldface elements). Note that it is irrelevant whether the computed assignment is unique as the upper bound is a scalar value. Therefore ,

2

⎢lf(L1⎥⎢2913⎥,Lii)E(G1,G2)=⎢∑⎥=⎢++⎥=16.

2⎣i=1⎦⎣222⎦

7

The updated 2nd tier cost-vector similarity is then calculated to be 0.63.

This upper bound value is less than 0.7, the minimum acceptable similarity; in this case, indeed, it is equal to the actual MCES-based similarity value that we would have obtained if the MCES calculation was carried out. The proposed screening method would, therefore, prevent the unnecessary MCES comparison between these two molecules. In concluding this section, it is important to emphasize that the currently proposed screening system is highly dependent upon the selection of an acceptable minimum similarity index (MSI). This implies that the user knows a priori what MSI constitutes a sensible threshold for similarity. 3.0 MCES ALGORITHM 3.1 Modular Product

The maximum common subgraph problem can be reduced to determining the maximum clique in the modular product graph, denoted as G1◊G2. This concept has been discovered on several occasions [21, 22, 23]. The modular product of two labeled factor graphs G1 and G2 is defined on the vertex set V(G1◊G2)) = V(G1)×V(G2) where the respective vertex labels are compatible with two vertices (ui,vi) and (uj,vj) being adjacent whenever

(ui,uj)∈E(G1)and(vi,vj)∈E(G2)andw(ui,uj)=w(vi,vj),or(ui,uj)∉E(G1)and(vi,vj)∉E(G2)

, where w(ui,uj) = w(vi,vj) indicates that the vertex and edge labels for each respective pair of vertices are compatible.

3.2 Edge-Induced Isomorphism

The transformation of the MCIS formulation to the MCES problem is based upon the work of Whitney [24], as illustrated in Figure 5. Figure 5(a) shows two graphs G1=K3 and G2=K1,3, respectively. It is evident by visual inspection that the two graphs in Figure 5(a) are not isomorphic. Figure 5(b) depicts the line graphs of G1 and G2, respectively, and it is clear by inspection that the line graphs are isomorphic. This is referred to as a ΔY exchange. Whitney proved that provided that a ΔY exchange does not occur, an isomorphism between two line graphs L(G1) and L(G2) induces an edge isomorphism between the root graphs (G1 and G2) of the two line graphs. By preventing the occurrence of a ΔY exchange during clique detection, it is then possible to use the modular product approach to determine the MCES between two graphs rather than the MCIS [25, 26].

8

This procedure is illustrated with the example in Figure 6. The chemical graphs G1 and G2 are shown in Figure 6(a), and their respective line graphs are depicted in Figure 6(b). Each vertex in the line graph L(G1) is labeled with its respective edge and vertex endpoint labels in G1. Take for instance edge number 1 in graph G1 of Figure 6(a). This corresponds to vertex number 1 in the line graph L(G1) in Figure 6(b). This vertex is labeled with the adjacent vertex pair C−O.

The edges in a labeled line graph are also labeled. Each edge between two vertices, vi

and vj, in a labeled line graph L(G1) is labeled with the vertex label of the incident vertex in G1 between the two edges in G1 corresponding to vi and vj. For instance, edges 3 and 4 in graph G1 of Figure 6(a) correspond to vertices 3 and 4 in the line graph L(G1) in Figure 6(b). Since edges 3 and 4 in G1 are incident on an atom of type carbon (C), then the corresponding edge in L(G1) is also labeled as type carbon.

Since a labeled line graph can then simply be assumed to be an arbitrary labeled graph, the modular product can be constructed as previously described. The modular product graph for the line graphs depicted in Figure 6(b) is presented in Figure 6(c). Since no ΔY exchange has occurred, the clique depicted in the modular product graph corresponds to the MCES between the two labeled graphs in Figure 6(a). The MCES is highlighted in bold face in the respective factor graphs.

In practice, it is straightforward to verify whether a ΔY exchange has occurred during clique detection. Since a clique in the modular product graph corresponds to a subgraph common to both G1 and G2, the respective degree sequences for G1 and G2 should be identical when the vertices are sorted according to degree. However, when a ΔY exchange has occurred the degree sequences of the common subgraph in G1 and G2 will not be the same, as evidenced by Figure 5. 3.3

Proposed Clique-Detection Algorithm

Our clique-detection procedure is a branch and bound algorithm that incorporates improvements in both lower and upper bounding as well as node selection and pruning. A branch and bound algorithm attempts to explore the search tree by iteratively obtaining subproblems of previous problem instances visualized as branches in the search tree and is a common approach to the maximum clique problem [27]. This process can be described

9

succinctly as the recursive implementation of the relation,

ω(G)=max{1+ω(NG(vi)),ω(G\\vi)}.

Bounding techniques for traversing the search tree can generated using this procedure be categorized into two specific types, 1) lower bounds and 2) upper bounds. A straightforward approach to reducing the number of branch traversals necessary to determine the maximum clique involves using the size of the largest clique so far detected [27] as a lower bound. Upper bounds set a largest possible size on the maximum clique that can exist at a specific search tree node. If this size is smaller than the size required to increase the size of the maximum clique thus far discovered, then it is not necessary to extend the search tree from the current search tree node, and a backtrack can be instanced. Lower Bounds

Since it is widely known that the order of the largest clique thus far detected during a clique detection procedure can be useful in potential pruning heuristics, it seems reasonable to assume that an efficient heuristic capable of determining a relatively large clique in a graph G prior to initiating the clique detection could potentially prevent the investigation of numerous unnecessary search tree branches. In the proposed algorithm, an MCES-based minimum graph similarity heuristic capable of establishing a relatively large lower bound on the size of the maximum clique in a modular product graph is developed.

As noted previously, the MCES between two molecular graphs only has potential value if the two molecules are meaningfully similar; it follows that it is reasonable to determine the MCES between two molecular structures only if the similarity between the two structures is greater than some minimum established threshold. If a value for the minimum similarity index (MSI) is established as the minimum threshold, a lower bound on the size of the maximum clique, |E(G12)|, can be constructed by observing sim(G1,G2)≥MSI which for the previously discussed Johnson metric [17] can be shown to yield

|E(G12)|≥MSI⋅(|V(G1)|+|E(G1)|)⋅(|V(G2)|+|E(G2)|)−|V(G1)|+ΔV(G1),

where E(G1)≤E(G2) and ΔV(G1) is defined as the number of vertices in graph G1 with no equivalent label in G2. Upper Bounds

10

As previously stated, an upper bound sets a maximum value for the largest clique that can exist at a specific search tree node. The two schemes that have been investigated include approximate coloring and edge projection onto the factor graphs

Coloring Approaches

A vertex k-coloring of a graph G is a partition of the vertices V(G) into independent sets {P1,P2,..Pk} such that no vertex in any one partition set Pi is adjacent to any other vertex in the same partition set. The minimum k for which a graph G can be colored (i.e., partitioned) is called the chromatic number of G denoted χ(G). The problem of determining χ(G) has been shown to be in the same class of problems as the maximum clique and MCES problems known as NP-complete for which no known polynomial-time complexity solution is known [28]. It is clear from the definition of the chromatic number that ω(G)≤χ(G); therefore, a requirement for forward traversal in the search tree can be given by h+χ(G)>M where h is the current depth in the search tree and M is the size of the current maximum clique (i.e., lower bound).

Grimmett and McDiarmid [29] and Manvel [30] have shown that as the number of nodes of a graph G increases, the chromatic number starts to become an arbitrarily poor estimate for the upper bound of the maximum clique. Moreover, approximate coloring methods such as the greedy algorithm can give arbitrarily poor estimates of the chromatic number further frustrating efforts to bound the size of the maximum clique in a graph. We have found that the estimated chromatic number obtained using the greedy algorithm [31] provides an effective upper bound within the range 125≤V(G)≤175 when used in conjunction with edge projection with poorer performances resulting if the value is set outside of this range. The median value of V(G)=150 has hence been selected as the threshold for initiating greedy coloring in the proposed algorithm.

Projection Bounds

Bessonov [32] has implemented a bounding technique based on the projection of vertex labels onto the graphs under consideration for the MCIS problem. Although a simple bound, the projective bound is almost always superior to the chromatic bound in practice. While the Bessonov implementation involved a projection bound for the MCIS problem, it is a simple matter to extend it to the MCES problem. The projective bounds are easily conceptualized by considering the modular product in a matrix representation (see Figure 6). Since each row in the modular product represents an edge in the graph G1 and each column represents an edge in graph

11

G2, each node in the modular product graph is assigned two labels, e1i for the ith edge in graph G1 and e2j for the jth edge in graph G2. To determine the two projective upper bounds Kα and Kβ, all the vertices under consideration in the modular product graph are projected back onto the factor graphs G1 and G2; the number of distinct edges conserved in the projection of the modular product onto each respective factor graph G1 and G2 then represents the projective bounds Kα and Kβ, respectively.

When the simple modular product in Figure 6 is projected onto the factor graphs L(G1)

and L(G2), edges {2, 3, 4} and {2’, 3’, 4’} are conserved in L(G1) and L(G2), respectively. This results in Kα = 3 and Kβ= 3. The Bessonov projection upper bound is then given as

min{Kα,Kβ}, and in this case, the projection bound is trivial since the maximum clique

constitutes the modular product.

Although the simple projection of the modular product onto the factor graphs G1 and G2 provides a fast and effective upper bound for the size of the maximum clique in the modular product, we have found that this simple procedure does not take optimal advantage of vertex and edge labeling in the factor graphs. Let i represent an unambiguous identifier for pair of adjacent vertices in a graph incorporating the edge label as well as the labels for the two end point vertices. Let Si1 indicate the set of all adjacent vertex pairs with the unique identifier i in graph

1

G1 and S1 denote the set of all such Si1. S1={S11,S21,...,SN} where N is the number of unique Si1.

We propose a new and sharper upper bound (Kwp) that is given as

Kwp=∑min{Si1,Si2}.

i=1N

This process is best illustrated using an example. Figure 7 demonstrates the procedure for two molecules, juglone and scopoletin. It is clear from the figure that all bonded pairs of atoms in both molecular graphs have a corresponding bonded pair of atoms with the same label identifier in the other graph; therefore, the simple projection procedure will yield the upper bounds Kα=14 and Kβ=15. The value of Kwp is calculated as shown in Figure 7 by 6+2+1+1+1=11.

The initial partition used for the proposed vertex selection procedure is based on the initial labeled projection (i.e. h=0), and the upper bound resulting from this partitioning process

12

will be referred to as Ka. Since the value of Ka at h>0 will be dependent upon the algorithm’s vertex selection procedure, the upper bound Kwp will be computed at each forward search tree node instance to account for the possibility that KwpIn the event of a backtrack instance, a modular product vertex is removed from consideration at a particular search tree node. It may then be possible to decrement Kwp. If the vertex vi under consideration during a backtrack instance is the only vertex in a particular partition used to calculate Kwp such as the vertex corresponding to edge 9 in juglone and edge 8 in scopoletin, then Kwp can be decremented. Vertex Selection

Selkow [33] has shown that an upper bound for the clique number in a graph can be given as ω(G)≤maxmin{1+ϕ(NG(vi)),k}, where vi∈V(G) and k represents the index for the independent set partition Pk of vertices of V(G) containing vi and ϕ(NG(vi)) denotes an upper bound for the maximum clique of the graph induced by the vertices adjacent to vertex vi. This observation serves as the impetus for the novel method for node selection in the proposed branch and bound algorithm. For instance, an arbitrary partitioning of the graph in Figure 8 into independent partitions is depicted in Figure 9(a). If the current estimate for the maximum clique is M=2 (i.e., the lower bound), then it is apparent that ϕ(NG(vi)) need only be calculated for vertices 1, 5, and 6, but if the partitions are sorted in non-increasing order according to the number of vertices in each partition, it is only necessary to calculate ϕ(NG(vi)) for vertices 1 and 3 (Figure 9(b)).

Figure 9(b) serves to illustrate the concept that when trying to determine the maximum clique in a graph, the branch and bound procedure need only consider the vertices contained in the excess partitions Pex={Pk, Pk-1,…, PM+1}. Furthermore, the number of vertices in the

{}partitions Pex and possibly the number of partitions k can be reduced by reassigning vertices in

Pex to partitions in PM={PM, PM-1,…, P1}. If there exists a vertex vi in a partition {Pj|Pj∈Pex}

and a partition {Pm|Pm∈PM}such that NG(vi)IPm=∅, then vi can be reassigned to partition Pm. This is illustrated in Figure 9(b) and 9(c) where node vi = 3 in partition P3 can be placed into partition P1, thus reducing the number of partitions in Pex and the number of nodes in Pex that

must be investigated.

13

The branching and node selection procedure then consists of the following operations: 1) Perform an initial partitioning of the vertices for the root node of the search tree (i.e., the graph G) using the labeled edge projection procedure. Sort the resulting partitions Ph=0={Pk0, Pk0-1,…, P1} such that

ex

PK0≤PK0−1≤...≤P1 and reassign vertices in excess partitions P={Pk0, Pk0-M

1,…, PM+1} to partitions P={PM, PM-1,…, P1} if possible.

2) Choose a vertex vi from partition Pkh in Pex and copy the nodes to partitions in the set Ph+1

using Ph+1:=PhING(vi). Increment the depth h by one. Sort the partitions Ph={Pkh, Pkh-1,…, P1} as before and reassign vertices in excess partitions Pex={Pkh, Pkh-1,…, PM+1} to M-partitions

PM={PM, PM-1,…, P1} if possible. Repeat (2).

The node selection procedure just outlined has the desirable property that the node vi will

typically have large degree in NG(vi) since, on average, it is more difficult to partition (color) vertices of large degree than those with a lower degree, resulting in sparsely populated partitions in Ph. This has the tendency to cause the maximum clique to be detected early in the traversal of the search tree. It has been found that when |V(G)|>400 in the modular product of chemical graphs, the efficiency of the algorithm can be improved by re-sorting the partition blocks according to the non-increasing numbers of vertices in each partition block following the Pex to

PM vertex re-assignment procedure. This re-sorting should only be performed at depth h=0 (i.e.,

the root node). Performing the re-sorting at search tree nodes h>0 does not improve the run time for the algorithm. Horizontal Pruning

Horizontal pruning can be defined as a procedure for reducing the number of nodes or edges in NG(vi) given an easily calculable property of NG(vi). The horizontal pruning techniques proposed here attempt to take maximum advantage of the separation of Ph into Pex and PM.

Since any clique larger than M, the current maximum clique, must have at least one node in Pex,

only the nodes contained in Pex need be considered in order to find any clique larger than M.

Therefore, any horizontal pruning to eliminate any vertices from Ph need only be applied to the vertices in Pex.

Kikusts Pruning

Kikusts [34] has proposed a recursive relation for the determination of the maximum independent set of a graph. This heuristic can be transformed in terms of the maximum clique by the following recursive relations,

14

ω(G)=max{1+ω(NG(vi)),ϖ(vi)}

and

ϖ(vi)=max{C,C⊆(G\\vi)iscompleteandCING(vi)≥2}.

Assume that a node vi is selected at a particular search tree node at depth h, and the recursive search for whether vi is contained in a clique larger than M has backtracked to the search tree node from which vi was selected. At this point, the current value of M represents the term ω(G)=1+ω(NG(vi)) in the previous equation. It is then known that the largest clique that NG(vi) can be involved in is M-1. Therefore, if there is a vertex vj in G that is not adjacent to vi, then the largest clique that vjUNG(vi) can be involved in is also size M. So, in order for vj to be involved in a clique larger than M, it must be adjacent to another vertex vm such that vm is not adjacent to vi. This is illustrated in Figure 10(a).

Similarly, if vj is adjacent to vi, then vj is in NG(vi). This means that the largest clique in

vjUNG(vi) is M-1. Therefore, in order for vj to be involved in a clique larger than M, it must be

adjacent to at least two vertices, vm and vn, not adjacent to vi. This can be strengthened slightly by requiring that vm and vn be adjacent. This situation is represented in Figure 10(b).

If any node vj at depth h, fails to satisfy either requirement illustrated in Figure 10 then it

can be deleted from G because it cannot possibly be involved in a clique larger than M. This means that it may be possible to remove other vertices in addition to vi upon a backtrack instance. Rather than implementing this test for all vertices in G\\vi, the separation of G (i.e., Ph) into Pex and PM allows the simplification of this procedure as this test only need be performed on

the vertices contained in Pex. This procedure will be referred to as Kikusts pruning, denoted

KP(vi, Pex) and is performed during each backtrack instance in an attempt to reduce the number

of vertices in Pex beyond Pex\\vi.

Equivalence Class Pruning

One inherent difficulty associated with the computation of an MCES is degeneracy due to symmetry in a set of query graphs, as many molecules exhibit a fair degree of structural symmetry (e.g., the two molecules shown in Figure 11). This often results in a modular product graph containing large numbers of MCES or near MCES maximal cliques making clique detection more difficult as the upper-bounding operations become less effective. We have found

15

that by addressing symmetry in the graphs being compared prior to clique detection, the efficiency of the matching process can be significantly improved in many cases.

Table 1 lists the edges for the molecular graphs in Figure 11 in their respective

equivalence classes. Take edge 3 in Figure 11(a) and edge 6’ in Figure 11(b). It is clear that the edge pair (3, 6’) is equivalent to edge pairs (3, 3’), (3, 23’), (3, 26’), (6, 3’), (6, 6’), (6, 23’), (6, 26’), (24, 3’), (24, 6’), (24, 23’), (24, 26’), (27, 3’), (27, 6’), (27, 23’), (27, 26’). Any MCES rooted at any of these edge pairs cannot be larger than an MCES rooted at edge pair (3, 6’). Therefore once the MCES rooted at edge pair (3, 6’) is discovered, it is no longer necessary to consider any of the clique search tree branches rooted at these other equivalent edge pairs. Since the equivalent edge pairs correspond to equivalence classes 5 (Graph (a)) and 6 (Graph (b)) in Table 1, it is simple to determine which edge pairs in the modular product graph are equivalent once the factor graphs have been processed using an equivalence testing algorithm. Faulon [35] has shown that equivalence class testing of planar molecular graphs which comprise almost all molecular graphs can be performed in O(n2) time. We have used the algorithm of Fan et al. in our experiments [36, 37] also of O(n2) time.

Since the MCES modular product is constructed using the line graphs of the factor graphs, it must first be determined for what cases the automorphism groups of line graphs, L(G), correspond to the automorphism groups of the root graphs, G. It has been shown that this correspondence will hold provided that the molecular graphs being compared are not isomorphic to the graphs illustrated in Figure 12 [38]. Therefore, in the absence of such a forbidden isomorphism, a simple horizontal pruning heuristic based on equivalence classes can be employed to simplify clique detection. If one or both of the graphs being matched happen to be isomorphic to one of the forbidden graphs in Figure 12, it is still possible to use the equivalence class pruning heuristic provided that the equivalence class detection algorithm is modified to account for the forbidden graphs.

The function Peq(vi, Pex) is defined as the pruning of symmetrically equivalent vertices

vi

from the modular product graph and is only performed at depth h = 0. Let EQ1viand EQ2 be the

equivalence class labels of the edges in graphs G1 and G2, respectively, corresponding to the currently selected vertex vi in the modular product graph. Then Peq(vi, Pex) prunes any vertices vj

vivjin Pex of the modular product graph where EQ1vi=EQ1vj and EQ2=EQ2.

Overall Algorithm

16

The proposed algorithm can be summarized in the following pseudo-code: Algorithm RASCAL Step 0: (Construct Modular Product) If the current comparison passes both cost-vector screening stages then construct the modular product graph.

Step 1: (Initialization) Set h:=0 and M:= the clique lower bound determined using a specified MSI. Partition V(G) into independent sets Ph=0:={PKa0,PKa0−1,...,P1} using the labeled projection technique. Sort P0 in order of non-increasing number of vertices in each partition block (i.e.,

exMPKah≤PKah−1≤...≤P1) and re-assign vertices in P to P if possible. Update Kah if re-assignment resulted the reduction in the number of partitions. If Ui=1Pi≥400, then re-sort the partitions Pi in order of non-increasing number of vertices in each partition block.

Step 2: (Upper Bound) Determine Kwph. If KwphKahi=1

Kah

U

Pi≤150 and h+Kh>M, then use the greedy coloring algorithm to estimate the chromatic

number χ(G). If χ(G)Step 3: (Branching) Choose a vertex vh∈PKah|PKah∈Pex. Set Ph+1:=PhING(vh), Kah+1 := |Ph+1|, and PKah:=PKah\\vh. If PKah=∅ then decrement Kah. Set h:=h+1. If Kah=0 then {M:=

U

h−1i=0

vi

and go to Step 4 } else {sort Ph+1 in order of non-increasing number of vertices in each partition

exMblock (i.e., PKah≤PKah−1≤...≤P1) and reassign vertices in P to P if possible}. Update Kah if

re-assignment resulted the reduction in the number of partitions. Go to Step 2.

Step 4: (Backtracking) If h=0 then end (M=max clique). While h+Kh≤M, decrement h. If h=0 then Peq(vh, Pex). Perform KP(vh, Pex). Update Kwph. Go to Step 2.

4.0 EXPERIMENTAL EVALUATION

Unlike other graph problems such as the maximum clique or the coloring problem, the MCIS /MCES problem does not have a standardized set of benchmark graphs with which to test

17

the efficiency of published algorithms. Since RASCAL has been developed for chemical information handling and no standardized test graphs are available, the experimental evaluation of the algorithm has been performed using a data set of chemical graphs. While chemical graphs are sparse with respect to many of the types of graphs of interest to other research fields, it is assumed that they can be used to satisfactorily test the efficiency of the proposed algorithm. Using the definition of the modular product, it is clear that the resulting modular product will be very dense as most edge pairs in both factor graphs will not be adjacent.

The data set of compounds selected for testing purposes consisted of a collection of 200 compounds comprising steroids, melatonins, statins, anti-leukemia actives, and other various compounds from miscellaneous activity classes. The average number of vertices is 25.6 with a standard deviation of 6.7. In order to investigate the efficiency and robustness of the proposed algorithm, all possible pair-wise MCES comparisons between two molecules in a data set were examined. For this data set, 19,900 pair-wise comparisons were performed.

For comparative purposes, three separate algorithms were coded and tested against the RASCAL algorithm. Two of these algorithms are advanced maximum clique detection algorithms for arbitrary graphs. These are the MC1 algorithm of Wood [31] employing a greedy coloring upper bound and the Ostergard [39] algorithm. The other algorithm is due to Bessonov [32] which was developed specifically for maximum clique detection in a modular product graph and uses the simple edge projection technique as an upper bounding procedure. All three of these algorithms employ different vertex selection schemes for traversing the search tree. Of the three, the vertex selection procedure used by Wood is most like the one employed in the RASCAL algorithm. For the sake of uniformity, all algorithms were coded in Visual C++ 6.0 using the same data structures where appropriate and executed on a 400 MHz Intel Celeron processor with 128 MB RAM using Windows 98. To further ensure uniformity, all four algorithms were run in conjunction with the cost-vector screening procedure previously presented as well as the minimum similarity index (MSI) lower bound for the size of the maximum clique with the exception of the Ostergard algorithm which was run using only the cost-vector screening. The unique branching procedure of the Ostergard algorithm precludes the use of a lower-bounding procedure for the size of the maximum clique.

The specification of an acceptable MSI value will depend in large part on the application as well as the connectivity of the graphs being considered. It will also be dependent upon the

18

number of vertices and density of the graphs. Figure 13 demonstrates a simple example depicting various similarity measures calculated for a set of small molecules. In this figure, dyphylline is compared to a set of other small xanthine compounds. For the efficiency evaluation, trial simulations were run at MSI values of ranging from 0.5 to 0.85, and modular product simplification heuristics [40] were used to simplify the modular product based on chemical knowledge prior to clique detection whenever possible.

To evaluate the performance of the cost-vector screening procedure, the “screen-out” was calculated at the same MSI values. The “screen-out” is defined as the percentage of comparisons correctly excluded from further consideration relative to the total number of comparisons whose MCES-based similarity measure is calculated as being below the specified MSI threshold. Table 2 lists the results of the “screen-out” simulation. Table 2 shows that the proposed screening method is effective in screening unnecessary comparisons and that the 1st tier cost-vector approach becomes increasingly effective as the value of MSI increases; whereas, the 2nd tier cost-vector screen becomes more effective as the MSI value decreases.

Table 3 presents the total time for each simulation trial. A time limit of 24 hours was used for each trial. It is clear from Table 3 that the RASCAL algorithm outperformed the other published algorithms. This is especially so at the lower MSI thresholds, which permit the identification of more distant structural relationships between molecules and which are of most potential interest in the discovery of novel bioactive molecules. While all of the heuristics presented in the RASCAL algorithm provide efficiency improvements, the majority of the decrease in run-time is due to the combination of the labeled edge projection upper bound in association with the partition re-ordering procedure. The equivalence testing heuristic was also found to have a significant impact on run-time for molecular graphs with a high degree of symmetry.

To further test the efficiency and durability of the RASCAL algorithm, it was tested on three separate data sets of molecular structures. These consisted of a set of 5,386 chemical structures supplied by Asinex (at URL http://www.asinex.com), another 2,715 chemical structures acquired from Nanosyn (at URL http://www.nanosyn.com), and a set of 16,000 central nervous system targeted molecules obtained from ChemBridge (at URL

http://www.chembridge.com). These data sets are denoted ASX, NAN, and CNS, respectively. All possible pair-wise comparisons for each data set were evaluated using MSI values of 0.7,

19

0.75, 0.8, 0.85. The minimum MSI of 0.7 was selected because MSI values below 0.7 failed to discriminate between the molecular structures in a chemically sensible manner to the degree desired. Table 4 presents the results of these trial simulations. These results indicate that rates in the range of thousands of comparisons per second using the specified processor can be achieved, even at the lower MSI threshold values, making it feasible to carry out MCES-based searching of large chemical databases in reasonable amounts of time, something that has not previously been possible in systems for chemical information management [41].

Table 4 shows that the per comparison time results for the CNS and NAN are comparable. However, the time results for the ASX data set differ significantly. Although the average size of a molecular graph for the ASX data set is larger than for both the CNS and NAN data sets, this difference does not fully explain the discrepancy in the performance of the algorithm. The difference in performance is chiefly attributable to the relative similarity of the graphs in each data set. A simple measure of diversity was calculated by determining the percentage of pair-wise comparisons whose MCES-based similarity exceeds the specified MSI relative to the total number of pair-wise comparisons possible in a data set. Each cell in Table 5 lists the number of comparisons as well as the percentage of comparisons exceeding the stated MSI value. It is evident from Table 5 that the proportion of comparisons whose similarity exceeds the specified MSI value is larger for the ASX dataset than for the CNS and NAN data sets, indicating that the ASX data set requires proportionately more calls to the computationally expensive graph matching routine. This difference corresponds closely with the noted difference in per comparison time results between the data sets. 5.0 CONCLUSION

An efficient graph similarity procedure RASCAL has been proposed based on the concept of the maximum common edge subgraph. The proposed algorithm consists of several heuristics discovered in the literature as well as several new heuristics such as the 1st and 2nd tier screening procedure, the suggested vertex selection procedure, a sharper upper bound based on graph projections, and the equivalence class pruning method to account for symmetry in the graphs being compared. It is relatively straightforward to implement and has proven to be an efficient tool for similarity calculations in chemical graph databases, thus providing an

alternative to existing measures of chemical similarity based on bit-strings. While RASCAL has

20

been designed for use in chemical information management, the algorithm is conceptually general and is directly applicable to any graph-based similarity application.

It may be possible to significantly improve the performance of the algorithm by incorporating other more efficient upper-bounding techniques in addition to the approaches suggested here. This possibility remains for future work on the algorithm, as does its extension to the calculation of similarity for 3D chemical graphs and its application to chemical problems such as searching and the prediction of biological activity. ACKNOWLEDGEMENTS

The authors would like to extend their gratitude to Pfizer (Ann Arbor) for funding of this project and thank Alain Calvet, George Cowan, Eric Gifford, Christine Humblet, and David Wild of Pfizer and Mark Johnson of Pananugget Consulting for their support. The authors would also like to thank the reviewers for their helpful comments and suggestions for improving the manuscript. The Krebs Institute for Biomolecular Research is a designated centre of the Biotechnology and Biological Science Research Grant.

21

TABLES

Table 1. Equivalence Class Listing of Figure 10

Graph (a)

Equivalence Class

1 2 4 5 6 7 8 9 10 11

Edge ID 1, 12, 30, 34 2, 11, 29, 33 10, 32 3, 6, 24, 27 7, 8, 25, 26 4, 5, 23, 28 13, 22 15, 21 14, 20 16, 19

Graph (b)

Equivalence Class

1 2 4 5 6 7 8 9 10 11

Edge ID 1', 12', 29', 33' 2', 11', 28', 32' 10', 31' 7', 8', 24', 25' 3', 6', 23', 26' 4', 5', 22', 27' 13', 21' 14', 20' 15', 19' 16', 18' 17' 3 9, 31 3 9', 30' 12 17, 18 12

Table 2. “Screen-out” Performance

1st tier

MSI

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 24.7% 34.3% 46.2% 59.8% 72.9% 85.0% 93.4% 97.6% 2nd tier 59.0% 54.9% 47.3% 36.8% 25.4% 14.3% 6.2% 2.1% Total 83.7% .2% 93.5% 96.6% 98.3% 99.3% 99.6% 99.7%

Table 3. Algorithm Time Comparison Results (seconds)

MSI Algorithm 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 Bessonov Ostergard Wood

>24 hr >24 hr >24 hr

>24 hr >24 hr >24 hr

>24 hr >24 hr >24 hr

>24 hr >24 hr >24 hr

>24 hr >24 hr >24 hr

>24 hr >24 hr

>24 hr >24 hr

93 >24 hr 13

>24 hr 12,8

RASCAL 304 188 120 80 43 22 11 7

22

Table 4. RASCAL Time Trial Results [total (per comparison) in seconds (milliseconds)]

Num. of Num. of Molecules Comparisons

Avg. |V(G)|

Std. Dev. |V(G)|

MSI Data Set

0.7 0.75 0.8 0.85 37,229 (2.57) 22,216 (0.174) 826 (0.224)

17,570 (1.21) 8,462 (0.0661) 349 (0.0947)

7,058 (0.487) 3,269 (0.0255) 150 (0.0407)

71,806 (4.95) 55,340

CNS 16,000 127,992,000 23.3 4.7 (0.432) 1,883

NAN 2,715 3,684,255 26.3 5.5 (0.511) ASX 5,386 14,501,805 32.5 5.0

Table 5. Proportion of MCES Similarity Comparisons Exceeding MSI Threshold

Data Set

MSI

0.7 0.75 0.8 0.85 ASX 1,698,872/11.7% 2,315/6.15% 422,383/2.91% 150,385/1.04% CNS 1,425,165/1.11% 485,830/0.38% 167,584/0.13% 50,920/0.040% NAN 48,279/1.31% 22,224/0.60% 9,749/0.26% 3,522/0.096%

23

FIGURES

a)b) Figure 1. a) Maximum Common Subgraph b) Maximum Common Edge Subgraph

a)OCCCCCOCCCCNCCCCCCCCCCCCCCCCCCCCCCCCNCCOb)

Figure 2. Example MCES Comparison (a) meperidine (b) methadone

24

C11C10C

198C7C9O

C

6

4C13C5C1C2N

3C

C18

17

OCC

1615O14C5C4C

C20C21

22

C116C10C9CC1823C

12C12CC13C7

8C

C

1714C2N

C16

15

C3

1

C

methadone (1)

C

meperidine (2)

L1

1=CL1

2=OL1

3=NL21=CL2

2=OL2

3=Nd(v1)= 4 (7)d(v1)= 1 (9)d(v1)= 3 (2)d(v1)= 4 (5)d(v1)= 2 (16)d(v1)= 3 (2)d(v2)= 3 (8)0d(v2)= 3 (6)d(v2)= 1 (15)d(v3)= 3 (4)d(v3)= 3 (14)d(v4)= 3 (18)d(v4)= 2 (3)d(v5)= 3 (12)d(v5)= 2 (4)d(v6)= 2 (6)d(v6)= 2 (7)d(v7)= 2 (10)d(v7)= 2 (8)d(v8)= 2 (13)d(v8)= 2 (9)d(v9)= 2 (14)d(v9)= 2 (10)d(v10)= 2 (15)d(v10)= 2 (11)d(v11)= 2 (16)d(v11)= 2 (12)d(v12)= 2 (17)d(v12)= 2 (13)d(v13)= 2 (19)d(v13)= 2 (17)d(v14)= 2 (20)d(v14)= 1 (1)d(v15)= 2 (21)d(v15)= 1 (18)

d(v16)= 2 (22)0d(v17)= 2 (23)0d(v18)= 1 (1)0d(v19)= 1 (3)0d(v20)= 1 (5)0d(v21)= 1 (11)0

SUM =

Figure 3. Example 1st Tier Cost-Vector Screening

25

min {d(v12

j), d(vj) }

L1=CL2=OL3=Nd(v1) = 4d(v1)= 1d(v1)= 3d(v2) = 3 d(v2)= 0d(v3) = 3d(v4) = 2d(v5) = 2d(v6) = 2d(v7) = 2d(v8) = 2d(v9) = 2d(v10)= 2d(v11)= 2d(v12)= 2d(v13)= 2d(v14)= 1d(v15)= 1d(v16)= 0d(v17)= 0d(v18)= 0d(v19)= 0d(v20)= 0d(v21)= 0

32

1

3

= 36

11CC20C21C22

C1019C

18C23C

8C7C9O

C18

CC132N17

OC

16methadone (1)

6C4C13C14C

1C2N3C15O14C5C4C

C116C10C9C12C17C16

12CC5

C78C15C1

3C

C

meperidine (2)

L2 (oxygen) :

1161

9

2(i)15(vi)

L3 (nitrogen):

1(i)2(vi)

1

2

013(i)(vi)(i)(vi)

L1 (carbon) :

151234567101112131415161718192021

7841812610131415161719202122231251126314435467710101111121213131714115

(i)18(vi)

422112200000000000011111331122222222220011121111100000000000011112111100000000001111222112200000000000011000220022222222220000000220022222222220000000220022222222220000000220022222222220000000220022222222220000222112200000000000011112111100000000001111111111100000000000011001000000000000001100111111100000000000011(i)(vi)

Figure 4. 2nd Tier Linear Assignment Cost Matrices

26

a)e'1eeG11=K33

G2=K1,3e'3e'2e2b)e1e'1L(G1)L(G2)e3

e2

e'3

e'2

Figure 5. ΔY Exchange

a)

b)

O1CG1

O3L(G1)CO

1

2C4CCC2NCCC4NCC3OC1'OG2

O3'

L(G2)CO1'C4'C2'CC2'NCCC4'NCC3'Oc)

L(G2)

1'2'

3'

4'

1

2

L(G1)

L(G1)◊L(G2)

3

4

Figure 6. Modular Product of Line Graphs

27

juglone

Oscopoletin

12

O12

234

O176514

111013

O10O6

98

7

115

432

14

O13O

1

654321

15

111087

juglone

14

9

1312

C:C123456

C-C79

C-O1011131415

C=C8

C=O12

scopoletin

Figure 7. Example of Labeled Projection Upper Bound

415

PexP4a)b)c)

326

Figure 8. Undirected Graph G

PMP3131M=2

P22,42,42,4P135,63,5,65,61

Figure 9. Vertex Partitioning Procedure

28

vmvjvmvnvjNG(vi)NG(vi)a)vib)vi Figure 10. Kikusts Heuristic

a)

11034O2873414279O332631OO32O11612O161718O201928222321242529513O3015OO33'Ob)

1'O2'3'8'9'O10'7'14'15'4'13'O16'5'6'11'O12'O32'26'17'18'O19'27'21'20'O25'30'O24'31'28'23'22'O29' Figure 11. Symmetrical Molecular Graphs

29

Figure 12. Forbidden Subgraphs for Line Graph Automorphism

OONONNON0.57HONHO0.51NNNONNOONNONNN0.63NN0.35OOSNONNONNN0.78ONONOHONNN0.80ONONcaffeineuricacidcafaminolNNdyphyllinecaptagonenprofylline

Viagra

Figure 13. Example of Chemical Graph Similarity

30

REFERENCES [1] [2] [3] [4] [5] [6]

Kuhl, F., Crippen, G. and Friesen, D. (1984) A combinatorial algorithm for calculating ligand binding. J. Comp. Chem., 5, 24-34.

Shearer, K., Bunke, H. and Venkatesh, S. (2001) Video indexing and similarity retrieval by largest common subgraph detection using decision trees. Patt. Recog., 34, 1075-1091. Horaud, R. and Skordas, T. (19) Stereo correspondence through feature grouping and Maximal Cliques. IEEE Trans. Pattern Anal. Mach. Intell., 11, 1168-1180.

Pelillo, M., Siddiqi, K. and Zucker, S. W. (1999) Matching hierarchical structures using association graphs. IEEE Trans. Patt. Anal. Mach. Intell., 21, 1105-1120. Trinajstic, N. (1992) Chemical Graph Theory. CRC Press, Boca Raton, FL. Willett, P. (1999) Matching of chemical and biological structures using subgraph and maximal common subgraph isomorphism algorithms. IMA Vol. Math. Appl., 108, 11-38. [7] [8]

Gifford, E., Johnson, M., Smith, D. and Tsai, C. (1996) Structure-reactivity maps as a tool for visualizing xenobiotic structure-reactivity. Network Science, 2, 1-33.

Lewis, R. A., Mason, J. S. and McLay, I. M. (1997) Similarity measures for rational Set selection and analysis of combinatorial libraries: the diverse property-derived (DPD) approach. J. Chem. Inf. Comput. Sci., 37, 599-614. [9]

Chen, L. and Robien, W. (1994) Application of the maximal common substructure algorithm to automatic interpretation of 13C NMR spectra. J. Chem. Inf. Comput. Sci., 34, 934-941. [10] [11] [12] [13] [14]

Gillet, V. J., Wild, D. J., Willett, P. and Bradshaw, J. (1998) Similarity and dissimilarity methods for processing chemical structure databases. Comput. J., 41, 547-558. Flower, D. R. (1998) On the properties of bit string-based measures of chemical similarity. J. Chem. Inf. Comput. Sci., 38, 379-386.

Diestel, R. (2000) Graph Theory. Springer-Verlag, New York.

Sanfeliu, A. and Fu, K. S. (1983) A distance measure between attributed relational graphs for pattern recognition. IEEE Trans. Syst. Man, Cybern., 13, 353-362. Hansen, P. (1979) Upper bounds for the stability number of a graph. Rev. Roumaine Math. Pures Appl., 24, 1195-1199.

31

[15] [16]

Liu, R. Y. (19) An upper bound on the chromatic number of a graph. J. Xinjiang Univ. Natur. Sci., 6, 24-27.

Papadopoulos, A. N. and Manolopoulos, Y. (1999) Structure-based similarity search with graph histograms. 10th International Workshop on Database & Expert Systems Applications, Florence, Italy, September 1-3, 1999, pp. 174-178. Springer-Verlag, Berlin. [17]

Johnson, M. (1985) Relating metrics, lines and variables defined on graphs to problems in medicinal chemistry. In Alavi, Y., Chartrand, G., Lesniak, L., Lick, D. and Wall, C. (eds), Graph Theory and Its Applications to Algorithms and Computer Science. Wiley, New York. [18] [19]

Carpaneto, G., Martello, S. and Toth, P. (1988) Algorithms and codes for the assignment problem. Ann. Oper. Res., 13, 193-223.

Brown, R. D. and Martin, Y. C. (1996) Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. J. Chem. Inf. Comput. Sci., 36, 572-584. [20]

Patterson, D. E., Cramer, R. D., Ferguson, A. M., Clark, R. D. and Weinberger, L. E. (1996) Neighborhood behavior: A useful concept for validation of \"molecular diversity\" descriptors. J. Med. Chem., 39, 3049-3059. [21] [22] [23]

Levi, G. (1972) A note on the derivation of maximal common subgraphs of two directed or undirected graphs. Calcolo, 9, 341-352.

Barrow, H. and Burstall, R. (1976) Subgraph isomorphism, matching relational structures and maximal cliques. Inf. Proc. Lett., 4, 83-84.

Vizing, V. G. (1974) Reduction of the problem of isomorphism and isomorphic entrance to the task of finding the nondensity of a graph (in Russian). Third All-Union Conference on Problems of Theoretical Cybernetics, Novosibirsk, pp. 124. [24] [25]

Whitney, H. (1932) Congruent graphs and the connectivity of graphs. Amer. J. Math., 54, 150-168.

Nicholson, V., Tsai, C., Johnson, M. and Naim, M. (1987) A subgraph isomorphism theorem for molecular graphs. In King, R.B. and Rouvray, D.H. (eds), Graph Theory and Topology in Chemistry. Elsevier, Athens, GA.

32

[26] [27] [28] [29] [30] [31] [32]

Kvasnicka, V. and Pospichal, J. (1990) Maximal common subgraphs of molecular graphs. Reports Molecular Theory, 1, 99-106.

Pardalos, P. and Xue, J. (1994) The maximum clique problem. J. Global Optim., 4, 301-328.

Garey, M. R. and Johnson, D. S. (1979) Computers and Intractability. W. H. Freeman, San Fancisco, CA.

Grimmett, G. R. and McDiarmid, C. J. H. (1975) On colouring random graphs. Math. Proc. Camb. Phil. Soc., 77, 313-324.

Manvel, B. (1981) Coloring large graphs. Congr. Numer., 33, 197-204.

Wood, D. (1997) An algorithm for finding a maximum clique in a graph. Oper. Res. Lett., 21, 211-217.

Bessonov, Y. E. (1985) On the solution of a problem on the search for the best intersection of graphs on the basis of an analysis of the projections of the subgraphs of the modular product (in Russian). Vychisl. Sistemy, 3-22,121. [33] [34] [35]

Selkow, S. (1978) New bounds for the clique number of a graph. Inf. Proc. Lett., 7, 173-174.

Kikusts, P. (1986) Another algorithm determining the independence number of a graph. Elektron. Inf. Verarb. Kybern., 22, 157-166.

Faulon, J. L. (1998) Isomorphism, automorphism partitioning, and canonical labeling can be solved in polynomial-time for molecular graphs. J. Chem. Inf. Comput. Sci., 38, 432-444. [36] [37]

Fan, B. T., Barbu, A., Panaye, A. and Doucet, J. P. (1996) Detection of constitutionally equivalent sites from a connection table. J. Chem. Inf. Comput. Sci., 36, 654-659. Fan, B. T., Panaye, A. and Doucet, J. P. (1999) Comment on \"Isomorphism,

automorphism partitioning, and canonical labeling can be solved in polynomial-time for molecular graphs\". J. Chem. Inf. Comput. Sci., 39, 630-631. [38] [39]

Teh, H. H. and Chen, C. C. (1970) On the automorphism groups of interchanged graphs. Nanta Math., 4, 72-78.

Ostergard, P. (2002) A fast algorithm for the maximum clique problem. Discrete Appl. Math., 120, 195-205.

33

[40]

Raymond, J., Gardiner, E. and Willett, P. (2002) Heuristics for rapid similarity searching of chemical graphs using a maximum common edge subgraph algorithm. J. Chem. Inf. Comput. Sci., 42, 305-316. [41]

Willett, P., Barnard, J. and Downs, G. (1998) Chemical similarity searching. J. Chem. Inf. Comput. Sci., 38, 983-996.

34

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- 91gzw.com 版权所有 湘ICP备2023023988号-2

违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务