Faulon J.L., Bender A. Handbook of Chemoinformatics Algorithms

Подождите немного. Документ загружается.

Computer-Aided

Molecular Design

Inverse Design

Donald P. Visco, Jr.

CONTENTS

9.1 Introduction.....................................................................270

9.2 CAMD and Quantitative Structure–Activity Relationship

(QSAR)/Inverse-QSAR (iQSAR) .............................................271

9.2.1 QSAR...................................................................271

9.2.2 The Origins of CAMD.................................................272

9.2.3 Inverse QSAR ..........................................................272

9.3 General Features of CAMD ...................................................273

9.4 Generate and Test Approach of Gani and Coworkers ........................273

9.4.1 Hybrid-CAMD .........................................................274

9.4.1.1 Predesign Phase..............................................274

9.4.1.2 Design Phase.................................................274

9.4.1.3 Postdesign Phase ............................................278

9.4.2 Case Study: Chemical Process Industry Application.................278

9.4.3 Case Study: Bio-Related Application .................................278

9.5 CAMD as Optimization........................................................279

9.5.1 Mixed-Integer Linear Programming Algorithm for CAMD .........280

9.5.1.1 Molecular Representation ...................................280

9.5.1.2 Constraint Equations ........................................280

9.5.1.3 Case Study ...................................................282

9.5.1.4 Case Study: Bio-Related Application.......................282

9.5.2 CAMD Using Signature ...............................................283

9.5.2.1 What Is Signature? ..........................................283

9.5.2.2 Inverse Design Algorithm Using Signature ................285

9.5.2.3 Case Study: Chemical Process Industry Application ......288

9.5.2.4 Case Study: Bio-Related Application.......................288

9.6 Concluding Remarks ...........................................................289

References ............................................................................290

269

270 Handbook of Chemoinformatics Algorithms

9.1 INTRODUCTION

“Molecular design” is a term that has many connotations in a variety of fields. Cer-

tainly one can perform experimental work on various compounds and, through an

analysis of their properties, propose a new substance with a desired property value.

This would be an example of molecular design. However, in this chapter we will

focus on the in silico approach to molecular design, which goes by the catch-all term

“computer-aided molecular design,” or CAMD.

At its most basic level, CAMD is the application of computer-implemented algo-

rithms that are utilized to design a molecule for a particular application. Normally,

when one considers the term “molecular design,” a common thought is in the area

of therapeutics. Many researchers in industry and academia alike are involved in the

design of drugs and, accordingly, extensive effort has been afforded to developing

techniques specific to these types of systems [1–4]. However, while not as visible

or attractive as marketing the latest pharmaceutical, CAMD is a popular and useful

technique in many other areas, such as for polymers [5,6] or in solvent design [7].

In general, CAMD has a practically infinite solution space wherein to search

for candidates. As we shall see in this chapter, when the desired molecules are for

biological systems, the solution space is estimated to be at least 10

[8], which is

a relatively tiny fraction of the space as a whole. Large search spaces are both a

blessing and a curse. With a vast amount of compounds to evaluate, there is more of a

possibility to find a higher-quality and/or novel candidate. This could, in turn, lead to

a discovery with the potential for great economic impact for a particular company. On

the other hand, with such a big “hay stack,” enormous time and effort could be spent

in a search that leads nowhere and is unproductive. Accordingly, efforts are made

a priori to limit the search space using techniques such as full-fledged templating [9]

or requiring the presence of certain features in a candidate molecule [10].

The two most visible industries using CAMD are the chemical process industry

and the pharmaceutical industry. While both industries are solving CAMD problems,

they differ in substantial ways. For example, the chemical process industry regularly

uses CAMD in the area of solvent design [7]. Solvents are designed to have certain

properties for applications in a particular area and outputs of CAMD algorithms are

scored based on predicted properties (most often from group-contribution methods).

In the pharmaceutical industry, CAMD is often used in a de novo approach [11,12].

In its most popular implementation, ligands are built within an active receptor site

through a CAMD algorithm, although in reality the term de novo has been used loosely

to encompass virtually any sort of computational drug design [11]. Hence, while both

industries use algorithms that share the common features of computational molecular

design and scoring of candidates, the scoring functions used and (ultimately) the

algorithms employed to make (or revise) the selected candidates are different. For

more details on specific de novo design approaches, many good reviews exist [11,12].

This chapter presents general molecular design methods where compounds are

designed using structure–activity or structure–property relationships. Chapter 10

focuses on drug design and, in particular, de novo drug design. With de novo design,

compounds are constructed ab initio to complement a given target receptor. In con-

trast to the next chapter, the techniques presented here are not limited to drugs

Computer-Aided Molecular Design 271

and do not require the knowledge of potential targets. Precisely, we focus on three

inverse design techniques that have been used to design compounds for both prod-

uct design/engineering applications and for a biological-related application during

the last decade. We have chosen these three since they present a broad overview

of some of the important issues associated with design of molecules and have been

used to design compounds for both engineering applications and bio-related appli-

cations. The first algorithm presented, a group-based generate and test approach, is

that derived from the work of Gani and coworkers during the early 1980s [13,14], but

modified since that time and is still an important CAMD approach to this day [15].

The second algorithm is representative of the approach to treat molecular design as an

optimization problem solved using a mixed-integer nonlinear approach, popularized

by the work of Maranas [6,16]. Here, topological indices are incorporated in con-

junction with adjacency matrices and we present a recent implementation based on

the work of Camarda and Sunderesan [17]. The third algorithm comes from the work

of Faulon and Visco using fragments of molecules (called Signatures) in conjunction

with a powerful structure generator to solve the molecular design problem [18,19].

For each of the three approaches, a case study is presented showing two molecular

design results from the implementation of each algorithm: one for an engineering

application and one for a bio-related application.

9.2 CAMD AND QUANTITATIVE STRUCTURE–ACTIVITY

RELATIONSHIP (QSAR)/INVERSE-QSAR (iQSAR)

Evaluating the fitness of molecular candidates derived from CAMD algorithms is

regularly performed by using models that are trained on experimental data. QSARs

for a particular property of interest are normally how candidates are scored. In this

section, we provide an overview of QSARs and CAMD as well as a way to utilize a

QSAR in a reverse fashion as an inverse design technique.

9.2.1 QSAR

A QSAR is a quantitative structure–activity (or property) relationship, which pur-

ports to describe something about a molecule (its activity against a certain protein, its

boiling point, etc.) based on the molecule’s structure. It was introduced in the 1960s

with the work of Hansch [20] and is still an active area of research [21] with a rich

history [22], although its utility as a predictive tool has been called into question [23].

While molecular properties themselves or whole-molecule descriptors can be used

as independent variables in a QSAR, a popular approach is to use independent vari-

ables based on subparts of the molecule. For example, group-contribution techniques

decompose a molecule into smaller groups where each group provides some contri-

bution to a predicted molecular property. Such approaches are well highlighted in

The Properties of Gases and Liquids [24]. Other techniques examine a 2D graphical

representation of a molecule where atoms are nodes and bonds are edges. Here, an

operator on some portion of the molecular graph plays the role of independent variable

and many of these descriptors exist in the literature today [25]. Note that “QSAR”

is a bit of a catch-all term and we use it in this chapter to denote any property of

272 Handbook of Chemoinformatics Algorithms

interest and not just biological activity. Further information on QSARs is given in

Chapters 6 and 7.

9.2.2 THE ORIGINS OF CAMD

Computer-aided molecular design has its origins in the early 1980s with the work

of Gani and Brignole [13,14]. Here, functional groups derived to estimate activity

coefficients of nonelectrolytes in an approach called UNIFAC [26] were used in a

generate and test approach for use in solvent selection. This technique, in general,

suffered greatly from combinatorial explosion since many nonfeasible structures are

generated through the algorithm. Additionally, accounting for steric effects is often

not successful using group-contribution techniques [27].

9.2.3 INVERSE QSAR

An alternative approach to using prescribed functional groups in a CAMD algorithm

is the so-called iQSAR techniques [28]. Here, rather than using a QSAR to score

potential molecules created from combining groups, one fixesa desired value (or range

of values) and attempts to solve for the set of independent variables (descriptors) that

satisfy the QSAR. Once this is done, a molecule (or molecules) is generated (normally

from a structure generator) based on those values of the independent variables (if ever

possible).

The first complete attempt at iQSAR was reported by Zefirov and coworkers in

1990 in relation to connectivity indices [29]. In this algorithm, a value of the order-1

connectivity index is used as input and is rewritten in terms of the distribution of edge

types. Valence-type distributions are determined from this edge type and structures

are generated from a structural isomer generation code. Reported degeneracy issues

exist that are associated with many structures possessing the same connectivity index.

Additionally, since the edge and valence-type distributions do not follow a one-to-one

correspondence, structures that are generated from the valence-type distributions are

not guaranteed to possess the required value of the connectivity index.

A few years later, Zefirov and coworkers again looked at iQSAR, but this time in

relation to another topological index, here the kappa (κ) indices [30]. The algorithm

is based on setting a desired number of vertices in a 2D graph of a molecule and

partitioning that number into the number of nodes of degrees 1, 2, 3, and 4. From the

conditions set on graph existence associated with a particular partition and by fixing

a desired value (or range of values) from a QSAR based on the κ indices, bond path

equations are determined and act as constraints on the partitions. The partitions that

pass are then generated into structures using a structure generator. While the approach

is straightforward, its limitations are the use of κ indices with the QSAR.Additionally,

there are known degeneracy issues associated with κ indices.

Other iQSAR approaches for different independent variables have been devel-

oped as well, such as Kier and Hall’s use of molecular connectivity indices [31,32]

and Zefirov’s use of the Hosoya Index [33] as well as other information topological

indices [34].

Computer-Aided Molecular Design 273

9.3 GENERAL FEATURES OF CAMD

In CAMD, molecules are made from fragments or groups. However, there is no

standard as to what are considered groups. Accordingly, the same algorithms, which

utilize different starting fragments, can end up with different optimally predicted

solutions.

Computer-aided molecular design can be broadly described in three general steps,

which are presented below.

Step 1: Selection of groups or fragments CAMD requires a pool of groups or

fragments in order to build molecules. The selection of these groups is not standard

and is normally a function of the problem to be solved. In fact, even what is considered

a fragment is problem dependent since fragments can be as small as a single atom or

contain many atoms [35].

Step 2: Making molecules (by combining groups/modifying candidates) If one

is working with groups, they must be merged together to form molecules. However,

there are two issues here. First, how the groups are selected is an algorithmic issue.

While one can use a technique that exhaustively selects groups (a so-called generate

and test paradigm), a variety of constraints can be implemented at this stage such

as requiring certain groups to be present (but not exceeding a certain number) [36].

Second, rules must be developed on how the various groups can merge together

based on valence arguments (among other user-defined constraints) [17,36,37]. Those

molecules that have been deemed structurally feasible are ready to be evaluated for

fitness. Note also that candidate molecules can be modified in this step as well (if

part of a feedback loop) through stochastic techniques such as genetic algorithms

[38,39].

Step 3: Evaluating candidate fitness Depending on the problem to be solved,

a single scoring function may be used or many scoring functions may be layered

together, to either filter out or rank solutions. While the scoring is normally based

on group contribution (through QSARs), other factors such as molecular stability

[40] or synthetic feasibility [41] can be used as well. It is also here where, if using a

stochastic algorithm, candidates can be modified (via step 2) in an attempt to improve

the rating of potential candidates. The best candidates are ranked and the top ones

move onto further analysis, through either experimental verification or additional

testing.

9.4 GENERATE AND TEST APPROACH

OF GANI AND COWORKERS

One of the most popular CAMD algorithms was implemented by Gani and coworkers

based on the use of the UNIFACgroups.We describe that algorithm here. In 1991, Gani

and coworkers refined their previous approaches [13,14] to present a methodology

for CAMD based on the previously identified UNIFAC groups [36]. Working with

the UNIFAC groups provided direct access to various parameters for these groups

that are needed during property estimation.

Gani created six classes of compounds based on the number of attachments avail-

able in a particular group. Class 0 is just molecules themselves (such as methanol),

274 Handbook of Chemoinformatics Algorithms

while Class 5 is specific for aromatic groups. In between those classes, the label refers

to the number of attachments available for a particular group (re: CH

is in class 1,

etc.). Within each class, five categories exist that house information on type of attach-

ment, basically encoding chemical feasibility and stability. The lower the category

number, the less restrictions placed on a particular group, in general.

While there have been modifications to this approach since its inception such

as the use of second-order groups [35], we focus on the broader methodology that

implements these ideas, the Hybrid-CAMD algorithm [40,42,43].

9.4.1 HYBRID-CAMD

The Computer Aided Process-Product Engineering Center (CAPEC) of the Techni-

cal University of Denmark offers their Integrated Computer-Aided System (ICAS),

inside of which is housed a computer-aided molecular design code called Pro-

CAMD (v 3.6). The educational version of ICAS boasts at least 50 academic users

worldwide and the CAPEC counts 32 companies from a varying range of indus-

tries as industrial consortium members. ProCAMD is based on the Hybrid-CAMD

[40,42,43] approach, which combines the generate and test paradigm of previous

techniques with inclusion of higher-order information (such as molecular connectiv-

ity through topological indices). Owing to its popularity, the algorithm is described

below.

The Hybrid-CAMD approach to molecular design comprises three steps: (1) pre-

design phase, (2) design phase, and (3) postdesign phase. We will describe each step

individually below.

9.4.1.1 Predesign Phase

The predesign phase is important since it speaks of the practicality of solving a par-

ticular CAMD problem. If equations or data are not available to estimate a particular

property, it is identified at this step.

ALGORITHM 9.1 HYBRID-CAMD: PREDESIGN PHASE ALGORITHM

1. Problem specification, including specific information on properties of interest,

range of application, and so on

2. List equations/methods (e.g., QSARs) available to predict required properties

3. Based on step 2, select proper groups to be used to build molecules

9.4.1.2 Design Phase

The design phase is a set of four generate-and-test steps (called levels), which have

an increasing sophistication of structural information used at each step. In effect, this

is basically a set of four filters, which have an increasingly fine (and computationally

expensive) threshold. A lot of solutions can be evaluated quickly in lower levels, yet

the higher levels provide more stringent tests of fitness, although they take a longer

time to evaluate. Such a multistep approach is purported to mitigate the combinatorial

Computer-Aided Molecular Design 275

explosion that was associated with earlier versions of this type of generate and test

technique [40].

9.4.1.2.1 Level 1

Level 1 is a generate-and-test algorithm that finds sets (called vectors) of feasible

groups that also pass any design constraints based on these groups. The groups are

the UNIFAC first-ordergroups and are characterized by five classes (based on valency)

and five categories (based on chemical feasibility/stability). Sets of feasible groups

passing this level are then feasible from a valency/feasibility/stability standpoint as

well as from a property standpoint. This algorithm, which is presented below, follows

from that given by the authors.

ALGORITHM 9.2 HYBRID-CAMD: DESIGN PHASE ALGORITHM

(LEVEL 1)

1. For a given set of building blocks (re: groups), set min/max number of groups

in a compound (Kmin and Kmax) and type of compound (acyclic, cyclic or

aromatic)

2. For K = Kmin, Kmax

a. Solve feasibility constraints associated with classes for a given value

of K

i. Solve constraints associated with categories based on chemical

feasibility and stability.

ii. Generate all possible combinations of groups within a particular

solution

iii. Screen each combination against property constraint(s) based on

knowledge of number and type of group only

3. K = K + 1

4. If K > Kmax, Stop, else go to step 2

Note that step 2a is simply a sum of the number of partitions of length 4 that

can be obtained from a given value of K since the classes reflect the amount

of available attachments for that class (re: class 1 has one free attachment, such

as –CH3, etc.). Also, if a cyclic compound is chosen, an additional constraint

is added to the partition, which is based on the maximum number of rings in a

molecule.

9.4.1.2.2 Level 2

Level 2 is the step that takes the vectors that pass through Level 1 and generates 2D

structures in a recursive algorithm that also removes duplicate structures. In essence,

the algorithm is finding the feasible spanning trees from the base graph, which pro-

vides all of the potential connections between groups. Caution is taken to split up

nonsymmetric groups that have more than one free connection. A key aspect of the

algorithm is an accounting of which groups have free connections available and which

276 Handbook of Chemoinformatics Algorithms

do not as the algorithm proceeds. Note that an additional step is required if cyclic

compounds are to be generated.

ALGORITHM 9.3 HYBRID-CAMD: DESIGN PHASE ALGORITHM

(LEVEL 2)

1. Set list of generated compounds to 0

2. Choose a solution vector, V

3. Choose a starting group from V

4. For all free connections in V

a. Select a free connection

i. For all unused groups in V

1. If connection is allowed between free connection and

unused group, make a copy of V with the new connection

and add to list of generated compounds

5. If all groups have been used, STOP

6. Remove duplicate solutions

7. If all groups have not been used, yet no free connections exist, remove solution.

8. Goto step 1, if remaining vectors exist

Once the structures have been generated, the potential exists to use additional

property estimation techniques to remove those structures that do not fall within

constraint limits. In the Hybrid-CAMD approach, the notion of second-order groups

[35] is used, which are, in essence, combinations of smaller first-order groups. Here,

a pattern matching technique for these second-order groups is used on the adjacency

matrix to determine the absence or presence of these second-order groups. This infor-

mation, along with QSARs available based on second-order groups, can be used to

predict the physical properties of the candidate molecule. Note that the adjacency

matrix at this level is group based and provides a 2D table of connectivity between

groups in the structure.

9.4.1.2.3 Level 3

The third level aims to transform connectivity information between groups to that

between atoms and to utilize constraints that are based on atomistic connectivity (such

as topological indices). In this level, the groups of the group-based adjacency matrix

are expanded atomistically to create an atom-based adjacency matrix. However, there

can be a degeneracy associated with isomers of some groups when transforming to

an atomic description. Thus, some group-based adjacency matrices will provide more

than one atom-based adjacency matrices.

ALGORITHM 9.4 HYBRID-CAMD: DESIGN PHASE ALGORITHM

(LEVEL 3)

1. For each group-based adjacency matrix

a. Expand the group into constituent atoms in the matrix and fill in “1” to

establish connectivity within the group

Computer-Aided Molecular Design 277

b. Where degeneracy exists, create new matrices that account for this

connectivity

2. Find where original groups connect and identify with a “1” in the new atom-

based adjacency matrix.

3. Stop

Determining atomistic-level connectivity opens up the possibility to use many

QSARs that have already been developed based on topological indices. Connectivity

indices [44,45] and shape indices [46–48], for example, have been used as molecular

descriptors in previous QSARs for a variety of physical properties. Accordingly, this

information can be used as a further screen of potential candidate solutions, especially

where different isomers possess widely different properties.

9.4.1.2.4 Level 4

The final level of the design phase algorithm converts the 2D representation of the

molecule to a 3D representation, with an accounting for potential structural isomers.

Once the 3D representation comes into being, additional techniques to evaluate the

fitness of the structure may be employed.

ALGORITHM 9.5 HYBRID-CAMD: DESIGN PHASE ALGORITHM

(LEVEL 4)

1. Choose an atom-based adjacency matrix

2. Select a single-bonded atom, J, and assign its position as the origin

3. Select the atom it is bonded to, K, and a direction

4. Find number and types of bonds K participates in

5. Determine position of K based on bond length and position of J

6. Find other atoms bonded to K and set distance and direction

7. Repeat until all atoms are used

8. For each atom

a. If chiral centers possible, duplicate structure and make appropriate

position swaps for R/S isomers

9. For each double bond

a. Analyze if possible for Z/E isomers. If so, duplicate structure and make

appropriate positional swaps for Z/E isomers

Note that not included in the algorithm above is an additional step to create cyclic

structures. If this occurs, an analysis of the possibility for cis/trans isomerism is

performed as well. Additionally, there is no step in the algorithm for the proper

inclusion of both torsional angles and uniformity of bond lengths in a ring.

Once the 3D structure(s) are obtained from the 2D atom-based adjacency matrix,

various structure-based analytical techniques can be used involving molecular force

fields.This can potentially address issues of torsional angles or the uniformity of bond

lengths in a ring. Potential stability issues in a structure that had not been a factor