318 Handbook of Chemoinformatics Algorithms
11.2 THE CHALLENGES OF GENERATING NETWORKS
Provided an ensemble of possible chemical reactions, the network generation pro-
cess essentially consists of finding all species that can be synthesized from an initial
set of reactants. The process is recursive as reactions can be applied to generated
species as well. The process ends depending on the goal of the study, for instance,
a specific compound has been produced, or all compounds with predefined physi-
cal, chemical, or biological properties have been generated. The same process can
be used to search for all species producing a given target when reversing the reac-
tions. This reverse process is used in retrosynthesis analysis where reversed reactions
are named transforms, and species generated are named synthons [1]. The network
generation algorithms given in this chapter apply to both synthesis and retrosynthe-
sis as it is just a matter of defining the direction of the reactions. As reviewed by
Ugi et al. [2], three main approaches have been developed for network generation:
empirical approaches, whose strategies are based on data from reaction libraries,
semiformal approaches based on heuristic algorithms, where reactions are derived
from a few mechanistic elementary reactions, and formal techniques, based on graph
theory. This chapter focuses on the last approach. The formal approach has spurred
most of the network generation algorithms, which historically started with the work
of Corey et al. [3] and the retrosynthesis code Lhasa (http://www.lhasalimited.org).
Ugi et al. argue that formal techniques are general enough to be applicable to
any type of reacting system in organic or inorganic chemistry; furthermore, formal
techniques are the only methods capable of generating new reaction mechanisms
and therefore elucidating unresolved chemical processes such as those found with
synthesis planning.
While formal techniques are robust, their computational scaling limits their appli-
cability to real reacting systems. Indeed, as has been shown by several authors [4–9],
for many processes related to retrosynthesis, combustion, and petroleum refining, the
number of reactions and intermediate species that are created generally scales expo-
nentially with the number of atoms of the reactants. With metabolic and signaling
networks, the number of possible species also grows prohibitively [10,11] when one
considers all the possible states a species can fall into (phosphorylation at several
sites, and complex formation with other proteins and ligands). As a matter of fact
it has been shown that even simple models of the epidermal growth factor (EGF)
receptor signaling network can generate more than 10
23
different species [12].
With all these systems, not only the time taken to generate networks may scale
exponentially, but simulating the dynamics of large networks may also become com-
putationally prohibitive. The most common approach to study the dynamics of a
reaction network is to calculate species concentration over time by integrating a sys-
tem of ordinary differential equations (ODEs). The computational cost of integrating
ODEs depends nonlinearly on N, the number of chemical species. For stiff ODEs
(cf. definition in Section 11.4), that cost scales N
3
, and thus simulating a system for
which N is larger than 10
4
−10
5
becomes impractical.
An alternative to palliate the exponential scaling problem of formal techniques
is the reduction of the reaction mechanism. The question raised when the reduc-
ing mechanism is used is how to choose a reaction subset that describes correctly