Textbook - Essentials

Introduction
Representation of Chemical Compounds
Representation of Chemical Reactions
The Data
Databases/Datasources
Databases/Datasources
Calculation of Physical and Chemical Data
Calculation of Structure Descriptors
Methods for Data Analysis
Applications

1. Introduction

2. Representation of Chemical Compounds

Chemical structures can be transformed into a language for computer representation via line notations such as ROSDAL, SMILES, Sybyl.
Chemical structures can also be represented and handled in matrices or connection tables.
The constitution can be represented in an unambiguous and unique manner by canonicalization (Morgan Algorithm).
Ring perception and isomorphism are also very important for processing chemical compounds.
Molfile, SDfile, and PDB-file are the most well-known data exchange formats.
Stereochemistry can be represented graphically in 2D structures, but also by (permutations) descriptors. It is included in all line notations and exchange formats.
3D structures can be generated with fragment-based, data-based, and numerical methods.
Molecular surfaces can express various chemical and physical properties, such as electrostatic potential, atomic charges or hydrophobicity, using colored mapping.
Chemical structures can be visualized in various different models.
Many programs exist which can be used to generate and visualize molecular structures.

3. Representation of Chemical Reactions

The representation of a chemical reaction should include the connection table of all participating species (starting materials, reagents, solvents, catalysts, products) as well as information on reaction conditions (temperature, concentration, time, etc.) and observations (yield, reaction rates, heat of reaction, etc.).
However, reactions are only insufficiently represented by the structure of their starting materials and products.
It is essential to indicate also the reaction center and the bonds broken and made in a reaction - in essence, to specify how electrons are shifted during a reaction.
In this sense, the representation of chemical reactions should consider some essential features of a reaction mechanism.
Further insight into the driving forces of chemical reactions can be gained by considering essential physicochemical effects at the reaction center.
Specification of the reaction center is important for many queries to reaction databases.
Reaction types can be derived automatically through classification of reaction instances.
Reaction classification is an essential step in knowledge acquisition from reaction databases.
There are two fundamental approaches to automatic reaction classification: model-driven and data-driven methods.
The stereochemistry of reactions can be treated by means of permutation group theory.

4. The Data

Relationship of the data to the information and knowledge
How to prepare the data to learn from them
Evaluation of the complexity of a system
Data exchange and how to perform it
Data pre-processing
Data transformations
Variable and pattern selection in a dataset
Compilation of training, test, and control datasets

5. Databases/Datasources

The hierarchical, network, relational, and object-oriented database models are the four most important ones.
They are classified into bibliographic, factual, and structure databases
The Chemical Abstracts (CA) File is the main abstracting and indexing service for biochemistry, chemistry, and chemical engineering.
Beilstein and Gmelin are the world's largest factual databases in chemistry.
The Cambridge Structural Database (CSD) and the Inorganic Crystal Structure Database (ICSD) contain information obtained from X-ray structure analysis.
Compounds are stored in structure and reaction databases as connection tables (CT), e.g., Beilstein, Gmelin, CAS Registry, and CASREACT.
INPADOC is the most comprehensive bibliographic database of scientific and technological patent documents.
Scientific and chemical information is becoming available increasingly on the Internet.

6. Structure Search Methods

A substructure search algorithm is usually the first step in the implementation of other important topological procedures for the analysis of chemical structures such as identification of equivalent atoms, determination of maximal common substructure, ring detection, calculation of topological indices, etc.
The search for structural fragments (substructures) is very important in medicinal chemistry, QSAR, spectroscopy and many other fields in the process of pharmacophore, chromophore or other -phore perceptions.
The similarity property principle states: "structurally similar molecules are expected to exhibit similar physical properties or biological activities"
Similarity search appears as an extremely useful tool for computer-aided structure elucidation as well as molecular design.
3D substructure search is usually known as pharmacophore searching in QSAR. In all of the 3D search methods the conformational flexibility creates considerable difficulties.

7. Calculation of Physical and Chemical Data

Additivity schemes allow the calculation of important molecular properties.
Additivity schemes for estimating molecular properties play an important role in chemical engineering.
The accuracy of an additivity scheme can be increased by going from atomic contributions through bond contributions to group contributions.
Heats of formation can be estimated with reasonable accuracy by additivity of group increments and corrections for ring effects.
Average binding energies from an additivity scheme can be used to recognize weak and strong binding of drugs to the corresponding receptor.
The PEOE method allows a rapid calculation of the charge distribution in s-bonded systems.
The PEOE method in conjunction with a modified Hückel molecular orbital (HMO) method allows charge calculation in conjugated p-systems.
Residual electronegativity values obtained by the PEOE method are useful quantitative measures of the inductive effect.
The polarizability effect can be calculated by a simple attenuation model.
Fundamental enthalpies of gas-phase reactions such as proton affinities or gas-phase acidities can be correlated with the values of the inductive and the polarizability effect.
Molecular mechanics calculations are a very useful tool for the spatial and energetic description of small molecules as well as macroscopic systems like proteins or DNA.
Current trends in ongoing development are in areas such as the treatment of polarization or applications to transition metal systems.
In combination with quantum mechanical methods, a QM/MM methodology allows the description of reaction mechanisms including whole enzymes.
Molecules should never be treated as static systems which are not able to undergo conformational changes.
Molecular dynamics simulations provide information about the motion of molecules, which facilitates the interpretation of experimental results and allows the statistically meaningful sampling of (thermodynamic) data.
The development of efficient algorithms and the sophisticated description of long-range electrostatic effects allow calculations on systems with 100 000 atoms and more, which address biochemical problems like membrane-bound protein complexes or the action of "molecular machines".

8. Calculation of Structure Descriptors

A structure descriptor is a mathematical representation of a molecule resulting from a procedure transforming the chemical information encoded within a symbolic representation of a molecule.
The abbreviation QSAR stands for quantitative structure-activity relationships. QSPR means quantitative structure-property relationships. As the properties of an organic compound usually cannot be predicted directly from its molecular structure, an indirect approach is used in order to tackle this problem. In the first step numerical descriptors encoding information about the molecular structure are calculated for a set of compounds. Secondly, statistical and artificial neural network models are used to predict the property or activity of interest, based on these descriptors or a suitable subset. A typical QSAR/QSPR study comprises the following steps: structure entry (or start from an existing structure database), descriptor calculation, descriptor selection, model building, model validation.
Molecules can be represented by structure descriptors in a hierarchic way with respect to a) the descriptor data type, and b) the molecular representation of the compound.
The QSAR methodology can also be applied to materials and mixtures where no structural information is available. Instead of descriptors derived from the compound's structure, various physicochemical properties, including spectra, can be used. In particular, spectra are valuable in this context as they reflect the structure in a sensitive way.

9. Methods for Data Analysis

Machine learning deals with the ability of machines or generally of computer programs to enhance their performance based on previous results. Machine learning can follow different learning strategies, e.g., supervised or unsupervised learning.
Decision trees give a graphical representation of a procedure for classification. They consist of nodes and branches; the leaf nodes give the classification of an instance.
Correlation analysis reveals the interdependence between variables. The statistical measure for the interdependence is the correlation coefficient.
Multiple linear regression (MLR) models a linear relationship between a dependent variable and one or more independent variables.
Principal Component Analysis (PCA) transforms a number of correlated variables into a smaller number of uncorrelated variables, the so-called principal components.
Principal Component Regression (PCR) is a combination of PCA and MLR: The scores gained by PCA are used for MLR.
PLS is a linear regression extension of PCA which is used to connect the information in two blocks of variables X and Y to each other. It can be applied even if the features are highly correlated.
Neural networks model the functionality of the brain. They learn from examples, whereby the weights of the neurons are adapted on the basis of training data.
A Kohonen network is a neural network which uses an unsupervised learning strategy. It can be used for, e.g., similarity perception, clustering, or classification tasks.
A counterpropagation neural network is a method for supervised learning which can be used for predictions.
A backpropagation neural network is also trained using a supervised learning strategy. The weights of neurons are adjusted so that the error of the output signal is minimized. A backpropagation neural network can be used for predictions.
Fuzzy logic extends the Boolean logic so as to handle information about truth values which are between "absolutely true" and "absolutely false".
Data mining provides methods for the extraction of implicit or hidden information from large data sets, and comprises procedures for the generation of reasonable and dependable secondary information.
Visual data mining allows the visualization and detection of hidden relationships in sets of data.
Expert systems, also called knowledge-based systems, use a knowledge base of human expertise and a so-called inference engine to solve problems. The inference engine contains strategies for using the information contained in the knowledge base to draw conclusions.

10. Applications

Building a QSPR model consists of three steps: descriptor calculation, descriptor analysis and optimization, and establishment of a mathematical relationship between descriptors and property.
Successful predictive models in toxicology exist - however, they are of a rather local nature. Effects considered in toxicology can be caused by different mechanisms. Efforts to get away from a class perspective to one that is more consistent regarding modes of toxic action are still a subject of ongoing research.
A proper representation of the molecular structure is crucial for the prediction of spectra. Fragment-based methods, topological descriptors, physicochemical descriptors, and 3D descriptors have been used for this endeavor.
NMR spectra have been predicted using quantum chemistry calculations, database searches, additive methods, regressions, and neural networks.
Several methods have been developed for establishing correlations between IR vibrational bands and substructure fragments. Counterpropagation neural networks were used to make predictions of the full spectra from RDF codes of the molecules.
Correlations between structure and mass spectra were established on the basis of multivariate analysis of the spectra, database searching, or the development of knowledge-based systems, some including explicit management of chemical reactions.
Robust implementations can currently propose correct structures from spectroscopic data, especially when the molecular formula and 13C NMR spectrum are available, or from 2D NMR spectra.
Reaction prediction treats chemical reactions in their forward direction, and synthesis design in their backward, retrosynthetic direction.
Reaction databases present a rich source of information for the extraction of knowledge for reaction prediction and synthesis design.
Reaction prediction, or reaction simulation, has to concentrate on the reaction center, i.e., the bonds broken and made in a reaction.
A reaction should be modeled as closely as possible to its mechanism.
The evaluation of chemical reactions can be performed to various levels of sophistication; heats of reaction allow for a consideration of the thermodynamics of a reaction, whereas reaction rates consider its kinetic aspects.
The more details of a reaction type are available, the better can be the modeling of a reaction of that type.
The evaluation of the impact of a chemical on the environment has to consider its degradation products also.
The metabolism of a potential drug has to be considered at an early phase of the drug development process.
Biochemical pathways and metabolic reaction networks have recently attracted much interest and are an active and rich field for research.
The disconnection approach has basically changed the view on planning a synthesis.
The retrosynthetic analysis of a target compound is a systematic approach in developing a synthesis plan starting with the target structure and working backward to available starting materials.
Several concepts for the implementation of synthesis design system have appeared since the early 1970s.
The drug discovery process comprises the following steps: a) target identification; b) target validation; c) lead finding (including design and synthesis of compound libraries as well as screening of compound libraries); d) lead optimization (searching for an acceptable pharmacokinetic profile, toxicity and mutagenicity); e) preclinical studies; f) clinical studies; g) drug approval (FDA approval).
A lead structure is "a representative of a compound series with sufficient potential (as measured by potency, selectivity, pharmacokinetics, physicochemical properties, absence of toxicity, and novelty) to progress to a full drug development program".
Chemoinformatics affects the lead finding and lead optimization steps within the drug discovery process. In particular the following tasks are involved:
- analysis of HTS data,
- similarity search,
- design of combinatorial libraries,
- design of focused libraries,
- comparison of the similarity/diversity of libraries,
- virtual screening,
- docking,
- de-novo design,
- pharmacophore perception,
- prediction of binding affinities, physicochemical properties (such as solubility, log P, pKa), and pharmacokinetic properties (ADMET profile),
- establishment of QSAR models which can be interpreted and guide the further development of a new drug.
Following the "similar property" principle, structurally similar molecules are expected to exhibit similar properties or biological activities.
The term "Virtual screening" or "in-silico screening" is defined as the selection of compounds by evaluating their desirability in a computational model. The desirability comprises high potency, selectivity, appropriate pharmacokinetic properties, and favorable toxicology.
Lipinski's "Rule of Five":
- molecular weight < 500,
- lipophilicity (log P) < 5,
- number of H-bond donors < 5,
- number of H-bond acceptors (number of N+O) < 10,
- (number of rotatable bonds < 10).