Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | cornelius-shepherd |
View: | 221 times |
Download: | 2 times |
Roberto Todeschini Roberto Todeschini
Viviana Consonni Viviana Consonni
Manuela PavanManuela Pavan
Andrea MauriAndrea Mauri
Davide BallabioDavide Ballabio
Alberto ManganaroAlberto Manganaro
chemometricschemometrics
molecular descriptorsmolecular descriptors
QSARQSAR
multicriteria decision makingmulticriteria decision making
environmetricsenvironmetrics
experimental designexperimental design
artificial neural networksartificial neural networks
statistical process controlstatistical process control
Milano Chemometrics and QSAR Research GroupMilano Chemometrics and QSAR Research Group
Department of Environmental SciencesDepartment of Environmental Sciences
University of Milano - BicoccaUniversity of Milano - Bicocca
P.za della Scienza, 1 - 20126 Milano (Italy)P.za della Scienza, 1 - 20126 Milano (Italy)
Website: michem.unimib.it/chm/Website: michem.unimib.it/chm/
Roberto TodeschiniMilano Chemometrics and QSAR Research Group
Molecular descriptors
Constitutional descriptors and graph invariants
Iran - February 2009Iran - February 2009
Content
Counting descriptorsCounting descriptors
Empirical descriptorsEmpirical descriptors
Fragment descriptorsFragment descriptors
Molecular graphsMolecular graphs
Topological descriptorsTopological descriptors
Counting descriptors
Each descriptor represents the number of elements of Each descriptor represents the number of elements of
some defined chemical quantity.some defined chemical quantity.
For example:For example:
- the number of atoms or bondsthe number of atoms or bonds
- the number of carbon or chlorine atomsthe number of carbon or chlorine atoms
- the number of OH or C=O functional groups- the number of OH or C=O functional groups
- the number of benzene rings- the number of benzene rings
- the number of defined molecular fragments- the number of defined molecular fragments
Counting descriptors
... also a ... also a sum of some atomic / bond propertysum of some atomic / bond property is is
considered as a count descriptor, as well as its considered as a count descriptor, as well as its averageaverage
1 1
/A A
i ii i
MW m P w AMW MW A
For example:For example:
- molecular weight and average molecular weightmolecular weight and average molecular weight
- sum of the atomic electronegativitiessum of the atomic electronegativities
- sum of the atomic polarizabilitiessum of the atomic polarizabilities
- sum of the bond orderssum of the bond orders
A counting descriptor A counting descriptor n n is semi-positive variable, is semi-positive variable,
i.e. i.e. nn 0 0
Its statistical distribution is usually a Poisson Its statistical distribution is usually a Poisson
distribution.distribution.
Counting descriptors
Main characteristics
• simple
• the most used
• local information
• high degeneracy
• discriminant modelling power
Empirical descriptors
Descriptors based on Descriptors based on specific structural aspectsspecific structural aspects
present in sets of present in sets of congeneric compoundscongeneric compounds and and
usually not applicable (or giving a single default usually not applicable (or giving a single default
value) to compounds of different classes.value) to compounds of different classes.
It is a descriptor dedicated to the modelling of the It is a descriptor dedicated to the modelling of the
benzene rings and is defined as the benzene rings and is defined as the sum of the six sum of the six
lengthslengths joining the adjacent substituent groups. joining the adjacent substituent groups.
H H
HH
CH3Cl
Index of TaillanderIndex of Taillander
Empirical descriptors
Taillander Taillander et alet al., 1983., 1983
Empirical descriptors
It is a descriptor dedicated to the modelling of It is a descriptor dedicated to the modelling of
hydrophilicity and is based on a function of the counting of hydrophilicity and is based on a function of the counting of
hydrophilic groups (OH-, SH-, NH-, ...) and carbon atoms.hydrophilic groups (OH-, SH-, NH-, ...) and carbon atoms.
n
nnHy
nnnCnHynHy
Hy
1log
1log
11log)1(
2
nHy number of hydrophilic groupsnC number of carbon atomsn total number of non-hydrogen atoms
-1 Hy 3.64
Hydrophilicity index (Hy)Hydrophilicity index (Hy)
Todeschini Todeschini et alet al., 1999., 1999
Empirical descriptors
Compound nHy nC n Hy
hydrogen peroxide 2 0 2 3.64
carbonic acid 2 1 3 3.48
water 2 0 1 3.44
butanetetraol 4 4 8 3.30
propanetriol 3 3 6 2.54
ethanediol 2 2 4 1.84
methanol 1 1 2 1.40
ethanol 1 2 3 0.71
decanediol 2 10 12 0.52
propanol 1 3 4 0.37
butanol 1 4 5 0.17
pentanol 1 5 6 0.03
methane 0 1 1 0.00
nHy = 0 and nC = 0 0 0 N 0.00
decanol 1 10 11 - 0.28
ethane 0 2 2 - 0.63
pentane 0 5 5 - 0.90
decane 0 10 10 - 0.96
alcane with nC = 1000 0 1000 1000 - 1.00
Fragment approach
Parametric approach (Hammett – Hansch,1964)Parametric approach (Hammett – Hansch,1964)
Substituent approach (Free-Wilson, Fujita-Ban, 1976)Substituent approach (Free-Wilson, Fujita-Ban, 1976)
DARC-PELCO approach (Dubois, 1966)DARC-PELCO approach (Dubois, 1966)
Sterimol approach (Verloop, 1976)Sterimol approach (Verloop, 1976)
Fragment approach
The biological activity of a molecule is The biological activity of a molecule is
the sum of its fragment propertiesthe sum of its fragment properties
common reference skeletoncommon reference skeleton
molecule properties gradually modified by substituentsmolecule properties gradually modified by substituents
Congenericity principleCongenericity principle
QSAR styrategies can be applied ONLY to classes of QSAR styrategies can be applied ONLY to classes of
similar compoundssimilar compounds
Biological response = fBiological response = f11((LL) + f) + f22((EE) + f) + f33((SS) + f) + f44((MM))
Corvin Hansch, 1964Corvin Hansch, 1964
Hansch approach
Lipophilic propertiesLipophilic properties
Electronic propertiesElectronic properties
Steric propertiesSteric properties
Other molecular propertiesOther molecular properties
11
22
33
44
Hansch approach
11 Congenericity approachCongenericity approach
22 Linear additive schemeLinear additive scheme
33 Limited representation of global molecular propertiesLimited representation of global molecular properties
44 No 3D and conformational informationNo 3D and conformational information
Me
H
H
Me
H
I
Me
F
F
Me
Br
F
Me
I
H
Free-Wilson approach
1
2
Me
H
H
Me
H
I
Me
F
F
Me
Br
F
Me
I
H
Pos. 1 Pos. 2
F Br I F Br Imol.1 0 0 0 0 0 0mol.2 0 0 1 0 0 0mol.3 0 0 0 0 0 1mol.4 1 0 0 1 0 0mol.5 0 1 0 1 0 0
Free-Wilson approach
Free-Wilson approach
Free-Wilson, 1964Free-Wilson, 1964
0 ,1 1
S Ns
i ks i kss k
y b b I
0 11 ,11 21 ,21 31 ,31 12 ,12 22 ,22 32 ,32i i i i i i iy b b I b I b I b I b I b I
F Br I F Br I
Pos. 1 Pos. 2
Iks absence/presence of k-th subst. in the s-th site
Fragment approach
FingerprintsFingerprints
binary vector
1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
presence of a fragment absence of a fragment
similarity searchingsimilarity searching
Molecular graph
1 2 3 4
5 6
7
Molecular graph
Mathematical object defined asMathematical object defined as
G = (V, E)
set set VV vertices
set et EE edges
1 2 3 4
5 6
7
atomsatoms
bondsbonds
Usually in the molecular graph Usually in the molecular graph hydrogen atomshydrogen atoms
are not considered are not considered
H - depleted molecular graphH - depleted molecular graph
Molecular graph
A A walkwalk in G is a in G is a sequence of verticessequence of vertices
w = (vw = (v11, v, v22, v, v33, ..., v, ..., vkk) such that {v) such that {vjj, v, vj+1j+1}} E.E.
The length of a walk is the number of edges traversed by the The length of a walk is the number of edges traversed by the
walk.walk.
A A pathpath in G is a in G is a walk without any repeated verticeswalk without any repeated vertices..
The length of a path (vThe length of a path (v11, v, v22, v, v33, ..., v, ..., vk+1k+1) is ) is k.k.
v1 v2 v3 v2 v5 walk of length 4
v1 v2 v3 v4 v5 path of length 4
1
23
4 5 6
Molecular graph
Molecular graph
The The topological distancetopological distance d dijij is the length of the is the length of the shortest shortest
pathpath between the vertices v between the vertices vii and v and vjj..1
23
4 5 6
d15 = 2
The The detour distancedetour distance ijij is the length of the is the length of the longest pathlongest path
between the vertices vbetween the vertices vii and v and vjj..
15 = 4
Molecular graph
A A self returning walkself returning walk is a is a walk closed in itselfwalk closed in itself, i.e. a , i.e. a
walk starting and ending on the same vertex.walk starting and ending on the same vertex.
A A cyclecycle is a walk with no repeated vertices other is a walk with no repeated vertices other
than its first and last ones (vthan its first and last ones (v11 = v = vkk).).
v1 v2 v3 v2 v1 Self returning walk of length 4
1
23
4 5 6 v2 v3 v4 v5 v2
Molecular graph
The The molecular walkmolecular walk ( (pathpath) ) countcount MWC MWCkk (MPC (MPCkk) ) of order of order
kk is the total number of walks (paths) of is the total number of walks (paths) of k-k-th length in the th length in the
molecular graph.molecular graph.
MWC0MWC0 = nSK (no. of atoms) = nSK (no. of atoms)
MWC1MWC1 = nBO (no. of bonds) = nBO (no. of bonds)
Molecular sizeMolecular size
BranchingBranching
Graph complexityGraph complexity
DRAGON
MWC1, MWC2, …, MWC10
Molecular graph
The The self-returning walk countself-returning walk count SRWk of SRWk of order order kk is the is the
total number of self-returning walks of length total number of self-returning walks of length kk in the in the
graph.graph.
spectral moments of the adjacency matrixspectral moments of the adjacency matrix, i.e. linear , i.e. linear
combinations of counts of certain fragments contained combinations of counts of certain fragments contained
in the molecular graph, i.e. embedding frequencies.in the molecular graph, i.e. embedding frequencies.
SRW1SRW1 = nSK = nSK
SRW2SRW2 = nBO = nBO
DRAGON
SRW1, SRW2, …, SRW10
Molecular graph
Local vertex invariantsLocal vertex invariants (LOVIs) are quantities (LOVIs) are quantities
associated to each vertex of a molecular graph. associated to each vertex of a molecular graph.
Graph invariantsGraph invariants are molecular descriptors are molecular descriptors
representing graph properties that are preserved by representing graph properties that are preserved by
isomorphism. isomorphism.
characteristic polynomialcharacteristic polynomial
derived from local vertex invariantsderived from local vertex invariants
Molecular graph and more
Molecular graphMolecular graph
Topological matrixTopological matrix
Algebraic operatorAlgebraic operator
Local Vertex InvariantsLocal Vertex Invariants Graph invariantsGraph invariants
Molecular descriptors
molecular graphmolecular graph graph invariantsgraph invariants
Wiener index, Hosoya Z indexZagreb indices, Mohar indicesRandic connectivity indexBalaban distance connectivity indexSchultz molecular topological indexKier shape descriptorseigenvalues of the adjacency matrixeigenvalues of the distance matrixKirchhoff numberdetour indextopological charge indices...............
Wiener index, Hosoya Z indexZagreb indices, Mohar indicesRandic connectivity indexBalaban distance connectivity indexSchultz molecular topological indexKier shape descriptorseigenvalues of the adjacency matrixeigenvalues of the distance matrixKirchhoff numberdetour indextopological charge indices...............
total information content on .....mean information content on .....total information content on .....mean information content on .....
Kier-Hall valence connectivity indicesBurden eigenvaluesBCUT descriptorsKier alpha-modified shape descriptors2D autocorrelation descriptors...............
Kier-Hall valence connectivity indicesBurden eigenvaluesBCUT descriptorsKier alpha-modified shape descriptors2D autocorrelation descriptors...............
3D-Wiener index3D-Balaban indexD/D index...............
3D-Wiener index3D-Balaban indexD/D index...............
topological information indicestopological information indices
topostructural topostructural descriptorsdescriptors
topochemical topochemical descriptorsdescriptors
molecular geometrymolecular geometryx, y, z coordinatesx, y, z coordinates
topographic topographic descriptorsdescriptors
Molecule graph invariants
Numerical chemical information extracted from
molecular graphs.
The mathematical representation of a molecular graph
is made by the topological matrices:
• adjacency matrixadjacency matrix• atom connectivity matrixatom connectivity matrix• distance matrixdistance matrix• edge distance matrixedge distance matrix• incidence matrixincidence matrix
... more than 60 matrix representations of the molecular structure
Local vertex invariantsLocal vertex invariants (LOVIs) are quantities (LOVIs) are quantities
associated to each vertex of a molecular graph. associated to each vertex of a molecular graph.
Examples:Examples:
• atom vertex degreeatom vertex degree
• valence vertex degreevalence vertex degree
• sum of the vertex distance degreesum of the vertex distance degree
• maximum vertex distance degreemaximum vertex distance degree
Local vertex invariants
Topological matrices
Adjacency matrixAdjacency matrixAdjacency matrixAdjacency matrix
Derived from a molecular graph, it represents the Derived from a molecular graph, it represents the
whole set of whole set of connectionsconnections between adjacent pairs of between adjacent pairs of
atoms. atoms.
aaijij = =
1 if atom 1 if atom ii and and jj are bonded are bonded
0 otherwise0 otherwise
Bond number BBond number BBond number BBond number B
It is the simplest graph invariant obtained from the It is the simplest graph invariant obtained from the
adjacency matrix.adjacency matrix.
It is the number of bonds in the molecular graph It is the number of bonds in the molecular graph
calculated as: calculated as:
B aijj
A
i
A
1
2 11
where where aaijij is the entry of the adjacency matrix. is the entry of the adjacency matrix.
Topological matrices
atom vertex degreeatom vertex degree
It is the row sum of the vertex adjacency matrixiδ
0 0 0110 0 0
0
0
0
011 0
0
11 11 11
11 11 11 0
110 0 0 0 0 0
110 0 0 0 0 0
0 110 0 0 0 0
11 00 0 0 0 0
1 2 3 4 5 6 7
2
1
3
4
5
6
7
1
4
3
1
1
1
1
i
1 2 3 4
5 6
7
Local vertex invariants
Local vertex invariants
ivi
vi hZ δ
viZ number of valence electrons of the i-th atom
ih number of hydrogens bonded to the i-th atom
valence vertex degreevalence vertex degree
for atoms of the 2nd principal quantum number (C, N, O, F)
Local vertex invariants
the vertex degree of the i-th atom is the count
of edges incident with the i-th atom, i.e. the
count of bonds or electrons.
valence vertex degreevalence vertex degree
valence vertex degreevalence vertex degree
Local vertex invariants
1δ
vii
iviv
i ZZhZ
iZ total number of electrons of the i-th atom (Atomic Number)
for atoms with principal quantum number > 2
Topological descriptors
Zagreb indices (Gutman, 1975)Zagreb indices (Gutman, 1975)
A
aaM
1
21
b jiM 2
i vertex degree of the i-th atom
Topological descriptors
Kier-Hall connectivity indices (1986)Kier-Hall connectivity indices (1986)
b jiR
211 /
Randic branching index (1975)Randic branching index (1975)
They are based on molecular graph decomposition into
fragments (subgraphs) of different size and complexity and use
atom vertex degrees as subgraph weigth.
They are based on molecular graph decomposition into
fragments (subgraphs) of different size and complexity and use
atom vertex degrees as subgraph weigth.
2/1 ji is called edge connectivityis called edge connectivity
Topological descriptors
mean Randic branching index
Bχ
χ RR
Topological descriptors
atom connectivity indices of m-th orderatom connectivity indices of m-th order
a a
210 /
2/1
1 1
δχ
k
P
k
n
aaq
m
m
b bji
211 /
P
kkjli
2
1
212 /
mP number of m-th order paths
q subgraph type (Path, Cluster, Path/Cluster, Chain)
n = m for Chain (Ring) subgraph type
n = m + 1 otherwise
The immediate bonding environment of each
atom is encoded by the subgraph weigth.
The number of terms in the sum depends on
the molecular structure.
The connectivity indices show a good
capability of isomer discrimination and reflect
some features of molecular branching.
The immediate bonding environment of each
atom is encoded by the subgraph weigth.
The number of terms in the sum depends on
the molecular structure.
The connectivity indices show a good
capability of isomer discrimination and reflect
some features of molecular branching.
They encode atom identities
as well as the connectivities
in the molecular graph.
They encode atom identities
as well as the connectivities
in the molecular graph.
valence connectivity indices of m-th ordervalence connectivity indices of m-th order
a
va
v 210 /
2/1
1 1
δχ
k
P
k
n
a
va
vq
m
m
b b
vj
vi
v 211 /
P
kk
vj
vl
vi
v
2
1
212 /
vq
m χ
Topological descriptors
Topological descriptors
iviKH δδX
Kier-Hall electronegativityKier-Hall electronegativity
996991XMJ .. ivi
correlation with the Mulliken-Jaffe electronegativity:
2i
vi
Nδδ
XKH
principal quantum number
principal quantum number
Kier-Hall relative electronegativity
electronegativity of carbon sp3 taken as zero
Kier-Hall relative electronegativity
electronegativity of carbon sp3 taken as zero
077N
997X2
ivi
MJ ..
Distance matrix
vertex distance matrix degreevertex distance matrix degree
si It is the row sum of the vertex distance matrix
1 2 3 4
5 6
7
The distance dij between two vertices is the smallest number of edges between them.
The distance dij between two vertices is the smallest number of edges between them.
2 3 210 3 2
2
2
0
01 2
2
1 1 1
1 1 1 2
13 2 0 3 2 3
12 2 3 0 3 2
3 12 2 3 0 3
1 22 3 2 3 0
1 2 3 4 5 6 7
2
1
3
4
5
6
7
13 3
8 2
9 2
14 3
13 3
14 3
13 3
sisi ii
si is high for terminal vertices and low for central vertices si is high for terminal vertices and low for central vertices
The eccentricity i of the i-th atom is the upper
bound of the distance dij between the atom i and
the other atoms j
The eccentricity i of the i-th atom is the upper
bound of the distance dij between the atom i and
the other atoms j
Local vertex invariants
Topological descriptors
Petitjean shape index (1992) Petitjean shape index (1992)
RRD
IPJ
A simple shape descriptor A simple shape descriptor
IPJ = 0 for structure strictly cyclic
IPJ = 1 for structure strictly acyclic and with an even diameter
IPJ = 0 for structure strictly cyclic
IPJ = 1 for structure strictly acyclic and with an even diameter
Topological descriptors
Wiener index (1947)Wiener index (1947)
A
i
A
jijdW
1 121
12
AAW
W
high values for big molecules and for linear molecules
low values for small molecules and for branched or cyclic molecules
The Average Wiener index is independent from the molecular size.
dij topological distances
Topological descriptors
Balaban distance connectivity index (1982)Balaban distance connectivity index (1982)
B number of bonds
C number of cycles
si sum of the i-th row distances
one of the most discriminant indicesone of the most discriminant indices
b ji ss
CB
J 5.0
1
b ji ss
CB
J 5.0_
1
1 ABC
Bs
s ii
average sum of the i-th row distancesaverage sum of the i-th row distances
number of atoms
number of atoms
1 2 3 4
5 6
7
Edge descriptors
a b cd e
f
2 1 210 1
2
1
0
01 1
1
1 1
1 2 2
21 1 0 2 1
12 1 2 0 2
1 21 1 2 0
a b c d e f
b
a
c
d
e
f
7 2
5 1
7 2
7 2
8 2
7 2
EsiEsi
EiEi
a b
c
d
e
f
atom
bond
Some geometrical descriptors are derived from the
corresponding topological descriptors substituting
the topological distances dst by the geometrical
distances rst.
They are called topographic descriptorstopographic descriptors.
Topographic descriptors
3DW 12 11
rijj
A
i
A
For example, the 3D-Wiener index:
G
0
0
0
12 1
21 2
1 2
r r
r r
r r
A
A
A A
The The geometry matrixgeometry matrix G (or geometric distance matrix) is G (or geometric distance matrix) is
a square symmetric matrix whose entry a square symmetric matrix whose entry rrstst is the is the
geometric distance calculated as the Euclidean distance geometric distance calculated as the Euclidean distance
between the atoms between the atoms ss and and tt::
Molecular geometry
Department of Environmental SciencesDepartment of Environmental Sciences
University of Milano - BicoccaUniversity of Milano - Bicocca
P.za della Scienza, 1 - 20126 Milano (Italy)P.za della Scienza, 1 - 20126 Milano (Italy)
Website: michem.disat.unimib.it/chm/Website: michem.disat.unimib.it/chm/THANK YOU
Roberto Todeschini Roberto Todeschini
Viviana Consonni Viviana Consonni
Manuela PavanManuela Pavan
Andrea MauriAndrea Mauri
Davide BallabioDavide Ballabio
Alberto ManganaroAlberto Manganaro
chemometricschemometrics
molecular descriptorsmolecular descriptors
QSARQSAR
multicriteria decision makingmulticriteria decision making
environmetricsenvironmetrics
experimental designexperimental design
artificial neural networksartificial neural networks
statistical process controlstatistical process control
Milano Chemometrics and QSAR Research GroupMilano Chemometrics and QSAR Research Group
coffee break
Goal
Goal
Molecular graph
Molecular graph
Molecule graph invariants
Molecular graph
Molecular graph
Molecular graph
Molecular graph
Molecular graph
Molecular graph
Hansch molecular descriptorsHansch molecular descriptors
partition coefficients - logP, logKow
chromatog. param. - Rf, RT,
Solubility
….
Hammett constants
molar refraction
dipole moment
HOMO, LUMO
Ionization potential
….
molecular weight
VDW volume
molar volume
surface area
….
lipophilic lipophilic propertiesproperties
steric steric propertiesproperties
electronic electronic propertiesproperties
Hansch approach
Molecular graph
Molecular graph
Molecular graph