Statistiche di test non lineariLa superificie di separazione ottimale può non essere un iperpiano, → statistica di test non lineare
acceptH0
H1Ci sono molti metodi statistici multivariati:
La fisica delle particelle benificia in questo caso dei progressi fatti nel campo del machine learning (per esempio negli studi sull’intelligenza artificiale)
Neural Networks,Kernel density methods,Decision Trees...
3 Glen Cowan Multivariate Statistical Methods in Particle Physics
Linear decision boundaries
A linear decision boundary is only optimal when both classes follow multivariate Gaussians with equal covariances and different means.
x1
x2
For some other cases a linear boundary is almost useless.
x1
x2
4 Glen Cowan Multivariate Statistical Methods in Particle Physics
Nonlinear transformation of inputs
x1
x2
We can try to find a transformation, x1 , , xn1x , ,mx
so that the transformed “feature space” variables can be separatedbetter by a linear boundary:
2= x12x22
1=tan−1 x2/ x1
1
2
Here, guess fixed basis functions(no free parameters)
Introduzione ai neural networks
Sono usati in neurobiologia, pattern recognition, matematica finanziaria, ... qui sono solo un tipo di statistica di test
Supponiamo t(x) abbia la forma sigmoide
Questa è una rete neurale con un solo strato di nodi (single-layer perceptron)Se s(u) è monotona→ è equivalente a una t(x) lineare
La rete neurale con più strati
I risultati del primo strato costituiscono i valori di entrata di uno strato successivo
Il valore dei nodi nello strato intermedio (nascosto) è dato da
e l’uscita della rete è
weights (connection strengths)
Discussione sulle reti neuraliFacile generalizzare a un numero arbitrario di strati di nodiRete feed-forward: i valori di un nodo dipende solo dallo strato precedente.Più nodi → più t(x) è vicino all’ottimale ma un numero maggiore di parametri deve essere determinato
I parametri si determinano minimizzando una error function
dove t (0) , t (1) sono valori preassegnati, per esempio 0 e 1 per la sigmoide.I valori di aspettazione sono calcolati su un campione MC (training sample).La procedura è complicata e si usano dei software standard
8 Glen Cowan Multivariate Statistical Methods in Particle Physics
Network architecture: one hidden layer
Theorem: An MLP with a single hidden layer having a sufficiently large number of nodes can approximate arbitrarily well the Bayes optimal decision boundary.
Holds for any continuous non-polynomial activation functionLeshno, Lin, Pinkus and Schocken (1993), Neural Networks 6, 861—867
In practice often choose a single hidden layer and try increasing thethe number of nodes until no further improvement in performanceis found.
9 Glen Cowan Multivariate Statistical Methods in Particle Physics
More than one hidden layer“Relatively little is known concerning the advantages and disadvantagesof using a single hidden layer with many units (neurons) over many hidden layers with fewer units. The mathematics and approximationtheory of the MLP model with more than one hidden layer is not wellunderstood.”
“Nonetheless there seems to be reason to conjecture that the two hiddenlayer model may be significantly more promising than the single hiddenlayer model, ...”
A. Pinkus, Approximation theory of the MLP model in neural networks,Acta Numerica (1999), pp. 143—195.
13 Glen Cowan Multivariate Statistical Methods in Particle Physics
OvertrainingIf the network has too many nodes, after training it will tend to conform too closely to the training data:
The classification error rate on the training sample may be very low, but it would be much higher on an independent data sample.
Overtraining
Therefore it is important to evaluate the error rate with a statisticallyindependent validation sample.
14 Glen Cowan Multivariate Statistical Methods in Particle Physics
Monitoring overtrainingIf we monitor the value of the error function E(w) at every cycle of the minimization, for the training sample it will continue to decrease.
But the validation sample it may initially decrease, and then at some point increase, indicatingovertraining.
validation sample
training sample
error
training cycle
15 Glen Cowan Multivariate Statistical Methods in Particle Physics
Validation and testingThe validation sample can be used to make various choices about the network architecture, e.g., adjust the number of hidden nodes soas to obtain good “generalization performance” (ability to correctlyclassify unseen data).
If the validation stage is iterated may times, the estimated error rate based on the validation sample has a bias, so strictly speaking one should finally estimate the error rate with an independent test sample.
train : validate : test 50 : 25 : 25
Rule of thumb if data nottoo expensive (Narsky):
But this depends on the type of classifier. Often the bias in the errorrate from the validation sample is small and one can omit the test step.
Esempio di Neural network a LEP IISignale: e+e− → W+W− (4 jet ben separati)Fondo: e+e− → qqgg (4 jet non tanto separati)
← variabile di input basata sulla struttura del jet, event shape,nessuno dei quali permette da solo di separare segnale e fondoIl Neural network dà una separazione migliore
(Garrido, Juste and Martinez, ALEPH 96-144)
Probability Density Estimation (PDE) techniques
See e.g. K. Cranmer, Kernel Estimation in High Energy Physics, CPC 136 (2001) 198; hep-ex/0011057; T. Carli and B. Koblitz, A multi-variate discrimination technique based on range-searching, NIM A 501 (2003) 576; hep-ex/0211019
Construct non-parametric estimators of the pdfs
and use these to construct the likelihood ratio
(n-dimensional histogram is a brute force example of this.)More clever estimation techniques can get this to work for(somewhat) higher dimension.
Product of one-dimensional pdfsFirst rotate to uncorrelated variables, i.e., find matrix A such that
for we have
Estimate the d-dimensional joint pdf as the product of 1-d pdfs,
(here x decorrelated)
This does not exploit non-linear features of the joint pdf, butsimple and may be a good approximation in practical examples.
20 Glen Cowan Multivariate Statistical Methods in Particle Physics
Correlation vs. independenceIn a general a multivariate distribution p(x) does not factorize into a product of the marginal distributions for the individual variables:
px=∏i=1
n
pi xiholds only if thecomponents of x are independent
Most importantly, the components of x will generally have nonzerocovariances (i.e. they are correlated):
V ij=cov [ xi , x j ]=E [ xi x j ]−E [ xi ]E [ x j ]≠0
21 Glen Cowan Multivariate Statistical Methods in Particle Physics
Decorrelation of input variablesBut we can define a set of uncorrelated input variables by a linear transformation, i.e., find the matrix A such that forthe covariances cov[y
i, y
j] = 0:
y=Ax
For the following suppose that the variables are “decorrelated” in this way for each of p(x|H
0) and p(x|H
1) separately (since in general
their correlations are different).
22 Glen Cowan Multivariate Statistical Methods in Particle Physics
Decorrelation is not enoughBut even with zero correlation, a multivariate pdf p(x) will in general have nonlinearities and thus the decorrelated variables are still not independent.
pdf with zero covariance butcomponents still notindependent, since clearly
x1
x2
p x2∣x1≡p x1 , x2
p1x1≠ p2 x2
p x1, x2≠ p1x1 p2 x2
and therefore
23 Glen Cowan Multivariate Statistical Methods in Particle Physics
Naive BayesBut if the nonlinearities are not too great, it is reasonable to first decorrelate the inputs and take as our estimator for each pdf
px=∏i=1
n
pi xi
So this at least reduces the problem to one of finding estimates ofone-dimensional pdfs.
The resulting estimated likelihood ratio gives the Naive Bayes classifier(in HEP sometimes called the “likelihood method”).
8 Glen Cowan Multivariate Statistical Methods in Particle Physics
HistogramsStart by considering one-dimensional case, goal is to estimate pdf p(x)of continuous r.v. x.
Simplest non-parametric estimate of p(x) is a histogram:
Bishop Section 2.5
p x =ni
N xi for x in bin i
ni
Dxi
x
N total entries
9 Glen Cowan Multivariate Statistical Methods in Particle Physics
Histograms (2)
Small bin width: estimate is very spiky, structure not really part of underlying distribution.
Medium bin width: best
Large bin width: too smooth and thus fails to capture e.g. bimodalcharacter of parent distribution
Bishop Section 2.5
12 Glen Cowan Multivariate Statistical Methods in Particle Physics
Counting events in a local volumeConsider a small volume V centred about x = (x
1, ..., x
D).
This is in contrast to the histogram where the bin edges were fixed.
Suppose from N total events we find K in V.
p x =K
N VTake as estimate for p(x)
Two approaches:
Fix V and determine K from the data
Fix K and determine V from the data
13 Glen Cowan Multivariate Statistical Methods in Particle Physics
Kernels
E.g. take V to be hypercube centered at the x where we want p(x).
k u=1for∣ui∣1/2 and 0 otherwise, i = 1, ..., DDefinei.e., the function is nonzero inside a unit hypercube centred about x and zero outside.
k(u) is an example of a kernel function (here called a Parzen window).
Kernel-based PDE (KDE, Parzen window)Consider d dimensions, N training events, x1, ..., xN, estimate f (x) with
Use e.g. Gaussian kernel:
kernel bandwidth (smoothing parameter)
Need to sum N terms to evaluate function (slow); faster algorithms only count events in vicinity of x (k-nearest neighbor, range search).
Decision treesA training sample of signal and background data is repeatedlysplit by successive cuts on its input variables.Order in which variables used based on best separation betweensignal and background.
Example by Mini-Boone, B. Roe et al., NIM A 543 (2005) 577
Iterate until stop criterion reached,based e.g. on purity, minimumnumber of events in a node.Resulting set of cuts is a ‘decision tree’.
5 Glen Cowan Multivariate Statistical Methods in Particle Physics
Decision tree size and stabilityUsually one grows the tree first to a very large (e.g. maximum) size and then applies pruning.
For example one can recombine leaves based on some measure of generalization performance (e.g. using statistical error of purity estimates).
Decision trees tend to be very sensitive to statistical fluctuations inthe training sample.
Methods such as boosting can be used to stabilize the tree.
Boosted decision treesBoosting combines a number classifiers into a stronger one; improves stability with respect to fluctuations in input data.To use with decision trees, increase the weights of misclassifiedevents and reconstruct the tree. Iterate → forest of trees (perhaps > 1000). For the mth tree,
Define a score αm based on error rate of mth tree.
Boosted tree = weighted sum of the trees:
Algorithms: AdaBoost (Freund & Schapire), ε-boost (Friedman).
Confronto di metodi multivariati (TMVA)
Si sceglie quello che dà il risultato migliore
Confronto di metodi multivariati (TMVA)
Si sceglie quello che dà il risultato migliore
Data una variabile di test, i passi successivi sono, per esempio, selezionare n eventi e stimare la sezione d’urto estimate a cross section of signal:
Discussione sulle analisi multivariate
Ma dobbiamo stimare anche l’errore sistematicoSe il campione di training (MC) ≠ Natura, le nostre stime di fondi e efficienze possono essere sbagliate (vero anche per semplici tagli)
Conviene iniziare con solo 1-2 variabili (quelle che hanno il maggior potere discriminatorio) e aggiungere le altre solo se i miglioramenti sono significativi - Con meno variabili non c’è un problema di ‘over-training’- Le correlazioni spesso rendono inutile aggiungere un altra variabile e sono potenzialmente pericolose per i sistematici