SupervisedLearning Coursework 2, GI01/M055, 2017/2018 Solution toQuestion 1 :(a)· A prior distribution – probability distribution over a function dp(f)· An additive noise model – inaccuracy in the output observations.The Noise for each measure assumed to be independent. (b) Accordingto the Bayes theorem – Posterior probability of a parameters (f) given an inputdata (X) is the product of the Likelihood function and Prior probability dP(f) (prior knowledge before performing theexperiment) divide by probability of input data.

Letus assume the inputs data x = {Xi, where i = 1,2,3…..N} and itscorresponding random function of f = { fi, where i = 1,2,3…..N } dPpost(f|X,y) = (dP(f) P(y|X, f))/p(y|X) Wecan eliminate the denominator p(y|X) since there is no dependency on theparameter ‘f’. Therefore:dPpost(f|X,y) = (dP(f) P(y|X, f))/p(y|X) ? dP(f) P(y|X, f) dPpost(f)= dP(f) P(y|X, f)) (c) Weknow that in the case of Gaussian process regression the prior and additivenoise model are Gaussian.Let =, where P(y|X,f)) = ,where X Î and y Î Thedata corrupted by likelihood function with zero mean and some variance Prioris given by dP(f) for symmetrical Gaussian distribution – centred at the originover the weight vector w.

(d) Fora finite collections of random variable stochastic Gaussian process hasmultivariate normal distribution while for infinite random variable it has ajoint distribution. In our case, it is a1-dimensional Gaussian(normal) distribution know as marginal distribution. (e) Marginaldistribution is the integral of prior multiplied by additive noise model.

Letus assume the inputs data x = {Xi, where i = 1,2,3…..N} itscorresponding random function of f = { fi, where i = 1,2,3…..

N } andoutput v alues yi …… yn P(y|X,f)) = Theposterior distribution: Bymanipulating the above expression, we can get the mean and covariance matrix. Theargmin of the weight vector( ) would maximise the posterior i.e it wouldgive us the mean of the distribution. is the mean distribution, where is standard optimiser, is the identity matrix which is And the covariance can be expressed as: (f) Thefull Bayesian inference adds up all the posterior distribution output and thentakes the average as its output. The maximum a posteriori probability (MAP) picksa weight vector which maximises the posterior distribution. If the prior and additive noise model areGaussian then the Posterior is also a Gaussian. In linear Gaussian, theconvergence of the posterior mean and the MAP estimator coincide.

The sample mean of random linear function is equivalentto the MAP mean of Gaussian for any value of standard deviation. Sincewe have prior knowledge about pervious posterior distribution, we can use it toenhance the MAP hence we can get the complete posterior distribution – as shownabove in part ‘e’ the Gaussian have a mean of in high dimension which is more accurate than sample meanfrom the linear function. Solution toQuestion 2 :(a) When the value of a =b =1, the Beta distributionis equivalent to uniform distribution. If we integrate B(a,b) and compute in interval of0,1 we get 0.5 mean and a /a +b gives 0.5 mean as well. (b) Let X be the data X ={1,….

n}H ={h:i?X ?0,1}=0,1X Prior given as, P(h) = and noise model P(y = 1|h,(i,y)) = h(i) and P(y = 0|h,(i,y)) = 1?h(i) Posterior distribution isthe proportional to the product of prior and noise model. Prior of the betadistribution is given byP(h) = Ppost = , This meansBeta is conjugate to the Bernoulli. (c) The mode is equivalent to themaximum a posterior distribution, so Mode of the posterior The maximumlikelihood estimate is the empirical probability, i.e. , Mean of theposterior E(p|X) = For uniform distributionwhere a =b =1P(y = 1|h,(i,y)) = h(i) = 0.3P(y = 0|h,(i,y)) = 1-h(i) = 0.7Therefore, Mean = 0.4333Mode = = 0.

3 Solution toQuestion 3 :(a)Overfitting or High variance is when the learned hypothesis fitstraining data well but fails to generalize for unseen dataset. The model mightbe too wiggly which is not a good model to predict future(test) dataset. Inother words, the predictor is too complex and not consistent for unseen data. (b) Empirical risk minimization(ERM) – the learner algorithm gets X trainingdata input and model function H which has an output predictor H(x) from unknowndistribution D. The goal of the learner algorithm is to minimize hypothesis H*for fixed function H for which E(L(H(x), y)) is minimum.

For such unknown distribution, we minimizethe empirical risk over the given training dataset such technique in statisticallearning algorithm known as Empirical risk minimization(ERM.) (c) Structural Risk Minimization(SRM) is on finite dataset, the taskof selecting best model class function within a nested family of space . The model may be too complex at HQ and poor at generalizing it to a newdataset which would consequent an overfitting problem. The SRM addressesoverfitting problem by solving such problem by balancing the model’s complexityand training data error. Let in Hq minimizer be As we increase q, the gets reduced, the function increase in between we get abalanced model.

(d) Regularization addresses overfitting problem by penalizing complexlearning algorithm – it selects one complex hypothesis class and adds regularizer function to empirical error to prevent overfitting. Our goal is to choose best learning algorithm among many possiblehypothesis space – the regularization parameter can be chosen best in case of ridge regration or best in . If we have large dataset – we can split the dataset into threedifferent parts: Training dataset – where we model our functions using differentlearning algorithms.Development or Validation dataset – on this we tune the parameterson held out dataset (development or validation set.

) Testdataset – to test best model we have developed on the training dataset. (e) The kernel trick enables us to work with linear function ininfinite dimensional feature space. subject to yi?w,xi??1??i,?i ?0, ?i = 1,…,m.??Hinge loss function is defined as: Basically, we minimize the hinge loss while minimizing the norm ofthe weight vector(w) that corresponds to maximising the margin.

Maximising themargin gives a good generalization learning algorithm even though we have ahigh dimensional feature space. The nonnegative xi(?i) measures the hinge loss – computesthe amount by which fails to meet a margin of 1. (f) By changing the hinge loss function, we can create a linearprogramming boosting. The primal subject to yiHia?p??i,?i ?0, ?= 1,..

.,m.? and the dual would have thefollowing form:; subject to where is the distribution, is additional weak learner, Optimal solution canbe achieved by doing the dual programming iteratively. We assume the set of weaklearners are finite. The boosting algorithm would give optimum error bound.We might have more than one weak learner’s optimal solution whichare approximate to the cost function. Solution toQuestion 4 : (a) In statistical learning theory generalisation error is a measure howwell a predictive of a learning algorithm perform on unseen dataset. The expected error is approximately equivalentto the empirical error and the error reduces when we increase the dataset.

Sincethe generalisation of the learning algorithm is done using randomly drawn froma finite independent and identically distributed samples the model might besensitive. One way of overcoming this problem is by relating thetraining(given) error to the generalisation error – which possibly would avoidoverfitting. There are many theories forgeneralisation error – each theory has strengths and weaknesses. (b) For a fixed size of ‘m’training datasets None linearity in the lostfunction would give a small difference in the average of the distribution.Analysing the mean usingtraditional statistics , If we use consistency ofclassification method for big dataset where if p(x,1)>p(x,0), otherwise This might not be a good solution when we have small sample size.We are taking limit m tends to infinity so for small m the Bayes risk mightfail. The 95th percentile is 0.

15 which is above the mean value 0.08.This means the probability we have been misled by 0.

05 which is less than themean value. (c) Structural risk minimization(SRM) is a best learning algorithm selectionprocedure – choosing the model that minimizes VC upper bound on risk. ,For fixed datasets, as the complexity of the learning algorithm C(Hi)increases, the training error (eˆrr(c)) will decrease but fails to generalize for unseen dataset. On the other hand, as the VC dimensionincreases, will increase since this depends on the VCdimension. SRC chooses a best model from the sequence of hypothesis ( which minimizes the right-hand side of theabove inequality.

As the complex models might over fit and at the same time the moretraining set we have the lower empirical error will be. SRC resolves the trade-offby providing a measure of characterization between complexity and empiricalerror – bybalancing the model’s complexity and training set (empirical) error. (d)As discussed above(b) is the probability that the training data misled. AssumingHk is finite and confidence at least 1-, where is small positive number, we can compute the likelihoodof being misled by evaluating training error equals to zero. In our case, Since the functions are drawnrandomly from a distribution we must add prior weight. By applying Hoeffding’s inequalityto the randomly drawn probability we can get the generalization error isbounded – for model k. , Reference:I have used the lecture notesfor this assignment.https://moodle.ucl.ac.uk/course/view.php?id=11543