I agree. As a Bayesian hoping to understand my data, P(X|M1) is useful: it's the...

mjw · on Dec 10, 2013

Yeah good perspective -- I guess I was thinking about this more from the perspective of predictive modelling than science.

Model averaging can be quite useful when you're averaging over versions of the same model with different hyperparameters, e.g. the number of clusters in a mixture model.

You still need a good hyper-prior over the hyperparameters to avoid overfitting in these cases though, as an example IIRC dirichlet process mixture models can often overfit the number of clusters.

Agreed that model averaging could be harder to justify as a scientist comparing models which are qualitatively quite different.

ced · on Dec 10, 2013

Model averaging can be quite useful when you're averaging over versions of the same model with different hyperparameters, e.g. the number of clusters in a mixture model.

Yeah, but in this case, there's a crucial difference: within the assumptions of a mixture model M, N=1, 2, ... clusters do make an exhaustive partition of the space, whereas if I compute a distribution for models M1 and M2, there is always M3, M4, ... lurking unexpressed and unaccounted for. In other words,

P(N=1|M) + P(N=2|M) + ... = 1

but

P(M1) + P(M2) << 1

Is the number of clusters even a hyperparameter? Wiki says that hyperparameters are parameters of the prior distribution. What do you think?