I agree. As a Bayesian hoping to understand my data, P(X|M1) is useful: it's the probability I have for X under M1's modelling assumptions. Of course M1 is an approximation, but that's how science is done. You get to understand how your model behaves, and you may say "Well, X is a bit higher than it should be, but that's because M1 assumes a linear response, and we know that's not quite true".
Bayesian model averaging entails P(X) = P(X|M1)P(M1) + P(X|M2)P(M2). It assumes that either M1 or M2 is true. No conclusions can be derived from that. It might be useful from a purely predictive standpoint (maybe) , but it has no place inside the scientific pipeline.
There is a related quantity which is P(M1)/P(M2). That's how much the data favours M1 over M2, and it's a sensible formula, because it doesn't rely on the abominable P(M1) + P(M2) = 1
Yeah good perspective -- I guess I was thinking about this more from the perspective of predictive modelling than science.
Model averaging can be quite useful when you're averaging over versions of the same model with different hyperparameters, e.g. the number of clusters in a mixture model.
You still need a good hyper-prior over the hyperparameters to avoid overfitting in these cases though, as an example IIRC dirichlet process mixture models can often overfit the number of clusters.
Agreed that model averaging could be harder to justify as a scientist comparing models which are qualitatively quite different.
Model averaging can be quite useful when you're averaging over versions of the same model with different hyperparameters, e.g. the number of clusters in a mixture model.
Yeah, but in this case, there's a crucial difference: within the assumptions of a mixture model M, N=1, 2, ... clusters do make an exhaustive partition of the space, whereas if I compute a distribution for models M1 and M2, there is always M3, M4, ... lurking unexpressed and unaccounted for. In other words,
P(N=1|M) + P(N=2|M) + ... = 1
but
P(M1) + P(M2) << 1
Is the number of clusters even a hyperparameter? Wiki says that hyperparameters are parameters of the prior distribution. What do you think?
Bayesian model averaging entails P(X) = P(X|M1)P(M1) + P(X|M2)P(M2). It assumes that either M1 or M2 is true. No conclusions can be derived from that. It might be useful from a purely predictive standpoint (maybe) , but it has no place inside the scientific pipeline.
There is a related quantity which is P(M1)/P(M2). That's how much the data favours M1 over M2, and it's a sensible formula, because it doesn't rely on the abominable P(M1) + P(M2) = 1