>the baseline models they test and blend are really terrible as well.
If the effect is there I would guess a few bad models should outperform a mediocre one, and a few mediocre ones should outperform a state-of-the-art one.
Of course it would be good to show the same again with GPT4 and maybe 3 GPT3.5 size models, but it's not necessary to show that such an effect exists, and maybe cost prohibitive for them as a research team. Now whether their methodology for proving this effect is correct is another discussion.
Personally I don't find these results surprising, our brain is also somewhat compartmentalized, why wouldn't the same hold for a good AI system?
The more difficult part is, how do you train these subnetworks optimally.
A Yi 34B or Mixtral finetune on the same data would blow them out of the water. Probably blow ChatGPT 3.5 out of the water as well.