Bingo. The kind of problems OP talks about are fairly frequent, but this problem...

Bingo. The kind of problems OP talks about are fairly frequent, but this problem of train speed estimation is not it.

Also worth discussing is what happens when instead of one you put 30 sensors to improve your estimate of the speed. Good luck figuring out the closed form Doppler expression in that case (you technically _can_ use Kalman filtering but you are assuming each sensor is independent - they would not be, they would be correlated based on their spatial location and closeness to train).

With deep learning, all you need to extend your 1 microphone solution to 30 is a lil bit of pytorch code to add more neurons and some plumbing to pass in 30 audio streams but that's it.

Not to mention extensions to more complicated scenarios - people talking nearby, cars nearby etc. With deep learning you probably wont even need to modify any code, just throw training data (assuming your original model architecture is well designed).