Really telling quote: > I was completely taken aback by the failure rate of GPUs... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

choppaface on March 6, 2024 | parent | context | favorite | on: Training LLMs from ground zero as a startup

Really telling quote:

> I was completely taken aback by the failure rate of GPUs as opposed to my experiences on TPUs at Google

Should be "I was completely unaware of the failure modes of GPUs, because all my career I've been inside Google and used Google TPUs and was well-acquainted with those failure modes."

I've used GPUs mostly, and when I tried TPUs the jobs failed all the time for really hard-to-debug reasons. Often the indirection between the x86 chip and the TPU device caused hours of hair-pulling, stuff you never get with x86+nvidia+pytorch.

10-15 years ago, Google minted many $10m+ data scientists (aka Sawzall engineers) who also ventured "into the wilderness" and had very similar reactions. This blog post is much more about the OP hyping his company and personal brand than contributing useful notes to the community.

quadrature on March 7, 2024 | [–]

I think the OP is referring to hardware failures rather than software not playing well together.

StarCyan on March 7, 2024 | [–]

When was this? I use JAX+TPUs to train LLMs and haven't experienced many issues. IMO it was way easier to set up distributed training, sharding, etc compared to Pytorch+GPUs.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact