Training is what creates the model, inference is what uses it. When you call a provider's API or run Ollama on your laptop, that's inference. Latency, throughput, and cost all live at the inference layer. Custom silicon (Groq's LPU, Cerebras's wafer) optimises inference specifically, which is why those providers post the lowest latency numbers.
Engineering notes from the Digitorn team. No marketing, no launch announcements, no "10 prompts that will change your life". Just the things we write that we'd want to read.