In depth
Training is what creates the model, inference is what uses it. When you call a provider's API or run Ollama on your laptop, that's inference. Latency, throughput, and cost all live at the inference layer. Custom silicon (Groq's LPU, Cerebras's wafer) optimises inference specifically, which is why those providers post the lowest latency numbers.
Related concepts
Newsletter
Get the next post in your inbox.
Engineering notes from the Digitorn team. No marketing, no launch announcements, no "10 prompts that will change your life". Just the things we write that we'd want to read.