Models & inference

Inference

The act of running a trained model to produce output. The thing you pay for per token.

also known as: LLM inference

In depth

Training is what creates the model, inference is what uses it. When you call a provider's API or run Ollama on your laptop, that's inference. Latency, throughput, and cost all live at the inference layer. Custom silicon (Groq's LPU, Cerebras's wafer) optimises inference specifically, which is why those providers post the lowest latency numbers.

Related concepts

LLMA neural network trained on text that takes a prompt and returns text, optionally including structured tool calls.Self-hostingRunning your agent stack on infrastructure you control, with your own model provider keys.

Newsletter

Get the next post in your inbox.

Engineering notes from the Digitorn team. No marketing, no launch announcements, no "10 prompts that will change your life". Just the things we write that we'd want to read.

More in Models & inference

Context window/glossary/context-window Frontier model/glossary/frontier-model LLM/glossary/llm Open-weight model/glossary/open-weight-model Streaming/glossary/streaming Temperature/glossary/temperature