Separate fast path from heavy path
Some requests should stay cheap and fast; others can invoke tools, retrieval, or larger multimodal flows.
Free Gemini breakdown
Strong answers treat Gemini-like systems as orchestration problems across models, tools, memory, and latency, not as “one LLM behind an API.”
The pivot
Some requests should stay cheap and fast; others can invoke tools, retrieval, or larger multimodal flows.
Grounded assistant behavior depends on when to call tools, not just how good the model is.
Latency is a product feature in assistants. Retrieval, memory, safety, and multimodal processing all compete for it.
Want the full version?
The full breakdown covers multimodal input, memory, tool use, orchestration, latency budgeting, and productized assistant tradeoffs.