Training improves the model, cloud inference serves users, and edge inference controls the physical world.
| Dimension | Training | Cloud Inference | Edge Inference |
|---|---|---|---|
| Objective | Model quality | Scale + cost | Real-time control |
| Mode | Offline | Online | Closed-loop |
| Batch | Large | Medium | 1 |
| Latency Requirements | None | Low | Hard real-time |
| Scaling | Scale-out | Scale-out | No scale-out |
| Bottleneck | Compute / HBM / comms | KV / scheduling / $ | Latency / jitter / power |
| Memory Preference | HBM | HBM + KV optimizations | SRAM-first |
| Cost Model | $ per training run | $ per token | $ per latency + Watt |
Groq was aqcuired by Nvidia in 12/2025 for 20 Billion USD.
GPUs win throughput economics; Groq wins latency economics.
Groq’s LPU (language processing unit)
Keep working data (KV, activations) on-chip in SRAM and pre-schedule all compute + dataflow to eliminate stalls, jitter, and DRAM round trips.
| Property | SRAM | DRAM |
|---|---|---|
| Structure | 6-transistor (6T) latch | charge in a capacitor (1T1C) |
| Latency | Very low (~1ns) | Higher (~50–100ns) |
| Bandwidth | Very high (on-chip) | Medium (DDR) to high (HBM) |
| Need refresh? | No | Yes |
| Density | Low | High |
| Power | Higher (static leakage) | Lower |
| Cost | Higher | Lower |
| Location | On-chip (cache) | Off-chip (DDR/LPDDR/HBM) |
Memory
└── Volatile
├── SRAM (on-chip cache)
└── DRAM
├── DDR / LPDDR
└── HBM
| Type of DRAM | Power | Bandwidth | Capacity | Cost | Use Case |
|---|---|---|---|---|---|
| DDR | Med | Med | High | Low | PCs / Servers |
| LPDDR | Low | Med | Med | Med | Mobile / Edge |
| HBM | Med | High | Med | High | AI / GPU / HPC |
