Estimating GPU memory for deploying the latest open-source LLMs

Towards Data Science
Source

If you’re like me, you probably get excited about the latest and greatest open-source LLMs — from models like Llama 3 to the more compact Phi-3 Mini. But before you jump into deploying your language model, there’s one crucial factor you need to plan for: GPU memory. Misjudge this, and your shiny new web app might choke, run sluggishly, or rack up hefty cloud bills. To make things easier, I explain to you what’s quantization, and I’ve prepared for you a GPU Memory Planning Cheat Sheet in 2024— a handy summary of the latest open-source LLMs on the market and what you need to know before deployment.

When deploying LLMs, guessing how much GPU memory you need is risky. Too little, and your model crashes. Too much, and you’re burning money for no reason.

Understanding these memory requirements upfront is like knowing how much luggage you can fit in your car before a road trip — it saves headaches and keeps things efficient.

Quantization: What’s It For?