Quantisation and token generation speed on local consumer hardware.

I've been testing LLM models using different types of quantisation to test token generation speed and result quality. For reference I'm using an NVIDIA 3090 Founders Edition with stock core clock speeds and a Ryzen 5600G CPU with 64GB of DDR4 3200MHz RAM.

For my specific hardware I've found that GPTQ with 8-bit quantisation is the best balance between speed and quality. For 7 billion parameter models I get around 30 tokens per second, and for 13B models I get around 21 tokens/s. GGUF models have been running significantly above expectations on colleagues' Apple Silicon hardware getting 25 tokens/s on 30 billion parameter models. However on my NVIDIA setup I have been getting around 10-15 tokens/s.

At some point I will do like-for-like comparisons with identical(ish) models, with similar context lengths. Though because of differences in quantisation methods and the fact not all the information is published nor the technology/libraries/scripts used, this is a bit challenging if not impossible unless I did all the quantisation myself.

Key takeaways for now are that high end Mac hardware is surprisingly good though as formats are limited to GGUF this may not last long. GPTQ performance is preferable for NVIDIA hardware running on linux machines.

updates home

End of output.