I recently downloaded ollama on my Linux machine and even with 3060 12gb gpu and...

M4v3R · on Dec 19, 2023

You need to pick the correct model size and quantization for the amount of GPU RAM you have. For any given model don’t download the default file, instead go to Tags section on Ollama’s page and pick a quantization whose size in GB is at most 2/3rd of your available RAM, and it should work. For example in your case Mistral-7B q4_0 and even q8_0 should work perfectly.

swyx · on Dec 19, 2023

whats the intuition for 2/3 of RAM?

M4v3R · on Dec 20, 2023

Because there’s always some overhead during inference plus you don’t want to fill all your available RAM because you risk swapping to disk which will make everything slow to a crawl.

swyx · on Dec 20, 2023

so why is the overhead a 1/3 ratio instead of a constant amount? just testing the scaling assumption

avereveard · on Dec 19, 2023

you need some leftover for holding the context

ilaksh · on Dec 19, 2023

Try https://github.com/ggerganov/llama.cpp

Builds very quickly with make. But if it's slow when you try it then make sure to enable any flags related to CUDA and then try the build again.

A key parameter is the one that tells it how many layers to offload to the GPU. ngl I think.

Also, download the 4 bit GGUF from HuggingFace and try that. Uses much less memory.

avereveard · on Dec 19, 2023

with llama.cpp and a 12gb 3060 they can get the an entire mistral model at Q5_K_M n ram with the full 32k context. I recommend openhermes-2.5-mistral-7b-16k with USER: ASSISTANT: instructions, it's working surprisingly well for content production (let's say everything except logic and math, but that's not the strong suite of 7b models in general)

mgreg · on Dec 19, 2023

Some details that might interest you from SemiAnalysis [1] just published yesterday. There's quite a bit that goes into optimizing inference with lots of dials to turn. One thing that does seem to have a large impact is batch size which is a benefit of scale.

1. https://www.semianalysis.com/p/inference-race-to-the-bottom-...

TheMatten · on Dec 19, 2023

I can reasonably run (quantized) Mistral-7B on a 16GB machine without GPU, using ollama. Are you sure it isn't a configuration error or bug?

ilaksh · on Dec 19, 2023

How many tokens per second and what are the specs of the machine? My attempts at CPU only have been really slow.

berkut · on Dec 19, 2023

In my experience with llama.cpp using the CPU (on Linux) is very slow compared to GPU or NPU with the same models as my M1 MacBook Pro using Metal (or maybe it's the shared memory allowing the speedup?).

Even with 12 threads of my 5900X (I've tried using the full 24 SMT - that doesn't really seem to help) with the dolphin-2.5-mixtral-8x7b.Q5_K_M model, my MacBook Pro is around 5-6x faster in terms of tokens per second...

ilaksh · on Dec 19, 2023

I think that Metal or something is actually a built in graphics/matrix accelerator that those Macs have now. It's not really using a CPU although it seems like Apple may be trying to market it a little bit as though it's just a powerful CPU. But more like accelerator integrated with CPU.

But whatever it is, it's great, and I hope that Intel and AMD will catch up.

AMD has had the APUs for awhile but I think they aren't at the same level at all as the new Mac acceleration.

stavros · on Dec 20, 2023

There must be something wrong, my 3060 does double the tokens per second as my M2 Mac (with Metal).

TheMatten · on Dec 19, 2023

Seems to be around 3 tokens/s on my laptop, which is faster than average human, but not too fast of course. On a desktop with mid-range GPU used for offloading, I can get around 12 tokens/s, which is plenty fast for chatting.

ignoramous · on Dec 19, 2023

> optimisation is done to make this all work

Obviously still a nascent area but https://lmsys.org/blog do a good job of diving into engineering challenges behind running these LLMs.

(I'm sure there are others)

idonotknowwhy · on Dec 20, 2023

You can run a 7b Q4 model in your 12gb vram no problem.