5 min read

Running LLM models locally using Ollama

Table of Contents

While my github student pack was hanging in the middle of verification, I decided to give a try to Ollama models locally since I have a dedicated GPU.

This was a fun rabbit hole and a bittersweet ride. Here’s my experience with the models as an absolute newbie.


What the hell is ollama even about?

According to their page:

ā€œOllama is the easiest way to get up and running with large language models such as gpt-oss, Gemma 3, DeepSeek-R1, Qwen3 and more.ā€

Simply put, it allows you to run open models on your local system.

My machine configurations

You need some kind of GPU to run the model fast. It may work on CPU but the speed will be damn slow if you’re using a bigger model. (yes, I’ve tried it).Ā 

System configurations:

  • Processor: Intel i5-12500H
  • RAM: 16GB DDR4 @ 3200MHz
  • GPU: Nvidia RTX 3050 Mobile (4GB)
  • VS Code withĀ Continue.devĀ extension to test out different model modes (chat, agent, etc)
  • OllamaĀ (v0.12.9)

Models tested

KC = Knowledge Cutoff; CS = Context Size

ModelParamsKCCSNotes
phi3:latest3.8bearly 2023128KExplanations are decent but not ideal for coding.
gemma34bSept 2021128K (image input)Good chat; a bit slow on my specs but usable.
llama3.23bDec 2023128KBetter chat than gemma3; good explanations; weak agent; non-coding chats are solid.
deepseek-coder6.7b(unknown)16KStronger agent mode than most; somewhat slow on my machine.
phi4-mini3.8bApr 2023128KGood for autocomplete and chat; agent mode manageable.
Qwen34bOct 2024256KPartly CPU-loaded (ā‰ˆ80:20 GPU:CPU); detailed but verbose; agent mode not great; occasional thinking loops.
Granite 3.1 MoE3bMar 2023128KShort, to-the-point suggestions; good tab completions.
Granite 3.32bApr 2023128KMore detailed than 3.1 MoE; good suggestions; agent mode not good.
Qwen 2.53b202332KGreat explanations; good agent mode.
StarCoder23b(no valid resp.)16KResponses were irrelevant in my tests; needs retesting.
Qwen 2.5 Coder3bOct 202332KExcellent for coding queries; likely my main LLM when Copilot expires.
Cogito3bOct 2023128KGood code improvement suggestions.
DeepSeek V3.1 (cloud)671bOct 2023160KCloud model (~404 GB); great explanations; deep thinking; sometimes long delays (3–5 min).
Kimi K2 (cloud)1tOct 2023256KExcellent explanations; very detailed when using large token outputs; strong code rationale.

How to check model status?

You can check your current running model’s status by running the follwoing command:

ollama ps


Workload sharing by your resources

If the models are not completely loaded on GPU, it will share some of it’s workload to the CPU as well. You can see these stats by running above mentioned command.

Don’t forget to checkout the load shared by CPU and GPU. That will give you the idea if the model is loaded aptly or do you need to upgrade/downgrade it.

Token size also affects how the memory is shared between cpu/gpu which is an interesting thing to note.


Additional Notes

Even if you don’t have a dedicated GPUĀ (dGPU), you can still configure to use your inteegrated GPUĀ (iGPU)Ā though the speed might not be comparable but it will get your your done. Try out different models and figure out which one give you the best of speed and accuracy. You can choose from the above metioned models as well as some of their lightweight models to better suit your usecase.

Some models are called ā€œembedding modelsā€ which contains a few million parameters. These are text-only models majorly and are best for systems with limited resources like mobile devices or IoT systems like Raspberry Pi, Arduino, etc. A few examples:Ā Granite (by IBM),Ā EmbeddingGemma (by Google Deepmind).Ā 

Another workaround for running bigger models on limited resource devices can be to use theĀ Ollama CloudĀ and connect to heavier cloud models likeĀ kimi-k2-1t.Ā Their free plan is quite good with daily and weekly limit for general usecase and you can opt for paid plan anytime for more usage.


Ending thoughts

Overall, it was quite a fun exploration of these models and I’ve gained a few exploratory knowledge about the LLM world. Still there is much more to learn like hosting these models on cloud and serving them directly to your users, customizing according to your requirement, etc.

I would like to hear your thoughts about LLMs and your experiences with it. You canĀ tag me on XĀ (Twitter) and we’ll start a conversation on it. Looking forward to it!

Thank you for reading and I’ll see you in the next one.

Happy Tinkering!