home > notes

llama.cpp on a raspberrypi 4

updated: 2026-04-12

Notes on experimenting with llama.cpp on a 8GB Rasperry Pi 4.


llama-cli

Before building llama.cpp on Debian trixie, the system needs these dependencies:

$ sudo apt-get install pciutils build-essential cmake \
    libcurl4-openssl-dev libssl-dev ccache

Install llama.cpp with the instructions provided within the CPU Build section of build.md.

$ git clone https://github.com/ggml-org/llama.cpp.git
$ cd llama.cpp
$ cmake -B build
$ cmake --build build --config Release

An alternative approach I took was to build llama-cli statically using ccache as a compiler cache:

$ cmake . -B build -DBUILD_SHARED_LIBS=OFF \
    -DCMAKE_C_COMPILER_LAUNCHER=ccache \
    -DCMAKE_CXX_COMPILER_LAUNCHER=ccache
$ cmake --build build --config Release -j$(nproc) --clean-first \
    --target llama-cli

This build took a total time of 20 minutes and 59 seconds and produced a statically linked llama-cli binary with total filesize of 9.4M:

$ du -h build/bin/llama-cli && ldd build/bin/llama-cli                                                                     
9.4M    build/bin/llama-cli
        linux-vdso.so.1 (0x0000ffff89d4f000)
        libssl.so.3 => /lib/aarch64-linux-gnu/libssl.so.3 (0x0000ffff89360000)
        libcrypto.so.3 => /lib/aarch64-linux-gnu/libcrypto.so.3 (0x0000ffff88d50000)
        libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000ffff88ce0000)
        libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffff88a70000)
        libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff889c0000)
        libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffff88980000)
        libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffff887c0000)
        /lib/ld-linux-aarch64.so.1 (0x0000ffff89d00000)
        libz.so.1 => /lib/aarch64-linux-gnu/libz.so.1 (0x0000ffff88780000)
        libzstd.so.1 => /lib/aarch64-linux-gnu/libzstd.so.1 (0x0000ffff886b0000)

Copy llama-cli to a directory within PATH and it’s ready to be invoked:

$ sudo cp build/bin/llama-cli /usr/local/bin
$ llama-cli --version
version: 8683 (d0a6dfeb2)
built with GNU 14.2.0 for Linux aarch64

llama-cli can now be used to download a module and run a prompt:

$ llama-cli -hf ggml-org/gemma-3-270m-it-GGUF \
    --single-turn -p "Write a haiku about a computer"
[...]
Silent screen glows bright,
Code whispers in the code,
A digital mind.

This prompt took 6.7 seconds to complete, excluding the time taken to download the model.

A more complex prompt:

$ time llama-cli -hf ggml-org/gemma-3-270m-it-GGUF --single-turn \
    -p "If a bat and a ball cost 1.10 in total, and the bat costs 1.00
        more than the ball, how much does the ball cost?" 
[...]
Let the cost of the bat be $B$ and the cost of the ball be $L$.
We are given that the bat costs 1.00 more than the ball, so $B = L + 1.00$.
We are also given that the total cost is 1.10, so $B + L = 1.10$.
We can substitute $B = L + 1.00$ into the equation $B + L = 1.10$:
$$(L + 1.00) + L = 1.10$$
$$2L + 1.00 = 1.10$$
$$2L = 1.10 - 1.00$$
$$2L = 0.10$$
$$L = \frac{0.10}{2} = 0.05$$
So the cost of the ball is 0.05.
Now we can find the cost of the bat: $B = L + 1.00 = 0.05 + 1.00 = 1.05$.
So the cost of the bat is 1.05.
The total cost is $1.05 + 0.05 = 1.10$.
Therefore, the ball costs 0.05.

Final Answer: The final answer is $\boxed{0.05}$

This prompt took 37.6 seconds to complete. The prompt processing speed was 81.4 tokens per second and text generation speed was 10 tokens per second. Running this same prompt several times in a row ended up with different incorrect answers.


llama-server

llama.cpp provides a fast, OpenAI API compatible HTTP server.

To build llama-server with ssl support and the webui removed:

$ cmake . -B build -DBUILD_SHARED_LIBS=OFF \
    -DLLAMA_OPENSSL=ON \
    -DLLAMA_BUILD_WEBUI=OFF \
    -DCMAKE_C_COMPILER_LAUNCHER=ccache \
    -DCMAKE_CXX_COMPILER_LAUNCHER=ccache 
$ cmake --build build --config Release -j$(nproc) --clean-first \
    --target llama-server

Create a llama system account and move llama-server into place:

$ sudo useradd llama --home-dir /opt/llama --create-home --system \
    --shell /bin/false
$ sudo mkdir -p /opt/llama/bin
$ sudo cp build/bin/llama-server /opt/llama/bin/
$ sudo chown llama: /opt/llama/bin/llama-server

Create a systemd service unit: /etc/systemd/system/llama-server.service

[Unit]
Description=Llama.cpp Server
Documentation=https://github.com/ggerganov/llama.cpp
After=network.target
Wants=network-online.target

[Service]
User=llama
Group=llama
WorkingDirectory=/opt/llama
EnvironmentFile=/etc/default/llama-server
ExecStart=/opt/llama/bin/llama-server
Restart=always
RestartSec=5s
ProtectSystem=strict
ProtectHome=true
PrivateTmp=yes
NoNewPrivileges=yes
RestrictAddressFamilies=AF_INET AF_INET6
CapabilityBoundingSet=
SystemCallFilter=@system-service
ReadWritePaths=/opt/llama
ReadWritePaths=/opt/llama/.cache

[Install]
WantedBy=multi-user.target

Create the environment variable file which the systemd service will source: /etc/default/llama-server

# recommend setting LLAMA_ARG_HOST to an IP address on the system
# running llama-server.
LLAMA_ARG_HOST=0.0.0.0
LLAMA_ARG_PORT=8080

# If you have a TLS key and crt set the SSL variables to the file paths
#  of the key and cert.
LLAMA_ARG_SSL_KEY_FILE=/etc/ssl/llama/tls.key
LLAMA_ARG_SSL_CERT_FILE=/etc/ssl/llama/tls.crt

LLAMA_ARG_HF_REPO=ggml-org/gemma-3-270m-it-GGUF
LLAMA_ARG_CTX_SIZE=8192
LLAMA_ARG_N_PARALLEL=1
LLAMA_ARG_MMPROJ_AUTO=0

See the llama-server README.md for all options.

Reload systemd and start the service:

$ sudo systemctl daemon-reload
$ sudo systemctl start llama-server.service

Send a prompt to llama-server:

$ curl https://${LLAMA_ARG_HOST}:8080/v1/messages -k -s \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What is the capital of Argentina?"}
    ]
  }' | jq -r 

Use http:// and remove -k if LLAMA_ARG_SSL_KEY_FILE and LLAMA_ARG_SSL_CERT_FILE are not set within the environment variable file for llama-server.


python clients

langchain supports llama.cpp langchain-community:

$ uv init lctest1
$ cd lctest1
$ uv add langchain_community

Create test.py:

#!/usr/bin/env python3

import multiprocessing
from langchain_community.chat_models import ChatLlamaCpp
from langchain_core.messages import HumanMessage

local_model = "gemma-3-270m-it-qat-Q4_0.gguf"
chat = ChatLlamaCpp(model_path=local_model, verbose=False)
message = [HumanMessage(content="Write a short poem about summer")]
for chunk in chat.stream(message):
    print(chunk.text, end="", flush=True)

Download the model referenced in local_model:

$ wget https://huggingface.co/ggml-org/gemma-3-270m-it-qat-GGUF/resolve/main/gemma-3-270m-it-qat-Q4_0.gguf

Then run test.py:

$ time uv run test.py
Sunbeams warm the soul,
A golden haze begins to bloom.
The air is thick with scent of heat,
Days long and shadows deep.
From fields of green, a vibrant hue,
Summer's magic way anew.

The waves crash on shore, a gentle sway,
As summer melts away in hazy ray.
Fresh flowers blooming, sweet perfume sighs,
In the sun-kissed world, days fly by.

The air is thick with warmth and light,
Summer's beauty, endless night.
From fields of green to skies so bright,
A golden hour, filled with delight.
real    0m15.319s
user    0m25.799s
sys     0m0.574s

Python bindings for llama.cpp are also provided by llama-cpp-python.

langchain-openai works with llama-server so you can access the llama-server endpoints over the network:

#!/usr/bin/env python3

import multiprocessing
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

chat = ChatOpenAI(
    base_url="https://${LLAMA_ARG_HOST}:${LLAMA_ARG_PORT}/v1",
    api_key="",
)
message = [HumanMessage(content="Write a short poem about summer")]
for chunk in chat.stream(message):
    print(chunk.text, end="", flush=True)

Swap out ${LLAMA_ARG_HOST} and ${LLAMA_ARG_PORT} with what was used in the llm-server section.