updated: 2026-04-12
Notes on experimenting with llama.cpp on a 8GB Rasperry Pi 4.
Before building llama.cpp on Debian trixie, the system needs these
dependencies:
$ sudo apt-get install pciutils build-essential cmake \
libcurl4-openssl-dev libssl-dev ccache
Install llama.cpp with the instructions provided within the CPU
Build
section of
build.md.
$ git clone https://github.com/ggml-org/llama.cpp.git
$ cd llama.cpp
$ cmake -B build
$ cmake --build build --config Release
An alternative approach I took was to build llama-cli statically using
ccache as a compiler cache:
$ cmake . -B build -DBUILD_SHARED_LIBS=OFF \
-DCMAKE_C_COMPILER_LAUNCHER=ccache \
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache
$ cmake --build build --config Release -j$(nproc) --clean-first \
--target llama-cli
This build took a total time of 20 minutes and 59 seconds and produced a
statically linked llama-cli binary with total filesize of 9.4M:
$ du -h build/bin/llama-cli && ldd build/bin/llama-cli
9.4M build/bin/llama-cli
linux-vdso.so.1 (0x0000ffff89d4f000)
libssl.so.3 => /lib/aarch64-linux-gnu/libssl.so.3 (0x0000ffff89360000)
libcrypto.so.3 => /lib/aarch64-linux-gnu/libcrypto.so.3 (0x0000ffff88d50000)
libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000ffff88ce0000)
libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffff88a70000)
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff889c0000)
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffff88980000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffff887c0000)
/lib/ld-linux-aarch64.so.1 (0x0000ffff89d00000)
libz.so.1 => /lib/aarch64-linux-gnu/libz.so.1 (0x0000ffff88780000)
libzstd.so.1 => /lib/aarch64-linux-gnu/libzstd.so.1 (0x0000ffff886b0000)
Copy llama-cli to a directory within PATH and it’s ready to be
invoked:
$ sudo cp build/bin/llama-cli /usr/local/bin
$ llama-cli --version
version: 8683 (d0a6dfeb2)
built with GNU 14.2.0 for Linux aarch64
llama-cli can now be used to download a module and run a prompt:
$ llama-cli -hf ggml-org/gemma-3-270m-it-GGUF \
--single-turn -p "Write a haiku about a computer"
[...]
Silent screen glows bright,
Code whispers in the code,
A digital mind.
This prompt took 6.7 seconds to complete, excluding the time taken to download the model.
A more complex prompt:
$ time llama-cli -hf ggml-org/gemma-3-270m-it-GGUF --single-turn \
-p "If a bat and a ball cost 1.10 in total, and the bat costs 1.00
more than the ball, how much does the ball cost?"
[...]
Let the cost of the bat be $B$ and the cost of the ball be $L$.
We are given that the bat costs 1.00 more than the ball, so $B = L + 1.00$.
We are also given that the total cost is 1.10, so $B + L = 1.10$.
We can substitute $B = L + 1.00$ into the equation $B + L = 1.10$:
$$(L + 1.00) + L = 1.10$$
$$2L + 1.00 = 1.10$$
$$2L = 1.10 - 1.00$$
$$2L = 0.10$$
$$L = \frac{0.10}{2} = 0.05$$
So the cost of the ball is 0.05.
Now we can find the cost of the bat: $B = L + 1.00 = 0.05 + 1.00 = 1.05$.
So the cost of the bat is 1.05.
The total cost is $1.05 + 0.05 = 1.10$.
Therefore, the ball costs 0.05.
Final Answer: The final answer is $\boxed{0.05}$
This prompt took 37.6 seconds to complete. The prompt processing speed was 81.4 tokens per second and text generation speed was 10 tokens per second. Running this same prompt several times in a row ended up with different incorrect answers.
llama.cpp provides a fast, OpenAI
API compatible HTTP
server.
To build llama-server with ssl support and the webui removed:
$ cmake . -B build -DBUILD_SHARED_LIBS=OFF \
-DLLAMA_OPENSSL=ON \
-DLLAMA_BUILD_WEBUI=OFF \
-DCMAKE_C_COMPILER_LAUNCHER=ccache \
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache
$ cmake --build build --config Release -j$(nproc) --clean-first \
--target llama-server
Create a llama system account and move llama-server into place:
$ sudo useradd llama --home-dir /opt/llama --create-home --system \
--shell /bin/false
$ sudo mkdir -p /opt/llama/bin
$ sudo cp build/bin/llama-server /opt/llama/bin/
$ sudo chown llama: /opt/llama/bin/llama-server
Create a systemd service unit: /etc/systemd/system/llama-server.service
[Unit]
Description=Llama.cpp Server
Documentation=https://github.com/ggerganov/llama.cpp
After=network.target
Wants=network-online.target
[Service]
User=llama
Group=llama
WorkingDirectory=/opt/llama
EnvironmentFile=/etc/default/llama-server
ExecStart=/opt/llama/bin/llama-server
Restart=always
RestartSec=5s
ProtectSystem=strict
ProtectHome=true
PrivateTmp=yes
NoNewPrivileges=yes
RestrictAddressFamilies=AF_INET AF_INET6
CapabilityBoundingSet=
SystemCallFilter=@system-service
ReadWritePaths=/opt/llama
ReadWritePaths=/opt/llama/.cache
[Install]
WantedBy=multi-user.target
Create the environment variable file which the systemd service will
source: /etc/default/llama-server
# recommend setting LLAMA_ARG_HOST to an IP address on the system
# running llama-server.
LLAMA_ARG_HOST=0.0.0.0
LLAMA_ARG_PORT=8080
# If you have a TLS key and crt set the SSL variables to the file paths
# of the key and cert.
LLAMA_ARG_SSL_KEY_FILE=/etc/ssl/llama/tls.key
LLAMA_ARG_SSL_CERT_FILE=/etc/ssl/llama/tls.crt
LLAMA_ARG_HF_REPO=ggml-org/gemma-3-270m-it-GGUF
LLAMA_ARG_CTX_SIZE=8192
LLAMA_ARG_N_PARALLEL=1
LLAMA_ARG_MMPROJ_AUTO=0
See the llama-server
README.md for all options.
Reload systemd and start the service:
$ sudo systemctl daemon-reload
$ sudo systemctl start llama-server.service
Send a prompt to llama-server:
$ curl https://${LLAMA_ARG_HOST}:8080/v1/messages -k -s \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "What is the capital of Argentina?"}
]
}' | jq -r
Use http:// and remove -k if LLAMA_ARG_SSL_KEY_FILE and
LLAMA_ARG_SSL_CERT_FILE are not set within the environment variable
file for llama-server.
langchain supports llama.cpp
langchain-community:
$ uv init lctest1
$ cd lctest1
$ uv add langchain_community
Create test.py:
#!/usr/bin/env python3
import multiprocessing
from langchain_community.chat_models import ChatLlamaCpp
from langchain_core.messages import HumanMessage
local_model = "gemma-3-270m-it-qat-Q4_0.gguf"
chat = ChatLlamaCpp(model_path=local_model, verbose=False)
message = [HumanMessage(content="Write a short poem about summer")]
for chunk in chat.stream(message):
print(chunk.text, end="", flush=True)
Download the model referenced in local_model:
$ wget https://huggingface.co/ggml-org/gemma-3-270m-it-qat-GGUF/resolve/main/gemma-3-270m-it-qat-Q4_0.gguf
Then run test.py:
$ time uv run test.py
Sunbeams warm the soul,
A golden haze begins to bloom.
The air is thick with scent of heat,
Days long and shadows deep.
From fields of green, a vibrant hue,
Summer's magic way anew.
The waves crash on shore, a gentle sway,
As summer melts away in hazy ray.
Fresh flowers blooming, sweet perfume sighs,
In the sun-kissed world, days fly by.
The air is thick with warmth and light,
Summer's beauty, endless night.
From fields of green to skies so bright,
A golden hour, filled with delight.
real 0m15.319s
user 0m25.799s
sys 0m0.574s
Python bindings for llama.cpp are also provided by
llama-cpp-python.
langchain-openai works
with llama-server so you can access the llama-server endpoints over
the network:
#!/usr/bin/env python3
import multiprocessing
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
chat = ChatOpenAI(
base_url="https://${LLAMA_ARG_HOST}:${LLAMA_ARG_PORT}/v1",
api_key="",
)
message = [HumanMessage(content="Write a short poem about summer")]
for chunk in chat.stream(message):
print(chunk.text, end="", flush=True)
Swap out ${LLAMA_ARG_HOST} and ${LLAMA_ARG_PORT} with what was used
in the llm-server section.