Today we will try running an LLM locally on a Macbook Pro with Apple Silicon. To do this we will use ollama, since it is one of the easiest ways.
First, download the right installer for your OS from ollamma.com. This will also install the CLI command ollama
, which you can use to run LLMs in the terminal:
# use 'ollama list' to get all supported models
ollama run llama3.2
Calling ollama run <model name and version>
in the Terminal opens a prompt where you can interact with the LLM by writing messages. In the case above, we will interact with Llama 3.2. If the model is not present on the system, it will download it first, which took up around 2GB in my case.
Some useful commands are:
- ollama help: list and explain all supported commands.
- ollama run mistral: run the mistral model (downloads it if not present on system).
- ollama list: list available models.
For a full list of supported models and further information about ollama, refer to the Ollama Github page.
Calling the LLM from Python
However, what we really want is to interact with the LLM programmatically through Python. The way to do this with ollama is to start a local server that exposes a ReST API. I would have prefered calling the model without running an API locally, but this is how it’s done for now.
Starting the server locally:
# Not necessary if you ran the installer, since it will start the server in the background.
ollama serve
Install the Python libary that calls this API
pip install ollama
Now you can run this Python code to interact with the LLM:
import ollama
prompt = "Tell me a joke"
response = ollama.chat(model="llama3.2", messages=[{"role": "user", "content": prompt}])
print(response["message"]["content"])