Must-Know for Running LLMs Locally! Ollama Complete Guide: From Model Fundamentals to Advanced Environment Variable Tuning

In recent years, AI technology has rapidly become widespread. Running AI models is no longer limited to the cloud—local execution has also gained significant attention. This is because cloud-based AI often comes with concerns such as privacy, security risks, cost, and network latency.

The emergence of Ollama provides an accessible option for running AI models locally. Ollama is an open-source tool that can be downloaded and run on macOS, Linux, and Windows. It can also be executed via Docker.

This article explains the internal structure of models, common Ollama commands, and environment variable configurations.

1. Model Concept

A model is composed of many different components, the largest of which is the weight file. This is a collection of nodes, and these nodes are connected through relationships called weights and biases. The combination of these weights and biases is referred to as parameters.

A node usually represents a concept, such as a word or a phrase. During training, these parameters connect different concepts together in varying degrees—sometimes bringing them closer, and sometimes pushing them further apart.

As training continues, there is not just a single weight between two nodes. Instead, multiple weight combinations may exist depending on context and the function of each node.

This is how vast amounts of world knowledge are compressed into a relatively small file.

The size of this file depends on how the parameters are represented. In the early stages of model development, 16-bit or 32-bit floating-point numbers are commonly used. These values are large and highly precise, but if they are grouped and compressed, they can be reduced significantly while still preserving a high level of accuracy.

The most common compression method uses 4 bits, known as 4-bit quantization.

2. Ollama CLI (Commands) Introduction

ollama -h (or ollama –help): Displays a list of all available Ollama commands.
ollama create: Used to create a new model in Ollama. A model in Ollama consists of multiple components, including a large GGUF weight file, templates, system prompts, and more.
ollama show Phi3: Displays high-level configuration settings of a model.

ollama run phi3 –format json: Instructs the model to output results in JSON format (JSON blob).
–keepalive: Forces the model to stay loaded or unload after a specified duration.
–verbose: Enables detailed output mode, showing more execution information.
ollama ls: Lists all models installed on your local machine.
ollama cp: Copies a model and gives it a new name. This does not copy the entire model file; instead, it only creates a reference. As a result, the copied model may only take around 500 bytes, while the original model may be 20GB.
ollama rm: Removes a specified model. However, if you delete a copied model, it will not recover much disk space because the original model still exists.
ollama pull phi3: Downloads the Phi-3 model from the Ollama model registry. Depending on model size and network speed, this process may take from several minutes to tens of minutes.
ollama run phi3: Starts an interactive chat session with the model. You can directly interact with it by entering prompts, and it will respond in real time. If the model is not already downloaded, Ollama will automatically download it before starting the session.

3. Ollama Environment Variables

In most local deployments, configuring environment variables is not required. However, in certain scenarios, adjusting them can improve flexibility and performance.

For example:

When the client and server are deployed on different machines.
When running multiple models simultaneously.
When handling a large number of concurrent requests.
When separating frontend and backend services.
When improving parallel processing capabilities.

In these cases, Ollama environment variables can be configured as follows:

OLLAMA_HOST

Used when the Ollama server is running on different machines. It specifies the address and port for remote access.

Example:

OLLAMA_HOST=0.0.0.0:11434

OLLAMA_KEEP_ALIVE

Controls how long a model remains loaded in memory after inference.

Extending this time reduces frequent model loading and improves response speed.

Example:

OLLAMA_KEEP_ALIVE=-1

This means the model stays in memory indefinitely until the Ollama service is stopped.

OLLAMA_MODELS

Specifies the directory where model files are stored, replacing the default location.

Example:

OLLAMA_MODELS=D:\Ollama\Models

This is useful for storing large models on a drive with more available space.

OLLAMA_MAX_LOADED_MODELS

Sets the maximum number of models that can be loaded into memory at the same time. Increasing this value can reduce reload time when switching models frequently.

OLLAMA_NUM_PARALLEL

Defines how many requests a single model can handle concurrently.

Increasing this improves parallel processing but also increases CPU, GPU, and memory usage.

OLLAMA_MAX_QUEUE

Defines the maximum number of requests allowed in the queue.

When incoming requests exceed processing capacity, they are placed in a queue. If the queue is full, additional requests will be rejected.

OLLAMA_DEBUG

Enables debug mode in Ollama.

When enabled, Ollama outputs more detailed logs, such as:

Model loading process
GPU detection results
Memory usage
API request logs
Performance statistics

This is useful for troubleshooting issues such as model loading failures, performance problems, or GPU detection issues.

Example:

OLLAMA_DEBUG=1

In general, most users do not need to modify Ollama environment variables. However, when scaling systems, serving multiple users, deploying across machines, or optimizing performance, these settings can significantly improve flexibility and control.

4. Conclusion

Future articles will cover more Ollama-related topics such as embedding models and how to update Ollama. Hopefully, this article provides a solid introduction to understanding Ollama.

Leave a Comment Cancel Reply