番外篇1、Ollama部署本地大模型

番外篇1、Ollama部署本地大模型

Ollama 玩转本地大模型

通常而言,依据经验(每个大模型开发者都应该知道的数字)来看,对于16位浮点精度(FP16)的模型,其推理所需的显存(单位是GB)大概是模型参数量(以10亿为单位)的两倍。例如,Llama 2 7B(70亿参数量),按照这个规律,其推理大约需要14GB的显存,这显然是普通家用计算机的硬件规格所无法满足的。就拿一块GeForce RTX 4060 Ti 16GB显卡来说,它的市场价超过3000元。

不过,模型量化(quantization)技术却能够在很大程度上降低显存的要求。以4 - bit量化为例,它可以把原本是FP16精度的权重参数压缩成4位整数精度,这样一来,模型权重的体积以及推理所需的显存都会大幅减小,仅为FP16的1/4到1/3,也就是说,大约4GB的显存就能够启动7B模型的推理了(当然,实际的显存需求会随着上下文内容的增加而不断变大)。

Ollama,它是一个简单易用的本地大模型运行框架。随着围绕Ollama的生态逐渐发展起来,更多的用户也能够方便地在自己的电脑上使用大模型了。

安装

打开官网https://ollama.com/download,即可根据电脑类型下载对应的客户端。

下载后,建议通过命令行执行安装,以便修改安装路径:.\OllamaSetup.exe /DIR="d:\Ollama"

直接双击安装会默认安装至:%USERPROFILE%\AppData\Local\Programs\Ollama

默认下载的模型保存在目录:%USERPROFILE%\.ollama
可以通过添加系统环境变量 OLLAMA_MODELS 修改模型的安装目录(请先退出Ollama)
默认的配置文件目录:%USERPROFILE%\AppData\Local\Ollama

安装完毕后,会默认直接启动,启动成功后会在任务栏右下角。

使用

除了双击对应ollama 应用启动后,还可以通过命令行启停安装完毕后,打开Terminal 执行ollama 命令,即可查看ollama 支持的系列指令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
PS C:\Users\shengjie> ollama
Usage:
ollama [flags]
ollama [command]

Available Commands:
serve Start ollama
create Create a model from a Modelfile
show Show information for a model
run Run a model
stop Stop a running model
pull Pull a model from a registry
push Push a model to a registry
list List models
ps List running models
cp Copy a model
rm Remove a model
help Help about any command

Flags:
-h, --help help for ollama
-v, --version Show version information

PS C:\Users\shengjie> ollama -v
ollama version is 0.6.1

拉取模型

当你运行 ollama –version 命令成功查询到版本时,表示 Ollama 的安装已经顺利完成,接下来便可以用 pull 命令从在线模型库(https://ollama.com/library)下载模型来玩了。接下来我们尝试来拉取`deepseek-r1:7b` 模型来玩耍,命令行执行以下命令:
PS C:\Users\shengjie> ollama pull deepseek-r1:7b

运行模型

当模型下载完毕后,可以直接命令行执行ollama run {model} 来启动即可,如下:

PS C:\Users\shengjie> ollama run deepseek-r1:7b

设置系统提示词

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/set
Available Commands:
/set parameter ... Set a parameter
/set system <string> Set system message
/set history Enable history
/set nohistory Disable history
/set wordwrap Enable wordwrap
/set nowordwrap Disable wordwrap
/set format json Enable JSON mode
/set noformat Disable formatting
/set verbose Show LLM stats
/set quiet Disable LLM stats
/set system 你是一名小红书风格文案写手,请使用 Emoji 风格改写用户提供的文案,该风格以引人入胜的标题、每个段落中包含
... 表情符号和在末尾添加相关标签为特点。请确保保持原文的意思。
Set system message.
/show system
你是一名小红书风格文案写手,请使用 Emoji 风格改写用户提供的文案,该风格以引人入胜的标题、每个段落中包含表情符号和在末尾添加相关标签为特点。请确保保持原文的意思。

设置参数

  1. /set format json :启用 Json Mode
  2. /set parameter num_ctx 4096:设置上下文窗口的token 大小为4096(还可以设置temperature、top_k、top_p等,可设参数详参:modelfile.md#parameter)

自定义模型

Ollama允许用户通过Modelfile自定义模型,Modelfile 类似Dockerfile ,通过Modelfile 可以对下载的模型进行自定义,详参: https://github.com/ollama/ollama/blob/main/docs/modelfile.md。

也可以通过ollama show --modelfile {model} 来查看具体某个模型的Modelfile

接下来我们尝试将千问设置为一名文案改进助理:

1. 定义Modelfile

创建redbook-ai-modelfile,内容如下:

1
2
3
4
5
6
7
8
FROM deepseek-r1:7b
# sets the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 1
# sets the context window size to 4096, this controls how many tokens the LLM can use as context to generate the next token
PARAMETER num_ctx 4096

# sets a custom system message to specify the behavior of the chat assistant
SYSTEM 你是一名中文写作改进助理,你的任务是改进所提供文本的拼写、语法、清晰、简洁和整体可读性,同时分解长句,减少重复,并提供改进建议。请只提供文本的更正版本,避免包括解释。

2. 创建自定义模型
执行ollama create rewrite-ai -f redbook-ai-modelfile

3. 运行自定义模型
执行ollama run rewrite-ai

4. 测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
ollama create rewrite-ai -f .\rewrite-ai-modelfile
transferring model data
using existing layer sha256:87f26aae09c7f052de93ff98a2282f05822cc6de4af1a2a159c5bd1acbd10ec4
using existing layer sha256:7c7b8e244f6aa1ac8c32b74f56d42c41a0364dd2dabed8d9c6030a862e805b54
using existing layer sha256:1da0581fd4ce92dcf5a66b1da737cf215d8dcf25aa1b98b44443aaf7173155f5
creating new layer sha256:f0a0557006bab292d768f2581992580aeb1eb35d38bfa558638fa79e4099df04
creating new layer sha256:0a73740ea421e924afe53e0b59fff35edd9cc156549ea84b96815ec0ba75b509
creating new layer sha256:f703e3ac557059df3e2ac9f0482a1262967664933408281b266914122986eb46
writing manifest
success
PS C:\Users\shengjie\Config> ollama run rewrite-ai

### 通过API 调用

默认Ollama 模型会通过`127.0.0.1:11434` 提供API 服务。主要提供了两个RESET API,详参:https://github.com/ollama/ollama/blob/main/docs/api.md

#### POST /api/generate

#### generate without stream
curl http://localhost:11434/api/generate -d '{
"model": "deepseek-r1:7b",
"stream": false,
"prompt": "Why is the sky blue?"
}'

#### generate with json mode
curl http://localhost:11434/api/generate -d '{
"model": "deepseek-r1:7b",
"prompt": "What color is the sky at different times of the day? Respond using JSON",
"format": "json",
"stream": false,
"options": {
"temperature": 0.8
}
}'

POST /api/chat

chat without stream
1
2
3
4
5
6
7
8
9
10
curl http://localhost:11434/api/chat -d '{
"model": "deepseek-r1:7b",
"stream": false,
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
}
]
}'
chat with tools
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": "What is the weather today in Paris?"
}
],
"stream": false,
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get the weather for, e.g. San Francisco, CA"
},
"format": {
"type": "string",
"description": "The format to return the weather in, e.g. 'celsius' or 'fahrenheit'",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location", "format"]
}
}
}
]
}'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
POST http://localhost:11434/api/chat 
{
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": "What is the weather today in Paris?"
}
],
"stream": false,
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get the weather for, e.g. San Francisco, CA"
},
"format": {
"type": "string",
"description": "The format to return the weather in, e.g. 'celsius' or 'fahrenheit'",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location", "format"]
}
}
}
]
}

参考:https://sspai.com/post/85193

运行 GGUF 模型

Ollama 支持直接运行Hugging Face 的GGUF 格式模型,具体可详参:Use Ollama with any GGUF Model on Hugging Face Hub

  1. ollama run hf.co/{username}/{repository}

    1
    2
    3
    4
    ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
    ollama run hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF
    ollama run hf.co/arcee-ai/SuperNova-Medius-GGUF
    ollama run hf.co/bartowski/Humanish-LLama3-8B-Instruct-GGUF
  2. ollama run hf.co/{username}/{repository}:{quantization}

    ollama quantization

SK 集成 Ollama 本地大模型

需要安装Microsoft.SemanticKernel.Connectors.Ollama,目前是实验版本,注意禁用Warning:

#pragma warning disable SKEXP0070

1
#r "nuget: Microsoft.SemanticKernel.Connectors.Ollama,*-*"
Installed Packages
  • Microsoft.SemanticKernel.Connectors.Ollama, 1.41.0-alpha
1
2
3
4
5
6
7
8
9
10
11
12
13
14
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using Microsoft.SemanticKernel.Connectors.Ollama;

var builder = Kernel.CreateBuilder();
var chatModelId = "llama3.2";
var embeddingModelId = "quentinz/bge-large-zh-v1.5:latest";
var endpoint = new Uri("http://localhost:11434");

#pragma warning disable SKEXP0070
builder.Services.AddOllamaChatCompletion(chatModelId, endpoint);
builder.Services.AddOllamaTextEmbeddingGeneration(embeddingModelId, endpoint);
#pragma warning restore SKEXP0070
var kernel = builder.Build();

聊天补全示例:

1
2
var response = await kernel.InvokePromptAsync("Who are you?");
response.Display();

Function Calling 调用举例:

1
2
3
4
5
6
7
8
kernel.Plugins.Clear();
kernel.ImportPluginFromFunctions("HelperFunctions",
[
kernel.CreateFunctionFromMethod(() => DateTime.UtcNow.ToString("R"), "GetCurrentDateTimeInUtc", "Retrieves the current date time in UTC."),
kernel.CreateFunctionFromMethod((string location) => {
return $"The weather in {location} is sunny.";
}, "GetWeather", "Retrieves the weather for a location.")
]);
1
2
3
4
5
6
7
8
9
10
11
12
#pragma warning disable SKEXP0070
OllamaPromptExecutionSettings settings = new()
{
FunctionChoiceBehavior = FunctionChoiceBehavior.Auto()
};
var response = await kernel.InvokePromptAsync(
"What is the weather today in Beijing?", new(settings));
response.Display();

response = await kernel.InvokePromptAsync("What is the time now?", new(settings));

response.Display();

Embedding 示例:

1
2
3
4
5
6
7
8
9
10
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Embeddings;

#pragma warning disable SKEXP0001
var embeddingGenerator = kernel.GetRequiredService<ITextEmbeddingGenerationService>();

var response = await embeddingGenerator
.GenerateEmbeddingsAsync(["Ollama:Get up and running with large language models."]);

response.Display();
1
[ -0.015619896, -0.001123916, -0.047033306, 0.031700864, 0.018831065, -0.018030616, -0.062909976, -0.0098228175, -0.011624538, 0.02440735, 0.014996011, -0.011616529, 0.01240978, -0.022910211, -0.015572232, 0.0152670555, -0.02139899, -0.022848004, 0.0015667947, -0.049193863 ... (1004 more) ]
作者

步步为营

发布于

2025-04-11

更新于

2025-04-11

许可协议