2025-04-11发表2025-04-11更新Semantic Kernel

番外篇1、Ollama部署本地大模型

Ollama 玩转本地大模型

通常而言，依据经验(每个大模型开发者都应该知道的数字)来看，对于16位浮点精度（FP16）的模型，其推理所需的显存（单位是GB）大概是模型参数量（以10亿为单位）的两倍。例如，Llama 2 7B（70亿参数量），按照这个规律，其推理大约需要14GB的显存，这显然是普通家用计算机的硬件规格所无法满足的。就拿一块GeForce RTX 4060 Ti 16GB显卡来说，它的市场价超过3000元。

不过，模型量化（quantization）技术却能够在很大程度上降低显存的要求。以4 - bit量化为例，它可以把原本是FP16精度的权重参数压缩成4位整数精度，这样一来，模型权重的体积以及推理所需的显存都会大幅减小，仅为FP16的1/4到1/3，也就是说，大约4GB的显存就能够启动7B模型的推理了（当然，实际的显存需求会随着上下文内容的增加而不断变大）。

Ollama，它是一个简单易用的本地大模型运行框架。随着围绕Ollama的生态逐渐发展起来，更多的用户也能够方便地在自己的电脑上使用大模型了。

安装

打开官网https://ollama.com/download，即可根据电脑类型下载对应的客户端。

下载后，建议通过命令行执行安装，以便修改安装路径：.\OllamaSetup.exe /DIR="d:\Ollama"

直接双击安装会默认安装至：%USERPROFILE%\AppData\Local\Programs\Ollama

默认下载的模型保存在目录：%USERPROFILE%\.ollama 。
可以通过添加系统环境变量 OLLAMA_MODELS 修改模型的安装目录（请先退出Ollama）
默认的配置文件目录：%USERPROFILE%\AppData\Local\Ollama

安装完毕后，会默认直接启动，启动成功后会在任务栏右下角。

使用

除了双击对应ollama 应用启动后，还可以通过命令行启停安装完毕后，打开Terminal 执行ollama 命令，即可查看ollama 支持的系列指令：

PS C:\Users\shengjie> ollama
Usage:
  ollama [flags]
  ollama [command]

Available Commands:
  serve       Start ollama
  create      Create a model from a Modelfile
  show        Show information for a model
  run         Run a model
  stop        Stop a running model
  pull        Pull a model from a registry
  push        Push a model to a registry
  list        List models
  ps          List running models
  cp          Copy a model
  rm          Remove a model
  help        Help about any command

Flags:
  -h, --help      help for ollama
  -v, --version   Show version information

PS C:\Users\shengjie> ollama -v
ollama version is 0.6.1

拉取模型

当你运行 ollama –version 命令成功查询到版本时，表示 Ollama 的安装已经顺利完成，接下来便可以用 pull 命令从在线模型库(https://ollama.com/library)下载模型来玩了。接下来我们尝试来拉取`deepseek-r1:7b` 模型来玩耍，命令行执行以下命令：
PS C:\Users\shengjie> ollama pull deepseek-r1:7b

运行模型

当模型下载完毕后，可以直接命令行执行ollama run {model} 来启动即可，如下：

PS C:\Users\shengjie> ollama run deepseek-r1:7b

设置系统提示词

/set
Available Commands:
/set parameter ...     Set a parameter
/set system <string>   Set system message
/set history           Enable history
/set nohistory         Disable history
/set wordwrap          Enable wordwrap
/set nowordwrap        Disable wordwrap
/set format json       Enable JSON mode
/set noformat          Disable formatting
/set verbose           Show LLM stats
/set quiet             Disable LLM stats
/set system 你是一名小红书风格文案写手，请使用 Emoji 风格改写用户提供的文案，该风格以引人入胜的标题、每个段落中包含
... 表情符号和在末尾添加相关标签为特点。请确保保持原文的意思。
Set system message.
/show system
你是一名小红书风格文案写手，请使用 Emoji 风格改写用户提供的文案，该风格以引人入胜的标题、每个段落中包含表情符号和在末尾添加相关标签为特点。请确保保持原文的意思。

设置参数

/set format json ：启用 Json Mode
/set parameter num_ctx 4096：设置上下文窗口的token 大小为4096（还可以设置temperature、top_k、top_p等，可设参数详参：modelfile.md#parameter)

自定义模型

Ollama允许用户通过Modelfile自定义模型，Modelfile 类似Dockerfile ，通过Modelfile 可以对下载的模型进行自定义，详参： https://github.com/ollama/ollama/blob/main/docs/modelfile.md。

也可以通过ollama show --modelfile {model} 来查看具体某个模型的Modelfile。

接下来我们尝试将千问设置为一名文案改进助理：

1. 定义Modelfile

创建redbook-ai-modelfile，内容如下：

FROM deepseek-r1:7b
# sets the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 1
# sets the context window size to 4096, this controls how many tokens the LLM can use as context to generate the next token
PARAMETER num_ctx 4096

# sets a custom system message to specify the behavior of the chat assistant
SYSTEM 你是一名中文写作改进助理，你的任务是改进所提供文本的拼写、语法、清晰、简洁和整体可读性，同时分解长句，减少重复，并提供改进建议。请只提供文本的更正版本，避免包括解释。

2. 创建自定义模型
执行ollama create rewrite-ai -f redbook-ai-modelfile

3. 运行自定义模型
执行ollama run rewrite-ai

4. 测试

ollama create rewrite-ai -f .\rewrite-ai-modelfile
transferring model data 
using existing layer sha256:87f26aae09c7f052de93ff98a2282f05822cc6de4af1a2a159c5bd1acbd10ec4 
using existing layer sha256:7c7b8e244f6aa1ac8c32b74f56d42c41a0364dd2dabed8d9c6030a862e805b54 
using existing layer sha256:1da0581fd4ce92dcf5a66b1da737cf215d8dcf25aa1b98b44443aaf7173155f5 
creating new layer sha256:f0a0557006bab292d768f2581992580aeb1eb35d38bfa558638fa79e4099df04 
creating new layer sha256:0a73740ea421e924afe53e0b59fff35edd9cc156549ea84b96815ec0ba75b509 
creating new layer sha256:f703e3ac557059df3e2ac9f0482a1262967664933408281b266914122986eb46 
writing manifest 
success 
PS C:\Users\shengjie\Config> ollama run rewrite-ai

### 通过API 调用

默认Ollama 模型会通过`127.0.0.1:11434` 提供API 服务。主要提供了两个RESET API，详参：https://github.com/ollama/ollama/blob/main/docs/api.md

#### POST /api/generate

#### generate without stream
curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-r1:7b",
  "stream": false,
  "prompt": "Why is the sky blue?"
}'

#### generate with json mode
curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-r1:7b",
  "prompt": "What color is the sky at different times of the day? Respond using JSON",
  "format": "json",
  "stream": false,
  "options": {
    "temperature": 0.8
  }
}'

POST /api/chat

chat without stream

curl http://localhost:11434/api/chat -d '{
  "model": "deepseek-r1:7b",
  "stream": false,
  "messages": [
    {
      "role": "user",
      "content": "why is the sky blue?"
    }
  ]
}'

chat with tools

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {
      "role": "user",
      "content": "What is the weather today in Paris?"
    }
  ],
  "stream": false,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The location to get the weather for, e.g. San Francisco, CA"
            },
            "format": {
              "type": "string",
              "description": "The format to return the weather in, e.g. 'celsius' or 'fahrenheit'",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location", "format"]
        }
      }
    }
  ]
}'

POST http://localhost:11434/api/chat 
{
  "model": "llama3.2",
  "messages": [
    {
      "role": "user",
      "content": "What is the weather today in Paris?"
    }
  ],
  "stream": false,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The location to get the weather for, e.g. San Francisco, CA"
            },
            "format": {
              "type": "string",
              "description": "The format to return the weather in, e.g. 'celsius' or 'fahrenheit'",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location", "format"]
        }
      }
    }
  ]
}

参考：https://sspai.com/post/85193

运行 GGUF 模型

Ollama 支持直接运行Hugging Face 的GGUF 格式模型，具体可详参：Use Ollama with any GGUF Model on Hugging Face Hub。

ollama run hf.co/{username}/{repository}

ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
ollama run hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF
ollama run hf.co/arcee-ai/SuperNova-Medius-GGUF
ollama run hf.co/bartowski/Humanish-LLama3-8B-Instruct-GGUF

ollama run hf.co/{username}/{repository}:{quantization}

SK 集成 Ollama 本地大模型

需要安装Microsoft.SemanticKernel.Connectors.Ollama，目前是实验版本，注意禁用Warning：

#pragma warning disable SKEXP0070

1	#r "nuget: Microsoft.SemanticKernel.Connectors.Ollama,-"

Installed Packages

Microsoft.SemanticKernel.Connectors.Ollama, 1.41.0-alpha

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using Microsoft.SemanticKernel.Connectors.Ollama;

var builder = Kernel.CreateBuilder();
var chatModelId = "llama3.2";
var embeddingModelId = "quentinz/bge-large-zh-v1.5:latest";
var endpoint = new Uri("http://localhost:11434");

#pragma warning disable SKEXP0070
builder.Services.AddOllamaChatCompletion(chatModelId, endpoint);
builder.Services.AddOllamaTextEmbeddingGeneration(embeddingModelId, endpoint);
#pragma warning restore SKEXP0070
var kernel = builder.Build();

聊天补全示例：

1 2	var response = await kernel.InvokePromptAsync("Who are you?"); response.Display();

Function Calling 调用举例：

kernel.Plugins.Clear();
kernel.ImportPluginFromFunctions("HelperFunctions",
        [            
            kernel.CreateFunctionFromMethod(() => DateTime.UtcNow.ToString("R"), "GetCurrentDateTimeInUtc", "Retrieves the current date time in UTC."),
            kernel.CreateFunctionFromMethod((string location) => {
                return $"The weather in {location} is sunny.";
            }, "GetWeather", "Retrieves the weather for a location.")
        ]);

#pragma warning disable SKEXP0070
OllamaPromptExecutionSettings  settings = new() 
{ 
    FunctionChoiceBehavior = FunctionChoiceBehavior.Auto() 
};
var response = await kernel.InvokePromptAsync(
    "What is the weather today in Beijing?", new(settings));
response.Display();

response = await kernel.InvokePromptAsync("What is the time now?", new(settings));

response.Display();

Embedding 示例：

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Embeddings;

#pragma warning disable SKEXP0001
var embeddingGenerator  = kernel.GetRequiredService<ITextEmbeddingGenerationService>();

var response = await embeddingGenerator
    .GenerateEmbeddingsAsync(["Ollama:Get up and running with large language models."]);

response.Display();

[ -0.015619896, -0.001123916, -0.047033306, 0.031700864, 0.018831065, -0.018030616, -0.062909976, -0.0098228175, -0.011624538, 0.02440735, 0.014996011, -0.011616529, 0.01240978, -0.022910211, -0.015572232, 0.0152670555, -0.02139899, -0.022848004, 0.0015667947, -0.049193863 ... (1004 more) ]

番外篇1、Ollama部署本地大模型

https://bubuweiying.site/番外篇1Ollama部署本地大模型/

作者

步步为营

发布于

2025-04-11

更新于

2025-04-11

许可协议

#Semantic Kernel

番外篇1、Ollama部署本地大模型

Ollama 玩转本地大模型

安装

使用

拉取模型

运行模型

设置参数

自定义模型

POST /api/chat

chat without stream

chat with tools

运行 GGUF 模型

SK 集成 Ollama 本地大模型

作者

发布于

更新于

许可协议

喜欢这篇文章？打赏一下作者吧

分类

链接

标签

目录

最新文章