一步一步的教你如何部署chatglm-6B 附加基于ray的在线推理框架实现

我们今天的目标是一步一步的实现chatglm-6B的模型部署工作。

chatglm-6b 体验地址：Gradio

首先chatglm的是基于pytorch深度学习框架的。本实验接口部署在autodl服务器中。镜像选择配置如下。

PyTorch  1.11.0
Python  3.8(ubuntu20.04)
Cuda  11.3
复制代码

第二我们发一张关于可以成功部署模型的pypi列表。

torch                          1.13.1
torchvision                    0.14.1
gradio                         3.21.0
protobuf>=3.19.5,<3.20.1
transformers>=4.26.1
icetk
cpm_kernels
复制代码

第三我们来一遍完整的部署过程：

首先 git clone github.com/THUDM/ChatG…

第二 pip install -r requirements.txt

第三 pip install gradio

第四 python web_demo.py

在日志中会返回两个地址。点击那两个地址就可以体验了。

问题：如果下载速度慢怎么办。

建议国内用户使用以下代码来进行模型下载

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", mirror='https://mirror.nju.edu.cn/hugging-face-models',trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b",mirror='https://mirror.nju.edu.cn/hugging-face-models', trust_remote_code=True).half().cuda()
复制代码

问题：有哪些GPU服务器又便宜又可以外网方案

显卡方面：大于等于24GB显存的服务器。例如3090、4090、A100、V100 32GB版本。

服务器运营商方面：autodl、featurize、趋动云。都可以提供外网接口。

autodl的外网开放一个端口。6006，在代码web_demo.py中需要将以下代码修改。

demo.queue().launch(server_port=6006,share=True)
复制代码

featurize接口开启方式

featurize port export 《port》
复制代码

趋动云的是通过页面交互形式提供接口

问题：如何加速模型推理时间

答案：openvino框架将chatglm的模型转换为onnxruntime模型，onnxruntime转换为openvino.xml结构模型。

问题：如何在小于24GB显存的服务器上部署chatglm-6b

尝试在3060显卡上部署chatglm-6b 修改代码中以下配置达到部署能力。

model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().quantize(4).cuda()
复制代码

模型量化会带来一定的性能损失，经过测试，ChatGLM-6B 在 4-bit 量化下仍然能够进行自然流畅的生成，使用 GPT-Q 等量化方案可以进一步压缩量化精度/提升相同量化精度下的模型性能，欢迎大家提出对应的 Pull Request。

随机找了一些话作为chatglm的输入，测试一下chatglm的响应。

添加图片注释，不超过 140 字（可选）

目前速度上可能是不太ok。所以我决定用一下基于ray的多进程能力来进行生成加速。

import json

word_set_list = json.load(open("word_set_list.json", "r"))

import requests
from starlette.requests import Request
from typing import Dict
from transformers import AutoTokenizer, AutoModel

from transformers import pipeline
import ray
from ray import serve


# 1: Wrap the pretrained sentiment analysis model in a Serve deployment.
@serve.deployment(route_prefix="/", num_replicas=1, ray_actor_options={"num_cpus": 8, "num_gpus": 1})
# 
# @serve.remote(
#             num_cpus=2, num_gpus=0.5)
class SentimentAnalysisDeployment:
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
        #         model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
        self.model = AutoModel.from_pretrained("THUDM/chatglm-6b", cache_dir='./chatglm-6b',
                                               mirror='https://mirror.nju.edu.cn/hugging-face-models',
                                               trust_remote_code=True).half().quantize(4).cuda()
        self._model = self.model.eval()

    #     @ray.remote(num_gpus=0.5)
    def __call__(self, request: Request) -> Dict:
        return self._model.chat(self.tokenizer, request.query_params["text"], history=[])[0]


# ray.init(num_cpus=8, num_gpus=1) 
# 2: Deploy the deployment.
serve.run(SentimentAnalysisDeployment.bind())
from tqdm import tqdm


# {'label': 'POSITIVE', 'score': 0.9998476505279541}
import csv
from tqdm import tqdm
#python2可以用file替代open
with open("chatglm.csv","w") as csvfile: 
    for word_one in tqdm(word_set_list):
        
        writer = csv.writer(csvfile)
        response = requests.get(
            "http://localhost:8000/", params={"text": word_one}
        ).text
        # response, history = model.chat(tokenizer, word_one, history=[])
        writer.writerow([word_one,response])
复制代码

这里面透露一下。大家可能很惊讶为什么一个60亿参数的模型会比一些3亿参数的模型预测的速度更快。这里面就不得不提到三个黑科技了。

4bit量化

fast transformer

cpp版本

小编也不太确定这里的fast transformer到底是nvidia的还是字节跳动的产品。不过从搜索引擎的结果中看更类似与nvidia的faster transformer

NVIDIA/FasterTransformer: Transformer related optimization, including BERT, GPT (github.com)

希望这些信息对大家有帮助记得一健三联哦。

作者：路人与大师
链接：https://juejin.cn/post/7212893989033246779
来源：稀土掘金
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

我的技术笔记

一步一步的教你如何部署chatglm-6B 附加基于ray的在线推理框架实现

关于

内容

备案