granite3.2-vision

显示行号 | 选择喜欢的代码风格

granite3.2-vision 是一种紧凑而高效的视觉语言模型，专为视觉文档理解而设计，能够从表格、图表、信息图、绘图、图解等中自动提取内容。

granite3.2-vision 参数量

2.53b vision tools

granite3.2-vision 模型介绍

granite3.2-vision，架构基于 granite，是紧凑高效的视觉语言模型，专为视觉文档理解而设计，可自动从表格、图表、信息图、绘图、图表等中提取内容。该模型在精心策划的指令跟踪数据集上进行训练，该数据集包含各种公共数据集和合成数据集，旨在支持广泛的文档理解和一般图像任务。它是通过微调具有图像和文本模态的 Granite 大型语言模型进行训练的。

模型用途：该模型旨在用于涉及处理视觉和文本数据的企业应用程序。具体而言，该模型非常适合一系列视觉文档理解任务，例如分析表格和图表、执行光学字符识别 (OCR) 以及根据文档内容回答问题。此外，它的功能还扩展到一般图像理解，使其能够应用于更广泛的业务应用程序。对于仅涉及基于文本的输入的任务，我们建议使用我们的 Granite 大型语言模型，该模型针对纯文本处理进行了优化，与此模型相比，其性能更出色。

granite3.2-vision 参数

{
    "num_ctx": 16384,
    "temperature": 0
}

granite3.2-vision 评估

granite 使用标准 llms-eval 基准对 Granite Vision 3.2 以及 1B-4B 参数范围内的其他视觉语言模型 (VLM) 进行了评估。评估涵盖了多个公开基准，特别侧重于文档理解任务，同时也包括一般的视觉问答基准。

	Molmo-E	InternVL2	Phi3v	Phi3.5v	Granite Vision
Document benchmarks
DocVQA	0.66	0.87	0.87	0.88	0.89
ChartQA	0.60	0.75	0.81	0.82	0.87
TextVQA	0.62	0.72	0.69	0.7	0.78
AI2D	0.63	0.74	0.79	0.79	0.76
InfoVQA	0.44	0.58	0.55	0.61	0.64
OCRBench	0.65	0.75	0.64	0.64	0.77
LiveXiv VQA	0.47	0.51	0.61	-	0.61
LiveXiv TQA	0.36	0.38	0.48	-	0.57
Other benchmarks
MMMU	0.32	0.35	0.42	0.44	0.37
VQAv2	0.57	0.75	0.76	0.77	0.78
RealWorldQA	0.55	0.34	0.60	0.58	0.63
VizWiz VQA	0.49	0.46	0.57	0.57	0.63
OK VQA	0.40	0.44	0.51	0.53	0.56

granite3.2-vision 模板

{{- /* Tools */ -}}
{{- if .Tools -}}
<|start_of_role|>available_tools<|end_of_role|>
{{- range $index, $_ := .Tools }}
{{- $last := eq (len (slice $.Tools $index)) 1 }}
{{ . }}
{{- if not $last }}
{{ end}}
{{- end -}}
<|end_of_text|>
{{ end }}

{{- /* System Prompt */ -}}
{{- if and (gt (len .Messages) 0) (eq (index .Messages 0).Role "system") -}}
<|system|>
{{(index .Messages 0).Content}}
{{- else -}}
<|system|>
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
{{- end }}

{{- /*Main message loop*/ -}}
{{- range $index, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $index)) 1 }}
{{- if eq .Role "system" }}

{{- else if eq .Role "user" }}
<|user|>
{{.Content}}

{{- else if eq .Role "assistant" }}
<|assistant|>
{{- if .Content }}
{{.Content}}
<|end_of_text|>
{{ end }}

{{- else if eq .Role "assistant_tool_call" }}
<|start_of_role|>assistant<|end_of_role|><|tool_call|>{{.Content}}<|end_of_text|>

{{- else if eq .Role "tool_response" }}
<|start_of_role|>tool_response<|end_of_role|>{{.Content}}<|end_of_text|>
{{- end }}

{{- /* Add generation prompt */ -}}
{{ if $last }}
{{- if eq .Role "assistant" }}
{{- else }}
<|assistant|>
{{- end }}
{{- end }}
{{- end }}

granite3.2-vision Generation

Granite Vision 模型原生支持 transformers>=4.49。下面是如何使用 granite-vision-3.2-2b 模型的一个简单示例。

首先，确保构建最新版本的 transformers：

pip install transformers>=4.49

然后运行代码：

from transformers import AutoProcessor, AutoModelForVision2Seq
from huggingface_hub import hf_hub_download
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model_path = "ibm-granite/granite-vision-3.2-2b"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForVision2Seq.from_pretrained(model_path).to(device)

# prepare image and text prompt, using the appropriate prompt template

img_path = hf_hub_download(repo_id=model_path, filename='example.png')

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": img_path},
            {"type": "text", "text": "What is the highest scoring model on ChartQA and what is its score?"},
        ],
    },
]
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(device)


# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))

granite3.2-vision Usage with vLLM

该模型也可以使用 vLLM 加载。首先确保安装以下库：

pip install torch torchvision torchaudio
pip install vllm==0.6.6

然后，从与您的用例相关的部分复制代码片段：

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
from huggingface_hub import hf_hub_download
from PIL import Image

model_path = "ibm-granite/granite-vision-3.2-2b"

model = LLM(
    model=model_path,
    limit_mm_per_prompt={"image": 1},
)

sampling_params = SamplingParams(
    temperature=0.2,
    max_tokens=64,
)

# Define the question we want to answer and format the prompt
image_token = ""
system_prompt = "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n"

question = "What is the highest scoring model on ChartQA and what is its score?"
prompt = f"{system_prompt}<|user|>\n{image_token}\n{question}\n<|assistant|>\n"
img_path = hf_hub_download(repo_id=model_path, filename='example.png')
image = Image.open(img_path).convert("RGB")
print(image)

# Build the inputs to vLLM; the image is passed as `multi_modal_data`.
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image,
    }
}

outputs = model.generate(inputs, sampling_params=sampling_params)
print(f"Generated text: {outputs[0].outputs[0].text}")