granite3.2-vision,架构基于 granite,是紧凑高效的视觉语言模型,专为视觉文档理解而设计,可自动从表格、图表、信息图、绘图、图表等中提取内容。该模型在精心策划的指令跟踪数据集上进行训练,该数据集包含各种公共数据集和合成数据集,旨在支持广泛的文档理解和一般图像任务。它是通过微调具有图像和文本模态的 Granite 大型语言模型进行训练的。
模型用途:该模型旨在用于涉及处理视觉和文本数据的企业应用程序。具体而言,该模型非常适合一系列视觉文档理解任务,例如分析表格和图表、执行光学字符识别 (OCR) 以及根据文档内容回答问题。此外,它的功能还扩展到一般图像理解,使其能够应用于更广泛的业务应用程序。对于仅涉及基于文本的输入的任务,我们建议使用我们的 Granite 大型语言模型,该模型针对纯文本处理进行了优化,与此模型相比,其性能更出色。
{
"num_ctx": 16384,
"temperature": 0
}
granite 使用标准 llms-eval 基准对 Granite Vision 3.2 以及 1B-4B 参数范围内的其他视觉语言模型 (VLM) 进行了评估。评估涵盖了多个公开基准,特别侧重于文档理解任务,同时也包括一般的视觉问答基准。
| Molmo-E | InternVL2 | Phi3v | Phi3.5v | Granite Vision | |
|---|---|---|---|---|---|
| Document benchmarks | |||||
| DocVQA | 0.66 | 0.87 | 0.87 | 0.88 | 0.89 |
| ChartQA | 0.60 | 0.75 | 0.81 | 0.82 | 0.87 |
| TextVQA | 0.62 | 0.72 | 0.69 | 0.7 | 0.78 |
| AI2D | 0.63 | 0.74 | 0.79 | 0.79 | 0.76 |
| InfoVQA | 0.44 | 0.58 | 0.55 | 0.61 | 0.64 |
| OCRBench | 0.65 | 0.75 | 0.64 | 0.64 | 0.77 |
| LiveXiv VQA | 0.47 | 0.51 | 0.61 | - | 0.61 |
| LiveXiv TQA | 0.36 | 0.38 | 0.48 | - | 0.57 |
| Other benchmarks | |||||
| MMMU | 0.32 | 0.35 | 0.42 | 0.44 | 0.37 |
| VQAv2 | 0.57 | 0.75 | 0.76 | 0.77 | 0.78 |
| RealWorldQA | 0.55 | 0.34 | 0.60 | 0.58 | 0.63 |
| VizWiz VQA | 0.49 | 0.46 | 0.57 | 0.57 | 0.63 |
| OK VQA | 0.40 | 0.44 | 0.51 | 0.53 | 0.56 |
{{- /* Tools */ -}}
{{- if .Tools -}}
<|start_of_role|>available_tools<|end_of_role|>
{{- range $index, $_ := .Tools }}
{{- $last := eq (len (slice $.Tools $index)) 1 }}
{{ . }}
{{- if not $last }}
{{ end}}
{{- end -}}
<|end_of_text|>
{{ end }}
{{- /* System Prompt */ -}}
{{- if and (gt (len .Messages) 0) (eq (index .Messages 0).Role "system") -}}
<|system|>
{{(index .Messages 0).Content}}
{{- else -}}
<|system|>
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
{{- end }}
{{- /*Main message loop*/ -}}
{{- range $index, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $index)) 1 }}
{{- if eq .Role "system" }}
{{- else if eq .Role "user" }}
<|user|>
{{.Content}}
{{- else if eq .Role "assistant" }}
<|assistant|>
{{- if .Content }}
{{.Content}}
<|end_of_text|>
{{ end }}
{{- else if eq .Role "assistant_tool_call" }}
<|start_of_role|>assistant<|end_of_role|><|tool_call|>{{.Content}}<|end_of_text|>
{{- else if eq .Role "tool_response" }}
<|start_of_role|>tool_response<|end_of_role|>{{.Content}}<|end_of_text|>
{{- end }}
{{- /* Add generation prompt */ -}}
{{ if $last }}
{{- if eq .Role "assistant" }}
{{- else }}
<|assistant|>
{{- end }}
{{- end }}
{{- end }}
Granite Vision 模型原生支持 transformers>=4.49。下面是如何使用 granite-vision-3.2-2b 模型的一个简单示例。
首先,确保构建最新版本的 transformers:
pip install transformers>=4.49
然后运行代码:
from transformers import AutoProcessor, AutoModelForVision2Seq
from huggingface_hub import hf_hub_download
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model_path = "ibm-granite/granite-vision-3.2-2b"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForVision2Seq.from_pretrained(model_path).to(device)
# prepare image and text prompt, using the appropriate prompt template
img_path = hf_hub_download(repo_id=model_path, filename='example.png')
conversation = [
{
"role": "user",
"content": [
{"type": "image", "url": img_path},
{"type": "text", "text": "What is the highest scoring model on ChartQA and what is its score?"},
],
},
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(device)
# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))
该模型也可以使用 vLLM 加载。首先确保安装以下库:
pip install torch torchvision torchaudio pip install vllm==0.6.6
然后,从与您的用例相关的部分复制代码片段:
from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
from huggingface_hub import hf_hub_download
from PIL import Image
model_path = "ibm-granite/granite-vision-3.2-2b"
model = LLM(
model=model_path,
limit_mm_per_prompt={"image": 1},
)
sampling_params = SamplingParams(
temperature=0.2,
max_tokens=64,
)
# Define the question we want to answer and format the prompt
image_token = ""
system_prompt = "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n"
question = "What is the highest scoring model on ChartQA and what is its score?"
prompt = f"{system_prompt}<|user|>\n{image_token}\n{question}\n<|assistant|>\n"
img_path = hf_hub_download(repo_id=model_path, filename='example.png')
image = Image.open(img_path).convert("RGB")
print(image)
# Build the inputs to vLLM; the image is passed as `multi_modal_data`.
inputs = {
"prompt": prompt,
"multi_modal_data": {
"image": image,
}
}
outputs = model.generate(inputs, sampling_params=sampling_params)
print(f"Generated text: {outputs[0].outputs[0].text}")