granite3.2-vision,架构基于 granite,是紧凑高效的视觉语言模型,专为视觉文档理解而设计,可自动从表格、图表、信息图、绘图、图表等中提取内容。该模型在精心策划的指令跟踪数据集上进行训练,该数据集包含各种公共数据集和合成数据集,旨在支持广泛的文档理解和一般图像任务。它是通过微调具有图像和文本模态的 Granite 大型语言模型进行训练的。
模型用途:该模型旨在用于涉及处理视觉和文本数据的企业应用程序。具体而言,该模型非常适合一系列视觉文档理解任务,例如分析表格和图表、执行光学字符识别 (OCR) 以及根据文档内容回答问题。此外,它的功能还扩展到一般图像理解,使其能够应用于更广泛的业务应用程序。对于仅涉及基于文本的输入的任务,我们建议使用我们的 Granite 大型语言模型,该模型针对纯文本处理进行了优化,与此模型相比,其性能更出色。
{ "num_ctx": 16384, "temperature": 0 }
granite 使用标准 llms-eval 基准对 Granite Vision 3.2 以及 1B-4B 参数范围内的其他视觉语言模型 (VLM) 进行了评估。评估涵盖了多个公开基准,特别侧重于文档理解任务,同时也包括一般的视觉问答基准。
Molmo-E | InternVL2 | Phi3v | Phi3.5v | Granite Vision | |
---|---|---|---|---|---|
Document benchmarks | |||||
DocVQA | 0.66 | 0.87 | 0.87 | 0.88 | 0.89 |
ChartQA | 0.60 | 0.75 | 0.81 | 0.82 | 0.87 |
TextVQA | 0.62 | 0.72 | 0.69 | 0.7 | 0.78 |
AI2D | 0.63 | 0.74 | 0.79 | 0.79 | 0.76 |
InfoVQA | 0.44 | 0.58 | 0.55 | 0.61 | 0.64 |
OCRBench | 0.65 | 0.75 | 0.64 | 0.64 | 0.77 |
LiveXiv VQA | 0.47 | 0.51 | 0.61 | - | 0.61 |
LiveXiv TQA | 0.36 | 0.38 | 0.48 | - | 0.57 |
Other benchmarks | |||||
MMMU | 0.32 | 0.35 | 0.42 | 0.44 | 0.37 |
VQAv2 | 0.57 | 0.75 | 0.76 | 0.77 | 0.78 |
RealWorldQA | 0.55 | 0.34 | 0.60 | 0.58 | 0.63 |
VizWiz VQA | 0.49 | 0.46 | 0.57 | 0.57 | 0.63 |
OK VQA | 0.40 | 0.44 | 0.51 | 0.53 | 0.56 |
{{- /* Tools */ -}} {{- if .Tools -}} <|start_of_role|>available_tools<|end_of_role|> {{- range $index, $_ := .Tools }} {{- $last := eq (len (slice $.Tools $index)) 1 }} {{ . }} {{- if not $last }} {{ end}} {{- end -}} <|end_of_text|> {{ end }} {{- /* System Prompt */ -}} {{- if and (gt (len .Messages) 0) (eq (index .Messages 0).Role "system") -}} <|system|> {{(index .Messages 0).Content}} {{- else -}} <|system|> A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. {{- end }} {{- /*Main message loop*/ -}} {{- range $index, $_ := .Messages }} {{- $last := eq (len (slice $.Messages $index)) 1 }} {{- if eq .Role "system" }} {{- else if eq .Role "user" }} <|user|> {{.Content}} {{- else if eq .Role "assistant" }} <|assistant|> {{- if .Content }} {{.Content}} <|end_of_text|> {{ end }} {{- else if eq .Role "assistant_tool_call" }} <|start_of_role|>assistant<|end_of_role|><|tool_call|>{{.Content}}<|end_of_text|> {{- else if eq .Role "tool_response" }} <|start_of_role|>tool_response<|end_of_role|>{{.Content}}<|end_of_text|> {{- end }} {{- /* Add generation prompt */ -}} {{ if $last }} {{- if eq .Role "assistant" }} {{- else }} <|assistant|> {{- end }} {{- end }} {{- end }}
Granite Vision 模型原生支持 transformers>=4.49。下面是如何使用 granite-vision-3.2-2b 模型的一个简单示例。
首先,确保构建最新版本的 transformers:
pip install transformers>=4.49
然后运行代码:
from transformers import AutoProcessor, AutoModelForVision2Seq from huggingface_hub import hf_hub_download import torch device = "cuda" if torch.cuda.is_available() else "cpu" model_path = "ibm-granite/granite-vision-3.2-2b" processor = AutoProcessor.from_pretrained(model_path) model = AutoModelForVision2Seq.from_pretrained(model_path).to(device) # prepare image and text prompt, using the appropriate prompt template img_path = hf_hub_download(repo_id=model_path, filename='example.png') conversation = [ { "role": "user", "content": [ {"type": "image", "url": img_path}, {"type": "text", "text": "What is the highest scoring model on ChartQA and what is its score?"}, ], }, ] inputs = processor.apply_chat_template( conversation, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt" ).to(device) # autoregressively complete prompt output = model.generate(**inputs, max_new_tokens=100) print(processor.decode(output[0], skip_special_tokens=True))
该模型也可以使用 vLLM 加载。首先确保安装以下库:
pip install torch torchvision torchaudio pip install vllm==0.6.6
然后,从与您的用例相关的部分复制代码片段:
from vllm import LLM, SamplingParams from vllm.assets.image import ImageAsset from huggingface_hub import hf_hub_download from PIL import Image model_path = "ibm-granite/granite-vision-3.2-2b" model = LLM( model=model_path, limit_mm_per_prompt={"image": 1}, ) sampling_params = SamplingParams( temperature=0.2, max_tokens=64, ) # Define the question we want to answer and format the prompt image_token = "" system_prompt = "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n" question = "What is the highest scoring model on ChartQA and what is its score?" prompt = f"{system_prompt}<|user|>\n{image_token}\n{question}\n<|assistant|>\n" img_path = hf_hub_download(repo_id=model_path, filename='example.png') image = Image.open(img_path).convert("RGB") print(image) # Build the inputs to vLLM; the image is passed as `multi_modal_data`. inputs = { "prompt": prompt, "multi_modal_data": { "image": image, } } outputs = model.generate(inputs, sampling_params=sampling_params) print(f"Generated text: {outputs[0].outputs[0].text}")