MiniCPM-V 2.6 represents the most recent and most superior iteration within the MiniCPM-V collection, constructed on the SigLip-400M and Qwen2-7B frameworks, boasting a complete of 8 billion parameters. This mannequin introduces important enhancements in efficiency and new options tailor-made for multi-image and video understanding, reaching substantial developments over its predecessor, MiniCPM-Llama3-V 2.5.
Key Options of MiniCPM-V 2.6:
- Main Efficiency: MiniCPM-V 2.6 attains a median rating of 65.2 on OpenCompass, a complete analysis throughout eight well-liked benchmarks. With its 8 billion parameters, this mannequin surpasses outstanding proprietary fashions corresponding to GPT-4o mini, GPT-4V, Gemini 1.5 Professional, and Claude 3.5 Sonnet in single picture understanding.
- Multi-Picture Understanding and In-context Studying: Able to dialog and reasoning over a number of photos, MiniCPM-V 2.6 achieves state-of-the-art outcomes on multi-image benchmarks together with Mantis-Eval, BLINK, Mathverse mv, and Sciverse mv. It additionally reveals promising in-context studying talents.
- Video Understanding: Accepting video inputs, MiniCPM-V 2.6 offers dialog and dense captions for spatial-temporal data. It outperforms fashions like GPT-4V, Claude 3.5 Sonnet, and LLaVA-NeXT-Video-34B on Video-MME, each with and with out subtitles.
- Robust OCR Functionality: Processing photos with numerous side ratios and as much as 1.8 million pixels, MiniCPM-V 2.6 units a brand new normal on OCRBench, outperforming proprietary fashions corresponding to GPT-4o, GPT-4V, and Gemini 1.5 Professional. Leveraging the most recent RLAIF-V and VisCPM methods, it ensures reliable behaviors with considerably decrease hallucination charges on Object HalBench, supporting multilingual capabilities throughout English, Chinese language, German, French, Italian, and Korean.
- Superior Effectivity: Regardless of its compact measurement, MiniCPM-V 2.6 reveals state-of-the-art token density, encoding a 1.8 million pixel picture into simply 640 tokens, 75% fewer than most fashions. This enhances inference pace, first-token latency, reminiscence utilization, and energy consumption, enabling environment friendly real-time video understanding on gadgets corresponding to iPads.
- Ease of Use: MiniCPM-V 2.6 is flexible in its utility, supporting environment friendly CPU inference on native gadgets by way of llama.cpp and ollama, providing quantized fashions in int4 and GGUF codecs in 16 sizes, vLLM assist for high-throughput and memory-efficient inference, domain-specific fine-tuning, fast native WebUI demo setup with Gradio, and on-line net demos.
MiniCPM-V 2.6 represents a big leap in machine studying for visible understanding, providing unmatched efficiency, effectivity, and value throughout single picture, multi-image, and video processing duties
Try the HF Mannequin and GitHub. All credit score for this analysis goes to the researchers of this mission. Additionally, don’t neglect to comply with us on Twitter and be a part of our Telegram Channel and LinkedIn Group. In the event you like our work, you’ll love our e-newsletter..
Don’t Overlook to affix our 47k+ ML SubReddit
Discover Upcoming AI Webinars right here