vision-language

Here are 156 public repositories matching this topic...

IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"

open-world object-detection vision-language vision-language-transformer open-world-detection

Updated Aug 12, 2024
Python

OFA-Sys / OFA

Star

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

prompt chinese image-captioning pretrained-models visual-question-answering multimodal text-to-image-synthesis vision-language pretraining referring-expression-comprehension prompt-tuning

Updated Apr 24, 2024
Python

2U1 / Qwen-VL-Series-Finetune

Star

An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud.

vlm multimodal vision-language vision-language-model qwen2-vl qwen2-5-vl qwen3-vl qwen3-5

Updated Mar 10, 2026
Python

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

chatbot llama clip mulit-modal vision-language vicuna gpt-4 vision-language-pretraining llava video-chatboat video-conversation

Updated Aug 5, 2025
Python

OFA-Sys / ONE-PEACE

Star

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

representation-learning multimodal vision-and-language contrastive-loss vision-language vision-transformer foundation-models audio-language

Updated Oct 6, 2024
Python

TinyLLaVA / TinyLLaVA_Factory

Star

A Framework of Small-scale Large Multimodal Models

nlp transformers llama vision-language llava large-multimodal-models tinyllama

Updated Mar 15, 2026
Python

mees / calvin

Star

CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

natural-language-processing computer-vision deep-learning robotics pytorch vision manipulation vision-and-language grounding vision-language

Updated Sep 8, 2025
Python

mbzuai-oryx / LLaVA-pp

Star

🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)

conversation lmms vision-language llm llava llama3 phi3 llava-llama3 llava-phi3 llama3-llava phi3-llava llama-3-vision phi3-vision llama-3-llava phi-3-llava llama3-vision phi-3-vision

Updated Aug 5, 2025
Python

Algolzw / daclip-uir

Star

[ICLR 2024] Controlling Vision-Language Models for Universal Image Restoration. 5th place in the NTIRE 2024 Restore Any Image Model in the Wild Challenge.

deep-learning prompt pytorch image-denoising image-restoration image-deblurring low-level-vision shadow-removal image-dehazing face-inpainting vision-language diffusion-models low-light-image-enhancement image-deraining jpeg-artifacts-removal image-desnowing all-in-one-image-restoration

Updated Aug 7, 2024
Python

longzw1997 / Open-GroundingDino

Star

This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.

open-world object-detection vision-language open-world-detection

Updated Jul 27, 2025
Python

AILab-CVC / SEED

Star

Official implementation of SEED-LLaMA (ICLR 2024).

multimodal vision-language foundation-model

Updated Sep 21, 2024
Python

airaria / Visual-Chinese-LLaMA-Alpaca

Star

多模态中文LLaMA&Alpaca大语言模型（VisualCLA）

nlp chinese llama lora alpaca multimodal vision-language llm

Updated Jul 27, 2023
Python

HUANGLIZI / LViT

Star

[IEEE Transactions on Medical Imaging/TMI 2023] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"

pytorch segmentation medical-image-analysis multimodal-learning vision-language

Updated Mar 10, 2025
Python

zdou0830 / METER

Star

METER: A Multimodal End-to-end TransformER Framework

vision-language

Updated Nov 16, 2022
Python

zjysteven / lmms-finetune

Star

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.