Repository navigation

video-captioning

Website
Wikipedia

X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).

image-captioning video-captioning vision-and-language pretraining cross-modal-retrieval visual-question-answering tden

Python

969

106

3 年前

xiadingZ / video-caption.pytorch

pytorch implementation of video captioning

PyTorch video-captioning 深度学习

Python

400

131

6 年前

scopeInfinity / Video2Description

Video to Text: Natural language description generator for some given video. [Video Captioning]

深度神经网络 cnn-keras image-captioning video-captioning video-processing audio-processing

Python

358

3 年前

xid32 / NAACL_2025_TWM

We introduce temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of Multimodal foundation models (MFMs). This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across QA, captioning, and retrieval tasks.

multimodal-large-language-models audio-visual-learning question-answering video-captioning

Python

310

8 个月前

tomchang25 / whisper-auto-transcribe

Auto transcribe tool based on whisper

asr text-to-speech 深度学习 speech-recognition speech-to-text language-model PyTorch speech-processing voice-activity-detection gradio gradio-interface video-captioning

Python

225

2 年前

antoyang / VidChapters

[NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale

multimodal-learning pre-training video-captioning video-understanding vision-and-language

Jupyter Notebook

198

2 年前

vijayvee / video-captioning

This repository contains the code for a video captioning system inspired by Sequence to Sequence -- Video to Text. This system takes as input a video and generates a caption in English describing the video.

video-captioning Tensorflow sequence-to-sequence multimodal-deep-learning seq2seq

Python

172

6 年前

jayleicn / recurrent-transformer

[ACL 2020] PyTorch code for MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

PyTorch video-captioning

Jupyter Notebook

171

5 年前

bytedance / Shot2Story

A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.

benchmark dataset large-language-models video-language-pretraining video-question-answering vision-language video-captioning research

Python

155

8 个月前

yao-jason / MLDS2018SPRING

Machine Learning and having it Deep and Structured (MLDS) in 2018 spring

ntu seq2seq sequence-to-sequence Generative Adversarial Network reinforcement-learning policy-gradient deep-q-network actor-critic 聊天机器人 video-captioning image-generation text-to-image 2018 Spring

Python

147

6 年前

jpthu17 / EMCL

[NeurIPS 2022 Spotlight] Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

cross-modal-retrieval neurips video-captioning video-question-answering

Python

140

1 年前

jssprz / video_captioning_datasets

Summary about Video-to-Text datasets. This repository is part of the review paper *Bridging Vision and Language from the Video-to-Text Perspective: A Comprehensive Review*

video-captioning vision-and-language 代码审查 state-of-the-art

Jupyter Notebook

128

2 年前

terry-r123 / Awesome-Captioning

A curated list of Multimodal Captioning related research(including image captioning, video captioning, and text captioning)

image-captioning video-captioning

111

3 年前

Kamino666 / Video-Captioning-Transformer

这是一个基于Pytorch平台、Transformer框架实现的视频描述生成 (Video Captioning) 深度学习模型。视频描述生成任务指的是：输入一个视频，输出一句描述整个视频内容的文字（前提是视频较短且可以用一句话来描述）。本repo主要目的是帮助视力障碍者欣赏网络视频、感知周围环境，促进“无障碍视频”的发展。

PyTorch transformer video-captioning

Python

4 年前

jayleicn / TVCaption

[ECCV 2020] PyTorch code of MMT (a multimodal transformer captioning model) on TVCaption dataset

video-captioning dataset PyTorch

Python

2 年前