FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing

Text-driven motion generation has achieved substantial progress with the emergence of diffusion models. However, existing methods still struggle to generate complex motion sequences that correspond to fine-grained descriptions, depicting detailed and accurate spatio-temporal actions. This lack of fi...

Full description

Bibliographic Details
Main Authors:	Zhang, Mingyuan, Li, Huirong, Cai, Zhongang, Ren, Jiawei, Yang, Lei, Liu, Ziwei
Format:	Text
Language:	unknown
Published:	2023
Subjects:	Computer Science - Computer Vision and Pattern Recognition Mogen sami
Online Access:	http://arxiv.org/abs/2312.15004

id	ftarxivpreprints:oai:arXiv.org:2312.15004
record_format	openpolar
spelling	ftarxivpreprints:oai:arXiv.org:2312.15004 2024-01-28T10:08:57+01:00 FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing Zhang, Mingyuan Li, Huirong Cai, Zhongang Ren, Jiawei Yang, Lei Liu, Ziwei 2023-12-22 http://arxiv.org/abs/2312.15004 unknown http://arxiv.org/abs/2312.15004 Computer Science - Computer Vision and Pattern Recognition text 2023 ftarxivpreprints 2023-12-31T02:11:58Z Text-driven motion generation has achieved substantial progress with the emergence of diffusion models. However, existing methods still struggle to generate complex motion sequences that correspond to fine-grained descriptions, depicting detailed and accurate spatio-temporal actions. This lack of fine controllability limits the usage of motion generation to a larger audience. To tackle these challenges, we present FineMoGen, a diffusion-based motion generation and editing framework that can synthesize fine-grained motions, with spatial-temporal composition to the user instructions. Specifically, FineMoGen builds upon diffusion model with a novel transformer architecture dubbed Spatio-Temporal Mixture Attention (SAMI). SAMI optimizes the generation of the global attention template from two perspectives: 1) explicitly modeling the constraints of spatio-temporal composition; and 2) utilizing sparsely-activated mixture-of-experts to adaptively extract fine-grained features. To facilitate a large-scale study on this new fine-grained motion generation task, we contribute the HuMMan-MoGen dataset, which consists of 2,968 videos and 102,336 fine-grained spatio-temporal descriptions. Extensive experiments validate that FineMoGen exhibits superior motion generation quality over state-of-the-art methods. Notably, FineMoGen further enables zero-shot motion editing capabilities with the aid of modern large language models (LLM), which faithfully manipulates motion sequences with fine-grained instructions. Project Page: https://mingyuan-zhang.github.io/projects/FineMoGen.html Comment: Accepted to NeurIPS 2023 Text sami ArXiv.org (Cornell University Library) Mogen ENVELOPE(87.933,87.933,68.133,68.133)
institution	Open Polar
collection	ArXiv.org (Cornell University Library)
op_collection_id	ftarxivpreprints
language	unknown
topic	Computer Science - Computer Vision and Pattern Recognition
spellingShingle	Computer Science - Computer Vision and Pattern Recognition Zhang, Mingyuan Li, Huirong Cai, Zhongang Ren, Jiawei Yang, Lei Liu, Ziwei FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing
topic_facet	Computer Science - Computer Vision and Pattern Recognition
description	Text-driven motion generation has achieved substantial progress with the emergence of diffusion models. However, existing methods still struggle to generate complex motion sequences that correspond to fine-grained descriptions, depicting detailed and accurate spatio-temporal actions. This lack of fine controllability limits the usage of motion generation to a larger audience. To tackle these challenges, we present FineMoGen, a diffusion-based motion generation and editing framework that can synthesize fine-grained motions, with spatial-temporal composition to the user instructions. Specifically, FineMoGen builds upon diffusion model with a novel transformer architecture dubbed Spatio-Temporal Mixture Attention (SAMI). SAMI optimizes the generation of the global attention template from two perspectives: 1) explicitly modeling the constraints of spatio-temporal composition; and 2) utilizing sparsely-activated mixture-of-experts to adaptively extract fine-grained features. To facilitate a large-scale study on this new fine-grained motion generation task, we contribute the HuMMan-MoGen dataset, which consists of 2,968 videos and 102,336 fine-grained spatio-temporal descriptions. Extensive experiments validate that FineMoGen exhibits superior motion generation quality over state-of-the-art methods. Notably, FineMoGen further enables zero-shot motion editing capabilities with the aid of modern large language models (LLM), which faithfully manipulates motion sequences with fine-grained instructions. Project Page: https://mingyuan-zhang.github.io/projects/FineMoGen.html Comment: Accepted to NeurIPS 2023
format	Text
author	Zhang, Mingyuan Li, Huirong Cai, Zhongang Ren, Jiawei Yang, Lei Liu, Ziwei
author_facet	Zhang, Mingyuan Li, Huirong Cai, Zhongang Ren, Jiawei Yang, Lei Liu, Ziwei
author_sort	Zhang, Mingyuan
title	FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing
title_short	FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing
title_full	FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing
title_fullStr	FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing
title_full_unstemmed	FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing
title_sort	finemogen: fine-grained spatio-temporal motion generation and editing
publishDate	2023
url	http://arxiv.org/abs/2312.15004
long_lat	ENVELOPE(87.933,87.933,68.133,68.133)
geographic	Mogen
geographic_facet	Mogen
genre	sami
genre_facet	sami
op_relation	http://arxiv.org/abs/2312.15004
_version_	1789338278900531200

FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing

Similar Items