This&That: Language-Gesture Controlled Video Generation for Robot Planning

Boyang Wang, Nikhil Sridhar, Chao Feng, Mark Van der Merwe, Adam Fishman, Nima Fazeli, Jeong Joon Park

[Github] [ArXiv] [Project Page]

This&That is a robotics scenario (based on the Bridge dataset for this demo), a Language-Gesture-Image-conditioned Video Generation Model for Robot Planning.

This demo focuses on the Video Diffusion Model. Only the VGL mode (image + language + gesture conditioned) is provided, but you can find the complete test code and all pretrained weights available.

Note: The default gesture point indices are [4, 10] (5th and 11th) for two gesture points, or [4] (5th) for one gesture point.

Note: Currently, the supported resolution is 256x384.

Note: Click "Clear All" to reset everything, or "Undo Point" to remove the last gesture point.

Note: The first run may take longer. Clicking "Clear All" before each run is the safest option.

If This&That is helpful, please star the GitHub Repo. Thank you!

Click two Points

Results

Examples
Input Image Text Prompt