ShapeTalk:
A Language Dataset
and Framework for 3D Shape Edits and Deformations

Snap Inc.1     KAIST2     Stanford University3

News

  • [August 22, 2023] The codebase and the pretrained models are publicly released.
  • [March 5, 2023] A version of this work was accepted in CVPR-2023.

Abstract

Editing 3D geometry is a challenging task requiring specialized skills. In this work, we aim to facilitate the task of editing the geometry of 3D models through the use of natural language. For example, we may want to modify a 3D chair model to “make its legs thinner” or to “open a hole in its back”. To tackle this problem in a manner that promotes open-ended language use and enables fine-grained shape edits, we introduce the most extensive existing corpus of natural language utterances describing shape differences: ShapeTalk. ShapeTalk contains over half a million discriminative utterances produced by con- trasting the shapes of common 3D objects for a variety of object classes and degrees of similarity. We also introduce a generic framework, ChangeIt3D, which builds on ShapeTalk and can use an arbitrary 3D generative model of shapes to produce edits that align the output better with the edit or deformation description. Finally, we introduce metrics for the quantitative evaluation of language-assisted shape editing methods that reflect key desiderata within this editing setup. We note, that our modules are trained and deployed directly in a latent space of 3D shapes, bypassing the ambiguities of lifting 2D to 3D when using extant foundation models and thus opening a new avenue for 3D object-centric manipulation through language.

The ShapeTalk Dataset

ShapeTalk covers 30 common object classes, with 536K contrastive utterances. Samples of those utterances are shown above. In each sub-box shape differences between a target and a distractor object of the same class are enumerated by an annotator (by decreasing order of importance in the annotator's judgement). Interestingly, both continuous and discrete geometric features that shapes share across categories emerge in the language of ShapeTalk; e.g., humans describe the “thinness” of a chair leg or of a vase lip (top row), or the presence of an (“arm”) that a lamp or a clock might have (bottom row).
Key characteristics of ShapeTalk. ShapeTalk's corpus explains the shapes for a large variety of common 3D objects in a rich and (by construction) discriminative manner. Shape parts, geometric attributes and dimensional specifications, are some of the main typical properties that annotators include in their references. See prototypical words for these properties (right, top). Interestingly, when the compared objects have on average a higher degree of shape similarity ("all hard class"), part-based and local reference is more frequent compared to when contrasting less similar ("all easy class") shapes.

Browse

You can browse the ShapeTalk annotations here.


License & Download

ChangeIt3D Architecture

Overview of ChangeIt3DNet, our modular framework for ChangeIt3D task. In Stage 1, we pretrain a shape autoencoder for shapes (using traditional reconstruction losses), freeze the encoder and use the encoded latents of the target and distractor to pretrain a neural listener (using classification losses). In Stage 2, we use the pretrained autoencoder and neural listener to train a shape editor module to edit shapes within the encoded latent space in a way that is both consistent with the language instruction and also minimal. All modules with locks indicate frozen weights.

Qualitative Results

Qualitative edits produced by ChangeIt3DNet. The results are derived by using an ImNet-based AE operating with implicit shape field. The achieved edits are oftentimes local, e.g., thinner legs, fine-grained, as in slatted back, or entail high-level and complex shape understanding, e.g. it appears more sturdy. Remarkably, these edits are derived by ChangeIt3DNet which does not utilize any form an explicit geometric local prior of shapes (part-like, or otherwise); but instead learns solely from the implicit bias of training with referential language.

Citations

If you find our work useful in your research, please consider citing:

@inproceedings{achlioptas2023shapetalk,
    title={{ShapeTalk}: A Language Dataset and Framework for 3D Shape Edits and Deformations},
    author={Achlioptas, Panos and Huang, Ian and Sung, Minhyuk and Tulyakov, Sergey and Guibas, Leonidas},    
    booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},    
    year={2023}}

Also, if you you use the ShapeTalk dataset please also consider citing our previous paper/data: ShapeGlot, which was critical in building and analyzing ShapeTalk:

@inproceedings{achlioptas2019shapeglot,
    title={{ShapeGlot}: Learning Language for Shape Differentiation},
    author={Achlioptas, Panos and Fan, Judy and Hawkins, Robert and Goodman, Noah  and Guibas, Leonidas},
    booktitle = {International Conference on Computer Vision (ICCV)},
    year={2019}}

Acknowledgements

This work is funded by a Vannevar Bush Faculty Fellowship, an ARL grant W911NF-21-2-0104, and a gift from Snap corporation. Panos Achlioptas wish to thank for their advices and help the following researchers: Iro Armeni (data collection), Nikos Gkanatsios (neural-listening), Ahmed Abdelreheem (rendering), Yan Zheng and Ruojin Cai (SGF deployment), Antonia Saravanou and Mingyi Lu (relevant discussions) and Menglei Chai (CLIP-NeRF). Last but not least, the authors want to emphasize their gratitude to all the hard-working Amazon Mechanical Turkers without whom this work would be impossible.