Paper Review—PolyGen: An Autoregressive Generative Model of 3D Meshes

November 08, 2020

3DShape Generation

📖 Link to the Paper - PolyGen: An Autoregressive Generative Model of 3D Meshes

Nash, Charlie, Yaroslav Ganin, S. M. Eslami, and Peter W. Battaglia. "PolyGen: An autoregressive generative model of 3D meshes." arXiv preprint arXiv:2002.10880 (2020).

💡 Link to Project Webstie - PolyGen

Main Contribution

The research problem in this paper is 3D object generation, specifically working with polygon meshes. This research problem presents in creating a virtual world for AR/VR applications, games, and other virtual AI environments where objects are made out of 3D meshes. Mesh is known to be difficult to work with due to its unordered elements and discrete face structures, thus it’s often converted in a post-processing step from other representations. Inspired by successful neural autoregressive models on complex and high-dimensional data, the authors propose to use a neural generative model working directly with 3D meshes.

The main contributions of this paper include the proposal of PolyGen, as a generative model of 3D objects which models the mesh directly, predicting vertices and faces sequentially using a Transformer-based architecture.

Method

As we know, a mesh consists of vertices and faces. In this model, two parts are proposed to tackle them individually, one is an autoregressive vertex model unconditionally modelling mesh vertices, and the other is an autoregressive face model modelling the mesh faces conditioned on input vertices. The model is trained on an augmented ShapeNet dataset and the renders of the processed ShapeNet meshes are created using Blender.

Before the first model, there is a preprocessing step to convert the mesh to n-gon mesh from triangulated meshes in ShapeNet. N-gon meshes are meshes with variable length polygons. This is done to simplify the modelling task by reducing the size of meshes and removing the triangulation variability (polygons can be triangulated in different ways). Here, the authors address a limitation of n-gon that it doesn’t uniquely define a 3D surface with n≥3, which is minor here because most of the n-gons are or close to planars.

Then for the two parts of models trained individually - to model vertices and then model the faces given the vertices. Firstly, the vertex model takes in a sequence of vertices, which are treated as a long sequence sorted by z,y,x axis values followed by a stopping token. Vertices are quantized into discrete variables and then modelled with categorical distribution, similar to PixelCNN and WaveNet.

The authors mentioned a trade-off between mesh fidelity and mesh size with the choice of bin numbers, where a larger bin number is good for mesh compression but results in lossy quality. They find 8-bit quantization as a good balance. With the current sequence of vertex coordinates as context input, the autoregressive model outputs a predictive distribution for the next vertex coordinate, maximizing the log-probability of the sequence. Secondly, the face model takes in a sequence of mesh faces, ordered by their lowest vertex index, and second-lowest and so on. This is treated as a long sequence again with a new face token and stopping token. To condition on the input vertex set, the input is first embedded using an encoder to obtain contextual vertex embeddings, and then each of the input face sequences is embedded using the corresponding vertex embedding, and then Transformer decoder is used to output pointer at each step.

The authors used the idea from the Pointer Network (replacing LSTM with Transformers) to compare the pointer vector with the input embeddings via a dot-product, then later normalized through softmax to obtain a distribution over the input set. The evaluation is conducted mainly through log-likelihood and other sanctity checks such as chamfer-distance and comparing generated sample distribution against real data.

What do I think?

The authors demonstrated that mesh generation with an autoregressive model working directly on mesh representation is doable. The novelty of this model is that it can directly model human-crafted mesh data, rather than alternative 3D representations in prior works. However, on the other hand, the mesh generation is conditioned on object representations such as images and voxels, which could still be limiting.

Another limitation lies in the dataset available. As the authors marked in the paper, training only on the ShapeNet dataset will result in overfitting due to its relatively small size. Besides, the Transformer model has high memory requirements, which limits the training dataset to be within 800 vertices or 2800 face indices (exceeding meshes are filtered out), thus this model may not be suited for large and complex meshes.

We have seen examples of taking a successful model from one domain and re-propose it to solve a problem from a different domain, for example, Mask-RCNN extends a successful network for object detection to instance segmentation, and then Mesh-RCNN into computer graphics. Once again, in this paper, PolyGen makes use of Transformer for its success in NLP tasks and the capability to model complex data and turns into a high-performing 3D generation model.