Paper Review—Deep Convolutional Priors for Indoor Scene Synthesis
November 08, 2020Main Contribution
The research problem in this paper is room-scale indoor scene synthesis. Generating virtual indoor environments and furniture layout is a practical problem in many applications, such as gaming, AR/VR, furniture retail, and robotics navigation. Inspired by CNN’s ability to learn to recognize and develop internal representations of 2D images and the recent availability of large-scale 3D scene dataset, the authors propose to train a CNN-based model for scene synthesis problems.
The main contributions of this paper include the proposal of the first CNN-based system for synthesizing indoor scenes taking only the geometry of the room as input, and iteratively generates the room by adding one object at a time. Besides, this system was made possible with the proposal of a novel orthographic top-down view of scenes that are semantically meaningful and 2D-based.
Method
Before getting into the details of the generative model, the authors introduce the top-down view representation as a key piece in this approach. The rule of thumb of indoor scene synthesis is that a 3D indoor scene can often be characterized by the 2D object layout as floor plans. Taking this idea further, the authors convert the 3D scene to image-based representation compatible with deep convolutional networks. This consists of an orthographic top-down depth render of the room and semantic features as additional channels including several masks, orientation, and category information. The conversion is done on the selected SUNCG dataset. Now, the top-down views can be learnt and properly analyzed in CNN networks.
The generative model generates a scene iteratively by placing one object at a time; each of the iterations consists of 3 steps: deciding if adding an object, deciding the category and location of the object, and placing an object instance in the scene. In the first step, a multilayer feed-forward neural network is used to process the current counts of objects and the high-level features from the top-down view representation extracted by CNN.
As the authors point out, the first two steps are strongly correlated, thus they learn a conditional distribution for the second step. This is calculated with an additional attention mask channel to obtain the softmax distribution of an object category at a location through a second CNN-based model. Intuitively, a finished room layout should have sufficient unoccupied space permitting movement, the authors hence augment the set of object categories with several auxiliary categories which make up the majority of the training sets in this step.
Lastly, with the location and object category, the model will then generate a 3D model and place it in the room with an orientation. The instance orientation is obtained from another CNN model taking the top-down view and a mask for the geometry of the instance as inputs.
What do I think?
In the training of the above three steps, the training dataset is filtered from SUNCG datasets and are manipulated in a way that makes sense intuitively, for example, an even split of positive and negative examples for step 1 and 3, and use 95% of auxiliary categories in training examples for step 2.
However, I think the breakdown of component-wise training introduces more uncertainty and time complexity compared to an end-to-end trainable network, and it could be difficult to optimize for each component. Since this model is interactive with local scene plausibility, the inference time could be relatively long (4 minutes as noted in the paper).
This limitation has been addressed in the follow-up paper [1] which factorizes the step of adding each object into a different sequence of decisions for global reasoning. This also solves some failure cases where multiple nightstands are clustered around the bed, as the bed is a clue for nightstands. Noteworthy that this paper has a great detailed section for acknowledging limitations and mentioning possible extensions, including problem domain and speed limitation.
Overall, I think the top-down view representation is promising in studying indoor room synthesis because it reaps benefits from deep convolutional networks and its powerful image reasoning capabilities.
Reference
[1] Ritchie, D., Wang, K., & Lin, Y. A. (2019). Fast and flexible indoor scene synthesis via deep convolutional generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6182-6190).