Paper ReviewBuilding Rome in a Day

October 18, 2020
Reconstruction

📖 Link to the Paper - Building Rome in a Day
Agarwal, Sameer, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M. Seitz, and Richard Szeliski. "Building rome in a day." Communications of the ACM 54, no. 10 (2011): 105-112.

Main Contribution

The research problem in this paper is image matching and 3D reconstruction. The goal the authors set out was to reconstruct the city of Rome in 3D from the online 2D photo collection in 24h, hence building Rome in a day. This was inspired by the massive image collections available on the internet, as it is an unprecedented opportunity for computer vision algorithms to take advantage of the rich user-generated content to study the hidden 3D information.

The main contribution of this work is a novel and parallel distributed matching system that matches vast collections of 2D images and solves large non-linear least squares problems in the 3D reconstruction stage effectively and efficiently.

Method

The pipeline the authors proposed has several distinct stages, including feature extraction (pre-processing), imaging matching, track generation, and geometric estimation. This method targets two major problems posed by the massive and unorganized collections of 2D images from different cameras and viewpoints - the correspondence problem and the structure from motion (SFM) problem. For each stage of the pipeline, the authors experiment with different algorithms to explore their performance and scalability.

First, for the correspondence problem, the author proposed to use SIFT features with ANN, and clean up with RANSAC by imposing the rigid scene constraint. For large scale matching, this problem is better framed as a graph estimation problem - given vertices matches, we want to propose and then verify the set of edges connecting correspondence. This graph is called the match graph. In this stage, the proposals are generated by whole image similarity and query expansion. I find the idea of looking to find the similarity between images close to the concept of word embedding in natural language processing. Applying this idea in computer vision, the authors take the SIFT features and cluster them into “visual words”, and the images become documents that contain such visual words. With this vocabulary tree in the TF-IDF scheme, we have a sparsely connected match graph as the initial proposal, after which, query expansion is used to increase the density of the component connections for the graph.

Later, in track generation, tracks can be viewed as the connected components in the match graph described above. The second problem to be solved is the SFM problem - given corresponding points, we look for the 3D coordinates of the points of interest, camera parameters, and focal lengths.

In this paper, this is done by first constructing a skeletal set and then incrementally improving by bundle adjustment. This design decision is intuitively justified by a large amount of redundancy demonstrated in Internet collections. But it seems unclear how many of the images can be bundled for such improvement. Lastly, the final reconstruction of the scene geometry within each cluster is done by a multiview stereo algorithm.

What do I think?

Overall, I think the authors came up with a neat pipeline of a distributed implementation for this task. One concern I have regarding this pipeline is that it seems to be relying heavily on the vectors embedding which are connected components in the match graph. I wonder if this means that the most prominent components will take charge of this process, and the reconstructed 3D model is performant in these areas only, where the less prominent part of the model will be neglected. This leads to my question about the metrics used in evaluating the accuracy of the results. Upon reading the paper, it isn’t clear to me what metric they used to measure the quality of the results. It seems that because the goal of this paper is particularly specific and custom-defined (building a 3D model of a city from internet images under 24 hours), it is less obvious to compare this to previous work. However, I wonder if the authors have compared their work with a ground truth or a highly accurate 3D model of the Colosseum, which would be a helpful addition to make this paper more sound and complete.

Another limitation I found is the lack of analysis of the problem space. The authors certainly addressed some problems in the downloaded images from the Internet such as diverse angles and different levels of zooming. I am curious to know if the pipeline proposed can be used in a variety of scenes, if it performs on the complex scenes and simple scenes equally well, and the role of the skeletal sets in 3D construction. I think these are all interesting things to explore and to extend from this paper.

I found the critical skeletal sets idea similar to the critical point concept from the PointNet paper. On this note, I think it would be useful to have an algorithm that identifies the skeletal sets from an extensive unstructured image collection to capture the most significant information to be able to navigate through such an assortment while avoiding redundancy as much as possible. Such an algorithm could help determine the crucial keyframe from a collection, and can also point a direction to what is missing in the collection to foster guided data image collection.