This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Mesoscopic photogrammetry with an unstabilized phone camera

Kevin C. Zhou,* Colin Cooke, Jaehee Park, Ruobing Qian,
Roarke Horstmeyer, Joseph A. Izatt, Sina Farsiu
Duke University, Durham, NC
*[email protected]
Abstract

We present a feature-free photogrammetric technique that enables quantitative 3D mesoscopic (mm-scale height variation) imaging with tens-of-micron accuracy from sequences of images acquired by a smartphone at close range (several cm) under freehand motion without additional hardware. Our end-to-end, pixel-intensity-based approach jointly registers and stitches all the images by estimating a coaligned height map, which acts as a pixel-wise radial deformation field that orthorectifies each camera image to allow homographic registration. The height maps themselves are reparameterized as the output of an untrained encoder-decoder convolutional neural network (CNN) with the raw camera images as the input, which effectively removes many reconstruction artifacts. Our method also jointly estimates both the camera’s dynamic 6D pose and its distortion using a nonparametric model, the latter of which is especially important in mesoscopic applications when using cameras not designed for imaging at short working distances, such as smartphone cameras. We also propose strategies for reducing computation time and memory, applicable to other multi-frame registration problems. Finally, we demonstrate our method using sequences of multi-megapixel images captured by an unstabilized smartphone on a variety of samples (e.g., painting brushstrokes, circuit board, seeds).

Refer to caption
Figure 1: Our method jointly stitches multi-megapixel images acquired under freehand motion at close range and reconstructs high-accuracy height maps without precalibration of camera distortion.

1 Introduction

The photogrammetric problem of reconstructing 3D representations of an object or scene from 2D images taken from multiple viewpoints is common and well studied, featuring prominently in techniques such as multi-view stereo (MVS) [10], structure from motion (SfM) [42, 48, 38], and simultaneous localization and mapping (SLAM) [8]. Implicit in these 3D reconstructions is knowledge of the camera parameters, such as camera position, orientation, and distortions, which are jointly estimated in SfM and SLAM. Photogrammetry tools have been developed and applied to both long-range, macro-scale applications [34], such as building-scale reconstructions or aerial topographical mapping, and close-range, meter-scale applications [27], such as industrial metrology. However, comparatively less work has been done to push photogrammetry to mesoscopic (mm variation) and microscopic scales, where additional issues arise, such as more limited depths of field and increased impact of camera distortion. Existing approaches at smaller scales typically require very careful camera distortion precalibration, expensive cameras, dedicated setups that allow well-controlled camera or sample motion (e.g., with a dedicated rig), or attachment of control points to the object [27].

Here, we show that a smartphone is capable of obtaining quantitative 3D mesoscopic images of objects with 100 µm- to mm-scale height variations at tens-of-micron accuracies with unstabilized, freehand motion and without precalibration of camera distortion (Fig. 1). To achieve this, we present a new photogrammetric reconstruction algorithm that simultaneously stitches the multi-perspective images after warping to a common reference frame, reconstructs sample’s 3D height profile, and estimates the camera’s position, orientation, and distortion (via a piecewise linear, nonparametric model) in an end-to-end fashion without relying on feature point extraction and matching. We emphasize that our careful modelling of distortions is especially important for mesoscopic applications. Our method also features a reparameterization of the camera-centric height maps as the outputs of a single untrained convolutional neural network (CNN) with the raw camera images at the input (akin to the deep image prior (DIP) [43]), which is optimized instead of the height map itself. Since the camera-centric height maps are by design coaligned with the camera images, they are automatically registered once the camera images are registered. As we will demonstrate, both the use of an untrained CNN and careful modeling of distortion substantially reduce reconstruction artifacts, thus allowing high-accuracy height map estimation without camera precalibration or stabilization.

2 Related work

The majority of general-purpose implementations of SfM, MVS, and photogrammetry, as well as for more general image registration and stitching problems are centered around feature points [48, 9, 38, 31, 37]. A typical multi-step pipeline starts with extraction of feature points, such as scale-invariant features (SIFT [26]), which are then matched across multiple images. The pipeline then crudely estimates the 3D point cloud of the object along with the cameras’ positions and poses, which are then refined with bundle adjustment (BA). Our method differs from BA in that it is based on pixel-wise image registration without requiring inference of point correspondences, and operates on rasterized 2D height rather than 3D point clouds, making it more amenable to incorporation of CNNs.

Prior to the development of robust feature point descriptors, pixel-intensity-based techniques for depth estimation from image sequences were also common [30, 19, 29, 18, 15, 39, 33]. Our method thus falls in this category. In more recent years, there has been a resurgence of interest in direct approaches with the advent of deep learning. In particular, our work is similar to previous self-supervised deep learning methods that train a CNN on consecutive frames of videos to generate camera-centric depth estimates based on dense pairwise viewpoint warping [55, 44, 45, 25, 12, 35, 50, 28, 51, 13]. However, they differ in that we jointly warp each frame to a common, potentially unseen reference (e.g., a world reference frame) and stitch them together, while these methods consider consecutive pairs of frames and stay within camera reference frames without stitching to form a larger field of view. Although these methods don’t require labels, they still require datasets for training, unlike our method. Other deep learning methods have used multiple frames to estimate depth; however, unlike our method, many of these techniques require known camera poses [46, 21, 49, 22, 20, 32] or reference a keyframe [46, 21, 49, 22, 20, 40, 41, 52, 47], and all of these multi-frame methods require supervision and ignore camera distortion. More generally, none of the deep learning approaches mentioned in this paragraph estimate or acknowledge camera distortion, with one exception [13], which uses a quartic polynomial radial distortion model. However, as we will show, such a model is too limited for obtaining high-accuracy results in high-resolution mesoscopic imaging applications.

While our method uses a CNN, we emphasize that it does not require training on and therefore does not inherit any biases from a dataset. Rather, the CNN serves as a drop-in, compression-based regularizer without the requirement for training or generalization beyond the current sample under investigation. As such, our work is also somewhat similar to recent work on using DIP [43] to fill in gaps in camera-centric depth images based on warping to nearby view, which requires knowledge of the camera poses [11]. While our CNN-based regularizer was inspired by DIP, it differs in that our CNN takes the camera images as input rather than random noise, thus allowing a single shared CNN and therefore a fixed number of parameters regardless of the number of images in the sequence.

3 Our approach

Our new photogrammetric approach is at its core an end-to-end, feature-free, multi-image registration technique formulated as an inverse optimization problem. It is reminiscent of a technique previously developed for jointly registering and combining microscopic images from multiple angles for optical superresolution [54], and of other pixel-intensity-based multi-image alignment techniques developed for a variety of tasks, such as digital superresolution [36, 2]. Here, our goal is to register and stitch the 2D camera images from multiple views into a single consistent composite mosaic, where the image deformation model is parameterized by the camera model parameters and the sample height map. The key insight of our approach is to use these parameters to warp and co-rectify the camera images so that they appear to have been taken from a single common perspective, thus allowing joint estimation of the stitched RGB image and coaligned height map.

In the following subsections, we describe in detail the camera model we employed to accurately account for distortion and smartphone autofocusing, the perspective rectification models, the multi-image registration framework, and the CNN-based regularization framework. We also propose strategies for dealing with large, multi-megapixel image that alleviate time and memory costs to produce larger, high-resolution RGBH reconstructions (H = height).

Refer to caption
Figure 2: Components of the image deformation model. The chief rays of the thin lens model (a) and the pinhole camera model (b) are equivalent, but have differing definitions of focal length. In general, the scene is not flat (height variation, h(x,y)h(x,y)) and the camera is tilted (orientation 𝐧^𝐢𝐦\mathbf{\hat{n}_{im}}). The orthorectification model (c) assumes a reference frame whose projection center is at infinity, such that all projection lines are parallel to 𝐳^\hat{\mathbf{z}}. The radial vector field 𝐫𝐫𝐞𝐜𝐭𝐢𝐟𝐲(x,y)\mathbf{r_{rectify}}(x,y), which converges at the vanishing point (𝐑\mathbf{R}) and whose magnitudes are h(x,y)\propto h(x,y), performs such a per-pixel rectification, so that the resulting image no longer has perspective distortion. Rectifying the i𝑡ℎi^{\mathit{th}} image to an arbitrary reference (b) requires a different 𝐫𝐫𝐞𝐜𝐭𝐢𝐟𝐲\mathbf{r_{rectify}}. The backprojection procedure of the camera images (e) to facilitate their stitching/registration requires optimization of the 6D camera pose and distortion, and the height map, which performs orthorectification. See Sec. 3.1.

3.1 Image deformation model

Pinhole camera and thin lens model. Assuming an ideal camera, image formation can be described by a pinhole camera model, whereby a 3D scene is projected onto a 2D image plane along lines that converge at a point, the center of projection, with a pinhole camera focal length, f𝑝ℎf_{\mathit{ph}}. The camera itself, however is governed by an effective focal length, f𝑒𝑓𝑓f_{\mathit{eff}}, and is related to f𝑝ℎf_{\mathit{ph}} by the thin lens equation,

1/z𝑜𝑏𝑗+1/f𝑝ℎ=1/f𝑒𝑓𝑓,1/z_{\mathit{obj}}+1/f_{\mathit{ph}}=1/f_{\mathit{eff}}, (1)

where z𝑜𝑏𝑗z_{\mathit{obj}} is the distance of the object being imaged from the pinhole camera focus, which corresponds to the position of the thin lens. Thus, the projection lines of the pinhole camera model correspond to chief rays in the lens model (Fig. 2a,b). When imaging scenes at very far distances (z𝑜𝑏𝑗f𝑝ℎz_{\mathit{obj}}\gg f_{\mathit{ph}}), f𝑝ℎf𝑒𝑓𝑓f_{\mathit{ph}}\approx f_{\mathit{eff}}, a good assumption for long-range but less so for very close-range applications.

These models do not predict the limited depths of field associated with closer working distances. To avoid this issue, we assume that the user only attempts to manipulate 2 (xyxy translation) out of the 6 camera pose parameters, which justifies designating an object plane (xyxy plane) to which the sample height variations may be referenced, assumed to be within the depth of field. In practice, to account for freehand instability, we still model all 6 degrees of freedom for each image: let the camera’s 3D orientation be parameterized by a unit normal vector, 𝐧^𝐢𝐦\mathbf{\hat{n}_{im}}, pointing along the optical axis, and an in-image-plane rotation angle, θ\theta; let the camera’s 3D position be designated by the position of its center of projection location, (X,Y,Z=zobj)(X,Y,Z=z_{obj}). In the supplement, we describe the procedure for homographic rectification of the camera images in a common 2D world reference (Fig. 2e).

Camera (un)distortion. Camera lenses and sensor placement are not perfect, giving rise to image distortion. This can pose problems for close-range, mesoscale applications, as the 3D-information-encoding parallax shifts become more similar in magnitude to image deformation due to camera distortion. Distortion models commonly separate radial and tangential components, which are often expanded as even-order polynomials [5, 7]. In some cases,the distortion center (the principal point) should also be optimized [16]. However, as we will show, we found even a 64-order polynomial radial undistortion model insufficiently expressive for our mesoscopic application. Instead, we used a nonparametric, piecewise linear radial undistortion model, whereby the radially dependent relative magnification factor is discretized into nrn_{r} points, {M~t}t=0nr1\{\widetilde{M}_{t}\}_{t=0}^{n_{r}-1}, spaced by δr\delta_{r} with intermediate points linearly interpolated:

M~(r)=(1+rδrrδr)M~rδr+(rδrrδr)M~rδr+1,\widetilde{M}(r)\!=\!\left(1\!+\!\lfloor\frac{r}{\delta_{r}}\rfloor\!-\!\frac{r}{\delta_{r}}\right)\widetilde{M}_{\lfloor\frac{r}{\delta_{r}}\rfloor}\!+\!\left(\frac{r}{\delta_{r}}\!-\!\lfloor\frac{r}{\delta_{r}}\rfloor\right)\widetilde{M}_{\lfloor\frac{r}{\delta_{r}}\rfloor+1}, (2)

where \lfloor\cdot\rfloor is the flooring operation, 0r<(nr1)δr0\leq r<(n_{r}-1)\delta_{r} is the radial distance from the distortion center, which is also optimized, and M~(r)=1\widetilde{M}(r)=1 if there’s no distortion. Thus, for a given point in the image, 𝐫𝐢𝐦\mathbf{r_{im}}, the distortion correction operation is given by

𝐫𝐢𝐦M~(|𝐫𝐢𝐦|)𝐫𝐢𝐦,\mathbf{r_{im}}\leftarrow\widetilde{M}(|\mathbf{r_{im}}|)\mathbf{r_{im}}, (3)

which is applied before backprojection. A piecewise linear model, unlike high-order polynomials, also has the advantage of being trivially analytically invertible, allowing easy computation of both image distortion and undistortion. This is important because, while BA typically uses a distortion model, our method requires an undistortion model, as we first backproject camera images to form the reconstruction.

Orthorectification. To extend our image deformation model to allow registration of scenes with height variation, we need to warp each backprojected image to a common reference in a pixel-wise fashion. One such option is orthorectification (Fig. 2c), which can be interpreted as rectifying to a world reference. As such, the effective camera origin is at infinity, so that our images are governed by true length scales, regardless of proximity to the camera (i.e., no perspective distortion). For each camera image backprojection, we estimate a radial deformation field,

𝐫𝐫𝐞𝐜𝐭𝐢𝐟𝐲(𝐫𝐨𝐛𝐣)=Δr𝐫𝐨𝐛𝐣𝐑|𝐫𝐨𝐛𝐣𝐑|,\mathbf{r_{rectify}}(\mathbf{r_{obj}})=\Delta r\frac{\mathbf{r_{obj}-R}}{|\mathbf{r_{obj}-R}|}, (4)

which is a function of position in the object plane, 𝐫𝐨𝐛𝐣=(x𝑜𝑏𝑗,y𝑜𝑏𝑗)\mathbf{r_{obj}}=(x_{\mathit{obj}},y_{\mathit{obj}}), and moves each pixel a signed distance of Δr\Delta r towards the vanishing point, 𝐑=(X,Y)\mathbf{R}=(X,Y), the point to which all lines parallel to 𝐳^\hat{\mathbf{z}} appear to converge in the camera image. Δr\Delta r is itself directly proportional to the height at the new rectified location,

h(𝐫𝐨𝐛𝐣+𝐫𝐫𝐞𝐜𝐭𝐢𝐟𝐲)=ZΔr|𝐫𝐨𝐛𝐣𝐑|.h(\mathbf{r_{obj}}+\mathbf{r_{rectify}})=-Z\frac{\Delta r}{|\mathbf{r_{obj}-R}|}. (5)

Each camera image has its own height map, forming an augmented RGBH image, which is pixel-wise orthorectified by Eq. 5.

Rectification to an arbitrary perspective. Orthorectification is just a special case of a more general rectification procedure (Fig. 2d), where we instead warp to an arbitrary camera-centric reference frame, specified by its vanishing point, 𝐑𝐫𝐞𝐟\mathbf{R_{ref}}, projection center height, Z𝑟𝑒𝑓Z_{\mathit{ref}}, and an orientation such that the image and object planes are parallel. Given the i𝑡ℎi^{\mathit{th}} camera image, whose extrinsics are similarly specified by 𝐑𝐢\mathbf{R_{i}}, ZiZ_{i}, and 𝐧^𝐢𝐦,𝐢\mathbf{\hat{n}_{im,i}}, the vector that warps a point, 𝐫𝐨𝐛𝐣\mathbf{r_{obj}}, to the reference frame is given by

𝐫𝐫𝐞𝐜𝐭𝐢𝐟𝐲(𝐫𝐨𝐛𝐣)=hZiZiZ𝑟𝑒𝑓Z𝑟𝑒𝑓h(𝐫𝐨𝐛𝐣𝐑𝐢)+hZ𝑟𝑒𝑓h(𝐑𝐢𝐑𝐫𝐞𝐟).\begin{gathered}\mathbf{r_{rectify}}(\mathbf{r_{obj}})=\\ \frac{h}{Z_{i}}\frac{Z_{i}-Z_{\mathit{ref}}}{Z_{\mathit{ref}}-h}(\mathbf{r_{obj}-R_{i}})+\frac{h}{Z_{\mathit{ref}}-h}(\mathbf{R_{i}}-\mathbf{R_{ref}}).\end{gathered} (6)

If we allow Z𝑟𝑒𝑓Z_{\mathit{ref}}\rightarrow\infty, the result is consistent with Eqs. 4, 5, and we recover the orthorectification case, as expected.

Accounting for autofocus. Most smartphones have an autofocus feature where the sensor or lens position automatically adjusts to sharpen some part of the image. This feature is only relevant for close-range applications, as can be seen in Eq.1, and manifests as a dynamically adjusted f𝑝ℎ,if_{\mathit{ph,i}} for the i𝑡ℎi^{\mathit{th}} image, while for long-range applications, f𝑝ℎ,if_{\mathit{ph,i}} remains fixed at f𝑒𝑓𝑓f_{\mathit{eff}}. This issue is intertwined with the well-known limitation of photogrammetry that inferring absolute scale requires something of known length in the scene (ground control points). In particular, Eq. 5 by itself is insufficient to obtain quantitative height maps, because ZiZ_{i} and f𝑝ℎ,if_{\mathit{ph,i}} are ambiguous up to a scale factor related to the camera’s magnification for the i𝑡ℎi^{\mathit{th}} image:

Mi=f𝑝ℎ,i/ZiM0Z0/Zi,M_{i}=f_{\mathit{ph,i}}/Z_{i}\approx M_{0}Z_{0}/Z_{i}, (7)

where the approximation assumes that the camera has autofocused once and maintained approximately the same f𝑝ℎf_{\mathit{ph}} throughout data acquisition. The more consistent the height of the camera during acquisition, the better this approximation. The advantage of this approximation is not having to calibrate the magnification for each image in the sequence. Combining Eqs. 1, 5, and 7, we obtain the ambiguity-free height estimate by the i𝑡ℎi^{\mathit{th}} image,

hi(𝐫𝐨𝐛𝐣+𝐫𝐫𝐞𝐜𝐭𝐢𝐟𝐲)\displaystyle h_{i}(\mathbf{r_{obj}}+\mathbf{r_{rectify}}) f𝑒𝑓𝑓Δri|𝐫𝐨𝐛𝐣𝐑𝐢|(1+1M0ZiZ0).\displaystyle\approx-f_{\mathit{eff}}\frac{\Delta r_{i}}{|\mathbf{r_{obj}-R_{i}}|}\left(1+\frac{1}{M_{0}}\frac{Z_{i}}{Z_{0}}\right). (8)

Putting it all together. Homographic rectification using the pinhole/thin-lens models, camera undistortion, and orthorectification (or arbitrary-reference rectification) via the height map, as described in this section (3.1), together constitute the backprojection step (Fig. 2e, 3). Let the associated image deformation parameters be collectively denoted as 𝐰\mathbf{w}.

3.2 Multi-frame image stitching and registration

Given the current estimate of the image deformation parameters, 𝐰\mathbf{w}, we simultaneously backproject all the images to form an estimate of the RGBH reconstruction, 𝐁\mathbf{B}, with the coaligned height map stacked as the fourth channel:

𝐁𝟎,𝐁[𝐱𝐰,𝐲𝐰]𝐃𝐑𝐆𝐁𝐇,\mathbf{B}\leftarrow\mathbf{0},\qquad\mathbf{B}[\mathbf{x_{w}},\mathbf{y_{w}}]\leftarrow\mathbf{D_{RGBH}}, (9)

where (𝐱𝐰,𝐲𝐰)(\mathbf{x_{w},y_{w}}) are the flattened coordinates corresponding to the pixels of 𝐃𝐑𝐆𝐁𝐇\mathbf{D_{RGBH}}, which are the flattened RGB images augmented with the camera-centric height maps. If a pixel of 𝐁\mathbf{B} is visited multiple times, the values are averaged.

To guide the optimization, we next generate forward predictions of the camera images, 𝐃^𝐑𝐆𝐁𝐇\mathbf{\hat{D}_{RGBH}}, by using the exact same backprojection coordinates, (𝐱𝐰,𝐲𝐰)(\mathbf{x_{w},y_{w}}), to reproject back into the camera frames of reference and compute the mean square error (MSE) with the original camera images (Fig. 3). The idea is that if the backprojected images are consistent with each other at the pixels where they overlap, then the forward predictions will be more accurate. We then use gradient descent to minimize the MSE with respect to the image deformation parameters, that is

min𝐰𝐃^𝐑𝐆𝐁𝐇𝐃𝐑𝐆𝐁𝐇2.\underset{\mathbf{w}}{\text{min}}\ ||\mathbf{\hat{D}_{RGBH}}-\mathbf{D_{RGBH}}||^{2}. (10)

To avoid local minima, we adopted a multi-scale strategy, whereby both 𝐃𝐑𝐆𝐁𝐇\mathbf{D_{RGBH}} and B were subject to a downsampling procedure that was relaxed over time. Further, we didn’t update the height map until we reached the lowest downsampling factor. If the scene consisted of non-repetitive structures and the camera images exhibited a lot of overlap, initializing each image to the same position was often a good initial guess. However, if this failed, we initialized using sequential cross-correlation-based estimates.

Refer to caption
Figure 3: Method overview. A stack of RGB images (only one shown for clarity) is augmented with their camera-centric height maps (RBGH), which are reparameterized as outputs of an untrained CNN. The camera-centric height maps and camera parameters serve as the image deformation parameters, which rectify the RGBH images so that they can be registered and stitched (RGBH reconstruction). We use the same warping coordinates to reproject the images to form a prediction, whose MSE with RGBH is minimized with respect to the CNN and camera parameters. Arrows indicate forward differentiable operations, but only red arrows indicate paths that allow backpropagation (Sec. 3.4).

3.3 Regularization

Reparameterization with a CNN. We reparameterized the camera-centric height maps as the output of a CNN with the respective RGB images as the inputs. Instead of optimizing for the per-image height maps, we optimize the weights of a single untrained CNN as a DIP, whose structure alone exhibits a bias towards “natural” images, as empirically demonstrated on multiple image-reconstruction-related tasks [43], including 3D reconstruction [53]. While the DIP is often an overparameterization, in our case the single CNN has fewer parameters than the total number of height map pixels that we otherwise would have directly optimized, thus offering further means of regularization [17]. Furthermore, we used an encoder-decoder network architecture without skip connections, forcing the information to flow through a bottleneck. Thus, the degree of compression in the CNN is an interpretable regularization hyperparameter, where restricting information flow may force the network to discard artifacts (see supplement for architecture). Finally, we note that the network doesn’t need to generalize beyond the current image sequence.

Camera-centric height map consistency. Although we directly optimize for camera-centric height maps, they ultimately follow the same backprojection and reprojection procedure that the RGB images undergo, as described earlier. However, while the contribution of RGB pixels to the MSE in Eq. 10 serves as feedback for registration, the contribution of the height values to the MSE is primarily to make the camera-centric height maps more consistent irrespective of the backprojection result. This can be useful, for example, when filling in height values at the vanishing points, which are blind spots, as h𝐫𝐫𝐞𝐜𝐭𝐢𝐟𝐲(𝐑)=0h\propto\mathbf{r_{rectify}}(\mathbf{R})=0. Since RGB values and height values are not directly comparable, we introduce a regularization hyperparameter that scales their relative contributions.

3.4 Reducing computation time and memory

To reduce computation time and memory, we used gradient checkpointing [6, 14] and CPU memory swapping [24], as well as two novel strategies, which we discuss next.

Blocking backpropagation through the reconstruction Instead of computing the total gradient of the loss with respect to the image deformation parameters, which would require backpropagation across every path that leads to the deformation parameters, we compute partial gradients using only the paths that lead to the deformation parameters without going through the reconstruction (red arrows in Fig. 3). Writing out the relevant terms in the chain rule expansion of the gradient of the loss with respect to the image deformation parameters (Sec. 3.1), we have

dLd𝐰=L𝐃^𝐑𝐆𝐁𝐇(𝐉𝐃^𝐑𝐆𝐁𝐇(𝐰)+𝐉𝐃^𝐑𝐆𝐁𝐇(𝐁)𝐉𝐁(𝐰)0),\frac{dL}{d\mathbf{w}}=\frac{\partial L}{\partial\mathbf{\hat{D}_{RGBH}}}\left(\mathbf{J}_{\mathbf{\hat{D}_{RGBH}}}(\mathbf{w})+\cancelto{0}{\mathbf{J}_{\mathbf{\hat{D}_{RGBH}}}(\mathbf{B})\mathbf{J}_{\mathbf{B}}(\mathbf{w})}\right), (11)

where 𝐉𝐲(𝐱)\mathbf{J_{y}(x)} denotes the Jacobian of 𝐲\mathbf{y} with respect to 𝐱\mathbf{x} and LL is the loss. Doing so ends up saving memory and time because it avoids computing expensive derivatives associated with the reconstruction backprojection and reprojection steps. An intuitive interpretation is that at each iteration the reconstruction serves as a temporarily static reference to which all the images are being registered, as opposed to also explicitly registering the reconstruction to the images. Note, however, that both registration directions are governed by the same parameters, and that this “static” reference still updates at every iteration.

Refer to caption
Figure 4: Stitched orthomosaics and height maps for various samples with mm-scale height variation. Height values for the cut cards and PCB components (red ×\times’s) are quantified in Tables 1 and 2, respectively. COLMAP consistently underestimates heights. Scale bars, 1 cm.

Batching with a running-average reconstruction. At every iteration of the optimization, we require an estimate of the reconstruction, which itself requires joint participation of all images in the dataset to maximize all the available information. This can be problematic because it requires both the reconstruction and the entire dataset to be in GPU memory at the same time. A standard approach to address this problem is batching, which at first glance would only only work for the reprojection step, as the projection step requires all images to form the reconstruction. A two-step approach to overcome this requirement is to realize that, because of our strategy of blocking backpropagation paths through the reconstruction, we can simply generate the reconstruction incrementally in batches without worrying about accumulating gradients. Once the temporarily static reconstruction is generated given the current estimates of the image deformation parameters, we could then update the parameters by registering batches of images to the reconstruction. This approach, however, is inefficient because it requires two passes through the dataset per epoch, and the parameters are only updated during one of the passes.

We introduce a more efficient, one-step, end-to-end strategy where each batch updates both the reconstruction and the parameters by keeping track of a running average of the reconstruction. In particular, the update rule for the reconstruction after the (j+1)𝑡ℎ(j+1)^{\mathit{th}} gradient step when presented with the j𝑡ℎj^{\mathit{th}} batch as a list of warped coordinates and their associated RGB values, (𝐱w,j,𝐲w,j,𝐃j)(\mathbf{x}_{\mathit{w,j}},\mathbf{y}_{\mathit{w,j}},\mathbf{D}_{j}), is given by

𝐁j+1𝐁j𝐁j+1[𝐱w,j,𝐲w,j]m𝐁j[𝐱w,j,𝐲w,j]+(1m)𝐃j\begin{gathered}\mathbf{B}_{j+1}\leftarrow\mathbf{B}_{j}\\ \mathbf{B}_{j+1}[\mathbf{x}_{\mathit{w,j}},\mathbf{y}_{\mathit{w,j}}]\leftarrow m\mathbf{B}_{j}[\mathbf{x}_{\mathit{w,j}},\mathbf{y}_{\mathit{w,j}}]+(1-m)\mathbf{D}_{j}\end{gathered} (12)

where 0<m<10<m<1 is the momentum controlling how rapidly to update B. The batch is specified very generally in Eq. 12, and can correspond to any subset of pixels from the dataset, whether grouped by image or chosen from random spatial coordinates. Only the spatial positions of the reconstruction visited by the batch are updated in the backprojection step, and the loss is computed with the same batch after the reprojection step. As a result, we only need one pass though the dataset per epoch. This method is general and can be applied to other multi-image registration problems.

4 Experiments

Using the rear wide-angle camera of a Samsung Galaxy S10+ (f𝑒𝑓𝑓f_{\mathit{eff}} = 4.3 mm) and freehand motion, we collected multiple image sequence datasets consisting of 21-23 RGB 1512×20161512\times 2016 images (2×2\times-downsampled from 3024×40323024\times 4032). While our method does not require it, we attempted to keep the phone approximately parallel and at a constant height (5-10 cm) from the sample while translating the phone laterally, to keep as much of the sample as possible within the limited depth of field associated with such close working distances. To obtain absolute scale, we estimated the magnification of the first image of each sequence using reference points of known separation in the background.

We implemented our algorithm in TensorFlow [1] (code and data available at github.com/kevinczhou/mesoscopic-photogrammetry) and performed reconstructions on an Intel Xeon Silver 4116 processor augmented with an 11-GB GPU (Nvidia RTX 2080 Ti). We used the same CNN architecture for all experiments, tuned on an independent sample to balance resolution and artifact reduction (see supplement for architectures). Gradient descent via Adam [23] was performed for 10,000 iterations for each sample with a batch size of 6. We set nr=30n_{r}=30 (Eq. 2) and m=0.5m=0.5 (Eq. 12).

Card # Ours CM CM (scaled)
(G. T.) Acc. Prec. Acc. Prec. Acc. Prec.
bkgd (0) 59.4 45.5 235.6 59.6 19.2 157.8
#1 (295) 34.9 37.7 74.9 26.8 41.8 71.0
#2 (350) 2.0 36.5 24.0 24.6 2.4 65.2
#3 (420) 38.5 54.9 6.7 18.4 31.7 48.9
#4 (485) 6.0 26.2 55.6 20.6 9.4 54.7
#5 (555) 32.5 35.5 120.6 15.4 47.3 40.8
#6 (625) 10.5 28.4 151.6 18.1 14.1 47.9
mean 26.3 37.8 95.6 26.2 23.7 69.5
Table 1: Accuracy (abs. error from ground truth (G. T.)) and precision (st. dev.) of our method vs. COLMAP (CM) vs. CM rescaled to match G. T. of the cut card sample (Fig. 4). All units are µm.
bkgd C120 C121 C126 C142 C147 C181 C182 C203 C57 R135 R144 R175 R184 U70 U71 U72 mean
G.T. 0 1253 1257 1269 618 1282 632 632 1354 677 533 548 476 427 1772 1771 1778
Ours Acc. 119.8 41.4 15.8 79.0 60.1 163.6 5.8 56.0 43.5 94.3 122.4 89.3 21.0 52.7 44.3 27.5 31.9 62.9
Prec. 138.4 123.5 133.2 32.2 175.1 45.3 34.2 86.7 139.2 117.0 61.0 17.6 52.9 101.1 141.6 134.3 95.8
CM Acc. 455.7 435.4 557.7 193.9 116.6 90.7 401.6 225.0 456.9 548.0 74.6 173.3 452.5 305.4 502.1 388.9 168.3 326.3
Prec. 108.0 140.2 83.4 248.1 163.4 118.5 95.5 133.8 51.7 98.7 91.3 105.0 70.4 90.9 95.3 123.6 113.6
Table 2: Quantification of accuracy and precision (in µms) of our method and COLMAP (CM) on PCB components (Fig. 4, red ×\times’s).

We compare our method to the open-source, feature-based SfM tool, COLMAP [38], which has been shown to outperform competing general-purpose SfM tools [4]. See supplement for COLMAP hyperparameter settings.

Refer to caption
Figure 5: Visual comparison of height map reconstructions of the cut cards sample using different undistortion models. Note the ring artifacts with the polynomial models. Scale bar, 1 cm.
Refer to caption
Figure 6: Strong ring artifacts in the painting sample due to uncompensated distortion in polynomial models. Scale bar, 1 cm.
Refer to caption
Figure 7: Height estimation performance on the 6 cut cards, background, and whole image for various undistortion models.
Refer to caption
Figure 8: Reconstructions of the cut cards and PCB samples using high- and low-compression CNNs (architectures 1 and 4; see supplement for architectures and additional results) and high and low levels of TV regularization. Zoom in to see fine features. Scale bars, 1 cm.

Accuracy and precision characterization. We first created a calibrated phantom sample consisting of standard playing cards (\sim0.3-mm thick), cut into six 1-2-cm2 squares and attached 0-5 layers of tape (50-70 µm thick per layer) to alter their heights. We measured the thicknesses of the tape-backed cut cards using calipers with 20-µm accuracy (Starrett EC799A-12/300), and arranged them on a flat, feature-rich surface [3]. We regarded these measurements as the ground truths, 𝐡𝐠𝐭\mathbf{h_{gt}} (column 1 of Table 1).

The stitched orthomosaics and height map reconstructions by our method and COLMAP are shown in the first column of Fig. 4, noting that COLMAP underestimates the card heights. To quantify accuracy and precision of both methods, we manually segmented the six cut cards and background based on the RGB orthomosaic, and computed the mean and standard deviation of the height values of these seven regions (Table 1). Because the height maps have an arbitrary global shift, we used the shift that minimizes the MSE between the mean height estimates and the ground truths: Δh=mean(𝐡𝐠𝐭𝐡𝐞𝐬𝐭)\Delta h=mean(\mathbf{h_{gt}}-\mathbf{h_{est}}). While COLMAP underestimates heights and therefore has low absolute accuracy, we hypothesized that the relative accuracy might be high. To test this, we additionally scaled COLMAP’s height estimates by the factor that minimizes MSE between mean height estimates and ground truths, given by cov(𝐡𝐠𝐭,𝐡𝐞𝐬𝐭)/var(𝐡𝐞𝐬𝐭)2.65cov(\mathbf{h_{gt}},\mathbf{h_{est}})/var(\mathbf{h_{est}})\approx 2.65 (Table 1). Only our method has simultaneously high accuracy (26.3 µm) and precision (37.8 µm) and without the need to rescale.

Results on multiple mm-scale samples. Next, we compared our method and COLMAP on a printed circuit board (PCB), helicopter seeds, and the brush strokes on a painting (Fig. 4). Unlike our method, COLMAP not only consistently underestimates heights, but also exhibits unpredictable surface curvatures. We quantified the height accuracy and precision on 16 of the PCB components using caliper estimates as the ground truth and manual segmentation based on the RGB orthomosaics (Table 2). We caution that this analysis is less rigorous than that of the cut cards sample, as some components are not completely flat (slight elevations along the shorter edges), and due to greater difficulty in estimating height with the calipers (we alleviated this problem statistically by using the mean of 10 measurements as the ground truth). These caveats may explain why the accuracy (62.5 µm) and precision estimates (95.8 µm) are inflated compared to those of the cut cards sample. However, it is clear that our method does not have COLMAP’s underestimation issue. Further evidence of this is that the optimal rescale value for our method is 1.02811.028\approx 1 (which we did not use). While we do not have ground truths for the helicopter or painting samples, the calipers estimated the helicopter seeds to be \sim2.3 mm at their thickest part, qualitatively consistent with our method; while this is not a rigorous measurement due to non-rigidity of the sample, it at least suggests that COLMAP is underestimating heights.

Importance of undistortion. Figs. 5 and 6 show spurious rings when camera distortions are not sufficiently modeled. In particular, although for conventional macro-scale applications it is often sufficient to use a low-order even polynomials (e.g., order-4 [13]), for mesoscopic applications, even using an order-64 polynomial undistortion model leaves behind noticeable ring artifacts, while our piecewise linear model (30 segments) effectively eliminates them. Fig. 7 quantifies the performance of multiple undistortion models via root-MSE (RMSE), precision, and accuracy of heights of the cut cards sample. Not only does our piecewise linear model generally have better precision and overall RMSE than the polynomial models, but also it does not exhibit directional biases caused by the ring artifacts (e.g., card 2), which cannot be corrected by a global scale or shift. See the supplement for a full comparison of undistortion models on all four samples in Fig. 4, as well as the estimated radial undistortion profiles and distortion centers.

Effectiveness of CNN regularizer. CNN reparameterization is crucial to our method. As the CNN is an encoder-decoder network without skip connections, the degree of compression can be adjusted by the number of parameters and downsampling layers. Fig. 8 shows that varying the degree of compression controls the amount of fine detail transferred to the height map without affecting the flatness of the field of view or blurring edges. Thus, the degree of compression in the CNN is an interpretable means of tuning the regularization. However, if we optimize the camera-centric height maps directly with total variation (TV) regularization, we see many artifacts, even when the regularization is strong enough to blur sharp edges. See the supplement for results with more CNN architectures and TV regularization levels for all four samples in Fig. 4, plus the hyperparameter-tuning sample.

5 Conclusion

We have presented a feature-free, end-to-end photogrammetric algorithm applied to mesoscopic samples with tens-of-µm accuracy over cms fields of view. Our method features a novel use of CNNs/DIPs, which effectively removes many artifacts from the height maps. We also showed that careful modeling of distortion is important for obtaining accurate height values. Qualitative and quantitative comparisons show that our method outperforms COLMAP, a feature-based SfM tool. Although we applied our method to mesoscopic samples with the orthorectification model, it would be interesting to apply this to macro-scale scenes where photogrammetry is more typically applied, using the arbitrary reference rectification model (Fig. 2d).

References

  • [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th {\{USENIX}\} symposium on operating systems design and implementation ({\{OSDI}\} 16), pages 265–283, 2016.
  • [2] Cecilia Aguerrebere, Mauricio Delbracio, Alberto Bartesaghi, and Guillermo Sapiro. A practical guide to multi-image alignment. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1927–1931. IEEE, 2018.
  • [3] Yoshua Bengio, Ian Goodfellow, and Aaron Courville. Deep learning, volume 1. MIT press Massachusetts, USA:, 2017.
  • [4] Simone Bianco, Gianluigi Ciocca, and Davide Marelli. Evaluating the performance of structure from motion pipelines. Journal of Imaging, 4(8):98, 2018.
  • [5] Duane C Brown. Decentering distortion of lenses. Photogrammetric Engineering and Remote Sensing, 1966.
  • [6] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  • [7] Andrew W Fitzgibbon. Simultaneous linear estimation of multiple view geometry and lens distortion. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, volume 1, pages I–I. IEEE, 2001.
  • [8] Jorge Fuentes-Pacheco, José Ruiz-Ascencio, and Juan Manuel Rendón-Mancha. Visual simultaneous localization and mapping: a survey. Artificial intelligence review, 43(1):55–81, 2015.
  • [9] Simon Fuhrmann, Fabian Langguth, and Michael Goesele. Mve-a multi-view reconstruction environment. In GCH, pages 11–18. Citeseer, 2014.
  • [10] Yasutaka Furukawa and Carlos Hernández. Multi-view stereo: A tutorial. Foundations and Trends in Computer Graphics and Vision, 9(1-2):1–148, 2015.
  • [11] Pallabi Ghosh, Vibhav Vineet, Larry S Davis, Abhinav Shrivastava, Sudipta Sinha, and Neel Joshi. Deep depth prior for multi-view stereo. arXiv preprint arXiv:2001.07791, 2020.
  • [12] Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE international conference on computer vision, pages 3828–3838, 2019.
  • [13] Ariel Gordon, Hanhan Li, Rico Jonschkowski, and Anelia Angelova. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In Proceedings of the IEEE International Conference on Computer Vision, pages 8977–8986, 2019.
  • [14] Andreas Griewank and Andrea Walther. Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Transactions on Mathematical Software (TOMS), 26(1):19–45, 2000.
  • [15] KJ Hanna. Direct multi-resolution estimation of ego-motion and structure from motion. In Proceedings of the IEEE workshop on visual motion, pages 156–157. IEEE Computer Society, 1991.
  • [16] Richard Hartley and Sing Bing Kang. Parameter-free radial distortion correction with center of distortion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(8):1309–1321, 2007.
  • [17] Reinhard Heckel and Paul Hand. Deep decoder: Concise image representations from untrained non-convolutional networks. arXiv preprint arXiv:1810.03982, 2018.
  • [18] Joachim Heel. Direct estimation of structure and motion from multiple frames. Technical report, MASSACHUSETTS INST OF TECH CAMBRIDGE ARTIFICIAL INTELLIGENCE LAB, 1990.
  • [19] Berthold KP Horn and EJ Weldon. Direct methods for recovering motion. International Journal of Computer Vision, 2(1):51–76, 1988.
  • [20] Yuxin Hou, Juho Kannala, and Arno Solin. Multi-view stereo by temporal nonparametric fusion. In Proceedings of the IEEE International Conference on Computer Vision, pages 2651–2660, 2019.
  • [21] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2821–2830, 2018.
  • [22] Sunghoon Im, Hae-Gon Jeon, Stephen Lin, and In So Kweon. Dpsnet: End-to-end deep plane sweep stereo. arXiv preprint arXiv:1905.00538, 2019.
  • [23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [24] Tung D Le, Haruki Imai, Yasushi Negishi, and Kiyokuni Kawachiya. Tflms: Large model support in tensorflow by graph rewriting. arXiv preprint arXiv:1807.02037, 2018.
  • [25] Ruihao Li, Sen Wang, Zhiqiang Long, and Dongbing Gu. Undeepvo: Monocular visual odometry through unsupervised deep learning. In 2018 IEEE international conference on robotics and automation (ICRA), pages 7286–7291. IEEE, 2018.
  • [26] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
  • [27] Thomas Luhmann. Close range photogrammetry for industrial applications. ISPRS journal of photogrammetry and remote sensing, 65(6):558–569, 2010.
  • [28] Reza Mahjourian, Martin Wicke, and Anelia Angelova. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5667–5675, 2018.
  • [29] Larry Matthies, Takeo Kanade, and Richard Szeliski. Kalman filter-based algorithms for estimating depth from image sequences. International Journal of Computer Vision, 3(3):209–238, 1989.
  • [30] Larry H Matthies, Richard Szeliski, and Takeo Kanade. Incremental estimation of dense depth maps from image sequences. In CVPR, pages 366–374, 1988.
  • [31] Pierre Moulon, Pascal Monasse, Romuald Perrot, and Renaud Marlet. Openmvg: Open multiple view geometry. In International Workshop on Reproducible Research in Pattern Recognition, pages 60–74. Springer, 2016.
  • [32] Zak Murez, Tarrence van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End-to-end 3d scene reconstruction from posed images. arXiv preprint arXiv:2003.10432, 2020.
  • [33] John Oliensis. Direct multi-frame structure from motion for hand-held cameras. In Proceedings 15th International Conference on Pattern Recognition. ICPR-2000, volume 1, pages 889–895. IEEE, 2000.
  • [34] GN Peggs, Paul G Maropoulos, EB Hughes, AB Forbes, S Robson, M Ziebart, and Balasubramanian Muralikrishnan. Recent developments in large-scale dimensional metrology. Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Manufacture, 223(6):571–595, 2009.
  • [35] Anurag Ranjan, Varun Jampani, Lukas Balles, Kihwan Kim, Deqing Sun, Jonas Wulff, and Michael J Black. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 12240–12249, 2019.
  • [36] Dirk Robinson, Sina Farsiu, and Peyman Milanfar. Optimal registration of aliased images using variable projection with applications to super-resolution. The Computer Journal, 52(1):31–42, 2009.
  • [37] Ewelina Rupnik, Mehdi Daakir, and Marc Pierrot Deseilligny. Micmac–a free, open-source solution for photogrammetry. Open Geospatial Data, Software and Standards, 2(1):1–9, 2017.
  • [38] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4104–4113, 2016.
  • [39] Gideon P Stein and Amnon Shashua. Model-based brightness constraints: On direct estimation of structure and motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9):992–1015, 2000.
  • [40] Chengzhou Tang and Ping Tan. Ba-net: Dense bundle adjustment network. arXiv preprint arXiv:1806.04807, 2018.
  • [41] Zachary Teed and Jia Deng. Deepv2d: Video to depth with differentiable structure from motion. arXiv preprint arXiv:1812.04605, 2018.
  • [42] Shimon Ullman. The interpretation of structure from motion. Proceedings of the Royal Society of London. Series B. Biological Sciences, 203(1153):405–426, 1979.
  • [43] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9446–9454, 2018.
  • [44] Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and Katerina Fragkiadaki. Sfm-net: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804, 2017.
  • [45] Chaoyang Wang, José Miguel Buenaposada, Rui Zhu, and Simon Lucey. Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2022–2030, 2018.
  • [46] Kaixuan Wang and Shaojie Shen. Mvdepthnet: Real-time multiview depth estimation neural network. In 2018 International Conference on 3D Vision (3DV), pages 248–257. IEEE, 2018.
  • [47] Xingkui Wei, Yinda Zhang, Zhuwen Li, Yanwei Fu, and Xiangyang Xue. Deepsfm: Structure from motion via deep bundle adjustment. arXiv preprint arXiv:1912.09697, 2019.
  • [48] Changchang Wu. Towards linear-time incremental structure from motion. In 2013 International Conference on 3D Vision-3DV 2013, pages 127–134. IEEE, 2013.
  • [49] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV), pages 767–783, 2018.
  • [50] Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1983–1992, 2018.
  • [51] Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and Ian Reid. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 340–349, 2018.
  • [52] Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox. Deeptam: Deep tracking and mapping. In Proceedings of the European conference on computer vision (ECCV), pages 822–838, 2018.
  • [53] Kevin C Zhou and Roarke Horstmeyer. Diffraction tomography with a deep image prior. Optics Express, 28(9):12872–12896, 2020.
  • [54] Kevin C Zhou, Ruobing Qian, Simone Degan, Sina Farsiu, and Joseph A Izatt. Optical coherence refraction tomography. Nature Photonics, 13(11):794–802, 2019.
  • [55] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1851–1858, 2017.