Please prepare your poster presentation by following Guidelines.
Chair: Feng Zheng (Magic Leap, Inc.)
In the last few years, the advancement of head mounted display technology and optics has opened up many new possibilities for the field of Augmented Reality. However, many commercial and prototype systems often have a single display modality, fixed field of view, or inflexible form factor. In this paper, we introduce Modular Augmented Reality (ModulAR), a hardware and software framework designed to improve flexibility and hands-free control of video see-through augmented reality displays and augmentative functionality. To accomplish this goal, we introduce the use of integrated eye tracking for on-demand control of vision augmentations such as optical zoom or field of view expansion. Physical modification of the device’s configuration can be accomplished on the fly using interchangeable camera-lens modules that provide different types of vision enhancements. We implement and test functionality for several primary configurations using telescopic and fisheye camera-lens systems, though many other customizations are possible. We also implement a number of eye-based interactions in order to engage and control the vision augmentations in real time, and explore different methods for merging streams of augmented vision into the user’s normal field of view. In a series of experiments, we conduct an in depth analysis of visual acuity and head and eye movement during search and recognition tasks. Results show that methods with larger field of view that utilize binary on/off and gradual zoom mechanisms outperform snapshot and sub-windowed methods and that type of eye engagement has little effect on performance.
The fundamental issues in Augmented Reality (AR) are on how to naturally mediate the reality with virtual content as seen by users. In AR applications with Optical See-Through Head- Mounted Displays (OST-HMD), the issues often raise the problem of rendering color on the OST-HMD consistently to input colors. However, due to various display constraints and eye properties, it is still a challenging task to indistinguishably reproduce the colors on OST-HMDs. An approach to solve this problem is to pre-process the input color so that a user perceives the output color on the display to be the same as the input.
We propose a color calibration method for OST-HMDs. We start from modeling the physical optics in the rendering and perception process between the HMD and the eye. We treat the color distortion as a semi-parametric model which separates the non-linear color distortion and the linear color shift. We demonstrate that calibrated images regain their original appearance on two OST-HMD setups with both synthetic and real datasets. Furthermore, we analyze the limitations of the proposed method and remaining problems of the color reproduction in OST- HMDs. We then discuss how to realize more practical color reproduction methods for future HMD-eye system.
In Augmented Reality (AR) with an Optical See-Through Head-Mounted Display (OST-HMD), the spatial calibration between a user's eye and the display screen is a crucial issue in realizing seamless AR experiences. A successful calibration hinges upon proper modeling of the display system which is conceptually broken down into an eye part and an HMD part. This paper breaks the HMD part down even further to investigate optical aberration issues. The display optics causes two different optical aberrations that degrade the calibration quality: the distortion of incoming light from the physical world, and that of light from the image source of the HMD. While methods exist for correcting either of the two distortions independently, there is, to our knowledge, no method which corrects for both simultaneously.
This paper proposes a calibration method that corrects both of the two distortions simultaneously for an arbitrary eye position given an OST-HMD system. We expand a light-field (LF) correction approach [8] originally designed for the former distortion. Our method is camera- based and has an offline learning and an online correction step. We verify our method in exemplary calibrations of two different OST-HMDs: a professional and a consumer OST-HMD. The results show that our method significantly improves the calibration quality compared to a conventional method with the accuracy comparable to 20/50 visual acuity. The results also indicate that only by correcting both the distortions simultaneously can improve the quality.
Chair: Yoshinari Kameda (University of Tsukuba)
In this work, we present a new automatic system for scene reconstruction, which delivers high- level structural models. We start with identifying planar regions in depth images obtained with a SLAM system. Our main contribution is an approach which identifies constraints such as incidence and orthogonality of planar surfaces and uses them in an incremental optimization framework to extract high-level structural models. The result is a manifold mesh with a low number of polygons, immediately useful in many Augmented Reality applications.
Volumetric methods provide efficient, flexible and simple ways of integrating multiple depth images into a full 3D model. They provide dense and photorealistic 3D reconstructions, and parallelised implementations on GPUs achieve real-time performance on modern graphics hardware. To run such methods on mobile devices, providing users with freedom of movement and instantaneous reconstruction feedback, remains challenging however. In this paper we present a range of modifications to existing volumetric integration methods based on voxel block hashing, considerably improving their performance and making them applicable to tablet computer applications. We present (i) optimisations for the basic data structure, and its allocation and integration; (ii) a highly optimised raycasting pipeline; and (iii) extensions to the camera tracker to incorporate IMU data. In total, our system thus achieves frame rates up 43 Hz on a Nvidia Shield Tablet and 820 Hz on a Nvidia GTX Titan X GPU, or even beyond 1 kHz without visualisation.
VWe present the first pipeline for real-time volumetric surface reconstruction and dense 6DoF camera tracking running purely on standard, off-the-shelf mobile phones. Using only the embedded RGB camera, our system allows users to scan objects of varying shape, size, and appearance in seconds, with real-time feedback during the capture process. Unlike existing state of the art methods, which produce only point-based 3D models on the phone, or require cloud-based processing, our hybrid GPU/CPU pipeline is unique in that it creates a connected 3D surface model directly on the device at 25Hz. In each frame, we perform dense 6DoF tracking, which continuously registers the RGB input to the incrementally built 3D model, minimizing a noise aware photoconsistency error metric. This is followed by efficient key-frame selection, and dense per-frame stereo matching. These depth maps are fused volumetrically using a method akin to KinectFusion, producing compelling surface models. For each frame, the implicit surface is extracted for live user feedback and pose estimation. We demonstrate scans of a variety of objects, and compare to a Kinect-based baseline, showing on average ~1.5cm error. We qualitatively compare to a state of the art point-based mobile phone method, demonstrating an order of magnitude faster scanning times, and fully connected surface models.
Chair: Georg Klein (Microsoft Corporation)
We present a method which can quickly and robustly match 2D and 3D point patterns based on their sole spatial distribution, but it can also handle other cues if available. This method can be easily adapted to many transformations such as similarity transformations in 2D/3D, and affine and perspective transformations in 2D. It is based on local geometric consensus among several local matchings and a refinement scheme. We provide two implementations of this general scheme, one for the 2D homography case (which can be used for marker or image tracking) and one for the 3D similarity case. We demonstrate the robustness and speed performance of our proposal on both synthetic and real images and show that our method can be used to augment any (textured/textureless) planar objects but also 3D objects.
Current methods dealing with non-rigid augmented reality only provide an augmented view when the topology of the tracked object is not modified, which is an important limitation. In this paper we solve this shortcoming by introducing a method for physics-based non-rigid augmented reality. Singularities caused by topological changes are detected by analyzing the displacement field of the underlying deformable model. These topological changes are then applied to the physics-based model to approximate the real cut. All these steps, from deformation to cutting simulation, are performed in real-time. This significantly improves the coherence between the actual view and the model, and provides added value.
We propose a novel formulation for determining the absolute pose of a single or multi-camera system given a known vertical direction. The vertical direction may be easily obtained by detecting the vertical vanishing points with computer vision techniques, or with the aid of IMU sensor measurements from a smartphone. Our solver is general and able to compute absolute camera pose from two 2D-3D correspondences for single or multi-camera systems. We run several synthetic experiments that demonstrate our algorithm's improved robustness to image and IMU noise compared to the current state of the art. Additionally, we run an image localization experiment that demonstrates the accuracy of our algorithm in real-world scenarios. Finally, we show that our algorithm provides increased performance for real-time model-based tracking compared to solvers that do not utilize the vertical direction and show our algorithm in use with an augmented reality application running on a Google Tango tablet.
We present a method for large-scale geo-localization and global tracking of mobile devices in urban outdoor environments. In contrast to existing methods, we instantaneously initialize and globally register a SLAM map by localizing the first keyframe with respect to widely available untextured 2.5D maps. Given a single image frame and a coarse sensor pose prior, our localization method estimates the absolute camera orientation from straight line segments and the translation by aligning the city map model with a semantic segmentation of the image. We use the resulting 6DOF pose, together with information inferred from the city map model, to reliably initialize and extend a 3D SLAM map in a global coordinate system, applying a model- supported SLAM mapping approach. We show the robustness and accuracy of our localization approach on a challenging dataset, and demonstrate unconstrained global SLAM mapping and tracking of arbitrary camera motion on several sequences.
Chair: Stephan Lukosch (Delft University of Technology)
Coloring books capture the imagination of children and provide them with one of their earliest opportunities for creative expression. However, given the proliferation and popularity of digital devices, real-world activities like coloring can seem unexciting, and children become less engaged in them. Augmented reality holds unique potential to impact this situation by providing a bridge between real-world activities and digital enhancements. In this paper, we present an augmented reality coloring book App in which children color characters in a printed coloring book and inspect their work using a mobile device. The drawing is detected and tracked, and the video stream is augmented with an animated 3-D version of the character that is textured according to the child’s coloring. This is possible thanks to several novel technical contributions. We present a texturing process that applies the captured texture from a 2-D colored drawing to both the visible and occluded regions of a 3-D character in real time. We develop a deformable surface tracking method designed for colored drawings that uses a new outlier rejection algorithm for real-time tracking and surface deformation recovery. We present a content creation pipeline to efficiently create the 2-D and 3-D content. And, finally, we validate our work with two user studies that examine the quality of our texturing algorithm and the overall App experience.
In this paper we present a dual, wide area, collaborative augmented reality (AR) system that consists of standard live view augmentation, e.g., from helmet, and zoomed-in view augmentation, e.g., from binoculars. The proposed advanced scouting capability allows long range high precision augmentation of live unaided and zoomed-in imagery with aerial and terrain based synthetic objects, vehicles, people and effects. The inserted objects must appear stable in the display and not jitter or drift as the user moves around and examines the scene. The AR insertions for the binocs must work instantly when they are picked up anywhere as the user moves around. The design of both AR modules is based on using two different cameras with wide and narrow field of view (FoV) lenses. The wide FoV gives context and enables the recovery of location and orientation of the prop in 6 degrees of freedom (DoF) much more robustly, whereas the narrow FoV is used for the actual augmentation and increased precision in tracking. Furthermore, narrow camera in unaided eye and wide camera on the binoculars are jointly used for global yaw (heading) correction. We present our navigation algorithms using monocular cameras in combination with IMU and GPS in an Extended Kalman Filter (EKF) framework to obtain robust and real-time pose estimation for precise augmentation and cooperative tracking.
Omnidirectional videos of real world environments viewed on head-mounted displays with real- time head motion tracking can offer immersive visual experiences. For live streaming applications, compression is critical to reduce the bitrate. Omnidirectional videos, which are spherical in nature, are mapped onto one or more planes before encoding to interface with modern video coding standards. In this paper, we consider the problem of evaluating the coding efficiency in the context of viewing with a head-mounted display. We extract viewport based head motion trajectories, and compare the original and coded videos on the viewport. With this approach, we compare different sphere-to-plane mappings. We show that the average viewport quality can be approximated by a weighted spherical PSNR.
Chair: Dieter Schmalstieg (Graz University of Technology)
In the Shader Lamps concept, a projector-camera system augments physical objects with projected virtual textures, provided that a precise intrinsic and extrinsic calibration of the system is available. Calibrating such systems has been an elaborate and lengthy task in the past and required a special calibration apparatus. Self-calibration methods in turn are able to estimate calibration parameters automatically with no effort. However they inherently lack global scale and are fairly sensitive to input data.
We propose a new semi-automatic calibration approach for projector-camera systems that - unlike existing auto-calibration approaches - additionally recovers the necessary global scale by projecting on an arbitrary object of known geometry. To this end our method combines surface registration with bundle adjustment optimization on points reconstructed from structured light projections to refine a solution that is computed from the decomposition of the fundamental matrix. In simulations on virtual data and experiments with real data we demonstrate that our approach estimates the global scale robustly and is furthermore able to improve incorrectly guessed intrinsic and extrinsic calibration parameters thus outperforming comparable metric rectification algorithms.
This paper proposes a novel radiometric compensation technique for cooperative projection system based-on distributed optimization. To achieve high scalability and robustness, we assume cooperative projection environments such that 1. each projector does not have any information about other projectors as well as target images, 2. the camera does not have any information about the projectors either, while having the target images, and 3. only a broadcast communication from the camera to the projectors is allowed to suppress the data transfer bandwidth. To this end, we first investigate a distributed optimization based feedback mechanism that is suitable for the required decentralized information processing environment. Next, we show that this mechanism works well for still image projection, however not necessary for moving images due to the lack of dynamic responsiveness. To overcome this issue, we focus on a specific structure of the distributed projector-camera system in consideration, and propose to implement an additional feedforward mechanism. Such a 2 Degree Of Freedom (2-DOF) control structure is well-known in control engineering community as a typical method to enhance not only disturbance rejection but also reference tracking capability, simultaneously. Actually, we can theoretically guarantee that this 2-DOF structure yields the moving image projection accuracy that is overwhelming the best achievable performance only by the distributed optimization mechanisms. The effectiveness of the proposed method is demonstrated through physical projection experiments.
Mobile devices are part of our everyday life and allow augmented reality (AR) with their integrated camera image. Recent research has shown that even photorealistic augmentations with consistent illumination are possible. A method, achieving this first, distributed lighting computations and the extraction of the important light sources. To reach real-time frame rates on a mobile device, the number of these extracted light sources must be low, limiting the scope of possible illumination scenarios and the quality of shadows. In this paper, we show how to reduce the computational cost per light using a combination of tile-based rendering and frustum culling techniques tailored for AR applications. Our approach runs entirely on the GPU and does not require any precomputation. Without reducing the displayed image quality, we achieve up to 2.2x speedup for typical AR scenarios.
Chair: Greg Welch (The University of Central Florida)
Augmented Reality (AR) in microscopic surgery has been subject of several studies in the past two decades. Nevertheless, AR has not found its way into everyday microsurgical workflows. The introduction of new surgical microscopes equipped with Optical Coherence Tomography (OCT) enables the surgeons to perform multimodal (optical and OCT) imaging in the operating room. Taking full advantage of such elaborate source of information requires sophisticated intraoperative image fusion, information extraction, guidance and visualization methods. Medical AR is a unique approach to facilitate utilization of multimodal medical imaging devices. Here we propose a novel medical AR solution to the long-known problem of determining the distance between the surgical instrument tip and the underlying tissue in ophthalmic surgery to further pave the way of AR into the surgical theater. Our method brings augmented reality to OCT for the first time by augmenting the surgeon's view of the OCT images with an estimated instrument cross-section shape and distance to the retinal surface using only information from the shadow of the instrument in intraoperative OCT images. We demonstrate the applicability of our method in retinal surgery using a phantom eye and evaluate the accuracy of the augmented information using a micromanipulator.
Image-guided medical interventions more frequently rely on Augmented Reality (AR) visualization to enable surgical navigation. Current systems use 2-D monitors to present the view from external cameras, which does not provide an ideal perception of the 3-D position of the region of interest. Despite this problem, most research targets the direct overlay of diagnostic imaging data, and only few studies attempt to improve the perception of occluded structures in external camera views. The focus of this paper lies on improving the 3-D perception of an augmented external camera view by combining both auditory and visual stimuli in a dynamic multi-sensory AR environment for medical applications. Our approach is based on Temporal Distance Coding (TDC) and an active surgical tool to interact with occluded virtual objects of interest in the scene in order to gain an improved perception of their 3-D location. Users performed a simulated needle biopsy by targeting virtual lesions rendered inside a patient phantom. Experimental results demonstrate that our TDC-based visualization technique significantly improves the localization accuracy, while the addition of auditory feedback results in increased intuitiveness and faster completion of the task.
This paper presents the first design of a mirror based RGBD X-ray imaging system and includes an evaluation study of the depth errors induced by the mirror when used in combination with an infrared pattern-emission RGBD camera. Our evaluation consisted of three experiments. The first demonstrated almost no difference in depth measurements of the camera with and without the use of the mirror. The final two experiments demonstrated that there were no relative and location-specific errors induced by the mirror showing the feasibility of the RGBDX-ray imaging system. Lastly, we showcase the potential of the RGBDX-ray system towards a visualization application in which an X-ray image is fused to the 3D reconstruction of the surgical scene via the RGBD camera, using automatic C-arm pose estimation.
Chair: Stephen R. Ellis
We present SoftAR, a novel spatial augmented reality (AR) technique based on a pseudo- haptics mechanism in the human brain that visually manipulates the sense of softness perceived by a user pushing a soft physical object. Considering the limitations of projection- based approaches that change only the surface appearance of a physical object, we propose two projection visual effects, i.e., surface deformation effect (SDE) and body appearance effect (BAE), on the basis of the observations of humans pushing physical objects. The SDE visualizes a two-dimensional deformation of the object surface with a controlled softness parameter, and BAE changes the color of the pushing hand. Through psychophysical experiments, we confirm that the SDE can manipulate softness perception such that the participant perceives significantly greater softness than the actual softness. Furthermore, fBAE, in which BAE is applied only for the finger area, significantly enhances manipulation of the perception of softness. On the basis of the experimental results, we create a computational model that estimates perceived softness when SDE+fBAE is applied. We construct a prototype SoftAR system in which two application frameworks are implemented, i.e., softness adjustment and softness transfer. The former framework allows a user to adjust the softness parameter of a physical object, and the latter allows the user to replace the softness with that of another object. Through a user study of the prototype, we confirm that perceived softness can be manipulated with accuracy that is less than the just noticeable difference of softness perception. SoftAR does not require user-worn/hand-held equipment and allows users to feel significantly different softness perception without changing materials; therefore, we believe that it will be useful for various applications, particularly the design process of soft products, such as furniture, plush toys, and imitation materials.
Many compelling augmented reality (AR) applications, such as image-guided surgery, manufacturing, and maintenance, that involve the dexterous manipulation of real and virtual objects at reaching distances, require users to correctly perceive the location of virtual objects. Some of these applications require aligning real and virtual objects with accuracies as tight as 1 mm or less. However, measuring the perceived depth of AR objects at these accuracies has not yet been demonstrated. In this paper, we address this challenge by employing two different depth judgment methods from the literature, \emph{blind reaching} and \emph{perceptual matching}, in a series of three experiments, where observers judged the depth of real and AR target objects presented at reaching distances. Both depth judgment methods are promising solutions to the measurement challenge, but both also have limitations that warrant additional study. Our experiments found that observers can very accurately match the distance of a real target, but systematically overestimate the distance to an AR target viewed through collimating optics, resulting in 0.5 to 4.0 cm of error. However, a model in which the collimating optics cause the eyes' vergence angle to rotate outward by a constant angular amount explains these results, which were replicated three times. These findings give error bounds on using collimating AR displays at reaching distances, and suggest that AR displays need to provide variable focus for these reaching-distance applications. Our experiments further found that observers initially reach $\sim$4 cm too short, but reaching accuracy improves with both consistent proprioception and corrective visual feedback to become nearly as accurate as matching. An additional contribution is the design of an apparatus that affords measuring depth judgments to within a few millimeters of precision.
An effective interaction in augmented reality (AR) requires utilization of different modalities. In this study, we investigated orienting the user in bimodal AR. Using auditory perception to support visual perception provides a useful approach for orienting the user to directions that are outside of the visual field-of-view (FOV). In particular, this is important in path-finding, where points-of-interest (POIs) can be all around the user. However, the ability to perceive the audio POIs is affected by the ventriloquism effect (VE), which means that audio POIs are captured by visual POIs. We measured the spatial limits for the VE in AR using a video see-through head- worn display. The results showed that the amount of the VE in AR was approx. 5 deg - 15 deg higher than in a real environment. In AR, spatial disparity between an audio and visual POI should be at least 30 deg of azimuth angle, in order to perceive the audio and visual POIs as separate. The limit was affected by azimuth angle of visual POI and magnitude of head rotations. These results provide guidelines for designing bimodal AR systems.