We propose an active object detection and localization framework that combines a robust untextured object detection and 3D pose estimation algorithm with a novel next-best-view selection strategy. We address the detection and localization problems by proposing an edge-based registration algorithm (D²CO) that refines the object position by minimizing a cost directly extracted from a 3D image tensor that encodes the minimum distance to an edge point in a joint direction/location space. We face the next-best-view problem by exploiting a sequential decision process that, for each step, selects the next camera position which maximizes the mutual information between the state and the next observations. We solve the intrinsic intractability of this solution by generating observations that represent scene realizations, i.e. combination samples of object hypothesis provided by the object detector, while modeling the state by means of a set of constantly resampled particles.
D²CO Object Loacalization
We believe that vision, possibly coupled with depth information used to provide scale and location priors, still remains the primary source of information for detection and localization of objects in challenging environments. In many cases, edge-based algorithms still provide superior performances. Unfortunately, in our experience we found that: (i) The huge 6D searching space imposes a coarse-grained viewpoint discretization, so it is usually required to perform many time-consuming object registration steps over a large set of object candidates in order to accurately detect the true best matches; (ii) Given as input a single view of the scene, often none of the tested state-of-the-art matching algorithms provide as first output the best, true-positive, matches. To address these problems, in this work we present an effective active perception framework based on the Direct Directional Chamfer Optimization registration method (D²CO). Our method provides: (i) Fast and robust 3D object registration: (ii) An effective active perception strategy able to solve the detection ambiguities and to improve the object localization accuracy. The key idea of the D²CO registration algorithm is to refine the parameters (i.e., the object pose) using a cost function that exploits the Directional Chamfer Distance (DCD) tensor in a direct way, i.e. by retrieving the costs and the derivatives directly from the tensor. Being a piecewise smooth function of both the image translation and the (edge) orientation, the DCD ensures a wide basin of convergence. Differently from other registration algorithms based on the ICP method, D²CO does not require to re-compute the point-to-point correspondences, since the data association is implicitly encoded in the DCD tensor.
Active Object Recognition and Localization Strategy
In many cases, a single view of a scene does not provide sufficient information to detect and accurately locate the objects of interest: objects in the working area can be mutually occluded, moreover different objects may look very similar from different viewpoints. We address these problems firstly by introducing a very simple but effective multi-view extension of the D²CO algorithm, then by proposing a novel solution to the next-best-view (NBV) problem that aims to solve the detection ambiguities while maximizing the confidence and the localization accuracy. The NBV problem refers to the sensor placement problem that, given the previous sensor measurements, asks for the next sensor position that results in a better understanding of the scene.
In our approach we employ a sequential decision process that, for each step, selects the next camera position which maximizes the mutual information between the state (i.e., the object position) and the observations. Unfortunately, a direct computation of the mutual information is often intractable since it requires to iterate over the whole observations space and over the whole state space. We address this problem by introducing a novel model-based observations sampling algorithm.
The idea is to generate observations by means of “scene realizations”, i.e. by sampling combinations of object hypothesis provided by the object detector. The next observations can be synthesized by projecting the scene realizations in an efficient way. Thus, we select the next view as the one that maximizes the mutual information between the system state and the synthesized observations. We model the probability density function over the state by means of a set of particles that represent objects positions: this allows to represent multi-modal probability functions that implicitly enables our system to detect multiple instances of an object type. At each new scene observation, we improve the localization accuracy employing the multi-view D²CO algorithm over all the collected images.
The top-right image shows the object localization result given a set of object pose candidates (top-left). The objects positions particles are shown in the second and third rows.
A detailed description of D²CO and the proposed active detection and localization framework can be found in these two publications. Please cite these papers to refer to the proposed methods. The second arXiv submission was also submitted as a paper to "Computer Vision and Image Understanding".
- Marco Imperoli and Alberto Pretto “D²CO: Fast and Robust Registration of 3D Textureless Objects Using the Directional Chamfer Distance” In Proceedings of the 10th International Conference on Computer Vision Systems (ICVS 2015), July 6-9 , 2015 Copenhagen, Denmark, pages: 316-328.
- Marco Imperoli and Alberto Pretto “Active Detection and Localization of Textureless Objects in Cluttered Environments” In arXiv preprint arXiv:1603.07022.