Is it possible to implement an artificial intelligence (AI) in metallic devices? It was the question remained after taking a class called "Computer and Mind." The question induced another question that wonders possibility of the AI which could perceive the physical world fundamentally through the criterion of survival. I was also fascinated by the fact from a book "On Intelligence" that a single unit of computation in the brain - the neuron - and its connectivity can process various types of input data. From the moment of fascination, combined with interests in visual data, I decided to study computer vision as the first step of building an artificial intelligence. My primary research goal is to understand underlying nature of input data and to predict meaningful structures using efficient representations.
My first exposure to the area of computer vision was during my Master’s. I extensively analyzed the effect of bounding-box based representation of an object, which, due to its simplicity, is widely used for object tracking or object detection. Particularly, I focused on handling ambiguity induced by discordance between the shape of the object and the bounding-box. Appearance models for accurate discrimination of the object region from the background were proposed in my thesis. I also participated in the study that used two bounding-boxes to avoid using information from the ambiguous region around the conventional single bounding-box, which was presented in ECCV 2014. After graduation, I joined the Korea Institute of Science and Technology (KIST) as a research staff and conducted research on scene flow estimation from a pair of RGB-D data. Dense correspondence estimation is an essential problem in modeling a dynamic 3D object such as human. With a help of an RGB-D camera, I had an access to both image and depth data. I tried to generalize a total variation (TV), a widely used motion prior which is robust near boundary. Employing total generalized variation (TGV) made the estimator prefer natural solutions. Furthermore, I adopted a deformation graph, a graph that efficiently leverages the geometry of surface, for estimation of motion with large displacement.
Participating in these research projects, I realized that efficient representation is of significant importance in many computer vision problems and I decided to developing methods of combining multiple structures with informative data. For instance, superpixel, the element I used as a segment in my thesis, had irregular shape and less information than a patch. Although it reduced the computational cost, they lack distinctive information such as contours and repeatability. Therefore, an additional computation dealing with superpixels was required. In 3D motion estimation, combining two different types of information, texture and geometry, was a challenging problem, since each data was represented and transformed into an energy term independently while they are correlated. It is not assured that the solution of a simple weighted sum of energy terms is a true global solution when energy terms have discordant solutions. Furthermore, we heavily rely on motion prior, which is not data dependent; we only use our conjecture instead of using numerous data available on the web.
I believe learning the relationship between multiple types of input data and desired output can solve the problem. Deep learning can be an effective way to realize the procedure; it automatically builds hierarchical representation of input data, and effectively combine them for estimating the desired output. There is a lot of space to be explored in analysis of deep neural network both theoretically and practically. Therefore, I am to open the black box and develop efficient methods in various estimation problems. Especially, I am interested in structured output prediction and delving deeper into elements in the network. I recently began studying human pose estimation using convolutional neural network (CNN) with a structured loss. The goal of this work is to develop an image-dependent pairwise term in the structured prediction which does not require manual clustering or discretization of input data. In my doctoral study, I hope to continue studying these areas to seek a precise prediction of informative structure by learning the underlying properties of data.