Vineet Gandhi

Home PublicationsTeachingStudentsMisc

 

  (v)(gandhi)@iiit.ac.in
+91-40-66531238

F25, Centre for Visual Information Technology (CVIT),
2nd floor, KCIS Research Block
Gachibowli, Hyderabad, India, 500032


 

I am currently an assistant professor at IIIT Hyderabad, where I am affiliated with Center for Visual Information Techonology ( CVIT ). I completed my PhD degree at INRIA Rhone Alpes/Univesity of Grenoble in applied mathematics and computer science (mathématique appliquée et informatique). I was funded by the CIBLE scholarship by Region Rhone Alpes. Prior to this, I completed my Masters with Erasmus Mundus scholarship under CIMET consortium. I am extremely thankful to European Union for giving me this opportunity which had a huge impact on both my professional and personal life. I spent a semester each in Spain, Norway and France and later joined INRIA for my master thesis and continued for my PhD there. I was also lucky to travel and deeply explore Europe (from south of Spain to North of Norway), at times purely surviving on gestural expressions for communication. I obtained my Bachelor of Technology degree from Indian Institute of Information Technology, Design and Manufacturing (IIITDM) Jabalpur, India (I belong to the first batch of the insititute).

I like to focus on problems with tangible goals and I try to build end to end solutions (with neat engineering). My current research interests are in applied machine learning for applications in computer vision and multimedia. In recent years, I have been exploring specific problems of computational videography/cinematography; image/video editing; multiple sensor fusion for 3D analytics; sports analytics; document analytics; visual detection,tracking and recognition. In personal space, I like to spend time with my family, play cards, go on bicycle rides, read ancient literature and explore mountains.

CV [pdf] [LinkedIn]
Current/Past Affiliations



News

Jan 30, 2018   One paper accepted in Eurographics, two in ICASSP, one in WACV and one in ICRA. Congratulations to Moniesh, Kranthi, Pranjal, Sajal, Vatsal, Rahul, Bharath, Ashar and Krishnam.
April 28, 2017   Awarded Early Career Research Award (ECRA) from Science and Engineering Research Board (SERB), Govt. of India
April 24, 2017   We are organizing a workshop on Intelligent Cinematograhy and Editing (WICED 2017) at Eurographics this year
Feb 1, 2017   "Zooming On All Actors: Automatic Focus+Context Split Screen Video Generation" accepted at Eurographics 2017
Dec 27, 2016   Gave a talk at the Mysore Park Workshop on Vision, Language and AI, (VLAI 2016)

 


Selected Publications

List of all publications @ Google Scholar

  Javed Syed Ashar, Saxena Shreyas and Gandhi Vineet
Learning Unsupervised Visual Grounding Through Semantic Self-Supervision
arXiv 2018
[abstract] [paper] [www] [bibTex]
Localizing natural language phrases in images is a challenging problem that requires joint understanding of both the textual and visual modalities. In the unsupervised setting, lack of supervisory signals exacerbate this difficulty. In this paper, we propose a novel framework for unsupervised visual grounding which uses concept learning as a proxy task to obtain self-supervision. The simple intuition behind this idea is to encourage the model to localize to regions which can explain some semantic property in the data, in our case, the property being the presence of a concept in a set of images. We present thorough quantitative and qualitative experiments to demonstrate the efficacy of our approach and show a 5.6% improvement over the current state of the art on Visual Genome dataset, a 5.8% improvement on the ReferItGame dataset and comparable to state-of-art performance on the Flickr30k dataset.
@inproceedings{Javed-arxiv-2018, AUTHOR = {Javed Syed Ashar, Shreyas Saxena and Gandhi Vineet}, TITLE = {Learning Unsupervised Visual Grounding Through Semantic Self-Supervision}, BOOKTITLE = {arXiv:1803.06506}, YEAR = {2018}}
  Gupta Krishnam, Javed Syed Ashar, Gandhi Vineet and Krishna Madhava K.
MergeNet: A Deep Net Architecture for Small Obstacle Discovery
ICRA 2018
[abstract] [paper] [www] [bibTex]
We present here, a novel network architecture called MergeNet for discovering small obstacles for on-road scenes in the context of autonomous driving. The basis of the architecture rests on the central consideration of training with less amount of data since the physical setup and the annotation process for small obstacles is hard to scale. For making effective use of the limited data, we propose a multi-stage training procedure involving weight-sharing, separate learning of low and high level features from the RGBD input and a refining stage which learns to fuse the obtained complementary features. The model is trained and evaluated on the Lost and Found dataset and is able to achieve state-of-art results with just 135 images in comparison to the 1000 images used by the previous benchmark. Additionally, we also compare our results with recent methods trained on 6000 images and show that our method achieves comparable performance with only 1000 training samples.
@inproceedings{Gupta-icra-2018, AUTHOR = {Gupta Krishnam, Javed Syed Ashar, Gandhi Vineet and Krishna Madhava K.}, TITLE = {{MergeNet: A Deep Net Architecture for Small Obstacle Discovery}}, BOOKTITLE = {ICRA}, YEAR = {2018}}
  Kumar Kranthi, Kumar Moneish, Gandhi Vineet, Subramanian Ramanathan
Watch to Edit: Video Retargeting using Gaze
Eurographics 2018
[abstract] [paper] [www] [bibTex]
We present a novel approach to optimally retarget videos for varied displays with differing aspect ratios by preserving salient scene content discovered via eye tracking. Our algorithm performs editing with cut, pan and zoom operations by optimizing the path of a cropping window within the original video while seeking to (i) preserve salient regions, and (ii) adhere to the principles of cinematography. Our approach is (a) content agnostic as the same methodology is employed to re-edit a wide-angle video recording or a close-up movie sequence captured with a static or moving camera, and (b) independent of video length and can in principle re-edit an entire movie in one shot. Our algorithm consists of two steps. The first step employs gaze transition cues to detect time stamps where new cuts are to be introduced in the original video via dynamic programming. A subsequent step optimizes the cropping window path (to create pan and zoom effects), while accounting for the original and new cuts. The cropping window path is designed to include maximum gaze information, and is composed of piecewise constant, linear and parabolic segments. It is obtained via L(1) regularized convex optimization which ensures a smooth viewing experience. We test our approach on a wide variety of videos and demonstrate significant improvement over the state-of-the-art, both in terms of computational complexity and qualitative aspects. A study performed with 16 users confirms that our approach results in a superior viewing experience as compared to gaze driven re-editing [JSSH15] and letterboxing methods, especially for wide-angle static camera recordings.
@inproceedings{kumar-eg-2018, title = {{Watch to Edit: Video Retargeting using Gaze}}, author = {Kumar Kranthi, Kumar Moneish, Gandhi Vineet, Subramanian Ramanathan}, booktitle = {Eurographics}, year = {2018}, }
  Rai Pranjal Kumar, Maheshwari Sajal, and Gandhi Vineet
Document Quality Estimation using Spatial Frequency Response
ICASSP (oral) 2018
[abstract] [paper] [www] [bibTex]
The current Document Image Quality Assessment (DIQA) algorithms directly relate the Optical Character Recognition (OCR) accuracies with the quality of the document to build supervised learning frameworks. This direct correlation has two major limitations: (a) OCR may be affected by factors independent of the quality of the capture and (b) it cannot account for blur variations within an image. An alternate possibility is to quantify the quality of capture using human judgement, however, it is subjective and prone to error. In this work, we build upon the idea of Spatial Frequency Re- sponse (SFR) to reliably quantify the quality of a document image. We present through quantitative and qualitative exper- iments that the proposed metric leads to significant improve- ment in document quality prediction in contrast to using OCR as ground truth.
@inproceedings{Rai-ICASSP-2018, title = {Document Quality Estimation using Spatial Frequency Response}, author = {Rai Pranjal Kumar, Maheshwari Sajal, and Gandhi Vineet}, booktitle = {ICASSP}, year = {2018}, }
  Shah Vatsal, and Gandhi Vineet
An Iterative approach for Shadow Removal in Document Images
ICASSP 2018
[abstract] [paper] [www] [bibTex]
Uneven illumination and shadows in document images cause a challenge for digitization applications and automated work- flows. In this work, we propose a new method to recover un- shadowed document images from images with shadows/un- even illumination. We pose this problem as one of estimating the shading and reflectance components of the given original image. Our method first estimates the shading and uses it to compute the reflectance. The output reflectance map is then used to improve the shading and the process is repeated in an iterative manner. The iterative procedure allows for a gradual compensation and allows our algorithm to even handle diffi- cult hard shadows without introducing any artifacts. Experi- ments over two different datasets demonstrate the efficacy of our algorithm and the low computation complexity makes it suitable for most practical applications.
@inproceedings{Shah-ICASSP-2018, title = {An Iterative approach for Shadow Removal in Document Images}, author = {Shah Vatsal and Gandhi Vineet}, booktitle = {ICASSP}, year = {2018}, }
  Sharma Rahul Anand, Bhat Bharath, Gandhi Vineet, Jawahar CV
Automated Top View Registration of Broadcast Football Videos
WACV 2018
[abstract] [paper] [bibTex]
In this paper, we propose a fully automatic method to register football broadcast video frames on the static top view model of the playing surface. Automatic registration has been difficult due to the difficulty of finding sufficient point correspondences. We investigate an alternate ap- proach exploiting the edge information from the line mark- ings on the field. We formulate the registration problem as a nearest neighbour search over a synthetically generated dictionary of edge map and homography pairs. The syn- thetic dictionary generation allows us to exhaustively cover a wide variety of camera angles and positions and reduces this problem to a minimal per-frame edge map matching problem. We show that the per-frame results can be fur- ther improved in videos using an optimization framework for temporal camera stabilization. We demonstrate the ef- ficacy of our approach by presenting extensive results on a dataset collected from matches of the football World Cup 2014 and show significant improvement over the current state of the art.
@inproceedings{anand-WACV-2018, title = {{Automated Top View Registration of Broadcast Football Videos}}, author = {Sharma Rahul Anand, Bhat Bharath, Gandhi Vineet, Jawahar CV}, booktitle = {{WACV}}, year = {2018}, }
  Moudgil Abhinav and Gandhi Vineet
Long-Term Visual Object Tracking Benchmark
arxiv 2017
[abstract] [paper] [dataset] [www] [bibTex]
In this paper, we propose a new long video dataset (called Track Long and Prosper - TLP) and benchmark for visual object tracking. The dataset consists of 50 videos from real world scenarios, encompassing a duration of over 400 minutes (676K frames), making it more than 20 folds larger in average duration per sequence and more than 8 folds larger in terms of total covered duration, as compared to existing generic datasets for visual tracking. The proposed dataset paves a way to suitably assess long term tracking performance and possibly train better deep learning architectures (avoiding/reducing augmentation, which may not reflect realistic real world behavior). We benchmark the dataset on 17 state of the art trackers and rank them according to tracking accuracy and run time speeds. We further categorize the test sequences with different attributes and present a thorough quantitative and qualitative evaluation. Our most interesting observations are (a) existing short sequence benchmarks fail to bring out the inherent differences in tracking algorithms which widen up while tracking on long sequences and (b) the accuracy of most trackers abruptly drops on challenging long sequences, suggesting the potential need of research efforts in the direction of long term tracking.
@inproceedings{moudgil-arxiv-2017, title = {{Long-Term Visual Object Tracking Benchmark}}, author = {Moudgil Abhinav and Gandhi Vineet}, booktitle = {arxiv}, year = {2017}, }
  Rai Pranjal Kumar*, Maheshwari Sajal*, Mehta Ishit, Sakurikar Parikshit and Gandhi Vineet
Beyond OCRs for Document Blur Estimation
ICDAR 2017
[abstract] [paper] [dataset] [www] [bibTex]
The current document blur/quality estimation algorithms rely on the OCR accuracy to measure their success. A sharp document image, however, at times may yield lower OCR accuracy owing to factors independent of blur or quality of capture. The necessity to rely on OCR is mainly due to the difficulty in quantifying the quality otherwise. In this work, we overcome this limitation by proposing a novel dataset for document blur estimation, for which we physically quantify the blur using a capture set-up which computationally varies the focal distance of the camera. We also present a selective search mechanism to improve upon the recently successful patch-based learning approaches (using codebooks or convolutional neural networks). We present a thorough analysis of the improved blur estimation pipeline using correlation with OCR accuracy as well as the actual amount of blur. Our experiments demonstrate that our method outperforms the current state-of-the-art by a significant margin.
@inproceedings{pranjal-icdar-2017, title = {{Beyond OCRs for Document Blur Estimation}}, author = {Pranjal Kumar Rai, Sajal Maheshwari, Ishit Mehta, Parikshit Sakurikar and Vineet Gandhi}, booktitle = {ICDAR}, year = {2017}, }
  Kumar Moneish, Gandhi Vineet, Ronfard Remi and Gleicher Michael
Zooming On All Actors: Automatic Focus+Context Split Screen Video Generation
Eurographics 2017
[abstract] [paper] [slides (original keynote) ] [www] [bibTex]
Recordings of stage performances are easy to capture with a high-resolution camera, but are difficult to watch because the actors’ faces are too small. We present an approach to automatically create a split screen video that transforms these recordings to show both the context of the scene as well as close-up details of the actors. Given a static recording of a stage performance and tracking information about the actors positions, our system generates videos showing a focus+context view based on computed close-up camera motions using crop-and zoom. The key to our approach is to compute these camera motions such that they are cinematically valid close-ups and to ensure that the set of views of the different actors are properly coordinated and presented. We pose the computation of camera motions as convex optimization that creates detailed views and smooth movements, subject to cinematic constraints such as not cutting faces with the edge of the frame. Additional constraints link the close up views of each actor, causing them to merge seamlessly when actors are close. Generated views are placed in a resulting layout that preserves the spatial relationships between actors. We demonstrate our results on a variety of staged theater and dance performances.
@inproceedings{kumar-eg-2017, title = {{Zooming On All Actors: Automatic Focus+Context Split Screen Video Generation}}, author = {Kumar Moneish, Gandhi Vineet, Ronfard Remi and Gleicher Michael}, booktitle = {Eurographics}, year = {2017}, }
  Maheshwari Sajal, Rai Pranjal, Sharma Gopal and Gandhi Vineet
Document Blur Detection using Edge Profile Mining
ICVGIP 2016
[abstract] [paper] [www] [bibTex]
We present an algorithm for automatic blur detection of document images using a novel approach based on edge intensity profiles. Our main insight is that the edge profiles are a strong indicator of the blur present in the image, with steep profiles implying sharper regions and gradual profiles implying blurred regions. Our approach first retrieves the profiles for each point of intensity transition (each edge point) along the gradient and then uses them to output a quantitative measure indicating the extent of blur in the input image. The real time performance of the proposed approach makes it suitable for most applications. Additionally, our method works for both hand written and digital documents and is agnostic to the font types and sizes, which gives it a major advantage over the currently prevalent learning based approaches. Extensive quantitative and qualitative experiments over two different datasets show that our method outperforms almost all algorithms in current state of the art by a significant margin, especially in cross dataset experiments.
@inproceedings{sajal-icvgip-2016, title = {{Document Blur Detection using Edge Profile Mining}}, author = {Maheshwari Sajal, Rai Pranjal, Sharma Gopal and Gandhi Vineet}, booktitle = {ICVGIP}, year = {2016}, }
  Gandhi Vineet and Ronfard Remi
A Computational Framework for Vertical Video Editing
Eurographics workshop on intelligent cinematography and editing (WICED) 2016
[abstract] [paper] [bibTex]
Vertical video editing is the process of digitally editing the image within the frame as opposed to horizontal video editing, which arranges the shots along a timeline. Vertical editing can be a time-consuming and error-prone process when using manual key-framing and simple interpolation. In this paper, we present a general framework for automatically computing a variety of cinematically plausible shots from a single input video suitable to the special case of live performances. Drawing on working practices in traditional cinematography, the system acts as a virtual camera assistant to the film editor, who can call novel shots in the edit room with a combination of high-level instructions and manually selected keyframes.
@inproceedings{gandhi-egw-2016, title = {{A Computational Framework for Vertical Video Editing }}, author = {Gandhi Vineet and Ronfard Remi}, booktitle = { Eurographics workshop on intelligent cinematography and editing (WICED)}, year = {2016}, }
  Gandhi Vineet
Automatic Rush Generation with Application to Theatre Performances
PhD dissertation
[abstract] [pdf] [bibTex]
Professional quality videos of live staged performances are created by recording them from different appropriate viewpoints. These are then edited together to portray an eloquent story replete with the ability to draw out the intended emotion from the viewers. Creating such competent videos typically requires a team of skilled camera operators to capture the scene from multiple viewpoints. In this thesis, we explore an alternative approach where we automatically compute camera movements in post-production using specially designed computer vision methods.

A high resolution static camera replaces the plural camera crew and their efficient cam- era movements are then simulated by virtually panning - tilting - zooming within the original recordings. We show that multiple virtual cameras can be simulated by choosing different trajectories of cropping windows inside the original recording. One of the key novelties of this work is an optimization framework for computing the virtual camera trajectories using the in- formation extracted from the original video based on computer vision techniques.

The actors present on stage are considered as the most important elements of the scene. For the task of localizing and naming actors, we introduce generative models for learning view independent person and costume specific detectors from a set of labeled examples. We explain how to learn the models from a small number of labeled keyframes or video tracks, and how to detect novel appearances of the actors in a maximum likelihood framework. We demonstrate that such actor specific models can accurately localize actors despite changes in view point and occlusions, and significantly improve the detection recall rates over generic detectors.

The thesis then proposes an offline algorithm for tracking objects and actors in long video sequences using these actor specific models. Detections are first performed to independently select candidate locations of the actor/object in each frame of the video. The candidate detections are then combined into smooth trajectories by minimizing a cost function accounting for false detections and occlusions.

Using the actor tracks, we then describe a method for automatically generating multiple clips suitable for video editing by simulating pan-tilt-zoom camera movements within the frame of a single static camera. Our method requires only minimal user input to define the subject matter of each sub-clip. The composition of each sub-clip is automatically computed in a novel convex optimization framework. Our approach encodes several common cinematographic practices into a single convex cost function minimization problem, resulting in aesthetically-pleasing sub-clips which can easily be edited together using off-the-shelf multiclip video editing software.

The proposed methods have been tested and validated on a challenging corpus of theatre recordings. They open the way to novel applications of computer vision methods for cost- effective video production of live performances including, but not restricted to, theatre, music and opera.
@inproceedings{gandhi-dissertation-2014, title = {{Automatic Rush Generation with Application to Theatre Performances}}, author = {Gandhi, Vineet}, booktitle = {PhD dissertation}, year = {2014}, }
  Gandhi Vineet, Ronfard Remi, Gleicher Michael
Multi-Clip Video Editing from a Single Viewpoint
European Conference on Visual Media Production (CVMP) 2014
[abstract] [paper] [Presentation slides] [www] [bibTex]
We propose a framework for automatically generating multiple clips suitable for video editing by simulating pan-tilt-zoom camera movements within the frame of a single static camera. Assuming important actors and objects can be localized using computer vision techniques, our method requires only minimal user input to define the subject matter of each sub-clip. The composition of each sub-clip is automatically computed in a novel L1-norm optimization framework. Our approach encodes several common cinematographic practices into a single convex cost function minimization problem, resulting in aesthetically-pleasing sub-clips which can easily be edited together using off-the-shelf multi-clip video editing software. We demonstrate our approach on five video sequences of a live theatre performance by generating multiple synchronized sub-clips for each sequence.
@inproceedings{gandhi-cvmp-2014, title = {{Multi-Clip Video Editing from a Single Viewpoint}}, author = {Gandhi, Vineet and Ronfard, Remi and Gleicher, Michael}, booktitle = {European Conference on Visual Media Production (CVMP)}, year = {2014}, }
  Gandhi Vineet, Ronfard Remi
Detecting and naming actors in movies using generative appearance models
Computer Vision and Pattern Recognition (CVPR) 2013
[abstract] [paper] [CVPR 2013 Poster (1.8mb)] [www] [bibTex]
We introduce a generative model for learning person and costume specific detectors from labeled examples. We demonstrate the model on the task of localizing and nam- ing actors in long video sequences. More specifically, the actor’s head and shoulders are each represented as a con- stellation of optional color regions. Detection can proceed despite changes in view-point and partial occlusions. We explain how to learn the models from a small number of la- beled keyframes or video tracks, and how to detect novel appearances of the actors in a maximum likelihood frame- work. We present results on a challenging movie example, with 81% recall in actor detection (coverage) and 89% pre- cision in actor identification (naming).
@inproceedings{gandhi-cvpr-2013, title = {{Detecting and Naming Actors in Movies using Generative Appearance Models}}, author = {Gandhi, Vineet and Ronfard, Remi}, booktitle = {CVPR}, year = {2013}, }
  Gandhi Vineet, Cech Jan and Horaud Radu
High-Resolution Depth Maps Based on TOF-Stereo Fusion
IEEE International Conference on Robotics and Automation (ICRA) 2012
[abstract] [paper] [www] [bibTex]
The combination of range sensors with color cameras can be very useful for robot navigation, semantic perception, manipulation, and telepresence. Several methods of combining range- and color-data have been investigated and successfully used in various robotic applications. Most of these systems suffer from the problems of noise in the range-data and resolution mismatch between the range sensor and the color cameras, since the resolution of current range sensors is much less than the resolution of color cameras. High-resolution depth maps can be obtained using stereo matching, but this often fails to construct accurate depth maps of weakly/repetitively textured scenes, or if the scene exhibits complex self-occlusions. Range sensors provide coarse depth information regardless of presence/absence of texture. The use of a calibrated system, composed of a time-of-flight (TOF) camera and of a stereoscopic camera pair, allows data fusion thus overcoming the weaknesses of both individual sensors. We propose a novel TOF-stereo fusion method based on an efficient seed-growing algorithm which uses the TOF data projected onto the stereo image pair as an initial set of correspondences. These initial ''seeds'' are then propagated based on a Bayesian model which combines an image similarity score with rough depth priors computed from the low-resolution range data. The overall result is a dense and accurate depth map at the resolution of the color cameras at hand. We show that the proposed algorithm outperforms 2D image-based stereo algorithms and that the results are of higher resolution than off-the-shelf color-range sensors, e.g., Kinect. Moreover, the algorithm potentially exhibits real-time performance on a single CPU.
@inproceedings{gandhi-icra-2012, AUTHOR = {Gandhi, Vineet and Cech, Jan and Horaud, Radu}, TITLE = {{High-Resolution Depth Maps Based on TOF-Stereo Fusion}}, BOOKTITLE = {ICRA}, YEAR = {2012}}
  Ronfard Remi, Gandhi Vineet and Boiron Laurent
The Prose Storyboard Language: A Tool for Annotating and Directing Movies
Workshop on Intelligent Cinematography and Editing part of Foundations of Digital Games (FDG) 2013
[abstract] [paper] [www] [bibTex]
The prose storyboard language is a formal language for describing movies shot by shot, where each shot is described with a unique sentence. The language uses a simple syntax and limited vocabulary borrowed from working practices in traditional movie-making, and is intended to be readable both by machines and humans. The language is designed to serve as a high-level user interface for intelligent cinematography and editing systems.
@inproceedings{ronfard-wiced-2013, title = {{The Prose Storyboard Language: A Tool for Annotating and Directing Movies}}, author = {Ronfard, Remi and Gandhi, Vineet and Boiron, Laurent}, booktitle = {{2nd Workshop on Intelligent Cinematography and Editing part of Foundations of Digital Games - FDG 2013}}, year = {2013}, }

 


Teaching

Statistical Methods in Artificial Intelligence - Monsoon 2017
Digital Image Processing - Monsoon 2015, Monsoon 2016
Digital Signal Analysis and Applications - Spring 2016, Spring 2017, Spring 2018

 


Students

Coming soon

 


Misc

Coming soon

 

 

Last updated: April 2018