Vineet Gandhi

Home PublicationsTeachingStudentsMisc



F25, Centre for Visual Information Technology (CVIT),
2nd floor, KCIS Research Block
Gachibowli, Hyderabad, India, 500032


I am currently an assistant professor at IIIT Hyderabad, where I am affiliated with Center for Visual Information Techonology ( CVIT ). I also advise a beautiful animation startup on their AI and ML related efforts ( I completed my PhD degree at INRIA Rhone Alpes/Univesity of Grenoble in applied mathematics and computer science (mathématique appliquée et informatique) under the guidance of Remi Ronfard. I was funded by the CIBLE scholarship by Region Rhone Alpes. Prior to this, I completed my Masters with Erasmus Mundus scholarship under CIMET consortium. I am extremely thankful to European Union for giving me this opportunity which had a huge impact on both my professional and personal life. I spent a semester each in Spain, Norway and France and later joined INRIA for my master thesis and continued for my PhD there. I was also lucky to travel and deeply explore Europe (from south of Spain to North of Norway), at times purely surviving on gestural expressions for communication. I obtained my Bachelor of Technology degree from Indian Institute of Information Technology, Design and Manufacturing (IIITDM) Jabalpur, India (I belong to the first batch of the insititute).

I like to focus on problems with tangible goals and I try to build end to end solutions (with neat engineering). My current research interests are in applied machine learning for applications in computer vision and multimedia. The current primary threads of my research are around creative AI, multimodal ML and exploring generalization aspect of ML models. In personal space, I like to spend time with my family, play tennis, play cards, read ancient literature and explore mountains.


June 26, 2021   Our team is super winner of Qualcomm Innovation Fellowship (QIF) 2020. Congratulations to Unni, Nivedita, Kanishk and Prof. Madhav Krishna. As a result, the fellowship has been extended for another year.
March 10, 2021   Moneish gets admit to CMU. With two full papers in Eurographics, one full paper in CHI, during his MS, it was more than deserved. Best Wishes for him to keep excelling. As an interesting observation, my first own MS student (Moneish), first jointly advised MS student (Rahul Anand), first RA (Ashar) and first honours student (Sajal Maheshwari) all end up at CMU.


Selected Publications

List of all publications @ Google Scholar

  Saiteja Kosgi, Sarath Sivaprasad, Niranjan Pedanekar, Anil Nelakanti, Vineet Gandhi
Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems
NAACL 2022
[abstract] [paper] [code(coming soon)] [bibTex]
We present a method to control the emotional prosody of Text to Speech (TTS) systems by using phoneme-level intermediate features (pitch, energy, and duration) as levers. As a key idea, we propose Differential Scaling (DS) to disentangle features relating to affective prosody from those arising due to acoustics conditions and speaker identity. With thorough experimental studies, we show that the proposed method improves over the prior art in accurately emulating the desired emotions while retaining the naturalness of speech. We extend the traditional evaluation of using individual sentences for a more complete evaluation of HCI systems. We present a novel experimental setup by replacing an actor with a TTS system in offline and live conversations. The emotion to be rendered is either predicted or manually assigned. The results show that the proposed method is strongly preferred over the state-of-the-art TTS system and adds the much-coveted "human touch" in machine dialogue. Audio samples from our experiments are available at:
@inproceedings{Sai-naacl-2022, AUTHOR = {Saiteja Kosgi, Sarath Sivaprasad, Niranjan Padnekar, Anil Nelakanti, Vineet Gandhi}, TITLE = {Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems}, BOOKTITLE = {NAACL}, YEAR = {2022}}
  Kanishk Jain and Vineet Gandhi
Comprehensive Multi-Modal Interactions for Referring Image Segmentation
Findings of ACL 2022
[abstract] [paper] [code(coming soon)] [bibTex]
We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the given natural language description. To solve RIS efficiently, we need to understand each word's relationship with other words, each region in the image to other regions, and cross-modal alignment between linguistic and visual domains. Recent methods model these three types of interactions sequentially. We argue that such a modular approach limits these methods' performance, and joint simultaneous reasoning can help resolve ambiguities. To this end, we propose a Joint Reasoning (JRM) module and a novel Cross-Modal Multi-Level Fusion (CMMLF) module for tackling this task. JRM effectively models the referent's multi-modal context by jointly reasoning over visual and linguistic modalities (performing word-word, image region-region, word-region interactions in a single module). CMMLF module further refines the segmentation masks by exchanging contextual information across visual hierarchy through linguistic features acting as a bridge. We present thorough ablation studies and validate our approach's performance on four benchmark datasets, and show that the proposed method outperforms the existing state-of-the-art methods on all four datasets by significant margins.
@inproceedings{Kanishk-arxiv-2021, AUTHOR = {Kanishk Jain, Vineet Gandhi}, TITLE = {Comprehensive Multi-Modal Interactions for Referring Image Segmentation}, BOOKTITLE = {Findings of ACL}, YEAR = {2022}}
  Sarath Sivaprasad, Akshay Goindani, Vaibhav Garg, Vineet Gandhi
Reappraising Domain Generalization in Neural Networks
arXiv 2021
[abstract] [paper] [code (coming soon)] [bibTex]
Domain generalization (DG) of machine learning algorithms is defined as their ability to learn a domain agnostic hypothesis from multiple training distributions, which generalizes onto data from an unseen domain. DG is vital in scenarios where the target domain with distinct characteristics has sparse data for training. Aligning with recent work~\cite{gulrajani2020search}, we find that a straightforward Empirical Risk Minimization (ERM) baseline consistently outperforms existing DG methods. We present ablation studies indicating that the choice of backbone, data augmentation, and optimization algorithms overshadows the many tricks and trades explored in the prior art. Our work leads to a new state of the art on the four popular DG datasets, surpassing previous methods by large margins. Furthermore, as a key contribution, we propose a classwise-DG formulation, where for each class, we randomly select one of the domains and keep it aside for testing. We argue that this benchmarking is closer to human learning and relevant in real-world scenarios. We comprehensively benchmark classwise-DG on the DomainBed and propose a method combining ERM and reverse gradients to achieve the state-of-the-art results. To our surprise, despite being exposed to all domains during training, the classwise DG is more challenging than traditional DG evaluation and motivates more fundamental rethinking on the problem of DG.
@inproceedings{bib_dg_2021, AUTHOR = {Sarath Sivaprasad, Akshay Goindani, Vaibhav Garg, Vineet Gandhi}, TITLE = {Reappraising Domain Generalization in Neural Networks}, BOOKTITLE = {arXiv:2110.07981}, YEAR = {2021}}
  Jeet Vora, Swetanjal Dutta, Shyamgopal Karthik, Vineet Gandhi
Bringing Generalization to Deep Multi-view Detection
arXiv 2021
[abstract] [paper] [code] [bibTex]
Multi-view Detection (MVD) is highly effective for occlusion reasoning and is a mainstream solution in various applications that require accurate top-view occupancy maps. While recent works using deep learning have made significant advances in the field, they have overlooked the generalization aspect, which makes them impractical for real-world deployment. The key novelty of our work is to formalize three critical forms of generalization and propose experiments to investigate them: i) generalization across a varying number of cameras, ii) generalization with varying camera positions, and finally, iii) generalization to new scenes. We find that existing state-of-the-art models show poor generalization by overfitting to a single scene and camera configuration. We propose modifications in terms of pre-training, pooling strategy, regularization, and loss function to an existing state-of-the-art framework, leading to successful generalization across new camera configurations and new scenes. We perform a comprehensive set of experiments on the WildTrack and MultiViewX datasets to (a) motivate the necessity to evaluate MVD methods on generalization abilities and (b) demonstrate the efficacy of the proposed approach. The code is publicly available at
@inproceedings{mvdet_jeet_2021, AUTHOR = {Jeet Vora, Swetanjal Dutta, Shyamgopal Karthik, Vineet Gandhi}, TITLE = {Bringing Generalization to Deep Multi-view Detection}, BOOKTITLE = {arXiv:2109.12227}, YEAR = {2021}}
  Nivedita Rufus, Kanishk Jain, Unni Krishnan R Nair, Vineet Gandhi and K Madhava Krishna
Grounding Linguistic Commands to Navigable Regions
IROS 2021
[abstract] [paper] [code (coming soon)] [bibTex]
Humans have a natural ability to effortlessly comprehend linguistic commands such as “park next to the yellow sedan” and instinctively know which region of the road the vehicle should navigate. Extending this ability to autonomous vehicles is the next step towards creating fully autonomous agents that respond and act according to human commands. To this end, we propose the novel task of Referring Navigable Regions (RNR), i.e., grounding regions of interest for navigation based on the linguistic command. RNR is different from Referring Image Segmentation (RIS), which focuses on grounding an object referred to by the natural language expression instead of grounding a navigable region. For example, for a command “park next to the yellow sedan,” RIS will aim to segment the referred sedan, and RNR aims to segment the suggested parking region on the road. We introduce a new dataset, Talk2Car-RegSeg, which extends the existing Talk2car [1] dataset with segmentation masks for the regions described by the linguistic commands. A separate test split with concise manoeuvre-oriented commands is provided to assess the practicality of our dataset. We benchmark the proposed dataset using a novel transformer-based architecture. We present extensive ablations and show superior performance over baselines on multiple evaluation metrics. A downstream path planner generating trajectories based on RNR outputs confirms the efficacy of the proposed framework.
@inproceedings{iros_rnr_2021, AUTHOR = {Nivedita Rufus, Kanishk Jain, Unni Krishnan R Nair, Vineet Gandhi and K Madhava Krishna}, TITLE = {Grounding Linguistic Commands to Navigable Regions}, BOOKTITLE = {IROS}, YEAR = {2021}}
  Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik, Ramanathan Subramanian, Vineet Gandhi
ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction
IROS 2021
[abstract] [paper] [code] [bibTex]
We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first network to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behaviour in the previous state-of-the-art models \cite{tsiami2020stavis} for audio-visual saliency prediction. Our findings contrast with previous works on deep learning-based audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio in a more effective manner. The code and pre-trained models are available at
@inproceedings{samyak-arxiv-2020, AUTHOR = {Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik, Ramanathan Subramanian, Vineet Gandhi}, TITLE = {ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction}, BOOKTITLE = {IROS}, YEAR = {2021}}
  Sarath Sivaprasad, Saiteja Kosgi, Vineet Gandhi
Emotional Prosody Control for Speech Generation
[abstract] [paper] [code (coming soon)] [bibTex]
Machine-generated speech is characterized by its limited or unnatural emotional variation. Current text to speech systems generates speech with either a flat emotion, emotion selected from a predefined set, average variation learned from prosody sequences in training data or transferred from a source style. We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space (Arousal-Valence space). The proposed TTS system can generate speech from the text in any speaker's style, with fine control of emotion. We show that the system works on emotion unseen during training and can scale to previously unseen speakers given his/her speech sample. Our work expands the horizon of the state-of-the-art FastSpeech2 backbone to a multi-speaker setting and gives it much-coveted continuous (and interpretable) affective control, without any observable degradation in the quality of the synthesized speech. Audio samples are available at
@inproceedings{Sarath-Interspeech-2021, AUTHOR = {Sarath Sivaprasad, Saiteja Kosgi, Vineet Gandhi}, TITLE = {Emotional Prosody Control for Speech Generation}, BOOKTITLE = {Interspeech}, YEAR = {2021}}
  Sarath Sivaprasad, Ankur Singh, Naresh Manwani, Vineet Gandhi
The Curious Case of Convex Neural Networks
[abstract] [paper] [code] [bibTex]
In this paper, we investigate a constrained formulation of neural networks where the output is a convex function of the input. We show that the convexity constraints can be enforced on both fully connected and convolutional layers, making them applicable to most architectures. The convexity constraints include restricting the weights (for all but the first layer) to be non-negative and using a non-decreasing convex activation function. Albeit simple, these constraints have profound implications on the generalization abilities of the network. We draw three valuable insights: (a) Input Output Convex Networks (IOC-NN) self regularize and almost uproot the problem of overfitting; (b) Although heavily constrained, they come close to the performance of the base architectures; and (c) The ensemble of convex networks can match or outperform the non convex counterparts. We demonstrate the efficacy of the proposed idea using thorough experiments and ablation studies on MNIST, CIFAR10, and CIFAR100 datasets with three different neural network architectures. The code for this project is publicly available at:
@inproceedings{Sarath-IOCNN-2020, AUTHOR = {Sarath Sivaprasad, Naresh Manwani, Vineet Gandhi}, TITLE = {The Curious Case of Convex Networks}, BOOKTITLE = {ECML}, YEAR = {2021}}
  Shyamgopal Karthik, Ameya Prabhu, Puneet Dokania and Vineet Gandhi
No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks
International Conference on Learning Representations (ICLR) 2021
[abstract] [paper] [code] [bibTex]
There has been increasing interest in building deep hierarchy-aware classifiers, aiming to quantify and reduce the severity of mistakes and not just count the number of errors. The idea is to exploit the label hierarchy (e.g., WordNet ontology) and consider graph distances as a proxy for mistake severity. Surprisingly, on examining mistake-severity distributions of the top-1 prediction, we find that current state-of-the-art hierarchy-aware deep classifiers do not show practical improvement in making better mistakes than the standard cross-entropy baseline. In fact, they reduce the average mistake-severity metric by largely making additional low-severity or easily avoidable mistakes. This might explain the noticeable accuracy drop. To this end, we resort to the classical Conditional Risk Minimization (CRM) framework for hierarchy aware classification. Given a cost matrix and a reliable estimate of likelihoods (obtained from a trained network), CRM simply amends mistakes at inference time; it needs no extra parameters; it requires adding just one line of code to the standard cross-entropy baseline. It significantly outperforms the state-of-the-art and consistently obtains large reductions in the average hierarchical distance of top-k predictions across datasets, with very little loss in accuracy. Since CRM does not require retraining or fine-tuning of any hyperparameter, it can be used with any off-the-shelf cross-entropy trained model.
@inproceedings{Shyam-iclr-2021, AUTHOR = {Shyamgopal Karthik, Ameya Prabhu, Puneet Dokania and Vineet Gandhi}, TITLE = {No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks}, BOOKTITLE = {ICLR}, YEAR = {2021}}
  Shyamgopal Karthik, Ameya Prabhu, Vineet Gandhi
Simple Unsupervised Multi-Object Tracking
arXiv 2020
[abstract] [paper] [code] [bibTex]
Multi-object tracking has seen a lot of progress recently, albeit with substantial annotation costs for developing better and larger labeled datasets. In this work, we remove the need for annotated datasets by proposing an unsupervised re-identification network, thus sidestepping the labeling costs entirely, required for training. Given unlabeled videos, our proposed method (SimpleReID) first generates tracking labels using SORT and trains a ReID network to predict the generated labels using crossentropy loss. We demonstrate that SimpleReID performs substantially better than simpler alternatives, and we recover the full performance of its supervised counterpart consistently across diverse tracking frameworks. The observations are unusual because unsupervised ReID is not expected to excel in crowded scenarios with occlusions, and drastic viewpoint changes. By incorporating our unsupervised SimpleReID with CenterTrack trained on augmented still images, we establish a new state-of-the-art performance on popular datasets like MOT16/17 without using tracking supervision, beating current best (CenterTrack) by 0.2-0.3 MOTA and 4.4-4.8 IDF1 scores. We further provide evidence for limited scope for improvement in IDF1 scores beyond our unsupervised ReID in the studied settings. Our investigation suggests reconsideration towards more sophisticated, supervised, end-to-end trackers by showing promise in simpler unsupervised alternatives.
@inproceedings{Shyam-Arxiv-2020, AUTHOR = {Shyamgopal Karthik, Ameya Prabhu, Vineet Gandhi}, TITLE = {Simple Unsupervised Multi-Object Tracking}, BOOKTITLE = {arXiv:2006.02609}, YEAR = {2020}}
  Aasheesh Singh, Aditya Kamireddypalli, Vineet Gandhi, K Madhava Krishna
LiDAR guided Small obstacle Segmentation
IROS 2020
[abstract] [paper] [code] [bibTex]
Detecting small obstacles on the road is critical for autonomous driving. In this paper, we present a method to reliably detect such obstacles through a multi-modal framework of sparse LiDAR(VLP-16) and Monocular vision. LiDAR is employed to provide additional context in the form of confidence maps to monocular segmentation networks. We show significant performance gains when the context is fed as an additional input to monocular semantic segmentation frameworks. We further present a new semantic segmentation dataset to the community, comprising of over 3000 image frames with corresponding LiDAR observations. The images come with pixel-wise annotations of three classes off-road, road, and small obstacle. We stress that precise calibration between LiDAR and camera is crucial for this task and thus propose a novel Hausdorff distance based calibration refinement method over extrinsic parameters. As a first benchmark over this dataset, we report our results with 73% instance detection up to a distance of 50 meters on challenging scenarios. Qualitatively by showcasing accurate segmentation of obstacles less than 15 cms at 50m depth and quantitatively through favourable comparisons vis a vis prior art, we vindicate the method's efficacy. Our project-page and Dataset is hosted at
@inproceedings{aavk-IROS-2020, AUTHOR = {Aasheesh Singh, Aditya Kamireddypalli, Vineet Gandhi, K Madhava Krishna}, TITLE = {LiDAR guided Small obstacle Segmentation}, BOOKTITLE = {IROS}, YEAR = {2020}}
  Navyasri Reddy, Samyak Jain, Pradeep Yarlagadda, Vineet Gandhi
Tidying Deep Saliency Prediction Architectures
IROS 2020
[abstract] [paper] [code] [bibTex]
Learning computational models for visual attention (saliency estimation) is an effort to inch machines/robots closer to human visual cognitive abilities. Data-driven efforts have dominated the landscape since the introduction of deep neural network architectures. In deep learning research, the choices in architecture design are often empirical and frequently lead to more complex models than necessary. The complexity, in turn, hinders the application requirements. In this paper, we identify four key components of saliency models, i.e., input features, multi-level integration, readout architecture, and loss functions. We review the existing state of the art models on these four components and propose novel and simpler alternatives. As a result, we propose two novel end-to-end architectures called SimpleNet and MDNSal, which are neater, minimal, more interpretable and achieve state of the art performance on public saliency benchmarks. SimpleNet is an optimized encoder-decoder architecture and brings notable performance gains on the SALICON dataset (the largest saliency benchmark). MDNSal is a parametric model that directly predicts parameters of a GMM distribution and is aimed to bring more interpretability to the prediction maps. The proposed saliency models can be inferred at 25fps, making them suitable for real-time applications. Code and pre-trained models are available at
@inproceedings{Navya-IROS-2020, AUTHOR = {Navyasri Reddy, Samyak Jain, Pradeep Yarlagadda, Vineet Gandhi}, TITLE = {Tidying Deep Saliency Prediction Architectures}, BOOKTITLE = {IROS}, YEAR = {2020}}
  K L Bhanu Moorthy, Moneish Kumar, Ramanathan Subramanian, Vineet Gandhi
GAZED – Gaze-guided Cinematic Editing of Wide-Angle Monocular Video Recordings
ACM CHI 2020 (Conference on Human Factors in Computing Systems)
[abstract] [paper] [www] [bibTex]
We present GAZED– eye GAZe-guided EDiting for videos captured by a solitary, static, wide-angle and high-resolution camera. Eye-gaze has been effectively employed in computational applications as a cue to capture interesting scene content; we employ gaze as a proxy to select shots for inclusion in the edited video. Given the original video, scene content and user eye-gaze tracks are combined to generate an edited video comprising cinematically valid actor shots and shot transitions to generate an aesthetic and vivid representation of the original narrative. We model cinematic video editing as an energy minimization problem over shot selection, whose constraints capture cinematographic editing conventions. Gazed scene locations primarily determine the shots constituting the edited video. Effectiveness of GAZED against multiple competing methods is demonstrated via a psychophysical study involving 12 users and twelve performance videos.
@inproceedings{Bhanu-CHI-2020, AUTHOR = {Moorthy Bhanu K L, Kumar Moneish, Subramanian Ramanathan, Gandhi Vineet}, TITLE = {GAZED– Gaze-guided Cinematic Editing of Wide-Angle Monocular Video Recordings}, BOOKTITLE = {CHI}, YEAR = {2020}}
  Shyamgopal Karthik, Abhinav Moudgil, Vineet Gandhi
Exploring 3 R's of Long-term Tracking: Re-detection, Recovery and Reliability
WACV 2020
[abstract] [paper] [www] [bibTex]
Recent works have proposed several long term tracking benchmarks and highlight the importance of moving towards long-duration tracking to bridge the gap with application requirements. The current evaluation methodologies, however, do not focus on several aspects that are crucial in a long term perspective like Re-detection, Recovery, and Reliability. In this paper, we propose novel evaluation strategies for a more in-depth analysis of trackers from a long-term perspective. More specifically, (a) we test re-detection capability of the trackers in the wild by simulating virtual cuts, (b) we investigate the role of chance in the recovery of tracker after failure and (c) we propose a novel metric allowing visual inference on the ability of a tracker to track contiguously (without any failure) at a given accuracy. We present several original insights derived from an extensive set of quantitative and qualitative experiments.
@inproceedings{Sudheer-cine-2019, AUTHOR = {Karthik Shyamgopal, Moudgil Abhinav, Gandhi Vineet}, TITLE = {Exploring 3 R's of Long-term Tracking: Re-detection, Recovery and Reliability}, BOOKTITLE = {WACV}, YEAR = {2020}}
  Sudheer Achary, Syed Ashar Javed, Nikita Shravan, K L Bhanu Moorthy, Vineet Gandhi, Anoop Namboodiri
CineFilter: Unsupervised Filtering for Real Time Autonomous Camera Systems
arXiv 2019
[abstract] [paper] [www] [bibTex]
Learning to mimic the smooth and deliberate camera movement of a human cameraman is an essential requirement for autonomous camera systems. This paper presents a novel formulation for online and real-time estimation of smooth camera trajectories. Many works have focused on global optimization of the trajectory to produce an offline output. Some recent works have tried to extend this to the online setting, but lack either in the quality of the camera trajectories or need large labeled datasets to train their supervised model. We propose two models, one a convex optimization based approach and another a CNN based model, both of which can exploit the temporal trends in the camera behavior. Our model is built in an unsupervised way without any ground truth trajectories and is robust to noisy outliers. We evaluate our models on two different settings namely a basketball dataset and a stage performance dataset and compare against multiple baselines and past approaches. Our models outperform other methods on quantitative and qualitative metrics and produce smooth camera trajectories that are motivated by cinematographic principles. These models can also be easily adopted to run in real-time with a low computational cost, making them fit for a variety of applications.
@inproceedings{Sudheer-cine-2019, AUTHOR = {Achary Sudheer, Ashar Javed Syed, Shravan Nikita, Bhanu Moorthy K L , Gandhi Vineet , Namboodiri Anoop }, TITLE = {CineFilter: Unsupervised Filtering for Real Time Autonomous Camera Systems}, BOOKTITLE = {arXiv:1912.05636}, YEAR = {2019}}
  N. N. Sriram, Maniar Tirth, Kalyanasundaram Jayaganesh, Gandhi Vineet, Bhowmick Brojeshwar, Krishna Madhava K.
Talk to the Vehicle: Language Conditioned Autonomous Navigation of Self Driving Cars
IROS 2019
[abstract] [paper] [www] [bibTex]
We propose a novel pipeline that blends encodings from natural language and 3D semantic maps obtained from computer vision data to generate local trajectories that are executed by a low-level controller. The pipeline precludes the need for a prior registered map through a local waypoint generator neural network. The waypoint generator network (WGN) maps semantics and natural language encodings (NLE) to local waypoints. A local planner then generates a trajectory from the ego location of the vehicle (an outdoor car in this case) to these locally generated waypoints while a low-level controller executes these plans faithfully. The efficacy of the pipeline is verified in the CARLA simulator environment as well as on local semantic maps built from real-world KITTI dataset. In both these environments (simulated and real-world) we show the ability of the WGN to generate waypoints accurately by mapping NLE of varying sequence lengths and levels of complexity. We compare with baseline approaches and show significant performance gain over them. And finally, we show real implementations on our electric car verifying that the pipeline lends itself to practical and tangible realizations in uncontrolled outdoor settings. In loop execution of the proposed pipeline that involves repetitive invocations of the network is critical for any such language-based navigation framework. This effort successfully accomplishes this thereby bypassing the need for prior metric maps or strategies for metric level localization during traversal..
@inproceedings{Sriram-IROS-2019, AUTHOR = {N. N. Sriram, Maniar Tirth, Kalyanasundaram Jayaganesh, Gandhi Vineet, Bhowmick Brojeshwar and Krishna Madhava K.}, TITLE = {Talk to the Vehicle: Language Conditioned Autonomous Navigation of Self Driving Cars}, BOOKTITLE = {IROS}, YEAR = {2019}}
  Javed Syed Ashar, Saxena Shreyas and Gandhi Vineet
Learning Unsupervised Visual Grounding Through Semantic Self-Supervision
IJCAI 2019
[abstract] [paper] [www] [bibTex]
Localizing natural language phrases in images is a challenging problem that requires joint understanding of both the textual and visual modalities. In the unsupervised setting, lack of supervisory signals exacerbate this difficulty. In this paper, we propose a novel framework for unsupervised visual grounding which uses concept learning as a proxy task to obtain self-supervision. The simple intuition behind this idea is to encourage the model to localize to regions which can explain some semantic property in the data, in our case, the property being the presence of a concept in a set of images. We present thorough quantitative and qualitative experiments to demonstrate the efficacy of our approach and show a 5.6% improvement over the current state of the art on Visual Genome dataset, a 5.8% improvement on the ReferItGame dataset and comparable to state-of-art performance on the Flickr30k dataset.
@inproceedings{Javed-IJCAI-2018, AUTHOR = {Javed Syed Ashar, Shreyas Saxena and Gandhi Vineet}, TITLE = {Learning Unsupervised Visual Grounding Through Semantic Self-Supervision}, BOOKTITLE = {IJCAI}, YEAR = {2019}}
  Gupta Aryaman, Thakkar Kalpit, Gandhi Vineet and Narayanan P J
Nose, Eyes and Ears: Head Pose Estimation By Locating Facial Keypoints
[abstract] [paper] [www] [bibTex]
Monocular head pose estimation requires learning a model that computes the intrinsic Euler angles for pose (yaw, pitch, roll) from an input image of human face. Annotating ground truth head pose angles for images in the wild is difficult and requires ad-hoc fitting procedures (which provides only coarse and approximate annotations). This highlights the need for approaches which can train on data captured in controlled environment and generalize on the images in the wild (with varying appearance and illumination of the face). Most present day deep learning approaches which learn a regression function directly on the input images fail to do so. To this end, we propose to use a higher level representation to regress the head pose while using deep learning architectures. More specifically, we use the uncertainty maps in the form of 2D soft localization heatmap images over five facial keypoints, namely left ear, right ear, left eye, right eye and nose, and pass them through an convolutional neural network to regress the head-pose. We show head pose estimation results on two challenging benchmarks BIWI and AFLW and our approach surpasses the state of the art on both the datasets.
@inproceedings{Gupta-arxiv-2018, AUTHOR = {Gupta Aryaman, Thakkar Kalpit, Gandhi Vineet and Narayanan P J}, TITLE = {Nose, Eyes and Ears: Head Pose Estimation By Locating Facial Keypoints}, BOOKTITLE = {arXiv:1812.00739}, YEAR = {2018}}
  Moudgil Abhinav and Gandhi Vineet
Long-Term Visual Object Tracking Benchmark
ACCV 2018
[abstract] [paper] [dataset] [www] [bibTex]
In this paper, we propose a new long video dataset (called Track Long and Prosper - TLP) and benchmark for visual object tracking. The dataset consists of 50 videos from real world scenarios, encompassing a duration of over 400 minutes (676K frames), making it more than 20 folds larger in average duration per sequence and more than 8 folds larger in terms of total covered duration, as compared to existing generic datasets for visual tracking. The proposed dataset paves a way to suitably assess long term tracking performance and possibly train better deep learning architectures (avoiding/reducing augmentation, which may not reflect realistic real world behavior). We benchmark the dataset on 17 state of the art trackers and rank them according to tracking accuracy and run time speeds. We further categorize the test sequences with different attributes and present a thorough quantitative and qualitative evaluation. Our most interesting observations are (a) existing short sequence benchmarks fail to bring out the inherent differences in tracking algorithms which widen up while tracking on long sequences and (b) the accuracy of most trackers abruptly drops on challenging long sequences, suggesting the potential need of research efforts in the direction of long term tracking.
@inproceedings{moudgil-arxiv-2017, title = {{Long-Term Visual Object Tracking Benchmark}}, author = {Moudgil Abhinav and Gandhi Vineet}, booktitle = {arxiv}, year = {2017}, }
  Gupta Krishnam, Javed Syed Ashar, Gandhi Vineet and Krishna Madhava K.
MergeNet: A Deep Net Architecture for Small Obstacle Discovery
ICRA 2018
[abstract] [paper] [www] [bibTex]
We present here, a novel network architecture called MergeNet for discovering small obstacles for on-road scenes in the context of autonomous driving. The basis of the architecture rests on the central consideration of training with less amount of data since the physical setup and the annotation process for small obstacles is hard to scale. For making effective use of the limited data, we propose a multi-stage training procedure involving weight-sharing, separate learning of low and high level features from the RGBD input and a refining stage which learns to fuse the obtained complementary features. The model is trained and evaluated on the Lost and Found dataset and is able to achieve state-of-art results with just 135 images in comparison to the 1000 images used by the previous benchmark. Additionally, we also compare our results with recent methods trained on 6000 images and show that our method achieves comparable performance with only 1000 training samples.
@inproceedings{Gupta-icra-2018, AUTHOR = {Gupta Krishnam, Javed Syed Ashar, Gandhi Vineet and Krishna Madhava K.}, TITLE = {{MergeNet: A Deep Net Architecture for Small Obstacle Discovery}}, BOOKTITLE = {ICRA}, YEAR = {2018}}
  Kumar Kranthi, Kumar Moneish, Gandhi Vineet, Subramanian Ramanathan
Watch to Edit: Video Retargeting using Gaze
Eurographics 2018
[abstract] [paper] [www] [bibTex]
We present a novel approach to optimally retarget videos for varied displays with differing aspect ratios by preserving salient scene content discovered via eye tracking. Our algorithm performs editing with cut, pan and zoom operations by optimizing the path of a cropping window within the original video while seeking to (i) preserve salient regions, and (ii) adhere to the principles of cinematography. Our approach is (a) content agnostic as the same methodology is employed to re-edit a wide-angle video recording or a close-up movie sequence captured with a static or moving camera, and (b) independent of video length and can in principle re-edit an entire movie in one shot. Our algorithm consists of two steps. The first step employs gaze transition cues to detect time stamps where new cuts are to be introduced in the original video via dynamic programming. A subsequent step optimizes the cropping window path (to create pan and zoom effects), while accounting for the original and new cuts. The cropping window path is designed to include maximum gaze information, and is composed of piecewise constant, linear and parabolic segments. It is obtained via L(1) regularized convex optimization which ensures a smooth viewing experience. We test our approach on a wide variety of videos and demonstrate significant improvement over the state-of-the-art, both in terms of computational complexity and qualitative aspects. A study performed with 16 users confirms that our approach results in a superior viewing experience as compared to gaze driven re-editing [JSSH15] and letterboxing methods, especially for wide-angle static camera recordings.
@inproceedings{kumar-eg-2018, title = {{Watch to Edit: Video Retargeting using Gaze}}, author = {Kumar Kranthi, Kumar Moneish, Gandhi Vineet, Subramanian Ramanathan}, booktitle = {Eurographics}, year = {2018}, }
  Rai Pranjal Kumar, Maheshwari Sajal, and Gandhi Vineet
Document Quality Estimation using Spatial Frequency Response
ICASSP (oral) 2018
[abstract] [paper] [www] [bibTex]
The current Document Image Quality Assessment (DIQA) algorithms directly relate the Optical Character Recognition (OCR) accuracies with the quality of the document to build supervised learning frameworks. This direct correlation has two major limitations: (a) OCR may be affected by factors independent of the quality of the capture and (b) it cannot account for blur variations within an image. An alternate possibility is to quantify the quality of capture using human judgement, however, it is subjective and prone to error. In this work, we build upon the idea of Spatial Frequency Re- sponse (SFR) to reliably quantify the quality of a document image. We present through quantitative and qualitative exper- iments that the proposed metric leads to significant improve- ment in document quality prediction in contrast to using OCR as ground truth.
@inproceedings{Rai-ICASSP-2018, title = {Document Quality Estimation using Spatial Frequency Response}, author = {Rai Pranjal Kumar, Maheshwari Sajal, and Gandhi Vineet}, booktitle = {ICASSP}, year = {2018}, }
  Shah Vatsal, and Gandhi Vineet
An Iterative approach for Shadow Removal in Document Images
[abstract] [paper] [www] [bibTex]
Uneven illumination and shadows in document images cause a challenge for digitization applications and automated work- flows. In this work, we propose a new method to recover un- shadowed document images from images with shadows/un- even illumination. We pose this problem as one of estimating the shading and reflectance components of the given original image. Our method first estimates the shading and uses it to compute the reflectance. The output reflectance map is then used to improve the shading and the process is repeated in an iterative manner. The iterative procedure allows for a gradual compensation and allows our algorithm to even handle diffi- cult hard shadows without introducing any artifacts. Experi- ments over two different datasets demonstrate the efficacy of our algorithm and the low computation complexity makes it suitable for most practical applications.
@inproceedings{Shah-ICASSP-2018, title = {An Iterative approach for Shadow Removal in Document Images}, author = {Shah Vatsal and Gandhi Vineet}, booktitle = {ICASSP}, year = {2018}, }
  Sharma Rahul Anand, Bhat Bharath, Gandhi Vineet, Jawahar CV
Automated Top View Registration of Broadcast Football Videos
WACV 2018
[abstract] [paper] [bibTex]
In this paper, we propose a fully automatic method to register football broadcast video frames on the static top view model of the playing surface. Automatic registration has been difficult due to the difficulty of finding sufficient point correspondences. We investigate an alternate ap- proach exploiting the edge information from the line mark- ings on the field. We formulate the registration problem as a nearest neighbour search over a synthetically generated dictionary of edge map and homography pairs. The syn- thetic dictionary generation allows us to exhaustively cover a wide variety of camera angles and positions and reduces this problem to a minimal per-frame edge map matching problem. We show that the per-frame results can be fur- ther improved in videos using an optimization framework for temporal camera stabilization. We demonstrate the ef- ficacy of our approach by presenting extensive results on a dataset collected from matches of the football World Cup 2014 and show significant improvement over the current state of the art.
@inproceedings{anand-WACV-2018, title = {{Automated Top View Registration of Broadcast Football Videos}}, author = {Sharma Rahul Anand, Bhat Bharath, Gandhi Vineet, Jawahar CV}, booktitle = {{WACV}}, year = {2018}, }
  Rai Pranjal Kumar*, Maheshwari Sajal*, Mehta Ishit, Sakurikar Parikshit and Gandhi Vineet
Beyond OCRs for Document Blur Estimation
ICDAR 2017
[abstract] [paper] [dataset] [www] [bibTex]
The current document blur/quality estimation algorithms rely on the OCR accuracy to measure their success. A sharp document image, however, at times may yield lower OCR accuracy owing to factors independent of blur or quality of capture. The necessity to rely on OCR is mainly due to the difficulty in quantifying the quality otherwise. In this work, we overcome this limitation by proposing a novel dataset for document blur estimation, for which we physically quantify the blur using a capture set-up which computationally varies the focal distance of the camera. We also present a selective search mechanism to improve upon the recently successful patch-based learning approaches (using codebooks or convolutional neural networks). We present a thorough analysis of the improved blur estimation pipeline using correlation with OCR accuracy as well as the actual amount of blur. Our experiments demonstrate that our method outperforms the current state-of-the-art by a significant margin.
@inproceedings{pranjal-icdar-2017, title = {{Beyond OCRs for Document Blur Estimation}}, author = {Pranjal Kumar Rai, Sajal Maheshwari, Ishit Mehta, Parikshit Sakurikar and Vineet Gandhi}, booktitle = {ICDAR}, year = {2017}, }
  Kumar Moneish, Gandhi Vineet, Ronfard Remi and Gleicher Michael
Zooming On All Actors: Automatic Focus+Context Split Screen Video Generation
Eurographics 2017
[abstract] [paper] [slides (original keynote) ] [www] [bibTex]
Recordings of stage performances are easy to capture with a high-resolution camera, but are difficult to watch because the actors’ faces are too small. We present an approach to automatically create a split screen video that transforms these recordings to show both the context of the scene as well as close-up details of the actors. Given a static recording of a stage performance and tracking information about the actors positions, our system generates videos showing a focus+context view based on computed close-up camera motions using crop-and zoom. The key to our approach is to compute these camera motions such that they are cinematically valid close-ups and to ensure that the set of views of the different actors are properly coordinated and presented. We pose the computation of camera motions as convex optimization that creates detailed views and smooth movements, subject to cinematic constraints such as not cutting faces with the edge of the frame. Additional constraints link the close up views of each actor, causing them to merge seamlessly when actors are close. Generated views are placed in a resulting layout that preserves the spatial relationships between actors. We demonstrate our results on a variety of staged theater and dance performances.
@inproceedings{kumar-eg-2017, title = {{Zooming On All Actors: Automatic Focus+Context Split Screen Video Generation}}, author = {Kumar Moneish, Gandhi Vineet, Ronfard Remi and Gleicher Michael}, booktitle = {Eurographics}, year = {2017}, }
  Maheshwari Sajal, Rai Pranjal, Sharma Gopal and Gandhi Vineet
Document Blur Detection using Edge Profile Mining
[abstract] [paper] [www] [bibTex]
We present an algorithm for automatic blur detection of document images using a novel approach based on edge intensity profiles. Our main insight is that the edge profiles are a strong indicator of the blur present in the image, with steep profiles implying sharper regions and gradual profiles implying blurred regions. Our approach first retrieves the profiles for each point of intensity transition (each edge point) along the gradient and then uses them to output a quantitative measure indicating the extent of blur in the input image. The real time performance of the proposed approach makes it suitable for most applications. Additionally, our method works for both hand written and digital documents and is agnostic to the font types and sizes, which gives it a major advantage over the currently prevalent learning based approaches. Extensive quantitative and qualitative experiments over two different datasets show that our method outperforms almost all algorithms in current state of the art by a significant margin, especially in cross dataset experiments.
@inproceedings{sajal-icvgip-2016, title = {{Document Blur Detection using Edge Profile Mining}}, author = {Maheshwari Sajal, Rai Pranjal, Sharma Gopal and Gandhi Vineet}, booktitle = {ICVGIP}, year = {2016}, }
  Gandhi Vineet and Ronfard Remi
A Computational Framework for Vertical Video Editing
Eurographics workshop on intelligent cinematography and editing (WICED) 2016
[abstract] [paper] [bibTex]
Vertical video editing is the process of digitally editing the image within the frame as opposed to horizontal video editing, which arranges the shots along a timeline. Vertical editing can be a time-consuming and error-prone process when using manual key-framing and simple interpolation. In this paper, we present a general framework for automatically computing a variety of cinematically plausible shots from a single input video suitable to the special case of live performances. Drawing on working practices in traditional cinematography, the system acts as a virtual camera assistant to the film editor, who can call novel shots in the edit room with a combination of high-level instructions and manually selected keyframes.
@inproceedings{gandhi-egw-2016, title = {{A Computational Framework for Vertical Video Editing }}, author = {Gandhi Vineet and Ronfard Remi}, booktitle = { Eurographics workshop on intelligent cinematography and editing (WICED)}, year = {2016}, }
  Gandhi Vineet
Automatic Rush Generation with Application to Theatre Performances
PhD dissertation
[abstract] [pdf] [bibTex]
Professional quality videos of live staged performances are created by recording them from different appropriate viewpoints. These are then edited together to portray an eloquent story replete with the ability to draw out the intended emotion from the viewers. Creating such competent videos typically requires a team of skilled camera operators to capture the scene from multiple viewpoints. In this thesis, we explore an alternative approach where we automatically compute camera movements in post-production using specially designed computer vision methods.

A high resolution static camera replaces the plural camera crew and their efficient cam- era movements are then simulated by virtually panning - tilting - zooming within the original recordings. We show that multiple virtual cameras can be simulated by choosing different trajectories of cropping windows inside the original recording. One of the key novelties of this work is an optimization framework for computing the virtual camera trajectories using the in- formation extracted from the original video based on computer vision techniques.

The actors present on stage are considered as the most important elements of the scene. For the task of localizing and naming actors, we introduce generative models for learning view independent person and costume specific detectors from a set of labeled examples. We explain how to learn the models from a small number of labeled keyframes or video tracks, and how to detect novel appearances of the actors in a maximum likelihood framework. We demonstrate that such actor specific models can accurately localize actors despite changes in view point and occlusions, and significantly improve the detection recall rates over generic detectors.

The thesis then proposes an offline algorithm for tracking objects and actors in long video sequences using these actor specific models. Detections are first performed to independently select candidate locations of the actor/object in each frame of the video. The candidate detections are then combined into smooth trajectories by minimizing a cost function accounting for false detections and occlusions.

Using the actor tracks, we then describe a method for automatically generating multiple clips suitable for video editing by simulating pan-tilt-zoom camera movements within the frame of a single static camera. Our method requires only minimal user input to define the subject matter of each sub-clip. The composition of each sub-clip is automatically computed in a novel convex optimization framework. Our approach encodes several common cinematographic practices into a single convex cost function minimization problem, resulting in aesthetically-pleasing sub-clips which can easily be edited together using off-the-shelf multiclip video editing software.

The proposed methods have been tested and validated on a challenging corpus of theatre recordings. They open the way to novel applications of computer vision methods for cost- effective video production of live performances including, but not restricted to, theatre, music and opera.
@inproceedings{gandhi-dissertation-2014, title = {{Automatic Rush Generation with Application to Theatre Performances}}, author = {Gandhi, Vineet}, booktitle = {PhD dissertation}, year = {2014}, }
  Gandhi Vineet, Ronfard Remi, Gleicher Michael
Multi-Clip Video Editing from a Single Viewpoint
European Conference on Visual Media Production (CVMP) 2014
[abstract] [paper] [Presentation slides] [www] [bibTex]
We propose a framework for automatically generating multiple clips suitable for video editing by simulating pan-tilt-zoom camera movements within the frame of a single static camera. Assuming important actors and objects can be localized using computer vision techniques, our method requires only minimal user input to define the subject matter of each sub-clip. The composition of each sub-clip is automatically computed in a novel L1-norm optimization framework. Our approach encodes several common cinematographic practices into a single convex cost function minimization problem, resulting in aesthetically-pleasing sub-clips which can easily be edited together using off-the-shelf multi-clip video editing software. We demonstrate our approach on five video sequences of a live theatre performance by generating multiple synchronized sub-clips for each sequence.
@inproceedings{gandhi-cvmp-2014, title = {{Multi-Clip Video Editing from a Single Viewpoint}}, author = {Gandhi, Vineet and Ronfard, Remi and Gleicher, Michael}, booktitle = {European Conference on Visual Media Production (CVMP)}, year = {2014}, }
  Gandhi Vineet, Ronfard Remi
Detecting and naming actors in movies using generative appearance models
Computer Vision and Pattern Recognition (CVPR) 2013
[abstract] [paper] [CVPR 2013 Poster (1.8mb)] [www] [bibTex]
We introduce a generative model for learning person and costume specific detectors from labeled examples. We demonstrate the model on the task of localizing and nam- ing actors in long video sequences. More specifically, the actor’s head and shoulders are each represented as a con- stellation of optional color regions. Detection can proceed despite changes in view-point and partial occlusions. We explain how to learn the models from a small number of la- beled keyframes or video tracks, and how to detect novel appearances of the actors in a maximum likelihood frame- work. We present results on a challenging movie example, with 81% recall in actor detection (coverage) and 89% pre- cision in actor identification (naming).
@inproceedings{gandhi-cvpr-2013, title = {{Detecting and Naming Actors in Movies using Generative Appearance Models}}, author = {Gandhi, Vineet and Ronfard, Remi}, booktitle = {CVPR}, year = {2013}, }
  Gandhi Vineet, Cech Jan and Horaud Radu
High-Resolution Depth Maps Based on TOF-Stereo Fusion
IEEE International Conference on Robotics and Automation (ICRA) 2012
[abstract] [paper] [www] [bibTex]
The combination of range sensors with color cameras can be very useful for robot navigation, semantic perception, manipulation, and telepresence. Several methods of combining range- and color-data have been investigated and successfully used in various robotic applications. Most of these systems suffer from the problems of noise in the range-data and resolution mismatch between the range sensor and the color cameras, since the resolution of current range sensors is much less than the resolution of color cameras. High-resolution depth maps can be obtained using stereo matching, but this often fails to construct accurate depth maps of weakly/repetitively textured scenes, or if the scene exhibits complex self-occlusions. Range sensors provide coarse depth information regardless of presence/absence of texture. The use of a calibrated system, composed of a time-of-flight (TOF) camera and of a stereoscopic camera pair, allows data fusion thus overcoming the weaknesses of both individual sensors. We propose a novel TOF-stereo fusion method based on an efficient seed-growing algorithm which uses the TOF data projected onto the stereo image pair as an initial set of correspondences. These initial ''seeds'' are then propagated based on a Bayesian model which combines an image similarity score with rough depth priors computed from the low-resolution range data. The overall result is a dense and accurate depth map at the resolution of the color cameras at hand. We show that the proposed algorithm outperforms 2D image-based stereo algorithms and that the results are of higher resolution than off-the-shelf color-range sensors, e.g., Kinect. Moreover, the algorithm potentially exhibits real-time performance on a single CPU.
@inproceedings{gandhi-icra-2012, AUTHOR = {Gandhi, Vineet and Cech, Jan and Horaud, Radu}, TITLE = {{High-Resolution Depth Maps Based on TOF-Stereo Fusion}}, BOOKTITLE = {ICRA}, YEAR = {2012}}
  Ronfard Remi, Gandhi Vineet and Boiron Laurent
The Prose Storyboard Language: A Tool for Annotating and Directing Movies
Workshop on Intelligent Cinematography and Editing part of Foundations of Digital Games (FDG) 2013
[abstract] [paper] [www] [bibTex]
The prose storyboard language is a formal language for describing movies shot by shot, where each shot is described with a unique sentence. The language uses a simple syntax and limited vocabulary borrowed from working practices in traditional movie-making, and is intended to be readable both by machines and humans. The language is designed to serve as a high-level user interface for intelligent cinematography and editing systems.
@inproceedings{ronfard-wiced-2013, title = {{The Prose Storyboard Language: A Tool for Annotating and Directing Movies}}, author = {Ronfard, Remi and Gandhi, Vineet and Boiron, Laurent}, booktitle = {{2nd Workshop on Intelligent Cinematography and Editing part of Foundations of Digital Games - FDG 2013}}, year = {2013}, }



Statistical Methods in Artificial Intelligence - Monsoon 2017, Spring 2020
Digital Image Processing - Monsoon 2015, Monsoon 2016
Digital Signal Analysis and Applications - Spring 2016, Spring 2017, Spring 2018
Computer Programming - Monsoon 2018, Monsoon 2019
Advanced Data Structure and Algorithms - Monsoon 2020





Coming soon



Last updated: April 2018