Vineet Gandhi

Home • Publications • Teaching • Students • Misc

(v)(gandhi)@iiit.ac.in
+91-40-66531238

F25, Centre for Visual Information Technology (CVIT),
2nd floor, KCIS Research Block
Gachibowli, Hyderabad, India, 500032

(Curriculum Vitae)

This website is no longer being updated. Please visit my current website at https://vineet-gandhi.github.io/ for the latest information.

I am currently an associate professor at IIIT Hyderabad, where I am affiliated with Center for Visual Information Techonology ( CVIT ). I also advise a beautiful animation startup on their AI and ML related efforts (Animaker.com). I completed my PhD degree at INRIA Rhone Alpes/Univesity of Grenoble in applied mathematics and computer science (mathématique appliquée et informatique) under the guidance of Remi Ronfard. I was funded by the CIBLE scholarship by Region Rhone Alpes. Prior to this, I completed my Masters with Erasmus Mundus scholarship under CIMET consortium. I am extremely thankful to European Union for giving me this opportunity which had a huge impact on both my professional and personal life. I spent a semester each in Spain, Norway and France and later joined INRIA for my master thesis and continued for my PhD there. I was also lucky to travel and deeply explore Europe (from south of Spain to North of Norway), at times purely surviving on gestural expressions for communication. I obtained my Bachelor of Technology degree from Indian Institute of Information Technology, Design and Manufacturing (IIITDM) Jabalpur, India (I belong to the first batch of the insititute).

I like to focus on problems with tangible goals and I try to build end to end solutions (with neat engineering). My current research interests are in applied machine learning for applications in computer vision and multimedia. In recent years, I have been exploring specific problems of computational videography/cinematography; image/video editing; multiple sensor fusion for 3D analytics; sports analytics; document analytics; visual detection,tracking and recognition. In personal space, I like to spend time with my family, play cards, go on bicycle rides, read ancient literature and explore mountains.

News

Aug 1, 2024

Our work on converting skin vibrations to speech is accepted at UBICOMP 2024. In this innovation, we demonstrate that skin vibrations recorded by a clinical stethoscope can be cleanly transformed into speech. Congratulations to Neil, Neha and Vishal for the excellent effort.

Selected Publications

List of all publications @ Google Scholar

		Neil Shah, Neha Sahipjohn, Vishal Tambrahalli, Ramanathan Subramanian, Vineet Gandhi StethoSpeech: Speech Generation Through a Clinical Stethoscope Attached to the Skin Ubicomp 2024 [abstract] [paper] [bibTex] We introduce StethoSpeech, a silent speech interface that transforms flesh-conducted vibrations behind the ear into speech. This innovation is designed to improve social interactions for those with voice disorders, and furthermore enable discreet public communication. Unlike prior efforts, StethoSpeech does not require (a) paired-speech data for recorded vibrations and (b) a specialized device for recording vibrations, as it can work with an off-the-shelf clinical stethoscope. The novelty of our framework lies in the overall design, simulation of the ground-truth speech, and a sequence-to-sequence translation network, which works in the latent space. We present comprehensive experiments on the existing CSTR NAM TIMIT Plus corpus and our proposed StethoText: a large-scale synchronized database of non-audible murmur and text for speech research. Our results show that StethoSpeech provides natural-sounding and intelligible speech, significantly outperforming existing methods on several quantitative and qualitative metrics. Additionally, we showcase its capacity to extend its application to speakers not encountered during training and its effectiveness in challenging, noisy environments. Speech samples are available at https://stethospeech.github.io/StethoSpeech/ @inproceedings{bib_ubicomp_2024, AUTHOR = {Neil Shah, Neha Sahipjohn, Vishal Tambrahalli, Ramanathan Subramanian, Vineet Gandhi}, TITLE = {StethoSpeech: Speech Generation Through a Clinical Stethoscope Attached to the Skin}, BOOKTITLE = {UBICOMP}, YEAR = {2024}}
		Kawshik Manikantan, Shubham Toshniwal, Makarand Tapaswi, Vineet Gandhi Major Entity Identification: A Generalizable Alternative to Coreference Resolution EMNLP 2024 [abstract] [paper] [bibTex] The limited generalization of coreference resolution (CR) models has been a major bottleneck in the task’s broad application. Prior work has identified annotation differences, especially for mention detection, as one of the main reasons for the generalization gap and proposed using additional annotated target domain data. Rather than relying on this additional annotation, we propose an alternative formulation of the CR task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities. Through extensive experiments, we demonstrate that MEI models generalize well across domains on multiple datasets with supervised models and LLM-based few-shot prompting. Additionally, the MEI task fits the classification framework, which enables the use of classification-based metrics that are more robust than the current CR metrics. Finally, MEI is also of practical use as it allows a user to search for all mentions of a particular entity or a group of entities of interest @inproceedings{bib_mei_2024, AUTHOR = {Kawshik Manikantan, Shubham Toshniwal, Makarand Tapaswi, Vineet Gandhi}, TITLE = {Major Entity Identification: A Generalizable Alternative to Coreference Resolution}, BOOKTITLE = {EMNLP}, YEAR = {2024}}
		Darshana Saravanan, Darshan Singh, Varun Gupta, Zeeshan Khan, Vineet Gandhi, Makarand Tapaswi VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time? arXiv 2024 [abstract] [paper] [bibTex] Compositionality is a fundamental aspect of vision-language understanding and is especially required for videos since they contain multiple entities (e.g. persons, actions, and scenes) interacting dynamically over time. Existing benchmarks focus primarily on perception capabilities. However, they do not study binding, the ability of a model to associate entities through appropriate relationships. To this end, we propose VELOCITI, a new benchmark building on complex movie clips and dense semantic role label annotations to test perception and binding in video language models (contrastive and Video-LLMs). Our perception-based tests require discriminating video-caption pairs that share similar entities, and the binding tests require models to associate the correct entity to a given situation while ignoring the different yet plausible entities that also appear in the same video. While current state-of-the-art models perform moderately well on perception tests, accuracy is near random when both entities are present in the same video, indicating that they fail at binding tests. Even the powerful Gemini 1.5 Flash has a substantial gap (16-28%) with respect to human accuracy in such binding tests. @inproceedings{bib_velociti_2024, AUTHOR = {Darshana Saravanan, Darshan Singh, Varun Gupta, Zeeshan Khan, Vineet Gandhi, Makarand Tapaswi}, TITLE = {VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?}, BOOKTITLE = {arXiv}, YEAR = {2024}}
		Darshana Saravanan, Naresh Manwani, Vineet Gandhi Pseudo-labelling meets Label Smoothing for Noisy Partial Label Learning arXiv 2024 [abstract] [paper] [bibTex] Partial label learning (PLL) is a weakly-supervised learning paradigm where each training instance is paired with a set of candidate labels (partial label), one of which is the true label. Noisy PLL (NPLL) relaxes this constraint by allowing some partial labels to not contain the true label, enhancing the practicality of the problem. Our work centres on NPLL and presents a minimalistic framework that initially assigns pseudo-labels to images by exploiting the noisy partial labels through a weighted nearest neighbour algorithm. These pseudo-label and image pairs are then used to train a deep neural network classifier with label smoothing. The classifier's features and predictions are subsequently employed to refine and enhance the accuracy of pseudo-labels. We perform thorough experiments on seven datasets and compare against nine NPLL and PLL methods. We achieve state-of-the-art results in all studied settings from the prior literature, obtaining substantial gains in fine-grained classification and extreme noise scenarios. Further, we show the promising generalisation capability of our framework in realistic crowd-sourced datasets. @inproceedings{bib_npll_2024, AUTHOR = {Darshana Saravanan, Naresh Manwani, Vineet Gandhi}, TITLE = {Pseudo-labelling meets Label Smoothing for Noisy Partial Label Learning}, BOOKTITLE = {arXiv}, YEAR = {2024}}
		Neil Shah, Saiteja Kosgi, Vishal Tambrahalli, Neha Sahipjohn, Anil Kumar Nelakanti and Vineet Gandhi ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised representations EACL 2024 [abstract] [paper] [bibTex] We present ParrotTTS, a modularized text-to-speech synthesis model leveraging disentangled self-supervised speech representations. It can train a multi-speaker variant effectively using transcripts from a single speaker. ParrotTTS adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone. Moreover, without training on bilingual or parallel examples, ParrotTTS can transfer voices across languages while preserving the speaker-specific characteristics, e.g., synthesizing fluent Hindi speech using a French speaker’s voice and accent. We present extensive results in monolingual and multi-lingual scenarios. ParrotTTS outperforms state-of-the-art multi-lingual text-to-speech (TTS) models using only a fraction of paired data as latter. Speech samples from ParrotTTS and code can be found at https://parrot-tts.github.io/tts/ @inproceedings{bib_parrotTTS_2024, AUTHOR = {Neil Shah, Saiteja Kosgi, Vishal Tambrahalli, Neha Sahipjohn, Anil Kumar Nelakanti and Vineet Gandhi}, TITLE = {ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised representations}, BOOKTITLE = {Findings of EACL}, YEAR = {2024}}
		Kanishk Jain, Shyamgopal Karthik and Vineet Gandhi Test-Time Amendment with a Coarse Classifier for Fine-Grained Classification NeurIPS 2023 [abstract] [paper] [bibTex] We investigate the problem of reducing mistake severity for fine-grained classification. Fine-grained classification can be challenging, mainly due to the requirement of knowledge or domain expertise for accurate annotation. However, humans are particularly adept at performing coarse classification as it requires relatively low levels of expertise. To this end, we present a novel approach for Post-Hoc Correction called Hierarchical Ensembles (HiE) that utilizes label hierarchy to improve the performance of fine-grained classification at test-time using the coarse-grained predictions. By only requiring the parents of leaf nodes, our method significantly reduces avg. mistake severity while improving top-1 accuracy on the iNaturalist-19 and tieredImageNet-H datasets, achieving a new state-of-the-art on both benchmarks. We also investigate the efficacy of our approach in the semi-supervised setting. Our approach brings notable gains in top-1 accuracy while significantly decreasing the severity of mistakes as training data decreases for the fine-grained classes. The simplicity and post-hoc nature of HiE render it practical to be used with any off-the-shelf trained model to improve its predictions further. @inproceedings{bib_hie_2023, AUTHOR = {Kanishk Jain, Shyamgopal Karthik and Vineet Gandhi}, TITLE = {Test-Time Amendment with a Coarse Classifier for Fine-Grained Classification}, BOOKTITLE = {Neurips}, YEAR = {2023}}
		Kanishk Jain, Varun Chhangani, Amogh Tiwari, K. Madhava Krishna and Vineet Gandhi Ground then Navigate: Language-guided Navigation in Dynamic Scenes ICRA 2023 [abstract] [paper] [bibTex] We investigate the Vision-and-Language Navigation (VLN) problem in the context of autonomous driving in outdoor settings. We solve the problem by explicitly grounding the navigable regions corresponding to the textual command. At each timestamp, the model predicts a segmentation mask corresponding to the intermediate or the final navigable region. Our work contrasts with existing efforts in VLN, which pose this task as a node selection problem, given a discrete connected graph corresponding to the environment. We do not assume the availability of such a discretised map. Our work moves towards continuity in action space, provides interpretability through visual feedback and allows VLN on commands requiring finer manoeuvres like "park between the two cars". Furthermore, we propose a novel meta-dataset CARLA-NAV to allow efficient training and validation. The dataset comprises pre-recorded training sequences and a live environment for validation and testing. We provide extensive qualitative and quantitive empirical results to validate the efficacy of the proposed approach. @inproceedings{bib_icra_2023, AUTHOR = {Kanishk Jain, Varun Chhangani, Amogh Tiwari, K. Madhava Krishna and Vineet Gandhi}, TITLE = {Ground then Navigate: Language-guided Navigation in Dynamic Scenes}, BOOKTITLE = {ICRA}, YEAR = {2023}}
		Ritvik Agrawal, Shreyank Jyoti, Rohit Girmaji, Sarath Sivaprasad, Vineet Gandhi Does Audio help in deep Audio-Visual Saliency prediction models? ICMI 2022 (Best Student Paper Award) [abstract] [paper] [code(coming soon)] [bibTex] Despite existing works of Audio-Visual Saliency Prediction (AVSP) models claiming to achieve promising results by fusing audio modality over visual-only models, these models fail to leverage audio information. In this paper, we investigate the relevance of audio cues in conjunction with the visual ones and conduct extensive analysis by employing well-established audio modules and fusion techniques from diverse correlated audio-visual tasks. Our analysis on ten diverse saliency datasets suggests that none of the methods worked for incorporating audio. Furthermore, we bring to light, why AVSP models show a gain in performance over visual-only models, though the audio branch is agnostic at inference. Our work questions the role of audio in current deep AVSP models and motivates the community to a clear avenue for reconsideration of the complex architectures by demonstrating that simpler alternatives work equally well. @inproceedings{Ritvik-icmi-2022, AUTHOR = {Ritvik Agrawal, Shreyank Jyoti, Rohit Girmaji, Sarath Sivaprasad, Vineet Gandhi}, TITLE = {Does Audio help in deep Audio-Visual Saliency prediction models?}, BOOKTITLE = {ICMI}, YEAR = {2022}}
		Saiteja Kosgi, Sarath Sivaprasad, Niranjan Pedanekar, Anil Nelakanti, Vineet Gandhi Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems NAACL 2022 [abstract] [paper] [bibTex] We present a method to control the emotional prosody of Text to Speech (TTS) systems by using phoneme-level intermediate features (pitch, energy, and duration) as levers. As a key idea, we propose Differential Scaling (DS) to disentangle features relating to affective prosody from those arising due to acoustics conditions and speaker identity. With thorough experimental studies, we show that the proposed method improves over the prior art in accurately emulating the desired emotions while retaining the naturalness of speech. We extend the traditional evaluation of using individual sentences for a more complete evaluation of HCI systems. We present a novel experimental setup by replacing an actor with a TTS system in offline and live conversations. The emotion to be rendered is either predicted or manually assigned. The results show that the proposed method is strongly preferred over the state-of-the-art TTS system and adds the much-coveted “human touch” in machine dialogue. Audio samples for our experiments and the code are available at: https://emtts.github.io/tts-demo/ @inproceedings{Sai-naacl-2022, AUTHOR = {Saiteja Kosgi, Sarath Sivaprasad, Niranjan Pedanekar, Anil Nelakanti, Vineet Gandhi}, TITLE = {Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems}, BOOKTITLE = {NAACL}, YEAR = {2022}}
		Kanishk Jain and Vineet Gandhi Comprehensive Multi-Modal Interactions for Referring Image Segmentation ACL 2022 [abstract] [paper] [code(coming soon)] [bibTex] We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the given natural language description. To solve RIS efficiently, we need to understand each word's relationship with other words, each region in the image to other regions, and cross-modal alignment between linguistic and visual domains. Recent methods model these three types of interactions sequentially. We argue that such a modular approach limits these methods' performance, and joint simultaneous reasoning can help resolve ambiguities. To this end, we propose a Joint Reasoning (JRM) module and a novel Cross-Modal Multi-Level Fusion (CMMLF) module for tackling this task. JRM effectively models the referent's multi-modal context by jointly reasoning over visual and linguistic modalities (performing word-word, image region-region, word-region interactions in a single module). CMMLF module further refines the segmentation masks by exchanging contextual information across visual hierarchy through linguistic features acting as a bridge. We present thorough ablation studies and validate our approach's performance on four benchmark datasets, and show that the proposed method outperforms the existing state-of-the-art methods on all four datasets by significant margins. @inproceedings{Kanishk-arxiv-2021, AUTHOR = {Kanishk Jain, Vineet Gandhi}, TITLE = {Comprehensive Multi-Modal Interactions for Referring Image Segmentation}, BOOKTITLE = {Findings of ACL}, YEAR = {2022}}
		Sarath Sivaprasad, Akshay Goindani, Vaibhav Garg, Vineet Gandhi Reappraising Domain Generalization in Neural Networks arXiv 2022 [abstract] [paper] [code (coming soon)] [bibTex] Domain generalization (DG) of machine learning algorithms is defined as their ability to learn a domain agnostic hypothesis from multiple training distributions, which generalizes onto data from an unseen domain. DG is vital in scenarios where the target domain with distinct characteristics has sparse data for training. Aligning with recent work~\cite{gulrajani2020search}, we find that a straightforward Empirical Risk Minimization (ERM) baseline consistently outperforms existing DG methods. We present ablation studies indicating that the choice of backbone, data augmentation, and optimization algorithms overshadows the many tricks and trades explored in the prior art. Our work leads to a new state of the art on the four popular DG datasets, surpassing previous methods by large margins. Furthermore, as a key contribution, we propose a classwise-DG formulation, where for each class, we randomly select one of the domains and keep it aside for testing. We argue that this benchmarking is closer to human learning and relevant in real-world scenarios. We comprehensively benchmark classwise-DG on the DomainBed and propose a method combining ERM and reverse gradients to achieve the state-of-the-art results. To our surprise, despite being exposed to all domains during training, the classwise DG is more challenging than traditional DG evaluation and motivates more fundamental rethinking on the problem of DG. @inproceedings{bib_dg_2022, AUTHOR = {Sarath Sivaprasad, Akshay Goindani, Vaibhav Garg, Vineet Gandhi}, TITLE = {Reappraising Domain Generalization in Neural Networks}, BOOKTITLE = {arXiv:2110.07981}, YEAR = {2022}}
		Jeet Vora, Swetanjal Dutta, Shyamgopal Karthik, Vineet Gandhi Bringing Generalization to Deep Multi-view Detection arXiv 2021 [abstract] [paper] [code] [bibTex] Multi-view Detection (MVD) is highly effective for occlusion reasoning and is a mainstream solution in various applications that require accurate top-view occupancy maps. While recent works using deep learning have made significant advances in the field, they have overlooked the generalization aspect, which makes them impractical for real-world deployment. The key novelty of our work is to formalize three critical forms of generalization and propose experiments to investigate them: i) generalization across a varying number of cameras, ii) generalization with varying camera positions, and finally, iii) generalization to new scenes. We find that existing state-of-the-art models show poor generalization by overfitting to a single scene and camera configuration. We propose modifications in terms of pre-training, pooling strategy, regularization, and loss function to an existing state-of-the-art framework, leading to successful generalization across new camera configurations and new scenes. We perform a comprehensive set of experiments on the WildTrack and MultiViewX datasets to (a) motivate the necessity to evaluate MVD methods on generalization abilities and (b) demonstrate the efficacy of the proposed approach. The code is publicly available at https://github.com/jeetv/GMVD @inproceedings{mvdet_jeet_2021, AUTHOR = {Jeet Vora, Swetanjal Dutta, Shyamgopal Karthik, Vineet Gandhi}, TITLE = {Bringing Generalization to Deep Multi-view Detection}, BOOKTITLE = {arXiv:2109.12227}, YEAR = {2021}}
		Nivedita Rufus, Kanishk Jain, Unni Krishnan R Nair, Vineet Gandhi and K Madhava Krishna Grounding Linguistic Commands to Navigable Regions IROS 2021 [abstract] [paper] [code (coming soon)] [bibTex] Humans have a natural ability to effortlessly comprehend linguistic commands such as “park next to the yellow sedan” and instinctively know which region of the road the vehicle should navigate. Extending this ability to autonomous vehicles is the next step towards creating fully autonomous agents that respond and act according to human commands. To this end, we propose the novel task of Referring Navigable Regions (RNR), i.e., grounding regions of interest for navigation based on the linguistic command. RNR is different from Referring Image Segmentation (RIS), which focuses on grounding an object referred to by the natural language expression instead of grounding a navigable region. For example, for a command “park next to the yellow sedan,” RIS will aim to segment the referred sedan, and RNR aims to segment the suggested parking region on the road. We introduce a new dataset, Talk2Car-RegSeg, which extends the existing Talk2car [1] dataset with segmentation masks for the regions described by the linguistic commands. A separate test split with concise manoeuvre-oriented commands is provided to assess the practicality of our dataset. We benchmark the proposed dataset using a novel transformer-based architecture. We present extensive ablations and show superior performance over baselines on multiple evaluation metrics. A downstream path planner generating trajectories based on RNR outputs confirms the efficacy of the proposed framework. @inproceedings{iros_rnr_2021, AUTHOR = {Nivedita Rufus, Kanishk Jain, Unni Krishnan R Nair, Vineet Gandhi and K Madhava Krishna}, TITLE = {Grounding Linguistic Commands to Navigable Regions}, BOOKTITLE = {IROS}, YEAR = {2021}}
		Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik, Ramanathan Subramanian, Vineet Gandhi ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction IROS 2021 [abstract] [paper] [code] [bibTex] We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first network to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behaviour in the previous state-of-the-art models \cite{tsiami2020stavis} for audio-visual saliency prediction. Our findings contrast with previous works on deep learning-based audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio in a more effective manner. The code and pre-trained models are available at https://github.com/samyak0210/ViNet @inproceedings{samyak-arxiv-2020, AUTHOR = {Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik, Ramanathan Subramanian, Vineet Gandhi}, TITLE = {ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction}, BOOKTITLE = {IROS}, YEAR = {2021}}
		Sarath Sivaprasad, Saiteja Kosgi, Vineet Gandhi Emotional Prosody Control for Speech Generation INTERSPEECH 2021 [abstract] [paper] [code (coming soon)] [bibTex] Machine-generated speech is characterized by its limited or unnatural emotional variation. Current text to speech systems generates speech with either a flat emotion, emotion selected from a predefined set, average variation learned from prosody sequences in training data or transferred from a source style. We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space (Arousal-Valence space). The proposed TTS system can generate speech from the text in any speaker's style, with fine control of emotion. We show that the system works on emotion unseen during training and can scale to previously unseen speakers given his/her speech sample. Our work expands the horizon of the state-of-the-art FastSpeech2 backbone to a multi-speaker setting and gives it much-coveted continuous (and interpretable) affective control, without any observable degradation in the quality of the synthesized speech. Audio samples are available at https://researchweb.iiit.ac.in/~sarath.s/emotts/ @inproceedings{Sarath-Interspeech-2021, AUTHOR = {Sarath Sivaprasad, Saiteja Kosgi, Vineet Gandhi}, TITLE = {Emotional Prosody Control for Speech Generation}, BOOKTITLE = {Interspeech}, YEAR = {2021}}
		Sarath Sivaprasad, Ankur Singh, Naresh Manwani, Vineet Gandhi The Curious Case of Convex Neural Networks ECML-PKDD 2021 [abstract] [paper] [code] [bibTex] In this paper, we investigate a constrained formulation of neural networks where the output is a convex function of the input. We show that the convexity constraints can be enforced on both fully connected and convolutional layers, making them applicable to most architectures. The convexity constraints include restricting the weights (for all but the first layer) to be non-negative and using a non-decreasing convex activation function. Albeit simple, these constraints have profound implications on the generalization abilities of the network. We draw three valuable insights: (a) Input Output Convex Networks (IOC-NN) self regularize and almost uproot the problem of overfitting; (b) Although heavily constrained, they come close to the performance of the base architectures; and (c) The ensemble of convex networks can match or outperform the non convex counterparts. We demonstrate the efficacy of the proposed idea using thorough experiments and ablation studies on MNIST, CIFAR10, and CIFAR100 datasets with three different neural network architectures. The code for this project is publicly available at: https://github.com/sarathsp1729/Convex-Networks. @inproceedings{Sarath-IOCNN-2020, AUTHOR = {Sarath Sivaprasad, Naresh Manwani, Vineet Gandhi}, TITLE = {The Curious Case of Convex Networks}, BOOKTITLE = {ECML}, YEAR = {2021}}
		Shyamgopal Karthik, Ameya Prabhu, Puneet Dokania and Vineet Gandhi No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks International Conference on Learning Representations (ICLR) 2021 [abstract] [paper] [code] [bibTex] There has been increasing interest in building deep hierarchy-aware classifiers, aiming to quantify and reduce the severity of mistakes and not just count the number of errors. The idea is to exploit the label hierarchy (e.g., WordNet ontology) and consider graph distances as a proxy for mistake severity. Surprisingly, on examining mistake-severity distributions of the top-1 prediction, we find that current state-of-the-art hierarchy-aware deep classifiers do not show practical improvement in making better mistakes than the standard cross-entropy baseline. In fact, they reduce the average mistake-severity metric by largely making additional low-severity or easily avoidable mistakes. This might explain the noticeable accuracy drop. To this end, we resort to the classical Conditional Risk Minimization (CRM) framework for hierarchy aware classification. Given a cost matrix and a reliable estimate of likelihoods (obtained from a trained network), CRM simply amends mistakes at inference time; it needs no extra parameters; it requires adding just one line of code to the standard cross-entropy baseline. It significantly outperforms the state-of-the-art and consistently obtains large reductions in the average hierarchical distance of top-k predictions across datasets, with very little loss in accuracy. Since CRM does not require retraining or fine-tuning of any hyperparameter, it can be used with any off-the-shelf cross-entropy trained model. @inproceedings{Shyam-iclr-2021, AUTHOR = {Shyamgopal Karthik, Ameya Prabhu, Puneet Dokania and Vineet Gandhi}, TITLE = {No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks}, BOOKTITLE = {ICLR}, YEAR = {2021}}
		Shyamgopal Karthik, Ameya Prabhu, Vineet Gandhi Simple Unsupervised Multi-Object Tracking arXiv 2020 [abstract] [paper] [code] [bibTex] Multi-object tracking has seen a lot of progress recently, albeit with substantial annotation costs for developing better and larger labeled datasets. In this work, we remove the need for annotated datasets by proposing an unsupervised re-identification network, thus sidestepping the labeling costs entirely, required for training. Given unlabeled videos, our proposed method (SimpleReID) first generates tracking labels using SORT and trains a ReID network to predict the generated labels using crossentropy loss. We demonstrate that SimpleReID performs substantially better than simpler alternatives, and we recover the full performance of its supervised counterpart consistently across diverse tracking frameworks. The observations are unusual because unsupervised ReID is not expected to excel in crowded scenarios with occlusions, and drastic viewpoint changes. By incorporating our unsupervised SimpleReID with CenterTrack trained on augmented still images, we establish a new state-of-the-art performance on popular datasets like MOT16/17 without using tracking supervision, beating current best (CenterTrack) by 0.2-0.3 MOTA and 4.4-4.8 IDF1 scores. We further provide evidence for limited scope for improvement in IDF1 scores beyond our unsupervised ReID in the studied settings. Our investigation suggests reconsideration towards more sophisticated, supervised, end-to-end trackers by showing promise in simpler unsupervised alternatives. @inproceedings{Shyam-Arxiv-2020, AUTHOR = {Shyamgopal Karthik, Ameya Prabhu, Vineet Gandhi}, TITLE = {Simple Unsupervised Multi-Object Tracking}, BOOKTITLE = {arXiv:2006.02609}, YEAR = {2020}}
		Aasheesh Singh, Aditya Kamireddypalli, Vineet Gandhi, K Madhava Krishna LiDAR guided Small obstacle Segmentation IROS 2020 [abstract] [paper] [code] [bibTex] Detecting small obstacles on the road is critical for autonomous driving. In this paper, we present a method to reliably detect such obstacles through a multi-modal framework of sparse LiDAR(VLP-16) and Monocular vision. LiDAR is employed to provide additional context in the form of confidence maps to monocular segmentation networks. We show significant performance gains when the context is fed as an additional input to monocular semantic segmentation frameworks. We further present a new semantic segmentation dataset to the community, comprising of over 3000 image frames with corresponding LiDAR observations. The images come with pixel-wise annotations of three classes off-road, road, and small obstacle. We stress that precise calibration between LiDAR and camera is crucial for this task and thus propose a novel Hausdorff distance based calibration refinement method over extrinsic parameters. As a first benchmark over this dataset, we report our results with 73% instance detection up to a distance of 50 meters on challenging scenarios. Qualitatively by showcasing accurate segmentation of obstacles less than 15 cms at 50m depth and quantitatively through favourable comparisons vis a vis prior art, we vindicate the method's efficacy. Our project-page and Dataset is hosted at https://small-obstacle-dataset.github.io/ @inproceedings{aavk-IROS-2020, AUTHOR = {Aasheesh Singh, Aditya Kamireddypalli, Vineet Gandhi, K Madhava Krishna}, TITLE = {LiDAR guided Small obstacle Segmentation}, BOOKTITLE = {IROS}, YEAR = {2020}}
		Navyasri Reddy, Samyak Jain, Pradeep Yarlagadda, Vineet Gandhi Tidying Deep Saliency Prediction Architectures IROS 2020 [abstract] [paper] [code] [bibTex] Learning computational models for visual attention (saliency estimation) is an effort to inch machines/robots closer to human visual cognitive abilities. Data-driven efforts have dominated the landscape since the introduction of deep neural network architectures. In deep learning research, the choices in architecture design are often empirical and frequently lead to more complex models than necessary. The complexity, in turn, hinders the application requirements. In this paper, we identify four key components of saliency models, i.e., input features, multi-level integration, readout architecture, and loss functions. We review the existing state of the art models on these four components and propose novel and simpler alternatives. As a result, we propose two novel end-to-end architectures called SimpleNet and MDNSal, which are neater, minimal, more interpretable and achieve state of the art performance on public saliency benchmarks. SimpleNet is an optimized encoder-decoder architecture and brings notable performance gains on the SALICON dataset (the largest saliency benchmark). MDNSal is a parametric model that directly predicts parameters of a GMM distribution and is aimed to bring more interpretability to the prediction maps. The proposed saliency models can be inferred at 25fps, making them suitable for real-time applications. Code and pre-trained models are available at https://github.com/samyak0210/saliency @inproceedings{Navya-IROS-2020, AUTHOR = {Navyasri Reddy, Samyak Jain, Pradeep Yarlagadda, Vineet Gandhi}, TITLE = {Tidying Deep Saliency Prediction Architectures}, BOOKTITLE = {IROS}, YEAR = {2020}}
		K L Bhanu Moorthy, Moneish Kumar, Ramanathan Subramanian, Vineet Gandhi GAZED – Gaze-guided Cinematic Editing of Wide-Angle Monocular Video Recordings ACM CHI 2020 (Conference on Human Factors in Computing Systems) [abstract] [paper] [www] [bibTex] We present GAZED– eye GAZe-guided EDiting for videos captured by a solitary, static, wide-angle and high-resolution camera. Eye-gaze has been effectively employed in computational applications as a cue to capture interesting scene content; we employ gaze as a proxy to select shots for inclusion in the edited video. Given the original video, scene content and user eye-gaze tracks are combined to generate an edited video comprising cinematically valid actor shots and shot transitions to generate an aesthetic and vivid representation of the original narrative. We model cinematic video editing as an energy minimization problem over shot selection, whose constraints capture cinematographic editing conventions. Gazed scene locations primarily determine the shots constituting the edited video. Effectiveness of GAZED against multiple competing methods is demonstrated via a psychophysical study involving 12 users and twelve performance videos. @inproceedings{Bhanu-CHI-2020, AUTHOR = {Moorthy Bhanu K L, Kumar Moneish, Subramanian Ramanathan, Gandhi Vineet}, TITLE = {GAZED– Gaze-guided Cinematic Editing of Wide-Angle Monocular Video Recordings}, BOOKTITLE = {CHI}, YEAR = {2020}}
		Shyamgopal Karthik, Abhinav Moudgil, Vineet Gandhi Exploring 3 R's of Long-term Tracking: Re-detection, Recovery and Reliability WACV 2020 [abstract] [paper] [www] [bibTex] Recent works have proposed several long term tracking benchmarks and highlight the importance of moving towards long-duration tracking to bridge the gap with application requirements. The current evaluation methodologies, however, do not focus on several aspects that are crucial in a long term perspective like Re-detection, Recovery, and Reliability. In this paper, we propose novel evaluation strategies for a more in-depth analysis of trackers from a long-term perspective. More specifically, (a) we test re-detection capability of the trackers in the wild by simulating virtual cuts, (b) we investigate the role of chance in the recovery of tracker after failure and (c) we propose a novel metric allowing visual inference on the ability of a tracker to track contiguously (without any failure) at a given accuracy. We present several original insights derived from an extensive set of quantitative and qualitative experiments. @inproceedings{Sudheer-cine-2019, AUTHOR = {Karthik Shyamgopal, Moudgil Abhinav, Gandhi Vineet}, TITLE = {Exploring 3 R's of Long-term Tracking: Re-detection, Recovery and Reliability}, BOOKTITLE = {WACV}, YEAR = {2020}}
		Sudheer Achary, Syed Ashar Javed, Nikita Shravan, K L Bhanu Moorthy, Vineet Gandhi, Anoop Namboodiri CineFilter: Unsupervised Filtering for Real Time Autonomous Camera Systems arXiv 2019 [abstract] [paper] [www] [bibTex] Learning to mimic the smooth and deliberate camera movement of a human cameraman is an essential requirement for autonomous camera systems. This paper presents a novel formulation for online and real-time estimation of smooth camera trajectories. Many works have focused on global optimization of the trajectory to produce an offline output. Some recent works have tried to extend this to the online setting, but lack either in the quality of the camera trajectories or need large labeled datasets to train their supervised model. We propose two models, one a convex optimization based approach and another a CNN based model, both of which can exploit the temporal trends in the camera behavior. Our model is built in an unsupervised way without any ground truth trajectories and is robust to noisy outliers. We evaluate our models on two different settings namely a basketball dataset and a stage performance dataset and compare against multiple baselines and past approaches. Our models outperform other methods on quantitative and qualitative metrics and produce smooth camera trajectories that are motivated by cinematographic principles. These models can also be easily adopted to run in real-time with a low computational cost, making them fit for a variety of applications. @inproceedings{Sudheer-cine-2019, AUTHOR = {Achary Sudheer, Ashar Javed Syed, Shravan Nikita, Bhanu Moorthy K L , Gandhi Vineet , Namboodiri Anoop }, TITLE = {CineFilter: Unsupervised Filtering for Real Time Autonomous Camera Systems}, BOOKTITLE = {arXiv:1912.05636}, YEAR = {2019}}
		N. N. Sriram, Maniar Tirth, Kalyanasundaram Jayaganesh, Gandhi Vineet, Bhowmick Brojeshwar, Krishna Madhava K. Talk to the Vehicle: Language Conditioned Autonomous Navigation of Self Driving Cars IROS 2019 [abstract] [paper] [www] [bibTex] We propose a novel pipeline that blends encodings from natural language and 3D semantic maps obtained from computer vision data to generate local trajectories that are executed by a low-level controller. The pipeline precludes the need for a prior registered map through a local waypoint generator neural network. The waypoint generator network (WGN) maps semantics and natural language encodings (NLE) to local waypoints. A local planner then generates a trajectory from the ego location of the vehicle (an outdoor car in this case) to these locally generated waypoints while a low-level controller executes these plans faithfully. The efficacy of the pipeline is verified in the CARLA simulator environment as well as on local semantic maps built from real-world KITTI dataset. In both these environments (simulated and real-world) we show the ability of the WGN to generate waypoints accurately by mapping NLE of varying sequence lengths and levels of complexity. We compare with baseline approaches and show significant performance gain over them. And finally, we show real implementations on our electric car verifying that the pipeline lends itself to practical and tangible realizations in uncontrolled outdoor settings. In loop execution of the proposed pipeline that involves repetitive invocations of the network is critical for any such language-based navigation framework. This effort successfully accomplishes this thereby bypassing the need for prior metric maps or strategies for metric level localization during traversal.. @inproceedings{Sriram-IROS-2019, AUTHOR = {N. N. Sriram, Maniar Tirth, Kalyanasundaram Jayaganesh, Gandhi Vineet, Bhowmick Brojeshwar and Krishna Madhava K.}, TITLE = {Talk to the Vehicle: Language Conditioned Autonomous Navigation of Self Driving Cars}, BOOKTITLE = {IROS}, YEAR = {2019}}
		Javed Syed Ashar, Saxena Shreyas and Gandhi Vineet Learning Unsupervised Visual Grounding Through Semantic Self-Supervision IJCAI 2019 [abstract] [paper] [www] [bibTex] Localizing natural language phrases in images is a challenging problem that requires joint understanding of both the textual and visual modalities. In the unsupervised setting, lack of supervisory signals exacerbate this difficulty. In this paper, we propose a novel framework for unsupervised visual grounding which uses concept learning as a proxy task to obtain self-supervision. The simple intuition behind this idea is to encourage the model to localize to regions which can explain some semantic property in the data, in our case, the property being the presence of a concept in a set of images. We present thorough quantitative and qualitative experiments to demonstrate the efficacy of our approach and show a 5.6% improvement over the current state of the art on Visual Genome dataset, a 5.8% improvement on the ReferItGame dataset and comparable to state-of-art performance on the Flickr30k dataset. @inproceedings{Javed-IJCAI-2018, AUTHOR = {Javed Syed Ashar, Shreyas Saxena and Gandhi Vineet}, TITLE = {Learning Unsupervised Visual Grounding Through Semantic Self-Supervision}, BOOKTITLE = {IJCAI}, YEAR = {2019}}
		Gupta Aryaman, Thakkar Kalpit, Gandhi Vineet and Narayanan P J Nose, Eyes and Ears: Head Pose Estimation By Locating Facial Keypoints ICASSP 2019 [abstract] [paper] [www] [bibTex] Monocular head pose estimation requires learning a model that computes the intrinsic Euler angles for pose (yaw, pitch, roll) from an input image of human face. Annotating ground truth head pose angles for images in the wild is difficult and requires ad-hoc fitting procedures (which provides only coarse and approximate annotations). This highlights the need for approaches which can train on data captured in controlled environment and generalize on the images in the wild (with varying appearance and illumination of the face). Most present day deep learning approaches which learn a regression function directly on the input images fail to do so. To this end, we propose to use a higher level representation to regress the head pose while using deep learning architectures. More specifically, we use the uncertainty maps in the form of 2D soft localization heatmap images over five facial keypoints, namely left ear, right ear, left eye, right eye and nose, and pass them through an convolutional neural network to regress the head-pose. We show head pose estimation results on two challenging benchmarks BIWI and AFLW and our approach surpasses the state of the art on both the datasets. @inproceedings{Gupta-arxiv-2018, AUTHOR = {Gupta Aryaman, Thakkar Kalpit, Gandhi Vineet and Narayanan P J}, TITLE = {Nose, Eyes and Ears: Head Pose Estimation By Locating Facial Keypoints}, BOOKTITLE = {arXiv:1812.00739}, YEAR = {2018}}
		Moudgil Abhinav and Gandhi Vineet Long-Term Visual Object Tracking Benchmark ACCV 2018 [abstract] [paper] [dataset] [www] [bibTex] In this paper, we propose a new long video dataset (called Track Long and Prosper - TLP) and benchmark for visual object tracking. The dataset consists of 50 videos from real world scenarios, encompassing a duration of over 400 minutes (676K frames), making it more than 20 folds larger in average duration per sequence and more than 8 folds larger in terms of total covered duration, as compared to existing generic datasets for visual tracking. The proposed dataset paves a way to suitably assess long term tracking performance and possibly train better deep learning architectures (avoiding/reducing augmentation, which may not reflect realistic real world behavior). We benchmark the dataset on 17 state of the art trackers and rank them according to tracking accuracy and run time speeds. We further categorize the test sequences with different attributes and present a thorough quantitative and qualitative evaluation. Our most interesting observations are (a) existing short sequence benchmarks fail to bring out the inherent differences in tracking algorithms which widen up while tracking on long sequences and (b) the accuracy of most trackers abruptly drops on challenging long sequences, suggesting the potential need of research efforts in the direction of long term tracking. @inproceedings{moudgil-arxiv-2017, title = {{Long-Term Visual Object Tracking Benchmark}}, author = {Moudgil Abhinav and Gandhi Vineet}, booktitle = {arxiv}, year = {2017}, }
		Gupta Krishnam, Javed Syed Ashar, Gandhi Vineet and Krishna Madhava K. MergeNet: A Deep Net Architecture for Small Obstacle Discovery ICRA 2018 [abstract] [paper] [www] [bibTex] We present here, a novel network architecture called MergeNet for discovering small obstacles for on-road scenes in the context of autonomous driving. The basis of the architecture rests on the central consideration of training with less amount of data since the physical setup and the annotation process for small obstacles is hard to scale. For making effective use of the limited data, we propose a multi-stage training procedure involving weight-sharing, separate learning of low and high level features from the RGBD input and a refining stage which learns to fuse the obtained complementary features. The model is trained and evaluated on the Lost and Found dataset and is able to achieve state-of-art results with just 135 images in comparison to the 1000 images used by the previous benchmark. Additionally, we also compare our results with recent methods trained on 6000 images and show that our method achieves comparable performance with only 1000 training samples. @inproceedings{Gupta-icra-2018, AUTHOR = {Gupta Krishnam, Javed Syed Ashar, Gandhi Vineet and Krishna Madhava K.}, TITLE = {{MergeNet: A Deep Net Architecture for Small Obstacle Discovery}}, BOOKTITLE = {ICRA}, YEAR = {2018}}
		Kumar Kranthi, Kumar Moneish, Gandhi Vineet, Subramanian Ramanathan Watch to Edit: Video Retargeting using Gaze Eurographics 2018 [abstract] [paper] [www] [bibTex] We present a novel approach to optimally retarget videos for varied displays with differing aspect ratios by preserving salient scene content discovered via eye tracking. Our algorithm performs editing with cut, pan and zoom operations by optimizing the path of a cropping window within the original video while seeking to (i) preserve salient regions, and (ii) adhere to the principles of cinematography. Our approach is (a) content agnostic as the same methodology is employed to re-edit a wide-angle video recording or a close-up movie sequence captured with a static or moving camera, and (b) independent of video length and can in principle re-edit an entire movie in one shot. Our algorithm consists of two steps. The first step employs gaze transition cues to detect time stamps where new cuts are to be introduced in the original video via dynamic programming. A subsequent step optimizes the cropping window path (to create pan and zoom effects), while accounting for the original and new cuts. The cropping window path is designed to include maximum gaze information, and is composed of piecewise constant, linear and parabolic segments. It is obtained via L(1) regularized convex optimization which ensures a smooth viewing experience. We test our approach on a wide variety of videos and demonstrate significant improvement over the state-of-the-art, both in terms of computational complexity and qualitative aspects. A study performed with 16 users confirms that our approach results in a superior viewing experience as compared to gaze driven re-editing [JSSH15] and letterboxing methods, especially for wide-angle static camera recordings. @inproceedings{kumar-eg-2018, title = {{Watch to Edit: Video Retargeting using Gaze}}, author = {Kumar Kranthi, Kumar Moneish, Gandhi Vineet, Subramanian Ramanathan}, booktitle = {Eurographics}, year = {2018}, }
		Rai Pranjal Kumar, Maheshwari Sajal, and Gandhi Vineet Document Quality Estimation using Spatial Frequency Response ICASSP (oral) 2018 [abstract] [paper] [www] [bibTex] The current Document Image Quality Assessment (DIQA) algorithms directly relate the Optical Character Recognition (OCR) accuracies with the quality of the document to build supervised learning frameworks. This direct correlation has two major limitations: (a) OCR may be affected by factors independent of the quality of the capture and (b) it cannot account for blur variations within an image. An alternate possibility is to quantify the quality of capture using human judgement, however, it is subjective and prone to error. In this work, we build upon the idea of Spatial Frequency Re- sponse (SFR) to reliably quantify the quality of a document image. We present through quantitative and qualitative exper- iments that the proposed metric leads to significant improve- ment in document quality prediction in contrast to using OCR as ground truth. @inproceedings{Rai-ICASSP-2018, title = {Document Quality Estimation using Spatial Frequency Response}, author = {Rai Pranjal Kumar, Maheshwari Sajal, and Gandhi Vineet}, booktitle = {ICASSP}, year = {2018}, }
		Shah Vatsal, and Gandhi Vineet An Iterative approach for Shadow Removal in Document Images ICASSP 2018 [abstract] [paper] [www] [bibTex] Uneven illumination and shadows in document images cause a challenge for digitization applications and automated work- flows. In this work, we propose a new method to recover un- shadowed document images from images with shadows/un- even illumination. We pose this problem as one of estimating the shading and reflectance components of the given original image. Our method first estimates the shading and uses it to compute the reflectance. The output reflectance map is then used to improve the shading and the process is repeated in an iterative manner. The iterative procedure allows for a gradual compensation and allows our algorithm to even handle diffi- cult hard shadows without introducing any artifacts. Experi- ments over two different datasets demonstrate the efficacy of our algorithm and the low computation complexity makes it suitable for most practical applications. @inproceedings{Shah-ICASSP-2018, title = {An Iterative approach for Shadow Removal in Document Images}, author = {Shah Vatsal and Gandhi Vineet}, booktitle = {ICASSP}, year = {2018}, }
		Sharma Rahul Anand, Bhat Bharath, Gandhi Vineet, Jawahar CV Automated Top View Registration of Broadcast Football Videos WACV 2018 [abstract] [paper] [bibTex] In this paper, we propose a fully automatic method to register football broadcast video frames on the static top view model of the playing surface. Automatic registration has been difficult due to the difficulty of finding sufficient point correspondences. We investigate an alternate ap- proach exploiting the edge information from the line mark- ings on the field. We formulate the registration problem as a nearest neighbour search over a synthetically generated dictionary of edge map and homography pairs. The syn- thetic dictionary generation allows us to exhaustively cover a wide variety of camera angles and positions and reduces this problem to a minimal per-frame edge map matching problem. We show that the per-frame results can be fur- ther improved in videos using an optimization framework for temporal camera stabilization. We demonstrate the ef- ficacy of our approach by presenting extensive results on a dataset collected from matches of the football World Cup 2014 and show significant improvement over the current state of the art. @inproceedings{anand-WACV-2018, title = {{Automated Top View Registration of Broadcast Football Videos}}, author = {Sharma Rahul Anand, Bhat Bharath, Gandhi Vineet, Jawahar CV}, booktitle = {{WACV}}, year = {2018}, }
		Rai Pranjal Kumar^, Maheshwari Sajal^, Mehta Ishit, Sakurikar Parikshit and Gandhi Vineet Beyond OCRs for Document Blur Estimation ICDAR 2017 [abstract] [paper] [dataset] [www] [bibTex] The current document blur/quality estimation algorithms rely on the OCR accuracy to measure their success. A sharp document image, however, at times may yield lower OCR accuracy owing to factors independent of blur or quality of capture. The necessity to rely on OCR is mainly due to the difficulty in quantifying the quality otherwise. In this work, we overcome this limitation by proposing a novel dataset for document blur estimation, for which we physically quantify the blur using a capture set-up which computationally varies the focal distance of the camera. We also present a selective search mechanism to improve upon the recently successful patch-based learning approaches (using codebooks or convolutional neural networks). We present a thorough analysis of the improved blur estimation pipeline using correlation with OCR accuracy as well as the actual amount of blur. Our experiments demonstrate that our method outperforms the current state-of-the-art by a significant margin. @inproceedings{pranjal-icdar-2017, title = {{Beyond OCRs for Document Blur Estimation}}, author = {Pranjal Kumar Rai, Sajal Maheshwari, Ishit Mehta, Parikshit Sakurikar and Vineet Gandhi}, booktitle = {ICDAR}, year = {2017}, }
		Kumar Moneish, Gandhi Vineet, Ronfard Remi and Gleicher Michael Zooming On All Actors: Automatic Focus+Context Split Screen Video Generation Eurographics 2017 [abstract] [paper] [slides (original keynote) ] [www] [bibTex] Recordings of stage performances are easy to capture with a high-resolution camera, but are difficult to watch because the actors’ faces are too small. We present an approach to automatically create a split screen video that transforms these recordings to show both the context of the scene as well as close-up details of the actors. Given a static recording of a stage performance and tracking information about the actors positions, our system generates videos showing a focus+context view based on computed close-up camera motions using crop-and zoom. The key to our approach is to compute these camera motions such that they are cinematically valid close-ups and to ensure that the set of views of the different actors are properly coordinated and presented. We pose the computation of camera motions as convex optimization that creates detailed views and smooth movements, subject to cinematic constraints such as not cutting faces with the edge of the frame. Additional constraints link the close up views of each actor, causing them to merge seamlessly when actors are close. Generated views are placed in a resulting layout that preserves the spatial relationships between actors. We demonstrate our results on a variety of staged theater and dance performances. @inproceedings{kumar-eg-2017, title = {{Zooming On All Actors: Automatic Focus+Context Split Screen Video Generation}}, author = {Kumar Moneish, Gandhi Vineet, Ronfard Remi and Gleicher Michael}, booktitle = {Eurographics}, year = {2017}, }
		Maheshwari Sajal, Rai Pranjal, Sharma Gopal and Gandhi Vineet Document Blur Detection using Edge Profile Mining ICVGIP 2016 [abstract] [paper] [www] [bibTex] We present an algorithm for automatic blur detection of document images using a novel approach based on edge intensity profiles. Our main insight is that the edge profiles are a strong indicator of the blur present in the image, with steep profiles implying sharper regions and gradual profiles implying blurred regions. Our approach first retrieves the profiles for each point of intensity transition (each edge point) along the gradient and then uses them to output a quantitative measure indicating the extent of blur in the input image. The real time performance of the proposed approach makes it suitable for most applications. Additionally, our method works for both hand written and digital documents and is agnostic to the font types and sizes, which gives it a major advantage over the currently prevalent learning based approaches. Extensive quantitative and qualitative experiments over two different datasets show that our method outperforms almost all algorithms in current state of the art by a significant margin, especially in cross dataset experiments. @inproceedings{sajal-icvgip-2016, title = {{Document Blur Detection using Edge Profile Mining}}, author = {Maheshwari Sajal, Rai Pranjal, Sharma Gopal and Gandhi Vineet}, booktitle = {ICVGIP}, year = {2016}, }
		Gandhi Vineet and Ronfard Remi A Computational Framework for Vertical Video Editing Eurographics workshop on intelligent cinematography and editing (WICED) 2016 [abstract] [paper] [bibTex] Vertical video editing is the process of digitally editing the image within the frame as opposed to horizontal video editing, which arranges the shots along a timeline. Vertical editing can be a time-consuming and error-prone process when using manual key-framing and simple interpolation. In this paper, we present a general framework for automatically computing a variety of cinematically plausible shots from a single input video suitable to the special case of live performances. Drawing on working practices in traditional cinematography, the system acts as a virtual camera assistant to the film editor, who can call novel shots in the edit room with a combination of high-level instructions and manually selected keyframes. @inproceedings{gandhi-egw-2016, title = {{A Computational Framework for Vertical Video Editing }}, author = {Gandhi Vineet and Ronfard Remi}, booktitle = { Eurographics workshop on intelligent cinematography and editing (WICED)}, year = {2016}, }
		Gandhi Vineet Automatic Rush Generation with Application to Theatre Performances PhD dissertation [abstract] [pdf] [bibTex] Professional quality videos of live staged performances are created by recording them from different appropriate viewpoints. These are then edited together to portray an eloquent story replete with the ability to draw out the intended emotion from the viewers. Creating such competent videos typically requires a team of skilled camera operators to capture the scene from multiple viewpoints. In this thesis, we explore an alternative approach where we automatically compute camera movements in post-production using specially designed computer vision methods. A high resolution static camera replaces the plural camera crew and their efficient cam- era movements are then simulated by virtually panning - tilting - zooming within the original recordings. We show that multiple virtual cameras can be simulated by choosing different trajectories of cropping windows inside the original recording. One of the key novelties of this work is an optimization framework for computing the virtual camera trajectories using the in- formation extracted from the original video based on computer vision techniques. The actors present on stage are considered as the most important elements of the scene. For the task of localizing and naming actors, we introduce generative models for learning view independent person and costume specific detectors from a set of labeled examples. We explain how to learn the models from a small number of labeled keyframes or video tracks, and how to detect novel appearances of the actors in a maximum likelihood framework. We demonstrate that such actor specific models can accurately localize actors despite changes in view point and occlusions, and significantly improve the detection recall rates over generic detectors. The thesis then proposes an offline algorithm for tracking objects and actors in long video sequences using these actor specific models. Detections are first performed to independently select candidate locations of the actor/object in each frame of the video. The candidate detections are then combined into smooth trajectories by minimizing a cost function accounting for false detections and occlusions. Using the actor tracks, we then describe a method for automatically generating multiple clips suitable for video editing by simulating pan-tilt-zoom camera movements within the frame of a single static camera. Our method requires only minimal user input to define the subject matter of each sub-clip. The composition of each sub-clip is automatically computed in a novel convex optimization framework. Our approach encodes several common cinematographic practices into a single convex cost function minimization problem, resulting in aesthetically-pleasing sub-clips which can easily be edited together using off-the-shelf multiclip video editing software. The proposed methods have been tested and validated on a challenging corpus of theatre recordings. They open the way to novel applications of computer vision methods for cost- effective video production of live performances including, but not restricted to, theatre, music and opera. @inproceedings{gandhi-dissertation-2014, title = {{Automatic Rush Generation with Application to Theatre Performances}}, author = {Gandhi, Vineet}, booktitle = {PhD dissertation}, year = {2014}, }
		Gandhi Vineet, Ronfard Remi, Gleicher Michael Multi-Clip Video Editing from a Single Viewpoint European Conference on Visual Media Production (CVMP) 2014 [abstract] [paper] [Presentation slides] [www] [bibTex] We propose a framework for automatically generating multiple clips suitable for video editing by simulating pan-tilt-zoom camera movements within the frame of a single static camera. Assuming important actors and objects can be localized using computer vision techniques, our method requires only minimal user input to define the subject matter of each sub-clip. The composition of each sub-clip is automatically computed in a novel L1-norm optimization framework. Our approach encodes several common cinematographic practices into a single convex cost function minimization problem, resulting in aesthetically-pleasing sub-clips which can easily be edited together using off-the-shelf multi-clip video editing software. We demonstrate our approach on five video sequences of a live theatre performance by generating multiple synchronized sub-clips for each sequence. @inproceedings{gandhi-cvmp-2014, title = {{Multi-Clip Video Editing from a Single Viewpoint}}, author = {Gandhi, Vineet and Ronfard, Remi and Gleicher, Michael}, booktitle = {European Conference on Visual Media Production (CVMP)}, year = {2014}, }
		Gandhi Vineet, Ronfard Remi Detecting and naming actors in movies using generative appearance models Computer Vision and Pattern Recognition (CVPR) 2013 [abstract] [paper] [CVPR 2013 Poster (1.8mb)] [www] [bibTex] We introduce a generative model for learning person and costume specific detectors from labeled examples. We demonstrate the model on the task of localizing and nam- ing actors in long video sequences. More specifically, the actor’s head and shoulders are each represented as a con- stellation of optional color regions. Detection can proceed despite changes in view-point and partial occlusions. We explain how to learn the models from a small number of la- beled keyframes or video tracks, and how to detect novel appearances of the actors in a maximum likelihood frame- work. We present results on a challenging movie example, with 81% recall in actor detection (coverage) and 89% pre- cision in actor identification (naming). @inproceedings{gandhi-cvpr-2013, title = {{Detecting and Naming Actors in Movies using Generative Appearance Models}}, author = {Gandhi, Vineet and Ronfard, Remi}, booktitle = {CVPR}, year = {2013}, }
		Gandhi Vineet, Cech Jan and Horaud Radu High-Resolution Depth Maps Based on TOF-Stereo Fusion IEEE International Conference on Robotics and Automation (ICRA) 2012 [abstract] [paper] [www] [bibTex] The combination of range sensors with color cameras can be very useful for robot navigation, semantic perception, manipulation, and telepresence. Several methods of combining range- and color-data have been investigated and successfully used in various robotic applications. Most of these systems suffer from the problems of noise in the range-data and resolution mismatch between the range sensor and the color cameras, since the resolution of current range sensors is much less than the resolution of color cameras. High-resolution depth maps can be obtained using stereo matching, but this often fails to construct accurate depth maps of weakly/repetitively textured scenes, or if the scene exhibits complex self-occlusions. Range sensors provide coarse depth information regardless of presence/absence of texture. The use of a calibrated system, composed of a time-of-flight (TOF) camera and of a stereoscopic camera pair, allows data fusion thus overcoming the weaknesses of both individual sensors. We propose a novel TOF-stereo fusion method based on an efficient seed-growing algorithm which uses the TOF data projected onto the stereo image pair as an initial set of correspondences. These initial ''seeds'' are then propagated based on a Bayesian model which combines an image similarity score with rough depth priors computed from the low-resolution range data. The overall result is a dense and accurate depth map at the resolution of the color cameras at hand. We show that the proposed algorithm outperforms 2D image-based stereo algorithms and that the results are of higher resolution than off-the-shelf color-range sensors, e.g., Kinect. Moreover, the algorithm potentially exhibits real-time performance on a single CPU. @inproceedings{gandhi-icra-2012, AUTHOR = {Gandhi, Vineet and Cech, Jan and Horaud, Radu}, TITLE = {{High-Resolution Depth Maps Based on TOF-Stereo Fusion}}, BOOKTITLE = {ICRA}, YEAR = {2012}}
		Ronfard Remi, Gandhi Vineet and Boiron Laurent The Prose Storyboard Language: A Tool for Annotating and Directing Movies Workshop on Intelligent Cinematography and Editing part of Foundations of Digital Games (FDG) 2013 [abstract] [paper] [www] [bibTex] The prose storyboard language is a formal language for describing movies shot by shot, where each shot is described with a unique sentence. The language uses a simple syntax and limited vocabulary borrowed from working practices in traditional movie-making, and is intended to be readable both by machines and humans. The language is designed to serve as a high-level user interface for intelligent cinematography and editing systems. @inproceedings{ronfard-wiced-2013, title = {{The Prose Storyboard Language: A Tool for Annotating and Directing Movies}}, author = {Ronfard, Remi and Gandhi, Vineet and Boiron, Laurent}, booktitle = {{2nd Workshop on Intelligent Cinematography and Editing part of Foundations of Digital Games - FDG 2013}}, year = {2013}, }

Teaching

Statistical Methods in Artificial Intelligence - Monsoon 2017, Spring 2020-2023
Digital Image Processing - Monsoon 2015, Monsoon 2016
Digital Signal Analysis and Applications - Spring 2016 - 2018
Computer Programming - Monsoon 2018, 2019
Advanced Data Structure and Algorithms - Monsoon 2020-2023
Human Values (part of collective faculty group) - 2015-2020
PG Certification in AI and Machine Learning (a program by IIIT outreach for Industry Professionals) - Since 2018

Students

Misc

Coming soon

Last updated: April 2018