I am an M.S. student in Computer Science (AI specialization) at UC San Diego. My work spans computer vision, multimodal learning, and learning-based systems for robotics and media.
Most recently, I was a Deep Learning Researcher at Rephrase.ai, where I worked on single-image talking-head video generation from audio and built a prosody-correction model that aligns word-level prosody with sentence-level speech. Earlier, I was an engineer at Udaan (≈10 months), developing perception and planning components for warehouse robotics. I began my career as a full-stack developer at SAP Labs India.
I received my B.Tech. in Information Technology from IIIT Allahabad, where I worked with Dr. Rahul Kala on improving V-SLAM localization using deep learning.
I work on computer vision, machine learning, and reinforcement learning, with an emphasis on multimodal reasoning, 3D and generative media, and physical intuition in learned models. I am motivated by applications in robotics, healthcare, and assistive technology where robust perception and interaction matter.
Industrial Projects
Single-shot talking-head video generation
[Code (private)]
An internal pipeline for synthesizing talking-head video from one portrait image and a speech track. Audio drives an expression network that predicts per-frame 3D morphable model (3DMM) coefficients; those coefficients and the source image feed a neural face renderer. Frames are composed into a temporally coherent clip aligned to the input audio. Hover over the thumbnail to preview a sample result.
Goal: transfer the prosody of a full spoken sentence onto an isolated target word while preserving lexical content. Inputs are (i) a short word clip from a resynthesized or TTS voice and (ii) a reference sentence from the target speaker.
A two-stage model first maps the word’s mel spectrogram with a CNN so that prosody matches the sentence while content stays fixed; a Wasserstein GAN then refines the mel before vocoding to waveform. The result is seamless audio with corrected stress and timing relative to the reference.
Autonomous forklift prototype for navigation and pick/place in a warehouse. Front-mounted stereo cameras and additional sensors support localization; strategically placed AprilTags provide robust pose cues and define control points for Bézier-curve trajectories. Hover over the thumbnail to see a simulation of a planned path.
Research Projects
Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding
Tanay Agrawal, Dhruv Agarwal, Michal Balazia, Neelabh Sinha, Francois Bremond
International Conference on Computer Vision Theory and Applications (VISAPP), 2022
arXiv
Personality recognition from audio, visual, and behavioral cues using cross-attention Transformers and hand-crafted behavior encodings.
From Multimodal to Unimodal Attention in Transformers using Knowledge
Distillation
Dhruv Agarwal, Tanay Agrawal, Laura M Ferrari, Francois Bremond
Advanced Video and Signal-based Surveillance (AVSS), 2021
arXiv
·
Slides
Distills a multimodal Transformer into a unimodal student via attention-level supervision, reducing modality dependence at inference.
Solving Physics Puzzles by Reasoning about Paths
Augustin Harter, Andrew Melnik, Gaurav Kumar, Dhruv Agarwal, Animesh Garg, Helge Ritter
NeurIPS workshop on Interpretable Inductive Biases and Physically Structured Learning, 2020
arXiv
·
Video
·
Code
Neural model for PHYRE-style physics puzzles: plan interventions by reasoning about object trajectories and stable paths to goals.
SLAM and Map Learning using Hybrid Semantic Graph Optimization
Ambuj Agrawal, Dhruv Agarwal, Mehul Arora, Ritik Mahajan, Shivansh Beohar, Lhilo Kenye, Rahul Kala (equal contribution)
Mediterranean Conference on Control and Automation, 2022 Paper
V-SLAM with richer semantics: corner-like features and detected objects support place recognition and correspondence, improving localization and loop closure on a mobile robot.
Similarity assessment and model migration for measurement processes
Dhruv Agarwal, Meike Huber, Robert Schmitt.
International Journal of Quality & Reliability Management (IJQRM), 2022 Paper
Framework for deciding when an existing uncertainty model can be migrated to a related measurement process—reducing repeated metrology modeling effort while guarding against invalid reuse (IJQRM).
First place, new-venture track, Soonami Venturethon (2023). AutoChart turns clinician–patient dialogue into structured chart notes so providers can focus on care rather than documentation.
Classic RL algorithms (DQN, A3C, PPO, and related baselines) applied to Atari-style and FPS environments (e.g., Doom, Space Invaders, Sonic the Hedgehog 2).