Dhruv Agarwal

I am an M.S. student in Computer Science (AI specialization) at UC San Diego. My work spans computer vision, multimodal learning, and learning-based systems for robotics and media.

Most recently, I was a Deep Learning Researcher at Rephrase.ai, where I worked on single-image talking-head video generation from audio and built a prosody-correction model that aligns word-level prosody with sentence-level speech. Earlier, I was an engineer at Udaan (≈10 months), developing perception and planning components for warehouse robotics. I began my career as a full-stack developer at SAP Labs India.

As an undergraduate, I interned with the STARS team at Inria (Sophia Antipolis) with Dr. François Bremond, focusing on multimodal emotion and personality recognition. I also collaborated remotely with Dr. Andrew Melnik at the University of Bielefeld on neural networks for physics-style reasoning tasks, and with the WZL Lab, RWTH Aachen, applying machine learning to measurement uncertainty in manufacturing.

I received my B.Tech. in Information Technology from IIIT Allahabad, where I worked with Dr. Rahul Kala on improving V-SLAM localization using deep learning.

Résumé / CV

Email  /  GitHub  /  Google Scholar  /  LinkedIn  /  Unsplash

profile photo
Research interests

I work on computer vision, machine learning, and reinforcement learning, with an emphasis on multimodal reasoning, 3D and generative media, and physical intuition in learned models. I am motivated by applications in robotics, healthcare, and assistive technology where robust perception and interaction matter.

Industrial Projects
project image
Single-shot talking-head video generation [Code (private)]

An internal pipeline for synthesizing talking-head video from one portrait image and a speech track. Audio drives an expression network that predicts per-frame 3D morphable model (3DMM) coefficients; those coefficients and the source image feed a neural face renderer. Frames are composed into a temporally coherent clip aligned to the input audio. Hover over the thumbnail to preview a sample result.

project image Prosody correction (audio) [Code (private)]

Goal: transfer the prosody of a full spoken sentence onto an isolated target word while preserving lexical content. Inputs are (i) a short word clip from a resynthesized or TTS voice and (ii) a reference sentence from the target speaker.

A two-stage model first maps the word’s mel spectrogram with a CNN so that prosody matches the sentence while content stays fixed; a Wasserstein GAN then refines the mel before vocoding to waveform. The result is seamless audio with corrected stress and timing relative to the reference.

project image
Airavat (warehouse robotics) [Code (private)]

Autonomous forklift prototype for navigation and pick/place in a warehouse. Front-mounted stereo cameras and additional sensors support localization; strategically placed AprilTags provide robust pose cues and define control points for Bézier-curve trajectories. Hover over the thumbnail to see a simulation of a planned path.

Research Projects
project image Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding

Tanay Agrawal, Dhruv Agarwal, Michal Balazia, Neelabh Sinha, Francois Bremond
International Conference on Computer Vision Theory and Applications (VISAPP), 2022
arXiv

Personality recognition from audio, visual, and behavioral cues using cross-attention Transformers and hand-crafted behavior encodings.

project image From Multimodal to Unimodal Attention in Transformers using Knowledge Distillation

Dhruv Agarwal, Tanay Agrawal, Laura M Ferrari, Francois Bremond
Advanced Video and Signal-based Surveillance (AVSS), 2021
arXiv · Slides

Distills a multimodal Transformer into a unimodal student via attention-level supervision, reducing modality dependence at inference.

project image Solving Physics Puzzles by Reasoning about Paths

Augustin Harter, Andrew Melnik, Gaurav Kumar, Dhruv Agarwal, Animesh Garg, Helge Ritter
NeurIPS workshop on Interpretable Inductive Biases and Physically Structured Learning, 2020
arXiv · Video · Code

Neural model for PHYRE-style physics puzzles: plan interventions by reasoning about object trajectories and stable paths to goals.

project image SLAM and Map Learning using Hybrid Semantic Graph Optimization

Ambuj Agrawal, Dhruv Agarwal, Mehul Arora, Ritik Mahajan, Shivansh Beohar, Lhilo Kenye, Rahul Kala (equal contribution)
Mediterranean Conference on Control and Automation, 2022
Paper

V-SLAM with richer semantics: corner-like features and detected objects support place recognition and correspondence, improving localization and loop closure on a mobile robot.

project image Similarity assessment and model migration for measurement processes

Dhruv Agarwal, Meike Huber, Robert Schmitt.
International Journal of Quality & Reliability Management (IJQRM), 2022
Paper

Framework for deciding when an existing uncertainty model can be migrated to a related measurement process—reducing repeated metrology modeling effort while guarding against invalid reuse (IJQRM).

Hackathon Highlights
project image AutoChart

Dhruv Agarwal, Mehul Arora, Gillian McMahon, Jillian Sweetland, Shivansh Tiwari, Shriya Shetty. [Hackathon Presentation]

First place, new-venture track, Soonami Venturethon (2023). AutoChart turns clinician–patient dialogue into structured chart notes so providers can focus on care rather than documentation.

ChatComix

Dhruv Agarwal, Apoorv Agnihotri, Amit. [Code]

First place, generative-AI track, internal hackathon at Rephrase.ai. ChatComix generates interactive comics from high-level controls (genre, length, tone), combining LLM planning with a comic-style presentation layer.

Independent Projects
project image Unofficial Implementation of Implicit Neural Representation with Phase Loss and Fourier Features [Code]

Unofficial reimplementation of Phase Transitions, Distance Functions, and Implicit Neural Representations, including phase loss for sharper INR fits and an optional Fourier feature layer for high-frequency detail.

project image Unofficial Implementation of ViViT [Code]

Unofficial PyTorch implementation of the Video Vision Transformer (ViViT) for video classification and spatiotemporal feature extraction.

Deep Reinforcement Learning on Games [Code]

Classic RL algorithms (DQN, A3C, PPO, and related baselines) applied to Atari-style and FPS environments (e.g., Doom, Space Invaders, Sonic the Hedgehog 2).

project image Face Generation [Code]

DCGAN variants trained on CelebA for photorealistic synthetic faces.

project image Art Maker [Code]

Neural style transfer with a VGG backbone: combine a content image and a reference style to produce stylized outputs.


Blogs

Occasional writing on ML practice and hackathons. Getting prepped with machine-learning skills for a hackathon (dev.to).

project image

Design and source code from Jon Barron's website