Souvik Ghosh

Education

MS by Research, CSE

IIIT Hyderabad | Grade: 8.5

2024-2026

Courses taken: Statistical Methods in AI, Digital Image Processing, Advanced NLP(LLMs), Computer Vision, Technology, Product and Entrepreneurship
Research focus on multimodal AI, Speech Technologies, LLMs and Representation Learning.
Working with CVIT Lab in the Audio Visual Team guided by Professor CV Jawahar and Professor Vinay Namboodiri

BTech, Applied Electronics and Instrumentation

HITK, Kolkata | Grade: 8.12

2019-2023

Explored the beauty of interdisciplinary education.
Final Year Thesis on IOT and Edge ML for patients with Epilepsy
Major Projects in Harassment Detection and Women Safety. Runners up at Nasscom Lab 2 Market

Publications

LipAdapter: Text-to-Video Alignment is All You Need for Lip-to-Speech

Under Submission

A generic and efficient modular framework that adapts an existing frozen pre-trained Text-to-Speech model into a lip-synchronized speech generator.

Aavaz: An Early Attempt to Give Voice to the Voiceless

Under Submission

A mobile app for silent video to speech communication leveraging a adapted VSR using a novel constrained beam search strategy

HindiOCR-VLM : Adapting Vision-Language Models for OCR in Indian Languages

ICDAR 2025

The first Single Stage Multi Domain OCR for Indian languages, showing better results than all Industry grade OCRs like that of Google, AWS, Azure and open souce SOTA models. Our commitment to open source remains and the model and codes have been made open source.

ReSenseNet: Ensemble Early Fusion Deep Learning Architecture for Multimodal Sentiment Analysis

IHCI 2021

Explores Multimodal sentiment analysis using a novel ensemble early fusion deep learning architecture. Opens the door for targetted sentiment analysis in constrained environments.

Speech@SCIS: Annotated Indian Video Dataset

SCI 2021

With the advent of AI based content creation, clean and annotated datasets on Indian languages are necessary. This work proposes a dataset of balanced make and female speakers for Indian languages.

Recent Updates

June 2025

Paper Accepted at ICDAR 2025

HindiOCR-VLM : Adapting Vision-Language Models for OCR in Indian Languages

April 2025

Paper Acceptd at DG-EBF@CVPR 2025

CLIP based domain adaptation via Residual Hypernetworks

August 2024

Started MS by Research at IIIT Hyderabad

Began my research journey at IIIT Hyderabad, focusing on multimodal AI and speech technologies under the guidance of Professor CV Jawahar.

February 2024

Joined Sync Labs (YC W24)

Excited to join Sync Labs as a Research Engineer, working on cutting-edge lip-sync and facial animation technologies.

Education

MS by Research, CSE

BTech, Applied Electronics and Instrumentation

Experience

Technical Expertise

AI & ML

MLOps

Frameworks

Publications

LipAdapter: Text-to-Video Alignment is All You Need for Lip-to-Speech

Aavaz: An Early Attempt to Give Voice to the Voiceless

HindiOCR-VLM : Adapting Vision-Language Models for OCR in Indian Languages

ReSenseNet: Ensemble Early Fusion Deep Learning Architecture for Multimodal Sentiment Analysis

Speech@SCIS: Annotated Indian Video Dataset

Recent Updates

Paper Accepted at ICDAR 2025

Paper Acceptd at DG-EBF@CVPR 2025

Started MS by Research at IIIT Hyderabad

Joined Sync Labs (YC W24)

Featured in leading Indian Tech Magazines like IndiaAI, Analytics India Magazine.