Research Scientist

CANTINA RESEARCH SINGAPORE PTE. LTD.

About the Role

Cantina is expanding, and we're looking for a Research Scientist to join our growing Singapore team! In this role, you will drive foundational research on video generation models, taking ownership across the full research cycle and driving post-training research. Furthermore, you'll collaborate closely with data, infrastructure, and adjacent modeling teams to translate research findings into durable model improvements.

What You’ll Do

  • Build and maintain scalable systems for ingesting, preprocessing, and delivering large-scale video data for model training

  • Design and scale distributed data pipelines for preprocessing, dataset generation, and repeated dataset refreshes

  • Own workflow orchestration, job scheduling, monitoring, and failure recovery for large-scale data processing jobs

  • Implement and maintain containerized pipeline infrastructure using Kubernetes or equivalent orchestration systems

  • Optimize cloud-based data storage and movement across providers (AWS, GCS, or Azure) for cost, throughput, and operational efficiency

  • Define and implement best practices for dataset storage layout, versioning, caching, retention, and access patterns

  • Build tooling to support deduplication workflows at scale, including near-dedup pipelines over large video corpora

  • Research and develop distillation methods for large-scale diffusion and flow-based video generation models, including guidance distillation and adversarial distillation, with a focus on preserving or improving generation quality while reducing inference cost

  • Develop reward models and preference-based fine-tuning pipelines that align video generation quality with human judgments across dimensions such as aesthetics, motion quality, and prompt adherence

  • Analyze the relationship between base model behavior and post-training outcomes, and work with the foundation model team to inform pretraining decisions accordingly

What You’ll Bring

  • Strong hands-on experience building or scaling large-scale data systems or pipelines for machine learning workflows

  • Experience with distributed data processing frameworks such as PySpark or Ray, and orchestration tools such as Airflow or equivalent

  • Familiarity with containerization and container orchestration, including Docker and Kubernetes

  • Experience working with cloud-based data storage and compute (AWS, GCS, and/or Azure), including tradeoffs around cost, throughput, storage layout, and access patterns

  • Familiarity with video and media processing tools such as FFmpeg, PyAV, DALI, or OpenCV

  • Familiarity with multimodal or media data, including video, image, text, and audio

  • Strong research background in post-training methods for large-scale diffusion or flow-based generative models, with deep hands-on experience in distillation across both inference efficiency and quality preservation

  • Experience with reward modeling or preference-based fine-tuning for generative models, including RLHF, DPO or equivalent alignment approaches

  • Solid understanding of the interplay between pretraining and post-training, and how base model properties affect distillation and fine-tuning outcomes

  • Proficiency in Python and modern machine learning frameworks, with a strong preference for PyTorch or JAX

  • Track record of independent research, with the ability to drive projects from initial idea through experimental validation

  • Publications at top-tier venues (NeurIPS, ICML, ICLR, CVPR, ICCV, ECCV) preferred

  • Good understanding of the practical challenges involved in building reliable, scalable, and reproducible data workflows for machine learning systems