I am a PhD student in Computer Science at Clemson University advised by Prof. Siyu Huang. Previously, I obtanined my M.S. in Computer Science from University of Southern California, where I worked advised by Prof. Yajie Zhao . I received my B.E. in Computer Science and Technology from Nanjing University of Posts and Telecommunications in 2020.
My research interests include computer vision and multimodal learning, particularly in 3D reconstruction and visual understanding.
We introduces FF3R, a fully annotation-free feed-forward framework that unifies geometric and semantic reasoning from unconstrained multi-view image sequences. Extensive experiments on ScanNet and DL3DV-10K demonstrate FF3R's superior performance in novel-view synthesis, open-vocabulary semantic segmentation, and depth estimation, with strong generalization to in-the-wild scenarios, paving the way for embodied intelligence systems that demand both spatial and semantic understanding.
We introduce FlexMap, a flexible HD map construction method that adapts to variable camera configurations without architectural changes or retraining. Unlike prior methods fixed to specific multi-camera setups, FlexMap uses a geometry-aware foundation model with cross-frame attention to implicitly encode 3D scene understanding, maintaining robustness to missing views and sensor variations for practical autonomous driving deployment.
This work introduces a new differentiable VG representation, dubbed Bézier splatting, that enables fast yet high-fidelity VG rasterization. Bézier splatting samples 2D Gaussians along Bézier curves, which naturally provide positional gradients at object boundaries.
In this work, we propose a method to achieve 3D-aware 2D representations and enable 3D reconstruction in the latent space. Our LRF enables 3D reconstruction on the 2D latent space instead of the image space. It can render high-quality and photorealistic novel views, even for the unbounded scenes.
3DGS-Enhancer restores view-consistent latent features of rendered novel views and integrates them with the input views through a spatial-temporal decoder. The enhanced views are then used to fine-tune the initial 3DGS model, significantly improving its rendering performance.