The talk will address the problems of fine-grained video understanding and generation, which present new challenges compared to conventional scenarios and offer wide-ranging applications in sports, cooking, entertainment, and beyond. The presentation will commence with an overview of our work on instructional video analysis, including the COIN dataset and a novel condensed action space learning method for procedure planning in instructional videos. Next, we will introduce an uncertainty-aware score distribution learning method and a group-aware attention method for assessing action quality. Lastly, we will discuss how we leverage multimodal information (such as language and music) to enhance the performance of referring segmentation and dance generation.