Proposes some state-of-the-art multimodal learning theories. To demonstrate the effectiveness of these models, the authors apply them to three practical tasks of micro-video understanding: popularity prediction, venue category estimation, and micro-video routing.