Multimodal and Generative Representation Learning