Mar 17, 2024

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Apple's MM1 paper distilled into practical lessons on data mixtures, image resolution, synthetic data, and multimodal architecture.

Researchers from Apple made a huge step towards building open multimodal models. In the paper, they share model and data design lessons and a general Recipe for Building MM1.

Data ablation

An ablation study is a research technique where we remove or alter parts of the system and check how it affects it. In the paper, they used data ablation. Data ablation is based on a very similar idea: we systematically remove (or alter) parts of the input data to investigate the impact on the model's performance. This process helps in understanding which elements are most critical for achieving high accuracy.

In the context of training multimodal large language models (MLLMs), data ablation involves conducting experiments where certain types of data (e.g., text-only data, image data, interleaved image-text data, or specific features within these data types) are either excluded from or specifically included in the training process. By comparing the performance of the model across these different training configurations, researchers can glean insights into how different types of data contribute to the model's ability to understand and generate responses based on the input it receives.

How to select training data? Top lessons.

After performing data ablation studies they formed the following tips:

Mixing different ratios of interleaved (text and image together) and captioned (image with a separate caption) data positively affects model performance. It's worth to include also text-only data alongside image data. It was crucial for maintaining the model's language understanding capabilities. In general, use a variety of data types. Example mix: 45% interleaved image-text documents, 45% image-text pair documents, and 10% text-only documents.
Consider including high-quality synthetic data, such as synthetic captions (VeCap), to improve the model's learning.
Prioritize high-resolution images and use an image encoder capable of handling such resolutions to ensure the model can extract and learn from fine-grained details.

Model architecture and training

Because we are dealing with multimodality, we have to process different kinds of inputs and have different tokenizers. In models like MM1, images are processed through the image encoder (e.g., a Vision Transformer or ViT) to produce a set of visual tokens, and through text encoders that convert input text into textual tokens.

In the paper they use a Vision Transformer (ViT-H - resolution of 378x378 pixels) model as an image encoder, pre-trained with a CLIP objective underscores the importance of image resolution for image encoding.

Then, we have a Vision-Language connector (VL). VL connector's role is to integrate the visual tokens from the image encoder with textual tokens derived from the input text.

After the VL connector combines visual and textual tokens, a Multimodal Transformer-based language model processes the sequence, applying complex reasoning over both modalities.

Conclusion

The findings are not surprising but it's good that took time to confirm those. They showed it's important to gather a balanced and diverse dataset that includes a mix of image-text pairs, interleaved documents, and text-only documents to support training across modalities. It's all about quality over quantity. When training the multimodal models we should try to use high-quality, high-resolution images and well-curated text data. Also, as a rising popularity shows, carefully incorporating synthetic data might be useful.

← AI explained