Apple shows off open AI prowess: new models outperform Mistral ...
As the world continues to gush over the prowess of the all-new GPT-4o-mini, Apple has chosen to expand its family of small models. A few hours ago, the research team at Apple working as part of the DataComp for Language Models project, released a family of open DCLM models on Hugging Face.
Apple's New Open DCLM Models
The package includes two main models at the core: one with 7 billion parameters and the other with 1.4 billion parameters. They both perform pretty well on the benchmarks, especially the bigger one — which has outperformed Mistral-7B and is closing in on other leading open models, including Llama 3 and Gemma.
Vaishaal Shankar from the Apple ML team described these as the “best-performing” open-source models out there. Something worth noting is the project was made truly open source with the release of the model weights, the training code, and the pretraining dataset.
The DataComp Project
Led by a team of multidisciplinary researchers, including those at Apple, University of Washington, Tel Aviv University, and Toyota Institute of Research, the DataComp project can be described as a collaborative effort to design high-quality datasets for training AI models, particularly in the multimodal domain.
The idea is pretty simple here: use a standardized framework – with fixed model architectures, training code, hyperparameters, and evaluations – to run different experiments and figure out which data curation strategy works best for training a highly performant model.
Model Performance and Experiments
The work on the project started a while ago and the experiments led the team to figure out that model-based filtering, where machine learning (ML) models automatically filter and select high-quality data from larger datasets, can be key to assembling a high-quality training set.
To demonstrate the effectiveness of the curation technique, the resulting dataset, DCLM-Baseline, was used to train the new DCLM decoder-only transformer English language models with 7 billion and 1.4 billion parameters from scratch.
Model Enhancements
The model’s performance across Core and Extended benchmarks saw further improvements when the researchers extended its context length and did additional training on the same dataset using the Dataset Decomposition technique.
Just like DCLM-7B, the smaller 1.4B version of the model, trained jointly with Toyota Research Insitute, also delivers impressive performance across various tests.
Commercial Use and Open Source
Currently, the larger model is available under Apple’s Sample Code License, while the smaller one has been released under Apache 2.0, allowing for commercial use, distribution, and modification.
It is important to note that this is just early research, highlighting the effectiveness of data curation. The models are not for Apple devices and may exhibit certain biases from test training data or produce harmful responses.