What Matters for Model Merging at Scale?

papers
summary
transformers
research
LLMs
model_merging
Author

Aakash Kumar Nain (@A_K_Nain)

Published

October 15, 2024

arXiv
annotated_paper

Yesterday, I read this banger paper titled: What Matters For Model Merging At Scale?. Though I recommend reading the full paper, I am including a summary here in case you are interested in the main points. I will also provide the link to an annotated version of this paper.





Introduction

Model merging is not a new concept. It has been tried and tested enough to create a better or more powerful model by combining two or more (expert) models. Model merging has several advantages:

  • Reduced storage and serving costs
  • Improved generalization to new tasks due to combined capabilities.
  • Decentralized and modular model development.

One potential gap in this area is the lack of a comprehensive study to evaluate its effectiveness as we scale the model size. Most people are either merging models at a small scale (7B-13B models) or merging a limited number of expert models. This paper provides insights into the scalability of model merging.

Problem Statement

  • Focuses on model merging with large models
  • N expert tasks and a base model
  • Expert for each of these N tasks obtained by fully fine-tuning the base model on a specific expert task.
  • Four merging methods, including Averaging, Task Arithmetic, TIES, and DARE.

Experimental Design for Large-Scale Evaluation of Model Merging

  • Data
    • Data settings from the T0 mixture containing 8 held-in and 4 held-out task categories.
    • The 8 held-in task categories (with a total of 16 datasets) include Multiple-choice QA, Extractive Qa, Closed-Book QA, Sentiment Analysis, Topic Classification, Structure-to-text, Summarization, and Paraphrase Identification. The 4 held-out task categories are Sentence Completion, Natural Language Inference, Co-reference Resolution, and Word Sense Disambiguation.


  • Models
    • PaLM-2 with sizes 1B, 8B, 24B, and 64B as the base models. For all these models, the authors also build an instruction-tuned version of PaLM-2-IT.
    • For each of the two base model types(non-IT Vs IT) and 4 model sizes, they perform full fine-tuning on the 8 held-in task categories resulting in 64 specialized expert models.
    • The authors create a large merging experiment grid with the two base models (PaLM-2 and PaLM-2-IT), four model sizes (1B, 8B, 24B, 64B), four Merging methods (Averaging, Task Arithmetic, Dare-TIES, and TIES), the number of constituent models (2, 4, 6, 8), and 3 seeds to randomly select the constituent tasks for the experiment resulting in a total of 384 merging experiments.


Evaluation

  • Performance evaluated on both held-in and held-out tasks
  • ∼9000 model evaluations across all the experiments.


Metrics

  • For held-in tasks, the merged model performance is normalized against the corresponding task expert model’s performance.
  • For held-out tasks, normalization is performed relative to the base model’s performance.


Experimental Results and Insights

  • Instruction-Tuned Models Facilitate Easier Merging
    • Merging experiments done with fully fine-tuned experts from PaLM-2 and PaLM-2-IT
    • Held-in performance is measured over three trials to minimize the impact of selected expert models and their data distributions.
    • PaLM-2-IT models consistently outperform PaLM-2 base models for all merging methods. The authors think large-scale instruction tuning further disentangles model weights, facilitating effective model merging and improving the base model zero-shot performance.


  • Model Merging Becomes Easier With Bigger Models
    • The authors noticed that merged models outperform their corresponding base models in zero-shot generalization to held-out tasks irrespective of the model size, merging method, or number of constituent models.
    • For weak base models (PaLM-2), increasing model size significantly improved the merged model performance over the base model. Strong base models (PaLM-2-IT) show a different trend, and the zero-shot generalization improves monotonically with more expert models.


  • Bigger Model Sizes Can Merge More Experts
    • For weak base models (PaLM-2) that are of small size (1B-8B), merging more models leads to a significant drop in performance, whereas for strong base models (PaLM-2), the drop is negligible.
    • The above trend doesn’t hold for bigger model sizes (64B). Merging more experts for a weak base model (PaLM-2 64B) leads to significant improvements in performance, whereas for strong base models(PaLM-2-IT), it leads to better generalization.



  • Merging Methods Become Similar at Scale
    • At scale, all merging methods for strong base models exhibit very similar performance, suggesting that we can simply use Averaging strategy to get the optimal performance at scale.



Results