Training extensive language models necessitates significant computational resources. Model distillation emerges as a promising technique to mitigate this challenge by transferring knowledge from a large primary model to a smaller target model. Scaling distillation for large language models focuses on several key aspects. First, it requires thorough