Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
singhsidhukuldeepΒ 
posted an update May 27
Post
1052
Remember stacking in ensemble ML? πŸ€”

What happens if you do the reverse of that but with LLMs? 🀯

Basically, MoE created by merging multiple models (instead of being pre-trained like Mixtral)? 🧠

Frankenstein MoE! (not an official name) πŸ§Ÿβ€β™‚οΈ

That's the new Kraken architecture! πŸ™

It uses a sequence classification model to route inputs to the most suitable language model based on the input's characteristics. 🚦

Yup, multiple full-fledged LLMs are loaded into memory, and then a classification layer decides who gets to generate an output! 🎰

Tell me you have too many GPUs without telling me you have too many GPUs! πŸ–₯️πŸ”₯

Jokes aside, extremely fascinating research but I don't understand why this can't just be a big model with multiple LORA adapters, that can be decided on the fly? πŸ€·β€β™‚οΈ

Model: cognitivecomputations/Kraken
Github: https://github.com/cognitivecomputations/kraken

Maybe @alexsherstinsky can solve the question about the multiple LoRA adapters :)

Β·

@elsatch Thank you for tagging me in this conversation! I think that while the approach with LoRA adapters would (in my opinion) be relying on a different technique, the results could indeed be favorable. Here are some recent papers that point to high effectiveness of LoRA as a specific PEFT method on a variety of examples and application domains:

Happy to discuss! Thanks again!