Post
1052
Remember stacking in ensemble ML? π€
What happens if you do the reverse of that but with LLMs? π€―
Basically, MoE created by merging multiple models (instead of being pre-trained like Mixtral)? π§
Frankenstein MoE! (not an official name) π§ββοΈ
That's the new Kraken architecture! π
It uses a sequence classification model to route inputs to the most suitable language model based on the input's characteristics. π¦
Yup, multiple full-fledged LLMs are loaded into memory, and then a classification layer decides who gets to generate an output! π°
Tell me you have too many GPUs without telling me you have too many GPUs! π₯οΈπ₯
Jokes aside, extremely fascinating research but I don't understand why this can't just be a big model with multiple LORA adapters, that can be decided on the fly? π€·ββοΈ
Model: cognitivecomputations/Kraken
Github: https://github.com/cognitivecomputations/kraken
What happens if you do the reverse of that but with LLMs? π€―
Basically, MoE created by merging multiple models (instead of being pre-trained like Mixtral)? π§
Frankenstein MoE! (not an official name) π§ββοΈ
That's the new Kraken architecture! π
It uses a sequence classification model to route inputs to the most suitable language model based on the input's characteristics. π¦
Yup, multiple full-fledged LLMs are loaded into memory, and then a classification layer decides who gets to generate an output! π°
Tell me you have too many GPUs without telling me you have too many GPUs! π₯οΈπ₯
Jokes aside, extremely fascinating research but I don't understand why this can't just be a big model with multiple LORA adapters, that can be decided on the fly? π€·ββοΈ
Model: cognitivecomputations/Kraken
Github: https://github.com/cognitivecomputations/kraken