I like to train large deep neural nets too 🧠🤖💥 | First Paper (AutoAgents: A Framework for Automatic Agent Generation) Accepted @ IJCAI 2024 | Role Model Karpathy
Some interesting findings in this paper: - They consider o1 a Large Reasoning Model (LRM) with a different arch from SOTA LLMs. - Creative justifications: “It is almost as if o1 has gone from hallucinating to gaslighting!”. This is so true, I noticed also it can “hallucinate” its chain-of-thoughts lol. - Accuracy/Cost Tradeoffs: o1 provides high accuracy but at significant computational and monetary costs due to hidden "reasoning tokens." Paper: https://www.arxiv.org/abs/2409.13373
nanoGPT with Sigmoid Self-Attention I couldn’t resist had to give it a try:)
Some observations on M2: SSA was ~5-10% faster in training with similar final loss values, slightly less coherent text generation, marginally higher perplexity, and lower memory usage compared to softmax.