5 3

Yi Cui

onekq

https://onekq.ai

AI & ML interests

Benchmark, Code Generation,

Articles

Does Daily Software Engineering Work Need Reasoning Models?

1 day ago

• 4

All LLMs Write Great Code, But Some Make (A Lot) Fewer Mistakes

13 days ago

• 3

Organizations

onekq's activity

posted an update about 14 hours ago

Post

529

Here is my latest study on OpenAI🍓o1🍓.
A Case Study of Web App Coding with OpenAI Reasoning Models (2409.13773)

I wrote an easy-to-read blogpost to explain finding.
https://huggingface.co./blog/onekq/daily-software-engineering-work-reasoning-models

INSTRUCTION FOLLOWING is the key.

100% instruction following + Reasoning = new SOTA

But if the model misses or misunderstands one instruction, it can perform far worse than non-reasoning models.

posted an update 8 days ago

Post

347

Announce 🎉 WebApp1K-Duo 🎉
onekq-ai/WebApp1K-Duo-React

This is to keep up the challenge after OpenAI o1 models saturated the WebApp1K benchmark. The new benchmark brings SOTA to 67%. Let the hill climbing commence!
onekq-ai/WebApp1K-models-leaderboard

PS: I will publish more findings soon.

replied to zhabotorabi's post 8 days ago

the Mistral API? the model name is probably diffrent. I used mistral-large-2 but had to use the name mistral-large-latest. The team will help you via chat.

posted an update 10 days ago

Post

528

🐋 DeepSeek 🐋2.5 is hands-down the best open-source model, leaving its peers way behind. It even beats GPT-4o mini.

onekq-ai/WebApp1K-models-leaderboard

The inference of the official API is painfully slow though. I heard the team is short on GPUs (well, who isn't).

replied to their post 12 days ago

pass@1 for 🍓o1-mini🍓: 0.94!!

💸💸💸💸

#gpt #o1 #inference #RL #selfplay #WebApp1K

posted an update 12 days ago

Post

1104

If your plan keeps changing it's a sign that you are living the moment.

I just got the pass@1 result of GPT 🍓o1-preview🍓 : 0.95!!!

This means my benchmark is cast into oblivion, I need to up the ante. I am all ears to suggestions. onekq-ai/WebApp1K-models-leaderboard

1 reply