PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation Paper • 2409.06820 • Published 15 days ago • 56
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? Paper • 2403.14624 • Published Mar 21 • 50
Premise Order Matters in Reasoning with Large Language Models Paper • 2402.08939 • Published Feb 14 • 24