论文标题
深度神经网络比二年级学生聪明吗?
Are Deep Neural Networks SMARTer than Second Graders?
论文作者
论文摘要
最近,深层神经网络在解决需要出色认知能力的任务中的应用越来越多,例如玩耍,生成艺术,Chatgpt等。这样一个巨大的进步提出了一个问题:神经网络在解决需要广泛技能的问题方面的神经网络如何推广?为了回答这个问题,我们提出了SMART:一个简单的多模式算法推理任务和相关的SMART-101数据集,用于评估神经网络在解决6--8岁年龄段专门为儿童设计的Visuo-lighuistic难题方面评估神经网络的抽象,扣除和概括能力。我们的数据集由101个独特的难题组成;每个难题都包括一个图片和一个问题,他们的解决方案需要多种基本技能,包括算术,代数和空间推理等。为了将数据集扩展到培训深层神经网络,我们可以编程为每个难题生成全新的实例,同时保留其解决方案算法。为了在SMART-101上进行基准表演,我们建议使用各种最新的骨架提出愿景和语言元学习模型。我们的实验表明,尽管在有监督的环境中,强大的深层模型在拼图上提供了合理的性能,但在分析概括时,它们并不比随机准确性好。我们还在SMART-101的一部分上评估了最近的Chatgpt和其他大型语言模型,并发现这些模型表现出令人信服的推理能力,但答案通常是不正确的。
Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task and the associated SMART-101 dataset, for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children in the 6--8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning, among others. To scale our dataset towards training deep neural networks, we programmatically generate entirely new instances for each puzzle, while retaining their solution algorithm. To benchmark performances on SMART-101, we propose a vision and language meta-learning model using varied state-of-the-art backbones. Our experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization. We also evaluate the recent ChatGPT and other large language models on a subset of SMART-101 and find that while these models show convincing reasoning abilities, the answers are often incorrect.