[05/22/2024] Nature Human Behaviour (Jan - 20 May, 2024)

Testing theory of mind in large language models and humans

Abstract

At the core of what defines us as humans is the concept of theory of mind: the ability to track other people’s mental states. The recent development of large language models (LLMs) such as ChatGPT has led to intense debate about the possibility that these models exhibit behaviour that is indistinguishable from human behaviour in theory of mind tasks. Here we compare human and LLM performance on a comprehensive battery of measurements that aim to measure different theory of mind abilities, from understanding false beliefs to interpreting indirect requests and recognizing irony and faux pas. We tested two families of LLMs (GPT and LLaMA2) repeatedly against these measures and compared their performance with those from a sample of 1,907 human participants. Across the battery of theory of mind tests, we found that GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs and misdirection, but struggled with detecting faux pas. Faux pas, however, was the only test where LLaMA2 outperformed humans. Follow-up manipulations of the belief likelihood revealed that the superiority of LLaMA2 was illusory, possibly reflecting a bias towards attributing ignorance. By contrast, the poor performance of GPT originated from a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference. These findings not only demonstrate that LLMs exhibit behaviour that is consistent with the outputs of mentalistic inference in humans but also highlight the importance of systematic testing to ensure a non-superficial comparison between human and artificial intelligences.

Trusting young children to help causes them to cheat less

Abstract

Trust and honesty are essential for human interactions. Philosophers since antiquity have long posited that they are causally linked. Evidence shows that honesty elicits trust from others, but little is known about the reverse: does trust lead to honesty? Here we experimentally investigated whether trusting young children to help can cause them to become more honest (total N = 328 across five studies; 168 boys; mean age, 5.94 years; s.d., 0.28 years). We observed kindergarten children’s cheating behaviour after they had been entrusted by an adult to help her with a task. Children who were trusted cheated less than children who were not trusted. Our study provides clear evidence for the causal effect of trust on honesty and contributes to understanding how social factors influence morality. This finding also points to the potential of using adult trust as an effective method to promote honesty in children.

Dynamic computational phenotyping of human cognition

Abstract

Computational phenotyping has emerged as a powerful tool for characterizing individual variability across a variety of cognitive domains. An individual’s computational phenotype is defined as a set of mechanistically interpretable parameters obtained from fitting computational models to behavioural data. However, the interpretation of these parameters hinges critically on their psychometric properties, which are rarely studied. To identify the sources governing the temporal variability of the computational phenotype, we carried out a 12-week longitudinal study using a battery of seven tasks that measure aspects of human learning, memory, perception and decision making. To examine the influence of state effects, each week, participants provided reports tracking their mood, habits and daily activities. We developed a dynamic computational phenotyping framework, which allowed us to tease apart the time-varying effects of practice and internal states such as affective valence and arousal. Our results show that many phenotype dimensions covary with practice and affective factors, indicating that what appears to be unreliability may reflect previously unmeasured structure. These results support a fundamentally dynamic understanding of cognitive variability within an individual.

A collaborative realist review of remote measurement technologies for depression in young people

Abstract

Digital mental health is becoming increasingly common. This includes use of smartphones and wearables to collect data in real time during day-to-day life (remote measurement technologies, RMT). Such data could capture changes relevant to depression for use in objective screening, symptom management and relapse prevention. This approach may be particularly accessible to young people of today as the smartphone generation. However, there is limited research on how such a complex intervention would work in the real world. We conducted a collaborative realist review of RMT for depression in young people. Here we describe how, why, for whom and in what contexts RMT appear to work or not work for depression in young people and make recommendations for future research and practice. Ethical, data protection and methodological issues need to be resolved and standardized; without this, RMT may be currently best used for self-monitoring and feedback to the healthcare professional where possible, to increase emotional self-awareness, enhance the therapeutic relationship and monitor the effectiveness of other interventions.

More than nature and nurture, indirect genetic effects on children’s academic achievement are consequences of dynastic social processes

Abstract

Families transmit genes and environments across generations. When parents’ genetics affect their children’s environments, these two modes of inheritance can produce an ‘indirect genetic effect’. Such indirect genetic effects may account for up to half of the estimated genetic variance in educational attainment. Here we tested if indirect genetic effects reflect within-nuclear-family transmission (‘genetic nurture’) or instead a multi-generational process of social stratification (‘dynastic effects’). We analysed indirect genetic effects on children’s academic achievement in their fifth to ninth years of schooling in N = 37,117 parent–offspring trios in the Norwegian Mother, Father, and Child Cohort Study (MoBa). We used pairs of genetically related families (parents were siblings, children were cousins; N = 10,913) to distinguish within-nuclear-family genetic-nurture effects from dynastic effects shared by cousins in different nuclear families. We found that indirect genetic effects on children’s academic achievement cannot be explained by processes that operate exclusively within the nuclear family.

Modelling dataset bias in machine-learned theories of economic decision-making

Abstract

Normative and descriptive models have long vied to explain and predict human risky choices, such as those between goods or gambles. A recent study reported the discovery of a new, more accurate model of human decision-making by training neural networks on a new online large-scale dataset, choices13k. Here we systematically analyse the relationships between several models and datasets using machine-learning methods and find evidence for dataset bias. Because participants’ choices in stochastically dominated gambles were consistently skewed towards equipreference in the choices13k dataset, we hypothesized that this reflected increased decision noise. Indeed, a probabilistic generative model adding structured decision noise to a neural network trained on data from a laboratory study transferred best, that is, outperformed all models apart from those trained on choices13k. We conclude that a careful combination of theory and data analysis is still required to understand the complex interactions of machine-learning models and data of human risky choices.

Greater variability in judgements of the value of novel ideas

Abstract

Understanding the factors that hinder support for creative ideas is important because creative ideas fuel innovation—a goal prioritized across the arts, sciences and business. Here we document one obstacle faced by creative ideas: as ideas become more novel—that is, they depart more from existing norms and standards—disagreement grows about their potential value. Specifically, across multiple contexts, using both experimental methods (four studies, total n = 1,801) and analyses of archival data, we find that there is more variability in judgements of the value of more novel (versus less novel) ideas. We also find that people interpret greater variability in others’ judgements about an idea’s value as a signal of risk, reducing their willingness to invest in the idea. Our findings show that consensus about an idea’s worth diminishes the newer it is, highlighting one reason creative ideas may fail to gain traction in the social world.

The motivating effect of monetary over psychological incentives is stronger in WEIRD cultures

Abstract

Motivating effortful behaviour is a problem employers, governments and nonprofits face globally. However, most studies on motivation are done in Western, educated, industrialized, rich and democratic (WEIRD) cultures. We compared how hard people in six countries worked in response to monetary incentives versus psychological motivators, such as competing with or helping others. The advantage money had over psychological interventions was larger in the United States and the United Kingdom than in China, India, Mexico and South Africa (N = 8,133). In our last study, we randomly assigned cultural frames through language in bilingual Facebook users in India (N = 2,065). Money increased effort over a psychological treatment by 27% in Hindi and 52% in English. These findings contradict the standard economic intuition that people from poorer countries should be more driven by money. Instead, they suggest that the market mentality of exchanging time and effort for material benefits is most prominent in WEIRD cultures.

Accelerated demand for interpersonal skills in the Australian post-pandemic labour market

Abstract

The COVID-19 pandemic has led to a widespread shift to remote work, reducing the level of face-to-face interaction between workers and changing their modes and patterns of communication. This study tests whether this transformation in production processes has been associated with disruptions in the longstanding labour market trend of increasing demand for interpersonal skills. To address this question, we integrate a skills taxonomy with the text of over 12 million Australian job postings to measure skills demand trends at the aggregate and occupational levels. We find that since the start of the pandemic, there has been an acceleration in the aggregate demand for interpersonal skills. We also find a strong positive association between an occupation’s propensity for remote work and the acceleration in interpersonal skills demand for the occupation. Our findings suggest that interpersonal skills continue to grow in importance for employment in the post-pandemic, remote work friendly labour market.

Driving and suppressing the human language network using large language models

Abstract

Transformer models such as GPT generate human-like language and are predictive of human brain responses to language. Here, using functional-MRI-measured brain responses to 1,000 diverse sentences, we first show that a GPT-based encoding model can predict the magnitude of the brain response associated with each sentence. We then use the model to identify new sentences that are predicted to drive or suppress responses in the human language network. We show that these model-selected novel sentences indeed strongly drive and suppress the activity of human language areas in new individuals. A systematic analysis of the model-selected sentences reveals that surprisal and well-formedness of linguistic input are key determinants of response strength in the language network. These results establish the ability of neural network models to not only mimic human language but also non-invasively control neural activity in higher-level cortical areas, such as the language network.

Previous[05/23/2024] Item Response Theory Next[05/22/2024] Nature Human Behaviour - Navigating the AI Frontier

Last updated 1 year ago