😶‍🌫️
Psych
  • Preface
  • [4/9/2025] A One-Stop Calculator and Guide for 95 Effect-Size Variants
  • [4/9/2025] the people make the place
  • [4/9/2025] Personality predicts things
  • [3/31/2025] Response surface analysis with multilevel data
  • [3/11/2025] A Complete Guide to Natural Language Processing
  • [3/4/2025] Personality - Self and Identity
  • [3/1/2025] Updating Vocational Interests Information
  • [2/25/2025] Abilities & Skills
  • [2/22/2025] APA table format
  • [2/19/2025] LLM that replace human participants can harmfully misportray and flatt
  • [2/18/2025] Research Methods Knowledge Base
  • [2/17/2025] Personality - Motives/Interests
  • [2/11/2025] Trait structure
  • [2/10/2025] Higher-order construct
  • [2/4/2025] RL for CAT
  • [2/4/2025] DoWhy | An end-to-end library for causal inference
  • [2/4/2025] DAGitty — draw and analyze causal diagrams
  • [2/2/2025] Personality States
  • [2/2/2025] Psychometric Properties of Automated Video Interview Competency Assessments
  • [2/2/2025] How to diagnose abhorrent science
  • [1/28/2025] LLM and personality/interest items
  • [1/28/2025] Personality - Dispositions
  • [1/28/2025] Causal inference in statistics
  • [1/27/2025] Personality differences between birth order categories and across sibship sizes
  • [1/27/2025] nomological network meta-analysis.
  • [1/25/2025] Classic Papers on Scale Development/Validation
  • [1/17/2025] Personality Reading
  • [1/15/2025] Artificial Intelligence: Redefining the Future of Psychology
  • [1/13/2025] R for Psychometics
  • [12/24/2024] Comparison of interest congruence indices
  • [12/24/2024] Most recent article on interest fit measures
  • [12/24/2024] Grammatical Redundancy in Scales: Using the “ConGRe” Process to Create Better Measures
  • [12/24/2024] Confirmatory Factor Analysis with Word Embeddings
  • [12/24/2024] Can ChatGPT Develop a Psychometrically Sound Situational Judgment Test?
  • [12/24/2024] Using NLP to replace human content coders
  • [11/21/2024] AI Incident Database
  • [11/20/2024] Large Language Model-Enhanced Reinforcement Learning
  • [11/05/2024] Self-directed search
  • [11/04/2024] Interview coding and scoring
  • [11/04/2024] What if there were no personality factors?
  • [11/04/2024] BanditCAT and AutoIRT
  • [10/29/2024] LLM for Literature/Survey
  • [10/27/2024] Holland's Theory of Vocational Choice and Adjustment
  • [10/27/2024] Item Response Warehouse
  • [10/26/2024] EstCRM - the Samejima's Continuous IRT Model
  • [10/23/2024] Idiographic Personality Gaussian Process for Psychological Assessment
  • [10/23/2024] The experience sampling method (ESM)
  • [10/21/2024] Ecological Momentary Assessment (EMA)
  • [10/20/2024] Meta-Analytic Structural Equation Modeling
  • [10/20/2024] Structure of vocational interests
  • [10/17/2024] LLMs for psychological assessment
  • [10/16/2024] Can Deep Neural Networks Inform Theory?
  • [10/16/2024] Cognition & Decision Modeling Laboratory
  • [10/14/2024] Time-Invariant Confounders in Cross-Lagged Panel Models
  • [10/13/2024] Polynomial regression
  • [10/13/2024] Bayesian Mixture Modeling
  • [10/10/2024] Response surface analysis (RSA)
  • [10/10/2024] Text-Based Personality Assessment with LLM
  • [10/09/2024] Circular unidimensional scaling: A new look at group differences in interest structure.
  • [10/07/2024] Video Interview
  • [10/07/2024] Relationship between Measurement and ML
  • [10/07/2024] Conscientiousness × Interest Compensation (CONIC) model
  • [10/03/2024] Response modeling methodology
  • [10/02/2024] Conceptual Versus Empirical Distinctions Among Constructs
  • [10/02/2024] Construct Proliferation
  • [09/23/2024] Psychological Measurement Paradigm through Interactive Fiction Games
  • [09/20/2024] A Computational Method to Reveal Psychological Constructs From Text Data
  • [09/18/2024] H is for Human and How (Not) To Evaluate Qualitative Research in HCI
  • [09/17/2024] Automated Speech Recognition Bias in Personnel Selection
  • [09/16/2024] Congruency Effect
  • [09/11/2024] privacy, security, and trust perceptions
  • [09/10/2024] Measurement, Scale, Survey, Questionnaire
  • [09/09/2024] Reporting Systematic Reviews
  • [09/09/2024] Evolutionary Neuroscience
  • [09/09/2024] On Personality Measures and Their Data
  • [09/09/2024] Two Dimensions of Professor-Student Rapport Differentially Predict Student Success
  • [09/05/2024] The SAPA Personality Inventory
  • [09/05/2024] Moderated mediation
  • [09/03/2024] BiGGen Bench
  • [09/02/2024] LMSYS Chatbot Arena
  • [09/02/2024] Introduction to Measurement Theory Chapters 1, 2 (2.1-2.8) and 3.
  • [09/01/2024] HCI measurememt
  • [08/30/2024] Randomization Test
  • [08/30/2024] Interview Quantative Statistical
  • [08/29/2024] Cascading Model
  • [08/29/2024] Introduction: The White House (IS_202)
  • [08/29/2024] Circular unidimensional scaling
  • [08/28/2024] Sex and Gender Differences (Neur_542_Week2)
  • [08/26/2024] Workplace Assessment and Social Perceptions (WASP) Lab
  • [08/26/2024] Computational Organizational Research Lab
  • [08/26/2024] Reading List (Recommended by Bo)
  • [08/20/2024] Illinois NeuroBehavioral Assessment Laboratory (INBAL)
  • [08/14/2024] Quantitative text analysis
  • [08/14/2024] Measuring complex psychological and sociological constructs in large-scale text
  • [08/14/2024] LLM for Social Science Research
  • [08/14/2024] GPT for multilingual psychological text analysis
  • [08/12/2024] Questionable Measurement Practices and How to Avoid Them
  • [08/12/2024] NLP for Interest (from Dan Putka)
  • [08/12/2024] ONet Interest Profiler (Long and Short Scale)
  • [08/12/2024] ONet Interests Data
  • [08/12/2024] The O*NET-SOC Taxonomy
  • [08/12/2024] ML Ratings for O*Net
  • [08/09/2024] Limited ability of LLMs to simulate human psychological behaviours
  • [08/08/2024] A large-scale, gamified online assessment
  • [08/08/2024] Text-Based Traitand Cue Judgments
  • [08/07/2024] Chuan-Peng Lab
  • [08/07/2024] Modern psychometrics: The science of psychological assessment
  • [08/07/2024] Interactive Survey
  • [08/06/2024] Experimental History
  • [08/06/2024] O*NET Research reports
  • [07/30/2024] Creating a psychological assessment tool based on interactive storytelling
  • [07/24/2024] My Life with a Theory
  • [07/24/2024] NLP for Interest Job Ratings
  • [07/17/2024] Making vocational choices
  • [07/17/2024] Taxonomy of Psychological Situation
  • [07/12/2024] PathChat 2
  • [07/11/2024] Using games to understand the mind
  • [07/10/2024] Gamified Assessments
  • [07/09/2024] Poldracklab Software and Data
  • [07/09/2024] Consensus-based Recommendations for Machine-learning-based Science
  • [07/08/2024] Using AI to assess personal qualities
  • [07/08/2024] AI Psychometrics And Psychometrics Benchmark
  • [07/02/2024] Prompt Engineering Guide
  • [06/28/2024] Observational Methods and Qualitative Data Analysis 5-6
  • [06/28/2024] Observational Methods and Qualitative Data Analysis 3-4
  • [06/28/2024] Interviewing Methods 5-6
  • [06/28/2024] Interviewing Methods 3-4
  • [06/28/2024] What is Qualitative Research 3
  • [06/27/2024] APA Style
  • [06/27/2024] Statistics in Psychological Research 6
  • [06/27/2024] Statistics in Psychological Research 5
  • [06/23/2024] Bayesian Belief Network
  • [06/18/2024] Fair Comparisons in Heterogenous Systems Evaluation
  • [06/18/2024] What should we evaluate when we use technology in education?
  • [06/16/2024] Circumplex Model
  • [06/12/2024] Ways of Knowing in HCI
  • [06/09/2024] Statistics in Psychological Research 1-4
  • [06/08/2024] Mathematics for Machine Learning
  • [06/08/2024] Vocational Interests SETPOINT Dimensions
  • [06/07/2024] How's My PI Study
  • [06/06/2024] Best Practices in Supervised Machine Learning
  • [06/06/2024] SIOP
  • [06/06/2024] Measurement, Design, and Analysis: An Integrated Approach (Chu Recommended)
  • [06/06/2024] Classical Test Theory
  • [06/06/2024] Introduction to Measurement Theory (Bo Recommended)
  • [06/03/2024] EDSL: AI-Powered Research
  • [06/03/2024] Perceived Empathy of Technology Scale (PETS)
  • [06/02/2024] HCI area - Quantitative and Qualitative Modeling and Evaluation
  • [05/26/2024] Psychometrics with R
  • [05/26/2024] Programming Grammer Design
  • [05/25/2024] Psychometric Network Analysis
  • [05/23/2024] Item Response Theory
  • [05/22/2024] Nature Human Behaviour (Jan - 20 May, 2024)
  • [05/22/2024] Nature Human Behaviour - Navigating the AI Frontier
  • [05/22/2024] Computer Adaptive Testing
  • [05/22/2024] Personality Scale (Jim Shard)
  • [05/22/2024] Reliability
  • [05/19/2024] Chatbot (Jim Shared)
  • [05/17/2024] GOMS and Keystroke-Level Model
  • [05/17/2024] The Psychology of Human-Computer Interaction
  • [05/14/2024] Computational Narrative (Mark's Group)
  • [05/14/2024] Validity Coding
  • [05/14/2024] LLM as A Evaluator
  • [05/14/2024] Social Skill Training via LLMs (Diyi's Group)
  • [05/14/2024] AI Persona
  • [05/09/2024] Psychological Methods Journal Sample Articles
  • [05/08/2024] Meta-Analysis
  • [05/07/2024] Mturk
  • [05/06/2024] O*NET Reports and Documents
  • [05/04/2024] NLP and Chatbot on Personality Assessment (Tianjun)
  • [05/02/2024] Reads on Construct Validation
  • [04/25/2024] Reads on Validity
  • [04/18/2024] AI for Assessment
  • [04/17/2024] Interest Assessment
  • [04/16/2024] Personality Long Reading List (Jim)
    • Personality Psychology Overview
      • Why Study Personality Assessment
    • Dimensions and Types
    • Reliability
    • Traits: Two Views
    • Validity--Classical Articles and Reflections
    • Validity-Recent Proposals
    • Multimethod Perspective and Social Desirability
    • Paradigm of Personality Assessment: Multivariate
    • Heritability of personality traits
    • Classical Test-Construction
    • IRT
    • Social desirability in scale construction
    • Traits and culture
    • Paradigms of personality assessment: Empirical
    • Comparison of personality test construction strategies
    • Clinical versus Actuarial (AI) Judgement and Diagnostics
    • Decisions: Importance of base rates
    • Paradigms of Personality Assessment: Psychodynamic
    • Paradigms of Assessment: Interpersonal
    • Paradigms of Personality Assessment: Personological
    • Retrospective reports
    • Research Paradigms
    • Personality Continuity and Change
Powered by GitBook
On this page

[05/22/2024] Nature Human Behaviour (Jan - 20 May, 2024)

Previous[05/23/2024] Item Response TheoryNext[05/22/2024] Nature Human Behaviour - Navigating the AI Frontier

Last updated 1 year ago

Testing theory of mind in large language models and humans

Abstract

At the core of what defines us as humans is the concept of theory of mind: the ability to track other people’s mental states. The recent development of large language models (LLMs) such as ChatGPT has led to intense debate about the possibility that these models exhibit behaviour that is indistinguishable from human behaviour in theory of mind tasks. Here we compare human and LLM performance on a comprehensive battery of measurements that aim to measure different theory of mind abilities, from understanding false beliefs to interpreting indirect requests and recognizing irony and faux pas. We tested two families of LLMs (GPT and LLaMA2) repeatedly against these measures and compared their performance with those from a sample of 1,907 human participants. Across the battery of theory of mind tests, we found that GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs and misdirection, but struggled with detecting faux pas. Faux pas, however, was the only test where LLaMA2 outperformed humans. Follow-up manipulations of the belief likelihood revealed that the superiority of LLaMA2 was illusory, possibly reflecting a bias towards attributing ignorance. By contrast, the poor performance of GPT originated from a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference. These findings not only demonstrate that LLMs exhibit behaviour that is consistent with the outputs of mentalistic inference in humans but also highlight the importance of systematic testing to ensure a non-superficial comparison between human and artificial intelligences.

Trusting young children to help causes them to cheat less

Abstract

Trust and honesty are essential for human interactions. Philosophers since antiquity have long posited that they are causally linked. Evidence shows that honesty elicits trust from others, but little is known about the reverse: does trust lead to honesty? Here we experimentally investigated whether trusting young children to help can cause them to become more honest (total N = 328 across five studies; 168 boys; mean age, 5.94 years; s.d., 0.28 years). We observed kindergarten children’s cheating behaviour after they had been entrusted by an adult to help her with a task. Children who were trusted cheated less than children who were not trusted. Our study provides clear evidence for the causal effect of trust on honesty and contributes to understanding how social factors influence morality. This finding also points to the potential of using adult trust as an effective method to promote honesty in children.

Dynamic computational phenotyping of human cognition

Abstract

Computational phenotyping has emerged as a powerful tool for characterizing individual variability across a variety of cognitive domains. An individual’s computational phenotype is defined as a set of mechanistically interpretable parameters obtained from fitting computational models to behavioural data. However, the interpretation of these parameters hinges critically on their psychometric properties, which are rarely studied. To identify the sources governing the temporal variability of the computational phenotype, we carried out a 12-week longitudinal study using a battery of seven tasks that measure aspects of human learning, memory, perception and decision making. To examine the influence of state effects, each week, participants provided reports tracking their mood, habits and daily activities. We developed a dynamic computational phenotyping framework, which allowed us to tease apart the time-varying effects of practice and internal states such as affective valence and arousal. Our results show that many phenotype dimensions covary with practice and affective factors, indicating that what appears to be unreliability may reflect previously unmeasured structure. These results support a fundamentally dynamic understanding of cognitive variability within an individual.

A collaborative realist review of remote measurement technologies for depression in young people

Abstract

Digital mental health is becoming increasingly common. This includes use of smartphones and wearables to collect data in real time during day-to-day life (remote measurement technologies, RMT). Such data could capture changes relevant to depression for use in objective screening, symptom management and relapse prevention. This approach may be particularly accessible to young people of today as the smartphone generation. However, there is limited research on how such a complex intervention would work in the real world. We conducted a collaborative realist review of RMT for depression in young people. Here we describe how, why, for whom and in what contexts RMT appear to work or not work for depression in young people and make recommendations for future research and practice. Ethical, data protection and methodological issues need to be resolved and standardized; without this, RMT may be currently best used for self-monitoring and feedback to the healthcare professional where possible, to increase emotional self-awareness, enhance the therapeutic relationship and monitor the effectiveness of other interventions.

More than nature and nurture, indirect genetic effects on children’s academic achievement are consequences of dynastic social processes

Abstract

Families transmit genes and environments across generations. When parents’ genetics affect their children’s environments, these two modes of inheritance can produce an ‘indirect genetic effect’. Such indirect genetic effects may account for up to half of the estimated genetic variance in educational attainment. Here we tested if indirect genetic effects reflect within-nuclear-family transmission (‘genetic nurture’) or instead a multi-generational process of social stratification (‘dynastic effects’). We analysed indirect genetic effects on children’s academic achievement in their fifth to ninth years of schooling in N = 37,117 parent–offspring trios in the Norwegian Mother, Father, and Child Cohort Study (MoBa). We used pairs of genetically related families (parents were siblings, children were cousins; N = 10,913) to distinguish within-nuclear-family genetic-nurture effects from dynastic effects shared by cousins in different nuclear families. We found that indirect genetic effects on children’s academic achievement cannot be explained by processes that operate exclusively within the nuclear family.

Modelling dataset bias in machine-learned theories of economic decision-making

Abstract

Normative and descriptive models have long vied to explain and predict human risky choices, such as those between goods or gambles. A recent study reported the discovery of a new, more accurate model of human decision-making by training neural networks on a new online large-scale dataset, choices13k. Here we systematically analyse the relationships between several models and datasets using machine-learning methods and find evidence for dataset bias. Because participants’ choices in stochastically dominated gambles were consistently skewed towards equipreference in the choices13k dataset, we hypothesized that this reflected increased decision noise. Indeed, a probabilistic generative model adding structured decision noise to a neural network trained on data from a laboratory study transferred best, that is, outperformed all models apart from those trained on choices13k. We conclude that a careful combination of theory and data analysis is still required to understand the complex interactions of machine-learning models and data of human risky choices.

Greater variability in judgements of the value of novel ideas

Abstract

Understanding the factors that hinder support for creative ideas is important because creative ideas fuel innovation—a goal prioritized across the arts, sciences and business. Here we document one obstacle faced by creative ideas: as ideas become more novel—that is, they depart more from existing norms and standards—disagreement grows about their potential value. Specifically, across multiple contexts, using both experimental methods (four studies, total n = 1,801) and analyses of archival data, we find that there is more variability in judgements of the value of more novel (versus less novel) ideas. We also find that people interpret greater variability in others’ judgements about an idea’s value as a signal of risk, reducing their willingness to invest in the idea. Our findings show that consensus about an idea’s worth diminishes the newer it is, highlighting one reason creative ideas may fail to gain traction in the social world.

The motivating effect of monetary over psychological incentives is stronger in WEIRD cultures

Abstract

Motivating effortful behaviour is a problem employers, governments and nonprofits face globally. However, most studies on motivation are done in Western, educated, industrialized, rich and democratic (WEIRD) cultures. We compared how hard people in six countries worked in response to monetary incentives versus psychological motivators, such as competing with or helping others. The advantage money had over psychological interventions was larger in the United States and the United Kingdom than in China, India, Mexico and South Africa (N = 8,133). In our last study, we randomly assigned cultural frames through language in bilingual Facebook users in India (N = 2,065). Money increased effort over a psychological treatment by 27% in Hindi and 52% in English. These findings contradict the standard economic intuition that people from poorer countries should be more driven by money. Instead, they suggest that the market mentality of exchanging time and effort for material benefits is most prominent in WEIRD cultures.

Accelerated demand for interpersonal skills in the Australian post-pandemic labour market

Abstract

The COVID-19 pandemic has led to a widespread shift to remote work, reducing the level of face-to-face interaction between workers and changing their modes and patterns of communication. This study tests whether this transformation in production processes has been associated with disruptions in the longstanding labour market trend of increasing demand for interpersonal skills. To address this question, we integrate a skills taxonomy with the text of over 12 million Australian job postings to measure skills demand trends at the aggregate and occupational levels. We find that since the start of the pandemic, there has been an acceleration in the aggregate demand for interpersonal skills. We also find a strong positive association between an occupation’s propensity for remote work and the acceleration in interpersonal skills demand for the occupation. Our findings suggest that interpersonal skills continue to grow in importance for employment in the post-pandemic, remote work friendly labour market.

Driving and suppressing the human language network using large language models

Abstract

Transformer models such as GPT generate human-like language and are predictive of human brain responses to language. Here, using functional-MRI-measured brain responses to 1,000 diverse sentences, we first show that a GPT-based encoding model can predict the magnitude of the brain response associated with each sentence. We then use the model to identify new sentences that are predicted to drive or suppress responses in the human language network. We show that these model-selected novel sentences indeed strongly drive and suppress the activity of human language areas in new individuals. A systematic analysis of the model-selected sentences reveals that surprisal and well-formedness of linguistic input are key determinants of response strength in the language network. These results establish the ability of neural network models to not only mimic human language but also non-invasively control neural activity in higher-level cortical areas, such as the language network.

a, Scores of the two GPT models on the original framing of the faux pas question (‘Did they know…?’) and the likelihood framing (‘Is it more likely that they knew or didn’t know…?’). Dots show average score across trials (n = 15 LLM observations) on particular items to allow comparison between the original faux pas test and the new faux pas likelihood test. Halfeye plots show distributions, medians (black points), 66% (thick grey lines) and 99% quantiles (thin grey lines) of the response scores on different items (n = 15 different stories involving faux pas). b, Response scores to three variants of the faux pas test: faux pas (pink), neutral (grey) and knowledge-implied variants (teal). Responses were coded as categorical data as ‘didn’t know’, ‘unsure’ or ‘knew’ and assigned a numerical coding of −1, 0 and +1. Filled balloons are shown for each model and variant, and the size of each balloon indicates the count frequency, which was the categorical data used to compute chi-square tests. Bars show the direction bias score computed as the average across responses of the categorical data coded as above. On the right of the plot, P values (one-sided) of Holm-corrected chi-square tests are shown comparing the distribution of response type frequencies in the faux pas and knowledge-implied variants against neutral.
a, The test sheet of the math test that children were asked to complete. b, The answer key.
a, Participants performed 7 cognitive tasks and a survey on a weekly basis for 12 consecutive weeks. The left column shows an example screen from each task. The middle column shows the computational model used to fit participants’ behavioural data from each task. The right column shows the free parameters that we fit to individual participants. The collection of these parameters constitutes the computational phenotype. b, Three potential sources of temporal variability: random noise, practice and state effects reflecting changes in mood and daily activities.
Shows the type of search and search terms used, and the number of records found and included from each search.
Blue text boxes list factors potentially influencing the ways in which RMT may work. Pink text boxes list factors potentially influencing who RMT may work for. The purple text box lists factors potentially influencing the contexts in which RMT may work. The orange text box lists potential mechanisms (M) linking these influencing factors (C) to outcomes (O), with those linked to intended outcomes in green, and those linked to unintended outcomes in red. RMT, remote measurement technologies; MDD, major depressive disorder; HCPs, healthcare practitioners; WEIRD, Western, educated, industrialised, rich and democratic; LMIC, low-middle income country.
Blue text boxes list factors found to influence the ways in which RMT may work. The pink text box lists factors found to influence who RMT may work for. The purple text box lists factors found to influence the contexts in which RMT may work. The orange text boxes list mechanisms (M) found to link these influencing factors (C) to intended (green) and unintended (red) outcomes (O). Question marks indicate areas where evidence is lacking and there is need for further research. RMT, remote measurement technologies; EMA, ecological momentary assessment; HCPs, healthcare practitioners; LMIC, low-middle income country.
a, Because all pairs of gambles considered in this study can be parameterized in one common way, the decision problems’ features can be used to compute a two-dimensional embedding (uniform manifold approximation and projection), representing the problem space. Each dot corresponds to a decision problem consisting of two gambles and the colours indicate the dataset of origin. b, Pairs of gambles from this problem space, together with the proportion of choices, constitute the datasets: the CPC15 dataset with human decisions from a laboratory study, the choices13k dataset with human decisions from a large-scale online experiment and a much larger synthetic dataset (synth15) generated by predictions from the psychological model BEAST. d, We trained six different NNs on the basis of two architectures; NNBourgin(NN-B)), based on Bourgin et al. and NNPeterson(NN-P), based on Peterson et al.. c, The target of training was the proportion of trials in which gamble B was chosen, averaged over all human participants and 5 trials (P(B)) from either CPC15 or choices13k. However, because of the small size of the CPC15 dataset, we first pre-trained on synth15 and then fine-tuned on CPC15. To test for dataset bias, we also pre-trained some NNs on synth15 and then fine-tuned on choices13k. We can now investigate the relationship between models and datasets by comparing predictions of NNs on decision problems. Because all pairs of gambles reside in the same problem space but the overlap in decision problems across datasets is small, we compute the difference in predictions between any two models on problems sampled from the problem space. e, Subsequently, we investigated the source of the differences in predictions between different combinations of models and datasets. First, we use linear regressions (top), relating individual or sets of features of the gambles to the difference in model predictions. Second, we use SHAP, an XAI method, which returns linear additive feature importance values for each gamble (bottom).
a, We developed an encoding model (M) of the LH language network in the human brain with the goal of identifying novel sentences that activate the language network to a maximal or minimal extent (see ‘Encoding model development’ in ). Five participants (train participants) read a large sample (n = 1,000) of six-word corpus-extracted sentences, the baseline set (sampled to maximize linguistic diversity; Supplementary Information section ), in a rapid, event-related design while their brain activity was recorded using fMRI. BOLD responses from voxels in the LH language network were averaged within each train participant and averaged across participants to yield an average language network response to each of the 1,000 baseline set sentences. We trained a ridge regression model from the representations of the unidirectional-attention Transformer language model, GPT2-XL (identified as the most brain-aligned language base model in Schrimpf et al.), to the 1,000 averaged fMRI responses. Given that GPT2-XL can generate a representation for any sentence, the encoding model (M) can predict the LH language network response for arbitrary sentences. To select the top-performing layer for our encoding model, we evaluated all 49 layers of GPT2-XL and selected the layer that had highest predictivity performance on brain responses to held-out baseline set sentences (layer 22; Supplementary Information section ). b, To evaluate the encoding model (M), we identified a set of sentences to activate the language network to a maximal extent (drive sentences) or a minimal extent (suppress sentences) (see ‘Encoding model evaluation’ in ). To do so, we obtained GPT2-XL embeddings for ~1.8 million sentences from diverse, large text corpora, generated predicted language network responses and ranked these responses to select the sentences that are predicted to increase or decrease brain responses relative to the baseline set. Finally, we collected brain responses to these novel sentences in new participants (evaluation participants).
16
17
36
Methods
1
20
6a
Methods