In this study, we developed a standardized social intelligence test called the Situational Evaluation of Social Intelligence (SESI) to assess the social intelligence of large language models (LLMs) based on real-world social scenarios. Our evaluation of 13 popular LLM agents on SESI revealed that while these models have made significant progress in academic intelligence, their social intelligence performance still has room for improvement. The results showed that errors in social intelligence were primarily due to superficial friendliness. Additionally, there was a low correlation between social and academic intelligence in LLMs, indicating that they are distinct abilities. Despite not fully understanding what social intelligence is, LLMs' social intelligence is influenced by various social factors just like humans. Further analysis of the SESI benchmark revealed that it features long, complex, and diverse social contexts with an average length of 44.2 words and involving three or more active characters in 50% of situations. The benchmark also encompasses a wide range of social relationship types, highlighting its challenging nature. SESI provides a comprehensive assessment across various dimensions of social intelligence beyond understanding contexts to achieving characters' social goals. Moreover, it encourages detailed and specific answers with an average length of 25.8 words, surpassing other common-sense reasoning benchmarks. The distribution of correct and incorrect answer lengths suggests that the benchmark focuses on substance rather than length in responses. The evaluation included mainstream LLMs such as OpenAI's GPT series, Vicuna, LLaMA 2-Chat, and Mixtral against baseline benchmarks like Natural Questions and Massive Multitask Language Understanding to accurately assess their knowledge and capabilities. Overall, this study highlights the need for further development in the social intelligence of LLMs and emphasizes the importance of considering both academic and social factors in evaluating these models' performance.
- - Developed standardized social intelligence test called Situational Evaluation of Social Intelligence (SESI) for assessing large language models (LLMs)
- - LLMs show significant progress in academic intelligence but room for improvement in social intelligence
- - Errors in social intelligence mainly due to superficial friendliness
- - Low correlation between social and academic intelligence in LLMs, indicating distinct abilities
- - LLMs' social intelligence influenced by various social factors like humans
- - SESI benchmark features long, complex, diverse social contexts with average length of 44.2 words and involving three or more active characters in 50% situations
- - Wide range of social relationship types included in benchmark, making it challenging
- - SESI assesses various dimensions of social intelligence beyond understanding contexts to achieving characters' social goals
- - Encourages detailed answers with average length of 25.8 words, focusing on substance rather than length
- - Evaluation included mainstream LLMs like OpenAI's GPT series, Vicuna, LLaMA 2-Chat, Mixtral against baseline benchmarks to assess knowledge and capabilities accurately
- - Emphasizes need for further development in LLMs' social intelligence and importance of considering both academic and social factors in evaluation
Summary1. Scientists made a test called SESI to check how well big language models understand social situations.
2. Big language models are good at school stuff but need help with understanding people's feelings.
3. Mistakes in understanding others happen when being too nice without really caring.
4. Being good at school things doesn't always mean being good with people, they are different skills.
5. Big language models learn how to act socially from different human interactions.
Definitions- Social intelligence: Understanding and interacting well with other people in different situations.
- Language models: Computer programs that can understand and generate human language.
- Academic intelligence: Knowledge and skills related to school subjects and learning.
- Correlation: How two things are connected or related to each other.
- Benchmark: A standard or point of reference used for comparison or measurement.
Introduction:
In recent years, there has been a significant advancement in the field of artificial intelligence (AI), particularly with large language models (LLMs). These models have shown impressive capabilities in academic tasks such as natural language processing and understanding. However, their social intelligence still remains an area that requires further development.
In order to assess the social intelligence of LLMs, researchers have developed a standardized test called the Situational Evaluation of Social Intelligence (SESI). This test is based on real-world social scenarios and aims to evaluate how well LLMs can navigate complex social situations.
The Study:
The study conducted by researchers aimed to evaluate 13 popular LLM agents on SESI and analyze their performance. The results showed that while these models have made significant progress in academic intelligence, they still struggle with social intelligence. The errors observed were primarily due to superficial friendliness, indicating a lack of depth in understanding human emotions and interactions.
Furthermore, the study revealed a low correlation between social and academic intelligence in LLMs. This suggests that these are distinct abilities and need to be evaluated separately. It also highlights the need for further development in the social intelligence aspect of LLMs.
Factors Influencing Social Intelligence:
The study also delved into understanding what factors influence the social intelligence of LLMs. While not fully understood yet, it was found that just like humans, these models are influenced by various social factors such as context and relationships.
SESI Benchmark:
One of the key contributions of this research is the creation of SESI benchmark which provides a comprehensive assessment across various dimensions of social intelligence beyond just understanding contexts. The benchmark features long, complex, and diverse social contexts with an average length of 44.2 words involving three or more active characters in 50% of situations.
Moreover, it encompasses a wide range of relationship types which makes it challenging for LLMs to accurately navigate through these scenarios. This highlights the need for more sophisticated social intelligence in these models.
SESI also encourages detailed and specific answers with an average length of 25.8 words, surpassing other common-sense reasoning benchmarks. The distribution of correct and incorrect answer lengths suggests that the benchmark focuses on substance rather than length in responses.
Evaluation:
The evaluation included mainstream LLMs such as OpenAI's GPT series, Vicuna, LLaMA 2-Chat, and Mixtral against baseline benchmarks like Natural Questions and Massive Multitask Language Understanding to accurately assess their knowledge and capabilities. This allowed for a fair comparison between different models and their performance on SESI.
Conclusion:
In conclusion, this study highlights the need for further development in the social intelligence of LLMs. While they have shown impressive progress in academic tasks, their social intelligence still has room for improvement. The creation of SESI benchmark provides a comprehensive assessment tool that goes beyond just understanding contexts to evaluating how well these models can achieve characters' social goals.
Moreover, it emphasizes the importance of considering both academic and social factors in evaluating the performance of LLMs. This research opens up new avenues for future studies to improve the social intelligence aspect of AI models and bridge the gap between human-like interactions and artificial agents.