LawBench: Benchmarking Legal Knowledge of Large Language Models

AI-generated keywords: Legal Knowledge LLMs LawBench GPT-4 Evaluation

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Large language models (LLMs) have shown strong capabilities in various aspects
  • Performance of LLMs in the legal domain is uncertain and safety-critical
  • Authors propose a comprehensive evaluation benchmark called LawBench
  • LawBench assesses LLMs' legal capabilities at three cognitive levels: knowledge memorization, understanding, and applying
  • Benchmark consists of 20 tasks covering classification, regression, extraction, and generation
  • Evaluations conducted on 51 LLMs including multilingual, Chinese-oriented, and legal-specific models
  • GPT-4 performs best among LLMs in the legal domain by a significant margin
  • Fine-tuning LLMs on legally specific text brings some improvements but more work is needed for usable and reliable LLMs in legal tasks
  • LawBench provides an in-depth understanding of LLM capabilities and aims to accelerate their development in the legal domain
  • All data, model predictions, and evaluation code are publicly available on GitHub
  • Research contributes valuable insights into evaluating and advancing LLM performance in law.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen, Zongwen Shen, Jidong Ge

Abstract: Large language models (LLMs) have demonstrated strong capabilities in various aspects. However, when applying them to the highly specialized, safe-critical legal domain, it is unclear how much legal knowledge they possess and whether they can reliably perform legal-related tasks. To address this gap, we propose a comprehensive evaluation benchmark LawBench. LawBench has been meticulously crafted to have precise assessment of the LLMs' legal capabilities from three cognitive levels: (1) Legal knowledge memorization: whether LLMs can memorize needed legal concepts, articles and facts; (2) Legal knowledge understanding: whether LLMs can comprehend entities, events and relationships within legal text; (3) Legal knowledge applying: whether LLMs can properly utilize their legal knowledge and make necessary reasoning steps to solve realistic legal tasks. LawBench contains 20 diverse tasks covering 5 task types: single-label classification (SLC), multi-label classification (MLC), regression, extraction and generation. We perform extensive evaluations of 51 LLMs on LawBench, including 20 multilingual LLMs, 22 Chinese-oriented LLMs and 9 legal specific LLMs. The results show that GPT-4 remains the best-performing LLM in the legal domain, surpassing the others by a significant margin. While fine-tuning LLMs on legal specific text brings certain improvements, we are still a long way from obtaining usable and reliable LLMs in legal tasks. All data, model predictions and evaluation code are released in https://github.com/open-compass/LawBench/. We hope this benchmark provides in-depth understanding of the LLMs' domain-specified capabilities and speed up the development of LLMs in the legal domain.

Submitted to arXiv on 28 Sep. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2309.16289v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Large language models (LLMs) have demonstrated strong capabilities in various aspects. However, their performance in the highly specialized and safety-critical legal domain remains uncertain. To address this gap, the authors propose a comprehensive evaluation benchmark called LawBench. LawBench aims to assess LLMs' legal capabilities at three cognitive levels: legal knowledge memorization, legal knowledge understanding, and legal knowledge applying. The benchmark consists of 20 diverse tasks covering single-label classification, multi-label classification, regression, extraction and generation. Extensive evaluations were conducted on 51 LLMs on LawBench including multilingual LLMs, Chinese-oriented LLMs and legal-specific LLMs. The results reveal that GPT-4 is the best performing LLM in the legal domain by a significant margin. Although fine-tuning LLMs on legally specific text brings some improvements there is still a long way to go in obtaining usable and reliable LLMs for legal tasks. The LawBench evaluation benchmark provides an in-depth understanding of the domain specified capabilities of LLMs and aims to accelerate their development in the legal domain. All data, model predictions and evaluation code are publicly available on GitHub. This research contributes valuable insights into evaluating and advancing LLM's performance in the complex field of law.
Created on 06 Oct. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.