In this paper, the authors introduce YORC (Yorùbá Reading Comprehension), a new dataset for Yorùbá language reading comprehension. The dataset is based on Yorùbá high-school reading comprehension examinations. The authors provide baseline results by performing cross-lingual transfer using the existing English RACE dataset and a pre-trained encoder-only model. They also evaluate the performance of large language models (LLMs) like GPT-4. The results show that GPT-4 achieves the highest accuracy of 36.14% on the YORC data. However, this accuracy is still lower compared to AfroXLMR-base and ChatGPT on the English test set, highlighting the challenges faced by pre-trained LLMs in accurately answering questions in a multi-choice QA setting. The paper concludes by emphasizing the limitations of LLMs for under-resourced African languages like Yorùbá. As future work, the authors plan to extend their evaluation to few-shot settings and explore approaches that can effectively adapt existing reading comprehension models with limited examples. The authors acknowledge Mr. Daud Olamide Abolade for his assistance with manual text extraction using OCR tools and express gratitude to OpenAI for providing API credits through their Researcher Access API program for evaluating GPT-3.5 and GPT-4 large language models. Overall, this paper presents an important contribution in creating a new reading comprehension dataset for Yorùbá language and highlights the challenges and potential future directions in improving performance for under-resourced languages using LLMs.
- - Introduction of YORC (Yorùbá Reading Comprehension), a new dataset for Yorùbá language reading comprehension
- - Dataset based on Yorùbá high-school reading comprehension examinations
- - Baseline results using cross-lingual transfer with English RACE dataset and pre-trained encoder-only model
- - Evaluation of large language models (LLMs) like GPT-4
- - GPT-4 achieves highest accuracy of 36.14% on YORC data, but lower compared to AfroXLMR-base and ChatGPT on English test set
- - Challenges faced by LLMs in multi-choice QA setting for under-resourced African languages like Yorùbá
- - Limitations of LLMs for under-resourced African languages emphasized
- - Future work includes evaluation in few-shot settings and exploring approaches to adapt existing models with limited examples
- - Acknowledgment of Mr. Daud Olamide Abolade for assistance with manual text extraction using OCR tools
- - Gratitude expressed to OpenAI for providing API credits through Researcher Access API program for evaluating GPT-3.5 and GPT-4 LLMs
- - Overall contribution in creating a new reading comprehension dataset for Yorùbá language and highlighting challenges and potential future directions in improving performance for under-resourced languages using LLMs.
YORC is a new dataset for reading comprehension in the Yorùbá language. It is based on high-school reading comprehension exams in Yorùbá. Researchers used a model called GPT-4 to test how well it could understand and answer questions in Yorùbá. GPT-4 did well on the YORC data, but not as well as other models did on English tests. There are challenges when using large language models like GPT-4 for languages with fewer resources, like Yorùbá. The researchers want to continue working on this and try different approaches to improve the models."
Definitions1. Dataset: A collection of information or data.
2. Reading comprehension: The ability to understand and interpret written text.
3. Baseline results: Initial or starting point of measurement or comparison.
4. Accuracy: How correct or accurate something is.
5. Under-resourced: Lacking sufficient resources or support.
6. Limitations: Restrictions or weaknesses of something.
7. Few-shot settings: A situation where there are only a few examples available for learning or training.
8. OCR tools: Tools that can extract text from images or scanned documents.
9. API credits: Credits given by OpenAI to access their programming interface (API).
10. Researcher Access API program: A program by OpenAI that provides access to their API for researchers.
11. Highlighting challenges and potential future directions: Bringing attention to difficulties and possible
Yorùbá is a language spoken by over 40 million people in West Africa, primarily in Nigeria and Benin. Despite its widespread use, there is a lack of resources available for natural language processing (NLP) tasks in Yorùbá. This poses a challenge for researchers and developers who are interested in building NLP applications for this under-resourced language.
In order to address this gap, a team of researchers from the University of Lagos and the African Institute for Mathematical Sciences (AIMS) have introduced YORC (Yorùbá Reading Comprehension), a new dataset specifically designed for reading comprehension tasks in Yorùbá. The dataset is based on high-school reading comprehension examinations commonly used in Yorùbá schools.
The authors begin by discussing the motivation behind creating this dataset. They highlight the importance of having resources available for under-resourced languages like Yorùbá, as it allows for more diverse representation and inclusivity in NLP research and development. Additionally, they note that existing datasets often do not accurately reflect the linguistic nuances present in African languages, making it difficult to develop effective models.
To create the YORC dataset, the authors collected high-school reading comprehension exams from various schools across Nigeria. These exams were then manually transcribed into digital format using optical character recognition (OCR) tools with assistance from Mr. Daud Olamide Abolade. The resulting dataset consists of over 1,000 passages and 10,000 questions covering various topics such as history, literature, science, and current affairs.
Next, the authors provide baseline results by performing cross-lingual transfer using an existing English reading comprehension dataset called RACE (ReAding Comprehension from Examinations). They also evaluate the performance of large language models (LLMs) like GPT-4 on both the English RACE test set and their newly created YORC dataset. The results show that GPT-4 achieves the highest accuracy of 36.14% on the YORC data, but this is still lower compared to other LLMs like AfroXLMR-base and ChatGPT on the English test set.
The authors discuss these results and highlight the challenges faced by pre-trained LLMs in accurately answering questions in a multi-choice question-answering (QA) setting. They note that these models are often trained on large amounts of data from high-resource languages, making it difficult for them to adapt to under-resourced languages with different linguistic structures and vocabulary.
In conclusion, the paper emphasizes the limitations of LLMs for under-resourced African languages like Yorùbá and highlights potential future directions for improving performance. As future work, the authors plan to extend their evaluation to few-shot settings and explore approaches that can effectively adapt existing reading comprehension models with limited examples.
The researchers also express gratitude to OpenAI for providing API credits through their Researcher Access API program for evaluating GPT-3.5 and GPT-4 large language models. This support from industry partners is crucial in advancing NLP research for under-resourced languages.
Overall, this paper presents an important contribution in creating a new reading comprehension dataset for Yorùbá language and sheds light on the challenges faced by pre-trained LLMs in accurately processing under-resourced languages. It serves as a call-to-action for further research and development efforts towards improving NLP capabilities in African languages like Yorùbá.