Evaluating LLM Performance

Experimental study to evaluate the performance of LLMs, including both open-source and closed-source LLMs in Named Entity Recognition (NER)

Named Entity Recognition (NER), which is a sub-task of Information Extraction (IE), plays a pivotal role in many applications such as information retrieval, question answering, document summarization, text mining, and machine translation, just to name a few. Current state-of-the-art NER in literature is to finetune transformer-based language models such as BERT or RoBERTa with human-annotated training datasets. Some challenges for finetune approaches include annotating training data which requires not only domain knowledge but is also labor intensive and expensive; the finetuned NER model struggling with out-of-distribution datasets; and annotated NER training and test data may contain incorrectly labeled data that makes the evaluation results questionable. While few-shot learning-based finetuned NER models aim to alleviate these issues, the few-shot NER methods find themselves still struggling with model generation issues when test datasets are out-of-distribution. An emerging approach is to use pre-trained Large Language Models (LLMs) such as ChatGPT or its variants to extract named entities directly with just a few or zero examples, and using commercial LLMs can achieve on-par NER performance with finetune-based NER models.

While the research outcomes are inspiring to reduce the cost of training annotation, using the close-source proprietary LLMs is not only expensive for large-scale NER tasks for organizations and businesses, but also has potential data ethics issues because all the data must be sent to the service providers for NER processing which involves privacy and security concerns. Our research intends to address the two issues by running open-source pretrained LLMs such as LLAMA2 on an on-premise computer with consumer-grade GPUs. The total hardware cost is less than the six-month rental cost of a similar-performance cloud GPU server, and the models and NER process are stored on the on-premise computer.

Outcomes:

•              Conduct empirical testing to demonstrate the efficacy of modern LLMs in the domain of named entity recognition

•              Address data privacy and governance concerns associated with the use of LLMs hosted by commercial service providers for these tasks

Project Team:

Dr Simon Zhu
Dr Siriu Li
A/Prof Nik Thompson