Named Entity Recognition (NER) with Autoregressive Large Language Models

Exploring the Feasibility of Using LLaMA 3.2 1B for NER Tasks

Nov 10, 2024

white book page on green and brown textile — Photo by Sara Calado on Unsplash

Named Entity Recognition (NER) is a key task in Natural Language Processing (NLP), essential for extracting structured information from unstructured text. Traditionally, encoder-based models like BERT have dominated this space due to their ability to process text bidirectionally, capturing rich contextual information. However, recent advancements in large autoregressive language models (LLMs), such as LLaMA, have opened up new possibilities for NER tasks.

Recently, I have conducted a small experiment to explore whether autoregressive models like LLaMA 3.2 (1B), typically used for generative tasks, can be adapted for NER tasks. Specifically, I tested two approaches: one using attention masking and another without it (i.e., bidirectional attention). The results provide early insights into the potential of decoder-only models for NER tasks.

Why LLaMA 3.2 (1B)?

LLaMA 3.2 is a decoder-only transformer-based model that has shown strong performance in generative tasks. While traditionally used for text generation, its architecture can be adapted for tasks like NER by disabling causal masking, which allows the model to leverage bidirectional context—similar to how encoder-only models work. I chose LLaMA 3.2 (1B) for this experiment for several reasons:

Scalability: The 1B variant strikes a balance between performance and resource requirements. Larger models could potentially offer better results but come with significantly higher computational costs and longer training times.
Efficiency: For a task like NER, which doesn't necessarily require the largest models available, the 1B variant offers a practical solution that can still capture meaningful patterns in the data without overburdening hardware resources.
Versatility: LLaMA's architecture is flexible enough to handle both generative and understanding tasks, making it an interesting candidate for exploring whether decoder-only models can compete with encoder-based models in NER tasks.

Experiment Setup

The experiment focused on comparing two configurations of LLaMA 3.2 for NER:

Bidirectional Attention: Causal masking was disabled to allow the model to infer context from both directions.
Masked Attention: Causal masking was remained enabled to simulate the typical autoregressive behavior.

Both approaches were tested over a few epochs to evaluate their performance and training efficiency.

Key Insights

It Works: Both approaches—masked and bidirectional—successfully performed NER tasks, demonstrating that autoregressive models like LLaMA can be adapted for this purpose.
Training Time:
- The masked attention approach was faster to train, as expected, and after some training, it performed reasonably well.
- The bidirectional model took significantly longer to train and showed slightly better performance than the masked version. However, the performance improvement was not substantial enough to justify the increased training time.
Decoder-Only Models Take Longer: Training with a decoder-only model like LLaMA takes longer due to its more complex architecture compared to encoder-only models like BERT. Despite this, both attention configurations worked well enough to be considered viable alternatives for NER tasks.
Performance vs Cost Trade-off: Disabling the attention mask didn't seem to provide enough performance improvement to justify the extra execution time and computational cost in most cases.
Potential Use Cases for Bidirectionality: While bidirectional attention may not always be worth the extra cost, it could be beneficial in specific scenarios where context from both sides of a token is crucial, e.g.:
- Legal Document Analysis: Where understanding the relationship between entities across long sentences is important.
- Medical Text Processing: Where context from both sides of a term could clarify ambiguous medical entities.
- Specific Languages: In languages where word order is flexible or context is inferred from both directions (e.g., Japanese or Arabic).

Conclusion

This experiment shows that autoregressive large language models can indeed be used for Named Entity Recognition tasks, although further experimentation is needed to optimize performance and training efficiency. While bidirectional attention offers slight improvements in accuracy, it comes at a significant computational cost that may not always be justified. For now, using masked attention seems like a more practical approach unless specific use cases demand bidirectional context processing. This project represents just an initial exploration into using decoder-only models for NER, and further research — such as other models, parameter sizes, hyperparameter settings and regularization techniques — will be necessary to refine these findings.

You can find the full project conducting the bidirectional approach on GitHub.

The Data Kernel

Discussion about this post