"I see SLMs becoming major players in the near future. Their cost and energy efficiency, together with the ability to provide more domain-specific and pertinent answers, are more in line with our high expectations on domain expertise. Ultimately, LLMs may actually become routers or SLM aggregators able to package the best answer from more specialised models."
Over the past few years, the rapid evolution of GenAI, particularly Large Language Models (LLMs), has significantly changed how we interact with technology, delivering transformative value to organisations across all sectors.
In fact, since the release of ChatGPT in November 2022, our fascination with AI has surged, leading to widespread exploration of concepts, such as LLMs and their implications. 2023 was hailed as the breakout year for GenAI, as evidenced by the rapidly increasing interest in LLMs and their potential to solve challenging problems.
New sets of LLMs are now being introduced every week with new capabilities, trained on vast amounts of text to understand existing content and generate original content. As a result they possess an unparalleled depth and breadth of knowledge and are exceptional in performance across various domains and tasks. GenAI has emerged as an indispensable tool, offering a seemingly magical shortcut for tasks spanning content creation to project management, catering to professionals across diverse sectors.
However, this paradigm shift comes at a cost – both literal and figurative. The high cost of deploying LLMs, especially for less commonly supported languages, such as Arabic, presents a barrier for many organisations.
The estimated dynamic computing cost in the case of GPT-3 is equivalent to two or three full Boeing 767s flying round-trip from New York to San Francisco; the current provision of consumer LLMs may be more like a Boeing 767 carrying one passenger at a time on that same journey7.
In response to these challenges, the open-source community has collaborated to democratise access to GenAI technology by releasing smaller open-source alternatives, known as Small Language Models (SLMs).
SLMs are a lightweight GenAI model. The term “small” in this context refers to the size of the model’s neural network and the number of parameters it uses. As organisations grapple with the escalating complexity and cost associated with LLM adoption, SLMs offer an attractive solution, promising efficiency without compromising on performance.
In this article, we explore the evolving landscape of SLMs, delving into their respective strengths, weaknesses and the overarching implications for industries at large. By examining the details of this transformative technology, we aim to provide readers with the insights and knowledge necessary to navigate this dynamic field and harness the full potential of GenAI in driving innovation and fostering digital transformation.
LLMs are trained on vast text datasets that enable capabilities of generating extensive text, summarising documents, providing translation for different languages and replying to complex inquiries. Conversely, SLMs possess the same capabilities; however, they are notably smaller in size, yet equally advantageous when it comes to cost, efficiency and customisability.
To make an informed decision between LLMs and SLMs, we must perform a comparative analysis, benchmarking and evaluation of the performance of both across various criteria.
As the name suggests, LLMs are considerably large-scale models, typically containing hundreds of millions or even billions of parameters with a deep architecture comprising multiple layers.
SLMs have fewer parameters, simpler architecture and fewer layers, which makes them more computation-efficient than massive loads of LLMs. Due to SLMs' reduced model size, they also require less storage for weights and less memory for processing.
The road ahead for SLMs holds an immense promise in reshaping the landscape of GenAI and Natural Language Processing (NLP). Their advantages substantially put them as a viable approach to LLM adoption. There are many strategies to optimise model sizes and maximise their performance and efficiency even further, such as quantisation, sparsity/pruning, distillation and adaptation.
1. Quantisation
Quantisation is a technique to reduce the model’s size by lowering the precision of the model's weights and activations. Hence, the compressed model uses less memory, requires less storage space, and performs faster. There are two types of LLM quantisation: Post-Training Quantisation (PTQ) and Quantisation-Aware Training (QAT). The main difference between the two types is “when” the quantisation process takes place. PTQ optimises the model “after” the model has been fully trained. While it is simpler and faster, some accuracy is lost in the process. QAT integrates quantisation during the training process. This is more computationally intensive and requires additional training time but results in a higher accuracy since it learns to adapt to the lower precision representation.
2. Sparsity/pruning
This optimisation method is usually paired with quantisation to opt for maximum efficiency. The technique involves trimming non-essential parameters, which are near zero, and replacing them with zeros. By doing so, the matrix occupies less space compared to a fully condensed version while retaining the significant parameters unaffected, thus preserving the model’s accuracy.
3. Distillation
It is like a master transferring its knowledge to their student. The distillation optimisation technique moves knowledge from a LLM to a smaller model that has a simpler architecture. Take, for example, BERT, one of the most renowned transformers-based deep learning models. DistilBERT shrinks the BERT model by 40% while maintaining 97% of its language-understanding abilities, all at a speed that’s 60% faster.
4. Adaptation
There are a couple of common methods to adapt LLMs and SLMs downstream for a specific task or use case. These methods depend on the nature of your dataset. If your data is structured for training, you may opt for a fine-tuning approach. For instance, Parameter-Efficient Fine-Tuning (PEFT) is a technique used to fine-tune SLMs to specific tasks by training only a small number of additional parameters. PEFT has become a popular approach as it is significantly lower in computational costs and storage requirements while maintaining comparable performance to full fine-tuning. On the other hand, if the data is unstructured, then Retrieval-Augmented Generation (RAG) framework boosts the model’s accuracy by performing indexing and semantic search. RAG ensures that the model’s responses are confined to relevant search results, thus generating accurate outputs.
LLMs and SLMs offer different benefits and advantages. They both have their unique strengths and limitations.
Overall, SLMs deliver strong ROI through cost savings from reduced computational needs and faster deployment, improved performance due to task-specific customisation, and enhanced business applications such as streamlined customer service and efficient content generation. These factors combine to lower expenses, increase productivity, and unlock new revenue opportunities.
Due to their lightweight nature, SLMs enable faster inference times. Faster response time makes it suitable for applications like Chatbots and IoT devices. SLMs result in reduced power consumption and a lower demand for hardware, promoting efficient resource use and cost-effectiveness.
The advent of SLMs marks a pivotal shift in the landscape of AI and natural language processing. These models, characterised by their efficiency, adaptability and domain-specific capabilities, are poised to democratise access to AI technologies, making them readily available for a wider range of applications and industries.
SLMs have demonstrated remarkable potential in use cases such as chatbots, real-time data analysis and customer service, offering cost-effective and scalable solutions. Their ability to be fine-tuned for specific tasks and domains ensures that they can be tailored to meet the unique needs of various organisations, making them invaluable tools for enhancing productivity, automating processes and improving decision-making.
While SLMs represent a significant step forward, it is important to acknowledge the ongoing research and development in this field. As the technology continues to mature, we can anticipate further improvements in efficiency, accuracy and capabilities, paving the way for even more innovative and impactful applications.
In the years to come, SLMs are poised to play an increasingly vital role in shaping the future of AI. Their accessibility, versatility and potential for customisation make them a compelling choice for businesses and organisations seeking to harness the power of AI to drive growth, innovation and success. As research and development continue to push the boundaries of what is possible, we can look forward to a future where SLMs become ubiquitous, empowering individuals and organisations alike to achieve their goals through the transformative power of AI.
1.Large Language Models (LLMs)
3. Language Model Quantisation Explained
4. Mastering LLM Techniques: Inference Optimisation
5. Mastering LLM Optimisation With These 5 Essential Techniques
6. A Guide to Quantisation in LLMs
7. Environmental Impact of Large Language Models
8. WHEN SMALL IS BEAUTIFUL: HOW SMALL LANGUAGE MODELS (SLM) COULD HELP DEMOCRATISE AI
9. Energy and Policy Considerations for Deep Learning in NLP