Three Takeaways After Exploring the Capabilities of Smaller, Open-Source LLMs

ChatGPT and generative AI have taken the world by storm, with record-breaking user adoption of ChatGPT sparking a wave of additional AI-enabled applications also being met with excitement from users around the globe.

ChatGPT is built on top of a large language model (LLM) — neutral networks that leverage extensive datasets to determine the likelihood of the next word, phrase or thought in a sentence and generate responses that seamlessly align with the ongoing conversation, prioritizing coherence and relevance.

Despite their fantastic performance, utilizing state-of-the-art LLMs for various enterprise-specific use cases can be a daunting task. Many of these models — such as top-of-the-line models in the GPT-3, GPT-3.5 and Bloom model families — have more than 175 billion parameters that require a whopping 350 GB of RAM to function.

Moreover, the options for using existing LLM APIs for commercial purposes are currently limited, offer minimal flexibility, and keep the model architecture under wraps.

Exploring smaller open-source LLMs

Smaller open-sourced LLMs may offer an alternative for enterprises seeking a more fine-tuned model to achieve domain-specific business tasks. Such LLMs are trained on a much smaller number of parameters, which can be loaded and fine-tuned on-premises with limited storage and memory requirements.

However, open-source LLMs may present a tradeoff between quality and cost. For enterprises to consider smaller LLMs as a viable alternative for their natural language processing (NLP) business needs, it becomes important to evaluate their text generation capabilities and the resource demands for their deployment.

We tested six open-source LLMs of different sizes on a diverse set of 12 questions and discovered some interesting insights and takeaways.

Insight 1: Smaller LLMs responses may lack specificity

The answers produced by the relatively smaller LLMs (i.e., that feature 100 million to 3 billion parameters) appear to generate responses related to the question posed but without specifically answering it. This can be observed in the example below:

Question	When did the United States declare Independence?
Answer (GPT-2 Medium)	The United States was founded on the principle of self-government. The Declaration of Independence was signed by the Continental Congress on July 4, 1776.

Here, the model's response contains the answer but does not seem to specifically address the question. In another example, the model gets confused about the question and produces a factually incorrect answer altogether.

Question	What is the largest state in the United States by size?
Answer (GPT-2 Large)	The USA is the fifth largest, just behind California, New York, Washington, and Texas. That means it is the sixth largest nation in the world by area. Even if you include the District, it is still eighth largest. But remember, the country is divided into 29 states, so we're talking about the same amount of land.

Why does this happen?

These LLMs were not tuned to perform any specific task (e.g., text summarization, question-answering, text classification, etc.). Instad, they were trained for next-word prediction — to guess the set of words most likely to follow a given prompt or sequence of words (i.e., the input sentence).

So, using an LLM as-is to perform specific tasks such as question answering, summarization, etc., can be like asking a general physician to lead a specialized surgery.

Insight 2: Prompt engineering has varying impacts on different models

To explore the capabilities of open-source LLMs of different sizes, we divided the landscape into groups based on the number of parameters:

Small LLMs (up to 600 million parameters): GPT-2 Small, GPT-2 Medium
Medium LLMs (600 million to 3 billion parameters): GPT-2 Large, GPT-2 XL
Large LLMs (greater than 3 billion parameters): Bloom-3B, GPT-J 6B

These models were fed a diverse set of questions curated to capture different aspects of response quality such as sentence formation and structure, factual accuracy, creativity, bias, etc. These models are benchmarked against ada, the smallest model in the GPT-3 model family.

The set of questions fed to each model included a mix of three categories of questions:

Factual questions
Questions requiring descriptive or creative responses
Questions requiring subjective opinion-based responses.

Each category received four input questions, totaling 12 questions in all.

Finally, to facilitate the evaluation and comparison of model capabilities, we scored responses for factual correctness, response structure, vocabulary and coherence. Moreover, scoring criteria were devised for each category. Accuracy was given a higher weight for factual questions while cohesiveness and structure were given more importance for descriptive and subjective questions.

Answers that contained irrelevant information or repeated phrases were docked points.

Note: While we could have used benchmark datasets and well-defined, quantitative metrics to score the performance of the models, we merely sought to gain a simple, intuitive understanding of the different models' capabilities and shortcomings for the purpose of this Research Note.

The scores produced by this exercise were crowdsourced to negate bias and subjectivity:

Model	Factual	Descriptive	Subjective	Overall
GPT-2 Small	1.75	0.5	1.25	1.2
GPT-2 Medium	1.75	0.75	3.75	2.1
GPT-2 Large	0.25	2.5	2.5	1.8
GPT-2 XL	0.5	5	2.5	2.7
Bloom-3B	2	2	2.25	2.1
GPT-J 6B	6.5	5.5	2.5	4.8
GPT-3 ada	6	6.75	4.75	5.8

Table 1: Scores of different models (with token length of 100)

The scorecard above captures the response scores generated by the different LLMs as-is, with a maximum token length of 100. In addition to token length, several other hyper-parameters (i.e., configs) govern the nature of the responses. The meaning and impact of these hyper-parameters on model outputs will be discussed in detail in a separate Research Note.

The below scorecard captures response scores after questions were tweaked by prompt engineering — formatting and structuring the input query in a specific, explicit manner to communicate and steer a model's response to align with an intended task or outcome.

Model	Factual	Descriptive	Subjective	Overall	Overall (without PE)
GPT-2 Small	1.75	0	0.75	0.83	1.2
GPT-2 Medium	0.5	2.5	5	2.67	2.1
GPT-2 Large	2.5	3.75	3.5	3.3	1.8
GPT-2 XL	1.5	3	3.5	2.7	2.7
Bloom-3B	5.5	3.5	4.5	4.5	2.1
GPT-J 6B	5.75	6.25	4.25	5.4	4.8
GPT-3 ada	5.5	6.75	4.5	5.6	5.8

Table 2: Scores with token length of 100 (with prompt-engineered input questions)

We modified all the questions posed to the models in the manner as illustrated below:

Raw Question	What is the largest state in the United States by size?
Modified Question	Q: What is the largest state in the United States by size? \nA:

The addition of "Q" and "\nA" is an attempt to explicitly guide the model into interpreting the prompt as a question-answering task.

The models that saw the most improvement due to prompt engineering were GPT-2 Large and Bloom-3B. For both models, prompt engineering helped produce factually accurate responses to multiple questions that previously generated incorrect answers. The answers of GPT-2 Large for descriptive questions and Bloom-3B for subjective questions became more relevant than before, even if they were far from ideal.

On the other hand, the GPT-2 Small, the smallest model tested with just 124 million parameters, fared worse with prompt-engineered inputs. There was no impact on factual questions, but the answers to descriptive and subjective questions became repetitive or (in some cases) tangential and in the first person.

Prompt-engineered inputs had no effect on the output of GPT-2 XL as well as the proprietary GPT-3 ada model of OpenAI.

Insight 3: The memory footprint increases with model size

We ran our experiments on a single Tesla P100 GPU with 16GB RAM on an Nvidia DGX server. The RAM memory needed was typically two times the model size: 1x for initial weights and another 1x to load the model checkpoint file.

A 1.5B parameter model (such as GPT-2 XL) takes up 6.5 GB of space and requires at least 13 GB of dedicated RAM to load the model in memory (excluding the RAM used up for system libraries, etc., for the GPU). See the table below for results.

This means that models larger than 3B parameters will not fit into the memory of commonly used on-premises hardware. Not only that but inference time (latency) also increases significantly with model size.

Model Name	Number of parameters (in millions)	Dedicated memory footprint (GB of GPU VRAM) requirement
GPT-2 Small	124	1
GPT-2 Medium	355	3
GPT-2 Large	774	6.5
GPT-2 Xl	1500	13
Bloom-3B	3000	12
GPT-J 6B	6000	48

Table 3: Size and memory footprint of the different LLMs

For this exercise, the models that could be loaded into memory included the Small, Medium and Large versions of the GPT-2.

Multiple techniques exist to load larger models onto on-premises GPU hardware. One such technique, which utilizes sharded model checkpoint files, was implemented to load bigger models such as the GPT-J 6B for this exercise. The details of this sharded checkpoint technique will be explained in a separate Research Note.

Conclusions

LLMs typically need to be fine-tuned for a specific task such as summarization, question answering, classification, etc., for the response to match user expectations and achieve good performance. This is especially true of smaller LLMs whose sizes range from 100 million to 6 billion parameters.

Using prompt engineering to modify the format of the input question improved the response quality for some models such as the GPT-J 6B but had no impact on many others such as GPT-2 Small, GPT-2 Medium and OpenAI's GPT-3 ada.

As the size and capability of LLMs increase, the hardware required to use them will become restrictive. That's why it is important to balance your performance objectives for accuracy against resource demands when choosing the appropriate LLM for a use case.

The table below summarizes our findings on the capabilities of the different models against different types of questions:

	Small Models (<750M)	Medium Models (750M – 3B)	Large Models (>=3B)
Factual Questions	Not able to grasp the specific context of questions that required factual accuracy.	Mid-sized Bloom model was factually quite accurate, but other models underperformed.	Underwhelming results but somewhat accurate. Bloom models showed better accuracy than GPT models.
Descriptive questions	Somewhat understood context but responses could be off the mark.	Great response for some questions, very bad for others.	Compute time was very high. Responses were somewhat relevant.
Opinion or perspective based	Polarized responses. Answers were straight yes/no instead of an unbiased perspective.	Somewhat context-based and unbiased.	High compute time and most responses didn't address the question.