Introduction

How does your word processing software spot your spelling or grammatical mistakes? How does Google always seem to know what you want to search for? How do Siri and Alexa understand your voice message? And how is ChatGPT able to generate impressive answers to a wide range of questions? These are some of the capabilities of natural language processing (NLP), a branch of artificial intelligence (AI) that focuses on the interactions between computers and natural languages, specifically how to make computers process a large amount of natural language data and ultimately understand even the contextual nuances.

Natural language refers to languages that are naturally developed in use (e.g., spoken languages), as opposed to being artificial and constructed (e.g., programming languages and international auxiliary languages). Because of the ubiquity of natural language data, for example, in the form of unstructured free text, from which patterns, insights, and knowledge may be extracted, NLP is able to make executing various tasks possible, ranging from automatic processing of information from legal or financial documents, to informing data-driven business decisions from customer reviews of products or services. 

NLP models can also answer questions or even converse with humans. Can machines engage in conversations so well that a human judge cannot reliably tell whether they are communicating with a human or a machine? 

The famous "Turing test," originally proposed as the "imitation game" by Alan Turing, first explored NLP in the 1950s [1] and has since become not only a criterion of artificial intelligence, but a philosophical question as well. Demonstrated in 1954, the first machine translation model, the Georgetown-IBM experiment, translated more than 60 Russian sentences into English [2]. A decade later, ELIZA, developed between 1964 and 1966, became one of the first chatbots and one of the first programs capable of attempting the Turing test [3]. In the most famous script, DOCTOR, ELIZA simulated a psychotherapist who parrots back what the patient says. The following is an example of a typical conversation [3]. 

It is quite impressive, especially as the first chatbot from 1966. In the latter half of the 20th century, NLP tasks were performed according to complex sets of hand-written rules, an approach termed symbolic NLP. In the ELIZA example, the program first identifies the most important keyword in the prompt and constructs the answer accordingly. If the DOCTOR script encounters keywords pertaining to similarity, such as "alike" or "same," it would respond with "In what way?" In other cases, after locating a keyword, ELIZA would try to match the prompt using a list of pre-defined patterns, decompose the prompt, and construct the response using a pre-defined phrase and the decomposed prompt segment. For example, a prompt of "You are very helpful" would first be matched with the pattern of "you are {something}," corresponding to a response of "What makes you think I am {something}?" The response to this prompt would therefore be "What makes you think that I am very helpful?" If there is no keyword in the prompt, the DOCTOR would just say "I see," or "Please go on," which does not contain any specific information [3]. 

The drawbacks of early symbolic NLP models are obvious. First, symbolic NLP models are not scalable. To expand the capability of such models, for example, building a chatbot that can have conversations as a teacher, you would need to define a new set of rules from scratch. Moreover, natural languages are ever evolving, which requires engineers to apply recurring updates and additions to the rules.

Starting in the 1990s, NLP tasks started to incorporate machine learning algorithms partly due to the increase in computing power. These approaches are termed statistical NLP, where statistical models learn weights corresponding to each input feature to make soft, probabilistic decisions (e.g., tf-idf or embedding, as explained in the following sections). The models use statistical inference to learn the rules automatically, which is a significant improvement compared to the hand-written rules required in symbolic NLP. In addition, statistical NLP models are more robust for unfamiliar or erroneous inputs than symbolic NLP models, where handling of such inputs requires numerous additions of rules and is extremely time-consuming.

Statistical NLP models works well for simple tasks related to information retrieval, such as spam filtering and sentiment analysis, when combined with techniques such as bag-of-words and term frequency-inverse document frequency (tf-idf), which we will discuss in more detail in the next sections. On the other hand, progress was also made on statistical machine translation models during this period. These models are trained on parallel bilingual text documents and learn the translations on the word, phrase, or syntax level. For example, consider the bilingual training data of English and French sentences:

If we train the statistical machine translation model using many sentence pairs like these, we can expect the model to learn that "I am" has a high probability of translating to "Je suis," "engineer" to "ingénieur," and so on. And, therefore, we can expect it to translate sentences that perhaps it has not seen before. For example, consider the following translation from English to French: 

Although the principle is straightforward, the implementation of these models requires a pipeline of separate intermediate tasks, such as elaborate feature engineering, word/phrase alignment between the two languages, and handling of different word orders in the target language. 

In contrast, neural NLP models, which use neural networks instead of statistical models, perform NLP tasks in an end-to-end fashion. Thus, the field has largely transitioned from statistical NLP to neural NLP models since 2015. The field has subsequently experience rapid growth and achieved stunning results in high-level tasks, such as machine translation, summarization, and question answering. We will explore some of these models in depth in this article.

Bag of Words

A concept found as early as 1957 in the context of automatic indexing for information retrieval [4].

7 Spam Email Examples That Will Make You LOL - EZComputer Solutions

Fig. 1 Suspicious email. 

If you have ever received a similar email, you might have realized immediately that it is spam. Some common-sense reasoning can help conclude that it is too good to be true for Mark Zuckerberg to personally email you and gift you $1.5 million, and that the poorly formatted email is unexpected from a CEO of a multibillion-dollar company. But you do not need to thoroughly reason, nor do you need to understand the meaning of the sentences, to reach this conclusion. It would suffice to just look for certain words, because you have seen enough spam emails to learn that you are often the "lucky individual" who is "randomly selected" to "win" a large sum of money. In other words, the existence of certain words makes the email much more likely to be spam. Similarly, when reading restaurant or film reviews, you can often get a good idea of each review's sentiment just by looking for certain positive or negative words. 

Ideally, a machine learning model can learn to automatically classify spam email for us or perform sentiment analysis. The bag-of-words model uses the simple word count, or term frequency, to extract features of a text document. Proportional to the frequencies of the words, the more important words in the document are given more weight. A simple way is to put all words of a document into a bag, tally the counts of every word, and use them as features of the document for machine learning tasks. In theory, the features should be a good representation of the document.

Advantages:

  • Low computing cost
  • Simplistic and easy to understand

Disadvantages:

  • Weights are directly proportional to word counts, but they are likely dominated by the "stop words" (frequent but unimportant words)

Term frequency-inverse document frequency (tf-idf)

A weighting scheme approach implemented in machine learning since 1998 [6].

The bag-of-words model, which is based on word count or term frequency, ideally identifies which features are a good representation of the document. However, in the English language, the most frequent words in documents are often the least important, such as "the," "is," "for," etc. These words are called "stop words" in natural language processing. Despite the high frequency, they are not characteristic to individual documents because of their ubiquity across all documents. Again, using the spam email as an example, because terms such as "idea," "get back to me," and "regards" are also seen frequently in non-spam emails, they are less characteristic than terms such as "lucky individual," "randomly selected," and "win," which are more unique to spam emails.

Therefore, instead of using the term frequency as the weight of a term in individual documents, we can adjust the weight using inverse document frequency (idf), first introduced as term specificity in 1972 [5]. A higher document frequency means lower term specificity, or idf. Considering both the term frequency (tf) and idf gives us better features describing the documents. This approach is called tf-idf, and it has been implemented in machine learning since 1998 [6].

Advantages:

  • Low computing cost
  • Weights better represent the relative importance of the words, thereby serving as better features

Disadvantages:

  • Lacks the ability to understand the word order, context, and semantic relationship between words

Embedding algorithms

How do we teach machines the semantic relationship between words? An intuition comes from the famous quote by linguist J. R. Firth, "You shall know a word by the company it keeps." [7]

A computer can look at two strings and tell you whether they are the same or not. However, can it learn to recognize different but closely related words? To address the shortcomings of bag-of-words and tf-idf, we can use a concept called word embeddings. It is a form of a vector representation of the words in a document's vocabulary, which is capable of capturing context of a word and the semantic similarity with other words.

Fig. 2 Examples of embeddings of related words.

Similar words in a semantic sense have a smaller distance between them (either Euclidean, cosine or other) than words that have no semantic relationship. For example, words like "man" and "woman" should be closer together than the words "woman" and "ketchup" or "man" and "cloud." Another fascinating property is that the embeddings could capture the word relations. For example, the difference between man and woman would be similar to that between king and queen, and similarly for county-capital pairs. It makes possible the arithmetic equation of, for instance, queen = king – man + woman, or Japan = Germany – Berlin + Tokyo. (Fig. 2) The distance between the vectors helps the computer understand how related or unrelated different words are from each other.

Consider the following sentence. "We found a cute, hairy wampimuk sleeping behind the tree." [8] Although you have never seen the made-up word wampimuk before, you can still guess that it is a small furry animal, perhaps resembles a squirrel or a sloth, just based on the other words surrounding it. In other words, the word embeddings or squirrel, sloth, and wampimuk should be close together, because they often appear around similar sets of words, such as "cute," "hairy," and "tree." 

Using the wampimuk example, let's understand two of the most popular word embedding techniques available today – word2vec and GloVe (Global Vectors for Word Representation).

Word2vec

Word2vec is a neural network model to learn word associations from a large corpus of text. It was developed by Tomas Mikolov in 2013 at Google [9] and uses the skip-gram model. To determine the vector representation of the word wampimuk, a window size of 2 is considered here for illustration purposes, but it can be varied based on the applications. Wampimuk would be the "target word," and the nearest 2 words around the target word would be the "context words," namely, cute, hairy, sleeping, and behind. As a result, we get the following training data.

Target word

Context word

Label

wampimuk

cute

1

wampimuk

hairy

1

wampimuk

sleeping

1

wampimuk

behind

1

wampimuk

geisha

0

wampimuk

frivolous

0

wampimuk

insinuate

0

Note that in addition to the first 4 entries of actual target-context pairs with the labels of 1, there are also random context words that do not appear near the target word, hence the labels of 0. These random words are termed "negative samples." Word2vec model is trained to come up with a vector representation of the target word, wampimuk, that has small distances to those of the true context words (cute, hairy, sleeping, and behind), but large distances to those of the false context words (geisha, frivolous, and insinuate). The model goes through each word in the document, uses them as targets, creates positive and negative context word samples, and tweaks the vector representations of each word accordingly.

If two different words have very similar "contexts" (what words are likely to appear around them), the resulting representations learned by the model would be similar for the two words. You could expect synonyms like "intelligent" and "smart" would have very similar contexts, and related words like "engine" and "transmission," would have similar contexts as well.

GloVe

However, word2vec only considers whether two words co-occur, but not how often they do. Co-occurrence frequency is used in the training of the GloVe model, which stands for global vectors for word representation. It was developed by Stanford for generating word embeddings by aggregating global word-word co-occurrence matrix from a corpus [10]. The co-occurrence matrix shows how often a particular word pair occurs together within a specified window size. 

The method constructs a large matrix comprised of the co-occurrence information, the number of times each word is used (rows) and the context it is used in (columns). Below is part of a co-occurrence matrix example. Note that negative sampling is also performed here, hence the random, unrelated words (geisha and frivolous) in the matrix.

 

cute

hairy

wampimuk

sleeping

behind

geisha

frivolous

cute

 

10

25

4

3

0

0

hairy

10

 

20

5

3

0

0

wampimuk

25

20

 

18

9

0

0

sleeping

4

5

18

 

17

0

0

behind

3

3

9

17

 

0

0

geisha

0

0

0

0

0

 

0

frivolous

0

0

0

0

0

0

 

During the training of a GloVe model, it adjusts the closeness of two-word representations based on how often they co-occur. As a result, the vector representation of wampimuk would be closest to that of cute (co-occurrence = 25), followed by hairy (20), sleeping (18), behind (9), and finally geisha and frivolous (co-occurrence = 0). 

The embedding algorithms, word2vec and GloVe, aim to build a vector space where the position of each word is influenced by its neighboring words based on their context and semantics. Word2vec starts with local individual examples of word co-occurrence pairs, and GloVe starts with global aggregated co-occurrence statistics across all words in the corpus. It becomes complicated to use these word embeddings when the same word has different meanings. How do you make a computer understand that "Apple" in "Apple is a tasty fruit" is a fruit that can be eaten and not an organization? There will be only one word2vec representation for 'apple' the company and 'apple' the fruit. The vector representation tries to generalize, which can lead to either one context overpowering the other or the different meanings being lost in the way the vector is represented. The different meanings of the same word under different contexts were addressed by more advanced models using an encoder-decoder architecture. 

Advantages:

  • Interrelationship of words are encoded in the word embeddings

Disadvantages:

  • Same word has the same embedding, although sometimes the meanings might differ
  • Embeddings are static and do not reflect the context

Encoder-decoder Architecture

A machine learning model converting prompt into context and context into a response.

Machines can perform NLP tasks that are more sophisticated, such as engaging in conversations. Let's take a minute to think about how we converse. We listen to what someone says, digest the sentences to form an overall understanding or idea, and then compose the response accordingly. In NLP, this two-step process of converting the prompt into a context, and then converting the context into a response, is realized using an encoder and decoder, respectively. An encoder-decoder architecture is shown in Fig. 3.

The NLP task of converting a prompt into a response is more generally termed a sequence-to-sequence (seq2seq) problem, because both the input and output are sentences composed of sequences of words. The use cases for sequence-to-sequence models include question answering, text summarization, and translation.

Let's use machine translation as an example to illustrate the principle of a language model with an encoder-decoder architecture. Our task is to translate a French sentence into an English one. Of course, the model can be trained to translate between any two languages, even including programming languages, such as java, python, and SQL.

Machine Translation Implemented using LSTM

In Fig. 3, we input the French sentence, "Je suis étudiant," into the translation model, and the output should be the English translation, "I am a student." In an encoder-decoder architecture, the input sentence would first be converted by the encoder into some context, represented by a vector, which would then be passed into the decoder to generate the translation. Note that the input and output sequences don't have to be the same lengths.

Fig. 3 Encoder-decoder Architecture. The encoder converts the prompt into a context, which is then converted by the decoder into a response.

The encoder and decoder were first implemented using LSTM (long short-term memory), a variation of recurrent neural network (RNN). In Fig. 4, LSTM is represented using rectangles with the numbers inside indicating the corresponding time steps. After passing in the first French word ("Je" in this example), the LSTM generates a hidden state (h0), which is passed back into the same LSTM in the next step with the second word ("suis") to generate the updated hidden state (h1), now including the information of both first two words. This process repeats until all words have been passed into the encoder. The resulting hidden state (h2), also sometimes called the "context," which contains the information from each of the input French words, is then decoded by the decoder. The decoder LSTM similarly takes in the context and generates a hidden state (h3) and a prediction of the first output English word of the translation ("I"). In the next time step, the generated word ("I") and the previous hidden state (h3) are again passed back to the decoder LSTM to generate the updated hidden state and predict the next word.

Fig. 4 Encoder-decoder architecture implemented using long-short term memory (LSTM). Both the encoder (left green box) and decoder (right purple box) are implemented using LSTM, represented using rectangles. Each rectangle corresponds to a time step, labeled by the number inside. After each time step t, the LSTM generates a corresponding hidden state ht. Note that the hidden state between the encoder and decoder (h2 in this example) is the context shown in Fig. 3. The <EOS> is the "end of sentence" token, prompting the decoder to start and stop generating words.

LSTM processes the texts in a very intuitive way, considering the sequential nature of languages by taking in the words one by one. However, there are some limitations on the implementation of a translation model using LSTM. For example, consider translating a longer sentence, "The agreement on the European Economic Area was signed in August 1992," into "L'accord sur la zone économie européenne a été signé en août 1992." (Fig. 5a) What the LSTM model does is essentially first memorize the entire sentence before generating the first word of the translation, and when the sentence is long, the useful information would be more diluted, thereby reducing the performance of the language model.

Advantages:

  • Generates dynamic, contextual representations
  • Able to perform high level tasks, such as translation and question answering

Disadvantages:

  • High computing cost due to lack of parallelizability

Attention Mechanism and Transformer

"Attention is All You Need" demonstrated an encoder-decoder architecture composed of only the attention mechanisms.

Let's think about human conversation again, or more specifically, how a human translates language. Instead of first memorizing a whole sentence and coming up with a translation all at once, a human translator pays attention to a small part of the sentence at a time and generates a part of the translation, working through the whole sentence little by little.

Graphical user interface

Description automatically generated with low confidence

Fig. 5 Cross attention between original and translated sentences. (a) Illustration of how human translators may pay attention to only part of the original sentence when coming up with the translation. (b) Cross-attention map generated by a machine translation model using the attention mechanism. Brighter pixels indicate more attention, or closer relation, between the two words. Figure taken from ref. [11].

Consider the sentence in Fig. 5a as an example, when generating the first French words, "L'accord," we pay attention to mostly the first English words, "The agreement," more than any other words in the sentence. When the agreement is signed (August 1992), which is specified at the end of the sentence, does not help with translating the very first part of the sentence. Similarly, when generating the next part, "sur la zone économie européenne," we mainly pay attention to "on the European Economic Area," and so on. These weights given to every word in the sentence specifying how important the words are at each step of the translation are called the "attention mechanism." Specifically, the attention mechanism mapping the relationship between two sentences is termed "cross attention" (Fig. 5b).

The attention mechanism improves performance of machine translations and helps language models understand the context of sentences. Take the following sentence for example, "The animal didn't cross the street because it was too tired." How do we understand the word "it" in this sentence? The pronoun might refer to either the animal or the street. Later in the sentence it says, "it was too tired," suggesting that "it" here refers to the animal. In summary, to understand the word "it," we pay more attention to the words "animal," "street," and "tired" to form a contextual understanding of this word. This is termed "self-attention," which maps the relationship between the words within the same sentence. A self-attention map of this sentence is shown in Fig. 6.

Chart, radar chart

Description automatically generated

Fig. 6 Self attention within a sentence with respect to the word "it." Two attention heads, green and red, are shown. The green attention head pays attention to the "it was too tired" clause, whereas the red attention head pays more attention to the words "animal" and "street." Figure generated at ref. [12].

When the attention mechanism was first proposed in 2014, it was used with the LSTM language model to improve its performance. However, in 2017, Transformer was introduced in the paper "Attention is All You Need," demonstrating an encoder-decoder architecture composed of only the attention mechanisms (Fig. 7) [13], without the need of including any RNN components. The exclusion of the RNN component also removes the major disadvantage of the RNN of not being parallelizable as it ingests only one word at a time. Transformer, built upon the attention mechanism, on the other hand, is parallelizable, taking all words in the input sentence all at once. This means given the same computing resources, we can build much deeper networks (meaning many transformers stacked on top of one another), which has been shown to lead to better model performances.

Diagram

Description automatically generated

Fig. 7 Transformer architecture. It has an encoder block on the left and a decoder block on the right. Transformer has three attentions, encoder self-attention, encoder-decoder cross-attention, and decoder self-attention. Figure modified from ref. [13].

The Transformer architecture has since been implemented in many state-of-the-art NLP models. Introduced in 2018, GPT (Generative Pre-trained Transformer) [14], with 110 million parameters, is composed of a stack of Transformer decoders, thus able to generate texts. Use cases of GPT include translation tasks to natural languages or programming languages, question answering, and summarization. The scaled-up and more powerful versions, GPT-2 and GPT-3, were introduced in 2019 and 2020, with 1.5 billion and 175 billion parameters, respectively [15-16]. 

Introduced in 2019, BERT (Bidirectional Encoder Representations from Transformers), with up to 340 million parameters [17], is composed of a stack of Transformer encoders, thus able to generate contextual representations of input words' sentences. The contextual representations enable downstream ML models to perform more accurate classification tasks, including next-word prediction, sentiment classification, and named entity recognition. BERT is a superior alternative to the embedding algorithms we introduced above (word2vec and GloVe) for generating word representations, albeit requiring more computing resources. Instead of learning one static representation for each word, BERT learns individual word representations based on their contexts. Consequently, the two "bars" would have distinct representations in the following example, "Alice just passed the bar exam so she's going to the bar to celebrate." 

The Transformer architecture has been proven extremely powerful not only in NLP, but also in other AI domains, such as computer vision and speech recognition. Some successes have been achieved by implementing the Transformer architecture to those domains, and researchers are currently developing Transformer-based models that can learn from data across all domains [18].

Advantages:

  • Lower computing cost than LSTM due to parallelizability, allowing for deeper networks and better performance using same computing resources

Disadvantages:

  • High computing complexity prevents the full attention mechanism to be used in other modalities with more tokens, such as audio and image

Conclusion

It is fascinating to see how far we have come since we first started exploring natural language processing in the 1950s. The older rule-based models could only handle rather limited use cases, such as a translation model that translates only dozens of sentences between two specific languages and a chatbot that operates only in a specific setting.

Fast forward to 2023, you might be surprised by what NLP models can do now. Over the past few years, many basic NLP use cases have become an integral part of our lives, including grammar and spelling correction, text completion, search engine, translation, summarization, etc. 

  • OpenAI released DALL-E 2, which creates realistic images or arts from a natural language description [19].
  • GPT-3 wrote an academic paper on itself and was listed as the first author [20].
  • Meta's Sphere is a knowledge-intensive NLP that can help check whether the references in Wikipedia are appropriate [21].
  • Google's Minerva is a language model trained on scientific papers containing mathematical expressions, capable of solving mathematical and scientific questions using step-by-step reasoning [22].
  • OpenAI announced ChatGPT [23], a chatbot built on top of the GPT-3 family models that can generate detailed and articulate responses across multiple domains of knowledge. ChatGPT surpassed 100 million monthly active users within two months of release [24].

Nowadays, so much is possible with NLP. WWT has implemented BERT to generate embeddings for tweets in our debias-GAN project. We have also built an ensemble model for real-time information retrieval using models including BERT, word2vec, and tf-idf. If you have a large amount of text data in your organization, chances are that NLP models can help you process the documents and/or extract actionable insights. 

Learn more about how data analytics and AI can benefit your organization. Explore now

References

  1. Turing, A. (1950). Computing Machinery and Intelligence, Mind. LIX (236): 433-460.
  2. Hutchins, W. J. (2004). The Georgetown-IBM Experiment Demonstrated in January 1954. In Machine Translation: From Real Users to Research, Springer, Berlin, Heidelberg.
  3. Weizenbaum, J. (1966). ELIZA—A Computer Program for the Study of Natural Language Communication Between Man and Machine, Communications of the ACM, 9: 36-45.
  4. Luhn, H. P. (1957). A Statistical Approach to Mechanized Encoding and Searching of Literary Information, IBM Journal of Research and Development. 1 (4): 309-317.
  5. Jones, K. S. (1972). A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation. 28 (1): 11-21.
  6. Joachims, T. (1998). Text Categorization with Support Vector Machine: Learning with Many Relevant Features. In Proceedings of the 10th European Conference on Machine Learning, pp. 137-142, Chemnitz, Germany.
  7. Firth, J. (1957). A Synopsis of Linguistic Theory, 1930-55. In Studies in Linguistic Analysis, pp. 1-31, Oxford, The United Kingdom.
  8. Lazaridou, A.; Bruni, E.; Baroni, M. (2014). Is This a Wampimuk? Cross-modal Mapping Between Distributional Semantics and the Visual World. In Proceeding of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 1403-1414, Baltimore, Maryland.
  9. Mikolov, T et al. (2013). Efficient Estimation of Word Representations in Vector Space. ArXiv: 1301.3781.
  10. Pennington, J.; Socher, R.; Manning C. (2014). GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp1532-1543, Doha, Qatar.
  11. Bahdanau, D.; Cho, K.; Bengio, Y. (2016). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv: 1409.0473.
  12. https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb.
  13. Vaswani, A. et al. (2017). Attention is All You Need. arXiv: 1706.03762.
  14. Radford, A. et al. (2018). Improving Language Understanding by Generative Pre-training.
  15. "GPT-2: 1.5B Release." OpenAI. 2019-11-05.
  16. Brown, T. B. et al. (2020). Language Models are Few-Shot Learners. arXiv: 2005.14165.
  17. Devlin, J. et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv: 1810.04805.
  18. Baevski, A. et al. (2022). data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. arXiv: 2022.03555.
  19. Ramesh, A. et al. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv: 2204.06125.
  20. GPT-3; Osmanovic-Thunström, A; Steingrimsson, S. (2022). Can GPT-3 Write an Academic Paper on Itself, with Minimal Human Input? hal-03701250.
  21. Piktus, A. et al. (2022). The Web Is Your Oyster — Knowledge-Intensive NLP against a Very Large Web Corpus. arXiv: 2112.09924.
  22. Lewkowycz, A. et al. (2022). Solving Quantitative Reasoning Problems with Language Models. arXiv: 2206.14858.
  23. "ChatGPT: Optimizing Language Models for Dialogue." OpenAI. 2022-11-30.
  24. https://archive.is/XRl0R.