Teaching computers language
— tech — 13 min read
Can you remember the last time you learned a language that you are fluent in? It encompassed mastering four different skills: reading, writing, listening and speaking. These are challenging skills to teach a computer but let's step it up and focus on an abstract reading skill: recognizing style. It's intuitive but often difficult to describe comprehensively what style is (think about a book written by your favourite author, or the difference between reading an essay written by a child vs a university graduate).
These days recognizing style is relevant with the amount of artificially generated content shared on the internet. Here's a simple game where you can see if your writing matches the style of popular large language models (Try it out here if it takes too long to load, best viewed on the desktop).
This game takes in a piece of text for a fixed prompt and predicts the LLM (out of ChatGPT, Gemini and Claude) that would most likely produce text with a similar style. Building it required designing a method to express writing styles and evaluate text against it, and in this article I'll explain how I did so and the principles behind it.
Basics
Let's start by reading sentences and recognizing their language, and unpacking some of what goes on behind the scenes as you do so.
Hey, how's it going?
As you read that sentence you quickly assessed that it was written in English. Can you pinpoint the exact word in the sentence where you were most confident? Was it after reading 'Hey', 'how's', 'it', or 'going'? Or was it related to how symbols like the comma (,) and apostrophe (') were used? It's clear that recognizing words is important when identifying language, but there are clues in other parts of the sentence as well.
What if the sentence was just the word 'Hey'? If you spoke Dutch, Danish or German, you could be undecided on the language as it's common to use this spelling for the English word 'hey' (also 'hej'). You couldn't be sure how this article was going to unfold (how thrilling).
Floccinaucinihilipilification, amidst sesquipedalian obfuscations, epitomizes anfractuous inefficacy
It's unlikely that you've come across this sentence before but if you had to guess, you would still classify it as English. Though several words were new to you, they have a certain English-ness to them. They didn't make your brain immediately think that a stray cat had walked across my keyboard, but were just words that you've never come across before 1.
To understand why, let's break down one of the complex words 'Floccinaucinihilipilification' into 'Floccinaucinihilipil' and 'ification'. The English-ness comes from two factors: the first is that you have come across other words that end with 'ification' (e.g identification, gamification, justification) or just 'cation'/'tion' (e.g education, application, election, emotion). This pattern is common in English that you've learned to recognize and be comfortable with. The second, more subtle reason is the absence of weird combinations of letters. While uncommon in English, sequence of letters like 'cci' and 'auc' didn't raise alarm bells. This might have been the case if they were 'kje' or 'dzi' since you rarely come across words in English that contain them. Moreover, if you happen to also speak Dutch or Polish these sequences would be natural to you. Interestingly, no one taught you that these sequences are uncommon, rather it's something that you learned through use of the language.
These examples highlight two important points that we'll make use of. First, we break sentences down into smaller parts (sometimes sub-word) and look for patterns to establish familiarity, and this is what we'll teach the computer to do. Second, recognition of language can be done without understanding meaning. This is important because it means we can teach a computer to read and recognize style without needing to teach it to understand or produce content. Armed with this knowledge, let's move on to building.
Machinery
To begin, we need to let the computer know what kinds of symbols it should recognize, called its vocabulary. In our case, we include the English alphabet, numbers, and common punctuation symbols (the full list can be found here). For any given text, the computer breaks it down into individual characters and only processes symbols that are in its vocabulary, ignoring the rest. This is an important design decision that will influence the capabilites of our computer. One consequence is that it cannot recognize languages that use non-English alphabet (e.g Cyrillic languages, Greek, Tamil).
We now have a computer that can read paragraphs of English, but we need it to recognize the style expressed in these paragraphs. Style boils down to the choice of words used and their relative arrangement and we need a way to capture or model this. One approach is to examine how things of a similar style are produced and find common patterns to capture it.
In the world of art, paintings are often described to be of a certain style belonging to a period in time (e.g Renaissance, Modern) or a particular artist (e.g Botticelli, Caravaggio), and what characterizes them as such are details like the colors and materials used (e.g oil vs watercolor paints) and the subjects (e.g figures vs abstract). These details are used to classify new paintings as belonging to one style more strongly than others. You will note that its certainly possible to blend styles, and classying such paintings accurately becomes harder especially as you increase the number of styles you want to recognize. In our game, we have limited ourselves to recognizing the writing style of three popular LLMs and content produced for a fixed prompt. This was done to make the problem more manageable, but the principles used can be extended to handle more general cases.
Taking inspiration from art, in order to characterise style in text, we need to find useful details and identify common patterns. To do this, we can first split paragraphs into individual characters and group them in threes, called trigrams. Each character can be part of multiple trigrams2, and each word could be made of multiple trigrams. Further, we can keep track of each trigram we come across and count how many times they occur in a piece of text. For example, mouse over a character in the sentence below to see a trigram it is a part of, and the number of times that trigram occurs in this sentence.
You might have noticed that punctuations like spaces and period are also part of trigrams, and that's because they are part of the vocabulary we defined earlier. Given a paragraph, we can break it down into trigrams and count the number of times each trigrams occurs. Naturally, common three letter patterns like 'the' or 'ion' will have higher counts in English paragraphs. Morever, we are able to capture patterns within words and across word boundaries. The intuition here is that different styles will have different counts for the same trigram. For example, some styles may use more common triplets while others may use more rare triplets since they use rarer words (or punctuations).
Building our style benchmarks
So, we have the idea of using trigram counts to understand writing style. But to make comparisons, the computer first needs clear examples or reference texts for each LLM style we want to recognize. And who better to provide these stylistic signatures than the LLMs themselves? To create these benchmarks, I prompted three popular LLMs (ChatGPT, Gemini, and Claude) with the exact same prompt3 multiple times, to gather a collection of text of about 5000 words by each LLM.
As the original authors, using these LLMs directly would be the most accurate way of checking style similarites, but this is not possible as they are closed source. This means that direct inspection of their internal architecture or the precise data upon which they were originally trained—which could reveal their built-in 'counts' or stylistic patterns—is not possible. Therefore, by collecting the output of these LLMs in response to the same prompt, we can create an approximation of their stylistic tendencies within that specific context. This is similar to attempting to understand a particular artist’s style by studying several paintings they created based on the same theme, instead of directly asking them (possibly since they aren't around anymore).
This also means that the reference texts collected directly affect our game's characterisation of writing styles. Firstly, the volume of data gathered is a key factor; generally, the more text collected from each LLM, the more accurately it will reflect its typical writing patterns. Secondly, prompt specificity plays a crucial role. Because all reference texts were generated from the same prompt, the game will be most adept at analyzing text written in response to similar prompts or on closely related topics. Its accuracy could diminish when faced with texts on vastly different subjects, as the stylistic nuances exhibited by the LLMs might change depending on the context.
Now, we can use trigram counts to represent paragraphs and capture their style. However, we still need a way to quantify how well a text fits a particular style to predict which LLM could have produced it. To do this, we can apply a simple trick to convert trigram counts into probabilities, which we use to express the confidence or likelihood of a text fitting a particular style. Let's explore an initial way to use these probabilities, and then a more refined approach.
Prediction
A simple approach to doing this is to calculate the probability of each trigram occuring in each reference text. For each of our LLM-generated reference texts, we calculate how often every possible trigram appears. For instance, if the trigram 't-h-e' shows up 5 times in an LLM's sample text that contains 1,000 total trigrams, its probability is 5/1000=0.005. By doing this for all trigrams, we create a kind of stylistic fingerprint – a 'probability profile' – for each LLM.
Then, when a user writes their text, we break it into its own trigrams. To see how well it matches LLM A's style, we'd find the probability of each of the user's trigrams according to LLM A's fingerprint and multiply these probabilities together. The LLM whose fingerprint results in the highest product of probabilities would seem like the closest match. If the user's text uses many trigrams common in LLM A's writing, the combined score for LLM A will be higher.
However, this method has a couple of key issues. Firstly, it's very sensitive to the length of the user's text. Because each trigram's probability is a small number (less than 1), multiplying many of them together for a longer text results in an incredibly tiny final score. This makes it hard to compare texts of different lengths fairly. Secondly, this approach treats each trigram in isolation, as if 't-h-e' has no bearing on what might come next, which isn't quite how language works.
So, how can we improve this? Instead of just counting complete trigrams, we can think predictively. Given the first two characters of a potential trigram (a bigram), we could ask what the likelihood of the next character is. For example, if we see the characters 't-h', what's the probability that the following character will be 'e'? To calculate this conditional probability, we go back to each LLM's reference text. Every time we find the bigram 't-h', we look at the character that immediately follows it. If 'e' follows 't-h' frequently in LLM A’s text, then the conditional probability of 'e' given 't-h' will be high for LLM A’s style. For example, mouse over the different characters below to see their conditional probabilities for that bigram based on the reference text:
We can do this for all possible next characters and all possible bi-grams. This approach inherently considers the sequence, acknowledging that characters influence what comes next. A text is then a 'good fit' for an LLM if it’s composed of many such likely predictions. While this predictive approach is more robust, simply multiplying these conditional probabilities across a long text can still lead to vanishingly small numbers and sensitivity to text length. To get a single, reliable score that isn't thrown off by text length, we use a measure called perplexity. Perplexity is normalized by the length of the text and is a measure of how 'surprised' an LLM's model is by the text. If the model predicts the sequences in the user's text well (meaning the characters are ones it expects), the text is less surprising, and the perplexity score will be low. A lower perplexity indicates a better stylistic match, and this method overcomes both the length sensitivity and the issue of treating trigrams as isolated.
Putting it together
With the foundational concepts and building blocks now in place, we can see how our game puts it all together. When a new piece of text is provided, the computer breaks it down into trigrams and calculates the perplexity for each LLM's reference style. It then compares these perplexities to find the LLM with the lowest perplexity, indicating the most stylistic similarity. If you followed along this far, thank you for sticking with me and congratulations on building the core mechanics of an n-gram language model! In our case, since we're using three-character sequences (trigrams), the 'n' in n-gram is 3.
When you interact with the game, you might observe a few things. For instance, strategically using names that occur in the reference texts of specific LLMs in your writing (like Clover for ChatGPT, or Jamie for Gemini) might subtly nudge the stylistic analysis towards that particular model. Moreover, given the relatively modest size of the reference text used to teach the computer each LLM's style, its judgments won't always be definitive, even when analyzing text genuinely produced by one of these LLMs! This is because language generation is a dynamic process; LLMs themselves can vary their output for the same prompt due to internal settings (like 'temperature,' which controls the randomness of their word choices).
Furthermore, while insightful, our n-gram model is a simplification. It captures local patterns effectively but doesn't possess the deeper semantic understanding or grasp of long-range dependencies found in the sophisticated, large-scale architectures of modern LLMs like ChatGPT, Gemini, or Claude themselves. These advanced systems use far more complex techniques to understand context, meaning, and stylistic nuance over much more extensive passages of text. Nevertheless, this n-gram approach provides a transparent and understandable window into one of the foundational techniques in language modeling!
Thanks to Yogesh P for reviewing drafts of this post. View the source code on GitHub.
If you liked this reading this post, let me know here
Footnotes
-
This is real sentence by the way, which when worded more simply is "Calling things worthless in overly complex language is the peak of confusing ineffectiveness." ↩
-
Three at most, as the first, middle or last part. ↩
-
"Write a paragraph about a boy with a red ball and a puppy in a vivid, narrative style, using descriptive language and varied sentence structures. Include a surprising event and write as long as you can." ↩