From number names to language families
This post is a continuation of a previous post: On sorting numbers alphabetically in different languages and other absurdities, where we made a little exploration of number names from 1 to 10 across +5000 languages and learnt some curiosities along the way. Nonetheless, you don’t need to read it to understand this one. They’re somewhat independent.
Given the title and description of the post, you can imagine what we will see today. The idea is to try some classification, clustering and dimensionality reduction techniques to see what these 10 words can tell us about different language families.
As I mentioned in the previous post, I’ve also created a Shiny app where you can visualize +3800 languages in an interactive 3D plot. I think it’s pretty cool. I highly recommend checking it out: https://olafmeneses.com/apps/LangNet
Before we start getting into the matter, let’s take a look at the data. For each language, we have the following information:
- Language name
- If they’re extinct or not
- If they have more than 1 million speakers or not
- The parent language
- Group 1-7: Group 1 corresponds to the language family. Group 2 is the subgroup within Group 1. Group 3 is the subgroup within Group 2… Each subsequent group is a subset of the previous one.
Data is taken from (Rosenfelder). For example, in the case of English, we would have:
We can have a look at all the languages in the following table, where we show the language, its family and the names of its numbers.
The Big Q
With numbers 1-10, just 10 words, can we classify languages into known language families?
We will proceed as follows:
- Identify different distance functions to compute with our data (dataset of +3000 languages)
- For each distance, calculate the distance matrix
- Based on k-nearest neighbors (k-NN) classification (since we know the correct groups), determine which distance method works better and compute its distance matrix
- Perform Multidimensional Scaling (MDS) and tSNE to visualize the data
- Generate an interactive visualization of the k-NN graph
- Construct a dendrogram to reveal hierarchical language relationships (only with the Indo-European language family)
Note: k-NN classification relies on known language family groupings. Only languages with identified families are retained, excluding Almean, Constructed languages, and Pidgins and Creoles families due to their artificial or pidgin/creole nature. Moreover, the languages named written or numerals will be excluded because they represent the written form of their respective languages.
Which distance is better?
My idea (initially and because of the previous post) was to compare languages by looking at the positions of their numbers sorted alphabetically. I didn’t have any hope that it would be great. It wasn’t great. At all. This distance is somewhat random and doesn’t retain relevant information. However, I think that stupid ideas sometimes (and only sometimes) lead you to interesting places.
After the initial failure, I tried with a well-known distance for strings: Levenshtein distance.
The Levenshtein distance counts the minimum number of single-character edits (insertions, deletions, or substitutions) needed to change one string into the other. So, the smaller the Levenshtein distance between two strings, the more similar they are.
We will now see an example of why we should make some modifications to define a new distance. If you have the strings “hello” and “hello”, the Levenshtein distance is 0. That makes sense. But here’s the problem:
String 1 | String 2 | Levenshtein Distance |
---|---|---|
hello | hellozzzzz | 5 |
hello | azxyp | 5 |
Although both have the same Levenshtein distance of 5, would you say that both strings are at the same distance (equally similar/dissimilar)? We can address this by normalizing the distance.
To do this, we divide the distance calculated by the maximum length of the strings s1 and s2 being compared (\(max(length(\text{s1}),\,length(\text{s1}))\), where length is the number of characters of each string). The minimum distance is 0, while the maximum distance is set at 1:
String 1 | String 2 | Normalized Levenshtein Distance |
---|---|---|
hello | hellozzzzz | 0.5 |
hello | azxyp | 1 |
This way, we can compare distances between strings of different lengths more fairly.
We will compare different distance functions (the one based on alphabetically ordering, DL [Damerau-Levenshtein], Levenshtein and OSA [Optimal String Alignment]). These last three are string distance functions and variations of Levenshtein’s idea. We will also compare the unmodified and normalized versions.
As we already said, we will calculate the accuracy with the k-NN. First, we will see for which value of k
we get the higher mean accuracy. As we can see in the table below, the best option is k=1
.
For this value of k
, we can see which are the distances with higher accuracy with the next plot. The normalized versions are slightly better than the unmodified ones. The accuracy with the distance based on the alphabetic sorting is awful.
Finally, we show which is the mean accuracy for each distance.
Best distance
The distance we will use from now on is the normalized version of Damerau-Levenshtein. The accuracy score for the different families/subfamilies levels can be found in the next table. We have to take into account that with this method we cannot classify correctly some languages because we only have one sample from that family (for example, Basque, a language isolate). Because of this, correct classification can be achieved for only 3557 out of the 3567 languages in this set.
Why do we get so good results?
When I first saw these accuracy results, I was a little bit surprised. I expected an accuracy for the first-level family ranged between 0.6-0.8. Instead, what we got here is that we correctly classified 3273 out of 3567 languages (91.76%).
I had some intuition on why it worked so well. My initial curiosity prompted me to investigate further. I stumbled upon (Calude 2021), which sheds light on the matter:
Lexical replacement rates vary enormously among words and among languages. In words that linguists believe to be the least rapidly changing within a given language, namely words that designate basic vocabulary terms, like foot, green, man, dirty, husband, wife, mother, and including numbers one through to five—a collection of words termed the Swadesh List (named after Morris Swadesh, who formulated various such lists)—rates of lexical replacement can still vary between word-forms as much as 100-fold [7, p. 8]. But number words stand out as being among the most conservatively preserved word-forms even in such basic vocabulary lists [7]. Remarkably, in the Indo-European language family, a single cognate set can be traced throughout its entire history, indicating astonishing agreement across speakers and time [7]. Put another way, speakers of Indo-European languages have preserved ancestral forms for low-limit numbers with extreme fidelity over thousands of years of language change.
This led me to the referenced article [7] (Pagel and Meade 2017), which I highly recommend reading. In it, the authors propose three hypotheses to explain the unusual conservation of number names (I have consistently used the expression number names but the authors refer to them as number words):
- Evolutionarily conserved brain regions associated with numerosity (somehow) influence the learning and use of linguistic-symbolic number words
- Number words are unambiguous in their meanings and therefore less likely to admit alternatives
- Number words occupy a region of the phonetic space that is relatively full
While all three hypotheses are valid, the second one seems the most reasonable to me. An explanation I liked from (Calude 2021):
All the evidence thus shows that low-limit numbers in languages that have productive higher numbers behave in a stable, uniform manner across large time scales and varied speaker populations. The question is, why are these low-limit numbers so resistant to change? A highly plausible hypothesis comes from the lack of variation in the system [7]. Owing to their concrete and specific meanings [32], there is less room for near-synonyms to develop and even when they do, these remain context-restricted and low in frequency (compare twelve with dozen), leading to fixation. This is precisely what was observed of the LAMSAS and LAGS American English data [31]. The findings also support a more general law of semantic change, the Law of Innovation, proposed by Hamilton et al. [33], which contends that polysemous words tend to change their meanings faster, showing Social Conformist Bias effects in language change. Yet, it is still unclear what keeps the variation among number words so low; why do we entertain various words for parlour but only one for three?
That final question is key. Words describe what we observe/imagine (a reality). The word house could be used to describe thousands of realities (from a small cabin in the woods to a skyscraper in the city). The word house describes a highly variable concept. In order to reduce ambiguity, we need synonyms that fit better the reality we are trying to describe (like mansion or apartment). However, number names describe the quantitative dimension with a very high degree of certainty. This means that, in the case of numbers (when trying to describe quantities), we rarely need a synonym.
I must say, though, that this extends beyond numbers. Consider the names of days and months, which are human-created conventions for organizing and navigating time. The use of these terms facilitates precise communication. However, they are not necessary, nor innate. Humans can live without precise counting or timekeeping. Another question arises: How far could a society progress without them? We’ll leave that for another day.
Apologies for this boring dissertation; let’s get back to our problem.
If numbers 1-10 evolve steadily, it seems feasible to find a neighbor within the same language family using the Levenshtein distance (or a similar one). As we obtain more samples from a particular language family, it becomes more probable that the closest neighbor of a given language belongs to the same family.
Some insights into the misclassifcation
In the following table we can see ten of the families and their classification accuracy, sorted by the number of languages. Seems like it worked for almost all families. But why do we have an abrupt difference with Indo-Pacific? While for some families like Austronesian or Indo-European we achieved almost a 100% accuracy, for this family we couldn’t even achieve 60%.
After reading the Wikipedia article about Indo-Pacific languages, it seems like this is a hypothetical language family that is not accepted by specialists. We can see some of the incorrect predictions for the Indo-Pacific family in the table below. Most of the closest neighbors identified are from the Austronesian language family. Even though it’s wrong based on the correct classification, it makes more sense:
Dimensionality reduction
In order to visualize the different clusters corresponding to the language families, we will use two dimensionality reduction techniques: MDS and tSNE. We will not use the set of +3000 languages, but a smaller one of (178) languages which have +1 million speakers.
I will not explain how MDS or tSNE (Maaten and Hinton 2008) work (for that, I recommend this and this), since it’s not the main goal of this post.
MDS
Using MDS, it appears feasible to distinguish families with a larger number of languages, such as Indo-European and Niger-Congo, while the remaining languages are clustered together. It’s important to note that in MDS, the x, y, and z axes lack an interpretable meaning.
Note: You can click on any image to make it bigger.
Here’s a 3D plot that you can interact with. You can observe that adding a third dimension helps to separate some clusters like Sino-Tibetan or Tai (yellow/orange colors). Still, there are many points clustered in the center.
tSNE
With tSNE, it seems that we can capture both the local and global structure of the data. The different language families are more apart from each other and, moreover, language families that are somewhat related (Tai and Sino-Tibetan) are close to each other. As in MDS, the x, y, and z axes do not have an interpretable meaning.
In the 3D plot, it becomes easier to assess how well tSNE performed.
k-Nearest Neighbors graph
If you prefer a simpler interface, you can explore a k-NN graph here. You have selectors to highlight family groups or languages and zoom in to observe closely related languages. We are using the same layout of points from tSNE but we have added some jitter to the points to prevent overlap with the language name labels.
I highly recommend interacting with it: click on any language, and it will highlight its neighbors. As in previous examples, we use k=1
.
Hierarchical clustering
We carried out a hierarchical cluster analysis using the McQuitty agglomeration (linkage) method to explore the linguistic relationships within the Indo-European language family. The decision to focus on Indo-European languages was intentional for two reasons. Firstly, the Indo-European language family stands out as one of the most thoroughly researched and widely recognized language families in linguistics; therefore, we have greater certainty about the established subfamilies. Secondly, by selecting a reduced subset, we simplify the visualization.
Note: Every dendrogram will be preceded by a legend, providing clarity on the color scheme used to represent different language subfamilies.
For this first dendrogram, we color-coded each language based on its first subfamily. By visualizing the dendrogram in this manner, we gain insights into the higher-level relationships among language subfamilies, such as Germanic, Romance, Slavic, and others.
In this second dendrogram, we opted to color-code each language according to its deepest subfamily possible within the Indo-European family tree. For instance, if two languages belong to the same subfamily but diverge at deeper levels, they will be represented with distinct colors in the dendrogram. This approach tries to visualize with finer detail how Indo-European languages are connected and how they’ve evolved. Although the dendrogram is not perfect, I would say it is quite good and approximates the established subfamilies.
A look back
As we conclude this journey, it’s amazing to think back to how it all started — a simple note written down while I was on the metro, as mentioned in On sorting numbers alphabetically in different languages and other absurdities). Now, after more than a month of work, it culminates in two detailed posts and a Shiny app. I’m more than happy with the results I got. Are they gonna change the world? Probably not. Did I enjoy the process of reading about number systems in different languages? Absolutely! And I hope that you enjoyed it too.
Special thanks to Mark Rosenfelder. Without his extensive work in compiling number names from over 5000 languages, this would have been way less interesting.
Let me repeat a sentence from this post (but now using the quotation style, as if it made it sound smarter):
Stupid ideas sometimes (and only sometimes) lead you to interesting places.
Don’t forget to check out the Shiny app! https://olafmeneses.com/apps/LangNet.
Feel free to comment below or email me at menesesolaf@gmail.com with any questions or suggestions. See you in the next post!