Verve Search logo

Echoes of English: Exploring language similarities and differences

Around two-thirds of the UK population only speak English, and while many say they don’t feel the need to learn another language, multilingual skills can help with a mass of learning and communication skills. The truth is, there are a lot more similar languages than we might initially realise. 

Here at Verve Search, working with a multi-national and multi-lingual team, we know these benefits first-hand. Bringing a huge range of viewpoints when brainstorming concepts for clients – from the most lucrative foreign languages to Spain’s most beautiful road trips – producing campaigns about culture, languages, and linguistics comes naturally.

So, to show off the need for bilingualism and its benefits, we’ve undertaken an analysis into which languages are the closest to English (and hence make them the easiest to learn, too).

Investigating spelling, pronunciation and even using some maths, we can reveal the best languages to start your learning journey with below…

Key findings

In this study, we analysed which languages are closest to English by measuring the similarity of selected language features. The process included a range of natural language processing (NLP) methods to decipher this. 

We found:

  • Scandinavian languages (Norwegian, Danish, Swedish) are the most similar languages to English, topping the board with their pronunciation and spelling.
  • Finnish is the most different in all three categories, making it the hardest for English speakers to learn.
  • Dutch is the closest in terms of phonetical sounds, whilst Turkish is the most different when spoken.
  • Looking at the 1,000 most used words, ‘Radio’ has the most consistent spelling and pronunciation across all languages studied.
Looking over mans shoulder onto book being read in coffee shop.

Methodology

There are three main elements to our data process. To summarise, we:

1. Gathered a list of the 1,000 most common words in the English language.

2. Translated each word into multiple languages using the Google Translate API.

3. Compared each translated word to its English equivalent to measure similarity.

Things to remember:

  • We analysed the most widely spoken languages in Europe which use the Latin alphabet, and if a language has some additional characters, these were still included.
  • Non-Latin alphabets are not compatible with this type of analysis. Languages such as Greek and Ukrainian have also been removed as they use the Greek and Cyrillic alphabets, respectively.
  • Stopwords (e.g. ‘and’, ‘I’, ‘the’…) have been excluded.
  • Disclaimer: This study analyses the similarity of individual words within each language, rather than the coherence and fluency of conversational differences. Language features around grammar (e.g. verb conjugations) and sentence structures are not considered.

How we measured the similarities of words

Now, let’s dive into the nitty-gritty of this study. We investigated two key features of words to understand their similarities and differences: their orthography and their phonetics.

First up, we have the orthographic distance between words.

“Orthography differences (spelling of words) measure how different spelling between languages is, considering alphabets, characters, and accents.”

To do this, we analysed the ‘Levenshtein distance’ between each English word in our seed list and their translated versions. Bear with us here.
 
Synonymous with edit distance, the Levenshtein distance calculates the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string (word) into another.
 
To break it down, ‘cat’ and ‘cut’ have a distance of 1, as 1 single-character substitution is required to match each word.
 
Whereas the distance between ‘hello’ and ‘halo’ = 2, as 1 substitution and 1 deletion are required to match each word.

Boxes showing the Levenshtein difference by moving letters.

So, with learning languages in mind, we’ve allowed accents with the same character symbols to be considered identical, only for this orthography part of the analysis. For example, ‘Ocean’ and ‘Océan’ will have a Levenshtein distance of 0, as the accented character is considered the same as the original character. Still with us?

Lets move onto the the phonetic differences between words.

“Phonetic differences (verbal) measure the difference in pronunciation between languages. This includes individual phonemes as well as accent and tone emphasis.”

Doing this gets a little bit technical. We use a method called the Double Metaphone algorithm and a few more NLP steps. This method allows us to measure the difference between the original English word and the pronunciation in a different language by comparing the number of sounds in a word.

Example of phonetic transcriptions between words including this, book, wall and which. Credit: ResearchGate

Firstly, we generated Double Metaphone encodings for both words (the English word and its translated counterpart, for each language) to represent how each word sounds.

Then, we measure the distance between each encoding through Levenshtein and maximum distance calculations. This distance is normalised and used as a similarity score between each word.

And breathe. That’s all for our method, but just a note on our scoring:

When interpreting our phonetic similarity scoring system, our phonetic similarity ranges from 1 to 100:

  • High Scores (70-100): The words sound very similar or phonetically close.
  • Mid Scores (30-70): Some phonetic characteristics are shared but are not very similar. A score of 50 indicates that the words have a balanced mix of similar and dissimilar phonetical features.
  • Low Scores (1-30): The words are quite different, phonetically.

We know, it’s a little bit of a mouthful. But, it may make more sense when we put it into real data analysis. Let’s see what our results found…

Analysis

Overall language similarity: Which languages are the easiest to learn?

We found that Scandinavian languages were the most similar to English, taking all 3 top spots. Norwegian came in first, followed by Danish and Swedish.

English speakers should be able to pick up these languages the easiest, due to their high rate of similar spellings and pronunciations that English speakers are used to.

Wondering why languages from this region register as the most similar?  It goes back to the Vikings!

The Norwegian Viking invasions and settlement in England led to a significant Old Norse influence on Old English, introducing many words and impacting grammar. You’ll see this from words like ‘muck’, ‘skull’, ‘knife’ and ‘die’. Looks like they were having a particularly malicious time during this period…

However, whilst these Scandinavian languages topped the table, another Nordic language actually ranked last: Finnish.

This language differs primarily because it belongs to the Finno-Ugric language family, distinct from the Indo-European family that includes English and most other European languages. To put it simply, Finnish has fundamentally different roots.

Table showing overall language similarities.

Orthography similarity (written): Which languages have the closest written vocabulary to English?

In line with the overall index, Scandinavian languages take the top places for their written similarities too. In fact, Scandinavian dialects took the top four places for this ranking.

With an average Levenshtein distance of 3.85, that means Norwegian words are the closest to their English counterparts – less than four letters different on average. The next two languages here are Danish (3.90) and Swedish (3.94).

This time, written Finnish (once again) as well as Polish take the crown for being the furthest away from English, with a whopping average Levenshtein distance of 5.73 and 5.64 respectively. This means the average word in both languages requires 5.7 single-character edits to match its English translation.

Anyone who does speak Polish will know its vocabulary is largely distinct from English, with far fewer cognates. Although Polish has borrowed some terms from Latin, German, and other languages, its core vocabulary doesn’t align with that seen in English, making it a lot more difficult if you’re trying to learn.

Table showing orthography similarity between languages.

Phonetic similarity (verbal): Which languages sound the closest to English?

Phonetically speaking, we measured Dutch as the closest language to English with an average phonetical similarity score of 48.2 out of 100. Where Old English and Old Dutch were both West Germanic languages, their evolution from these common roots means they retain many phonetic similarities.

On the other side of the table, you’ll find Finnish last once again – but this time followed closely by Turkish, which only scores an average of 21.3 and 23 out of 100, respectively.

What makes Turkish so different to English? That’s down to their sets of phonemes and phonological rules. For example, Turkish has vowel harmony in consideration, where vowels within a word harmonise to be either front or back vowels – a feature that’s not present in English.

 

Table showing phonetic similarity between words and languages.

Which words are the most similar across all studied languages?

Of the 1,000 English words analysed across 13 languages, the top three words with the most consistency in spelling and pronunciation are ‘Radio’, ‘Atom’ and ‘Dollar’.

Radio’ comes top with the same spelling across all languages studied, except in Turkish (‘Radyo’). The invention of the radio occurred in the late 19th century in 1894, a time when technological advancements and global communication were becoming more interconnected. After this, the term ‘radio’ was adopted quickly around the world to describe this new technology making it a lot easier to pick up across languages.

Atom’ takes second place, coming from the Greek word ‘atomos’ which means ‘indivisible’. It was adopted into scientific vocabulary in the 19th century, and with science being a global discipline, the term was retained in its original form across many languages.

Dollar’ rounds up the top three. Its consistency across languages is due to its historical origins in the European ‘thaler’, its widespread use in global trade and finance, and the influence of the U.S. dollar as a primary reserve currency.

Table showing the most similar words across languages.

Conclusion

Our Echoes of English analysis found that Scandinavian languages – particularly Norwegian, Danish, and Swedish – are the most accessible for English speakers to learn due to their high degree of similarity in both vocabulary and pronunciation.

These insights offer guidance for bilingual-curious English speakers to understand which languages will be the easiest to pick up, on an objective scale. This study also emphasises the importance of considering orthographic and phonetic aspects when evaluating language learning difficulty, aiding learners, language learning platforms, and language teachers.

Whilst Scandinavian tongues topped the tables, languages like Finnish and Turkish present the greatest challenges due to their significant linguistic differences in both spelling and pronunciation.

When analysing the 1,000 most common English words with both Levenshtein distance for orthography and Double Metaphone encoding for phonetics, this study offers a robust, comparative analysis of language similarity, particularly for the words ‘Radio’, ‘Atom’ and ‘Dollar’.

This underscores and reveals the historical and linguistic connections that facilitate easier language learning, such as the impact of Old Norse on English and the shared Germanic roots of English and Dutch.

Glossary

Accents: Difference in pronunciation specific to regions or groups within a language, often marked by different intonation and sound patterns.

Cognates: Words in different languages that have a common etymological origin and similar meanings.

Double Metaphone: An algorithm used in natural language processing to encode words by their phonetic pronunciation. Helpful in comparing how words sound across different languages.

Language Similarity Score: A composite measure of how similar a language is to English, based on both orthographic and phonetic analyses.

Latin Alphabet: The writing system originally used by the Romans, which is the basis for the alphabet used in English and many other languages.

Levenshtein Distance: A measure of the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into another. Used to assess orthographic similarity between words.

Natural Language Processing (NLP): A field of artificial intelligence focused on the interaction between computers and human (natural) languages, involving the analysis and synthesis of language data.

Old English: The earliest form of the English language, spoken in England from roughly the 5th to the 11th century.

Old Norse: The North Germanic language spoken by the inhabitants of Scandinavia during the Viking Age, which influenced the development of Old English.

Orthography: The conventional spelling system of a language, including alphabet, characters, and accents.

Phonemes: The smallest units of sound in a language that can distinguish words from each other.

Phonetic Similarity: The degree to which words sound alike when pronounced, analysed using the Double Metaphone algorithm.

Similarity Index Score: A composite measure of how similar a language is to English, based on both orthographic and phonetic analyses.

Stopwords: Commonly used words (e.g., ‘and’, ‘the’, ‘I’) that are often filtered out in language processing tasks because they carry less meaningful content.

Xenoglossophobia: The fear of learning or using foreign languages.


Verve Search provides international targeting for campaigns across the globe. Interested in our content marketing, outreach and digital PR services? Get in touch