Learning languages very quickly  —  with the help of some very basic Data Science

data-science.jpg

I have moved to Sweden 6 months ago with my girlfriend. It is a great country for expats like us, because almost everyone does speak really good English here. Even if it’s so, we would like to learn some Swedish, just to understand a bit more from the daily conversations and about the Swedish culture.

For me this is gonna be a hobby-language, so I don’t want to put too much effort in it. Let’s say, I’d like to spend top 10–20 hours and my target is to understand the 80% of the daily chit-chats. Is it doable? We will see! Here’s my approach — backed with some beginner-level data science.


The concept

What I’ll present here is a pretty unorthodox method, I guess. But sometimes you need to be creative if you want things to be easier, right?

What I did:

  1. I took two popular TV shows, Friends and How I met your mother.
  2. I got the Swedish subtitles of these TV shows — for all seasons, all episodes.
  3. Analyzed the most common words and sorted the words by the number of occurrences.
  4. I made a cut at the top 1000 words.

The question is, how far can I go, if I learn only the top 1000 words of a language. Or being more pragmatic: How many percent of Swedish conversations are covered by the top 1000 words of Swedish language? I have set up a short data script to find it out — and by the end of this article I’ll share it with you as well! But first, I’ll show you, how did I get my answer for this question. (Note: my method works with every European language.)

(UPDATE: since I published this article, I got some nice tweets on Twitter from folks, who had used this method before me! So it’s not so unorthodox after all! :-))


The code

The first step is to get the subtitles. It can be any and as much movies as you want. I picked Friends and HIMYM, because I have already watched both series in English before and I liked the vocabulary of them. But if you want to speak more like Rocky, get the subtitles of the Rocky movies. (An important note about subtitles! As far as I know, to use subtitles for non-profit research projects like this one is not illegal. But to stay on the safe side, I invite you to do the same, that I did: ask a native speaker friend to listen and type in the whole script for you instead of downloading. ;-))

Either way you go — once you have the subtitles, you will need a tool to do the text analysis part. I’ve described in this article step by step before, how to get, install and start with data coding tools like Python, R, SQL and bash. You could use any of those for this project.

I chose the simplest one: bash. It’s so simple, that the whole analysis needs actually only one line of code. On the screenshot below I broke it down to 14 lines just to make it more readable.

So here’s my exact bash code:

 Founder of Chineasy, Shaolan, delivering a talk at  TED - full speech here.

Line by line

What the code does? In general, the first 9 lines will be about data cleaning and the rest 5 will do the actual analysis. Here it is line by line:

  • Line 1: Read-in all the subtitle files, that I have put previously in my “friends” and “himym” folders on my data server. From now on you can imagine all the subtitles as one big temporary text file. Every further modification I’ll do below, will be processed on this “one big temporary text file”. Here’s a small sample of that:
02-line-code.png
  • Line 2: Remove all the lines, that starts with a number. (eg. the timestamps)
  • Line 3: Replace all the question marks, exclamation marks, dots and pipes with spaces.
  • Line 4: Replace all the multiple spaces with single spaces.
  • Line 5: Remove everything, but alphabetical characters, apostrophes and spaces.
  • Line 6: Remove all the unnecessary spaces from the beginning of the lines.
  • Line 7: Turn every spaces into line-breaks (means we will have one word per line in our file).
  • Line 8: Remove all the empty lines.
  • Line 9: Turn every uppercase characters into lowercase characters.

Now we have a clean list with all the words that were used in the TV shows. Every word appears as many times as it was used. Here’s a sample:

03-line.png
  • Line10: Sort the words into alphabetical order.
  • Line11: Make a list from every unique word and print the number of occurrences next to them.
  • Line12: Order this word-list by the number of occurrences. (The most occurrences the top of the list.)
  • Line13: Remove spaces from the beginning of the lines.
  • Line14: Print the top 1000 lines (top 1000 used words) into a file called 1000.csv.

Done! Don’t worry, if you don’t get the whole script immediately. Even if it’s a pretty beginner-level data script, obviously, you need some initial knowledge about bash to fully understand it. I’ve published 4 articles about the very basics of data coding in bash on my blog. Here you can read them, if you want to learn more:

Data Coding 101 — Introduction to Bash: ep#1ep#2ep#3, and ep#4


The results

We already have the top 1000 unique words, that we were looking for. Nice! The only question left that how much will I understand if I learn only this 1000 words? To answer it, I need to have two numbers:

  1. I need to count the total number of words (multiple used words counted multiple times).
  2. I need to count, how many times the top 1000 unique words occur in the subtitles.

Comparing these two numbers will give back the usage ratio of the TOP 1000 words.

Back to my code! To answer 1) I have to change Line 14 to this code:

awk ‘{sum += $1} END {print sum}’

It will give you the total word count: 1 131 360

Then I’ll do the same with my freshly created 1000.csv file…

awk ‘{sum += $1} END {print sum}’ 1000.csv

I’ll get the total occurrences of the top 1000 words: 964 682

This is just amazing! 964682/1131360= 85.26%

Yes, it means if I learn the top 1000 words from Friends and How I met your mother, I will understand the 85% of them!
04-ikea.jpg

Reality check! I shoot this picture at the IKEA on the weekend. All the words of the headline are understandable by learning only the top 200 words.

Another interesting stat! This chart below shows that how many percent (Y) you cover by learning the TOP (X) words.

04-results.png

The result: a beautiful logarithmic function on a line chart. (This is not a trend or a smoothened line, this is the actual result!) After all this means that even if you invest the same amount of time to learn the first 1000 words as the second 1000 words, the first 1000 words will help you to understand 85%, the second 1000 only 5%. And so on.


The next steps

Okay, I have to admit, vocabulary is not everything. So my next steps will be:

  1. Learn the top 1000 Swedish words.
  2. At the same time do the Swedish Duolingo course to get some grammar knowledge. (Note that Swedish looks pretty similar to German and English, so I guess I won’t have serious trouble to learn the grammar part.)
  3. Watch some movies with Swedish dub+subtitle, that I watched already in English or in Hungarian (that’s my native language by the way). This is good to match the pronunciation with the written format. And of course to learn some special expressions.

And then… we will see!

05-netflix.png

Don’t forget: my goal was to understand, not to fluently speak the language.


Disclaimers

To be frank I would not even call this project a “data science” project — as there is nothing very scientific in it. It’s just common sense and some text mining with basic data coding. And before NLP professionals start stoning me — I have to mention two known issues:

  1. This whole little language learning experiment of mine is based on a fair, but yet not proved assumption. I presumed, that the vocabulary of these TV shows is pretty similar to the vocabulary of the daily discussions. Maybe this is just not true. Maybe. But it looks so logical, that I’ve just decided to take this minimal risk anyway.
  2. My bash script I’ve presented above is not 100%. Eg. I didn’t implement any stemming algorithm, that reduces the words like “cooking”, “cooked” or “cooks” to the root format: “cook”. My estimation is, that these kind of issues cause a top 5% inaccuracy in my results. And then I just asked myself: What’s the ROI (return-of-investment) here? Is that 5% accuracy really worth the extra effort? The answer was: definitely not.
    I was looking for a quick and dirty method to learn a language, so it’s kinda self-explanatory that the script I wrote for that should be quick and dirty as well. If I had spent days (and not minutes) with optimizing my data script to 99.9% (and not to 95%), the whole project would have just lost the main objective of it: be time-efficient.

And one more note. I haven’t tested this method yet and until I do, it will remain a hypothesis. But the learning-experiment starts now, and I promise, that I’ll let you know, how it worked! (UPDATE: I’ve learnt the top 200 words — and I already feel much more confident with Swedish! :-)) Also, I invite you to be a part of this and share your experience, if this method worked for you or not.

Here are the top 1000 words in Swedish and in English. If you want to do the same for other languages, feel free to use my code.


Conclusion

You can apply data in almost every segment of your life to do things smarter. Be creative and take advantage of the fact that there is so much available data around us — and it is so easy to analyze it and turn it into meaningful findings! And always stay critical, think about the return of investment and don’t be afraid to do quick and dirty things!

If you want to learn more about data science, check my blog (data36.com) and/or subscribe to my Newsletter! And also check my new data coding tutorial series: SQL for Data Analysis!

Thanks for reading! Enjoyed the article? Please just let me know by sharing below. It also helps other people see the story!

Tomi Mester
author of data36.com
Twitter: @data36_com


Written by Tomi Mester: Data analyst & researcher. “Data helps you to understand your customers.” Focusing on #datadriven #startup #ecommerce #bigdata #analytics