Note of data science training EP 11: NLP & Spacy – Languages are borderless

Note of data science training EP 11: NLP & Spacy – Languages are borderless

prev: Note of data science training EP 10: Cluster – collecting and clustering

Computers are capable to learn human languagues.

Natural Language Processing (NLP)

It is the methodology to translate human languages to datasets to analyse. For instances, “I Love You” can be translated as “positive”, “romantic”, and “sentimental”.

One basic term is “Tokenization” that is splitting a set of text into groups of words. We understand what we listen to by combining all meanings of words, so is computer.

Python has many libraries for this task. One is Spacy.

Spacy

This problem is from my final project. It is to predict rating from cartoons’ names. The steps are to split the names and transform into numbers and use Random Forest estimator as a predictor.

Let’s go.

1. Install

Find Spacy package here.

2. Prepare a dataset

The dataset is from Kaggle via this link.

3. import libraries and files

Import Pandas and .read_csv()

4. import spacy

As the dataset is in English, we have to download Spacy model "en_core_web_sm" with .load()then we got a class object. At this step, we can use that object to tokenize (word splitting) as the figure below.

We can display tokenized text with .text and their parts of speech with .pos_.

5. Custom tokenization

We don’t want special characters but only letters and numbers, so we need to improve the tokenizer with the regular expression in this method.

import re
def splitter(val, processor):
    pattern = r'[0-9a-zA-Z]+'
    return [r.group().lower() for r in re.finditer(pattern, processor(val).text)]

[0-9a-zA-Z]+ means to capture only number (0 – 9), lowercases (a – z), and uppercases (A – Z). Sign symbol means the captures are one letter or more.

6. Tokenize them all

OK, we now have to tokenize all names

pattern_splitter = [splitter(n, processor) for n in anime.name]
pattern_splitter

Then we add the tokenized value in a new column “name_token”.

anime.loc[:, 'name_token'] = pd.Series(pattern_splitter)
anime

7. Cleanse before use

As we require rating to predict, we have to remove non-value of rating here.

8. Make train and test sets

From all 12,064 rows, we are going to separate them into train set and test set. We apart 70% to train set here.

9. Vectorizer

Vectorizer in Scikit-learn is to transform words to matrix. It applies TF-IDF formula to calculate frequency of each word in the matrix.

First to create TfidfVectorizer object.

Run .fit_transform() on train set to learn words and store in the matrix then run .transform() on test set.

10. Random Forest

We now at the time to train it. Start with create a Regressor.

Assign “y” as the rating of train set.

Finally, run .fit() with the matrix and “y”. Now we got an estimator.

11. Scores of Random Forest

After that, we have to scoring the estimator. Here we have mse = 1.64 .

Try to compare predicted and real rating.

Then plot a graph. It might prove that, there are less relationships between name and rating of cartoons. Anyway, it is ok for the prediction results.

12. Interesting features

We can find feature rankings by .feature_importances_ of Random Forest and feature values by .get_feature_names() of vectorizer. Use them altogether to find which feature value causes the highest rating.

This is a DataFrame of feature names and feature importances.

13. Linear Regression version

We are curious how about Linear Regression. As a result, its mse = 2.98 that is higher than the Random Forest. OK. this one is worse.

NLP with Thai language

The teacher recommended pythainlp. This library can interpret Thai text in similar style as Spacy.


This blog is just an introduction. We can go further by learning Content Classification, Sentiment Analysis, etc.

See you next time, Bye.

next: Note of data science training EP 12: skimage – Look out carefully

Show Comments