Note of data science training EP 11: NLP & Spacy – Languages are borderless
prev: Note of data science training EP 10: Cluster – collecting and clustering
Computers are capable to learn human languagues.
Natural Language Processing (NLP)
It is the methodology to translate human languages to datasets to analyse. For instances, “I Love You” can be translated as “positive”, “romantic”, and “sentimental”.
One basic term is “Tokenization” that is splitting a set of text into groups of words. We understand what we listen to by combining all meanings of words, so is computer.
Python has many libraries for this task. One is Spacy.
Spacy
This problem is from my final project. It is to predict rating from cartoons’ names. The steps are to split the names and transform into numbers and use Random Forest estimator as a predictor.
Let’s go.
1. Install
Find Spacy package here.
2. Prepare a dataset
The dataset is from Kaggle via this link.
3. import libraries and files
Import Pandas
and .read_csv()
4. import spacy
As the dataset is in English, we have to download Spacy model "en_core_web_sm"
with .load()
then we got a class object. At this step, we can use that object to tokenize (word splitting) as the figure below.
We can display tokenized text with .text
and their parts of speech with .pos_
.
5. Custom tokenization
We don’t want special characters but only letters and numbers, so we need to improve the tokenizer with the regular expression in this method.
import re
def splitter(val, processor):
pattern = r'[0-9a-zA-Z]+'
return [r.group().lower() for r in re.finditer(pattern, processor(val).text)]
[0-9a-zA-Z]+
means to capture only number (0 – 9), lowercases (a – z), and uppercases (A – Z). Sign symbol means the captures are one letter or more.
6. Tokenize them all
OK, we now have to tokenize all names
pattern_splitter = [splitter(n, processor) for n in anime.name]
pattern_splitter
Then we add the tokenized value in a new column “name_token”.
anime.loc[:, 'name_token'] = pd.Series(pattern_splitter)
anime
7. Cleanse before use
As we require rating to predict, we have to remove non-value of rating here.
8. Make train and test sets
From all 12,064 rows, we are going to separate them into train set and test set. We apart 70% to train set here.
9. Vectorizer
Vectorizer in Scikit-learn is to transform words to matrix. It applies TF-IDF formula to calculate frequency of each word in the matrix.
First to create TfidfVectorizer
object.
Run .fit_transform()
on train set to learn words and store in the matrix then run .transform()
on test set.
10. Random Forest
We now at the time to train it. Start with create a Regressor.
Assign “y” as the rating of train set.
Finally, run .fit()
with the matrix and “y”. Now we got an estimator.
11. Scores of Random Forest
After that, we have to scoring the estimator. Here we have mse = 1.64 .
Try to compare predicted and real rating.
Then plot a graph. It might prove that, there are less relationships between name and rating of cartoons. Anyway, it is ok for the prediction results.
12. Interesting features
We can find feature rankings by .feature_importances_
of Random Forest and feature values by .get_feature_names()
of vectorizer. Use them altogether to find which feature value causes the highest rating.
This is a DataFrame
of feature names and feature importances.
13. Linear Regression version
We are curious how about Linear Regression. As a result, its mse = 2.98 that is higher than the Random Forest. OK. this one is worse.
NLP with Thai language
The teacher recommended pythainlp. This library can interpret Thai text in similar style as Spacy.
This blog is just an introduction. We can go further by learning Content Classification, Sentiment Analysis, etc.
See you next time, Bye.
next: Note of data science training EP 12: skimage – Look out carefully