Hướng dẫn text to vector python

Common vectorizing techniques employed in a typical NLP machine learning model pipeline using the real of fake news dataset from Kaggle.

Hướng dẫn text to vector python

Photo by Roman Kraft from Unsplash

In this article, we will learn about vectorizing and different vectorizing techniques employed in an NLP model. Then, we will apply these concepts to the context of a problem.

We will work with a dataset that classifies news as fake or real. The dataset is available on Kaggle, the link to the dataset is below:

https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset

The initial step involved in a typical machine learning text pipeline is data cleaning. This step is covered in detailed in a previous article, linked below:

Dataset after data cleaning

The raw news titles were transformed into a cleaned format containing only the essential information (last column of the above picture). The next step is to further transform the cleaned text into a form that the machine learning model can understand. This process is known as Vectorizing. In our context, each news title is converted to a numerical vector representative of that particular title. There are many vectorization techniques, but in this article, we will focus on the three widely used vectorization techniques- Count vectorization, N-Grams, TF-IDF, and their implementation in Python.

  1. Count vectorization

As discussed above, vectorization is the process of converting text to numerical entries in a matrix form. In the count vectorization technique, a document term matrix is generated where each cell is the count corresponding to the news title indicating the number of times a word appears in a document, also known as the term frequency. The document term matrix is a set of dummy variables that indicates if a particular word appears in the document. A column is dedicated to each word in the corpus. The count is directly proportionate to the correlation of the category of the news title. This means, if a particular word appears many times in fake news titles or real news titles, then the particular word has a high predictive power of determining if the news title is fake or real.

def clean_title(text):
text = "".join([word.lower() for word in text if word not in string.punctuation])
title = re.split('\W+', text)
text = [ps.stem(word) for word in title if word not in nltk.corpus.stopwords.words('english')]
return text
count_vectorize = CountVectorizer(analyzer=clean_title)
vectorized = count_vectorize.fit_transform(news['title'])

Dissecting the above code, the function “clean_title”- joins the lowercase news titles without punctuation. Then, the text is split on any non-word character. Finally, the non-stop words are stemmed and presented as a list. A detailed description of the cleaning process is given in this article.

Next, we have made use of the “CountVectorizer” package available in the sklearn library under sklearn.feature_extraction.text. The default values and the definition are available in the scikit-learn — Count Vectorizer documentation. In the above code, we have instantiated Count Vectorizer and defined one parameter — analyzer. The other parameters are its default values. The analyzer parameter calls for a string and we have passed a function, that takes in raw text and returns a cleaned string.

The shape of the document term matrix is 44898,15824. There are 44898 news titles and 15824 unique words in all the titles.

A subset of the 15824 unique words in the news title

The vectorizer produces a sparse matrix output, as shown in the picture. Only the locations of the non-zero values will be stored to save space. So, an output of the vectorization will look something like this:

<20x158 sparse matrix of type ''
with 206 stored elements in Compressed Sparse Row format>

but, converting the above to an array form yields the below result:

As shown in the picture, most of the cells contain a 0 value, this is known as a sparse matrix. Many vectorized outputs would look similar to this, as naturally many titles wouldn’t contain a particular word.

2. N-Grams

Similar to the count vectorization technique, in the N-Gram method, a document term matrix is generated and each cell represents the count. The difference in the N-grams method is that the count represents the combination of adjacent words of length n in the title. Count vectorization is N-Gram where n=1. For example, “I love this article” has four words and n=4.

if n=2, i.e bigram, then the columns would be — [“I love”, “love this”, ‘this article”]

if n=3, i.e trigram, then the columns would be — [“I love this”, ”love this article”]

if n=4,i.e four-gram, then the column would be -[‘“I love this article”]

The n value is chosen based on performance.

For the python code, the cleaning process is performed similarly to the count vectorization technique, but the words are not in a tokenized list form. The tokenized words are joined to form a string, so the adjacent words can be gathered to effectively perform N-Grams.

The cleaned title text is shown below:

The remaining vectorization technique is the same as the Count Vectorization method we did above.

The trade-off is between the number of N values. Choosing a smaller N value, may not be sufficient enough to provide the most useful information. Whereas choosing a high N value, will yield a huge matrix with loads of features. N-gram may be powerful, but it needs a little more care.

3. Term Frequency-Inverse Document Frequency (TF-IDF)

Similar to the count vectorization method, in the TF-IDF method, a document term matrix is generated and each column represents a single unique word. The difference in the TF-IDF method is that each cell doesn’t indicate the term frequency, but the cell value represents a weighting that highlights the importance of that particular word to the document.

TF-IDF formula:

The second term of the equation helps in pulling out the rare words. What does that mean? If a word appeared multiple times across many documents, then the denominator df will increase, reducing the value of the second term. Term frequency or tf is the percentage of the number of times a word (x) occurs in the document (y) divided by the total number of words in y.

For the python code, we will use the same cleaning process as the Count Vectorizer method. Sklearn’s TfidfVectorizer can be used for the vectorization portion in Python.

The sparse matrix output for this method displays decimals representing the weight of the word in the document. High weight means that the word occurs many times within a few documents and low weight means that the word occurs fewer times in a lot of documents or repeats across multiple documents.

Concluding thoughts

There is no rule of thumb in choosing a vectorization method. I often decide based on the business problem at hand. If there are no constraints, I often start with the easiest method (usually fastest).

I’d love to hear your thoughts and feedback on my articles. Please do leave them in the comment section below.