Natural Language Processing (Part 4)

EHzh...6AzB
19 Mar 2024
24

Sentiment Analysis (Logistic Regression):Vocabulary & Feature Extraction

Python Notbook
In this Tutorial, you’re going to learn how to represent a text as a vector. In order for you to do so, you first have to build vocabulary and that will allow you to encode any text or any tweet as an array of numbers.

Vocabulary

def:In the context of natural language processing (NLP), vocabulary refers to the set of unique words that appear in a text corpus.The vocabulary is used to represent text in a machine-readable format. For example, if the vocabulary contains 10,000 words, then each text document can be represented as a vector of 10,000 numbers, where each number represents the frequency of a particular word in the document.
The vocabulary is an important part of NLP because it allows us to represent text in a way that can be processed by computers. Without a vocabulary, it would be very difficult to develop NLP algorithms that can understand and process natural language.
There are a few different ways to create a vocabulary for NLP. One common approach is to use a statistical method to select the most frequent words in a text corpus. Another approach is to use a dictionary of words that are considered to be important for a particular task.
So let’s dive in and see how you can do this. Picture a list of tweets, visually it would look like this. Then your vocabulary, V, would be the list of unique words from your list of tweets. To get that list, you’ll have to go through all the words from all your tweets and save every new word that appears in your search. So in this example, you’d have the word I, then the word, am and happy, because, and so forth. But note that the word I and the word am would not be repeated
in the vocabulary.

Feature Extraction
Sparse Representations

Let’s take these tweets and extract features using your vocabulary. To do so, you’d have to check if every word from your vocabulary appears in the tweet. If it does like in the case of the word I, you would assign a value of 1 to that feature, like this. If it doesn’t appear, you’d assign a value of 0, like that.
In this example, the representation of your tweet would have six ones and many zeros. These correspond to every unique word from your vocabulary that isn’t in the tweet.
Now, this type of representation with a small relative number of non-zero values is called a sparse representation. Now let’s take a closer look at this representation of these tweets. In the last slides, I walked you through extracting features to represent the tweet based on a vocabulary and I arrived at this vector.
This representation would have a number of features equal to the size of your entire vocabulary. This would have a lot of features equal to 0 for every tweet. With the sparse representation, a logistic regression model would have to learn n plus 1 parameters, where n would be equal to the size of your vocabulary and you can imagine that for large vocabulary sizes, this would be problematic. It would take an excessive amount of time to train your model and much more time than necessary to make predictions.
Given a text, you learned how to represent this text as a vector of dimension V. Specifically, you did this for a tweet and you were able to build a vocabulary of dimension V. Now as V gets larger and larger, you will face certain problems. In the next video, you will learn to identify these problems.
Please Follow coursesteach to see latest updates on this story
If you want to learn more about these topics: Python, Machine Learning Data Science, Statistic For Machine learning, Linear Algebra for Machine learning Computer Vision and Research
Then Login and Enroll in Coursesteach to get fantastic content in the data field.
Stay tuned for our upcoming articles where we will explore specific topics related to NLP in more detail!
Remember, learning is a continuous process. So keep learning and keep creating and sharing with others!💻✌️

References

1- Natural Language Processing with Classification and Vector Spaces
2-Vocabulary & Feature Extraction

Write & Read to Earn with BULB

Learn More

Enjoy this blog? Subscribe to jihad

1 Comment

B
No comments yet.
Most relevant comments are displayed, so some may have been filtered out.