Beginning with Text Analytics
Data Science

Beginning with Text Analytics


Text Analytics
 
is the method involved in coaxing importance out of composed correspondence. It is the process of derivation of high-end information through established patterns and trends in a piece of text. It works by dividing the paragraphs, sentences, and phrases into their components and are evaluated in each part's role and meaning using complex software roles and machine learning algorithms.


How to do Text Analytics?

  1. Bag of words approach.
  2. Term Frequency Inverse Document Frequency (TF-IDF)
  3. Lexicons

1. Bag of words Approach.

A bag-of-words model is an approach to addressing text information while displaying text with AI calculations. The bag-of-words model is easy to comprehend and execute and has seen incredible outcomes in issues, for example, the language demonstration and archive arrangement. A bag of words model, or BoW for short, is an approach to separating highlights from the text for use in demonstrating, for example, with AI calculations.

The methodology is extremely basic and adaptable and can be utilized in a bunch of ways for separating highlights from reports. A pack of words is a portrayal of text that depicts the event of words inside a record. It involves two things:

  • A vocabulary of known words. 
  • An extent of the presence of known words.

It is known as a ―"bag" of words in light of the fact that any data about the request or construction of words in the record is disposed of. The model is just worried about whether realized words happen in the record, not wherein the report.

Example:

Here‘s a sample of reviews about a particular horror movie.

  • Review 1: This movie is very scary and long.
  • Review 2: This movie is not scary and is slow.
  • Review 3: This movie is spooky and good.

We can see that there are some contrasting reviews about the movie as well as the length and pace of the movie. Consider having more than thousands of such reviews dataset. Clearly, there are a lot of interesting insights we can draw from them and build upon them to gauge how well the movie performed.

Notwithstanding, as we saw above, we can't just give these sentences to an AI model and request that it let us know whether a survey was positive or negative. We truly need to play out unambiguous text preprocessing steps.

We will initially assemble a jargon from every one of the exceptional words in the over three surveys. The vocabulary consists of these 11 words: "This",  "movie",  "is",  "very",  "scary",  "and‘, "long", "not" , "slow", "spooky",  "good". We can now take every one of these words and imprint their event in the three film audits above with 1s and 0s. This will give us 3 vectors for 3 surveys:


1

This

2

movie

3

is

4

very

5

scary

6

and

7

long

8

not

9

slow

10

spooky

11

good

Length of the review
Review 1111111100007
Review 2
112001101008
Review 3
111000100116

Table 1: Bag of words Approach.

Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]
Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]
Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]
And that‘s the core idea behind a Bag of Words (BoW) model.

2. Term Frequency-Inverse Document Frequency (TF-IDF)

Term frequency-inverse document frequency is a mathematical measurement that is expected to reflect how significant a word is to a record in an assortment or corpus.

Term Frequency (TF) is a proportion of how regularly a term, t, shows up in a report, d:

 

Here, in the numerator, n is the number of times the term "t" appears in the document  "d". Hence, each report and term would have its own TF esteem. Take the same vocabulary we had built in the Bag-of-Words model to show how to calculate the TF for Review #2:

Review 2: This movie is not scary and is slow.

Here, 

  • Vocabulary: "This", "movie", "is", "very", "scary", "and", "long", "not", "slow", "spooky", "good".
  • The number of words in Review 2 = 8.
  • TF for the word "this" = (number of times ‗this‘ appears in review 2)/ (number of terms in review 2) = 1/8.

Similarly, 

  • TF("movie") = 1/8 
  • TF("is") = 2/8 = 1/4
  • TF("very") = 0/8 = 0
  • TF("scary") = 1/8
  • TF("and") = 1/8 
  • TF("long") = 0/8 = 0 
  • TF("not") = 1/8 
  • TF("slow") = 1/8 
  • TF("spooky") = 0/8 = 0 
  • TF("good") = 0/8 = 0

Now the term frequencies for all the terms and all the reviews can be calculated following the above methodology: 


TermReview 1Review 2Review 3TF1TF2TF3

this

1111/71/81/6
movie1111/71/81/6
is1211/71/41/6
very1001/700
scary1101/71/80
and1111/71/81/6
long1001/700
not01001/80
slow01001/80
spooky001001/6
good001001/6

Table 2: TFIDF Table

Inverse Document Frequency (IDF)

IDF is a proportion of how significant a term is. We really want the IDF esteem since processing simply the TF alone isn't adequate to figure out the significance of words.

We can ascertain the IDF values for the every one of the words in Review 2:

IDF('this‘) = log ('number of documents/number of documents containing the word 'this‘) = log (3/3) = log (1) = 0

Similarly,

  • IDF ('movie‘,) = log (3/3) = 0 
  • IDF('is‘) = log (3/3) = 0
  • IDF('not‘) = log (3/1) = log (3) = 0.48
  • IDF('scary‘) = log (3/2) = 0.18
  • IDF('and‘) = log (3/3) = 0 
  • IDF('slow‘) = log (3/1) = 0.48

We can ascertain the IDF values for each word like this. Along these lines, the IDF values for the whole jargon would be determined.

Along these lines, we see that words like "is", "this", "and", and so forth, are decreased to 0 and have little significance; while words like "startling", "long", "great", and so on are words with more significance and subsequently have a higher worth. Similarly, the TF-IDF scores for the other two reviews can be calculated. The most common application of the approach is Spam Ham Detection.

Spam Ham Detection
  • Spam Messages, Spam emails or messages belong to the broad category of unsolicited messages received by a user.
  • Bag-of-words which is a natural language processing (NLP) algorithm can be used to classify messages as ham or spam.




3. Lexicons for Sentiment Analysis.

Sentiment analysis with lexicons is a cycle by which data is broke down using normal language handling (NLP). Lexicons calculate the sentiment from the semantic orientation of words or phrases that occur in a text. The types of lexicons are:

  •  Affin 
  • Textblob 
  • VADER   

Afinn

It is the simplest yet most popular lexicon used for sentiment analysis developed by Finn Årup Nielsen. It contains 3300+ words with a polarity score associated with each word. In python, there is an in-fabricated work for this dictionary. 


Textblob

Textblob It is a straightforward python library that offers API admittance to various NLP errands like opinion examination, spelling remedy, and so on. Textblob opinion analyzer returns two properties for a given info sentence:

  • Polarity is a float that lies between [-1,1], -1 indicates negative sentiment and +1 indicates positive sentiments. 
  • Subjectivity is likewise a float that lies in the scope of [0,1]. Abstract sentences for the most part allude to closely-held convictions, feeling, or judgment.


VANDER

VADER sentiment Valence aware dictionary for sentiment reasoning (VADER) is another popular rule-based sentiment analyzer. It utilizes a rundown of lexical highlights (for example words) which are named as sure or negative as per their semantic direction to compute the message opinion. Vader's opinion returns the likelihood of a given info sentence to be positive, negative, and nonpartisan.


For example:

"The food was great!"

Positive: 99%

Negative:1%

Neutral: 0%

These three probabilities will add up to 100%.

The most common application of the approach is Sentiment Analysis.


  • sanskriti jain
  • Apr, 26 2022

Add New Comments

Please login in order to make a comment.

Recent Comments

Be the first to start engaging with the bis blog.