Text Analysis Using Turicreate

What is Text Analysis?
Text is a group of words or sentences.Text analysis is  analyzing the text and then extracting information with the help of text.Text data is one of  the biggest factor that can make a company big or small.For example 

  • On E-Commerce website people buy things .With Text Analysis the E-Commerce website can know what it’s costumer likes and it through this data it can make it’s productivity higher.
  • Using Text analysis and some Machine Learning Algorithm our Alexa Google Home mini works.  These two are based on Natural Language Processing.
  • Using Text Analysis we can decide whether a E-mail is a Spam or a Non Spam.

Text analysis can be done using text mining.As the text “data” can be structured as well as unstructured.The text mining technique will help us in differentiating between them.

Now let’s do some text analysis using Turicreate.We will build a model that classifies that a message is a spam or ham for text analysis.Link for the dataset=https://www.kaggle.com/team-ai/spam-text-message-classification
Step 1: Import the Turicreate Library

filter_none

edit
close

play_arrow

link
brightness_4
code

import turicreate as tc

chevron_right


Step 2:Load the data set.

filter_none

edit
close

play_arrow

link
brightness_4
code

data = tc.SFrame("data.csv")

chevron_right


Step 3: We will explore the data first.



filter_none

edit
close

play_arrow

link
brightness_4
code

# It will print the first full rows of the data
data.head().

chevron_right


Output: 

dataset

Step 4:Now adding the word count in the data set.
This is because data has two things category and message. Adding the word count will help in model feature selection.

filter_none

edit
close

play_arrow

link
brightness_4
code

# Text analytics library has a count word function. 
# It will seprately count the words for each row 
# of message column.
data['word_count']= tc.text_analytics.count_words(data['Message'])
  
# now we can see that the data has one more column if word_count.
data.head()

chevron_right


Output:

Here One more row of word_count is added in the data set.

Step 5: To split the data into train and test set.

filter_none

edit
close

play_arrow

link
brightness_4
code

train_data, test_data = data.random_split(.8, seed = 0)

chevron_right


Step 6: Now we will make a model for classifying the spam and ham.

filter_none

edit
close

play_arrow

link
brightness_4
code

# We will use our feature as word count and 
# our target "category is to find out spam or ham.
  
model = tc.logistic_classifier.create(
    train_data, target ='Category'
    features =['word_count'], 
    validation_set = test_data)

chevron_right


Step 7: Now we will check accuracy of our model.

filter_none

edit
close

play_arrow

link
brightness_4
code

model.evaluate(test_data)

chevron_right


Output:

The accuracy is 0.975 that means 97.5%.Step 8: We can predict manually by checking from our test data that it is giving right answer or not.

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

test_data.head()
# We will select the first one that is spam 
# and select that is spam or not.

chevron_right


Step 9: Predicting the test data.

filter_none

edit
close

play_arrow

link
brightness_4
code

model.predict(test_data[1])

chevron_right


Output:

The result is spam hence the model is predicting it right.

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.




My Personal Notes arrow_drop_up


If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :
Practice Tags :


Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.