Python | Gender Identification by name using NLTK

Natural Language Toolkit (NLTK) is a platform used for building programs for text analysis. We can observe that male and female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. Let’s build a classifier to model these differences more precisely.

In order to run the below python program, you must have to install NLTK. Please follow the installation steps.

pip install nltk

The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features. For this example, we’ll start by just looking at the final letter of a given name. The following feature extractor function builds a dictionary containing relevant information about a given name.

Example :

Input : gender_features('saurabh')
Output : {'last_letter': 'h'}
filter_none

edit
close

play_arrow

link
brightness_4
code

def gender_features(word):
     return {'last_letter': word[-1]}
gender_features('mahavir')
# output : {'last_letter': 'r'}

chevron_right


A GUI will pop up then choose to download “all” for all packages, and then click ‘download’. This will give you all of the tokenizers, chunkers, other algorithms, and all of the corpora, so that’s why the installation will take quite a time.

nltk.download()

Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some examples of classification tasks are:

  1. Deciding whether an email is spam or not.
  2. Deciding what the topic of a news article is, from a fixed list of topic areas such as “sports, ” “technology, ” and “politics.”
  3. Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution.

The basic classification task has a number of interesting variants. For example, in multi-class classification, each instance may be assigned multiple labels; in open-class classification, the set of labels is not defined in advance; and in sequence classification, a list of inputs are jointly classified.

A classifier is called supervised if it is built based on training corpora containing the correct label for each input. The framework used by supervised classification is shown in figure.

The training set is used to train the model, and the dev-test set is used to perform error analysis. The test set serves in our final evaluation of the system. For reasons discussed below, it is important that we employ a separate dev-test set for error analysis, rather than just using the test set.

The division of the corpus data into different subsets is shown in following Figure :

Get the link of text file used from here –

  • By text urls directly. male.txt, female.txt
  • male.txt and female.txt files are downloaded automatically while nltk.download() method executed successfully. Path in local system:
    path of nltk: C:\Users\currentUserName\AppData\Roaming
    path for files inside nltk: \nltk_data\corpora\names
filter_none

edit
close

play_arrow

link
brightness_4
code

# importing libraries
import random
from nltk.corpus import names
import nltk
  
def gender_features(word):
    return {'last_letter':word[-1]}
  
# preparing a list of examples and corresponding class labels.
labeled_names = ([(name, 'male') for name in names.words('male.txt')]+
             [(name, 'female') for name in names.words('female.txt')])
  
random.shuffle(labeled_names)
  
# we use the feature extractor to process the names data.
featuresets = [(gender_features(n), gender) 
               for (n, gender)in labeled_names]
  
# Divide the resulting list of feature
# sets into a training set and a test set.
train_set, test_set = featuresets[500:], featuresets[:500]
  
# The training set is used to 
# train a new "naive Bayes" classifier.
classifier = nltk.NaiveBayesClassifier.train(train_set)
  
print(classifier.classify(gender_features('mahavir')))
  
# output should be 'male'
print(nltk.classify.accuracy(classifier, train_set))
  
# it shows accurancy of our classifier and 
# train_set. which must be more than 99 % 
# classifier.show_most_informative_features(10)

chevron_right


 
Getting informative features from Classifier:

filter_none

edit
close

play_arrow

link
brightness_4
code

classifier.show_most_informative_features(10)
# 10 indicates 10 rows

chevron_right


Output:



My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.