Open In App

Load text in Tensorflow

Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to see how to load the text in Tensorflow using Python.

Tensorflow is an open-source Machine Learning platform that helps to create production-ready Machine Learning pipelines. Using Tensorflow, one can easily manage large datasets and develop a Neural network model in a few lines of code. These large datasets may include audio, image, video, or text. In this article, we will focus on the text dataset. 

How to load the text in Tensorflow?

Text is the most used form of data in today’s real-time world. Documentation, Media Posts, Social Media conversations, and Blog articles all come in the form of text. All the text comes in raw form to be used in Machine Learning models. Tensorflow provides utility support to load the text. 

Let’s take an example to demonstrate on how to load and preprocess text. 

Before we proceed let us first import the required modules and download the dataset

Python3




import tensorflow as tf
import tensorflow.keras as keras
import pathlib
  
  
download = keras.utils.get_file(
    origin=url, untar=True, cache_dir='stack_overflow')
DATA_DIR = pathlib.Path(download).parent
print(pathlib.os.listdir(DATA_DIR))
print(pathlib.os.listdir(f"{DATA_DIR}/train"))


Output:

['train', 'stack_overflow_16k.tar.gz', 'test', 'README.md']
['java', 'python', 'csharp', 'javascript']

We downloaded Stack Overflow question text data in the above code using Keras API. utils.get_file method takes in the origin URL which contains the actual data. By setting untar=True, the dataset is unzipped automatically and saved in the directory. A Machine Learning model is continually trained on training data, verified on validation data, and tested on testing data. 

text_dataset_from_directory

Tensorflow enables us to read or load text directly from the directory and moreover lets us split the dataset into train and validation, everything using the same method. 

The training directory consists of Java, Python, C#, and JavaScript questions each containing 2000 texts. 

Python3




TRAIN_DIR = f"{DATA_DIR}/train"
TEST_DIR = f"{DATA_DIR}/test"
  
for i in pathlib.os.listdir(TRAIN_DIR):
    text_len = len(pathlib.os.listdir(f"{TRAIN_DIR}/{i}"))
    print(f"{i} contains {text_len} text")


Output:

java contains 2000 text
python contains 2000 text
csharp contains 2000 text
javascript contains 2000 text

To create validation data and assign labels to the data, we shall now use the text_dataset_from_directory method that is used to load text from the directory. 

Python3




training_data = keras.utils.text_dataset_from_directory(
    TRAIN_DIR,
    batch_size=32,
    validation_split=0.2,
    subset='training',
    seed = 42)
  
validation_data = keras.utils.text_dataset_from_directory(
    TRAIN_DIR,
    batch_size=32,
    validation_split=0.2,
    subset='validation',
    seed=42)


Output:

Found 8000 files belonging to 4 classes.
Using 6400 files for training.

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.

This is how you can load the text in Tensorflow. 



Last Updated : 21 Nov, 2022
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads