Load text in Tensorflow

Last Updated : 21 Nov, 2022

In this article, we are going to see how to load the text in Tensorflow using Python.

Tensorflow is an open-source Machine Learning platform that helps to create production-ready Machine Learning pipelines. Using Tensorflow, one can easily manage large datasets and develop a Neural network model in a few lines of code. These large datasets may include audio, image, video, or text. In this article, we will focus on the text dataset.

How to load the text in Tensorflow?

Text is the most used form of data in today’s real-time world. Documentation, Media Posts, Social Media conversations, and Blog articles all come in the form of text. All the text comes in raw form to be used in Machine Learning models. Tensorflow provides utility support to load the text.

Let’s take an example to demonstrate on how to load and preprocess text.

Before we proceed let us first import the required modules and download the dataset.

Python3

import tensorflow as tf 
import tensorflow.keras as keras 
import pathlib 
  
url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'
  
download = keras.utils.get_file( 
    origin=url, untar=True, cache_dir='stack_overflow') 
DATA_DIR = pathlib.Path(download).parent 
print(pathlib.os.listdir(DATA_DIR)) 
print(pathlib.os.listdir(f"{DATA_DIR}/train")) 

Output:

['train', 'stack_overflow_16k.tar.gz', 'test', 'README.md']
['java', 'python', 'csharp', 'javascript']

We downloaded Stack Overflow question text data in the above code using Keras API. utils.get_file method takes in the origin URL which contains the actual data. By setting untar=True, the dataset is unzipped automatically and saved in the directory. A Machine Learning model is continually trained on training data, verified on validation data, and tested on testing data.

text_dataset_from_directory

Tensorflow enables us to read or load text directly from the directory and moreover lets us split the dataset into train and validation, everything using the same method.

The training directory consists of Java, Python, C#, and JavaScript questions each containing 2000 texts.

Python3

TRAIN_DIR = f"{DATA_DIR}/train"
TEST_DIR = f"{DATA_DIR}/test"
  
for i in pathlib.os.listdir(TRAIN_DIR): 
    text_len = len(pathlib.os.listdir(f"{TRAIN_DIR}/{i}")) 
    print(f"{i} contains {text_len} text")

Output:

java contains 2000 text
python contains 2000 text
csharp contains 2000 text
javascript contains 2000 text

To create validation data and assign labels to the data, we shall now use the text_dataset_from_directory method that is used to load text from the directory.

Python3

training_data = keras.utils.text_dataset_from_directory( 
    TRAIN_DIR, 
    batch_size=32, 
    validation_split=0.2, 
    subset='training', 
    seed = 42) 
  
validation_data = keras.utils.text_dataset_from_directory( 
    TRAIN_DIR, 
    batch_size=32, 
    validation_split=0.2, 
    subset='validation', 
    seed=42)

Output:

Found 8000 files belonging to 4 classes.
Using 6400 files for training.

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.

This is how you can load the text in Tensorflow.

Suggest improvement

Calculate Time Difference in Python

How to split data into training and testing in Python without sklearn

Share your thoughts in the comments

Load text in Tensorflow

How to load the text in Tensorflow?

Python3

text_dataset_from_directory

Python3

Output:

Python3

Output:

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?