Load text in Tensorflow
In this article, we are going to see how to load the text in Tensorflow using Python.
Tensorflow is an open-source Machine Learning platform that helps to create production-ready Machine Learning pipelines. Using Tensorflow, one can easily manage large datasets and develop a Neural network model in a few lines of code. These large datasets may include audio, image, video, or text. In this article, we will focus on the text dataset.
How to load the text in Tensorflow?
Text is the most used form of data in today’s real-time world. Documentation, Media Posts, Social Media conversations, and Blog articles all come in the form of text. All the text comes in raw form to be used in Machine Learning models. Tensorflow provides utility support to load the text.
Let’s take an example to demonstrate on how to load and preprocess text.
Before we proceed let us first import the required modules and download the dataset.
Python3
import tensorflow as tf import tensorflow.keras as keras import pathlib download = keras.utils.get_file( origin = url, untar = True , cache_dir = 'stack_overflow' ) DATA_DIR = pathlib.Path(download).parent print (pathlib.os.listdir(DATA_DIR)) print (pathlib.os.listdir(f "{DATA_DIR}/train" )) |
Output:
['train', 'stack_overflow_16k.tar.gz', 'test', 'README.md'] ['java', 'python', 'csharp', 'javascript']
We downloaded Stack Overflow question text data in the above code using Keras API. utils.get_file method takes in the origin URL which contains the actual data. By setting untar=True, the dataset is unzipped automatically and saved in the directory. A Machine Learning model is continually trained on training data, verified on validation data, and tested on testing data.
text_dataset_from_directory
Tensorflow enables us to read or load text directly from the directory and moreover lets us split the dataset into train and validation, everything using the same method.
The training directory consists of Java, Python, C#, and JavaScript questions each containing 2000 texts.
Python3
TRAIN_DIR = f "{DATA_DIR}/train" TEST_DIR = f "{DATA_DIR}/test" for i in pathlib.os.listdir(TRAIN_DIR): text_len = len (pathlib.os.listdir(f "{TRAIN_DIR}/{i}" )) print (f "{i} contains {text_len} text" ) |
Output:
java contains 2000 text python contains 2000 text csharp contains 2000 text javascript contains 2000 text
To create validation data and assign labels to the data, we shall now use the text_dataset_from_directory method that is used to load text from the directory.
Python3
training_data = keras.utils.text_dataset_from_directory( TRAIN_DIR, batch_size = 32 , validation_split = 0.2 , subset = 'training' , seed = 42 ) validation_data = keras.utils.text_dataset_from_directory( TRAIN_DIR, batch_size = 32 , validation_split = 0.2 , subset = 'validation' , seed = 42 ) |
Output:
Found 8000 files belonging to 4 classes. Using 6400 files for training. Found 8000 files belonging to 4 classes. Using 1600 files for validation.
This is how you can load the text in Tensorflow.
Please Login to comment...