How can Tensorflow be used to download and explore the Iliad dataset using Python?

Last Updated : 27 Jun, 2022

Tensorflow is a free open-source machine learning and artificial intelligence library widely popular for training and deploying neural networks. It is developed by Google Brain Team and supports a wide range of platforms. In this tutorial, we will learn to download, load and explore the famous Iliad dataset.

In the Iliad dataset, there are various works of different English translations of the same Homer’s Iliad text. Tensorflow has modified the documents for focusing on the examples of their work. The dataset is available at the following URL.

https://storage.googleapis.com/download.tensorflow.org/data/illiad/

Example: In the following example, we will take the works of three translators named: William Cowper, Edward, Earl of Derb, and Samuel Butler. Then with the help of TensorFlow, we will load them and classify their works with their translations.

Install the TensorFlow text package:

pip install "tensorflow-text==2.8.*"

Download and load the Iliad dataset

We need to label each dataset individually and so we use the Dataset.map function. This will return example-label pairs.

Python3

import pathlib 
import tensorflow as tf 
from tensorflow.keras import layers 
from tensorflow.keras import losses 
from tensorflow.keras import utils 
from tensorflow.keras.layers import TextVectorization 
import tensorflow_datasets as tfds 
import tensorflow_text as tf_text 
  
print("Welcome to GeeksforGeeks") 
print("Loading the Illiad dataset") 
DIRECTORY_URL = 'https://storage.googleapis.com/\ 
download.tensorflow.org/data/illiad/' 
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt'] 
  
for name in FILE_NAMES: 
   text_dir = utils.get_file(name, 
                             origin=DIRECTORY_URL + name) 
  
parent_dir = pathlib.Path(text_dir).parent 
  
def labeler(example, index): 
  return example, tf.cast(index, tf.int64) 
  
labeled_data_sets = [] 
  
for i, file_name in enumerate(FILE_NAMES): 
  lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name)) 
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i)) 
  labeled_data_sets.append(labeled_dataset) 
labeled_data_sets

Output:

[<MapDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>,
<MapDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>,
<MapDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>]

Concatenate and shuffle the datasets. There are concatenated using the Dataset.concatenate function. The shuffle function is used to shuffle the data. We then print out some examples.

Python3

BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000
  
all_labeled_data = labeled_data_sets[0] 
for labeled_dataset in labeled_data_sets[1:]: 
    all_labeled_data = all_labeled_data.concatenate(labeled_dataset) 
  
all_labeled_data = all_labeled_data.shuffle( 
    BUFFER_SIZE, reshuffle_each_iteration=False) 
for text, label in all_labeled_data.take(5): 
    print("Sentence: ", text.numpy()) 
    print("Label:", label.numpy()) 

Output:

Sentence: b”Of brass, and color’d with a ring of gold.”
Label: 0
Sentence: b’drove the horses in among the others.’
Label: 2
Sentence: b’Into the boundless ether. Reaching soon’
Label: 0
Sentence: b”Drive to the ships, for pain weigh’d down his soul.”
Label: 1
Sentence: b”Not one is station’d to protect the camp.”
Label: 1