PyBrain – Datasets Types

Last Updated : 21 Feb, 2022

Datasets are useful for allowing comfortable access to training, test, and validation data. Instead of having to mangle with arrays, PyBrain gives you a more sophisticated data structure that allows easier work with your data.

DataSets In PyBrain

The most commonly used datasets that Pybrain supports are SupervisedDataSet and ClassificationDataSet.

SupervisedDataSet: It consists of fields of input and target. It is the simplest form of a dataset and is mainly used for supervised learning tasks. As the name says, this simplest form of a dataset is meant to be used with supervised learning tasks. It is comprised of the fields ‘input’ and ‘target’, the pattern size of which must be set upon creation:

Python3

from pybrain.datasets import SupervisedDataSet 
  
DS = SupervisedDataSet(3, 2) 
DS.appendLinked([1, 2, 3], [4, 5]) 
len(DS) 
DS['input'] 
array([[1.,  2.,  3.]]) 

Output:

ClassificationDataSet: It is mainly used to deal with classification problems. It takes in input, target field, and also an extra field called “class” which is an automated backup of the targets given. For example, the output will be either 1 or 0, or the output will be grouped together with values based on input given, i.e., either it will fall in one particular class.

Python3

# Importing all the necessary libraries 
from sklearn import datasets 
import matplotlib.pyplot as plt 
from pybrain.datasets import ClassificationDataSet 
from pybrain.utilities import percentError 
from pybrain.tools.shortcuts import buildNetwork 
from pybrain.supervised.trainers import BackpropTrainer 
from pybrain.structure.modules import SoftmaxLayer 
from numpy import ravel 
  
# Loading iris dataset from sklearn datasets 
iris = datasets.load_iris() 
  
# Defining feature variables and target variable 
X_data = iris.data 
y_data = iris.target 
  
# Defining classification dataset model 
classification_dataset = ClassificationDataSet(4, 1, nb_classes=3) 
  
# Adding sample into classification dataset 
for i in range(len(X_data)): 
    classification_dataset.addSample(ravel(X_data[i]), y_data[i]) 
  
# Spilling data into testing and training data  
# with the ratio 7:3 
testing_data, training_data = classification_dataset.splitWithProportion(0.3) 
  
# Classification dataset for test data 
test_data = ClassificationDataSet(4, 1, nb_classes=3) 
  
# Adding sample into testing classification dataset 
for n in range(0, testing_data.getLength()): 
    test_data.addSample(testing_data.getSample( 
        n)[0], testing_data.getSample(n)[1]) 
  
# Classification dataset for train data 
train_data = ClassificationDataSet(4, 1, nb_classes=3) 
  
# Adding sample into training classification dataset 
for n in range(0, training_data.getLength()): 
    train_data.addSample(training_data.getSample( 
        n)[0], training_data.getSample(n)[1]) 
  
test_data._convertToOneOfMany() 
train_data._convertToOneOfMany() 
  
# Building network with outclass as SoftmaxLayer 
# on training data 
build_network = buildNetwork( 
    train_data.indim, 4, train_data.outdim, outclass=SoftmaxLayer) 
  
# Building a backproptrainer on training data 
trainer = BackpropTrainer( 
    build_network, dataset=train_data, learningrate=0.01, verbose=True) 
  
# 20 iterations on training data 
trainer.trainEpochs(20) 
  
# Testing data 
print('Error percentage on testing data=>', percentError( 
    trainer.testOnClassData(dataset=test_data), test_data['class'])) 

Output:

Total error:  0.0892390931641
Total error:  0.0821479733597
Total error:  0.0759327938967
Total error:  0.0722385583142
Total error:  0.0690818068826
Total error:  0.0667645311923
Total error:  0.0647079622731
Total error:  0.0630345245312
Total error:  0.0608030839912
Total error:  0.0595356750412
Total error:  0.0586635639408
Total error:  0.0573043661487
Total error:  0.0559188704413
Total error:  0.0548155819544
Total error:  0.0535537679931
Total error:  0.0527051106108
Total error:  0.0515783629912
Total error:  0.0501025301423
Total error:  0.0499123823243
Total error:  0.0482250742606
Error percentage on testing data=> 20.0

Suggest improvement

Python Data Types

Share your thoughts in the comments