Sagemaker – Exploring Ground truth labeling | ML

Last Updated : 03 Jun, 2020

Have you ever thought of doing machine learning completely from the scratch and don’t know where to start? If yes, there is a place where you can enter by holding just the dataset in your hands and leaves the place with fully trained machine learning model which is ready to be deployed in the real-life scenario’s. Amazon Sagemaker provides various services that offer labelling datasets, training the models, hyperparameters tuning, and creating inference for us to deploy. Sagemaker offers the following services:

Ground truth
Notebook Instances
Training Jobs/Hyperparameters.
Inferences

In this article, we are exploring the ground truth labelling and how it is helpful in reducing the burden of labelling datasets on developers.

Amazon sagemaker Ground truth:
Ask any machine learning expert “What is the majority task that consumes a whole lot of time in the machine learning part?” And first, they will say preparing and cleaning datasets and secondly labelling them accurately that fits the models. So, in general labelling, the datasets consume 70% of the time and it would take more if the model you are feeding this dataset is sensitive enough. Amazon Ground truth provides the labelling service in which it not only just does the labelling but also creates the files that are required for the models to understand. Ground truth provides three services namely

Mechanical Turk workers which are useful in labelling small datasets and the labelling can be done by human workers.
Private labelling workforce, in which you have an option that the employees from your organization label the dataset
Third part vendors, as the name, implies that the datasets be labelled by some other vendors.

Ground truth also provides built-in five data labelling tasks

Bounding Boxes
Image classification
Semantic segmentation
Text classification
Named Entity Recognition.

Customers can also bring their custom labelling tasks. So, how in the first place we are going to provide input data and what kind of output format that we can expect from ground truth?

Initially, the dataset that we are intending to the label has to be placed in the S3 bucket and while creating the labelling job, we can mention the location of the dataset in the S3 bucket so that it creates input manifest file automatically. We can also have to specify the output folder for the labelled data to be stored. As shown in the image that the labelling can be done whether by automatic labelling in which the machine learning models come into the picture and completes the job or by human labelling in which the humans can do the job. Auto labelling is performed if the dataset is so large. The label consolidation provides the majority voting or checks for the higher probabilities of a particular image, text or audio etc., in a given dataset. After the labelling has been done, the ground truth provides augmented manifest file which will be used to training the model. The augmented manifest file consists of

Source-ref: The source of the object in S3 bucket
Labelling job name: The name of the labelling job that was initiated while creating it.

Code: Labelling job looks like which was performed on a bird image.

[ 
  { 
    "boundingBox": { 
      "boundingBoxes": [ 
         { 
           "height": 845, 
           "label": "Bird", 
           "left": 54, 
           "top": 19, 
           "width": 765 
          } 
        ], 
        "inputImageProperties": { 
        "height": 1024, 
        "width": 968 
       } 
     } 
   } 
 ] 

Now we have successfully labelled the dataset of our choice and we have with us are the augmented manifest files which are ready to be fed to the training model inside the sagemaker notebook!

Suggest improvement

Explaining the language in Natural Language

Share your thoughts in the comments

Sagemaker – Exploring Ground truth labeling | ML

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?