Have you ever thought of doing machine learning completely from the scratch and don’t know where to start? If yes, there is a place where you can enter by holding just the dataset in your hands and leaves the place with fully trained machine learning model which is ready to be deployed in the real-life scenario’s. Amazon Sagemaker provides various services that offer labelling datasets, training the models, hyperparameters tuning, and creating inference for us to deploy. Sagemaker offers the following services:
- Ground truth
- Notebook Instances
- Training Jobs/Hyperparameters.
In this article, we are exploring the ground truth labelling and how it is helpful in reducing the burden of labelling datasets on developers.
Amazon sagemaker Ground truth:
Ask any machine learning expert “What is the majority task that consumes a whole lot of time in the machine learning part?” And first, they will say preparing and cleaning datasets and secondly labelling them accurately that fits the models. So, in general labelling, the datasets consume 70% of the time and it would take more if the model you are feeding this dataset is sensitive enough. Amazon Ground truth provides the labelling service in which it not only just does the labelling but also creates the files that are required for the models to understand. Ground truth provides three services namely
- Mechanical Turk workers which are useful in labelling small datasets and the labelling can be done by human workers.
- Private labelling workforce, in which you have an option that the employees from your organization label the dataset
- Third part vendors, as the name, implies that the datasets be labelled by some other vendors.
Ground truth also provides built-in five data labelling tasks
- Bounding Boxes
- Image classification
- Semantic segmentation
- Text classification
- Named Entity Recognition.
Customers can also bring their custom labelling tasks. So, how in the first place we are going to provide input data and what kind of output format that we can expect from ground truth?
Initially, the dataset that we are intending to the label has to be placed in the S3 bucket and while creating the labelling job, we can mention the location of the dataset in the S3 bucket so that it creates input manifest file automatically. We can also have to specify the output folder for the labelled data to be stored. As shown in the image that the labelling can be done whether by automatic labelling in which the machine learning models come into the picture and completes the job or by human labelling in which the humans can do the job. Auto labelling is performed if the dataset is so large. The label consolidation provides the majority voting or checks for the higher probabilities of a particular image, text or audio etc., in a given dataset. After the labelling has been done, the ground truth provides augmented manifest file which will be used to training the model. The augmented manifest file consists of
- Source-ref: The source of the object in S3 bucket
- Labelling job name: The name of the labelling job that was initiated while creating it.
Code: Labelling job looks like which was performed on a bird image.
Now we have successfully labelled the dataset of our choice and we have with us are the augmented manifest files which are ready to be fed to the training model inside the sagemaker notebook!
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.