AWS Machine Learning Blog
Verifying and adjusting your data labels to create higher quality training datasets with Amazon SageMaker Ground Truth
Building a highly accurate training dataset for your machine learning (ML) algorithm is an iterative process. It is common to review and continuously adjust your labels until you are satisfied that the labels accurately represent the ground truth, or what is directly observable in the real world. ML practitioners often built custom systems to review and update data labels because accurately labeled data is critical to ML model quality. If there are issues with the labels, the ML model can’t effectively learn the ground truth, which leads to inaccurate predictions.
One way that ML practitioners have improved the accuracy of their labeled data is through using audit workflows. Audit workflows enable a group of reviewers to verify the accuracy of labels (a process called label verification) or adjust them (a process called label adjustment) if needed.
Amazon SageMaker Ground Truth now features built-in workflows for label verification, and label adjustment for bounding boxes and semantic segmentation. With these new workflows, you can chain an existing Amazon SageMaker Ground Truth labeling job to a verification or adjustment job, or you can import your existing labels for a verification or adjustment job.
This post walks you through both options for bounding boxes labels. The walkthrough assumes that you are familiar with running a labeling job or have existing labels. For more information, see Amazon SageMarker Ground Truth – Build Highly Accurate Datasets and Reduce Labeling Costs by up to 70%.
Chaining a completed Amazon SageMaker Ground Truth labeling job
To chain a completed labeling job, complete the following steps.
- From the Amazon SageMaker Ground Truth console, choose Labeling jobs.
- Select your desired job.
- From the Actions drop-down menu, choose Chain.
The following screenshot shows the Labeling jobs page:
For more information, see Chaining labeling jobs.
The Job Overview page carries forward the configurations you used for your chained job. If there are no changes, you can move to the next section Task Type.
Configuring label verification
To use label verification, from Task type, choose Label verification.
See the following screenshot of the Task type page:
The Workers section is preconfigured to the selections you made for the chained labeling job. You can opt to choose a different workforce or stick with the same configurations for your label verification job. For more information, see Managing Your Workforce.
You can define your verification labels, for example, Label Correct
, Label Incorrect – Object(s) Missed
, and Label Incorrect – Box(es) Not Tightly Drawn
.
You can also specify the instructions in the left-hand panel to guide reviewers on how to verify the labels.
See the following screenshot of the Label verification tool page:
Configuring label adjustment
To perform label adjustment, from the Task type section, choose Bounding box. See the following screenshot of the Task type page:
The following steps for configuring the Workers section and setting up the labeling tool are similar to creating a verification job. The one exception is that you must opt into displaying existing labels in the Existing-labels display options section. See the following screenshot:
Uploading your existing labels from outside Amazon SageMaker Ground Truth
If you labeled your data outside of Amazon SageMaker Ground Truth, you can still use the service to verify or adjust your labels. Import your existing labels by following these steps.
- Create an augmented manifest with both your data and existing labels.For example, in the following example code, the source-ref points to the images that were labeled, and the “bound-box” attribute is the label.
- Save your augmented manifest in Amazon S3.You should save the manifest in the same S3 bucket as your images. Also, remember the attribute name of your labels (in this post,
bound-box
) because you need to point to this when you set up your jobs.Additionally, make sure that the labels conform to the label format prescribed by Amazon SageMaker Ground Truth. For example, you can see the label format for bounding boxes in Bounding Box Job Output.You are now ready to create verification and adjustment jobs. - From the Amazon SageMaker Ground Truth console, create a new labeling job.
- In Job overview, for Input dataset location, point to the S3 path of the augmented manifest that you created.See the following screenshot of the Job overview page:
- Follow the steps previously outlined to configure Task Type, Workers, and the labeling tool when setting up your verification or adjustment job.
- In Existing-labels display option, for Label attribute name, select the name of your augmented manifest from the drop-down menu.See the following screenshot of Existing-labels display options:
Conclusion
A highly accurate training dataset is critical for achieving your ML initiatives, and you now have built-in workflows to perform label verification and adjustment through Amazon SageMaker Ground Truth. This post walked you through how to use the new label verification and adjustment features. You can chain a completed labeling job, or you can upload labels. Visit the Amazon SageMaker Ground Truth console to get started.
As always, AWS welcomes feedback. Please submit any comments or questions.
About the Authors
Sifan Wang is a Software Development Engineer for AWS AI. His focus is on building scalable systems to process big data and intelligent systems to learn from the data. In his spare time, he enjoys traveling and hitting the gym.
Carter Williams is a Web Development Engineer on the Mechanical Turk Requester CX team with a focus in Computer Vision UIs. He strives to learn and develop new ways to gather accurate annotation data in intuitive ways using web technologies. In his free time, he enjoys paintball, hockey, and snowboarding.
Vikram Madan is the Product Manager for Amazon SageMaker Ground Truth. He focusing on delivering products that make it easier to build machine learning solutions. In his spare time, he enjoys running long distances and watching documentaries.