For most organizations, the bottleneck in achieving a high-quality dataset is not in acquiring data, but in labeling data. The human labor involved in labeling can be costly in both time and money. Gradio’s data engine makes the most effective use of an organization’s labeling budget by identifying the datapoints in the unlabeled dataset that would provide the highest value to the model if they were labeled. Gradio then performs augmentations on these datapoints to expand coverage. The Gradio-powered labeled dataset allows the model to perform at a level that would normally have required a dataset many multiples in size.
We’ll see this in action on a US government-provided consumer financial complaint dataset that classifies the domain of a consumer complaint. Each complaint is classified as one of five categories, such as “debt” or “credit”. There are 2000 complaints in this dataset.
The first technique that we present is of dataset valuation. Imagine that all 2000 datapoints from the training set are available to us, but none of them are labeled. Because labeling is expensive in this scenario, we only have the budget to label a certain sample size of the data. We’ll simulate budgets that allow us various sample sizes – in this instance, sample sizes of 25, 100, 250, 500, and 1000 complaints. Let’s randomly select which datapoints to label at each budget level and train models at each corresponding dataset size.
With data valuation, Gradio can direct labelers towards the data points that would provide the most value in improving the model. For example, if the labeling budget only allowed for 500 of the 2000 reviews to be labeled, Gradio’s algorithms can inform labelers the best first 100 datapoints to label. After training the model on these 100 datapoints, Gradio would analyze the model and the remaining dataset, and provide the next 100 datapoints that would give the most value to the model. The model would retrain over these additional 100 datapoints. This process would repeat until 500 datapoints are labeled at which point the budget is exhausted.
We now have two datasets at each sample size – one with randomly sampled datapoints, and one Gradio-directed sampling.
We now apply the second technique to Gradio’s datasets. For each of these datasets, we augment them by making copies of the datapoints in the datasets and transforming them using Gradio’s extensive library of transformations. This expands the effective size of each dataset significantly.
With two datasets at each sample size, we can now plot the performance of models trained on the randomly sampled dataset against models trained on the Gradio-powered dataset.
We see that a Gradio-powered dataset can make the model perform at accuracy levels equivalent to a dataset of much larger size. With just 25 samples, the Gradio driven dataset gives performance at levels greater than a random 500 sample dataset. At 100 samples, the Gradio dataset performs better than the 1000 sample training set without Gradio’s directed labeling and augmentations.