Why use it?
So why should you use synthetic data? Well, synthetic data is ...
... quicker to iterate.
Synthetic data makes it easy to change the annotation style, or add an additional label which can be used as an additional training loss for the model. It also makes it easy to generate more examples of a specific edge case that may be causing issues in production. Synthetic data generation and iteration should be easy, and used in concert with adjustments to the model in order to achieve one’s goals.
... virtually infinite.
Modern machine learning requires larger and larger datasets with synthetic data you can scale a dataset to the required size to train high-performant models.
... perfectly labeled.
At Zumo Labs, many of our incoming customers have a common pain point: labeled training data is presenting itself as a significant bottleneck. How is it that data wrangling (that is, sourcing labeled data and managing the training data pipeline) can take up to 80% of AI project time by some estimates.
... free of privacy risks and biases.
Traditionally the problem has been that compiling useful data sets requires infringing on people’s personal information, but guaranteeing privacy means either smaller or lower quality data sets, or stripping them of information to the point they are no longer useful.