Synthetic Data: A Tutorial

Do you have questions about what synthetic data is, or whether synthetic data is right for you? In this tutorial, we're going to cover:

  • What is synthetic data useful for?
  • How does synthetic data compare to other privacy techniques?
  • Does synthetic data affect downstream analytics or machine learning?
  • How can you try synthetic data on your own dataset?

Let's start with a definition: synthetic data is artificial data that is designed to be similar in structure to real data, but without any sensitive information. It is generated based on a model of how we expect the data to be distributed. Since it does not include any real data, it eliminates the need for masking or subsetting, and can be used safely for testing and sharing.

With Gradio, synthetic data is generated based on the real data that you collect. This is so valuable because it allows you to continue to get value from data that you collect while still preserving privacy (more on that ahead!).

When to Use Synthetic Data?

When you need to work with sensitive data, you should be using synthetic data instead of real data.

For example, if you are sharing customer data with your development team, there are risks that the data might leak, exposing your customers' privacy. These exposures carry financial and reputation risks that your company can easily avoid using synthetic data, which can be generated by Gradio on-demand, mimicking the structure and characteristics of real data. Here are some common reasons why a company like yours might use synthetic data:

  • You want to build a test suite with representative data, but can't use real data for regulatory reasons.
  • Your data scientists are building tools to analyze patient data, but you don't want to share data even internally for the sake of patient privacy
  • You want to share data externally with a 3rd party analytics vendor, but want to make sure no intellectual property is leaked.

Synthetic Data vs. Anonymization vs. Encryption

Is synthetic data the only tool for you to protect sensitive data? Nope, you may be familiar with encryption, in which your data is converted or hashed into secret code that hides the information's true meaning. The main limitation with encryption is that it is not designed to be decrypted on the fly, making the data difficult to use for downstream analysis or machine learning. Encryption is a good tool for data at rest, not data at work.

Alternatively, there exist classification anonymization methods to remove certain kinds of sensitive information or metadata from data. While sometimes useful, anonymization is a blunt hammer in that it often removes a lot of value from the information as well. Synthetic data can often provide a more fine-grained approach that allows you to preserve the value in information while still mitigating privacy and security-related risks.

How Does Synthetic Data Affect Downstream Analysis?

As mentioned earlier, one of the key advantages of synthetic data is that it allows you to preserve the value of your machine learning or analytics pipelines, while simply swapping out your original sensitive data with synthetic data. In a variety of experiments, synthetic data consistently provides higher value than other anonymization methods.

Will synthetic data work for you? The best way to is to try our simple APIs yourself. Our team is here to show you additional experiments & help you try synthetic data on your own dataset.