Data ownership

The use of personal data by web platforms has always been a slippery slope. In order to justify their ad-supported business strategy to users, Big Tech's marketing departments manufactured the ostensible benefit of "personalization" in return for the capturing, re-use, and re-selling of personal data. The conceit was, they are going to show you ads and targeted posts no matter what, but at least those interruptions might be more in line with your interests.

This already feels invasive when it comes to capturing what you are interacting with, which groups you are in, which keywords in text are most likely to compel your engagement, and so on. It's an extraction of user intent.

What happens when the system is capturing more than your clicks though? When you are talking to an AI, or working with it to create content, where is the line drawn between what belongs to you, what belongs to the platform, and what to do with the stuff in the middle?

While the nuances and principles of this market are ironed out, many AI companies have implemented the Data Retention pattern, leading to its near ubiquity.

ChatGPT's settings show the common pattern

The pattern is generally the same in all uses: located in the user or company settings, a pithy statement about the need to improve the company's models for the benefit of all is paired with an on/off toggle.

Where its interface is consistent, the principles of interaction vary.

Opt-in vs. opt-out

Most commonly, this setting defaults to "on." This is the case for ChatGPT, Substack, GitHub, and many other big players in the space.

Notably, when Figma announced their AI features at their Config conference in 2024, they took the approach of defaulting the option to "off." This could signal a departure from the norm towards customer-first defaults.

Figma has published their approach to AI including data storage and model training. They link to this explainer from the settings pane

Paid v. Unpaid

In many cases, the option to opt out of sharing your data or your company's data is only available on premium plans. Free users may never even see the setting, and won't be aware of the tradeoff.

This makes sense commercially, as the servers to run AI are expensive, so free users pay in the form of their data. As legislation related to AI becomes more concrete, we should expect to see pressure on this approach.

Enterprise vs. consumer

As Enterprise-grade accounts roll out across AI platforms, settings like this are being placed within the admin settings instead of within individual settings. Figma offers a good example of this approach. This is logically so that company admins don't have to rely on the discretion of employees to follow their standard security policies.

Full privacy

Some companies take this a step forward and fully guarantee a private experience. Limitless.ai doesn't include this setting because it doesn't train models with user data, and explicitly calls out in its privacy policy that it does not allow any third-party partners to use their users' data to train models either.

If you take this approach, you might want to include these details in your settings panel as well so that users don't think the lack of the setting relates to lack of privacy, when the opposite is true.

Design considerations

Default to privacy. Users should never be surprised to learn that their data can be consumed for training purposes. Allow users to opt in at any time, but set the default for data ownership settings to restrict training.
Separate training from retention. Granting permission for an AI service to retain your data for customer service and aggregate analytical purposes should not be combined with the permission for models to be trained on personal data. Give each its own control and page so users can change one without affecting the other and make the distinction clear.
State the default in the UI, not only in policy. Provide sufficient details within the UI where the setting is controlled so users can understand what they are opting into or out of. Support this with links to deeper documentation and policy.
Present both sides. Explain to users what happens to their data. Some might be willing to share it if they understood that it helps the overall experience. Be clear about the differences between data collection and data training, and how data is used in the aggregate in both cases.