Training Input and Consent

grimesi

34 Training Input and Consent

AI Inputs

AI Inputs can create potential harm from AI systems when they are trained on problematic data. AI models require massive amounts of training data to function, and in the past, developers had to hand curate AI datasets. Today, however, large scale datasets are widely available thanks to widely available internet content, and these datasets have enabled the existence of incredibly powerful AI models like never before.

Along with the availability of these massive datasets, concerns have arisen over the content they contain and how they are being used. Abeba Birhane, who audits large datasets as part of her research, has found illegal and/or unethical content in datasets that are used to train common AI models. Using this data in turn embeds biases into AI models, which tends to harm marginalized groups disproportionately. Even if the content contained within these datasets is not illegal, it may have been obtained without affirmative consent from the creators or subjects of those data. People are also rarely able to withdraw their data from these datasets. Indeed, because the datasets contain millions or even billions of data points, it may be practically impossible to remove pieces of content from them.

At the same time, obtaining individualized consent from everyone who has data in one of these datasets would be incredibly difficult, if not impossible. Moreover, requiring consent would likely greatly constrain development of AI because it would be so hard to get consent to use every element in the large datasets that have increased innovation in AI.

Unfortunately, there is no clear solution to the problems associated with AI inputs. A Creative Commons panel met on the 9th and 10th of November 2022, and agreed that some legal regulation over AI training data can be useful to improve the quality and ethics of AI inputs. But regulation alone is unlikely to solve the many problems that arise. We also need guidelines for researchers who are using data, including openly licensed data, for AI training purposes. For instance, guidelines may discourage the use of openly licensed content as AI inputs where such a use could lead to problematic outcomes, even if the use does not violate the license, such as with facial recognition technology. Public discussions like this are essential to developing a consensus among stakeholders about how to use AI inputs ethically, to raise awareness of these issues, and try to improve AI models going forward.

“Experts Weigh In: AI Inputs, AI Outputs, and the Public Commons.” Creative Commons, 2 Dec. 2022, https://creativecommons.org/2022/12/02/experts-weigh-in-ai-inputs-ai-outputs-and-the-public-commons/. CC BY 4.0.

License

Icon for the Creative Commons Attribution 4.0 International License

Balancing AI, Copyright, and Data Privacy in Education: A Guidebook for Educators Copyright © by grimesi is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.

License

Share This Book