Pauline Vengeroff

PhD Candidate
Pauline Vengeroff photo
Dissertation Title
Garbage to Governance: Shaping the Regulation of AI Training Data


The rapid advancement of AI has transformed various sectors, highlighting challenges and risks in the regulatory landscape, especially concerning the training data used for AI and machine learning. My research critically evaluates how laws can strengthen individual rights while recognizing their limitations in fully safeguarding privacy in AI training. It also scrutinizes the potential of emerging Canadian privacy legislation to enhance AI data governance, underscoring the need for a comprehensive legal approach. Advocating for trustworthy AI principles like transparency, my research aims to balance innovation with rigorous AI regulation. The objective is to propose policy modifications that make laws more adaptable to technological progress. This research takes a comprehensive approach to ethical AI training data governance, integrating insights from diverse fields such as private sector regulation, data protection, ethics, healthcare, copyright, and emerging technologies. It goes beyond assessing AI's societal impacts, advocating for technologies that proactively meet societal needs. Emphasis is placed on proactive data cleaning and fostering multidisciplinary dialogue for effective oversight. The goal is to establish a framework for robust, safe, and practical regulation, providing actionable tools for real-world application by technologists. In the Canadian context, direct legislation regulating the use of AI training data is notably absent. Some areas of law apply by extension, such as the governance of training data incorporating personal information, which is indirectly regulated through privacy laws. This proxy approach faces challenges, including the potential bypass through anonymization techniques or synthetic data. Legal scholarship has often overlooked the foundational role of training data, focusing predominantly on the harms caused by AI outputs. This research, therefore, probes into a proactive approach, targeting the potential root of the problem – the training data – rather than merely reacting to biased AI outputs. It emphasizes the need for a legal framework that integrates innovation with comprehensive data governance. The analysis tackles the regulation of AI training data from three pivotal perspectives: the privacy implications of using personal information, the use of personal health information under Canadian health information laws, and the challenges to copyright laws in the context of data used for AI training and the protectability of database rights. This includes a critical examination of how current privacy legislation fails to meet the needs of training data, the legal and ethical considerations of using health data in AI-driven healthcare devices, and the complexities of copyright law in relation to AI training datasets. The research aims to illuminate the shortcomings of the current legal framework and advocate for legal reforms that are anticipatory and protect against algorithmic discrimination. It strives to guide the development of AI systems that are ethical, trustworthy, and aligned with societal values. In doing so, it aims to set a precedent for how countries can adapt their legal frameworks to the evolving landscape of AI, making a significant contribution to the Canadian and global legal scholarship on AI regulation and training data.