Provenance is just knowing where something came from. In art, provenance answers two questions: Who created the work? And does the current possessor have the right to transfer? In the context of data, provenance answers: What individuals or organizations are described in the data? And does the current possessor have the right to transfer or use?
Answering these questions is critical for three reasons. First, knowing your data: how much value can you create from data if you don't understand or trust it? When you acquire data second-hand or third-hand, trust becomes increasingly important. Second, contractual obligations: contracts may explicitly prohibit re-using or re-distributing data, and downstream use creates breach of contract risks.
Third, regulatory compliance: federal and state laws may require specific documentation with respect to data. Neglecting data protection regulations can have severe impacts on an organization's machine learning models, operations, and financial wellbeing. If consent is the basis for processing data, there should be a means by which data can be tracked to consent.
Copyright considerations also play a significant role in data usage rights. If data has been obtained from a third party, the issue of provenance is even more important. Organizations should document the lineage of third-party data and any limitations on that data.
Technology is both enabling the exponential growth of data — which complicates provenance — and offering potential solutions. MLOps platforms like MLflow allow organizations to create and version datasets, including their provenance and lineage, and use these datasets to train versioned machine learning models.
Data provenance is not just a technical concern but a fundamental aspect of responsible AI governance. For board members, understanding and overseeing data provenance practices is crucial for ensuring the reliability, legality, and ethical use of AI and machine learning models. In the world of AI and data, knowing where your data comes from is just as important as knowing where you're going with it.