Data Quality in Artificial Intelligence
Haukkala, Mikko (2022)
Haukkala, Mikko
2022
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2022120927650
https://urn.fi/URN:NBN:fi:amk-2022120927650
Tiivistelmä
This thesis is part of the AI-TIE project coordinated by Haaga-Helia University of Applied Sciences. The main goal of the project is to support SME companies in developing and growing their business in Finland by utilizing artificial intelligence solutions. The aim of the thesis, which was carried out in 2022, is to study the importance of data quality in AI development, examine the dimensions of data quality and to find out the common problems and good practices affecting data quality in companies that are already using or planning to implement artificial intelligence.
The theory section explains what is meant by artificial intelligence and what good data quality means from the perspective of artificial intelligence. In addition, the study explores what data is and how data quality can be measured and evaluated. By examining and comparing methods, the body of the interview and survey conducted in the research is selected.
The research part of the thesis utilizes the means of concurrent mixed method research. Based on interviews and surveys, the research section examines the views of professionals in the field on the different dimensions of data quality and the related challenges and good practices from the perspective of AI development.
Based on the results of the study, relevancy was considered the most challenging dimension of data quality in AI development. This dimension was selected as one of the most challenging data quality dimensions six times out of seven surveys. The reasons given for the challenging dimension included the difficulty of predicting what kind of data should be collected for future needs and a sufficient contextual understanding of the business and its needs. A comprehensive understanding of business problems from a technical and business perspective was considered important to be able to start collecting relevant data. In addition, the study revealed dimension-specific development suggestions and good practices for improving each data quality dimension.
The results of the thesis can be used to improve and evaluate the quality of existing data and to support the planning of future data needs from the perspective of artificial intelligence. In addition, the results can be utilized in the development of the maturity model of data quality on the way to the implementation of a production-ready AI application.
The theory section explains what is meant by artificial intelligence and what good data quality means from the perspective of artificial intelligence. In addition, the study explores what data is and how data quality can be measured and evaluated. By examining and comparing methods, the body of the interview and survey conducted in the research is selected.
The research part of the thesis utilizes the means of concurrent mixed method research. Based on interviews and surveys, the research section examines the views of professionals in the field on the different dimensions of data quality and the related challenges and good practices from the perspective of AI development.
Based on the results of the study, relevancy was considered the most challenging dimension of data quality in AI development. This dimension was selected as one of the most challenging data quality dimensions six times out of seven surveys. The reasons given for the challenging dimension included the difficulty of predicting what kind of data should be collected for future needs and a sufficient contextual understanding of the business and its needs. A comprehensive understanding of business problems from a technical and business perspective was considered important to be able to start collecting relevant data. In addition, the study revealed dimension-specific development suggestions and good practices for improving each data quality dimension.
The results of the thesis can be used to improve and evaluate the quality of existing data and to support the planning of future data needs from the perspective of artificial intelligence. In addition, the results can be utilized in the development of the maturity model of data quality on the way to the implementation of a production-ready AI application.