My iPhone screen time is over four hours every day. Over the last month I’ve booked restaurant reservations and doctor’s appointments, received motorcycle maintenance records, loaded new applications and ordered clothes. All of these actions involved the sort of data exchanges that today’s information-based tech companies crave. Applying machine learning tools to personal data can uncover valuable knowledge and generate tremendous business value.
With data increasingly seen as “the new oil,” many economists, politicians, and others are suggesting people should be paid for the data they produce. Although the price of commodities is generally set by a market, the mechanisms and metrics for evaluating data remain immature.
A team of researchers from the University of California at Berkeley, ETH, Zhejiang University, and the University of Illinois at Urbana-Champaign set out to put a price on data in the context of machine learning in their 2019 research papers Towards Efficient Data Valuation Based on the Shapley Value and the Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms.
Current approaches for data valuation focus mostly on Query-based pricing, data attribute-based pricing, and auction-based pricing. These approaches offer pricing guidelines instead of price-tagging the data directly. Query-based pricing for example attaches values to user-initiated queries, while data attribute-based pricing builds price models that consider parameters such as data age, credibility, potential benefits. Auction-based pricing meanwhile creates auctions that set the price dynamically based on buyers’ and sellers’ bids.
What are we missing? The researchers argue the following important considerations are not included in the existing data valuation schemes:
Task-specificness: The value of data depends on the task it helps to fulfill.
Fairness: The quality of data from different sources varies dramatically.
Efficiency: Practical machine learning tasks may involve thousands or billions of data contributors; thus, data valuation techniques should be capable of scaling up.
The researchers’ method for determining the relative value of data uses the “Shapely Value,” a game theory solution proposed in 1953 in by American mathematician Lloyd Shapely to determine how rewards are proportioned among individual members of a team. “The Shapley value attaches a real-value number to each player in the game to indicate the relative importance of their contributions.” – wrote the two papers’ first author Ruoxi Jia in a BAIR blog post.
A practicality problem facing researchers however is that computing the Shapley value is exponentially time-consuming.
The studies demonstrate that K-nearest neighbors (KNN) classification prevents the need to re-train models and compute the Shapley value in quasi-linear time, thus delivering an exponential improvement in computational efficiency.
Evaluated on up to 10 million data points, the proposed exact algorithm was up to three orders of magnitude faster than the baseline approximation algorithm while computing the exact Shapely Value. The researchers’ Locality Sensitive Hashing (LSH) approximation algorithm was even faster, significantly outperforming the exact algorithm especially with large training sizes.
The proposed algorithms introduce a new way for researchers to calculate data values for a large database. In machine learning, efficient data valuation can play a critical role in facilitating the collection of the data used for training models. This is a huge marketplace, as Synced previously reported for example the global annotation tools market alone is expected to reach US$1.6 billion by 2025.
The researchers believe their approach offers new possibilities for creating theoretical and computational tools to estimate the value of data, and that the Shapley value can help produce practical and robust tools for machine learning researchers.
The papers Towards Efficient Data Valuation Based on the Shapley Value and Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithmsare available on arXiv.
Journalist: Fangyu Cai | Editor: Michael Sarazen