Theoretical Analysis of Fair Data Pruning with Gaussian Mixtures

tldt arrow

Too Long; Didn't Read

Explore our theoretical analysis of data pruning for fair binary classification using a mixture of Gaussians
featured image - Theoretical Analysis of Fair Data Pruning with Gaussian Mixtures
Algorithmic Bias (dot tech) HackerNoon profile picture
0-item

Abstract and 1 Introduction

2 Data Pruning & Fairness

3 Method & Results

4 Theoretical Analysis

5 Discussion, Acknowledgments and Disclosure of Funding, and References

A Implementation Details

B Theoretical Analysis for a Mixture of Gaussians

4 Theoretical Analysis

In this section, we derive analytical results for data pruning in a toy model of binary classification for a mixture of two univariate Gaussians with linear classifiers[1]. Perhaps surprisingly, in this framework we can derive optimal fair class-wise densities that motivate our method, as well as demonstrate how prototypical pruning algorithms fail with respect to worst-class accuracy.





The class-conditional independence assumption we made above is crucial. While it is respected when subsampling randomly within each class, it clearly does not hold for existing, more sophisticated, data pruning algorithms. Therefore, even though they tend to prune easier classes more aggressively as evident from Figure 3, they rarely enjoy any improvement of the worst-class performance compared to the original dataset. We further illustrate this observation by replicating a supervised variant of the Self-Supervised Pruning (SSP) developed by Sorscher et al. [2022] for ImageNet. In this algorithm,



we remove samples globally (i.e., irrespective of their class membership) located within a certain margin M > 0 of their class means. Intuitively, this algorithm discards the easiest or the most representative samples from the most densely populated regions of the distribution. As illustrated in Figure 8 (middle), this method fortuitously prunes the easier class more aggressively; indeed, the amount of samples removed from a class with mean µ and variance σ 2 is proportional to the area under the probability density function over the pruning interval [µ − M, µ + M], which is larger for smaller values of σ. Nevertheless, if the original dataset has classes mixed in proportions N0 and N1, the solution remains close to t ∗ (N0/N1) even after pruning, as we show formally in Appendix B.3. On the other hand, random subsampling according to class-wise pruning proportions defined by SSP does improve the worst-class accuracy, as illustrated in the right plots of Figure 8. This corresponds to our observation in Figure 4 that random pruning respecting the class proportions discovered by GraNd and Forgetting often improves fairness compared to these methods themselves.


A priori, it is unclear how exactly the variance-based worst-case optimal pruning quotas in Equation 5 generalize to deep learning, and in particular how they can motivate error-based MetriQ, as proposed in Section 3. One could connect our theory to the feature distribution in the penultimate layer of neural networks and determine class densities from cluster variances around class means. In fact, SSP [Sorscher et al., 2022] uses such metrics to determine pruning scores for sample, using a pretrained model. However, such variances are noisy, especially for smaller class sizes, and hard to connect to class accuracies. Therefore, instead, we examine MetriQ in our toy framework. As argued above, given a dataset with class priors N0 and N1, the optimal class densities must satisfy d0N0σ1 = d1N1σ0, while MetriQ requires d0R1[t ∗ (N0/N1)] = d1R0[t ∗ (N0/N1)].


Figure 9a plots the two proportions for different Gaussian mixtures; we find that MetriQ approximates the optimal quotas quite well, especially when σ1/σ0 is small. Figure 9b demonstrates that random pruning according to MetriQ lands the average ERM near the optimal value. Thus, even though MetriQ does not enjoy simple theoretical guarantees, it still operates near-optimally in this toy setup.


This paper is available on arxiv under CC BY 4.0 DEED license.


[1] Our results generalize to isotropic multivariate Gaussians as well, see Appendix B.

Authors:

(1) Artem Vysogorets, Center for Data Science, New York University ([email protected]);

(2) Kartik Ahuja, Meta FAIR;

(3) Julia Kempe, New York University, Meta FAIR.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks
OSZAR »