# Proposed Robust Weighting Strategy For Federated Learning

*3 main points*✔️ This is a study on improving the weighting method for Federated Learning. We used upper and lower bounds on the generalization performance of each local model for weighting, taking into account the effects of statistical heterogeneity and noise data.

✔️ Decomposed generalization performance based on variance and bias trade-offs. Robust model performance against shifts in the data distribution was considered

✔️ Experimental results show that the proposed weighting strategy improves the performance and robustness of the Federated Learning algorithm

Aggregation Weighting of Federated Learning via Generalization Bound Estimation

written by Mingwei Xu, Xiaofeng Cao, Ivor W.Tsang, James T.Kwok

(Submitted on 10 Nov 2023)

Comments: Accepted on arXiv.

Subjects: Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

## Summary

In Federated Learning (FL), model parameters collected from clients are usually weighted and aggregated according to the number of samples for each client. However, this simple weighting method can lead to poor or unbiased model performance in the presence of statistical bias or noise in the data across clients. Theoretical studies have shown that there is an upper bound on the generalization performance of a model with respect to changes in distribution. This raises the need to rethink weighting methods in coalition learning.

We propose to replace the traditional weighting method with a new strategy that takes into account the range of generalization performance of each client's model. Specifically, for each communication round, we estimate the upper and lower bounds of the second-order moments for the current client model's distributional change, and use the difference in the range as the aggregate ratio of the weighting. Experimental results confirm that the proposed method significantly improves the performance of a representative FL algorithm on a benchmark dataset.

This study provides a new perspective on how to weight model aggregation in coalition learning. By taking into account data bias among clients, we expect to construct more robust and unbiased models. In the future, it will be important to test the effectiveness of the proposed method using a variety of real-world data.

## Introduction

Data security and privacy protection are critical topics in the field of data mining and machine learning. A distributed machine learning framework called Federated Learning (FL) has attracted much attention: in FL, multiple clients collaborate to train models, but they do not need to share data directly. Only parameters are communicated between clients and servers. In recent years, FL has been applied in a variety of fields, including IoT, computer vision, automated driving, and medicine.

However, a major challenge in FL is the statistical bias of data among clients. This means that the distribution of data held by each client is different, and noise and imbalances exist. As a result, the optimization direction of the local and global loss functions can be misaligned, which can significantly degrade model performance and hinder convergence. Increased communication may be necessary to address this problem.

There have been a number of studies addressing the issue of statistical bias in FL. The main goal is to reduce model deviations so that local updates do not deviate significantly from the global model. For example, FedProx adds a regularization term to the loss function, SCAFFOLD introduces global and local control variables to reduce divergence, and FedDyn takes a different approach by establishing a dynamic regularization term for each client.

In these studies, it is common practice to weight client model parameters based on the number of samples for each client when aggregating them at the server. However, this simple weighting method can lead to unfair results and loss of robustness due to statistical bias among clients. This points to the need to rethink the weighting method in FL.

In the field of machine learning, weighting is known as an effective and robust learning technique when dealing with noisy or unbalanced data. For example, assigning large weights to noisy data and small weights to noisy data can reduce the influence of the data that prevents the model from learning. For unbalanced data, assigning large weights to a small class of data can also reduce learning bias.

However, the simultaneous presence of noise and imbalance in FL may not be adequately addressed by conventional weighting methods, and more sophisticated weighting strategies are needed to deal with the complexity of statistical bias in FL. Future research on new weighting methods that take into account the characteristics of FL is expected.

## Motive

In distribution robustness analysis, the generalization performance of the shifted heterogeneous data distribution can be bounded, allowing one to control for the worst-case performance of the model-independent training model. Specifically, this generalization bound is positively correlated with the degree of the shifted distribution. In other words, the more heterogeneous the distribution, the more difficult it is to accurately estimate generalization performance.

Based on these insights, the authors propose a novel weighting strategy for parameter aggregation in federated learning that exploits the boundary mismatch of shifted heterogeneous distributions. The boundary mismatch estimation theoretically reflects the training difficulty within the client's data distribution. The narrower the boundary discrepancy, the more robust the training performance.

In this setting, clients with shifted distributions that exhibit significant heterogeneity should be assigned a small weighting due to their large generalization discrepancy with the server. By accounting for boundary mismatch, the authors' proposed weighting strategy aims to improve the robustness and fairness of parameter aggregation in federated learning.

Theoretically, the primary and secondary origin moments are each expected values of different forms of the robustness loss function, but the main difference is that the secondary origin moment is flatter than the primary origin moment for loss values below 1 and more strongly convex for values above 1. In the context of Sharpness-Aware Minimization (SAM), flat minima are considered to be preferred over sharp minima. This is because flat minima tend to be more stable; in FL robust weighting, the quadratic origin moment is used to converge to a flat minimum.

Therefore, in the process of estimating the generalized bounds, we use the quadratic origin moments for the generalized bounds. Contribution. In this paper, instead of relying on weighting based on sample proportions, we introduce a weighting scheme based on the estimation of the generalization bounds of the local model. Specifically, we exploit the superior flatness and convexity of the aforementioned quadratic origin moments to uniquely estimate the generalization bounds. By calculating the boundary mismatch, we dynamically adjust the aggregate weighting at each communication round to make client participation in the training process more equitable. Clients with narrower boundary mismatch are assigned higher aggregate weightings indicating greater uniformity.

The main contributions of this work include

1) Distribution Robustness Perspective : The authors reconsider the aggregate weighting approach in federated learning from a distribution robustness perspective, allowing us to bound the generalization performance of the local model shift distribution.

2) New theoretical insight : Theoretically, we utilize the quadratic origin moment of the loss function, which exhibits better generalization performance compared to the primary origin moment and avoids aggregate weightings approaching zero at sharp values. Specifically, we place upper and lower bounds on the generalization performance measure under shifts in the data distribution. Here we use the quadratic moment of loss, the In terms of a bias-variance tradeoff analysis, this quadratic moment approximates the sum of the square of the bias and the variance.

3) Robust Aggregate Weighting : The authors propose a novel approach to address the inherent unfairness of traditional sample ratio weighting in federated learning. The authors' strategy implements a boundary discrepancy weighting scheme that estimates generalization bounds and improves aggregation efficiency in the presence of statistical heterogeneity. The authors extensively evaluated the authors' approach using popular federated learning algorithms such as FedAvg, FedProx, SCAFFOLD, and FedDyn. Experimental results show significant improvements with the authors' proposed approach.

## Related Research

Federated learning is widely recognized as a method to protect data privacy by aggregating local models without sharing raw data. Research in this area focuses primarily on three key aspects: privacy and security, communication efficiency, and heterogeneity.

With respect to heterogeneity, a classification of non-IID (non-independent and non-identical distributions) scenarios by Kairouz et al. illustrates five different non-IID data distribution situations. Li et al. proposed a benchmark for a split strategy that provides comprehensive guidelines and datasets covering non-IID scenarios, while Zhao et al. introduced a solution involving the generation of a small shared dataset for initial model training. In addition, Li et al. performed a convergence analysis of FedAvg on a non-IID dataset and recognized that heterogeneity in the dataset can slow the convergence rate and lead to deviations from the optimal solution; Luo et al. extensively studied the implicit representation of different layers of neural networks and found that large classifier bias was identified as the main cause of performance degradation in non-IID data.

Previous work has addressed the issue of statistical heterogeneity in federated learning from a variety of perspectives, but has only used general sample proportions in local model aggregation. In this paper, we revisit the weighting approach, with a particular focus on robust aggregate weighting. Robust reweighting is a widely used concept in machine learning; Zhou et al. propose an effective reweighting of training samples to improve out-of-distribution (OOD) generalization and mitigate overfitting in large overparameter models; Shen et al. developed a sample reweighting technique to solve the collinearity problem between input variables.

Ren et al. also reweighted samples using a bias-free validation set; Shu et al. used MLP networks trained on a small bias-free validation set to learn how to reweight different data losses; and Pillutla et al. used a bias-free validation set to learn how to reweight different data losses. In the context of federated learning, Pillutla et al. presented an update aggregation method using geometric medians to increase the robustness of the aggregation process against potential poison attacks on local data and model parameters. Li et al. also focused on a learning-based reweighting approach to mitigate the effects of label corruption in federated learning.

## Background knowledge

### Trade-off between bias and variance

In machine learning, the expected cost of a trained model is partitioned into three non-negative components: inherent target noise, squared bias, and variance. The trade-off between bias and variance is a useful statistical tool for understanding the generalization of trained model predictions. The optimal tradeoff yields a more accurate model that avoids over- and under-training. The training dataset D consists of independent and identically distributed samples drawn from the distribution P(X, Y) where x represents the test sample and y its true label. _{hD} (x) represents the hypothesis trained by the machine learning algorithm on dataset D and h(x) is the expected label given to input x. The expected model hypotheses are

and the expected test error is shown as

The bias-dispersion decomposition is represented by The bias-dispersion decomposition is as follows

The derivation of the bias-dispersion decomposition is in Appendix A. In practical applications, the noise term is usually difficult to detect and is therefore considered constant. Therefore, in this paper, the expected test error is approximated as follows

next

If l(Z) = hD(x) - y, then equation (3) above can be written as

According to the law of large numbers, if N is sufficiently large, , then the bias is approximately E[l(Z)]. For simplicity, we can rewrite the expected test error as

Equation (5) also satisfies the variance formula V[x] = E[x2] - E2[x] in statistics. Note, however, that this equation includes both bias and variance.

### Robustness analysis of distributions

**Distribution Robustness ** Distribution robustness optimization is a technique to improve the robustness of a model by optimizing it for a worst-case distribution. In this approach, x ∈ X as input and y ∈ Y as output are considered from a joint data distribution P(X, Y), where h : X → Y is the machine learning model. Given a loss function L : Y × Y → R+ , the objective is to minimize the following equation

where UP ⊆ P(X, Y) represents the set of uncertainty probability distributions. By solving this optimization problem, model parameters that provide sufficient robustness can be obtained. In the distribution robustness framework, Werber et al. investigated the discrepancy in generalization performance between ignorant models caused by discrepancies in data distributions. They provided upper and lower bounds on model generalization performance, inspired by Theorem 2.2 of Werber's paper and the authors' analysis above, and extend the use of quadratic origin moments instead of primary moments as described in the original text. By introducing a distance parameter ǫ, we establish a different way to limit the robust performance of h in the shifted data distribution Q

where P represents the actual distribution and B2L represents the distance ǫ and a bound that depends on the current data distribution P. Hellinger Dist(,) refers to the Hellinger distance used in machine learning to quantify the similarity between two probability distributions: E(X, Y) ∼ Q The upper and lower bounds for [L2(h(X), Y)] follow from Theorem 3.1 and Theorem 3.2.

**Theorem 3.1**: Upper bounds on the generalization performance of the ignorance model under the shifted distribution:

Assuming that L : Y × Y → R+ is a nonnegative function and that sup(x,y) ∈ (X, Y) |L(h(x), y)| ≤ M for some M > 0, then sup(x,y) ∈ (X, Y) |L2(h(x), y)| ≤ M2 and for any probability measure P, ǫ & gt; for 0, the following equation holds:

where λǫ = [ǫ2 (2 - ǫ2) (1 - ǫ2)2]1/2 and Bǫ(P) = {Q ∈ P(X, Y) : H(P, Q) ≤ ǫ} is a Hellinger sphere of radius ǫ centered at P . The radius ǫ is,

must be

**Theorem 3.2**: Lower bounds on the generalization performance of the ignorance model under the shifted distribution:

If L : Y × Y → R+ is a function taking non-negative values in (X, Y), then for any probability measure P, for ǫ > 0, we have

where λǫ = [ǫ2 (1 - ǫ2)2 (2 - ǫ2)]1/2 and Bǫ(P) = {Q ∈ P(X, Y) : H(P, Q) ≤ ǫ} is a Hellinger sphere of radius ǫ centered at P . The radius ǫ must be sufficiently small:

Theorem 3.1 and Theorem 3.2 above provide upper and lower bounds on the generalization performance of the ignorance model in the presence of a discrepancy ǫ in the data distribution. The upper and lower bounds are a combination of expectation and variance. This can be viewed as the computation of the quadratic origin moment of the loss function with the variance as the regularization term. For local models in federated learning, the upper and lower bounds described above can be leveraged to estimate bounds on the discrepancy in generalization performance.

## Coalitional learning with robust weighting

**PROBLEM PRESENTATION ** In a typical federated learning study, weighting ratios to local models during aggregation are assigned according to the principle: ∑K k=1 pk = 1, where pk is the ratio of local training samples to total training samples. This approach ensures that the contribution of each local model is properly taken into account. However, in heterogeneous scenarios, the data distribution may differ among local models, and the strategy of determining aggregate weights based on sample proportions takes into account the potential negative effects caused by heterogeneous data.

The bias-variance tradeoff is that the quadratic origin moments consist of important statistical indicators, namely bias and variance, which provide valuable insight into the accuracy and generalizability of the learning model. More importantly, according to the Sharpness-Aware Minimization analysis described earlier, the quadratic origin moments exhibit better stability and convexity. Based on the aforementioned analysis, the authors' goal is to estimate upper and lower bounds for the secondary origin moments of the local model. These bounds are obtained under a variance robust setting. This will provide a comprehensive understanding of the weighted aggregate performance of the model and take into account potential variability and uncertainty.

The formal presentation of the problem is as follows: to mitigate the adverse effects of heterogeneous data in parameter aggregate weighting, we first assign a defined distance to quantify the discrepancy in the data distribution, representing the degree of distribution shift. Next, we estimate the upper and lower generalization bounds of the local model. Finally, the discrepancies in the generalization bounds provide the basis for determining the weighting used for aggregation.

### Federated Learning

General Coalitional Learning. In a typical FL [1], the learning objective can be generalized as an optimization function:

where Lk represents the total learning loss of the kth client, h represents the hypothesis of the learning model, and K represents the number of local clients participating in the training. kth client has nk training data {(xk,1, yk,1), (xk,2, yk,2), . . . , (xk,nk , yk,nk )}, then the local objective function Lk(-) can be defined as follows

where L(, ) represents the loss function. The aggregated global parameters are passed to the server-side client, and the kth local client htk,e performs a local update of the E step:

Finally, the global model is generated from local learning.

### Generalized boundary estimation

Incorporate distributional robustness analysis into local model boundary estimation. In heterogeneous data scenarios, the use of upper and lower bounds provides a more robust and unbiased measure of client training performance. This step is crucial in robust weighted aggregation strategies. Based on Theorem 3.1 for the upper bound and Theorem 3.2 for the lower bound, we derive the following theorem:

**Theorem 4.1** Upper and lower bounds for model performance under a shift distribution are as follows

The upper and lower bounds for estimating generalization performance based on the actual data distribution for each local client are outlined in Theorem 4.1. These bounds depend primarily on the expectation and variance within the actual data distribution. To estimate these bounds, one samples from the actual data distribution, accounts for learning loss, and sets a given distance to quantify the discrepancy in the data distribution.

### Robust weighting for FL

In this section, we introduce a robust aggregate weighting strategy based on Theorem 4.1. Rather than relying solely on sample proportions, the authors' approach exploits discrepancies in generalization bounds to achieve a more robust and fair weighting scheme: in the context of Federated Learning, the data within each client is assumed to remain unchanged during each round of training. However, different clients with diverse data distributions may exhibit varying generalization performance when following the same model assumptions. If the upper and lower bounds are u and l, respectively, as defined in equations (16) and (17), the discrepancy σ for the j-th client at a given distance is computed as follows

To obtain more information about the discrepancy of the generalization bounds, we set several different distance values and compute the discrepancy of their bounds, which can be viewed as neighborhood values. Thus, the jth (j = 1, . , K) client's total boundary discrepancy ηj is

In the t + 1th round of aggregate weighting, the aggregate weighting is formulated as follows

denotes robust weighting. The framework of the authors' robust weighting strategy is depicted in Figure 1 and Algorithm 1. It is important to note that the computation of the upper and lower bounds in Theorem 4.1 involves different conditions. As a result, the upper and lower bounds must be estimated separately, since direct equation inference and subtraction are not possible.

Figure 1. Overview of robust aggregate weighting. Each client estimates the generalization discrepancy of the model and performs training and aggregation weighting. |

### Robust aggregate weighting algorithm

The key steps of the aforementioned algorithms are abstracted and an overview of the algorithmic process is presented. Algorithm 1 presents a robust aggregate weighting strategy within a standard federated learning framework. This framework consists of two steps: ClientUpdate and ServerExecute. we also introduce four classical baselines that remain applicable in the setting of this algorithm. in the ClientUpdate step, we use the sample proportions rather than using the estimated boundary discrepancies as weights for aggregating local model parameters; in the ServerExecute step, the server receives the discrepancies and client models and aggregates all clients are aggregated.

## Experiment

### Boundary discrepancies in IID and non-IID data sets

**・Mounting**

We investigate differences in boundary disagreement estimation for the CIFAR10 dataset in both the IID and non-IID cases; in the IID case, the training dataset consists of 2,000 randomly selected samples from each category, for a total of 20,000 training samples In the non-IID case, the training dataset consists of 2,000 randomly selected samples from each category. In the non-IID case, the training data set also contains 20,000 samples, with each category having the following random sample sizes: [913, 994, 2254, 2007, 1829, 1144, 840, 4468, 713, 4838]. In all cases, the test set consists of 10,000 samples, with 1,000 samples in each category. We use the resnet20 network as the model and consider two loss functions: 0-1 loss and JSD loss. Throughout the experiment, 100 communication rounds are performed with a batch size of 64.

*・Results*

Figure 2 shows trends in loss and test accuracy during training. The results clearly show that the model performs better in the IID case than in the non-IID case. In Figure 3, upper and lower bounds are shown using 0-1 and JSD losses, with blue lines representing IID data and red lines representing non-IID data. The total boundary discrepancy is calculated at 10 equally spaced discrete points: for 0-1 loss, the total boundary discrepancy for non-IID data is 2.28 and for IID data is 2.10; for JSD loss, the total boundary discrepancy for non-IID data is 2.38 and for IID data is 2.16. These findings indicate that in the IID scenario, the data are similar to each other and follow the same distribution, resulting in a smaller range of possible predicted outcomes and tighter bounds for disagreement between different models and algorithms. Conversely, in the non-IID scenario, the data are more diverse and may follow different distributions, resulting in a wider range of possible predictive outcomes and looser bounds of disagreement between models.

These results recognize that shifts in heterogeneous data distributions can be effectively evaluated through estimation of bounds on their generalization performance. In addition, they provide an initial understanding for subsequent experiments on robust aggregate weighting.

Figure 2: Test accuracy and training loss on IID and non-IID CIFAR10 data sets. |

Figure 3: Upper and lower bound discrepancies for 0-1 loss and JSD loss in the CIFAR10 data set, blue line for IID data and red line for Non-IID data, showing that IID has a tighter bound discrepancy than Non-IID. |

### Robust aggregate weighting for FL

The authors' experiment is to test the effectiveness of robust aggregate weighting by using boundary discrepancies in FL. The baseline chosen follows the FedDyn paper, specifically including FedAvg, FedProx, SCAFFOLD, and FedDyn. Under the same hyperparameter settings, we compare the sample weighting to the percentage weighting of the robust aggregate.

**・Experimental setup**

**Datasets** To assess data heterogeneity, we utilize four datasets widely used in Federated Learning research: CIFAR10, MNIST, CIFAR100, and EMNIST. To create a more realistic simulation of the non-IID dataset, we introduce a heterogeneous distribution for the client classes and allow for the possibility that some classes are missing. For this purpose, we sample from a non-equilibrium Dirichlet distribution. For each client, we generate a random vector pk ∼ Dir(α) from the Dirichlet distribution; the fraction of images belonging to each category c in the data set assigned to the kth client is expressed as (100 - pk,c)%. In our experiments, we set the parameters of the lognormal distribution to unbalanced sgm = 0.9 and the parameters of the Dirichlet distribution to the rule arg = 0.3. In addition, to simulate noisy data in a real-world scenario, we introduced 20% noisier data in the four data sets by assigning some of the labels as 0.

**Setup** In all experiments, we assume that all clients participate in each round of communication, i.e., the probability that each client participates in the training is equal to 1. We set the number of communication rounds to [200, 500, 700] for different data sets, as shown in Figure 4. The weight decay is equal to ^{ 1e-3} and the batch size is 50. For each client, the local epoch is 5 and the learning rate is 0.1. for each dataset, experiments are performed with 10, 20, 50, 100, and 200 clients, respectively. for the MNIST and EMNIST datasets, we use fully connected neural networks consisting of two hidden layers, with the hidden layer with 200 and 100 neurons, respectively; experiments on the CIFAR10 and CIFAR100 datasets used the CNN model used in (McMahan et al., 2017), which includes two convolutional layers and a 64 × 5 × 5 filter, with 394 and 192 neurons in two fully connected layers followed by a softmax layer.

*・Experimental results on model performance*

**Overall Summary ** The authors applied both the robust aggregate weighting strategy and the original sample ratio method to four classical baseline algorithms. Figure 4 shows the test accuracy results for 10, 20, and 50 clients, with 20% additional noise data. In the figure, the solid line represents the robust aggregate weighting strategy and the dashed line represents the sample ratio strategy. The corresponding test accuracies for all experiments are shown in Table 1. In Table 1, Propto stands for sample proportions and Robust for robust aggregate weighting. From the table, it is clear that the authors' strategy consistently achieves higher test accuracies than the original strategy. The authors' experiments show significant performance improvements for FedAvg and FedProx, and slight improvements for SCAFFOLD and FedDyn. These results indicate that the authors' robust aggregate weighting is fairer and more robust when dealing with heterogeneous and noisy data.

Figure 4: CIFAR10, MNIST, CIFAR100, EMNIST , 0.9 ) using Dirichlet (0.3) . Communication rounds are set to [200,500,700] for different data sets. |

Table I: Test accuracy using Dirichlet (0.3,0.9). |

**Test Accuracy ** Specifically, observing the experimental data in Federated Learning Test Accuracy in Table 1, we see that the weighting method consistently achieves significant improvements over the FedAvg and FedProx baseline methods. Also, when the ratio of noisy data in the data distribution among clients is high and heterogeneous, FedProx, the regularization term, which is an improvement over FedAvg, does not consistently outperform FedAvg. For SCAFFOLD and FedDyn, considering all clients and data sets the overall test results outperform the original method 77.5% of the time; on the MNIST dataset the test accuracy completely outperforms the original method, with a failure rate of 2.5% on EMNIST, 5% on CIFAR10, and finally 15% on CIFAR100. This phenomenon is due to the dimensionality collapse caused by the heterogeneity of the dataset during the training process when introducing the bias correction for weighted estimation, resulting in the loss of some representational information, invalid weights, and poor model performance. In FedDyn, with its superior performance, it is still difficult to match the performance of the original weighted model, despite the introduction of new weights in SCAFFOLD, FedProx, and FedAvg. FedDyn's client-side strategy of model convergence to a global optimum has been shown to be effective in certain FedDyn's strategy of convergence of the model to a global optimum on the client side makes it difficult to capture effective boundary mismatch for certain good clients, but it is still an excellent solution.

**Robustness Analysis ** In addition to the above analysis, the authors performed a robustness analysis to evaluate the performance of their robust aggregate weighting strategy. Experimental results show that the authors' robust aggregate weighting strategy promotes fairness by assigning smaller weights to clients with a high proportion of heterogeneous data during parameter aggregation. In particular, for FedAvg and FedProx, the robust aggregate weighting strategy yields significant accuracy improvements and provides marginal improvements for SCAFFOLD and FedDyn. The reason behind these results is that FedAvg and FedProx are classical methods that do not adequately address the correction of local model bias due to heterogeneous data, whereas SCAFFOLD and FedDyn focus on optimizing local model shifts due to heterogeneous data SCAFFOLD and FedDyn improve accuracy but require more training time than FedAvg and FedProx. By incorporating the authors' robust aggregate weighting strategy, FedAvg and FedProx can achieve comparable performance to SCAFFOLD and FedDyn while requiring less computation time.

**Balance of Communication and Performance ** From a horizontal perspective, the difference in test accuracy between the authors' new weighting method and the original method across the four baselines has narrowed. For example, comparing FedAvg and FedProx to SCAFFOLD, FedAvg and FedProx consistently perform significantly worse than SCAFFOLD in the sample ratio-based weighting experiments. However, in experiments with weighting based on boundary mismatch, the performance difference between the algorithms rapidly diminishes, with FedAvg and FedProx outperforming SCAFFOLD in a small number of cases. Similarly, FedAvg and FedProx lag slightly behind FedDyn's model performance after incorporating boundary information into the aggregate weighting, but the difference is much smaller than with the sample ratio weighting method. This is particularly applicable to scenarios with tight communication and computational constraints but more relaxed accuracy requirements. To further evaluate the stability of the experiment, additional experiments involving 100 and 200 clients were conducted, as shown in Figure 5. The results obtained in these experiments are consistent with the improvements observed in the 10, 20, and 50 client experiments. More detailed experimental results are also presented in Table 1. In Table 1, the authors' experimental results show significant accuracy improvements over FedAvg and FedProx. It should be noted, however, that the authors' weighting strategy does not consistently improve the accuracy of all federated learning algorithms. This may be due to the stability of FL algorithms such as SCAFFOLD and FedDyn and the potential risk of overfitting training. Overall, the experimental results provide compelling evidence for the effectiveness of the authors' weighting strategy, especially in the FedAvg and FedProx cases.

**・Analysis of Variance for Inspection Accuracy**

Experiments are performed on four baselines and four data sets, randomly selecting 10 and 100 client numbers to compute the variance of test accuracy after model convergence. This is shown in Table 2. As in the setting of the previous table, Propto. represents the sample ratio weighting method and Robust. represents the method based on the authors' new generalized boundary estimation. Variance is calculated based on the performance of models that continue to participate in communication training after convergence.

The authors perform this analysis because in environments with high noise ratios and strong data heterogeneity, some methods, such as SCAFFOLD 4, exhibit significant oscillations after convergence, increasing model training uncertainty. Therefore, we compared the stability of the performance of several baselines and this can be visually observed in Figure 4 and Figure 5. Specifically, from Table 2, we can see that the authors' weighting method had an overall lower test accuracy variance in 90.625% of cases compared to the original weighting method; the test accuracy variance of the FedAvg and FedProx methods, by using the authors' proposed weighting scheme, was reduced compared to the sample ratio weighting.

In the case of the FedDyn and SCAFFOLD methods, only a portion of the variance exceeded the variance of the original weighting method. This indicates that the authors' method achieves robust training in most cases, but may fail in some extreme scenarios and requires further exploration.

Table II: Variance of inspection accuracy by Dirichlet ( 0.3 , 0.9 ) |

Figure 5. cIFAR10, MNIST, CIFAR100, EMNIST with Dirichlet (0.3, 0.9). |

### Noisy data vs. percentage of clients in FL

Experiments were conducted to test the effectiveness of the authors' robust aggregate weighting strategy under different proportions of randomly selected noise data and participating clients. To introduce noise, we added 40% noise to the CIFAR10 and EMNIST datasets. For the percentage of participating clients, we set prob=0.7 for the percentage of clients participating in training on the CIFAR10 dataset. The rest of the experimental setup remained the same.

The results of these experiments are summarized in Figure 6, Table 3, and Table 4. Compared to the results of the aforementioned experiments, we observe that the authors' robust aggregate weighting strategy is effective even when dealing with a higher percentage of noisy data and when only a subset of clients participates in the training. This strategy consistently improves test accuracy across different scenarios, demonstrating its robustness and adaptability. These findings highlight the robustness of the authors' aggregate weighting strategy, even in the presence of highly noisy data. They demonstrate their effectiveness even when large amounts of data are corrupted and even when only a subset of clients contribute to the training process.

The ability of the authors' strategy to adapt to such demanding scenarios is a great advantage and ensures reliable and accurate model training in a real-world setting.

Figure 6 (a) Test accuracy in CIFAR10 with 40% data noise and Dirichlet(0.3, 0.9). (b) Test accuracy in CIFAR10 with 70% clients involved in training and Dirichlet(0.3, 0.9). Solid line is the weighting strategy, dashed line is when the percentage of training samples is used as weights. |

Table 3: Dirichlet (0.3, 0.9) test accuracy for 40% noise data. |

Table 4: Dirichlet (0.3, 0.9) test accuracy for 70% of clients. |

## Conclusion

In the field of Federated Learning (FL), it has been noted that traditional aggregate weighting methods based on sample proportions may produce unfair results due to differences in the distribution of each client's data. To address this issue, researchers have reviewed aggregate weighting methods from a new perspective.Specifically, we propose an approach that takes into account the performance of each client's local model. Referencing the analysis of distributional robustness, the researchers introduced a weighting method based on the degree of disagreement of the decision boundaries of each local model, rather than simply on the sample proportions.

To obtain these bounds, we used the quadratic moments of robustness loss, which allows for a smoother generalization of the bounds while avoiding the assignment of extremely small weights to some clients.This approach has been shown to be more effective in real FL scenarios such as noise and class imbalances.

Various experimental results show that the proposed weighting strategy significantly improves the performance and robustness of existing FL algorithms.Future work includes the development of automatically adaptive weighting methods for gradient aggregation in training with datasets of different distributions. These efforts are expected to further advance Federated Learning.

Categories related to this article