FedGCS: A Generative Framework for Efficient Client Selection in Federated Learning via Gradient-based Optimization (2024)

Zhiyuan Ning1,2These authors contributed equally.  Chunlin Tian411footnotemark: 1  Meng Xiao1,2  Wei Fan5  Pengyang Wang4  Li Li4Corresponding author.  
Pengfei Wang1,222footnotemark: 2&Yuanchun Zhou1,2,3
1Computer Network Information Center, Chinese Academy of Sciences
2University of Chinese Academy of Sciences3Hangzhou Institute for Advanced Study, UCAS
4Department of Computer and Information Science, IOTSC, University of Macau5University of Oxford
ningzhiyuan@cnic.cn,yc27402@um.edu.mo,shaow@cnic.cn,wei.fan@wrh.ox.ac.uk,
{pywang, llili}@um.edu.mo,{pfwang, zyc}@cnic.cn

Abstract

Federated Learning faces significant challenges in statistical and system heterogeneity, along with high energy consumption, necessitating efficient client selection strategies.Traditional approaches, including heuristic and learning-based methods, fall short of addressing these complexities holistically.In response, we propose FedGCS, a novel generative client selection framework that innovatively recasts the client selection process as a generative task.Drawing inspiration from the methodologies used in large language models, FedGCS efficiently encodes abundant decision-making knowledge within a continuous representation space, enabling efficient gradient-based optimization to search for optimal client selection that will be finally output via generation.The framework comprises four steps:(1) automatic collection of diverse “selection-score” pair data using classical client selection methods;(2) training an encoder-evaluator-decoder framework on this data to construct a continuous representation space;(3) employing gradient-based optimization in this space for optimal client selection;(4) generating the final optimal client selection via using beam search for the well-trained decoder.FedGCS outperforms traditional methods by being more comprehensive, generalizable, and efficient, simultaneously optimizing for model performance, latency, and energy consumption.The effectiveness of FedGCS is proven through extensive experimental analyses.

1 Introduction

Federated Learning (FL), as introduced inMcMahan et al. (2017), enables multiple client devices to collaboratively train a shared model without exposing users’ raw data, preserving privacy.Despite its advantages, the practical deployment of FL still faces various challenges:(1) statistical heterogeneity, i.e., data owned by different clients may come from different distributions, resulting in non-independent, identically distributed (Non-IID) training data, which can seriously affect the convergence and overall performance of the global modelHsieh et al. (2020);(2) system heterogeneity, i.e., clients participating in FL have different degrees of computational resources, communication capabilities and fault tolerance, which may result in stragglers that hinder the FL training process inflicting intolerable latencyBonawitz et al. (2019); (3) the devices perform many compute-intensive data processing and model training, and the devices and the central server frequently exchange model parameters and gradients, which will lead to substantial energy consumption.Thus, the inclusion of too many clients may lead to suboptimal performance, latency, and wasted energy.The selection of a “good” subset of clients as FL participants for each training round is critical to mitigating all three of these issuesLai et al. (2021).

FedGCS: A Generative Framework for Efficient Client Selection in Federated Learning via Gradient-based Optimization (1)

Existing client selection methods only partially address the above three challenges and can be categorized into two types:(1) heuristic-based selection, as shown in Figure1(a) right, in which manually-engineered heuristic rules are used to choose participating devices to improve the model performance and training efficiencyChai et al. (2020); Lai et al. (2021); Tian et al. (2022).However, such methods are subjectively designed by human experts and are not comprehensive enough that consider only a limited number of optimization objectives and deployment scenarios.As a result, when confronted with unfamiliar and complex scenarios, a great deal of domain expertise and manual tuning is often required, and the final performance may be suboptimal;(2) learning-based selection, as shown in Figure1(a) left, which explores the use of reinforcement learning (RL)Sutton and Barto (2018) to train continuously self-optimizing agents for optimal client selectionWang et al. (2020); Zhang et al. (2022); Tian et al. (2023).However, the objectives designated by human experts for RL training still do not handle the evolving and complex real-world deployment scenarios wellKim and Wu (2021).Moreover, RL is developed based on making decisions in massive discrete spaces and therefore has difficulty converging compared to solving continuous optimization problemsDulac-Arnold et al. (2021).

Recently, large language models (LLMs) trained on large volumes of text data have shown impressive abilities to encode world knowledge into model parameters and solve a wide variety of natural language processing tasks via text generationBrown et al. (2020); Touvron et al. (2023).Such great success has driven various domains to migrate to generative modelsRamesh et al. (2022); Gruver et al. (2023), where all the information and knowledge required for a task is encoded into the model parameters and the task is ultimately solved in a generative manner.This is a general, flexible, comprehensive, and efficient modeling style that prompts us to consider: can client selection in FL also be effectively addressed in a similar generative way?To this end, we propose FedGCS, a novel Generative Client Selection framework for Federated learning, that formulates the discrete client device selection task as a generative task.Like LLMs, FedGCS tries to encode the wide knowledge of discrete decisions (i.e., selection or deselection of clients) into a continuous representation space, where gradient-based optimization can be efficiently applied to find the optimal representation that will eventually be output as the discrete selection format via generation (as shown in Figure1(b)).Specifically, FedGCS includes four steps:(1) utilizing classical client selection approaches, i.e., heuristic-based and learning-based selection, to automatically collect sufficient, diverse and high-quality “selection-score” pair data as training data for subsequent models;(2) training an encoder-evaluator-decoder framework by simultaneously optimizing the sequence-to-sequence lossSutskever et al. (2014) and score estimation loss based on the collected pair data, and thereby obtaining a continuous representation space on which selection optimization will be performed;(3) adopting gradient-based optimization in the continuous representation space to find the optimal client selection representation;(4) applying the beam search strategyFreitag and Al-Onaizan (2017) to the well-trained decoder to generate the optimal client selection based on the optimal representation.

FedGCS has the following advantages over previous methods:(1) FedGCS encodes abundant experience of various different classical algorithms into the same continuous space to determine the final decision, and simultaneously considers the three metrics of model performance, latency and energy, making it more comprehensive than the traditional methods;(2) through data-driven learning, FedGCS only needs the automatically collected “selection-score” pair data as input and avoids the overly complex and cumbersome manual interventions and tuning of previous methods, therefore FedGCS is more flexible and generalizable;(3) modeling the process of determining the optimal client selection as executing gradient-based optimization in a continuous space is more efficient than performing a heuristic rule-based search in a large discrete space or training hard-to-converge RL agents.Finally, we conduct extensive experiments and analyses to demonstrate the effectiveness and superiority of FedGCS.

2 Related Works

FedGCS: A Generative Framework for Efficient Client Selection in Federated Learning via Gradient-based Optimization (2)FedGCS: A Generative Framework for Efficient Client Selection in Federated Learning via Gradient-based Optimization (3)

2.1 Client Selection in Federated Learning

Client selection in FL is critical to optimizing communication costs, computational resources, and overall performanceLai et al. (2021),which falls into two main categories:(1) heuristic-based selection that predominantly relies on heuristics rooted in isolated considerations such as data heterogeneity and energy efficiencyLi et al. (2019); Cho et al. (2020).Specifically, OortLai et al. (2021) employs analytical strategies that comprehensively address the multifaceted nature of device selection.However, adapting these heuristics to unseen scenarios often demands lots of domain-specific expertise and extensive tuning.(2) learning-based selection that devises policies for device selection through RLSutton and Barto (2018); Tian et al. (2024); Fan et al. (2020) to formulate decisions as Markov decision processes.Specifically, AutoFLKim and Wu (2021) leverages a Q-table, FavorWang et al. (2020) adopts Q-learning, and FedMarlZhang et al. (2022) utilizes multi-agent RL to achieve their objectives.However, training RL agents is difficult to convergeDulac-Arnold et al. (2021); Fan et al. (2021), especially in the face of noisy and complex real-world environments (as shown in Figure2 left).Finally, as shown in Figure2 right, all these methods mentioned above usually focus on only one or two optimization metrics and lack a comprehensive global consideration of performance, latency, and energy.

FedGCS: A Generative Framework for Efficient Client Selection in Federated Learning via Gradient-based Optimization (4)

2.2 Generative Models

Recent advancements in LLMs and generative models mark a shift in the machine learning paradigm, attesting to the evolution from task-specific architecturesNing et al. (2022) to highly generalized generic models.These models, exemplified by the likes of GPT-3Brown et al. (2020) in the language domain, along with DALL·ERamesh et al. (2022) in the multi-modal domain, have driven the migration to generative models across various domainsGruver et al. (2023); Xiao et al. (2023); Wang et al. (2024).Generative models have showcased an unparalleled ability to encode multifaceted knowledge representations from expansive data into their parametersBrown et al. (2020).This encoded knowledge can be adeptly maneuvered to generate contextually coherent and semantically rich outputs, addressing a wide spectrum of downstream tasksNing et al. (2021).Despite the wide range of applications in language and vision, we have not seen generative models applied extensively in decision-making tasks like the client selection task in FL.

3 Problem Formulation

In this section, we formulate the discrete client selection task in FL as a generative task and conduct gradient-based optimization in a continuous representation space to get the optimal selection.Firstly, at the beginning of a training round, given a candidate client device pool set 𝒞={c1,c2,cJ}𝒞subscript𝑐1subscript𝑐2subscript𝑐𝐽\mathcal{C}=\{c_{1},c_{2},\dots\,c_{J}\}caligraphic_C = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_c start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT } of J𝐽Jitalic_J client devices, we use classic client selection algorithms to collect n𝑛nitalic_n “selection-score” pair as training data for subsequent model training.We denote the collected records as ={(𝐬𝐢,pi)}1nsuperscriptsubscriptsubscript𝐬𝐢subscript𝑝𝑖1𝑛\mathcal{R}=\{(\mathbf{s_{i}},p_{i})\}_{1}^{n}caligraphic_R = { ( bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, where 𝐬𝐢={s1,s2,sT}subscript𝐬𝐢subscript𝑠1subscript𝑠2subscript𝑠𝑇\mathbf{s_{i}}=\{s_{1},s_{2},\dots\,s_{T}\}bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } is the client selection record made by a classic algorithm (𝐬𝐢𝒞subscript𝐬𝐢𝒞\mathbf{s_{i}}\subset\mathcal{C}bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ⊂ caligraphic_C, s1iT𝒞subscript𝑠1𝑖𝑇𝒞s_{1\leq i\leq T}\in\mathcal{C}italic_s start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_T end_POSTSUBSCRIPT ∈ caligraphic_C is the selected device IDs, and T𝑇Titalic_T is not a fixed number that is determined by the classical algorithm used for this selection), and pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the corresponding comprehensive score of the FL performance for making the client selection 𝐬𝐢subscript𝐬𝐢\mathbf{s_{i}}bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT.To characterize pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of a given 𝐬𝐢subscript𝐬𝐢\mathbf{s_{i}}bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT, unlike previous FL approachesLi et al. (2019); Zhan et al. (2020); Wang et al. (2020) that focus on different metrics and typically consider only one or two metrics to guide the selection, in this paper, we simultaneously focus on:(1) good performance of the global model;(2) low processing latency;(3) low energy consumption to perform comprehensive and multi-dimensional optimization.Here, the comprehensive metric function M𝑀\mathit{M}italic_M of a client selection 𝐬𝐢subscript𝐬𝐢\mathbf{s_{i}}bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT is defined as:

pi=M(𝐬𝐢)=pperf.×(LpL)1(L<pL)×a×(EpE)1(E<pE)×b,subscript𝑝𝑖𝑀subscript𝐬𝐢subscript𝑝𝑝𝑒𝑟𝑓superscript𝐿subscript𝑝𝐿1𝐿subscript𝑝𝐿𝑎superscript𝐸subscript𝑝𝐸1𝐸subscript𝑝𝐸𝑏\scalebox{0.85}{\mbox{$\displaystyle p_{i}=\mathit{M}(\mathbf{s_{i}})=p_{perf.%}\times\left(\frac{L}{p_{L}}\right)^{\textbf{1}\left(L<p_{L}\right)\times a}%\times\left(\frac{E}{p_{E}}\right)^{\textbf{1}\left(E<p_{E}\right)\times b}$}},italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_M ( bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) = italic_p start_POSTSUBSCRIPT italic_p italic_e italic_r italic_f . end_POSTSUBSCRIPT × ( divide start_ARG italic_L end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 1 ( italic_L < italic_p start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) × italic_a end_POSTSUPERSCRIPT × ( divide start_ARG italic_E end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 1 ( italic_E < italic_p start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ) × italic_b end_POSTSUPERSCRIPT ,(1)

where pperf.subscript𝑝𝑝𝑒𝑟𝑓p_{perf.}italic_p start_POSTSUBSCRIPT italic_p italic_e italic_r italic_f . end_POSTSUBSCRIPT denotes the downstream task performance.L𝐿Litalic_L and E𝐸Eitalic_E are developer-specified latency and energy budgets for the devices.pLsubscript𝑝𝐿p_{L}italic_p start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and pEsubscript𝑝𝐸p_{E}italic_p start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT represent the actual total latency and energy consumption, encompassing communication and computation processes, for a given client selection 𝐬𝐢subscript𝐬𝐢\mathbf{s_{i}}bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT.1(x)1𝑥\textbf{1}(x)1 ( italic_x ) is an indicator function that takes the value 1 if x𝑥xitalic_x is true and 0 otherwise.Therefore, client selections exceeding the desired latency L𝐿Litalic_L and energy E𝐸Eitalic_E will be penalized by developer-specified factors a𝑎aitalic_a and b𝑏bitalic_bLai et al. (2021), both set to 2 in our implementation.Then, we aim to learn a continuous representation space 𝕊𝕊\mathbb{S}blackboard_S for “selection-score” pair data.Specifically, we learn three modules:(1) an encoder ϕitalic-ϕ\phiitalic_ϕ which can map a selection 𝐬𝐢subscript𝐬𝐢\mathbf{s_{i}}bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT to its corresponding continuous representation E𝐬𝐢subscript𝐸subscript𝐬𝐢E_{\mathbf{s_{i}}}italic_E start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT;(2) an evaluator ω𝜔\omegaitalic_ω which can evaluate the comprehensive score pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of a selection 𝐬𝐢subscript𝐬𝐢\mathbf{s_{i}}bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT based on its representation E𝐬𝐢subscript𝐸subscript𝐬𝐢E_{\mathbf{s_{i}}}italic_E start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT;(3) a decoder ψ𝜓\psiitalic_ψ which can map a continuous representation E𝐬𝐢subscript𝐸subscript𝐬𝐢E_{\mathbf{s_{i}}}italic_E start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT back to its corresponding discrete client selection 𝐬𝐢subscript𝐬𝐢\mathbf{s_{i}}bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT.We optimize all these modules simultaneously based on the collected records \mathcal{R}caligraphic_R.After getting the space 𝕊𝕊\mathbb{S}blackboard_S and the well-learned three modules, we can adopt gradient-based optimization in 𝕊𝕊\mathbb{S}blackboard_S to find the optimal client selection 𝐬superscript𝐬\mathbf{s^{*}}bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, given by:

𝐬=ψ(E𝐬)=ψ(argmaxE𝐬𝐢𝕊ω(E𝐬𝐢)).superscript𝐬𝜓subscript𝐸superscript𝐬𝜓subscriptargmaxsubscript𝐸subscript𝐬𝐢𝕊𝜔subscript𝐸subscript𝐬𝐢\displaystyle\mathbf{s}^{*}=\psi\left(E_{\mathbf{s^{*}}}\right)=\psi(%\operatorname{argmax}_{E_{\mathbf{s_{i}}}\in\mathbb{S}}\omega(E_{\mathbf{s_{i}%}})).bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_ψ ( italic_E start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = italic_ψ ( roman_argmax start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_S end_POSTSUBSCRIPT italic_ω ( italic_E start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) .(2)

4 Methodology

4.1 Framework Overview

Figure3 shows the overview of our framework, which consists of four steps:(1) “selection-score” data collection: we collect client selections and their corresponding comprehensive scores as training data for subsequent models;(2) continuous space for selection optimization: we develop an encoder-evaluator-decoder framework to learn a continuous representation space 𝕊𝕊\mathbb{S}blackboard_S for selection optimization based on the pair data collected in Step 1;(3) gradient-based selection optimization: we adopt gradient-based optimization in 𝕊𝕊\mathbb{S}blackboard_S to find the optimal client selection representation;(4) optimal client selection generation: we use the well-trained decoder to generate the optimal client selection on the basis of the optimal representation obtained in Step 3.

4.2 “Selection-Score” Data Collection

To ensure that subsequent models and algorithms perform well, we need to collect sufficient, diverse, comprehensive and high-quality “selection-score” pair data ={(𝐬𝐢,pi)}1nsuperscriptsubscriptsubscript𝐬𝐢subscript𝑝𝑖1𝑛\mathcal{R}=\{(\mathbf{s_{i}},p_{i})\}_{1}^{n}caligraphic_R = { ( bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as training data:training data is large enough to represent the entire distribution; training data includes high-performance selection cases as well as certain random exploration samples.Intuitively, we can use random selections and compute their corresponding comprehensive scores as training data.However, this strategy is inefficient because the number of random selections is exceptionally large and most of them are low-performance samples.Therefore, we also must consider the efficiency of our data collection process.

Classical client selection algorithms are capable of outputting well-performing selection samples, and they differ in their methods and the metrics of concern.With the help of these algorithms, we can automatically and efficiently collect training data that fulfills our requirements in two ways:(1) heuristic-base collection: using heuristic-base selection methods (e.g., OortLai et al. (2021)) to generate records.This category of algorithms consists of expert manually-engineered heuristic rules, and different algorithms reflect different experts’ considerations, based on which we can produce comprehensive and high-quality “selection-score” records reflecting various expert intelligence.(2) learning-based collection: treating learning-based selection methods (e.g., FedMarlZhang et al. (2022)) as exploration-based training data collectors.Such methods utilize RL to train continuously self-optimizing agents which progressively explore device selections and find optimal ones.This stochastic and self-optimizing learning process can be viewed as an exploratory and automated data collection tool that leverages machine intelligence to help collect diverse yet quality records.For the collected selection samples, the comprehensive metric function M𝑀\mathit{M}italic_M allows us to obtain their corresponding scores easily, thus finally finishing the pair data collection.

4.3 Continuous Space for Selection Optimization

After collecting “selection-score” pair data \mathcal{R}caligraphic_R that reflects the broad and abundant experience of classical client device selection algorithms,we train an encoder-evaluator-decoder framework based on \mathcal{R}caligraphic_R to embed these records into a continuous representation space 𝕊𝕊\mathbb{S}blackboard_S for follow-up gradient-based selection optimization.Each point in 𝕊𝕊\mathbb{S}blackboard_S is associated with a client selection and its corresponding comprehensive score, denoted below by 𝐬𝐬\mathbf{s}bold_s and p𝑝pitalic_p, respectively.

Data Augmentation.

Sequence-to-sequence (seq2seq) architectureSutskever et al. (2014) is an important machine learning method: it uses an encoder to capture the context of the input sequence and sends it to a decoder, which then produces the final output sequence.Its flexibility and powerful capabilities have led to a wide range of applications where the data can be naturally organized as a sequenceRaffel et al. (2020).The encoder and decoder together in our framework essentially belong to the seq2seq paradigm as well, with the particularity that the inputs and outputs (i.e., client selections) are a special kind of sequences—the sets.Each client selection is a subset of the device pool that contains an unordered sequence of device IDs.To model the order-independent properties of sets in the seq2seq component, we propose to use data augmentation to increase the diversity of sequences corresponding to the same set.Specifically, given a selection and its corresponding score from \mathcal{R}caligraphic_R, we randomly shuffle the device IDs contained in the selection to obtain a new order, then we pair the new shuffled device IDs sequence with the original score and add them to the training data.

Encoder ϕitalic-ϕ\phiitalic_ϕ.

The encoder aims to map any given selection 𝐬={s1,s2,sT}𝐬subscript𝑠1subscript𝑠2subscript𝑠𝑇\mathbf{s}=\{s_{1},s_{2},\dots\,s_{T}\}bold_s = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } to continuous representation E𝐬=ϕ(𝐬)subscript𝐸𝐬italic-ϕ𝐬E_{\mathbf{s}}=\phi(\mathbf{s})italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT = italic_ϕ ( bold_s ) that is considered as the input context.Here, we adopt a single layer long short-term memory (LSTM)Hochreiter and Schmidhuber (1997) as encoder ϕitalic-ϕ\phiitalic_ϕ.Specifically, we input the device IDs from 𝐬𝐬\mathbf{s}bold_s sequentially into the encoder and collect their corresponding output embeddings to form E𝐬=[h1,h2,,hT]T×dsubscript𝐸𝐬subscript1subscript2subscript𝑇superscript𝑇𝑑E_{\mathbf{s}}=[h_{1},h_{2},\dots,h_{T}]\in\mathbb{R}^{T\times d}italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT = [ italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_d end_POSTSUPERSCRIPT (T𝑇Titalic_T is the length of the selection 𝐬𝐬\mathbf{s}bold_s, and d𝑑ditalic_d is the embedding dimension).

Decoder ψ𝜓\psiitalic_ψ.

The decoder aims to generate the device IDs of a client selection 𝐬𝐬\mathbf{s}bold_s based on E𝐬subscript𝐸𝐬E_{\mathbf{s}}italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT, denoted by 𝐬=ψ(E𝐬)𝐬𝜓subscript𝐸𝐬\mathbf{s}=\psi(E_{\mathbf{s}})bold_s = italic_ψ ( italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ).Similar to the encoder, we employ a single-layer LSTM to implement the decoder and train it in an autoregressive waySutskever et al. (2014); Brown et al. (2020).Firstly, the initial input of the decoder is the last device ID embedding in E𝐬subscript𝐸𝐬E_{\mathbf{s}}italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT (i.e., hTsubscript𝑇h_{T}italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT).Then, at step i𝑖iitalic_i of the decoding process, we can get the current decoder hidden state h^isubscript^𝑖\hat{h}_{i}over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the LSTM, and utilize dot product attentionLuong et al. (2015) to aggregate the input context E𝐬subscript𝐸𝐬E_{\mathbf{s}}italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT from encoder, and finally obtain an enhanced input embedding h~isubscript~𝑖\tilde{h}_{i}over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for this step, defined as:

h~i=hjE𝐬aijhj, whereaij=exp(h^ihj)hkE𝐬exp(h^ihk),\displaystyle\scalebox{0.95}{\mbox{$\displaystyle\tilde{h}_{i}=\sum_{h_{j}\in E%_{\mathbf{s}}}a_{ij}h_{j}\text{, where }a_{ij}=\frac{\exp\left(\hat{h}_{i}%\cdot h_{j}\right)}{\sum_{h_{k}\in E_{\mathbf{s}}}\exp\left(\hat{h}_{i}\cdot h%_{k}\right)},$}}over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , where italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG roman_exp ( over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp ( over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG ,(3)

where aijsubscript𝑎𝑖𝑗a_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the attention weight between h^isubscript^𝑖\hat{h}_{i}over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and hjsubscript𝑗h_{j}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.Later, we concatenate h^isubscript^𝑖\hat{h}_{i}over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and h~isubscript~𝑖\tilde{h}_{i}over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT together and fed them into a fully connected layer followed by a softmax layer to produce the predictive distribution of step i𝑖iitalic_i, formulated as:

Pψ(si𝐬<i,E𝐬)=exp(Wsi[h^i;h~i])c𝒞exp(Wc[h^i;h~i]),subscript𝑃𝜓conditionalsubscript𝑠𝑖subscript𝐬absent𝑖subscript𝐸𝐬subscript𝑊subscript𝑠𝑖subscript^𝑖subscript~𝑖subscript𝑐𝒞subscript𝑊𝑐subscript^𝑖subscript~𝑖\displaystyle P_{\psi}\left(s_{i}\mid\mathbf{s}_{<i},E_{\mathbf{s}}\right)=%\frac{\exp\left(W_{s_{i}}\left[\hat{h}_{i};\tilde{h}_{i}\right]\right)}{\sum_{%c\in\mathcal{C}}\exp\left(W_{c}\left[\hat{h}_{i};\tilde{h}_{i}\right]\right)},italic_P start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ) = divide start_ARG roman_exp ( italic_W start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c ∈ caligraphic_C end_POSTSUBSCRIPT roman_exp ( italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT [ over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ) end_ARG ,(4)

where si𝐬subscript𝑠𝑖𝐬s_{i}\in\mathbf{s}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_s is the i𝑖iitalic_i-th ID in 𝐬𝐬\mathbf{s}bold_s, 𝐬<isubscript𝐬absent𝑖\mathbf{s}_{<i}bold_s start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT represents the prediction of the previous or initial step, 𝒞𝒞\mathcal{C}caligraphic_C is the candidate client device pool set (i.e., the token set in seq2seq models), and W𝑊Witalic_W stand for the parameter of the fully connected layer.By multiplying the probability in each step, we can derive the distribution of the whole client selection 𝐬𝐬\mathbf{s}bold_s as follows:

Pψ(𝐬E𝐬)=t=1TPψ(si𝐬<i,E𝐬).subscript𝑃𝜓conditional𝐬subscript𝐸𝐬superscriptsubscriptproduct𝑡1𝑇subscript𝑃𝜓conditionalsubscript𝑠𝑖subscript𝐬absent𝑖subscript𝐸𝐬\displaystyle P_{\psi}(\mathbf{s}\mid E_{\mathbf{s}})=\prod_{t=1}^{T}P_{\psi}%\left(s_{i}\mid\mathbf{s}_{<i},E_{\mathbf{s}}\right).italic_P start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_s ∣ italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ) .(5)

In order to make the generated sequence similar to the true sequence, we minimize the negative log-likelihood of the distribution, which is defined as:

seq2seq=logPψ(𝐬E𝐬)=t=1TlogPψ(si𝐬<i,E𝐬).\displaystyle\scalebox{0.9}{\mbox{$\displaystyle\mathcal{L}_{seq2seq}=-\log P_%{\psi}(\mathbf{s}\mid E_{\mathbf{s}})=-\sum_{t=1}^{T}\log P_{\psi}\left(s_{i}%\mid\mathbf{s}_{<i},E_{\mathbf{s}}\right).$}}caligraphic_L start_POSTSUBSCRIPT italic_s italic_e italic_q 2 italic_s italic_e italic_q end_POSTSUBSCRIPT = - roman_log italic_P start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_s ∣ italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ) .(6)

Evaluator ω𝜔\omegaitalic_ω.

The evaluator aims to evaluate the corresponding comprehensive score p𝑝pitalic_p of a selection 𝐬𝐬\mathbf{s}bold_s based on its representation E𝐬subscript𝐸𝐬E_{\mathbf{s}}italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT.Specifically, we first conduct mean value operation on device ID embeddings h(.)h_{{(.)}}italic_h start_POSTSUBSCRIPT ( . ) end_POSTSUBSCRIPT in E𝐬subscript𝐸𝐬E_{\mathbf{s}}italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT to aggregate the information and obtain the integrated selection embedding E¯𝐬dsubscript¯𝐸𝐬superscript𝑑\bar{E}_{\mathbf{s}}\in\mathbb{R}^{d}over¯ start_ARG italic_E end_ARG start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.We then fed E¯𝐬subscript¯𝐸𝐬\bar{E}_{\mathbf{s}}over¯ start_ARG italic_E end_ARG start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT into the evaluator ω𝜔\omegaitalic_ω (a feedforward neural network) to estimate the score, given as p^=ω(E¯𝐬)^𝑝𝜔subscript¯𝐸𝐬\hat{p}=\omega(\bar{E}_{\mathbf{s}})over^ start_ARG italic_p end_ARG = italic_ω ( over¯ start_ARG italic_E end_ARG start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ).To minimize the difference between the estimated score p^^𝑝\hat{p}over^ start_ARG italic_p end_ARG and the collected real one p𝑝pitalic_p, we leverage the Mean Squared Error (MSE), defined by:

score=MSE(p,p^)=(pp^)2.subscript𝑠𝑐𝑜𝑟𝑒MSE𝑝^𝑝superscript𝑝^𝑝2\displaystyle\mathcal{L}_{score}=\operatorname{MSE}(p,\hat{p})=(p-\hat{p})^{2}.caligraphic_L start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT = roman_MSE ( italic_p , over^ start_ARG italic_p end_ARG ) = ( italic_p - over^ start_ARG italic_p end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .(7)

Joint Training Loss \mathcal{L}caligraphic_L.

We optimize the encoder ϕitalic-ϕ\phiitalic_ϕ, decoder ψ𝜓\psiitalic_ψ and evaluator ω𝜔\omegaitalic_ω simultaneously by integrating Equation6 (i.e., the seq2seq loss) and Equation7 (i.e., the score estimation loss) to construct the joint training loss \mathcal{L}caligraphic_L:

=αseq2seq+(1α)score,𝛼subscript𝑠𝑒𝑞2𝑠𝑒𝑞1𝛼subscript𝑠𝑐𝑜𝑟𝑒\displaystyle\mathcal{L}=\alpha\mathcal{L}_{seq2seq}+(1-\alpha)\mathcal{L}_{%score},caligraphic_L = italic_α caligraphic_L start_POSTSUBSCRIPT italic_s italic_e italic_q 2 italic_s italic_e italic_q end_POSTSUBSCRIPT + ( 1 - italic_α ) caligraphic_L start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT ,(8)

where α𝛼\alphaitalic_α is the trade-off hyperparameter to control the contribution of seq2seqsubscript𝑠𝑒𝑞2𝑠𝑒𝑞\mathcal{L}_{seq2seq}caligraphic_L start_POSTSUBSCRIPT italic_s italic_e italic_q 2 italic_s italic_e italic_q end_POSTSUBSCRIPT and scoresubscript𝑠𝑐𝑜𝑟𝑒\mathcal{L}_{score}caligraphic_L start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT during the training process.

4.4 Gradient-based Selection Optimization

After obtaining the trained encoder, evaluator, and decoder, we can adapt the gradient-based optimization method in 𝕊𝕊\mathbb{S}blackboard_S to find the optimal client selection.Good initialization is crucial for the gradient-based optimization approachesGlorot and Bengio (2010), so we first select top-K𝐾Kitalic_K client selections ranked by their comprehensive score p𝑝pitalic_p and use the encoder to embed these selections into a continuous representation which will later be used as starting points for subsequent optimization.Assuming that one starting point representation is E𝐬subscript𝐸𝐬E_{\mathbf{s}}italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT, in order to obtain a selection that possesses a better comprehensive score, we optimize from E𝐬subscript𝐸𝐬E_{\mathbf{s}}italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT towards the gradient direction induced by the evaluator ω𝜔\omegaitalic_ω:

E𝐬+=E𝐬+ηω(E𝐬)E𝐬,superscriptsubscript𝐸𝐬subscript𝐸𝐬𝜂𝜔subscript𝐸𝐬subscript𝐸𝐬\displaystyle E_{\mathbf{s}}^{+}=E_{\mathbf{s}}+\eta\frac{\partial\omega(E_{%\mathbf{s}})}{\partial E_{\mathbf{s}}},italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT + italic_η divide start_ARG ∂ italic_ω ( italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT end_ARG ,(9)

where E𝐬+superscriptsubscript𝐸𝐬E_{\mathbf{s}}^{+}italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is the optimized selection representation, η𝜂\etaitalic_η is the step size.The comprehensive score of E𝐬+superscriptsubscript𝐸𝐬E_{\mathbf{s}}^{+}italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is supposed to be better than E𝐬subscript𝐸𝐬E_{\mathbf{s}}italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT due to ω(E𝐬+)ω(E𝐬)𝜔superscriptsubscript𝐸𝐬𝜔subscript𝐸𝐬\omega(E_{\mathbf{s}}^{+})\geq\omega(E_{\mathbf{s}})italic_ω ( italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) ≥ italic_ω ( italic_E start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ).For all the K𝐾Kitalic_K starting points, we perform the above optimization several times to get the set of candidate selection representation {E~𝐬i}1Ksuperscriptsubscriptsubscript~𝐸subscript𝐬𝑖1𝐾\{\tilde{E}_{\mathbf{s}_{i}}\}_{1}^{K}{ over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT.Finally, we choose the optimal selection representation E𝐬subscript𝐸superscript𝐬E_{\mathbf{s^{*}}}italic_E start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT in {E~𝐬i}1Ksuperscriptsubscriptsubscript~𝐸subscript𝐬𝑖1𝐾\{\tilde{E}_{\mathbf{s}_{i}}\}_{1}^{K}{ over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT through the estimated candidates’ comprehensive score, i.e., E𝐬=argmaxE~𝐬𝐢{ω(E~𝐬𝐢)}1KE_{\mathbf{s^{*}}}=\operatorname{argmax}_{\tilde{E}_{\mathbf{s_{i}}}}\{\omega(%\tilde{E}_{\mathbf{s_{i}}})\}_{1}^{K}italic_E start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = roman_argmax start_POSTSUBSCRIPT over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_ω ( over~ start_ARG italic_E end_ARG start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT.

4.5 Optimal Client Selection Generation

To identify the optimal client selection ssuperscript𝑠s^{*}italic_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we fed E𝐬subscript𝐸superscript𝐬E_{\mathbf{s^{*}}}italic_E start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT into the well-trained decoder ψ𝜓\psiitalic_ψ, this process can be denoted by: s=ψ(E𝐬)superscript𝑠𝜓subscript𝐸superscript𝐬s^{*}=\psi(E_{\mathbf{s^{*}}})italic_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_ψ ( italic_E start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ).Specifically, we adopt the beam searchFreitag and Al-Onaizan (2017) to generate the client selection, just like text generation in natural language processingSutskever et al. (2014).Instead of setting the length of the selection in advance, we use the decoder to iteratively generate device IDs until it encounters the stop token <EOS>, which makes our selection process more adaptive.

5 Experiments

5.1 Experimental Setup

Dataset&ModelCV TasksNLP Tasks
MNISTCIFAR10CINIC10TinyImageNetShakespeareWikitext
LeNet5 \uparrowResnet18 \uparrowVGG16 \uparrowShuffleNet \uparrowLSTM \uparrowGPT-21 \downarrow
IIDNon-IID2IIDNon-IID2IIDNon-IID2IIDNon-IID2Non-IID2Non-IID2
FedAvg98.8436.7084.7842.0264.7837.4034.6223.8239.1831.13
FedProx99(35)subscript993599_{(35)}99 start_POSTSUBSCRIPT ( 35 ) end_POSTSUBSCRIPT367.6785.1152.4064.5139.5935.3824.7240.6830.07
AFL98.6189.6083.3351.8165.8641.0336.7725.9042.6727.42
TiFL98.9191.2184.9656.3967.3242.1638.8225.8544.1225.61
Oort99(49)subscript994999_{(49)}99 start_POSTSUBSCRIPT ( 49 ) end_POSTSUBSCRIPT392.4785.6953.5167.0743.5740.5726.6245.4219.12
Favor98.8785.4885.9851.3465.4739.2637.6225.3443.1128.45
FedMarl98.8290.1386.3756.7366.7444.1639.4826.3144.7221.84
FedGCS99(28)subscript9928\textbf{99}_{(28)}99 start_POSTSUBSCRIPT ( 28 ) end_POSTSUBSCRIPT395.8388.4261.6269.0946.2743.3529.5147.6512.58
  • 1

    We use perplexity (PPL) to evaluate GPT-2, which reflects the text generation capability of model.

  • 2

    Set β=0.01𝛽0.01\beta=0.01italic_β = 0.01 for MNIST, β=0.1𝛽0.1\beta=0.1italic_β = 0.1 for others to emulate Non-IID. Shakespeare and Wikitext are naturally Non-IID.

  • 3

    Early exit when target accuracy (99%percent9999\%99 %) is achieved. Numbers in parentheses indicate the exited round.

Infrastructure.

To evaluate the performance and effectiveness of FedGCS, we first build a simulator with the server/client architecture based on PyTorchPaszke et al. (2019), where distinct processes emulate the central server and participating devices.To emulate data heterogeneity, we allocate training data with different distributions across devices.To mimic system heterogeneity, we establish a FL system comprising six device types with diverse hardware configurations, including Google Pixel 6, OnePlus 10 pro, Redmi Note 10, Realme Q3s, NVIDIA Jetson Nano, and NVIDIA Jetson TX2.We utilize Monsoon Power Monitormon (2023) to track latency and energy consumption during training.Furthermore, we integrate end-user interaction traces from LiveLabShepard et al. (2011) to emulate concurrent applications that impact the training capability at runtime.

Baselines.

FedGCS is compared with three types of client selection methods,including random-based (FedAvgMcMahan et al. (2017) and FedProxLi et al. (2021)),heuristic-based (AFLGoetz et al. (2019), TiFLChai et al. (2020) and OortLai et al. (2021)),and learning-based (FavorWang et al. (2020) and FedMarlZhang et al. (2022)).

Datasets and Models.

We use typical models and datasets in Computer Vision (CV) and Natural Language Processing (NLP) domains for evaluation,including LeNet5LeCun et al. (1998a) on MNISTLecun et al. (1998b),ResNet18He et al. (2016) on CIFAR10Krizhevsky et al. (2009),VGG16Simonyan. (2014) on CINIC10Darlow et al. (2018),ShuffleNetZhang et al. (2018) on TinyImageNetDeng et al. (2009) for image classification, LSTMHochreiter and Schmidhuber (1997) on ShakespeareShakespeare (2017),GPT-2Radford et al. (2019) on WikitextMerity et al. (2016) for text generation.For four CV datasets, Dirichlet distribution pkDirN(β)similar-tosubscript𝑝𝑘𝐷𝑖subscript𝑟𝑁𝛽p_{k}\sim Dir_{N}(\beta)italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ italic_D italic_i italic_r start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_β ) is utilized to simulate the Non-IID data distribution on different devices.

Hyperparameter Settings and Reproducibility.

We run two heuristic-based methods—Oort and Favor, and one heuristic-based method—FedMarl 100 times each, totaling 300 runs, to collect “selection-score” pair data for subsequent model training.For data augmentation, each selection is randomly shuffled 25 times to model set order-independence.We adopt a single-layer LSTM as the Encoder and Decoder backbones, and two-layer feed-forward networks for the Evaluator.The hidden state sizes are 64 (Encoder), 64 (Decoder), and 200 (Evaluator), with a 32323232-dimensional embedding for each device ID token.For FedGCS training, we set batch size=1024absent1024=1024= 1024, learning rate=0.001absent0.001=0.001= 0.001, α=0.8𝛼0.8\alpha=0.8italic_α = 0.8, and utilize the top-25252525 device selections as starting points for gradient optimization.In the FL setting, J=100𝐽100J=100italic_J = 100 devices are available, with T=10𝑇10T=10italic_T = 10 (T=20𝑇20T=20italic_T = 20 for FedMarl) selected per round over r=10𝑟10r=10italic_r = 10 rounds for GPT-2 (r=50𝑟50r=50italic_r = 50 for others).Clients perform local ep=5𝑒𝑝5ep=5italic_e italic_p = 5 epoch training each round.The code is available at https://github.com/zhiyuan-ning/GenerativeFL.

5.2 Overall Performance

Model Performance.

Table1 shows the final test performance of all client selection methods.We can observe that FedGCS significantly outperforms other baselines across all domains and tasks.Specifically, for IID setting, FedGCS improves the test accuracy 5.56%percent5.565.56\%5.56 % over FedAvg and 2.20%percent2.202.20\%2.20 % over SOTA value on average.Further, FedGCS is particularly more effective in the more challenging Non-IID setting, outperforming the baselines by considerable margins: improving the test accuracy 20.35%percent20.3520.35\%20.35 % over FedAvg and 3.16%percent3.163.16\%3.16 % over SOTA on average, and reducing 18.5518.5518.5518.55 perplexity (PPL, lower values correspond to stronger text generation capability) over FedAvg and 6.546.546.546.54 over SOTA on the GPT-2 model.Overall, this experiment demonstrates the practical and robust ability of FedGCS to scale to complex workloads and application scenarios.This may be because FedGCS converts extensive discrete decision knowledge into a continuous representation space in a data-driven manner and identifies a superior client selection in that space based on gradient-based optimization.

Latency and Energy Consumption.

Then, we evaluate the system efficiency of FedGCS from two perspectives: Time to Accuracy (ToA) and Energy to Accuracy (EoA).Figure4 left shows FedGCS effectively speeds up the training process with faster convergence and achieves superior accuracy.In addition, FedGCS also achieves significant energy savings shown in Figure4 right.This can be attributed to FedGCS thoughtfully integrating considerations of both latency and energy consumption during training.In contrast, Oort emphasizes solely training duration as the system effectiveness metric, and FedMarl takes into account the energy costs associated with communication, but it overlooks the intrinsic energy overheads of the training process itself, a critical oversight, especially for devices operating on battery power.

SchemesAcc. (%)\uparrowLatency(min)\downarrowEnergy(KJ)\downarrow
FedGCScsuperscriptFedGCS𝑐\text{FedGCS}^{-c}FedGCS start_POSTSUPERSCRIPT - italic_c end_POSTSUPERSCRIPT86.2318.22.12
FedGCSasuperscriptFedGCS𝑎\text{FedGCS}^{-a}FedGCS start_POSTSUPERSCRIPT - italic_a end_POSTSUPERSCRIPT85.3724.822.53
FedGCS88.4215.341.70

FedGCS: A Generative Framework for Efficient Client Selection in Federated Learning via Gradient-based Optimization (5)FedGCS: A Generative Framework for Efficient Client Selection in Federated Learning via Gradient-based Optimization (6)

5.3 Framework Analysis

Impact of Data Collection and Augmentation.

To explore the impact of the different modules of FedGCS, we develop two model variants:(1) FedGCScsuperscriptFedGCS𝑐\text{FedGCS}^{-c}FedGCS start_POSTSUPERSCRIPT - italic_c end_POSTSUPERSCRIPT, we randomly collect pair data without relying on classical client selection methods.(2) FedGCSasuperscriptFedGCS𝑎\text{FedGCS}^{-a}FedGCS start_POSTSUPERSCRIPT - italic_a end_POSTSUPERSCRIPT, we disable the data augmentation process of FedGCS.Table2 shows the comparison results, we can find that the performance of FedGCS is much better than FedGCScsuperscriptFedGCS𝑐\text{FedGCS}^{-c}FedGCS start_POSTSUPERSCRIPT - italic_c end_POSTSUPERSCRIPT.This suggests that the quality of the collected data is critical to constructing a continuous representation space, and that a better space will in turn help to identify better client selection.Moreover, we observe that FedGCS is superior to FedGCSasuperscriptFedGCS𝑎\text{FedGCS}^{-a}FedGCS start_POSTSUPERSCRIPT - italic_a end_POSTSUPERSCRIPT.The key driver is that data augmentation can enhance data diversity and model the order-independent properties of client selection, thereby enabling more robust and effective learning for FedGCS.Consequently, these results confirm that data collection and augmentation are essential for sustaining the performance of FedGCS.

Study of Generalization Ability of GCS.

Figure5 outlines when training the ResNet18 model on CIFAR10 (IID), the generalization ability of FedGCS in combination with different client selection methods, and the effectiveness of using multiple selection methods for data collection.The adoption of a single selection method as data collector can significantly improve results over the corresponding selection method, highlighting the merits of continuous space paradigms in enhancing generalization ability.Furthermore, the use of diverse, multiple selection methods as data collectors (i.e., the FedGCS in Figure5 which uses Oort, Favor and FedMarl as data collectors simultaneously) outperforms other single collector strategies, emphasizing the value of diversity in data collectors for comprehensive space construction.

FedGCS: A Generative Framework for Efficient Client Selection in Federated Learning via Gradient-based Optimization (7)
FedGCS: A Generative Framework for Efficient Client Selection in Federated Learning via Gradient-based Optimization (8)

Study of the Selection Strategy Made by FedGCS.

We analyze the selection strategy of FedGCS by comparing its selected device ratio with the best fixed client selection method Oort and the dynamic selection approach FedMarl.Figure6 shows the results, where we can observe that the ratio of the client selection made by FedGCS is dynamically changing each training round, indicating that decision making via generative is more adaptive when compared to the fixed client selection methods.In addition, the ratio of FedGCS is the smallest, suggesting that FedGCS can select the most reasonable number of devices even when compared to the dynamic selection approaches.Overall, such results demonstrate that FedGCS can achieve the best performance with the smallest device selection ratio, indicating that the selection strategy of FedGCS is superior compared to other methods.

Study of the Hyperparameter Sensitivity of FedGCS.

There are two major hyperparameters, training trade-off α𝛼\alphaitalic_α and the top-K𝐾Kitalic_K records as the starting points of gradient-based optimization.A higher α𝛼\alphaitalic_α will make the model more concentrated on the loss seq2seqsubscript𝑠𝑒𝑞2𝑠𝑒𝑞\mathcal{L}_{seq2seq}caligraphic_L start_POSTSUBSCRIPT italic_s italic_e italic_q 2 italic_s italic_e italic_q end_POSTSUBSCRIPT, and a higher K𝐾Kitalic_K makes the model search more milder.We set α𝛼\alphaitalic_α from 0.1 to 0.9, and K𝐾Kitalic_K from 5 to 50, then train the FedGCS on ResNet18 over CIFAR10 (IID).The model performance is reported in Figure7.Overall, we observe that model performance and system efficiency vary slightly for different K, but for better results K needs to be greater than 20.Further, setting α𝛼\alphaitalic_α as 0.8 will slightly bring a higher model performance.These findings provide insight into how α𝛼\alphaitalic_α and K𝐾Kitalic_K impact the model performance and system efficiency and how to choose the optimal hyperparameters.

FedGCS: A Generative Framework for Efficient Client Selection in Federated Learning via Gradient-based Optimization (9)
FedGCS: A Generative Framework for Efficient Client Selection in Federated Learning via Gradient-based Optimization (10)
FedGCS: A Generative Framework for Efficient Client Selection in Federated Learning via Gradient-based Optimization (11)

5.4 Overhead Analysis

As shown in Figure8, we delineate the overhead analysis of the FedGCS framework and employ the Monsoon Power Monitor for the empirical measurement of the local training runtime overhead, specifically focusing on the ResNet18-CIFAR10 model augmented with 500 samples on an NVIDIA Jetson Nano platform.The empirical data reveals an average local training duration per round of 28.24 minutes, accompanied by an average energy expenditure of 370.5 Joules. Conversely, the deployment of the FedGCS framework on a server manifests markedly different metrics: a mere 0.68 minutes of processing time and an average energy consumption of 3.9 Joules. This stark contrast underscores the negligible runtime overhead of the FedGCS when operating in a server environment.Further, the memory requirements of FedGCS are remarkably low, with a total memory footprint under 0.1 MB. This is negligible when compared to the substantial memory size of the ResNet18 model, which is approximately 42.70MB, and the expansive 40GB DRAM capacity available on cloud servers equipped with NVIDIA A40 GPUs. Consequently, we assert that the runtime overhead of the FedGCS framework is not a significant factor in affecting the overall efficiency and progress of the training process.

6 Conclusion

FedGCS marks a significant advancement in FL by introducing a generative framework for client selection, effectively addressing statistical and system heterogeneity and high energy consumption challenges.This framework transcends traditional methods by encapsulating decision-making processes within a continuous representation space, enabling precise and efficient identification of optimal client subsets.Our novel approach, which includes data collection, model training, gradient-based optimization, and generation of client selection, not only outperforms existing methods in terms of model performance, latency, and energy efficiency but also simplifies the process, reducing reliance on domain expertise and manual intervention.The empirical validation of FedGCS underscores its superiority and potential to revolutionize client selection strategies in FL.

Acknowledgments

This research is supported by thethe Strategic Priority Research Program of the Chinese Academy of Sciences XDB38030300,the Natural Science Foundation of China under Grant No. 61836013,the Postdoctoral Fellowship Program of CPSF (No.GZC20232736),the China Postdoctoral Science Foundation Funded Project (No.2023M743565),the Science and Technology Development Fund (FDCT),Macau SAR (file no. 0123/2023/RIA2, 001/2024/SKL),and the Start-up Research Grant of University of Macau (File no. SRG2021-00017-IOTSC).

References

  • Bonawitz et al. [2019]Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konečnỳ, Stefano Mazzocchi, Brendan McMahan, etal.Towards federated learning at scale: System design.Proceedings of machine learning and systems, 1:374–388, 2019.
  • Brown et al. [2020]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, etal.Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020.
  • Chai et al. [2020]Zheng Chai, Ahsan Ali, Syed Zawad, Stacey Truex, Ali Anwar, Nathalie Baracaldo, YiZhou, Heiko Ludwig, Feng Yan, and Yue Cheng.Tifl: A tier-based federated learning system.In Proceedings of the 29th international symposium on high-performance parallel and distributed computing, pages 125–136, 2020.
  • Cho et al. [2020]YaeJee Cho, Jianyu Wang, and Gauri Joshi.Client selection in federated learning: Convergence analysis and power-of-choice selection strategies.arXiv preprint arXiv:2010.01243, 2020.
  • Darlow et al. [2018]LukeN Darlow, ElliotJ Crowley, Antreas Antoniou, and AmosJ Storkey.Cinic-10 is not imagenet or cifar-10.arXiv preprint arXiv:1810.03505, 2018.
  • Deng et al. [2009]Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and LiFei-Fei.Imagenet: A large-scale hierarchical image database.In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Dulac-Arnold et al. [2021]Gabriel Dulac-Arnold, Nir Levine, DanielJ Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester.Challenges of real-world reinforcement learning: definitions, benchmarks and analysis.Machine Learning, 110(9):2419–2468, 2021.
  • Fan et al. [2020]Wei Fan, Kunpeng Liu, Hao Liu, Pengyang Wang, Yong Ge, and Yanjie Fu.Autofs: Automated feature selection via diversity-aware interactive reinforcement learning.In 2020 IEEE International Conference on Data Mining (ICDM), pages 1008–1013. IEEE, 2020.
  • Fan et al. [2021]Wei Fan, Kunpeng Liu, Hao Liu, Yong Ge, Hui Xiong, and Yanjie Fu.Interactive reinforcement learning for feature selection with decision tree in the loop.IEEE Transactions on Knowledge and Data Engineering, 2021.
  • Freitag and Al-Onaizan [2017]Markus Freitag and Yaser Al-Onaizan.Beam search strategies for neural machine translation.In Thang Luong, Alexandra Birch, Graham Neubig, and Andrew Finch, editors, Proceedings of the First Workshop on Neural Machine Translation, pages 56–60, Vancouver, August 2017. Association for Computational Linguistics.
  • Glorot and Bengio [2010]Xavier Glorot and Yoshua Bengio.Understanding the difficulty of training deep feedforward neural networks.In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
  • Goetz et al. [2019]Jack Goetz, Ksh*tiz Malik, Duc Bui, Seungwhan Moon, Honglei Liu, and Anuj Kumar.Active federated learning.arXiv preprint arXiv:1909.12641, 2019.
  • Gruver et al. [2023]Nate Gruver, MarcAnton Finzi, Shikai Qiu, and AndrewGordon Wilson.Large language models are zero-shot time series forecasters.In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • He et al. [2016]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Hochreiter and Schmidhuber [1997]Sepp Hochreiter and Jürgen Schmidhuber.Long short-term memory.Neural computation, 9(8):1735–1780, 1997.
  • Hsieh et al. [2020]Kevin Hsieh, Amar Phanishayee, Onur Mutlu, and Phillip Gibbons.The non-iid data quagmire of decentralized machine learning.In International Conference on Machine Learning, pages 4387–4398. PMLR, 2020.
  • Kim and Wu [2021]YoungGeun Kim and Carole-Jean Wu.Autofl: Enabling heterogeneity-aware energy efficient federated learning.In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 183–198, 2021.
  • Krizhevsky et al. [2009]Alex Krizhevsky, Geoffrey Hinton, etal.Learning multiple layers of features from tiny images.Citeseer, 2009.
  • Lai et al. [2021]Fan Lai, Xiangfeng Zhu, HarshaV Madhyastha, and Mosharaf Chowdhury.Oort: Efficient federated learning via guided participant selection.In OSDI, pages 19–35, 2021.
  • LeCun et al. [1998a]Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lecun et al. [1998b]Yann Lecun, Corinna Cortes, and ChristopherJ.C. Burges.The mnist database of handwritten digits.http://yann.lecun.com/exdb/mnist/, 1998.
  • Li et al. [2019]LiLi, Haoyi Xiong, Zhishan Guo, Jun Wang, and Cheng-Zhong Xu.Smartpc: Hierarchical pace control in real-time federated learning system.In 2019 IEEE Real-Time Systems Symposium (RTSS), pages 406–418. IEEE, 2019.
  • Li et al. [2021]Qinbin Li, Bingsheng He, and Dawn Song.Model-contrastive federated learning.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10713–10722, 2021.
  • Luong et al. [2015]Thang Luong, Hieu Pham, and ChristopherD. Manning.Effective approaches to attention-based neural machine translation.In Lluís Màrquez, Chris Callison-Burch, and Jian Su, editors, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal, September 2015. Association for Computational Linguistics.
  • McMahan et al. [2017]Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and BlaiseAguera yArcas.Communication-efficient learning of deep networks from decentralized data.In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  • Merity et al. [2016]Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher.Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016.
  • mon [2023]Monsoon high voltage power monitor.https://www.msoon.com/online-store/High-Voltage-Power-Monitor-p90002590, 2023.
  • Ning et al. [2021]Zhiyuan Ning, Ziyue Qiao, Hao Dong, YiDu, and Yuanchun Zhou.Lightcake: A lightweight framework for context-aware knowledge graph embedding.In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 181–193. Springer, 2021.
  • Ning et al. [2022]Zhiyuan Ning, Pengfei Wang, Pengyang Wang, Ziyue Qiao, Wei Fan, Denghui Zhang, YiDu, and Yuanchun Zhou.Graph soft-contrastive learning via neighborhood ranking.arXiv preprint arXiv:2209.13964, 2022.
  • Paszke et al. [2019]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, etal.Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019.
  • Radford et al. [2019]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, etal.Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019.
  • Raffel et al. [2020]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ Liu.Exploring the limits of transfer learning with a unified text-to-text transformer.The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  • Ramesh et al. [2022]Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  • Shakespeare [2017]Shakespeare.Shakespeare dataset.https://www.gutenberg.org/files/100/, 2017.
  • Shepard et al. [2011]Clayton Shepard, Ahmad Rahmati, Chad Tossell, Lin Zhong, and Phillip Kortum.Livelab: measuring wireless networks and smartphone users in the field.ACM SIGMETRICS Performance Evaluation Review, 38(3):15–20, 2011.
  • Simonyan. [2014]Simonyan.Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014.
  • Sunehag et al. [2017]Peter Sunehag, Guy Lever, Audrunas Gruslys, WojciechMarian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, JoelZ Leibo, Karl Tuyls, etal.Value-decomposition networks for cooperative multi-agent learning.arXiv preprint arXiv:1706.05296, 2017.
  • Sutskever et al. [2014]Ilya Sutskever, Oriol Vinyals, and QuocV Le.Sequence to sequence learning with neural networks.Advances in neural information processing systems, 27, 2014.
  • Sutton and Barto [2018]RichardS Sutton and AndrewG Barto.Reinforcement learning: An introduction.MIT press, 2018.
  • Tian et al. [2022]Chunlin Tian, LiLi, Zhan Shi, Jun Wang, and Cheng-Zhong Xu.HARMONY: heterogeneity-aware hierarchical management for federated learning system.In 55th IEEE/ACM International Symposium on Microarchitecture, MICRO 2022, Chicago, IL, USA, October 1-5, 2022, pages 631–645. IEEE, 2022.
  • Tian et al. [2023]Chunlin Tian, Zhan Shi, and LiLi.Learn to select: Efficient cross-device federated learning via reinforcement learning.In Krystal Maughan, Rosanne Liu, and ThomasF. Burns, editors, The First Tiny Papers Track at ICLR 2023, Tiny Papers @ ICLR 2023, Kigali, Rwanda, May 5, 2023. OpenReview.net, 2023.
  • Tian et al. [2024]Chunlin Tian, Zhan Shi, Xinpeng Qin, LiLi, and Chengzhong Xu.Ranking-based client selection with imitation learning for efficient federated learning.arXiv preprint arXiv:2405.04122, 2024.
  • Touvron et al. [2023]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etal.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023.
  • Wang et al. [2020]Hao Wang, Zakhary Kaplan, DiNiu, and Baochun Li.Optimizing federated learning on non-iid data with reinforcement learning.In IEEE INFOCOM 2020-IEEE Conference on Computer Communications, pages 1698–1707. IEEE, 2020.
  • Wang et al. [2024]Dongjie Wang, Meng Xiao, Min Wu, Yuanchun Zhou, Yanjie Fu, etal.Reinforcement-enhanced autoregressive feature transformation: Gradient-steered search in continuous space for postfix expressions.Advances in Neural Information Processing Systems, 36, 2024.
  • Xiao et al. [2023]Meng Xiao, Dongjie Wang, Min Wu, Pengfei Wang, Yuanchun Zhou, and Yanjie Fu.Beyond discrete selection: Continuous embedding space optimization for generative feature selection.In 2023 IEEE International Conference on Data Mining (ICDM), pages 688–697. IEEE, 2023.
  • Zhan et al. [2020]Yufeng Zhan, Peng Li, and Song Guo.Experience-driven computational resource allocation of federated learning by deep reinforcement learning.In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 234–243. IEEE, 2020.
  • Zhang et al. [2018]Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.Shufflenet: An extremely efficient convolutional neural network for mobile devices.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6848–6856, 2018.
  • Zhang et al. [2022]SaiQian Zhang, Jieyu Lin, and QiZhang.A multi-agent reinforcement learning approach for efficient client selection in federated learning.In Proceedings of the AAAI Conference on Artificial Intelligence, volume36, pages 9091–9099, 2022.
FedGCS: A Generative Framework for Efficient Client Selection in Federated Learning via Gradient-based Optimization (2024)
Top Articles
Latest Posts
Article information

Author: Golda Nolan II

Last Updated:

Views: 6085

Rating: 4.8 / 5 (78 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Golda Nolan II

Birthday: 1998-05-14

Address: Suite 369 9754 Roberts Pines, West Benitaburgh, NM 69180-7958

Phone: +522993866487

Job: Sales Executive

Hobby: Worldbuilding, Shopping, Quilting, Cooking, Homebrewing, Leather crafting, Pet

Introduction: My name is Golda Nolan II, I am a thoughtful, clever, cute, jolly, brave, powerful, splendid person who loves writing and wants to share my knowledge and understanding with you.