Near-Optimal No-Regret Learning for Correlated Equilibria in Multi-Player General-Sum Games

Ioannis Anagnostides^† Constantinos Daskalakis^‡ Gabriele Farina^† Maxwell Fishelson^‡ Noah Golowich^‡ Tuomas Sandholm^†,§,¶,# ^† Carnegie Mellon University Computer Science Department ^‡ MIT CSAIL^§ Strategy Robot Inc.^¶ Optimized Markets Inc.^# Strategic Machine Inc. {ianagnos,gfarina,sandholm}@cs.cmu.edu, {costis,maxfish,nzg}@csail.mit.edu

(November 2021)

Abstract

Recently, Daskalakis, Fishelson, and Golowich ([Daskalakis21:Near] NeurIPS ‘21) showed that if all agents in a multi-player general-sum normal-form game employ Optimistic Multiplicative Weights Update (OMWU), the external regret of every player is $O(\operatorname{polylog}(T))$ after $T$ repetitions of the game. In this paper we extend their result from external regret to internal and swap regret, thereby establishing uncoupled learning dynamics that converge to an approximate correlated equilibrium at the rate of $\widetilde{O}\left(T^{-1}\right)$ . This substantially improves over the prior best rate of convergence for correlated equilibria of $O(T^{-3/4})$ due to Chen and Peng ([Chen20:Hedging] NeurIPS ‘20), and it is optimal up to polylogarithmic factors in $T$ .

To obtain these results, we develop new techniques for establishing higher-order smoothness for learning dynamics involving fixed point operations. Specifically, we first establish that the no-internal-regret learning dynamics of Stoltz and Lugosi ([Stoltz05:Internal] Mach Learn ‘05) are equivalently simulated by no-external-regret dynamics on a combinatorial space. This allows us to trade the computation of the stationary distribution on a polynomial-sized Markov chain for a (much more well-behaved) linear transformation on an exponential-sized set, enabling us to leverage similar techniques as [Daskalakis21:Near] to near-optimally bound the internal regret.

Moreover, we establish an $O(\operatorname{polylog}(T))$ no-swap-regret bound for the classic algorithm of Blum and Mansour ([Blum07:From] JMLR ‘07). We do so by introducing a technique based on the Cauchy Integral Formula that circumvents the more limited combinatorial arguments of [Daskalakis21:Near]. In addition to shedding clarity on the near-optimal regret guarantees of [Blum07:From], our arguments provide insights into the various ways in which the techniques by [Daskalakis21:Near] can be extended and leveraged in the analysis of more involved learning algorithms.

1 Introduction

Online learning and game theory share an intricately connected history tracing back to Robinson’s analysis of fictitious play [Robinson51:iterative], as well as Blackwell’s seminal approachability theorem [Blackwell56:analog], which served as the advent of the modern no-regret framework [Hart00:Simple, Abernethy11:Blackwell]. These connections have since led to the discovery of broad learning paradigms such as Online Mirror Descent, encompassing algorithms such as the celebrated Multiplicative Weights Update (MWU) [Littlestone94:Weighted]. Importantly, uncoupled learning dynamics overcome the often unreasonable assumption that players have perfect knowledge of the game, while they have also emerged as a central component in several recent landmark results in computational game solving [Brown17:Superhuman, Moravvcik17:DeepStack]. Moreover, another compelling feature of the no-regret framework is that it guarantees robustness even against adversarial opponents. Indeed, there are broad families of learning paradigms [Shalev-Shwartz12:Online] that accumulate $O(\sqrt{T})$ regret after $T$ iterations, a barrier which is known to be insuperable in fully adversarial environments [Cesa-Bianchi06:Prediction]. However, this begs the question: What if players do not face adversarial losses, but instead face predictable losses?

This question was first addressed by [Daskalakis11:Near]. They devised a decentralized variant of Nesterov’s excessive gap technique [Nesterov05:Smooth], enjoying a near-optimal rate of convergence of $O(\log T/T)$ to Nash equilibrium when employed by both players in a two-player zero-sum normal-form game. (For brevity we will henceforth omit the specification “normal-form” when referring to games.) At the same time, their algorithm also guarantees optimal (external) regret under worst-case losses. Subsequently, [Rakhlin13:Optimization, Rakhlin13:Online] introduced an optimistic variant of Online Mirror Descent—considerably simpler than the algorithm proposed in [Daskalakis11:Near]—achieving optimal convergence rate to Nash equilibrium, again in zero-sum games. Then [Syrgkanis15:Fast] identified a broad class of predictive learning algorithms that induce no-regret learning dynamics in multi-player general-sum games that guarantee $O(T^{1/4})$ regret if followed by each player. This line of work culminated in a recent advancement by [Daskalakis21:Near], where it was shown that, when all players employ an optimistic variant of MWU, each player incurs only $O(\operatorname{polylog}(T))$ regret. In turn, this implies that the average product distribution of play induced by optimistic MWU is an $\widetilde{O}\left(T^{-1}\right)$ -approximate¹¹1As usual, we use the notation $\widetilde{O}(\cdot)$ to suppress polylogarithmic factors of $T$ . Also note that for simplicity, and with a slight abuse of notation, in our introductory section we use the big- $O$ notation to hide game-specific parameters. coarse correlated equilibrium (CCE) after $T$ repetitions of the game.

Yet, it is well-understood that a CCE prescribes a rather weak notion of equilibrium [Gordon08:No]. An arguably more compelling solution concept²²2In general-sum multi-player games it is typical to search for solution concepts more permissive than Nash equilibria [Nash50:Equilibrium] as the latter is known to be computationally intractable under reasonable assumptions [Daskalakis09:Complexity, Chen09:Settling, Etessami07:Complexity, Rubinstein16:Settling, Babichenko17:Communication]. in multi-player general-sum games is that of correlated equilibria (CE) [Aumann74:Subjectivity]. Like CCE, it is known that CE can be computed through uncoupled learning dynamics. Thus, our paper is concerned with the following central question: The main contribution of our paper is to answer this question in the affirmative. Unlike in the case of CCE, typical no-external-regret dynamics such as MWU are known not to guarantee convergence to CE. Instead, specialized no-internal-regret or no-swap-regret algorithms have to be employed to converge to CE [Cesa-Bianchi06:Prediction]. Compared to no-external-regret dynamics, these learning dynamics are considerably more complex in that all known algorithms require the computation of the stationary distribution of a certain Markov chain at every iteration. Our main primary technical contribution is to develop techniques to overcome these additional challenges.

1.1 Contributions

Our work presents a refined analysis of the no-internal-regret algorithm of [Stoltz05:Internal], as well as the no-swap-regret algorithm of [Blum07:From], both instantiated with Optimistic Multiplicative Weights Update (OMWU). Going forward, we will refer to those learning dynamics as SL-OMWU and BM-OMWU, respectively. Our primary contribution is to show that both of these algorithms exhibit a near-optimal convergence rate of $\widetilde{O}(T^{-1})$ , settling Section 1 in the affirmative. More precisely, for SL-OMWU our main theorem is summarized as follows.

Theorem 1.1.

Consider a general-sum multi-player game with $m$ players, with each player $i\in having$ n_i $actions.Thereexistsauniversalconstant$ C ¿ 0 $suchthat,whenallplayersselectstrategiesaccordingtoalgorithm$ SL-OMWU $withstepsize$ η= 1/(C⋅m log^4 T) $,theinternalregretofeveryplayer$ i ∈ is bounded by $O\left(m\log n_{i}\log^{4}T\right)$ . As a result, the average product distribution of play is an $O\left({(m\log n\log^{4}T)}/{T}\right)$ -approximate correlated equilibrium.

This matches, up to constant factors, the rate of convergence for coarse correlated equilibria as follows by the result in [Daskalakis21:Near], and it is optimal, within the no-regret framework,³³3Finding a correlated equilibrium can be phrased as a linear programming problem, and thus $\epsilon$ -approximate correlated equilibria can be found in time $\operatorname{poly}(m,n,\log(1/\epsilon))$ , where $n=\max_{i\in\{n_{i}\}$,forsuccinctmulti-playergames~{}\cite[cite]{[\@@bibref{}{Papadimitriou08:Computing}{}{}]}.However,theprocedurefordoingso,\emph{ellipsoid against hope},cannotbephrasedasuncoupleddynamicsandisunlikelytoberunbyplayerscompetinginarepeatedgame.}$ uptopolylogarithmicfactors[Daskalakis11:Near].ThisalsosubstantiallyimprovesupontheO(T-3/4)rateofconvergenceforcorrelatedequilibriarecentlyshownby[Chen20:Hedging].Moreover,sinceswapregretonann-simplexistriviallyatmostntimeslargerthaninternalregret(e.g.,see[Blum07:From, pp. 1311 ]),Theorem 1.1directlygivesaboundintermsofswapregretaswell,statedasfollows.

Corollary 1.2.

Ifallplayersselectstrategiesaccordingtoalgorithm

SL-OMWU,theswapregretofeveryplayeri∈isboundedbyO(mnilognilog4T).

Forthepopularandmoreinvolvedalgorithm

BM-OMWU,ourmaintheoremissummarizedasfollows.

Theorem 1.3.

Considerageneral-summulti-playergamewithmplayers,witheachplayeri∈havingniactions.ThereexistsauniversalconstantC>0suchthat,whenallplayersselectstrategiesaccordingtoalgorithm

BM-OMWUwithstepsizeη=1/(C⋅mni3log4(T)),theswapregretofeveryplayeri∈isboundedbyO(mni4lognilog4(T)).Asaresult,theaverageproductdistributionofplayisanO((mn4lognlog4T)/T)-approximatecorrelatedequilibrium.

Finally,weremarkthat

SL-OMWUandBM-OMWUinstantiatedwiththelearningratesdescribedinTheorems 1.1and1.3guaranteenear-optimalswapregret(inT)whenallplayersusethesamedynamics,butmightnotagainstgeneral,adversariallosses.Toguaranteenear-optimalswapregretinboththeadversarialandthenon-adversarialregime,anadaptivechoiceoflearningratesimilartothatin[Daskalakis21:Near]canbeused(seeLABEL:corollary:adversarial).

1.2 OverviewofTechniques

Therecentworkof

[Daskalakis21:Near]identifiedhigher-order smoothnessofno-external-regretlearningdynamicsasakeypropertyforobtainingnear-optimalexternalregretbounds.Inparticular,theyshowedthatfortheno-external-regretdynamicsOMWU,thehigher-orderdifferences(LABEL:definition:fin-dif)ofthesequenceoflossvectorsdecayexponentiallyatordersuptoroughlylogT.However,establishingsuchhigher-ordersmoothnessforno-internal-andno-swap-regretlearningdynamicsisaconsiderablechallengesincetheknownalgorithmsinvolvecomputingthestationarydistributionofacertainMarkovchainateveryiteration.Ourmaintechnicalcontributionistodevelopnewtechniquestoeffectivelyaddressthischallenge.

ProofofTheorem 1.1:AnalyzingSL-OMWU.

First,weshowthatinternalregretminimizationonann-simplexcanbesimulatedbyno-external-regretdynamicsonthecombinatorialspaceofalln-node

directed trees(LABEL:theorem:equivalence).Ourequivalenceresultenablesustotradethecomputationofastationarydistributionofapolynomial-sizedMarkovchainfora(muchmorewell-behaved)lineartransformationonanexponential-sizedset.Toourknowledge,thisisthefirstno-internal-to-no-external-regretreductionthatsidestepsthecomputationofstationarydistributionsofMarkovchains,andmighthaveapplicationsbeyondthecharacterizationofhigher-ordersmoothnessofthedynamics.Basedonourequivalenceresult,wethenadaptandleveragetheknownhigher-ordersmoothnesstechniquesforno-external-regretdynamics[Daskalakis21:Near].Westressthatouranalysisiseventuallybroughtbacktoa``low-dimensional′′regretminimizer,insteadofsolelyoperatingoverthespaceofdirectedtrees;thisstepiscrucialforobtainingthelogarithmicdependenceonthenumberofactionsofeachplayerforno-internal-regretdynamics(Theorem 1.1).

TheequivalenceresultmentionedinthepreviousparagrapharisesasaconsequenceoftheclassicMarkovchaintreetheorem,whichprovidesaclosed-formcombinatorialformulaforthestationarydistributionofanergodicMarkovchain,andcruciallyreliesonthemultiplicativestructureoftheupdateruleof(O)MWU.Specifically,weprovethatthestationarydistributionsofcertainMarkovchainswhosetransitionprobabilitiesareupdatedthrough(O)MWUarethemselveslineartransformationsofiteratesproducedby(O)MWU.Ourequivalencegivesadirectwaytoargueaboutthehigher-ordersmoothnessofstationarydistributionsofMarkovchains,substantiallyextendingthefirst-ordersmoothnessobservationof

[Chen20:Hedging].Furthermore,weexpecttheequivalencetocontinuetoholdbeyondstationarydistributionsofMarkovchains,tothemoregeneralproblemofcomputingfixed pointsoflineartransformationsrequiredintheframeworkofPhi-regret[Stoltz07:Learning, Greenwald03:General, Farina21:Simple].

ProofofTheorem 1.3:AnalyzingBM-OMWU.

Thetechniqueswedescribedsofarenableustoestablishthenear-optimalinternalandswapregretboundsfor

SL-OMWU(Theorem 1.1andCorollary 1.2),aswellasthecorrespondingconvergencetocorrelatedequilibrium.However,differenttechniquesarenecessarytoestablishtheregretboundofBM-OMWU(Theorem 1.3).Atahighlevel,BM-OMWUrunsniindependentexternalregretminimizersforeachplayeri,aggregatestheoutputsintoatransitionmatrixofaMarkovchain,andthencomputesitsstationarydistribution.

Ratherthanarguingindirectlyintermsofasupplementaryexternal-regretminimizer,we

directlyanalyzethehigher-ordersmoothnessofasequenceofstationarydistributionsofMarkovchains,atthecostofultimatelyobtainingaworsedependenceonthenumberofactionsniinourswapregretbounds.Usingthemachineryof[Daskalakis21:Near](inparticular,theboundednesschainruleofLABEL:lemma:boundedness-chain-rule),doingsoboilsdowntoobtainingaboundontheTaylorseriescoefficientsofthefunctionthatmapstheentriesofanergodicmatrixQtoQ′sstationarydistribution.Takenliterally,suchaboundisnotquitepossible,sincethestationarydistributionmayhavesingularitiesaroundthenon-ergodicmatrices.However,weshowthatbyusingtheMarkovchaintreetheoremtogetherwiththemulti-dimensionalversionofCauchy′sintegralformula,itispossibletoboundtheTaylorseriesofthefunctionmappingthelogarithmsoftheentriesofQtoitsstationarydistribution(seeLABEL:lem:mct-taylor).Leveragingtheexponential-typestructureoftheOMWUupdates,wethenusethisboundtoobtainthedesiredguaranteeonthehigher-orderdifferencesofthestationarydistributions(LABEL:lem:dh-bound).

1.3 FurtherRelatedWork

No-internal-regretalgorithmsthatrequireblack-boxaccesstoa

singleno-external-regretminimizerareknownintheliterature[Stoltz05:Internal, Cesa-Bianchi06:Prediction].Thisisincontrastwiththeconstructionof[Blum07:From]forthestrongernotionofswapregret,whichrequiresnindependentno-external-regretminimizers---onepereachactionoftheplayer.Nevertheless,bothclassesofalgorithmsinvolvecomputingthestationarydistributionofacertainMarkovchainateveryiteration.Theintrinsiccomplexityassociatedwiththecomputationofastationarydistributionwasarguablythemainfactorlimitingourabilitytogiveacceleratedconvergenceguaranteesforeitherclassofalgorithms.Indeed,whilelearningdynamicsguaranteeingexternalregretboundedbyO(T1/4)havebeenknownforseveralyears[Syrgkanis15:Fast],amatchingboundforswapregretwasonlyrecentlyshownby[Chen20:Hedging].

Thesettingstudiedinourpaper(learningdynamicsforcorrelatedequilibrium)issubstantiallymorechallengingthantheproblemofgivingacceleratedlearningdynamicsforNashequilibriaintwo-playerzero-sumgames,aswelllearningdynamicsfor

smooth games[Roughgarden15:Intrinsic].Indeed,whileinoursettingtheconvergencetotheequilibriumisdrivenbythemaximuminternal(orswap)regretcumulatedbytheplayers,inthelattertwosettingsthequalitymetricisdrivenbythesumoftheexternalregrets.Asshownby[Syrgkanis15:Fast],itispossibletoguaranteeaconstant sumofexternalregretsunderabroadclassofpredictiveno-regretalgorithmswhichincludesoptimisticOMDandoptimisticFTRLunderverygeneraldistance-generatingfunctions.Thisisincontrastwiththecaseofno-regretdynamicsforCCEandCE,whereitremainsanopenquestiontogivebroadclassesofalgorithmsthatcanachievenear-optimalconvergence.

Finally,wepointoutthatoptimisticvariantsofFTRLsuchasOMWUhavebeenshowntoalsoconvergeinthe

last-iterate sense,andtheconvergenceisknowntobelinear⁴⁴4However, the convergence rate of the last-iterate may not depend polynomially in the size of the game [Wei21:Linear].[Daskalakis18:Training, Daskalakis18:The, Daskalakis19:Last, Wei21:Linear],butthisholdsonlyforrestrictedclassesofgames,suchastwo-playerzero-sumgames.

2 Preliminaries

Considerafinite

normal-formgameΓconsistingofasetofmplayers{1,2,…,m}suchthateveryplayerihasanaction spaceAi≔,forsomeni∈N.ThejointactionspacewillberepresentedwithA≔A1×⋯×Am.Foragivenactionprofilea=(a1,…,am)∈A,theloss functionΛi:A→[0,1]specifiesthelossofplayeriundertheactionprofilea;notethatthenormalizationofthelossescomeswithoutanylossofgenerality.Amixed strategyxi∈Δ(Ai)foraplayeri∈isaprobabilitydistributionover