This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Faster Fundamental Graph Algorithms via Learned Predictions

Justin Y. Chen
MIT
[email protected]
   Sandeep Silwal
MIT
[email protected]
   Ali Vakilian
TTIC
[email protected]
   Fred Zhang
UC Berkeley
[email protected]
Abstract

We consider the question of speeding up classic graph algorithms with machine-learned predictions. In this model, algorithms are furnished with extra advice learned from past or similar instances. Given the additional information, we aim to improve upon the traditional worst-case run-time guarantees. Our contributions are the following:

  1. (i)

    We give a faster algorithm for minimum-weight bipartite matching via learned duals, improving the recent result by Dinitz, Im, Lavastida, Moseley and Vassilvitskii (NeurIPS, 2021);

  2. (ii)

    We extend the learned dual approach to the single-source shortest path problem (with negative edge lengths), achieving an almost linear runtime given sufficiently accurate predictions which improves upon the classic fastest algorithm due to Goldberg (SIAM J. Comput., 1995);

  3. (iii)

    We provide a general reduction-based framework for learning-based graph algorithms, leading to new algorithms for degree-constrained subgraph and minimum-cost 0-11 flow, based on reductions to bipartite matching and the shortest path problem.

Finally, we give a set of general learnability theorems, showing that the predictions required by our algorithms can be efficiently learned in a PAC fashion.

1 Introduction

There has been recent interest in moving beyond the traditional and often pessimistic worst-case analysis of algorithms by using machine-learned predictions. This paradigm of learning-augmented algorithms is inspired by the great success of machine learning (ML) and aims to utilize ML predictions to improve the performance of classic algorithms.

The extra information assumed in learning-augmented algorithms can be supplied in a variety of settings. For example, in data streams, the observation that underlying patterns in real-world datasets do not change quickly over time has led to the development of an oracle capable of predicting frequently occurring stream elements. In distribution learning, it is natural to have access to different but related distributions that can aid our learning tasks. In many other applications, the current input can be similar to past instances that might help us to avoid computing the solution from scratch. All these scenarios fall under the general umbrella of a “warm start”, which enables better initialization of the algorithms to improve their performance.

This learning-based paradigm has been successfully applied in many algorithmic domains. They all share an underlying goal to minimize some resource constraints: in online algorithms, predictions are used to make better future decisions and reduce regret and competitive ratios [LV18]. In streaming algorithms and data structures, predictors have been developed to optimize space usage [KBC+18, HIKV19]. In sublinear algorithms, predictors can reduce the sample complexity of a task [EIN+21].

Despite this activity, only recently have there been works on provably improving the time complexity of algorithms under this framework. Two recent works which consider this resource include the work of [EFS+21] on the kk-means clustering problem and the work of [DIL+21] on graph matching; our paper relates to the latter. In [DIL+21], the authors give a learning-based algorithm for (min-cost) bipartite matching and show that predictions provably result in faster algorithms. Our paper is motivated by three natural follow-up questions:

  1. (i)

    Can we derive learning-augmented algorithms which exploit warm starts and other auxilliary information for other important graph optimization problems besides bipartite matching?

  2. (ii)

    Do we need a tailor made learning-augmented algorithm for every different graph optimization problem?

  3. (iii)

    Can we understand when warm starts are learnable for general problems?

1.1 Our Results

Our main contributions provide answers to the three motivating questions. We individually address these questions and our relevant contributions.

Can we derive learning-augmented algorithms for other classic graph optimization problems besides bipartite matching?

Towards answering this question, we first provide a more efficient learning-augmented algorithm for bipartite matching than the one in [DIL+21], in Section 3. The algorithm of [DIL+21] uses dual variables from the linear programming formulation of bipartite matching as predictions. We achieve better runtime by utilizing the interplay of this dual and another set of related dual variables, called reduced edge duals, arising from viewing bipartite matching as a max-flow problem. This result extends to bb-matching as well.

Theorem 1.1 (Informal; see Theorem 3.2).

Given a weighted bipartite graph and predicted dual y^\hat{y} , there exists an algorithm that finds a minimum weight perfect matching in time O(mn+(m+nlogn)yy^0)O(m\sqrt{n}+(m+n\log n)||y^{*}-\hat{y}||_{0}), where yy^{*} is an optimal dual solution.

This significantly improves upon the prior bound of O~(min{mnyy^1,mn})\widetilde{O}(\min\{m\sqrt{n}||y^{*}-\hat{y}^{\prime}||_{1},mn\}) due to [DIL+21].

Beyond the problem of minimum-weight matching, we also use reduced edge duals, which allow us to obtain the first learning-augmented algorithm for the single-source shortest-paths problem with negative edge lengths.

Theorem 1.2 (Informal; see Theorem D.4).

Given a directed graph with negative edge weights and predicted dual y^\hat{y} , there exists an algorithm that finds single-source shortest paths in time O(mmin{y^y1y^y,nlog(y^y)})O(m\min\{\|\hat{y}-y^{*}\|_{1}\cdot\|\hat{y}-y^{*}\|_{\infty},\sqrt{n}\log(\|\hat{y}-y^{*}\|_{\infty})\}), where yy^{*} is an optimal dual solution.

To properly utilize these duals, we give an efficient rounding scheme which takes in as input a set of predicted reduced edge duals and rounds them to a feasible instance. See Section 4 for the formal notion of feasibility and our rounding algorithm.

Do we need a ‘tailor made’ learning-augmented algorithm for every different graph optimization problem?

The prior work [DIL+21] outlined the three challenges of “feasibility, optimization, and learnabilty” needed to put warm start heuristics on theoretical footing. We leverage reductions to avoid addressing these challenges from scratch for each new graph problem. Specifically, we introduce a general framework of reductions that takes existing learning-augmented algorithms and applies them to new problems. Note that in the context of learning-augmented algorithms, we need reductions that efficiently convert instances of a given problem to instances of another problem which we know how to solve using predictions. Therefore, we must judiciously choose the problems and reductions to apply in this framework. Nonetheless, the benefits of our reduction framework include faster learning-augmented algorithms for shortest-paths and new algorithms for other problems, such as degree-constrained subgraph and unit-capacity maximum flow.

Can we understand when warm starts are learnable for general graph problems?

Given the wide range of problems we consider, we need to understand when good hints and predictions which generalize are learnable. (Note that [DIL+21] is only concerned about learnability of duals for the specific problem of bipartite matching.) We answer this question in Section 6 by generalizing the arguments of [DIL+21] beyond bipartite matching.

1.2 Related Work

Learning-augmented graph algorithms.

The most relevant work to ours is [DIL+21]. We improve and extend their results in several ways, as discussed earlier. For the shortest-path problem, a recent work [EIX22] investigates the theory of learning-based labeling scheme for the AA^{*} search heuristic, whereas a few others approach it empirically [BCS17, YTB+21, CLDS20]. Several previous papers focus on online graph problems [AGKK20, LMRX21, XM21, APT22]. The scope of our paper differs from theirs, as we study only offline problems.

Classic graph algorithms.

There is a vast body of literature addressing graph optimization problems considered in this paper. We only mention a few that are most relevant to this paper. Similar to [DIL+21], our matching algorithm builds upon the classic Hungarian method. There are other theoretically faster (exact) algorithms for (bipartite) minimum-cost perfect matching, including [OA92, GK97, DS12]. However, these procedures are fairly involved and hard to incorporate predictions. Our learning-based algorithm for single-source negative-length shortest paths is inspired by [Gol95]. In the classic setting, a web of reductions among graph problems were introduced by [Gab83, Gab85, GT89].

The learning-based algorithm paradigm has been applied to a number of other problems. See Appendix A for more related works.

1.3 Organization

The remainder of the paper proceeds as follows. We set up some preliminary background and notations in Section 2. In Section 3, we give an improved algorithm for learning-augmented minimum-cost bipartite matching. We then extend the approach to shortest path in Section 4. We address the question of learnability in Section 6. Finally, we provide numerical evaluations of our shortest-path algorithm via reductions in Section 7.

2 Preliminaries

Notation.

Let G=(V,E)G=(V,E) be a graph of mm edges and nn vertices. We will specify its directedness in different settings. For a vector xmx\in\mathbb{R}^{m}, we let xp\|x\|_{p} to denote its ppth norm, for any p0p\geq 0.

Minimum-Weight Bipartite Perfect Matching.

Let G=(V,E)G=(V,E) be a bipartite graph with non-negative edge costs, and CC be the maximum cost. The objective of this problem is to find a perfect matching MM with minimum total cost in GG. In the minimum-weight bb-matching problem, we are also given a demand vector b+Vb\in\mathbb{Z}_{+}^{V}. The goal is to match each vertex uu bub_{u} times, with minimum cost.

Maximum Flow.

Given a directed graph G=(V,E)G=(V,E) with capacity vector c+Ec\in\mathbb{R}_{+}^{E}, let ss and tt be distinct vertices of HH. A feasible ss-tt flow is a vector f+Ef\in\mathbb{R}_{+}^{E}, with each entry representing flow along an edge, such that sum of incoming flow along edges (v,u)(v,u) equals sum of outgoing flow along edges (u,v)(u,v) for all uV{s,t}u\in V\setminus\{s,t\} and fuvcuvf_{uv}\leq c_{uv} for all edge uvuv. An ss-tt-flow ff is maximum if it maximizes the outgoing flow from ss. For a feasible flow ff, GfG_{f} denotes its residual graph. A classic procedure for finding maximum flow is Ford-Fulkerson; see [CLRS09].

3 Improved Learning-Based Minimum-Weight Matching

The results from [DIL+21] on matching contain two main results: (1) that given predicted duals for a minimum-weight matching problem, there is an efficient near-optimal algorithm to round the duals to feasibility and (2) that after rounding, these feasible predicted duals can be used to quickly find a solution to the primal. We provide a new approach to the second problem of using a feasible prediction to quickly solve minimum-weight matching that significantly improves upon prior work.

First, we will restate the theorem from [DIL+21] which established an algorithm from using a feasible predicted dual to quickly solve minimum-weight matching.

Theorem 3.1 (Theorem 1313 in [DIL+21]).

There exists an algorithm which takes as input a feasible integer dual solution y^\hat{y}^{\prime} and finds a minimum-weight bipartite perfect matching in O~(min{mnyy^1,mn})\widetilde{O}(\min\{m\sqrt{n}\|y^{*}-\hat{y}^{\prime}\|_{1},mn\}) time, where yy^{*} is an optimal dual solution.

With small modifications to the algorithm and an improved analysis, we present the following improved time complexity.

Theorem 3.2 (Faster Matching from Predicted Duals).

There exists an algorithm which takes as input a feasible integer dual solution y^\hat{y}^{\prime} and finds a minimum-weight bipartite perfect matching in O(mn+(m+nlogn)yy^0)O(m\sqrt{n}+(m+n\log n)\|y^{*}-\hat{y}^{\prime}\|_{0}) time, where yy^{*} is an optimal dual solution.

If the feasible dual is within O(n)O(\sqrt{n}) of an optimal dual in 1\ell_{1} distance (which is the case in which the algorithm from [DIL+21] attains an improved runtime over the classical algorithm), our algorithm improves upon the time complexity by a factor of

n(yy^1yy^0).\sqrt{n}\left(\frac{\|y^{*}-\hat{y}^{\prime}\|_{1}}{\|y^{*}-\hat{y}^{\prime}\|_{0}}\right).

Note that y(c)y^(c)0<y(c)y^(c)1\|y^{*}(c)-\hat{y}^{\prime}(c)\|_{0}<\|y^{*}(c)-\hat{y}^{\prime}(c)\|_{1} as we are considering only integral duals.

While the algorithm from [DIL+21] improves upon the classic Hungarian algorithm only when y(c)y^(c)1=o(n)\|y^{*}(c)-\hat{y}^{\prime}(c)\|_{1}=o(\sqrt{n}), our algorithm improves upon the Hungarian algorithm as long as y(c)y^(c)0=o(n)\|y^{*}(c)-\hat{y}^{\prime}(c)\|_{0}=o(n), a much milder condition on the predictions.

As a corollary, when combined with the linear-time rounding procedure from [DIL+21], this algorithm gives a fast framework for taking a predicted (possibly infeasible) dual and using it to speed up minimum-weight matching.

Corollary 3.3.

There exists an algorithm which takes as input a (possibly infeasible) integral dual solution y^\hat{y}, produces a feasible dual y^\hat{y}^{\prime} s.t. y^y13y^y1\|\hat{y}^{\prime}-y^{*}\|_{1}\leq 3\|\hat{y}-y^{*}\|_{1}, and finds a minimum-weight bipartite perfect matching in O(mn+(m+nlogn)yy^0)O(m\sqrt{n}+(m+n\log n)\|y^{*}-\hat{y}^{\prime}\|_{0}) time, where yy^{*} is an optimal dual solution.

Our algorithm is given in Algorithm 1. The main difference in the algorithm/analysis to prior work is that they essentially consider running a O(mn)O(m\sqrt{n}) matching algorithm at each step and then reason that the dual variables increase by at least one on each call to the algorithm, getting the 1\ell_{1} dependence on the error.

Our improvements are based on the following observation. If the predicted duals are accurate enough to get improvements over the normal Hungarian algorithm, then the first call to a maximum cardinality matching algorithm will match many edges. Then, we can account for the amount of work we have to do in subsequent iterations by the small number of edges remaining to be matched by via a flow interpretation of the matching problem.

Algorithm 1 Faster Primal-Dual Scheme for MWPM
1:procedure MWPM-PD++++(G=(LR,E),c,yG=(L\cup R,E),c,y)
2:     E{eEyi+yj=cijE^{\prime}\leftarrow\{e\in E\mid y_{i}+y_{j}=c_{ij} }
3:     G(V,E)G^{\prime}\leftarrow(V,E^{\prime})
4:     MM\leftarrow Maximum cardinality matching in GG^{\prime}
5:     Give all edges in EE unit capacity and direct them from left to right \triangleright Flow representation
6:     Add nodes s,ts,t to GG along with unit capacity, zero cost edges (s,i)(s,i) for all iLi\in L and (j,t)(j,t) for all jRj\in R
7:     Associate a flow ff with MM s.t. (i,j)M\forall(i,j)\in M, fsi=fij=fjt=1f_{si}=f_{ij}=f_{jt}=1 and otherwise fe=0f_{e}=0
8:     ziyiiLz_{i}\leftarrow-y_{i}\quad\forall i\in L
9:     zjyjjRz_{j}\leftarrow y_{j}\quad\forall j\in R
10:     cece+zizje=(i,j)E s.t. i,j{s,t}c^{\prime}_{e}\leftarrow c_{e}+z_{i}-z_{j}\quad\forall e=(i,j)\in E\text{ s.t. }i,j\notin\{s,t\} and ce0c^{\prime}_{e}\leftarrow 0 for all other edges
11:     while ff has flow value less than nn do
12:         zuzu+d(s,u)uLRz_{u}\leftarrow z_{u}+d(s,u)\;\forall u\in L\cup R where d(,)d(\cdot,\cdot) is shortest path distance in GfG_{f} w.r.t. cc^{\prime} \triangleright Dijkstra
13:         cece+zizje=(i,j)E s.t. i,j{s,t}c^{\prime}_{e}\leftarrow c_{e}+z_{i}-z_{j}\quad\forall e=(i,j)\in E\text{ s.t. }i,j\notin\{s,t\}
14:         Ef{eEf|ce=0}E_{f}^{\prime}\leftarrow\{e\in E_{f}|c^{\prime}_{e}=0\}.
15:         Gf(V,Ef)G_{f}^{\prime}\leftarrow(V,E_{f}^{\prime})
16:         gg\leftarrow Maximum flow in GfG_{f}^{\prime} \triangleright Ford-Fulkerson
17:         Augment along gg in GfG_{f}
18:     end while
19:     Return {e=(i,j)f:iL,jR,fe=1}\{e=(i,j)\in f:i\in L,j\in R,f_{e}=1\}
20:end procedure

The formal analysis of the algorithm is somewhat technical and appears in Appendix B, where we prove Theorem 3.2.

Extension to bb-matching

We extend the improvements for learning-based minimum-weight perfect bipartite matching to the more general problem of minimum-weight perfect bb-matching. For two sets of dual variables over the vertices yy and zz, we will use as a distance measure the weighted p\ell_{p} error:

yzb,p=iVbi|yizi|p.\|y-z\|_{b,p}=\sum_{i\in V}b_{i}|y_{i}-z_{i}|_{p}.

We will restate a theorem from [DIL+21].

Theorem 3.4 (Theorem 3131 in [DIL+21]).

There exists an algorithm which takes as input a feasible integer dual solution y^\hat{y}^{\prime} and finds a minimum-weight perfect b-matching in O(mnyy^b,1)O(mn\|y^{*}-\hat{y}^{\prime}\|_{b,1}) time, where yy^{*} is an optimal dual solution.

Using the same algorithm (Algorithm 4 shown in Appendix C), but with an improved analysis, we show the following improved runtime.

Theorem 3.5.

There exists an algorithm which takes as input a feasible integer dual solution y^\hat{y}^{\prime} and finds a minimum-weight perfect b-matching in O(mn+myy^b,0)O(mn+m\|y^{*}-\hat{y}^{\prime}\|_{b,0}) time, where yy^{*} is an optimal dual solution.

As before, since the duals are integral, yy^b,0yy^b,1\|y^{*}-\hat{y}^{\prime}\|_{b,0}\leq\|y^{*}-\hat{y}^{\prime}\|_{b,1}. Note that this runtime improves upon prior work by a factor of

min{nyy^b,1yy^b,0,yy^b,1}.\min\left\{n\frac{\|y^{*}-\hat{y}^{\prime}\|_{b,1}}{\|y^{*}-\hat{y}^{\prime}\|_{b,0}},\|y^{*}-\hat{y}^{\prime}\|_{b,1}\right\}.

The full details and proof are in Appendix C.

4 Fast Learning-Based Shortest Paths

In this section, we introduce the reduced edge length duals and how to round them efficiently given predictions. Reduced edge length duals are defined as follows.

Definition 4.1 (Reduced Edge Length Duals (RE Duals)).

Let G=(V,E)G=(V,E) with |V|=n,|E|=m|V|=n,|E|=m, denote a directed graph and :E\ell:E\rightarrow\mathbb{Z} denote the length of the edges, which may be negative. yVy\in\mathbb{Z}^{V} is a valid or feasible reduced edge length dual (RE Dual) if

y(u,v):=(u,v)+yuyv0\ell_{y}(u,v):=\ell(u,v)+y_{u}-y_{v}\geq 0

for all edges e=(u,v)Ee=(u,v)\in E.

It is natural to study these duals as they appear in many fundamental combinatorial optimization problems. For example, consider the dual linear program for the shortest paths problem on the graph GG (where we wish to compute the shortest path from vertex ss to tt). It is given by:

max\displaystyle\max yt\displaystyle\quad y_{t}
s.t. yvyu(u,v)\displaystyle\quad y_{v}-y_{u}\leq\ell(u,v)
ys=0.\displaystyle\quad y_{s}=0.

Note that the constraints yvyu(u,v)y_{v}-y_{u}\leq\ell(u,v) exactly encode y(u,v)0\ell_{y}(u,v)\geq 0 in Definition 4.1. Furthermore, given a valid dual solution yy to the dual linear program, one can quickly compute the shortest paths in near linear time via an application of Dijkstra’s algorithm since all reduced edge lengths are non negative by Definition 4.1. This is because the sum of the lengths of edges along any path (v1,v2,,vk)(v_{1},v_{2},\cdots,v_{k}) is the same up to an additive term yv1yvky_{v_{1}}-y_{v_{k}} due to telescoping. Thus, this transformation preserves the identity of shortest paths from a starting vertex. Furthermore, many shortest paths algorithms on general graphs, such as the Bellman-Ford algorithm, also implicitly calculate the dual yy: in the Bellman-Ford algorithm, the dual can be constructed in linear time after the algorithm terminates.

Now suppose predictions y^:V\hat{y}:V\rightarrow\mathbb{Z} for the duals yy are given. The main result of this section is that there exists an efficient algorithm, Algorithm 2, which outputs a feasible y^\hat{y}^{\prime} according to 4.1.

Theorem 4.1 (Fast Shortest-Path from Predicted Duals).

Let y^:V\hat{y}:V\rightarrow\mathbb{Z} be predicted duals and let y:Vy^{*}:V\rightarrow\mathbb{Z} be a feasible set of reduced edge length duals according to Definition 4.1 such that y^y1\|\hat{y}-y^{*}\|_{1} is minimized. Algorithm 2 returns a feasible y^:V\hat{y}^{\prime}:V\rightarrow\mathbb{Z} in time

O(mmin{y^y1y^y,nlog(y^y)}).O(m\min\{\|\hat{y}-y^{*}\|_{1}\cdot\|\hat{y}-y^{*}\|_{\infty},\sqrt{n}\log(\|\hat{y}-y^{*}\|_{\infty})\}).

If we define reduced edge lengths according to the predicted dual y^\hat{y}, it is likely that the non-negativity constraint of some edges become violated, i.e., y^(e)<0\ell_{\hat{y}}(e)<0. The goal of Algorithm 2 is to modify some coordinates of y^\hat{y} to fix these negative edge weights. The algorithm uses a key subroutine of Goldberg’s algorithm on shortest paths [Gol95]. It proceeds by mending negative edges by reducing the dual value of one of their endpoints. At every iteration, we greedily maximize the number of dual variable which are updated. The vertices which are updated are picked through a layering structure utilized in [Gol95]. Algorithm 2 presents the formal details.

Note that we implicitly assume the given graph GG with edge lengths given by \ell does not have a negative weight cycle. This is a necessary assumption since otherwise, there exists no valid RE Duals for GG: the length of a cycle under any valid dual yy must be non-negative by definition but the cost of any cycle is the same under \ell and y\ell_{y} due to telescoping which leads to a contradiction if \ell induces a negative weight cycle.

Algorithm 2 Rounding Predictions for Reduced Edge Length Duals
1:Input: Graph G=(V,E)G=(V,E), predicted duals y^:V\hat{y}:V\rightarrow\mathbb{Z}
2:procedure Round-RE-Duals(G,y^G,\hat{y})
3:     while there exists an edge ee such that y^(e)<0\ell_{\hat{y}}(e)<0 do
4:         G=(V,E)G^{-}=(V,E^{-})\leftarrow subgraph of GG that have weight at most 0 under y^\ell_{\hat{y}}
5:         Contract all strongly connected components in GG^{-} \triangleright All edges connecting vertices in the same strongly connected component are 0 [Gol95]
6:         Add a vertex xx to GG^{-} and connect it with zero length edges to all of VV
7:         Li{vV|d(x,v)=i}L_{i}\leftarrow\{v\in V|d(x,v)=-i\} \triangleright dd is graph distance in GG^{-} using reduced edge lengths given by y^\ell_{\hat{y}}
8:         iargmaxi|Li|i^{*}\leftarrow\operatorname*{arg\,max}_{i}|L_{i}|
9:         Lower the value of y^v\hat{y}_{v} for all vertices in tiLt\cup_{t\geq i^{*}}L_{t} by 11
10:     end while
11:     Return y^\hat{y}
12:end procedure

The analysis of the algorithm and the proof of Theorem 4.1 appear in Appendix D. The theorem also implies an algorithm for all-pair shortest paths; see Section D.2 for details.

5 A General Framework for Learning-Based Reductions

In this section, we introduce a general framework for obtaining learning-augmented algorithms via reductions. Suppose we have an oracle which provides hints or a warm start to instances of problem P1P_{1}. If we are instead interested in solving instances of another problem P2P_{2}, we can hope to transform our instance at hand to an instance of P1P_{1} in order to utilize the available predictions. If there exists an efficient reduction from P2P_{2} to P1P_{1}, we can indeed use this reduction to transform our instance of P2P_{2} to that of P1P_{1}, use the hints available for P1P_{1} to efficiently solve our new problem, and use the solution found to solve our original instance of P2P_{2}. This will be the basis of our framework for learning-based reductions.

Why is such a framework useful? First, hints might be easier to learn for problem P1P_{1} or there may not be a natural notion of hints for instances of P2P_{2}. In addition, there might already exist a learning-based algorithm for P1P_{1} which efficiently utilizes hints. Therefore, using reductions from other problems to P1P_{1} would eliminate the need to create new algorithms and thereby increasing the power and usability of the existing learning-based algorithms.

We formally define reductions as follows.

Definition 5.1 (Reductions).

Let P1P_{1} and P2P_{2} be two problem instances. We say that R:P2P1R:P_{2}\rightarrow P_{1} is a reduction from P2P_{2} to P1P_{1} if for any instance IP2I\in P_{2}, R(I)R(I) maps to an instance II^{\prime} of P2P_{2}. Furthermore, Furthermore, there exists mapping which takes a solution of II^{\prime} and converts it to a solution for II.

Note that the definition of reduction by itself is not quite useful: by the Cook-Levin theorem, any problem in the complexity class P can be reduced to 3SAT. However in this paper, we are interested in efficient reductions which take linear or almost linear time in the size of the input. Therefore, such reductions would be extremely fast to execute in practice and the final algorithm of solving instances of II of P2P_{2} via solving instances of P1P_{1} using a learned oracle would overall be faster than solving II with no hints.

5.1 General Framework

Our framework is given in Algorithm 3. Note that there, 𝒜\mathcal{A} is an existing algorithm which solves instances of problem P1P_{1} using hints given by a predictor y:P1dy:P_{1}\rightarrow\mathbb{R}^{d}, i.e., the hints are dd dimensional vectors.

Algorithm 3 General Reduction Framework for Learning-Based Algorithms
1:Input: Problem instance IP2I\in P_{2}. Predictor y:P1dy:P_{1}\rightarrow\mathbb{R}^{d}, reduction R:P2P1R:P_{2}\rightarrow P_{1}, Algorithm 𝒜(P1,y(P1))\mathcal{A}(P_{1},y(P_{1}))
2:procedure Reduction-Solve(P1,P2,y,R,𝒜P_{1},P_{2},y,R,\mathcal{A})
3:     IR(I)I^{\prime}\leftarrow R(I) \triangleright Use reduction RR to get an instance of P1P_{1}
4:     y^y(I)\hat{y}\leftarrow y(I^{\prime}) \triangleright Get hints for instance II^{\prime} using predictor yy
5:     Execute 𝒜(I,y^)\mathcal{A}(I^{\prime},\hat{y}) \triangleright Solve instance II^{\prime} using hints and learning-based algorithm 𝒜\mathcal{A}.
6:     Return solution to II using solution for II^{\prime} given by 𝒜(I,y^)\mathcal{A}(I^{\prime},\hat{y}).
7:end procedure

Note that Step 66 in Algorithm 3 would depend on the instances P1,P2P_{1},P_{2} and the reduction RR. In some cases, some post-processing the solution 𝒜(I,y^)\mathcal{A}(I^{\prime},\hat{y}) could be required. In the examples we study in this paper, both this step and the reduction RR are efficient. We give concrete instantiations of Algorithm 3 in Section 5.2.

Note that we still need to understand the learnability of the hints y^\hat{y} in Step 44 of Algorithm 3: even if there exists an efficient algorithm 𝒜\mathcal{A} for problem P1P_{1}, we might not have a predictor yy at hand. Note that in [DIL+21], the question of learnability of predictors was tackled by assuming access to multiple instances of a particular problem class drawn from some distribution. In our case, we might have lots of training data on instances of problem P2P_{2} but our goal is to train a predictor yy for P1P_{1} in hopes of utilizing 𝒜\mathcal{A}. To do so, we can just go through the reduction RR to get samples of problem instances drawn from P1P_{1}. Note that the distribution on these problems will be different than that on P2P_{2}. We introduce general learnability results which imply one can learn a good predictor yy. For details, see  Section 6.

5.2 Reductions and Their Implications

We now present one application of the reduction framework outlined in the previous section. We demonstrate three more reductions, for degree constrained subgraph, minimum-cost 0-11 flow, and graph diameter, in Appendix E.

Shortest Path from Matching.

We leverage the following reduction from shortest path (with negative edge lengths) to maximum-weight perfect matching on bipartite graphs, due to [Gab85]. Given a directed graph GG with edge lengths given by \ell, construct a weighted bipartite graph H=(L,R,E)H=(L,R,E).

  • For each vertex uGu\in G, make two copies u1Lu_{1}\in L and u2Ru_{2}\in R.

  • For each arc (u,v)G(u,v)\in G, create an edge e=(u1,v2)e=(u_{1},v_{2}) of weight (e)-\ell(e) in HH.

  • Finally, create an edge (u1,u2)(u_{1},u_{2}) of weight 0 for each vertex uGu\in G.

Now suppose we find the maximum weight perfect matching of HH and its corresponding dual variables yuiy_{u_{i}} for all uGu\in G and i{1,2}i\in\{1,2\}. By the construction, we immediately have:

Lemma 5.1.

The maximum weight perfect matching of HH has positive weight if and only if the graph GG has negative cycles.

Otherwise, we can let πu=yu1\pi_{u}=y_{u_{1}} for each uGu\in G. Then by feasibility of yy, we have

(e)yu1+yv2πuπv,-\ell(e)\leq y_{u_{1}}+y_{v_{2}}\leq\pi_{u}-\pi_{v},

for any e=(u,v)Ge=(u,v)\in G. It follows that π\pi is a feasible dual for the shortest path problem on GG, i.e., it satisfies Definition 4.1.

Observe that the graph HH contains m+nm+n edges. By using the run-times for faster matching from [DIL+21] as well as our runtime of Section Section 3, we have the following corollary due to Algorithm 3.

Theorem 5.2.

Given a shortest path problem on input graph G=(V,E)G=(V,E) with nn vertices and mm edges, there exists an algorithm which takes takes as input a predicted dual solution y^\hat{y} to an instance of maximum weight perfect matching derived from GG, near-optimally rounds the dual to a feasible solution y^\hat{y}^{\prime}, and finds feasible reduced edge length duals for GG in time O(mn+(m+nlogn)yy^0)O(m\sqrt{n}+(m+n\log n)||y^{*}-\hat{y}^{\prime}||_{0}).

Remark 5.1.

Recall that in Section 4 we derived an alternative runtime of

O(mmin{y^y1y^y,nlog(y^y))}O(m\min\{\|\hat{y}-y^{*}\|_{1}\cdot\|\hat{y}-y^{*}\|_{\infty},\sqrt{n}\log(\|\hat{y}-y^{*}\|_{\infty}))\}

for the problem of rounding a predicted dual y^\hat{y} to a feasible dual y^\hat{y}^{\prime} to satisfy the reduced edge property of Definition 4.1. These results are incomparable since Theorem 5.2 uses dual predictions for matchings on a transformed graph whereas Theorem 4.1 uses predictions for shortest path duals (RE duals) on the original graph.

6 General Learnability of Hints

We now present two general learnability theorems on vector-valued hints for graph optimization problems. We consider the model where the edge weights (or capacities) are drawn from a distribution, while the vertex set is the same. The goal is to learn a hint vector in d\mathbb{R}^{d} that is close to the optimal hint on average (in 1\ell_{1} or \ell_{\infty} norm), given i.i.d. samples of the edge weights. We require no assumption either on the edge weights distribution or on the notion of optimality of a hint. Indeed, the latter needs to depend on particular problems. For a variety of graph problems, though, the optimal hint could be taken as the optimal dual solution of certain LP relaxation.

Throughout the section, we assume that the edge weights cc are drawn from an unknown distribution of 𝒟\mathcal{D}. We search for a hint within a range d\mathcal{H}\subseteq\mathbb{R}^{d}.

6.1 Learnability from Bounded Pseudo-Dimension

For a fixed problem, given edge weight cmc\in\mathbb{R}^{m}, and graph GG, let h(c)h^{*}(c) denote an optimal hint with respect to the instance (G,c)(G,c). We consider 1\ell_{1} and \ell_{\infty} loss:

1(h,c)=h(c)h1=i=1d|hi(c)hi|,\displaystyle\ell_{1}(h,c)=\|h^{*}(c)-h\|_{1}=\sum_{i=1}^{d}\left|h^{*}_{i}(c)-h_{i}\right|, (1)
(h,c)=h(c)h=maxi|hi(c)hi|.\displaystyle\ell_{\infty}(h,c)=\|h^{*}(c)-h\|_{\infty}=\max_{i}\left|h^{*}_{i}(c)-h_{i}\right|. (2)

The goal of the algorithm is to find a hint h^\widehat{h}\in\mathcal{H} such that the expected loss 𝔼c𝒟(h,c)\operatorname*{\mathbb{E}}_{c\sim\mathcal{D}}\ell(h,c) is minimized, for =1\ell=\ell_{1} or \ell_{\infty}. Let hargminh𝔼c𝒟(h,c)h^{*}\in\operatorname*{arg\,min}_{h\in\mathcal{H}}\operatorname*{\mathbb{E}}_{c\sim\mathcal{D}}\ell(h,c).

𝑳1\bm{L}_{1} Loss.

Our first result is a straightforward abstraction of the main learnability theorem of [DIL+21], for the 1\ell_{1} loss. In particular, we show that one can find a hint vector h^d\widehat{h}\in\mathbb{R}^{d} that approximately minimizes the population loss 𝔼c𝒟1(h,c)\operatorname*{\mathbb{E}}_{c\sim\mathcal{D}}\ell_{1}(h,c), under the following conditions:

Theorem 6.1 (1\ell_{1}-learnability; see also Theorem 14 of [DIL+21]).

For any graph problem with optimal hint h(c)h^{*}(c)\in\mathcal{H} for c𝒟c\sim\mathcal{D}, assume that

  • (bounded range) for any hh\in\mathcal{H} we have hi[M,M]h_{i}\in[-M,M] for all ii, for some MM; and

  • (efficient optimization) there exists a polynomial time algorithm that finds a hint vector hh\in\mathcal{H} that minimizes i=1sh(ci)h1\sum_{i=1}^{s}\|h^{*}(c_{i})-h\|_{1}, given i.i.d. samples c1,c2,,cs𝒟c_{1},c_{2},...,c_{s}\sim\mathcal{D}.

Then there is a polynomial-time algorithm that given s=O((dMϵ)2(dlogd+log(1/δ)))s=O\left(\left(\frac{dM}{\epsilon}\right)^{2}(d\log d+\log(1/\delta))\right) samples returns a hint hh\in\mathcal{H} such that 𝔼c𝒟1(h,c)𝔼c𝒟1(h(c),c)+ϵ\operatorname*{\mathbb{E}}_{c\sim\mathcal{D}}\ell_{1}(h,c)\leq\operatorname*{\mathbb{E}}_{c\sim\mathcal{D}}\ell_{1}(h^{*}(c),c)+\epsilon with probability at least 1δ1-\delta.

𝑳\bm{L}_{\infty} Loss.

We now give a learnability result for the \ell_{\infty} loss.

Theorem 6.2 (\ell_{\infty}-learnability).

For any graph problem with optimal hint h(c)h^{*}(c)\in\mathcal{H} for c𝒟c\sim\mathcal{D}, assume that

  • (bounded range) for any hh\in\mathcal{H} we have hi[M,M]h_{i}\in[-M,M] for all ii, for some MM; and

  • (efficient optimization) there exists a polynomial time algorithm that finds a hint vector hh\in\mathcal{H} that minimizes i=1sh(ci)h\sum_{i=1}^{s}\|h^{*}(c_{i})-h\|_{\infty}, given i.i.d. samples c1,c2,,cs𝒟c_{1},c_{2},...,c_{s}\sim\mathcal{D}.

Then there is a polynomial-time algorithm that given s=O((dMϵ)2(d+log(1/δ)))s=O\left(\left(\frac{dM}{\epsilon}\right)^{2}(d+\log(1/\delta))\right) samples returns a hint hh\in\mathcal{H} such that 𝔼c𝒟(h,c)𝔼c𝒟(h(c),c)+ϵ\operatorname*{\mathbb{E}}_{c\sim\mathcal{D}}\ell_{\infty}(h,c)\leq\operatorname*{\mathbb{E}}_{c\sim\mathcal{D}}\ell_{\infty}(h^{*}(c),c)+\epsilon with probability at least 1δ1-\delta.

The proofs of the two theorems appear in Appendix F.

6.2 Learnability from Arithmetic Complexity

We give an alternative argument for learning predictions. Informally, we show that good predictions can be learned efficiently if the loss function can be ‘computed efficiently’. This provides a more general framework that goes beyond 1\ell_{1} or \ell_{\infty} norm error. See Section F.3 for formal details.

7 Empirical Simulations

We demonstrate the applicability of our learning-based reductions framework with a real world case study on foreign exchange markets. The reduction we focus on is the general shortest paths to bipartite matching reduction outlined in Section 7. We focus our evaluation on this task since prior work in [DIL+21] has already demonstrated the empirical advantage of learning-based methods for bipartite matching.

Our graph dataset is constructed as follows. We have a weighted directed graph where nodes represent countries and all possible directed edges between all pairs are present. The weight of the directed edge from country ii and jj represents the average monthly exchange rate between the currencies of the two countries, i.e., the amount of currency jj we can obtain starting from one unit of currency ii, as set by the foreign exchange rate market111Dataset scraped from https://fxtop.com/en/historical-exchange-rates.php. We transform these weights by taking the natural logarithm and negating the weight. This implies that the shortest path from country ii to country jj on the transformed graph represents the optimal way to convert one unit of currency ii to the currency of jj, i.e., the set of conversions which maximize the amount of currency jj.

Experimental Setup.

We first describe our training dataset. We construct the graph described above for each month of the year 20192019 where we use the average monthly exchange rates as edge weights before performing the transformation. Our testing dataset are similarly constructed graphs for each month of 20202020 and 20212021. For each graph, we construct the reduction from shortest paths to matching outlined in Section 5.2. By Lemma 5.1, the output of the maximum weight perfect matching on the bipartite graph obtained via the reduction gives us feasible reduced edge length duals which we can be subsequently use to solve shortest paths in nearly linear time. The resulting bipartite graphs have 500\sim 500 vertices each and 5104\sim 5\cdot 10^{4} edges.

We use the code from [DIL+21] 222Available at https://github.com/tlavastida/LearnedDuals for the maximum weight bipartite matching algorithm. As in [DIL+21], we measure the runtime in terms of the number of steps used by the Hungarian algorithm to solve the matching instances derived from our training and test graph datasets when we initialize the algorithm with predicted duals versus when we start the algorithm “from scratch.”

We instantiate predictions in two distinct ways, similar to the methodology of [DIL+21]: (a) first, we consider the batch version where we compute the optimal dual variables in the training set, take their median, and use these as the predicted dual variables for each of the graphs in the test set. (b) The second method is the online version where we use the optimal dual variables from the graph for the prior month in the test set as initialization for the current month.

Results.

Our results are shown in Figures 1 and 2. Figure 1 shows up to an order of magnitude reduction in the number of iterations taken by the learning-augmented algorithm versus the classical and widely used Hungarian algorithm. As expected, the online method performs slightly better than the batch version as it is able to offer more accurate predictions for the next graph instance. This is very intuitive: it is rare for the foreign exchange market to experience drastic shifts over the span of one month since such a shift implies a major global event.

Our results also validate the dependence of prediction error derived in our theoretical bounds. In Figure 2, we plot the excess dual objective, defined as eyeey^e\sum_{e}y^{*}_{e}-\sum_{e}\hat{y}_{e} where recall that yy^{*} represents the optimal dual variables and y^\hat{y} denotes the predictions, versus the number of steps saved in the Hungarian algorithm in our batch setting; we obtained a qualitatively similar result for the online setting. We see there is a direct linear relationship between the excess dual objective, which represents the prediction error, and the decrease in runtime, measured by the number of Hungarian iterations saved. Note that we removed three outlier points from Figure 2 which represent data from October 2021 to December 2021. The outlier points showed a large excess dual as well as large savings in runtime (which can be inferred from Figure 1). We hypothesize that this is because of a distribution shift which occurred during these months in the foreign exchange markets. Indeed, examining the conversion rate from the Euro to US Dollars for example, we see a 3%3\% decrease in the exchange rate which represents the biggest decrease in the time frame of our training and test dataset. This can be explained by global events such as the rise of the Omicron strain or concerns about increased inflation.

In addition to extending and complementing the experimental results of [DIL+21], we summarize our results in the following points: (a) Our theory is predictive of experimental performance. Both figures demonstrates that better predictions imply better empirical runtime. In addition, Figure 2 demonstrates a direct relationship between prediction error and runtime, as implied by our theoretical bounds. (b) The reduction framework is efficient to carry out and execute in practice. (c) Learning augmented graph algorithms can be applied to real world datasets varying over time such as in the analysis of graphs derived from the foreign exchange rates market.

Refer to caption
Figure 1: Comparison of the classical Hungarian
algorithm (blue) versus learning-augmented
algorithms. Predictions lead to up to an order
magnitude reduction in number of iterations.
Refer to caption
Figure 2: Excess dual objective versus the number
of saved Hungarian iterations in the batch version.
There is a negative correlation between the
excess dual and the #\# of saved iterations

Acknowledgment

We thank Piotr Indyk for helpful comments on an early draft of the paper and Robert Tarjan for providing us with the reference [Gab83].

Justin and Sandeep are supported by an NSF Graduate Research Fellowship under Grant No. 1745302, NSF TRIPODS program (award DMS-2022448), NSF award CCF-2006798, and Simons Investigator Award. Justin is also supported by a MathWorks Engineering Fellowship. Ali is supported by NSF award CCF-1934843.

References

  • [ABB99] Martin Anthony, Peter L Bartlett, and Peter L Bartlett. Neural Network Learning: Theoretical Foundations, volume 9. Cambridge University Press, 1999.
  • [ACI22] Anders Aamand, Justin Y. Chen, and Piotr Indyk. (Optimal) Online Bipartite Matching with Predicted Degrees. CoRR, abs/2110.11439, 2022.
  • [AEMP22] Sara Ahmadian, Hossein Esfandiari, Vahab Mirrokni, and Binghui Peng. Robust load balancing with machine learned advice. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2022.
  • [AGKK20] Antonios Antoniadis, Themis Gouleakis, Pieter Kleer, and Pavel Kolev. Secretary and online matching problems with machine learned advice. In Advanced in Neural Information Processing Systems (NeurIPS), 2020.
  • [AGKP21] Keerti Anand, Rong Ge, Amit Kumar, and Debmalya Panigrahi. A regression approach to learning-augmented online algorithms. Advances in Neural Information Processing Systems (NeurIPS), 34, 2021.
  • [AGP20] Keerti Anand, Rong Ge, and Debmalya Panigrahi. Customizing ml predictions for online algorithms. In International Conference on Machine Learning (ICML), 2020.
  • [AKL+19] Daniel Alabi, Adam Tauman Kalai, Katrina Liggett, Cameron Musco, Christos Tzamos, and Ellen Vitercik. Learning to prune: Speeding up repeated computations. In Conference on Learning Theory (COLT), 2019.
  • [AMK21] Mohammad Ali Alomrani, Reza Moravej, and Elias B Khalil. Deep policies for online bipartite matching: A reinforcement learning approach. arXiv preprint arXiv:2109.10380, 2021.
  • [AMO93] Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Network Flows: Theory, Algorithms, and Applications. Prentice Hall, 1993.
  • [AMW19] Saeed Amizadeh, Sergiy Matusevych, and Markus Weimer. Learning to solve circuit-SAT: An unsupervised differentiable approach. In International Conference on Learning Representations (ICLR), 2019.
  • [APT22] Yossi Azar, Debmalya Panigrahi, and Noam Touitou. Online graph algorithms with predictions. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). SIAM, 2022.
  • [BCK+22] Nikhil Bansal, Christian Coester, Ravi Kumar, Manish Purohit, and Erik Vee. Learning-augmented weighted paging. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). SIAM, 2022.
  • [BCS17] Mohak Bhardwaj, Sanjiban Choudhury, and Sebastian Scherer. Learning heuristic search via imitation. In Conference on Robot Learning (CoRL, 2017.
  • [BDD+21] Maria-Florina Balcan, Dan F. DeBlasio, Travis Dick, Carl Kingsford, Tuomas Sandholm, and Ellen Vitercik. How much data is sufficient to learn high-performing algorithms? generalization guarantees for data-driven algorithm design. Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (STOC), 2021.
  • [BDSV18a] Maria-Florina Balcan, Travis Dick, Tuomas Sandholm, and Ellen Vitercik. Learning to branch. In International Conference on Machine Learning (ICML), 2018.
  • [BDSV18b] Maria-Florina Balcan, Travis Dick, Tuomas Sandholm, and Ellen Vitercik. Learning to branch. In International Conference on Machine Learning (ICML), 2018.
  • [BDV18] Maria-Florina Balcan, Travis Dick, and Ellen Vitercik. Dispersion for data-driven algorithm design, online learning, and private optimization. 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), 2018.
  • [BDW18] Maria-Florina Balcan, Travis Dick, and Colin White. Data-driven clustering via parameterized Lloyd’s families. In Advanced in Neural Information Processing Systems (NeurIPS), 2018.
  • [BLP21] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. European Journal of Operational Research, 290(2):405–421, 2021.
  • [BMS20] Etienne Bamas, Andreas Maggiori, and Ola Svensson. The primal-dual method for learning augmented algorithms. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • [BPL+16] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. CoRR, abs/1611.09940, 2016.
  • [CGP20] Edith Cohen, Ofir Geri, and Rasmus Pagh. Composable sketches for functions of frequencies: Beyond the worst case. In International Conference on Machine Learning (ICML), 2020.
  • [CGT+20] Shuchi Chawla, Evangelia Gergatsouli, Yifeng Teng, Christos Tzamos, and Ruimin Zhang. Pandora’s box with correlations: Learning and approximation. In IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), 2020.
  • [CLDS20] Binghong Chen, Chengtao Li, Hanjun Dai, and Le Song. Retro*: learning retrosynthetic planning with neural guided a* search. In International Conference on Machine Learning (ICML), 2020.
  • [CLRS09] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2009.
  • [DIL+21] Michael Dinitz, Sungjin Im, Thomas Lavastida, Benjamin Moseley, and Sergei Vassilvitskii. Faster matchings via learned duals. Advances in Neural Information Processing Systems (NeurIPS), 2021.
  • [DIRW20] Yihe Dong, Piotr Indyk, Ilya P Razenshteyn, and Tal Wagner. Learning space partitions for nearest neighbor search. International Conference on Learning Representations (ICLR), 2020.
  • [DKT+21] Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, Ali Vakilian, and Nikos Zarifis. Learning online algorithms with distributional advice. In International Conference on Machine Learning (ICML), 2021.
  • [DS12] Ran Duan and Hsin-Hao Su. A scaling algorithm for maximum weight matching in bipartite graphs. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms (SODA), 2012.
  • [DW21] Mina Dalirrooyfard and Nicole Wein. Tight conditional lower bounds for approximating diameter in directed graphs. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (STOC), 2021.
  • [DWM21] Elbert Du, Franklyn Wang, and Michael Mitzenmacher. Putting the “learning” into learning-augmented algorithms for frequency estimation. In International Conference on Machine Learning (ICML), 2021.
  • [EFS+21] Jon Ergun, Zhili Feng, Sandeep Silwal, David P. Woodruff, and Samson Zhou. Learning-augmented kk-means clustering. CoRR, abs/2110.14094, 2021.
  • [EIN+21] Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, and Tal Wagner. Learning-based support estimation in sublinear time. In International Conference on Learning Representations (ICLR), 2021.
  • [EIX22] Talya Eden, Piotr Indyk, and Haike Xu. Embeddings and labeling schemes for a. In 13th Innovations in Theoretical Computer Science Conference (ITCS 2022), 2022.
  • [EKM15] Hossein Esfandiari, Nitish Korula, and Vahab Mirrokni. Online allocation with traffic spikes: Mixing adversarial and stochastic models. In Proceedings of the Sixteenth ACM Conference on Economics and Computation (EC), 2015.
  • [FLV20] Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. Why are learned indexes so effective? In International Conference on Machine Learning (ICML), 2020.
  • [Gab83] Harold N Gabow. An efficient reduction technique for degree-constrained subgraph and bidirected network flow problems. In Proceedings of the Fifteenth Annual ACM Symposium on Theory of Computing (STOC), 1983.
  • [Gab85] Harold N Gabow. Scaling algorithms for network problems. Journal of Computer and System Sciences, 31(2):148–168, 1985.
  • [GK97] Andrew V Goldberg and Robert Kennedy. Global price updates help. SIAM Journal on Discrete Mathematics, 10(4):551–572, 1997.
  • [Gol95] Andrew V Goldberg. Scaling algorithms for the shortest paths problem. SIAM Journal on Computing, 24(3):494–504, 1995.
  • [GP19] Sreenivas Gollapudi and Debmalya Panigrahi. Online algorithms for rent-or-buy with expert advice. In International Conference on Machine Learning (ICML), 2019.
  • [GR17] Rishi Gupta and Tim Roughgarden. A PAC approach to application-specific algorithm selection. SIAM J. Comput., 46:992–1017, 2017.
  • [GT89] Harold N Gabow and Robert E Tarjan. Faster scaling algorithms for network problems. SIAM Journal on Computing, 18(5):1013–1036, 1989.
  • [HIKV19] Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. Learning-based frequency estimation algorithms. In International Conference on Learning Representations (ICLR), 2019.
  • [IKMQP21] Sungjin Im, Ravi Kumar, Mahshid Montazer Qaem, and Manish Purohit. Non-clairvoyant scheduling with predictions. In Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2021.
  • [ISZ21] Zachary Izzo, Sandeep Silwal, and Samson Zhou. Dimensionality reduction for wasserstein barycenter. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  • [IVY19] Piotr Indyk, Ali Vakilian, and Yang Yuan. Learning-based low-rank approximations. In Advanced in Neural Information Processing Systems (NeurIPS), 2019.
  • [JLL+20] Tanqiu Jiang, Yi Li, Honghao Lin, Yisong Ruan, and David P. Woodruff. Learning-augmented data stream algorithms. In International Conference on Learning Representations (ICLR), 2020.
  • [KBC+18] Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD), pages 489–504, 2018.
  • [KDN+17] Elias B Khalil, Bistra Dilkina, George L Nemhauser, Shabbir Ahmed, and Yufen Shao. Learning to run heuristics in tree search. In IJCAI, 2017.
  • [KDZ+17] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems (NIPS), 2017.
  • [KLBS+16] Elias Khalil, Pierre Le Bodic, Le Song, George Nemhauser, and Bistra Dilkina. Learning to branch in mixed integer programming. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2016.
  • [KvHW19] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems! In International Conference on Learning Representations (ICLR), 2019.
  • [LCK18] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Combinatorial optimization with graph convolutional networks and guided tree search. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • [LFKF18] Mario Lucic, Matthew Faulkner, Andreas Krause, and Dan Feldman. Training gaussian mixture models at scale via coresets. Journal of Machine Learning Research (JMLR), 18(160):1–25, 2018.
  • [LLMV20] Silvio Lattanzi, Thomas Lavastida, Benjamin Moseley, and Sergei Vassilvitskii. Online scheduling via learned weights. In Proceedings of the 31st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2020.
  • [LMRX21] Thomas Lavastida, Benjamin Moseley, R. Ravi, and Chenyang Xu. Learnable and instance-robust predictions for online matching, flows and load balancing. In 29th Annual European Symposium on Algorithms (ESA), 2021.
  • [LV18] Thodoris Lykouris and Sergei Vassilvtiskii. Competitive caching with machine learned advice. In International Conference on Machine Learning (ICML). PMLR, 2018.
  • [Mit18] Michael Mitzenmacher. A model for learned bloom filters and optimizing by sandwiching. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • [Mit20] Michael Mitzenmacher. Scheduling with predictions and the price of misprediction. In 11th Innovations in Theoretical Computer Science Conference (ITCS), 2020.
  • [MNS12] Mohammad Mahdian, Hamid Nazerzadeh, and Amin Saberi. Online optimization with uncertain information. ACM Transactions on Algorithms (TALG), 8(1):1–29, 2012.
  • [MRT18] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2018.
  • [MSIB21] Nina Mazyavkina, Sergey Sviridov, Sergei Ivanov, and Evgeny Burnaev. Reinforcement learning for combinatorial optimization: A survey. Computers & Operations Research, page 105400, 2021.
  • [MSV+19] Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM Special Interest Group on Data Communication. 2019.
  • [MV20] Michael Mitzenmacher and Sergei Vassilvitskii. Algorithms with predictions. arXiv preprint arXiv:2006.09123, 2020.
  • [NOST18] Mohammadreza Nazari, Afshin Oroojlooy, Lawrence Snyder, and Martin Takác. Reinforcement learning for solving the vehicle routing problem. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • [OA92] James B Orlin and Ravindra K Ahuja. New scaling algorithms for the assignment and minimum mean cycle problems. Mathematical programming, 54(1):41–56, 1992.
  • [PSK18] Manish Purohit, Zoya Svitkina, and Ravi Kumar. Improving online algorithms via ml predictions. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • [RBL19] Jack Rae, Sergey Bartunov, and Timothy Lillicrap. Meta-learning neural bloom filters. In International Conference on Machine Learning (ICML), 2019.
  • [Roh20] Dhruv Rohatgi. Near-optimal bounds for online caching with machine learned advice. In Proceedings of the 31st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2020.
  • [SLB+19] Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L. Dill. Learning a SAT solver from single-bit supervision. In International Conference on Learning Representations (ICLR), 2019.
  • [VKKM21] Kapil Vaidya, Eric Knorr, Tim Kraska, and Michael Mitzenmacher. Partitioned learned bloom filter. In International Conference on Learning Representations (ICLR), 2021.
  • [Wei20] Alexander Wei. Better and simpler learning-augmented online caching. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020), 2020.
  • [WLKC16] Jun Wang, Wei Liu, Sanjiv Kumar, and Shih-Fu Chang. Learning to hash for indexing big data - a survey. Proceedings of the IEEE, 104(1):34–57, 2016.
  • [WZ20] Alexander Wei and Fred Zhang. Optimal robustness-consistency trade-offs for learning-augmented online algorithms. Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • [XM21] Chenyang Xu and Benjamin Moseley. Learning-augmented algorithms for online steiner tree. arXiv preprint arXiv:2112.05353, 2021.
  • [YTB+21] Ryo Yonetani, Tatsunori Taniai, Mohammadamin Barekatain, Mai Nishimura, and Asako Kanezaki. Path planning using neural a* search. In International Conference on Machine Learning (ICML), 2021.
  • [ZGA+21] Hang Zhu, Varun Gupta, Satyajeet Singh Ahuja, Yuandong Tian, Ying Zhang, and Xin Jin. Network planning with deep reinforcement learning. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, pages 258–271, 2021.

Appendix A More Related work

Learning-augmented approaches have found success in a wide array of algorithmic tasks. This includes improving classic space complexity in streaming algorithms [HIKV19, IVY19, CGP20, JLL+20, EIN+21, DWM21], and achieving better competitive ratios in online algorithms [MNS12, EKM15, LV18, PSK18, GP19, Roh20, Wei20, LLMV20, BMS20, ACI22, AGP20, WZ20, DKT+21, AGKP21, BCK+22]. Other application domains include data structures [KBC+18, FLV20, Mit18, RBL19, VKKM21], similarity search [WLKC16, DIRW20], and machine scheduling [Mit20, AEMP22, LLMV20, IKMQP21].

A number of works study deep and reinforcement learning for combinatorial optimization and integer programming [BPL+16, KLBS+16, KDN+17, KDZ+17, BDSV18a, NOST18, LCK18, KvHW19, SLB+19, MSV+19, AMW19, AKL+19, ZGA+21, AMK21]. Most of the work in this direction are empirical in nature. See [MSIB21, BLP21] for two recent surveys.

There are also recent works on a related area of data driven algorithm design whose focus is to learn a generalizable algorithm from a family using a small number of samples rather than study how a predictor can aid in an algorithmic task [GR17, BDD+21, BDV18, BDSV18b, BDW18, CGT+20]. See the article [MV20] for a recent survey.

Appendix B Omitted Details from Section 3

B.1 Proof of Theorem 3.2

Correctness of Algorithm 1.

We now prove the correctness of Algorithm 1. The following three claims are well known facts about the minimum-weight perfect matching problem (see, for instance [AMO93]).

Claim B.1.

There exists a flow of value nn if and only if there exists a perfect matching.

Claim B.2.

If the flow ff at the end of the algorithm is a minimum-cost flow of value nn, then the returned set of edges is a minimum-cost perfect matching.

Claim B.3.

A flow ff is a minimum-cost flow if and only if GfG_{f} contains no negative cost cycles.

Lemma B.4.

The residual graph GfG_{f} at the end of Algorithm 1 contains no cycles which include nodes ss or tt.

Proof.

At the end of the algorithm, ff will have value nn and will saturate all edges leaving ss as well as all edges entering tt. Therefore, all edges containing ss will be incoming edges and all edges containing tt will be outgoing edges. For both nodes, as all of their edges are oriented in the same direction, they cannot be included in any cycles. ∎

Lemma B.5.

From their initialization in step 10 of Algorithm 1, all reduced edge costs cc^{\prime} in the residual graph GfG_{f} are non-negative.

Proof.

Note that any edges containing ss or tt always have reduced cost 0. Starting from the costs set at step 1010 of the algorithm, we will prove inductively that for all edges (i,j)(i,j) in the residual graph s.t. i,j{s,t}i,j\notin\{s,t\}, cij0c^{\prime}_{ij}\geq 0. From the feasibility of the dual variables yy, for all (i,j)E(i,j)\in E s.t. iL,jRi\in L,j\in R,

yi+yjcij.y_{i}+y_{j}\leq c_{ij}.

Therefore, for all left to right edges in the residual graph,

cij=cij+zizj=cijyiyj0.c^{\prime}_{ij}=c_{ij}+z_{i}-z_{j}=c_{ij}-y_{i}-y_{j}\geq 0.

It remains to consider edges (j,i)Ef(j,i)\in E_{f} s.t. jR,iLj\in R,i\in L. These backwards edges exist if and only if fij=1f_{ij}=1. By step 77, fij=1f_{ij}=1 only if the matching MM derived in step 44 contains (i,j)(i,j). This implies that yi+yj=cijy_{i}+y_{j}=c_{ij}. Therefore, for such edges (j,i)Ef(j,i)\in E_{f},

cji=cij+zjzi=cij+yj+yi=0.c^{\prime}_{ji}=-c_{ij}+z_{j}-z_{i}=-c_{ij}+y_{j}+y_{i}=0.

This completes the base case: all edges in the residual graph have non-negative reduced cost at step 1010.

For the inductive step, assume going into the while loop that all cec^{\prime}_{e} for e=(i,j)Gfe=(i,j)\in G_{f} such that i,j{s,t}i,j\notin\{s,t\} are non-negative. Consider the new reduced costs for any such edge (i,j)(i,j) at step 13, noting that d(s,j)d(s,i)+cijd(s,j)\leq d(s,i)+c^{\prime}_{ij}. Let c′′c^{\prime\prime} denote the updated costs and cc^{\prime} denote the previous reduced costs:

cij′′=cij+zizj\displaystyle c^{\prime\prime}_{ij}=c_{ij}+z_{i}-z_{j} =cij+d(s,i)d(s,j)\displaystyle=c^{\prime}_{ij}+d(s,i)-d(s,j)
cij+d(s,i)(d(s,i)+cij)=0.\displaystyle\geq c^{\prime}_{ij}+d(s,i)-(d(s,i)+c^{\prime}_{ij})=0.

So, the new costs are also non-negative.

Now, consider the augmentation in step 17 which will change the edges in the residual graph GfG_{f}. In particular, some edges (i,j)(i,j) will disappear and be replaced by edges (j,i)(j,i) with reduced costs cji=cijc^{\prime}_{ji}=-c^{\prime}_{ij}. As we only augment along edges with reduced costs cij=0c_{ij}=0, the reverse edges we create in the residual graph will also have reduced cost 0. Therefore, the invariant holds that costs remain non-negative, completing the proof. ∎

Lemma B.6.

Assume that there exists a perfect matching. If ff has flow value less than nn, then the flow gg computed in step 16 of Algorithm 1 will have value at least 11.

Proof.

First, note that the flow ff maintained by the algorithm will always have capacity in {0,1}\{0,1\} as the graph GG has unit capacities and we always augment by a maximum flow which will have value either 0 or 11 along each edge. If ff has value less than nn, as the max flow value is nn (due to the existence of a perfect matching), there must exist some path of flow value 11 from ss to tt in GfG_{f}. This implies that d(s,t)>0d(s,t)>0. Let PP be the a shortest path from ss to tt in GfG_{f} as measured by the reduced costs cc^{\prime}. Note that for any edge (i,j)P(i,j)\in P, d(s,j)=d(s,i)+cijd(s,j)=d(s,i)+c^{\prime}_{ij} (or else, PP is not a shortest path). So, after updating the reduced costs in step 13, every edge in PP must have reduced cost 0. Therefore, PEfP\subseteq E^{\prime}_{f} and the maximum flow computed in step 16 must have flow value 11. ∎

Lemma B.7.

Algorithm 1 returns a minimum-cost perfect matching, if one exists.

Proof.

Assume that a perfect matching exists. By Lemmas B.1 and B.6, the algorithm will terminate with a flow of value nn. By Lemmas B.2 and B.3, it suffices to show that there are no negative cycles in the final residual graph GfG_{f}. By Lemma B.4, any cycles in GfG_{f} must only contain edges (i,j)(i,j) where i,j{s,t}i,j\notin\{s,t\}. By Lemma B.5, all reduced costs cijc^{\prime}_{ij} of edges (i,j)Ef(i,j)\in E_{f} are non-negative. Let CEfC\subset E_{f} be any cycle in GfG_{f}. The value of CC is

(i,j)Ccij=(i,j)Ccijzi+zj(i,j)Czi+zj.\sum_{(i,j)\in C}c_{ij}=\sum_{(i,j)\in C}c^{\prime}_{ij}-z_{i}+z_{j}\geq\sum_{(i,j)\in C}-z_{i}+z_{j}.

As CC is a cycle, each vertex incident on CC has equally many incoming and outgoing edges, cancelling out the contributions of each zuz_{u} so that

(i,j)Czi+zj=0.\sum_{(i,j)\in C}-z_{i}+z_{j}=0.

Hence, there are no negative cycles in GfG_{f}, completing the proof of the correctness of Algorithm 1. ∎

Runtime of Algorithm 1

We now analyze the runtime of Algorithm 1.

Lemma B.8.

After the first call to maximum cardinality matching in step 4, the partial matching MM will contain at least nyy^0n-\|y^{*}-\hat{y}^{\prime}\|_{0} edges.

Proof.

Consider the minimum-weight perfect matching MM^{*} corresponding to the optimal duals yy^{*}. Note that the edges in MM^{*} are vertex disjoint since they constitute a matching. Let M′′M^{\prime\prime} be the set of edges in MM^{*} that are tight under y^\hat{y}^{\prime}:

M′′={e=ijM:y^i+y^j=ce}.M^{\prime\prime}=\{e=ij\in M^{*}:\hat{y}^{\prime}_{i}+\hat{y}^{\prime}_{j}=c_{e}\}.

As yy^{*} is tight for all edges in MM^{*}, if the predicted and optimal duals agree on both endpoints of an edge in MM^{*}, that edge is also tight under y^\hat{y}^{\prime}. Therefore, out of the nn edges in MM^{*}, at most yy^0\|y^{*}-\hat{y}^{\prime}\|_{0} of them do not have tight constraints under y^\hat{y^{\prime}}. Equivalently, |M′′|nyy^0|M^{\prime\prime}|\geq n-\|y^{*}-\hat{y}^{\prime}\|_{0}.

Consider the call to maximum cardinality matching in step 4. As MM^{*} is a valid matching and M′′MM^{\prime\prime}\subseteq M^{*}, M′′M^{\prime\prime} is a valid matching as well. In addition, all of M′′M^{\prime\prime}’s edges are contained in EE^{\prime} (the set of tight edges under y^\hat{y}^{\prime}). Therefore, M′′M^{\prime\prime} is a valid matching for step 4. So, the maximum cardinality matching returned by step 4 must have size at least |M′′|nyy^0|M^{\prime\prime}|\geq n-\|y^{*}-\hat{y}^{\prime}\|_{0}, completing the proof. ∎

Lemma B.9.

The total number of iterations of the while loop in step 11 will be at most yy^0\|y^{*}-\hat{y}^{\prime}\|_{0}.

Proof.

By Lemma B.6, we will increase the flow value of ff by at least 11 each iteration of the while loop. By Lemma B.8, ff enters the while loop with value nyy^0n-\|y^{*}-\hat{y}^{\prime}\|_{0}. Therefore, there can be at most yy^0\|y^{*}-\hat{y}^{\prime}\|_{0} iterations. ∎

Lemma B.10.

The total amount of work done by calls to Ford-Fulkerson in step 16 is O(myy^0)O(m\|y^{*}-\hat{y}^{\prime}\|_{0})

Proof.

The runtime of Ford-Fulkerson is O(mf)O(mf) where ff is the value of the max flow. By Lemma B.8, the flow value can increase by at most yy^0\|y^{*}-\hat{y}^{\prime}\|_{0} over all calls to Ford-Fulkerson before we reach a flow of value nn. So, Ford-Fulkerson does a total of O(myy^0)O(m\|y^{*}-\hat{y}^{\prime}\|_{0}) work, as required. ∎

We are now ready to prove the main theorem.

Proof of Theorem 3.2.

By Lemma B.7, the algorithm returns a minimum-cost perfect matching, if one exists. It remains to prove the runtime. The first call to matching takes O(mn)O(m\sqrt{n}) time. Constructing the initial flow and residual graph as well as the corresponding costs takes O(m)O(m) time. By Lemma B.9, there are at most yy^0\|y^{*}-\hat{y}^{\prime}\|_{0} iterations of the while loop. In each iteration, calculating the shortest path distances can be done via Dijkstra’s algorithm in O(m+nlogn)O(m+n\log n) time as the reduced costs are always non-negative by Lemma B.5. Updating the costs and constructing the subgraph GfG^{\prime}_{f} takes O(m)O(m) time. Augmenting along the flow gg takes O(m)O(m) time. So, ignoring calls to Ford-Fulkerson, the total running time of work done in the while loop is O((m+nlogn)yy^0)O((m+n\log n)\|y^{*}-\hat{y}^{\prime}\|_{0}). By Lemma B.10, all calls to Ford-Fulkerson takes a total of O(myy^0)O(m\|y^{*}-\hat{y}^{\prime}\|_{0}).

So, the total runtime of the algorithm is O(mn+(m+nlogn)yy^0)O(m\sqrt{n}+(m+n\log n)\|y^{*}-\hat{y}^{\prime}\|_{0}), as required. ∎

Appendix C Improved Learning-Based Minimum-Weight bb-Matching

Algorithm 4 Primal-Dual Scheme for MWBM from [DIL+21]
1:procedure MWBM-PD(G=(V,E),c,yG=(V,E),c,y)
2:     E{ijEyi+yj=cijE^{\prime}\leftarrow\{ij\in E\mid y_{i}+y_{j}=c_{ij} } \triangleright Set of tight edges in the dual
3:     G(LR{s,t},E{siiL}{jtjR})G^{\prime}\leftarrow(L\cup R\cup\{s,t\},E^{\prime}\cup\{si\mid i\in L\}\cup\{jt\mid j\in R\}) \triangleright Network of tight edges
4:     eE(G)\forall e\in E(G^{\prime}) s.t. e=sie=si or e=ite=it, uebiu_{e}\leftarrow b_{i}
5:     ueu_{e}\leftarrow\infty for all other edges of GG^{\prime}
6:     ff\leftarrow Maximum sts-t flow in GG^{\prime} with capacities uu
7:     while Value of ff is <iLbi<\sum_{i\in L}b_{i} do
8:         Find a set SLS\subseteq L such that iSbi>jΓ(S)bj\sum_{i\in S}b_{i}>\sum_{j\in\Gamma(S)}b_{j} \triangleright Can be found in O(m+n)O(m+n) time
9:         ϵminiS,jRΓ(S){cijyiyj}\epsilon\leftarrow\min_{i\in S,j\in R\setminus\Gamma(S)}\{c_{ij}-y_{i}-y_{j}\}
10:         iS\forall i\in S, yiyi+εy_{i}\leftarrow y_{i}+\varepsilon
11:         jΓ(S)\forall j\in\Gamma(S), yjyjεy_{j}\leftarrow y_{j}-\varepsilon
12:         Update E,G,uE^{\prime},G^{\prime},u
13:         ff\leftarrow Maximum sts-t flow in GG^{\prime} with capacities uu
14:     end while
15:     xfx\leftarrow f restricted to edges of GG
16:     Return xx
17:end procedure

As a corollary to Theorem 3.5, when combined with the near-linear time rounding procedure from [DIL+21], this algorithm gives a fast framework for taking a predicted (possibly infeasible) dual and using it to speed up minimum-weight bb-matching.

Corollary C.1.

There exists an algorithm which takes as input a (possibly infeasible) integral dual solution y^\hat{y}, produces a feasible dual y^\hat{y}^{\prime} s.t. y^yb,15y^yb,1\|\hat{y}^{\prime}-y^{*}\|_{b,1}\leq 5\|\hat{y}-y^{*}\|_{b,1}, and finds a minimum-weight perfect bb-matching in O(mn+myy^b,0)O(mn+m\|y^{*}-\hat{y}^{\prime}\|_{b,0}) time, where yy^{*} is an optimal dual solution.

The runtime of O(mnyy^b,1)O(mn\|y^{*}-\hat{y}^{\prime}\|_{b,1}) from [DIL+21] is derived from the fact that each time a maximum flow is found in step 13, the flow value is increased by at least 11 (due to the integrality of the problem), and each maximum flow can be found in O(nm)O(nm) time.

To get the improved runtime in Theorem 3.5, we can follow essentially the same analysis to that for minimum-weight perfect matching, showing that first call to maximum flow will push a significant amount of flow and then bounding the rest of the work in terms of the remaining flow to be pushed.

Proof of Theorem 3.5.

The correctness of Algorithm 4 comes from prior work, so it suffices to prove the time complexity. After the first call to max flow in step 6, the flow value will be at least ibiyy^b,0\sum_{i}b_{i}-\|y^{*}-\hat{y}^{\prime}\|_{b,0} by the same argument as Lemma B.8. In particular, consider an optimal flow (corresponding to a minimum-weight bb-matching) ff^{*}. Consider the flow gg where ge=min{fe,ue}g_{e}=\min\{f^{*}_{e},u^{\prime}_{e}\} where ueu^{\prime}_{e} is the capacity of edge ee in GG^{\prime} in step 6 of the algorithm (gg is the subset of ff that satisfies the capacities in GG^{\prime}). For each edge e=(u,v)e=(u,v) where yu=yuy^{*}_{u}=y^{\prime}_{u} and yv=yvy^{*}_{v}=y^{\prime}_{v}, ge=feg_{e}=f_{e}. Conversely, if there is some vertex uu where yuyuy^{*}_{u}\neq y^{\prime}_{u}, at worst this vertex can invalidate bub_{u} edges as uu can be incident on at most bub_{u} edges in ff^{*}. So, the value of gg is at most yy^b,0\|y^{*}-\hat{y}^{\prime}\|_{b,0} less than that of ff^{*}. As ff^{*} is optimal, it has value ibi\sum_{i}b_{i}, and the first call to max flow must push at least ibiyy^b,0\sum_{i}b_{i}-\|y^{*}-\hat{y}^{\prime}\|_{b,0} units of flow.

Subsequently, over all calls to maximum flow in step 13, the total amount of flow pushed is at most yy^b,0\|y^{*}-\hat{y}^{\prime}\|_{b,0} and the total number of iterations of the while loop is at most yy^b,0\|y^{*}-\hat{y}^{\prime}\|_{b,0}. By implementing max flow in step 13 by the Ford-Fulkerson algorithm, the total amount of work done in the while loop will be O(myy^b,0)O(m\|y^{*}-\hat{y}^{\prime}\|_{b,0}) as Ford-Fulkerson takes linear time per unit of flow and all other work done in the while loop takes linear time per iteration. As the first call to max flow in step 6 can take time O(mn)O(mn), we get a total runtime of O(mn+myy^b,0)O(mn+m\|y^{*}-\hat{y}^{\prime}\|_{b,0}), as required. ∎

Appendix D Omitted Details from Section 4

D.1 Proof of Main Theorem: Theorem 4.1

The goal of the section is to prove the correctness and runtime of Algorithm 2. We first need the following auxiliary lemmas, starting from an observation from [Gol95].

Lemma D.1.

The graph GG^{-} (after contracting all strongly connected components) is acyclic.

Lemma D.2.

Consider any edge ee such that y^(e)0\ell_{\hat{y}}(e)\geq 0 at any stage of Algorithm 2. Then ee will always continue to satisfy y^(e)0\ell_{\hat{y}}(e)\geq 0.

Proof.

Let e=(u,v)e=(u,v). We prove the lemma by showing that y^(e)0\ell_{\hat{y}}(e)\geq 0 continues to hold after every iteration of the while loop in step 33. The only way for y^(e)\ell_{\hat{y}}(e) to change is if one of yuy_{u} or yvy_{v} is updated in step 99 of Algorithm 2. If both u,vtiLiu,v\in\cup_{t\geq i^{*}}L_{i} or both u,vtiLiu,v\not\in\cup_{t\geq i^{*}}L_{i} then the edge is unchanged. Now if vtiLiv\in\cup_{t\geq i^{*}}L_{i} but not uu, then y^(e)\ell_{\hat{y}}(e) increases by 11 so y^(e)0\ell_{\hat{y}}(e)\geq 0 continues to hold for this iteration. Now suppose that utiLiu\in\cup_{t\geq i^{*}}L_{i} but not vv. In this case, if y^(e)\ell_{\hat{y}}(e) was strictly greater than zero, i.e., y^(e)1\ell_{\hat{y}}(e)\geq 1, then uv\ell_{u}-\ell_{v} only decreases by 11 so y^(e)0\ell_{\hat{y}}(e)\geq 0 is maintained. Lastly we need to consider the possibility that y^(e)=0\ell_{\hat{y}}(e)=0. In this case, it must be that vtiLiv\in\cup_{t\geq i^{*}}L_{i} since we can go from xx to vv via xuvx\rightarrow u\rightarrow v which means vv is either in the same layer as uu or possibly a higher layer. Both of these scenarios were addressed previously so we are done. ∎

Lemma D.3.

At any iteration of the while loop, i2y^y1i^{*}\leq 2\|\hat{y}-y^{*}\|_{1} where the quantity y^y1\|\hat{y}-y^{*}\|_{1} denotes the initial predictor error.

Proof.

We first provide a bound for the very first iteration of the while loop on Line 3 of Algorithm 2. Note that GG^{-} is acyclic due to Lemma D.1. All paths in GG^{-} must use non-negative edges. Any negative edge e=(u,v)e=(u,v) has reduced length at least (|yuy^u|+|yvy^v|)-(|y^{*}_{u}-\hat{y}_{u}|+|y^{*}_{v}-\hat{y}_{v}|). This is because we know that (u,v)+yuyv0\ell(u,v)+y^{*}_{u}-y^{*}_{v}\geq 0. Thus, the absolute value of the length of y^(e)\ell_{\hat{y}}(e) is at most

|y(e)y^(e)||yuy^u|+|yvy^v|.|\ell_{y^{*}}(e)-\ell_{\hat{y}}(e)|\leq|y^{*}_{u}-\hat{y}_{u}|+|y^{*}_{v}-\hat{y}_{v}|.

Now consider the max ii for which LiL_{i} exists. This means there is a path x=u1,,uk=vx=u_{1},\cdots,u_{k}=v of total length i-i. We have

i=j=1k1|y^(uj,uj+1)|j=1k1|yujy^uj|+|yuj+1y^uj+1|2yy^1i=\sum_{j=1}^{k-1}|\ell_{\hat{y}}(u_{j},u_{j+1})|\leq\sum_{j=1}^{k-1}|y^{*}_{u_{j}}-\hat{y}_{u_{j}}|+|y^{*}_{u_{j+1}}-\hat{y}_{u_{j+1}}|\leq 2\|y^{*}-\hat{y}\|_{1}

since every vertex uu can appear at most twice in the middle summation above.

We now claim that the maximum ii is always at most 2yy^12\|y^{*}-\hat{y}\|_{1}. Lemma D.2 implies that non-negative edges (under y^\ell_{\hat{y}}) always stay non-negative. Furthermore, the proof of Lemma D.2 tells us that the length of any negative edge is monotonically increasing until the edge becomes non-negative. Therefore any path from xx to vv in GG^{-} in any iteration of the while loop must have also existed in the very first iteration. It follows that most negative distances in GG^{-} are monotonically decreasing every iteration, i.e., becoming less negative. Therefore, the same bound on the number of layers LiL_{i} also continues to hold for all instances of GG^{-}. ∎

Proof of Theorem 4.1.

The correctness of Theorem 4.1 follows from the fact that the while loop only stops when all reduced edge lengths are non-negative. Therefore, the main challenge is to bound the number of iterations. From the standard analysis of Goldberg’s algorithm [Gol95], we know that each iteration of the while loop takes O(m)O(m) time. This is because GG^{-} is an acyclic graph and thus, finding the layers LiL_{i} and all subsequent computations can be done in O(m)O(m) time. Thus, it remains to bound the number of while loop iterations.

Now call a vertex vv touched if vLiv\in L_{i^{*}} for ii^{*} defined in step 88 of Algorithm 2. Note that for a vertex to be touched, it must exist in some layer and therefore has a negative incoming edge. We now claim that every time a vertex is touched, its most negative incoming edge increases in length by +1+1.

Indeed, let (u,v)(u,v) be the most negative incoming edge to vv and suppose that vLiv\in L_{i*}. Vertex uu cannot exist in layer LtL_{t} for some t>it>i^{*} since we can consider the path xuvx\rightarrow u\rightarrow v which implies vv must exist in a larger layer than uu. Thus when vv is touched, the edge length (u,v)(u,v) must increase by 11. From Lemma D.3, we know that layer LiL_{i*} has at least n/(2y^y1n/(2\|\hat{y}-y^{*}\|_{1}) many vertices since there are at most 2y^y12\|\hat{y}-y^{*}\|_{1} layers which partition all nn vertices. Therefore, at least n/(2y^y1)n/(2\|\hat{y}-y^{*}\|_{1}) many vertices get touched in every iteration of the while loop. Each vertex can only get touched at most O(yy^)O(\|y^{*}-\hat{y}\|_{\infty}) times since the most negative edge length in the very beginning of Algorithm 2 has absolute value at most O(yy^)O(\|y^{*}-\hat{y}\|_{\infty}). This implies that the number of while loop iterations is at most O(y^y1yy^)O(\|\hat{y}-y^{*}\|_{1}\cdot\|y^{*}-\hat{y}\|_{\infty}). Since every while loop iteration takes O(m)O(m) time, the bound of O(my^y1yy^)O(m\|\hat{y}-y^{*}\|_{1}\cdot\|y^{*}-\hat{y}\|_{\infty}) follows. Note that we could have also used Goldberg’s algorithm after getting the reduced edge lengths from y^\ell_{\hat{y}} with no further modifications to get time O(mnlog(yy^)O(m\sqrt{n}\log(\|y^{*}-\hat{y}\|_{\infty}). Therefore, running these two algorithms in parallel implies the claimed running time. ∎

D.2 All-Pair Shortest Paths

We observe that Theorem 4.1 implies the following runtime for finding all pairs shortest paths on a graph.

Theorem D.4.

There exists an algorithm which takes as input predicted reduced edge duals y^:V\hat{y}:V\rightarrow\mathbb{Z} and outputs all pair shortest paths in O(mmin{y^y1y^y,nlog(y^y))}+mn+n2logn)O(m\min\{\|\hat{y}-y^{*}\|_{1}\cdot\|\hat{y}-y^{*}\|_{\infty},\sqrt{n}\log(\|\hat{y}-y^{*}\|_{\infty}))\}+mn+n^{2}\log n) time where y:Vy^{*}:V\rightarrow\mathbb{Z} denotes a feasible set of reduced edge length duals.

Proof.

Consider Algorithm 2. It applies Algorithm 2 to round y^\hat{y} into a feasible RE dual y^\hat{y}^{\prime}. Then we can run Dijkstra’s algorithm starting from all vertices in time O(mn+n2logn)O(mn+n^{2}\log n). The running time follows from Theorem 4.1. ∎

Algorithm 5 Learning-based Shortest Paths
1:Input: Graph G=(V,E)G=(V,E), predicted duals y^:V\hat{y}:V\rightarrow\mathbb{Z}
2:procedure Faster-Shortest-Paths(G,y^G,\hat{y})
3:     y^\hat{y}^{\prime}\leftarrowRound-RE-Duals(G,y^G,\hat{y}) \triangleright y^\hat{y}^{\prime} is a feasible RE Dual
4:     for all vVv\in V do
5:         Run Dijkstra’s algorithm starting from vv
6:     end for
7:     Return all shortest paths found from all vertices
8:end procedure

Appendix E Additional Reductions for Learning-Based Graph Algorithms

E.1 Degree-Constrained Subgraph from Matching

The degree constrained subgraph (DCS) problem is defined as follows. We are given an undirected multigraph G=(V,E)G=(V,E) (we will only be considering bipartite graphs) as well as a set of desired upper and lower bound on each vertex’s degree: lidiuil_{i}\leq d_{i}\leq u_{i} for all iVi\in V. A DCS is an edge-induced subgraph of GG where the degree conditions are satisfied. A DCS is called complete if each degree achieves its upper bound: di=uid_{i}=u_{i} for all iVi\in V.

The maximum perfect DCS and maximum weights DCS problems correspond to maximum perfect matching and maximum weight matching, respectively. Note that the DCS versions of these problems generalize the matching versions by setting li=0l_{i}=0 and ui=1u_{i}=1 for all vertices. Next, we will show that DCS can also be reduced to matching following the reduction given in [Gab85].

First consider the maximum perfect DCS problem. Let G=(L,R,E)G=(L,R,E) be the corresponding multigraph with degree bounds li,uil_{i},u_{i} for iVi\in V. We will build a corresponding bipartite graph H=(L,R,E)H=(L^{\prime},R^{\prime},E^{\prime}) as follows.

  • For each vertex ii in GG, create a complete bipartite graph Kδ,dK_{\delta,d} where d=did=d_{i} is the degree of ii in GG and δ=diui\delta=d_{i}-u_{i} is how many edges need to be removed from ii to meet the upper bound. We will call the δ\delta side of Kδ,dK_{\delta,d} internal nodes and the dd side external nodes. Without loss of generality, assume iLi\in L. Then, the external side of Kδ,dK_{\delta,d} is in LL^{\prime} and the internal side is in RR^{\prime}.

  • Associate each of ii’s edges in GG with one of its external nodes in HH. Specifically, for each (i,j)E(i,j)\in E, there will be an edge between one of ii’s and one of jj’s external nodes in HH and both of those nodes will not be neighbors with any other external nodes. Note that as GG is bipartite, with these added edges, HH will still be bipartite.

  • For each of these external-external edges, give them costs in HH corresponding to their costs in GG.

First, note that a perfect matching in HH corresponds to a perfect DCS in GG. For each node ii in GG, all of its δ\delta internal nodes in HH will be matched, meaning that exactly uiu_{i} of its external nodes are matched with other external nodes. As each of these external-external edges correspond to edges in the original edgeset EE, this means that ii will have degree uiu_{i} in the subgraph induced by the external-external edges in the perfect matching, as required.

Similarly, it is easy to see that every perfect DCS in GG corresponds to a perfect matching in HH, so optimizing over perfect matchings/DCS’s are equivalent, completing the reduction.

Assume that GG had nn vertices and mm edges (counting copies). In HH, we will have O(m)O(m) total vertices and O(mdmax)O(m\cdot d_{max}) total edges where dmaxd_{max} is the maximum degree of any vertex in GG. Algorithm 3 gives us the following corollary.

Theorem E.1.

Given a maximum weight perfect DCS problem on input graph G=(V,E)G=(V,E) with nn vertices, mm edges, and maximum degree dmaxd_{max}, there exists an algorithm which takes takes as input a predicted dual solution y^\hat{y} to an instance of maximum weight perfect matching derived from GG, near-optimally rounds the dual to a feasible solution y^\hat{y}^{\prime}, and solves the DCS in time O(m3/2dmax+(mdmax+mlogm)yy^0)O(m^{3/2}d_{max}+(md_{max}+m\log m)||y^{*}-\hat{y}^{\prime}||_{0}).

E.2 Minimum-Cost 0-1 Flow from Degree-Constrained Subgraph

The reduction bares resemblance to the reduction from shortest path from matching and is also due to [Gab85]. We are given a directed graph GG with unit capacities and integral edge costs aija_{ij}. We want to find a minimum cost flow of flow value vv. We will construct a bipartite multigraph H=(L,R,E)H=(L,R,E) for the DCS problem as follows.

  • For each vertex iGi\in G, make two copies i1Li_{1}\in L and i2Ri_{2}\in R.

  • Add mindegree(i)mindegree(i) copies of the edge (i1,i2)(i_{1},i_{2}) to HH each with weight 0. Where mindegree(i)mindegree(i) is the minimum of ii’s indegree and outdegree.

  • For each edge (i,j)(i,j) in GG, add an edge (j1,k2)(j_{1},k_{2}) to HH with weight ajk-a_{jk}.

  • Set the degree constrains ui1=ui2=mindegree(i)u_{i_{1}}=u_{i_{2}}=mindegree(i) for all is,ti\neq s,t. Set us2=mindegree(s)u_{s_{2}}=mindegree(s), us1=us2+vu_{s_{1}}=u_{s_{2}}+v, ut2=mindegree(t)u_{t_{2}}=mindegree(t), ut1=ut2+vu_{t_{1}}=u_{t_{2}}+v.

Note that the number of vertices and edges in HH are at most twice those in GG.

Theorem E.2.

Given a minimum-cost 0-1 flow problem on input graph G=(V,E)G=(V,E) with nn vertices, mm edges, and maximum degree dmaxd_{max}, there exists an algorithm which takes takes as input a predicted dual solution y^\hat{y} to an instance of maximum weight perfect matching derived from GG, near-optimally rounds the dual to a feasible solution y^\hat{y}^{\prime}, and solves the DCS in time O(m3/2dmax+(mdmax+mlogm)yy^0)O(m^{3/2}d_{max}+(md_{max}+m\log m)||y^{*}-\hat{y}^{\prime}||_{0}).

E.3 Diameter to Shortest Paths

The diameter of a graph is defined as the largest distance between any pair of vertices. All exact algorithms for calculating the diameter on general weighted graphs all rely on computing all pairs shortest paths (and there is evidence that this approach is unavoidable [DW21]). Our learning-augmented algorithm for computing shortest-paths of Section 4 gives us the following corollary for computing the diameter of an input graph which follows by first rounding to a valid reduced edge length dual of Definition 4.1 and running all pairs shortest paths using Dijkstra’s algorithm on the resulting graph with non-negative weights.

Theorem E.3.

Given an input graph GG with nn vertices and mm edges with possibly negative integer edge lengths given by \ell, there exists an algorithm which takes takes as input a predicted dual solution y^\hat{y} to the reduced edge length dual on GG and computes the diameter of GG in time

O(mmin{y^y1y^y,nlog(y^y)})+O~(mn).O(m\min\{\|\hat{y}-y^{*}\|_{1}\cdot\|\hat{y}-y^{*}\|_{\infty},\sqrt{n}\log(\|\hat{y}-y^{*}\|_{\infty})\})+\tilde{O}(mn).
Remark E.1.

Note that the an algorithm which doesn’t use any learned predictions for computing shortest paths on a graph with negative weights, such as the Bellman-Ford algorithm, would have taken time O(mn2)O(mn^{2}) to compute the diameter. Note that we could have also reduced the diameter problem to matching by using the reduction from shortest paths to matching. However the reduction used in Theorem E.3 is simpler as we don’t need to compute any new graphs.

Appendix F Omitted Details from Section 6

Our results generally follow from bounding the pseudo-dimension of the loss function and applying standard uniform convergence for PAC learning.

Definition F.1 (pseudo-dimension).

Let \mathcal{F} be a class of functions f:X.f:X\rightarrow\mathbb{R}. Let S={x1,x2,,xs}XS=\left\{x_{1},x_{2},\ldots,x_{s}\right\}\subset X We say that that SS is shattered by \mathcal{F} if there exist real numbers r1,,rsr_{1},\ldots,r_{s} so that for all SSS^{\prime}\subseteq S, there is a function ff\in\mathcal{F} such that f(xi)rif\left(x_{i}\right)\leq r_{i} if and only if xiSx_{i}\in S^{\prime} for all i[s]i\in[s]. The pseudo-dimension of \mathcal{F} is the largest ss such that there exists an SXS\subseteq X with |S|=s|S|=s that is shattered by \mathcal{F}.

For a class of loss functions with bounded range and pseudo-dimension, the following lemma provides a PAC learning guarantee.

Lemma F.1 (uniform convergence; e.g., [ABB99]).

Let 𝒟\mathcal{D} be a distribution over a domain XX and \mathcal{F} be a class of functions f:X[0,H]f:X\rightarrow[0,H] with pseudo-dimension dd_{\mathcal{F}}. Consider ss i.i.d. samples x1,x2,,xsx_{1},x_{2},\ldots,x_{s} from 𝒟.\mathcal{D}. There is a universal constant c0c_{0}, such that for any ϵ>0\epsilon>0 and p(0,1)p\in(0,1), if sc0(Hϵ)2(d+ln(1/δ)),s\geq c_{0}\left(\frac{H}{\epsilon}\right)^{2}\left(d_{\mathcal{F}}+\ln(1/\delta)\right), then we have

|1si=1sf(xi)𝔼x𝒟f(x)|ϵ\left|\frac{1}{s}\sum_{i=1}^{s}f\left(x_{i}\right)-\mathbb{E}_{x\sim\mathcal{D}}f(x)\right|\leq\epsilon

for all ff\in\mathcal{F} with probability at least 1δ.1-\delta.

F.1 Proof of Theorem 6.1

To prove Theorem 6.1 Let f1,h(c)=1(h,c)f_{1,h}(c)=\ell_{1}(h,c) for hh\in\mathcal{H}. The following lemma provides a bound on the pseudo-dimension of the family 1={f1,h(c):h}\mathcal{F}_{1}=\{f_{1,h}(c):h\in\mathcal{H}\}.

Lemma F.2 ([DIL+21]).

The pseudo-dimension of 1\mathcal{F}_{1} is bounded above by O(dlogd)O(d\log d).

Now we are ready to prove our learnability result Theorem 6.1.

Proof of Theorem 6.1.

Given ss samples c1,c2,csc_{1},c_{2}\cdots,c_{s}, the algorithm performs empirical risk minimization on the loss ^(h)=i=1sh(ci)h1\hat{\ell}(h)=\sum_{i=1}^{s}\|h^{*}(c_{i})-h\|_{1}. The algorithm runs in polynomial time by the efficient optimization assumption.

Moreover, since d\mathcal{H}\subseteq\mathbb{R}^{d} and \mathcal{H} has bounded range, we have that any function in 1\mathcal{F}_{1} is bounded by dMdM. Therefore, the sample complexity and error bound follows from Lemma F.1 and Lemma F.2. ∎

F.2 Proof of Theorem 6.2

Proof of Theorem 6.2.

Similar to the 1\ell_{1} learnability theorem, the algorithm simply finds the empirical minimizer of ^(h)=i=1sh(ci)h\hat{\ell}(h)=\sum_{i=1}^{s}\|h^{*}(c_{i})-h\|_{\infty}. The algorithm runs in polynomial time by the efficient optimization assumption.

Let f,h(c)=(h,c)f_{\infty,h}(c)=\ell_{\infty}(h,c) for hh\in\mathcal{H}. It now suffices to bound the pseudo-dimension of the family ={f,h(c):h}\mathcal{F_{\infty}}=\{f_{\infty,h}(c):h\in\mathcal{H}\} and then apply the uniform convergence lemma (Lemma F.1). Now observe that the pseudo-dimension of \mathcal{F}_{\infty} can be in turn bounded by the VC dimension of axis-aligned hyperrectangles in d\mathbb{R}^{d}, which is known to be 2d2d [MRT18]. ∎

F.3 Details on Learnability Via Arithmetic Complexity

Suppose we have any loss function L(h,G)L(h,G)\in\mathbb{R} which represents how well a hint vector performs on some input GG. For notational simplicity, we define 𝒜\mathcal{A} as the class of functions in hh composed with LL:

𝒜:={Lh:h}.\mathcal{A}:=\{L\circ h:h\in\mathcal{H}\}.

We also assume that the range of LL is equal to [0,H][0,H] and that all graphs GG can be represented as a feature vector in m\mathbb{R}^{m}.

Again, we aim to learn the best function hh\in\mathcal{H} which minimizes the following objective:

𝔼cD[L(h,G)].\operatorname*{\mathbb{E}}_{c\sim D}[L(h,G)]. (3)

Towards this end, we let hh^{*} be such the optimal hh\in\mathcal{H}. We also assume that for each instance GG and each hh\in\mathcal{H}, Lh(G)L\circ h(G) can be computed in time T(m,d)T(m,d). For example, suppose graphs drawn from 𝒟\mathcal{D} possess edge features in d\mathbb{R}^{d} for some dd and our family \mathcal{H} is parameterized by a single vector θd\theta\in\mathbb{R}^{d} and represents linear functions which report the dot product of each edge feature with θ\theta. Then it is clear that T(m,d)T(m,d) is a (small) polynomial in the relevant parameters.

The result of this section is to bound the pseudo-dimension of 𝒜\mathcal{A}. After obtaining a bound, we can readily apply Lemma F.1 as we did in the proof of Theorem 6.1 in Section 6.1.

Theorem F.3 (Learnability via computational complexity).

Suppose that any a𝒜a\in\mathcal{A} takes T(m,d)T(m,d) time to compute given any graph HH drawn from 𝒟\mathcal{D}. Then the pseudo-dimension of 𝒜\mathcal{A} is O(poly(T(m,d))O(\text{poly}(T(m,d)).

To prove Theorem F.3, we first relate the pseudo-dimension to the VC dimension of a related class of threshold functions. This relationship has been fruitful in obtaining learning bounds in a variety of works such as [LFKF18, ISZ21].

Lemma F.4 (Pseudo-dimension to VC dimension, Lemma 1010 in [LFKF18]).

For any a𝒜a\in\mathcal{A}, let BaB_{a} be the indicator function of the region on or below the graph of aa, i.e., Ba(x,y)=sgn(a(x)y)B_{a}(x,y)=\text{sgn}(a(x)-y). The pseudo-dimension of 𝒜\mathcal{A} is equivalent to the VC-dimension of the subgraph class B𝒜={Baa𝒜}B_{\mathcal{A}}=\{B_{a}\mid a\in\mathcal{A}\}.

The following theorem then relates the VC dimension of a given function class to its computational complexity, i.e., the complexity of computing a function in the class in terms of the number of operations needed.

Lemma F.5 (Theorem 8.148.14 in [ABB99]).

Let w:α×β{0,1}w:\mathbb{R}^{\alpha}\times\mathbb{R}^{\beta}\rightarrow\{0,1\}, determining the class

𝒲={xw(θ,x):θα}.\mathcal{W}=\{x\rightarrow w(\theta,x):\theta\in\mathbb{R}^{\alpha}\}.

Suppose that any ww can be computed by an algorithm that takes as input the pair (θ,x)α×β(\theta,x)\in\mathbb{R}^{\alpha}\times\mathbb{R}^{\beta} and returns w(θ,x)w(\theta,x) after no more than rr of the following operations:

  • arithmetic operations +,,×,+,-,\times, and // on real numbers,

  • jumps conditioned on >,,<,,=,>,\geq,<,\leq,=, and == comparisons of real numbers, and

  • output 0,10,1,

then the VC dimension of 𝒲\mathcal{W} is O(α2r2+r2αlogα)O(\alpha^{2}r^{2}+r^{2}\alpha\log\alpha).

Combining the previous results allows us prove Theorem F.3. At a high level, we are instantiating Lemma F.5 with the complexity of computing any function in the function class 𝒜\mathcal{A}.