Proof of sublinear regret of UCB algorithms for bandits

11 Jun 2021

This post will be concerned with the proof that the UCB algorithm achieves sublinear regret for multi-armed bandits. First, let’s set up the preliminaries of the problem.

UCB and bandits are part of fast reinforcement learning, a problem setting that is concerned with making sample efficient decisions, especially in applications where experience is costly or difficult to achieve. Bandits are a simplified version of an MDP–in particular one with only one state. We have a finite set of actions \(\mathcal{A}\), each of which induces a reward distribution \(\mathcal{R}^{a}(r) = P[r \vert a]\). At each step, the agent selects some \(a_{t} \in \mathcal{A}\) and the overall goal is to maximize the cumulative reward, \(\sum_{t=1}^T r_{t}\), or equivalently, to minimize the cumulative regret, \(l_{T} = \mathbb{E}[\sum_{t=1}^T V^{*} - Q(a_{t})]\).

Here we define:

Action-value: \(Q(a) = \mathbb{E}[r \vert a]\)
Optimal value: \(V^{*} = Q(a^{*}) = \max_{a} Q(a)\)

Lower Bound

Lai and Robbins in 1985 proved that the asymptotic total regret is at least logarithmic in the number of steps. The lower bound gives a measure of the inherent difficulty of the problem, and establishes a goal to shoot for when designing algorithms. We make a sidenote that the performance of any bandit algorithm is determined by how similar the optimal arm is to other arms. Similarly, how difficult it is to distinguish an optimal arm from a suboptimal arm. Thus, hard bandit problems have similar looking, i.e. small \(D_{KL}(\mathcal{R}^{a} \| \mathcal{R}^{a^{*}})\), arms with different means.

Optimism in the Face of Uncertainty

A high level heuristic we use to solve this problem is optimism under uncertainty. We repeatedly choose actions that look like they have high value. Why? Either the action does turn out to be good, and we reap a good reward, or the action has low reward, and we update our knowledge of it, leading us to choose it less over time (minimizing cumulative regret).

For each arm \(a\), we estimate an upper confidence bound \(U_{t}(a)\), such that with high probability, \(Q(a) \le U_{t}(a)\). I.e., with high confidence, our estimate is an upper bound on the true arm value. The algorithm then, at each time step, selects the action with maximum UCB.

At each time step, as long as all of the UCB bounds simultaneously hold, we’re in good shape to prove sublinear regret. Why? Let’s first do a proof-sketch so we can get a birds-eye view of how this will go.

If all UCB bounds hold, then no matter what action, \(a_{t}\), we take, we have that \(U_{t}(a_{t}) > Q(a^{*})\). That is, the UCB of whatever action the algorithm takes is an upper bound on the optimal action. There are two cases.

Case 1: \(a_{t} = a^{*}\)

The action we select is actually the optimal action. Then \(U_{t}(a_{t}) = U_{t}(a^{*}) > Q(a^{*})\).

Case 2: \(a_{t} \ne a^{*}\)

Then \(U_{t}(a_{t}) > U_{t}(a^{*}) > Q(a^{*})\).

To see how this is relevant, let’s look at the outline of the regret proof.

\[\begin{align*} regret(UCB,T) &= \sum_{t=1}^{T} (Q(a^{*}) - Q(a_{t})) \\ &= \sum_{t=1}^{T} U_{t}(a_{t}) - Q(a_{t}) + Q(a^{*}) - U_{t}(a_{t}) \\ & \le \sum_{t=1}^{T} U_{t}(a_{t}) - Q(a_{t}) \end{align*}\]

We have the inequality on the third line because the simultaneous UCB bounds give us \(Q(a^{*}) - U_{t}(a_{t}) \le 0\). Our estimate of \(U_{t}(a_{t})\) will be of the form \(U_{t}(a_{t}) = \hat{Q}(a_{t}) + d\) where \(\sum_{t=1}^T d_{t}\) will be a sublinear term. Ok, now onto the details.

The basic idea is to use a concentration bound for the sample mean of the value on an arm, since each pull is an i.i.d. estimate of the reward distribution. We use the Chernoff-Hoeffding Bound because it casts the deviation of the sample mean from the mean as an inverse exponential of the number of samples. The inverse exponential is important for getting sublinear regret.

Chernoff-Hoeffding Bound

Let \(X_1,...,X_n\) be i.i.d. random variables in \([0,1]\) and let \(\bar{X}_{n} = \dfrac{1}{n} \sum_{\tau = 1}^n X_{\tau}\) be the sample mean. Then

\[\begin{equation} P[E[X] > \bar{X}_{n} + u] \le exp(-2nu^2) \end{equation}\]

In our interpretation, we let

\[\begin{equation} P[Q(a_{t}) > \hat{Q}(a_{t}) + u] \le exp(-2tu^2) = \dfrac{\delta}{t^2} \end{equation}\]

Where \(\delta\) is a parameter and \(t\) is the current timestep and \(n(a_{t})\) is action selection count. We’ll see later why setting the bound to be \(\dfrac{\delta}{t^2}\) is useful.

We use the Chernoff-Hoeffding equation to derive the design of the estimate \(U(a_{t})\). Solving for \(u\), we get,

\[\begin{equation} u = \sqrt{\frac{1}{n(a_{t})} \log(t^2/\delta)} \end{equation}\]

So setting \(U_{t}(a_{t}) = \hat{Q}(a_{t}) + \sqrt{\frac{1}{n(a_{t})} \log(t^2/\delta)}\) gives us an UCB bound that holds with probability at least \(1 - \frac{\delta}{t^2}\)

Proof Details

The assumption we make for our sublinear regret bound to hold is that the Chernoff-Hoeffding Bounds hold for all arms at all time steps. Let’s formally derive this quantity by looking at the probability of failure, i.e. the probability that at some timestep the bound for some arm is incorrect.

\[\begin{align*} \bigcup\limits_{t=1}^{T} Pr(Q(a^{*}) > U_{t}(a_{t})) & \le \bigcup\limits_{t=1}^{T} \bigcup\limits_{i=1}^{m} Pr(\vert Q(a_{t}) - \hat{Q}(a_{t}) \vert > u) \\ & \le \bigcup\limits_{t=1}^{T} \bigcup\limits_{i=1}^{m} \dfrac{\delta}{t^2} \\ & \le 2m \delta \\ \end{align*}\]

For the last inequality we used \(\sum_{t=1}^{\infty} t^{-2} = \dfrac{\pi^2}{6} < 2\), which puts our analysis in the infinite horizon case.

Showing that the Chernoff bound gives us simultaneous success with probability at least \(1 - 2m\delta\).

Given that we have simultaneous success of our bounds, let’s now prove that the regret of the optimistic algorithm is indeed sublinear. Going back to the proof sketch from above, we fill out the details

Again, the Chernoff-Hoeffding bounds give us \(Q(a^{*}) - U_{t}(a_{t}) \le 0\). Now we plug in for our estimator \(U\).

\[\begin{align*} \sum_{t=1}^{T} U_{t}(a_{t}) - Q(a_{t}) &= \sum_{t=1}^{T} \hat{Q}(a_{t}) + \sqrt{\frac{1}{n(a_{t})} \log(t^2/\delta)} - Q(a_{t}) \\ & \le 2 \sum_{t=1}^T \sqrt{\dfrac{1}{n_t({a_t})} \log \dfrac{t^2}{\delta}} \\ &= 2 \sqrt{\dfrac{1}{2} \log \dfrac{T^2}{\delta}} \sum_{i=1}^m \sum_{n=1}^{n_{t}(i)} \sqrt{\dfrac{1}{n}}\\ & \le 2 \sqrt{\dfrac{1}{2} \log \dfrac{T^2}{\delta}} \sum_{i=1}^m \sum_{n=1}^{T/m} \sqrt{\dfrac{1}{n}}\\ & \le 2 \sqrt{\dfrac{1}{2} \log \dfrac{T^2}{\delta}} \sum_{i=1}^m 2 \sqrt{T/m} \\ & \le 4 \sqrt{\dfrac{1}{2} \log \dfrac{T^2}{\delta} Tm} \end{align*}\]

w.p. at least \(1 - 2m \delta\).

To comment on some of the steps above, we note that the second line comes from noting that \(\vert \hat{Q}(a_{t}) - Q(a_{t})\vert\) is bounded by \(u\) when the Chernoff-Hoeffdoing bounds hold. In the third line, we upper bound \(t\) by \(T\) and break the counts down to the individual arms. When then note that the entire sum is maximized when each arm is visited an equal number of times, and so we upper bound by that. In the fifth line, we use the analytical bound for a harmonic series. Finally, we clean up the term.

Remarks

It’s important to note that what looks like a local update for a specific arm in estimating \(U_{t}(a_{t})\) actually results in a global update for all arms at timestep \(t\). This is because the \(u\) term \(\sqrt{\frac{1}{n(a_{t})} \log(t^2/\delta)}\) includes \(t\) which is a global counter, in addition to \(n(a_{t})\) which is the local counter for an arm. As we visit an arm more times, our estimate of it becomes more accurate, as \(u\) decreases with \(n(a_{t})\). However, this is balanced by the slower growing term depedent on \(t\). This is an example of the local-vs-global theme in CS, which I’ve also seen in property testing and optimization, but apparently is also common in algebraic geometry and number theory.
Earlier, we remarked that the difficulty of a particular bandit problem is characterized by how difficult it is to distinguish a suboptimal arm from an optimal arm. This is evocative of another theme in CS, the duality between distinguishing and predicting, or decision and search. For example, in complexity theory, we can turn a decider of correct SAT solutions into a SAT solver.
People have taken inspiration from UCB algorithm in the deep reinforcement learning setting. The specific strategy is to make actions look more optimistic than they are, thereby encouraging exploration. Here is an example of an approach which hashes states.

Ann He

Proof of sublinear regret of UCB algorithms for bandits

Lower Bound

Optimism in the Face of Uncertainty

Chernoff-Hoeffding Bound

Proof Details

Remarks

Related Posts

How structured do PCP queries need to be? 14 Apr 2021

Understanding and Implementing Policy Gradients 12 Apr 2021