Proof of sublinear regret of UCB algorithms for bandits

This post will be concerned with the proof that the UCB algorithm achieves sublinear regret for multi-armed bandits. First, let’s set up the preliminaries of the problem.

UCB and bandits are part of fast reinforcement learning, a problem setting that is concerned with making sample efficient decisions, especially in applications where experience is costly or difficult to achieve. Bandits are a simplified version of an MDP–in particular one with only one state. We have a finite set of actions \(\mathcal{A}\), each of which induces a reward distribution \(\mathcal{R}^{a}(r) = P[r \vert a]\). At each step, the agent selects some \(a_{t} \in \mathcal{A}\) and the overall goal is to maximize the cumulative reward, \(\sum_{t=1}^T r_{t}\), or equivalently, to minimize the cumulative regret, \(l_{T} = \mathbb{E}[\sum_{t=1}^T V^{*} - Q(a_{t})]\).

Read more

How structured do PCP queries need to be?

Although lower bounds in complexity theory usually refer to impossibility results related to how time or space efficient an algorithm for a class of problems can be, I do think that this question, about the entropy of queries of a PCP, is a sort of lower bound on how oblivious or lazy a verifier can be. Entropy and information gain are important and old concepts in machine learning, but I think that in the age of deep learning, people don’t realize just how wide the concept of a learning algorithm can be stretched. Fields like cryptography often focus on transmitting some very specific, structured knowledge and hide everything else, which is a nice dual perspective to machine learning. Verifiers and extractors are learners, too. Unsurprisingly, techniques from cryptography can show up in machine learning. Alex Irpan has a cool blog post showing how the hybrid argument (which shows up in my proof of the not-uniform-but-still-independent case) appears in imitation learning (RL) proofs.

Read more

Understanding and Implementing Policy Gradients

Policy gradients are a pretty cool class of reinforcement learning algorithms. They focus on directly optimizing the metric we care most about in reinforcement learning (the expected reward from acting in an environment), and because of this enjoy an elegant formulation that looks very similar to supervised maching learning, and has stability benefits over approaches like Q-learning, which indirectly learn a policy, and can suffer from too much approximation.

While this blog post is in no sense comprehensive, I hope to show a good mixture of theory and practice. There’s a lot of

Read more