Practical Session 2

In [1]:
import numpy as np
from forban import *
from forban.bandits import NormalBandit, Normal
from forban.sequentialg import SequentiAlg
from forban.utils import *
import matplotlib.pyplot as plt

Structured Bandit

We study the classical multi-armed bandit problem specified by a set of real-valued Gaussian distributions $ (\nu_a)_{a \in \mathcal{A}}$ with means $(\mu_a)_{a \in \mathcal{A}} \in \mathbb{R}^\mathcal{A}$ and unitary variances, where $\mathcal{A}$ is a finite set of arms. We denote $\mu_* = \max \{ \mu_a | a \in \mathcal{A} \}$.

At each time $t \geq 1$, an agent must choose an arm $a_t \in \mathcal{A}$, based only on the past. A reward $X_t$ is drawn from the chosen distribution $\mu_{a_t}$ and observed by the agent. The goal of the agent is to maximize the expected sum of rewards received over time, or equivalently to minimize regret with respect to the strategy constantly receiving the highest mean reward.

Part 0 - Bandit environment, how to?

Distributions - Arms

In [2]:
mean_1 = 0.
mean_2 = 1.
normal_1 = Normal(mean_1)
normal_2 = Normal(mean_2)
sample_1 = normal_1.sample()
sample_2 = normal_2.sample()
means = [mean_1, mean_2]
print(f"A sample from a normal distribution with mean {mean_1} and standard deviation 1 is {sample_1:.3f}")
print(f"A sample from a normal distribution with mean {mean_2} and standard deviation 1 is {sample_2:.3f}")
A sample from a normal distribution with mean 0.0 and standard deviation 1 is -1.386
A sample from a normal distribution with mean 1.0 and standard deviation 1 is -0.093

Bandit instance

In [3]:
bandit_instance = NormalBandit([0.1, 0.2, 0.32, 0.24, 0.22])
print(f"A bandit instance is {bandit_instance}")
A bandit instance is Bandit([Normal (mean=0.100, std=1.000), Normal (mean=0.200, std=1.000), Normal (mean=0.320, std=1.000), Normal (mean=0.240, std=1.000), Normal (mean=0.220, std=1.000)])
In [4]:
plot_bandit(bandit_instance)

Sequential algorithms

In [5]:
# Constantly play the same arm
class Constant(SequentiAlg):
    def __init__(self, bandit, name="Constant", params={'init': 1, 'choice': 0 }):
        SequentiAlg.__init__(self, bandit, name=name, params=params)
        self.choice = params['choice']
        self.name = f"{self.name} (arm {self.choice})"
        assert self.choice < bandit.nbr_arms, f"'choice' ({self.choice}) should be one of\
the arms indices (<{bandit.nbr_arms})"
    
    def compute_indices(self):
        self.indices[self.choice] = 0
        
# Non-optimized Explore Then Commit strategy        
class Etc(SequentiAlg):
    def __init__(self, bandit, name="ETC", params={'init': 0, 'exploration': 200 }):
        SequentiAlg.__init__(self, bandit, name=name, params=params)
        self.exploration = params['exploration']
        
    def compute_indices(self):
        if self.time <= self.exploration:
            # probably faster computation is possible
            self.indices = np.random.rand(self.bandit.nbr_arms)
        else:
            self.indices = self.means
            
    def choose_an_arm(self):
        return randamax(self.indices)
In [6]:
# Set of means
means = [0.2, 0.5, 0., 0.05, 0.3, 0.4]
# Create a bandit problem
bandit = NormalBandit(means)
# Create sequential algorithms
seqalg1 = Constant(bandit) # Constantly play the same arm
seqalg2 = SequentiAlg(bandit, name="Random") # Sequentially choose an arm using a uniform distribution on the set of arms
seqalg3 = Constant(bandit, params = {'init': 1, 'choice': 3}) # Constantly play the same arm
seqalg4 = Etc(bandit) # Explore Then Commit strategy
In [7]:
print(seqalg4)
ETC algorithm - time step = 0
  Normal (mean=0.200, std=1.000) : est. mean = 0.000 - nbr. pulls = 0
  Normal (mean=0.500, std=1.000) : est. mean = 0.000 - nbr. pulls = 0
  Normal (mean=0.000, std=1.000) : est. mean = 0.000 - nbr. pulls = 0
  Normal (mean=0.050, std=1.000) : est. mean = 0.000 - nbr. pulls = 0
  Normal (mean=0.300, std=1.000) : est. mean = 0.000 - nbr. pulls = 0
  Normal (mean=0.400, std=1.000) : est. mean = 0.000 - nbr. pulls = 0

In [8]:
test_experiment = Experiment([seqalg1, seqalg2, seqalg3, seqalg4], bandit,
                             statistics={'mean':True, 'std':True, 'quantile':True, 'pulls':False},
                             complexity=False)
In [9]:
test_experiment.run(50, 550)
test_experiment.plot()
In [10]:
print(seqalg4)
ETC algorithm - time step = 550
  Normal (mean=0.200, std=1.000) : est. mean = 0.184 - nbr. pulls = 26
  Normal (mean=0.500, std=1.000) : est. mean = 0.317 - nbr. pulls = 30
  Normal (mean=0.000, std=1.000) : est. mean = -0.077 - nbr. pulls = 39
  Normal (mean=0.050, std=1.000) : est. mean = 0.396 - nbr. pulls = 87
  Normal (mean=0.300, std=1.000) : est. mean = 0.403 - nbr. pulls = 244
  Normal (mean=0.400, std=1.000) : est. mean = 0.395 - nbr. pulls = 124

Part 1 - UCB algorithm

The UCB strategy selects at each round $t$ the arm with the highest UCB index, which is defined for arm $a$ as $$ UCB_{a}(t) = \widehat{\mu}_{a}(t) + R\sqrt{\frac{\log\frac{1}{\delta_t}}{2N_a(t-1)}} $$ where $$ \begin{align} &\widehat{\mu}_{a}(t) = \frac{1}{N_a(t-1)}\sum_{s=1}^{t-1} \text{Reward}(s) \mathbb{1}_{\text{a is played at round $s$}},\\ &N_a(t-1) = \text{number of times $a$ has been played up to round $t-1$ (included)},\\ &\delta_t \text{ is a function of $t$, for instance } \delta_t=\frac{1}{t^3},\\ &R \text{>0 is a constant (theory says it can be the standard deviation of the reward distributions but you may want to try higher or lower values to promote more or less exploration).}\\ \end{align} $$

Note that initially $N_a(0)=0$ so you may initialise UCB by pulling every arm at least once.

Question 1

Using the SequentiAlg class, implement the UCB algorithm.

Question 2

Using the Experiment class, run experiments on (at least) two bandit problems of your choice.

Question 3

Comment the plots (suboptimality gaps, number of arms...).

Part 2 - IMED algorithms

On a Gaussian bandit instance, the Lai and Robbins lower bound tells us that the regret is asymptotically no smaller than

$$\left(\sum_{a \in \mathcal{A}} \frac{\Delta_a}{\mathrm{KL}(\mu_a,\mu_*)}\right) \log(T),$$

where $\mathrm{KL}(\mu_a,\mu_*)$ is the KL-divergence between the Gaussian distribution of mean $\mu_a$ and the Gaussian distribution of mean $\mu_*$ (unitary variances). The constant in front of the $\log(T)$ may be called the complexity of the bandit problem.

For all arm $a \in \mathcal{A}$, for all time step $t \geq 1$, the empirical mean of arm $a$ at time step $t$ is $$\hat \mu_a(t) = \dfrac{1}{N_a(t)} \sum\limits_{s = 1}^t \mathbb{1}_{\{a_s = a\}}X_s, \text{ if } N_a(t) > 0,\ 0 \text{ otherwise}, $$ where $N_a(t) = \sum\limits_{ s=1}^t \mathbb{1}_{\{ a_s = a\}}$ is the number of pulls of arm $a$ at time $t$. We write $\hat \mu_*(t) = \max\limits_{a \in \mathcal{A}}\hat\mu_a(t)$.

IMED strategy - Honda and Takemura (2011)

For an arm $a \in \mathcal{A}$ and a time step $t\geq 1$, the IMED index is defined as follows: $$ I_a(t) = N_a(t)\,\mathrm{KL}\!\left(\hat\mu_a(t),\hat \mu_*(t)\right)+ \log\!\left(N_a(t)\right) \,. $$

This quantity can be seen as a transportation cost for “moving" a sub-optimal arm to an optimal one, plus exploration terms (the logarithms of the numbers of pulls). When an optimal arm is considered, the transportation cost is null and it remains only the exploration part. Note that, as stated in Honda and Takemura (2011), $I_a(t)$ is an index in a weak sense since it cannot be determined only by samples from the arm $a$ but also uses empirical means of current optimal arms.

IMED is the strategy that consists in pulling an arm with minimal index at each time step: $$ a_{t+1} = \arg\!\min_{a \in \mathcal{A}}I_a(t) \,.$$

Bonus Question

Derive the formula of the Kullback-Leibler divergences between two Gaussian distributions.

Implementation of the IMED strategy

In [11]:
class IMED(SequentiAlg):
    def __init__(self, bandit, name="IMED", params={'init': -np.inf, 'kl':klGaussian}):
        SequentiAlg.__init__(self, bandit, name=name, params=params)
        self.kl = params['kl']
    
    def compute_indices(self):
        max_mean = np.max(self.means)
        if self.all_selected:
            self.indices = self.nbr_pulls*self.kl(self.means, max_mean) + np.log(self.nbr_pulls)
        else:
            for arm in np.where(self.nbr_pulls != 0)[0]:
                self.indices[arm] = self.nbr_pulls[arm]*self.kl(self.means[arm], max_mean) \
                + np.log(self.nbr_pulls[arm])

Experiment the IMED strategy

In [12]:
horizon = 700
agent = IMED(bandit)
experiment_imed = Experiment([agent], bandit,
                             statistics={'mean':True, 'std':False, 'quantile':False, 'pulls':False},
                             complexity=False)

experiment_imed.run(1, horizon)
experiment_imed.plot()
# Visualize results (on one run) 

# Histogram of the number of arms selections
plt.figure(figsize=(12,5))
plt.xlabel("Arms", fontsize=14)
plt.ylabel("Number of arms selections", fontsize=14)
plt.bar(range(bandit.nbr_arms), agent.nbr_pulls, width=0.4, tick_label=[str(i) for i in range(bandit.nbr_arms)])
plt.title("Number of selections of each arm", fontsize=14)
plt.show()

Compare strategies

In [13]:
nbr_exp = 50
horizon = 500
etc = Etc(bandit) # Explore Then Commit strategy
imed = IMED(bandit) # IMED strategy
experiment = Experiment([etc, imed], bandit,
                        statistics={'mean':False, 'std':False, 'quantile':True, 'pulls':False},
                        complexity=True)
experiment.run(nbr_exp, horizon)
experiment.plot()

Part 2 - Unimodal bandits

We assume that $\mathcal{A} = \left\{0, \dots, A-1\right\}, A \geq 1$, and $ \mu : \begin{cases} \mathcal{A} & \to \mathbb{R} \\ a &\mapsto \mu_a \end{cases} $ is unimodal. That is, there exists $ a_* \!\in\! \mathcal{A}$ such that $\mu_{[\![0,a_*]\!]} $ is increasing and $ \mu_{[\![ a_*, A]\!]} $ is decreasing. It is further assumed that for each arm $a$, $ \nu_{a} $ is a Gaussian distribution $ \mathcal{N}(\mu_a,1) $ , where $ \mu_a \in \mathbb{R} $ is the mean of the distribution $\nu_{a}$. We denote the structured set of Gaussian unimodal bandit distributions by $$\mathcal{D}_{\text{unimodal}} = \bigg\{\nu = (\nu_a)_{a \in \mathcal{A}}:\ \forall a \!\in\! \mathcal{A},\ \nu_a \sim \mathcal{N}(\mu_a,1) \text{ with } \mu_a \!\in\! \mathbb{R}\text{ and }\mu \text{ is unimodal}\bigg\}\,.$$

On a Gaussian unimodal bandit instance, the Lai and Robbins lower bound tells us that the regret is asymptotically no smaller than

$$\left(\sum_{a \in \mathcal{V}_{a_*}} \frac{\Delta_a}{\mathrm{KL}(\mu_a,\mu_*)}\right) \log(T),$$

where $\mathcal{V}_{a_*} = \{a_* - 1, a_* + 1\}\cap\mathcal{A}$ and $\mathrm{KL}(\mu_a,\mu_*)$ is the KL-divergence between the Gaussian distribution of mean $\mu_a$ and the Gaussian distribution of mean $\mu_*$ (variances equal to $1$). The constant in front of the $\log(T)$ may be called the unimodal complexity of the bandit problem.

1) Computing regret lower bound for a unimodal bandit instance

Question 4

Write a function that computes the complexity of a unimodal Gaussian bandit instance.

Question 5

Write a function that generates at random a unimodal Gaussian bandit instance.

Question 6

On a unimodal Gaussian bandit instance $\nu $ of your choice, add the theoretical lower bound $t \mapsto C_{\text{unimodal}}(\nu) \log(t)$ where $C_{\text{unimodal}}(\nu)$ is the unimodal complexity of a unimodal bandit problem to the regret curve of IMED.

Add a plot with experiments of your choice using the previous algorithms.

2) IMED for Unimodal Bandit

For an arm $a \in \mathcal{A}$ and a time step $t\geq 1$, the IMED4UB index is defined as follows: $$ I_a(t) = N_a(t)\,\mathrm{KL}\!\left(\hat\mu_a(t),\hat \mu_*(t)\right)+ \log\!\left(N_a(t)\right) \,. $$

IMED4UB is the strategy that consists in pulling an arm in the neighbourhood of the current optimal arm (also called leader arm, the one with the largest empirical mean) with minimal index at each time step: $$ a_{t+1} = \arg\!\min_{a \in \mathcal{V}_{\hat a_*(t)}\cup \hat a_*(t)}I_a(t) \,.$$

Question 7

Write a class that provides an IMED type strategy for unimodal bandit inspired from the regret lower bound for unimodal structure.

Experiment like it was done in part 1. You can comment and add experiments of your choice.

Part 3 - Lipschitz bandits

We assume that $\mathcal{A} = \left\{0, \dots, A-1\right\}, A \geq 1$, and $ \mu : \begin{cases} \mathcal{A} & \to \mathbb{R} \\ a &\mapsto \mu_a \end{cases} $ is $k$-Lipschitz, where is $k$ is assumed to be known. That is, for all $ a, a' \!\in\! \mathcal{A}$, $|\mu_a - \mu_{a'}| \leq k\!\times\! |a - a'|$. It is further assumed that for each arm $a$, $ \nu_{a} $ is a Gaussian distribution $ \mathcal{N}(\mu_a,1) $ , where $ \mu_a \in \mathbb{R} $ is the mean of the distribution $\nu_{a}$. We denote the structured set of Gaussian $k$-Lipschitz bandit distributions by $$\mathcal{D}_{k\text{-Lip}} = \bigg\{\nu = (\nu_a)_{a \in \mathcal{A}}:\ \forall a \!\in\! \mathcal{A},\ \nu_a \sim \mathcal{N}(\mu_a,1) \text{ with } \mu_a \!\in\! \mathbb{R}\text{ and }\mu \text{ is } k\text{-Lipschitz}\bigg\}\,.$$

On a Gaussian $k$-Lipschitz bandit instance, the lower bound specifies that the numbers of pulls satisfy asymptotically the following inequalities

$$\forall a \in \mathcal{A}, \quad \sum_{a' \in \mathcal{V}_{a}} \mathrm{KL}\!\left(\mu_{a'},\,\mu_*\!-\!k|a \!-\! a'|\right)\,N_{a'}(T) \geq \log(T)\,,$$

where $\mathcal{V}_{a} = \big\{a'\!\in\!\mathcal{A}\!: \mu_{a'}\!<\! \mu_*\!-\!k|a \!-\! a'|\big\}$ and $\mathrm{KL}(\mu,\mu')$ is the KL-divergence between the Gaussian distribution of mean $\mu$ and the Gaussian distribution of mean $\mu'$ (unitary variances). The constant $C_{k\text{-Lip}}(\nu)$ resulting from the following linear programming problem may be called the Lipschitz complexity of the bandit problem: $$ C_{k\text{-Lip}}(\nu) = \min \bigg\{\sum\limits_{a \in \mathcal{A}} (\mu_*-\mu_a)\,n_{a}:\ n \in \mathbb{R}_+^{\mathcal{A}} \text{ s.t. } \forall a \in \mathcal{A},\sum_{a' \in \mathcal{V}_{a}} \mathrm{KL}\!\left(\mu_{a'},\,\mu_*\!-\!k|a \!-\! a'|\right)\,n_{a'} \geq 1\bigg\}\,. $$

1) Computing regret lower bound for a k-Lipschitz bandit instance

Question 8

Write a function that computes the complexity of a $k$-Lipschitz bandit instance.

Question 9

Write a function that generates at random a Lipschitz Gaussian bandit instance.

Question 10

On a $k$-Lipschitz Gaussian bandit instance $\nu $ of your choice, add the theoretical lower bound $t \mapsto C_{k-\text{Lip}}(\nu) \log(t)$ where $C_{k\text{-Lip}}(\nu)$ is the lipschitz complexity of a lipschitz bandit problem to the regret curve of IMED.

2) IMED for Lipschitz Bandit

For an arm $a \in \mathcal{A}$ and a time step $t\geq 1$, the IMED4LB index is defined as follows: $$ I_a(t) = \sum\limits_{a ' \in \hat{\mathcal{V}}_{a}(t)}N_{a'}(t)\,\mathrm{KL}\!\left(\hat\mu_{a'}(t),\hat \mu_*(t) - k |a - a'|\right)+ \log\!\left(N_{a'}(t)\right) \,, $$ where $\hat{\mathcal{V}}_{a}(t) = \big\{a'\!\in\!\mathcal{A}\!: \hat\mu_{a'}(t)\!\leq\! \hat\mu_*(t)\!-\!k|a \!-\! a'|\big\}$.

IMED4LB is the strategy that consists in pulling an arm with minimal index at each time step: $$ a_{t+1} = \arg\!\min_{a \in \mathcal{V}_{\hat a_*(t)}}I_a(t) \,.$$

Question 11

Write a class that provides an IMED type strategy for Lipschitz bandit inspired from the lower bounds on the number of pulls for Lipschitz structure.

Experiment like it was done in part 1. You can comment and add experiments of your choice.

Part 4 - Linear bandits

Let us consider the discretisation $\mathcal{A} = \left\{0, \dots, A-1\right\}, A \geq 1$ of the space $ \mathcal{X} = \bigg\{x_a=\dfrac{a}{A}\ ,a\!\in\!\mathcal{A} \bigg\} \subset [0,1]$. Let us consider the parameter space $\Theta = \mathcal{B}(O,1)\subset\mathbb{R}^d$, $d = 2p+1$, $p\geq 1$, and the related function space $\mathcal{F}_{\Theta} = \bigg\{ f_\theta: x\in \mathcal{X} \mapsto \theta^{\text{T}}\phi(x)\ , \theta \in \Theta \bigg\}$, where $ \forall x\in\mathcal{X}, \phi(x) = \!\left(1, \cos(2\pi x),\sin(2\pi x),\dots,\cos(2\pi p x), \sin(2\pi p x)\right) $. It is further assumed that for all parameter $\theta\in\Theta$, for all arm $a\in\mathcal{A}$, $ \nu_{a}(\theta) $ is a Gaussian distribution $ \mathcal{N}(\mu_a(\theta),1) $ , where $ \mu_a(\theta) = f_\theta(x_a) $ is the mean of the distribution $\nu_{a}(\theta)$. We denote the structured set of Linear bandit distributions by $$\mathcal{D}_{\Theta} = \bigg\{\nu(\theta) = \left(\nu_a(\theta)\right)_{a \in \mathcal{A}}\ , \theta\in\Theta \bigg\}\,.$$

Question 12

Write a function that generates at random a Linear Gaussian bandit instance.

In [ ]:
class LinearBandit:
    """
    Linear bandit environment with N(0, sigma^2) noise. 
    The reward for selecting context X_k is <X_k, theta> + N(0, sigma^2).
    """
    def __init__(self, theta, nbr_arms, sigma=1.0):
        self.theta = np.array(theta)
        self.d = len(theta)
        self.nbr_arms = int(nbr_arms)
        self.sigma = sigma

        # Generate arms features
        self.arms = np.empty((self.nbr_arms, self.d))
        for k in range(self.nbr_arms):
            xk = k / self.nbr_arms
            for j in range(self.d):
                if j == 0:
                    self.arms[k, j] = 1
                if j % 2 == 0:
                    self.arms[k, j] = np.cos(2 * np.pi * xk)
                else:
                    self.arms[k, j] = np.sin(2 * np.pi * xk)
        
        mean_rewards = np.array([self.theta @ arm for arm in self.arms])
        self.regrets = np.max(mean_rewards) - mean_rewards
        
    def __repr__(self):
        res = f"Linear bandit with trigonometric features.\n"
        res += f"Number of arms: {self.nbr_arms}\n"
        res += f"Dimension of features: {self.d}\n"
        res += f"True parameter theta (unknown to the bandit algorithm!):\n {self.theta}\n"
        return res
    
    def __str__(self):
        res = f"Linear bandit with trigonometric features.\n"
        res += f"Number of arms: {self.nbr_arms}\n"
        res += f"Dimension of features: {self.d}\n"
        res += f"True parameter theta (unknown to the bandit algorithm!):\n {self.theta}\n"
        return res
    
    def pull(self, arm):
        return self.theta @ self.arms[arm] + self.sigma * np.random.randn()


def generate_random_linear_bandit():
    # Pick context dimension and number of arms at random
    p = np.random.randint(2, 5)
    d = 2 * p + 1
    nbr_arms = np.random.randint(2, 5)
    
    # Pick theta uniformly at random between in [+/- 1/sqrt(d)]
    # so that its L2 norm is <= 1.
    theta = np.random.uniform(-1 / np.sqrt(d), 1 / np.sqrt(d), size=d)
    return LinearBandit(theta, nbr_arms)

1) The Linear UCB algorithm

The LinUCB strategy is very similar to the UCB one, namely it selects at each round $t$ the arm with the highest UCB index. However, instead of relying on an empirical estimator $\widehat{\mu}_{a}(t)$ of the mean, it uses the features attached to each arm to get a more accurate prediction.

More precisely, for arm $a$: $$ UCB_{a}(t) = f_{\widehat{\theta}_{a}(t)}(x_a) + R\sqrt{\phi(x_a)^\top V_{t}^{-1}\phi(x_a)} $$ where $R>0$ is a constant (again, theory can help you find a reasonable value for it, but you may see it as an hyperparameter to tune).

The predictor $\widehat{\theta}_{a}(t)$ is calculated via regularised least-squares regression on the past rewards

$$ \widehat{\theta}_{a}(t) \arg\min_{\theta\in\mathbb{R}^d} \sum_{s=1}^{t-1}\lVert \text{Reward}(s) - \phi(x_a)^\top \theta \rVert^2 +\lambda \lVert \theta \rVert^2. $$

This problem has the following closed-form solution:

$$ \widehat{\theta}_{a}(t) = V_t^{-1}\sum_{s=1}^{t-1} \text{Reward}(s) \phi(x_a) $$

where $V_t = \lambda I_d + \sum_{s=1}^{t-1} X_sX_s^\top.$

Therefore, running one step of LinUCB requires to invert a $d\times d$ matrix. This can be speed up by noticing that $V_{t+1} = V_{t} + X_{t}X_{t}^\top$ and using the Sherman-Morrison inversion formula:

$$ (V_t + X_{t}X_{t}^\top)^{-1} = V_t^{-1} - \frac{V_t^{-1} X_t X_t^{\top} V_t^{-1}}{1 + X_s^\top V_t^{-1} X_s}. $$

Question 13

Implement the linear UCB algorithm: LinearUCB($\lambda=1$, $\delta= 0.05$ , $\sigma^2 = 1$).

In [ ]:
########################################
#  Sherman-Morrison matrix inversion   #
########################################
def sherman_morrison_inverse(A_inv, u):
    """Compute the inverse of A + u*u^T
    given the matrix A^{-1} and the 
    vector u.
    """
    Au = np.dot(A_inv, u)
    A_inv -= np.outer(Au, Au)/(1+np.dot(u.T, Au))
    return A_inv


class LinUCB(SequentiAlg):
    def __init__(self, 
                 bandit, 
                 name="LinUCB", 
                 params={'init': 0, 'reg': 1.0, 'exploration_bonus': 1.0}
                 ):
        # Ridge regularisation parameter 
        self.reg = params.get('reg', 1.0)
        # Multiplicative factor in the UCB term
        self.exploration_bonus = params.get('exploration_bonus', 1.0)
        name += ' {:.2f}'.format(self.exploration_bonus)
        SequentiAlg.__init__(self, bandit, name=name, params=params)
    
    def __repr__(self):
        res = f"{self.name} algorithm - time step = {self.time}\n"
        res += f"theta_hat:\n {self.theta_hat}\n"
        for i in range(self.bandit.nbr_arms):
            res += "  "
            res += str(self.bandit.arms[i])
            res += " : "
            res += f"est. mean = {self.means[i]:.3f} - "
            res += f"nbr. pulls = {self.nbr_pulls[i]}\n"
        return res
    
    def __str__(self):
        res = f"{self.name} algorithm - time step = {self.time}\n"
        res += f"theta_hat:\n {self.theta_hat}\n"
        for i in range(self.bandit.nbr_arms):
            res += "  "
            res += str(self.bandit.arms[i])
            res += " : "
            res += f"est. mean = {self.means[i]:.3f} - "
            res += f"nbr. pulls = {self.nbr_pulls[i]}\n"
        return res
    
    def reset(self, horizon=None):
        self.means = np.zeros(self.bandit.nbr_arms)
        self.nbr_pulls = np.zeros(self.bandit.nbr_arms, int)
        
        # We need two things to sequentially compute the 
        # regularised least-squares predictor:
        # 1, Inverse of matrix V = sum X_s X_s^T
        # 2. S = sum Y_s X_s
        # where Y = <theta, X> + noise is the reward model.
        # Then theta_hat = V^{-1} S.
        self.V_inv = 1 / self.reg * np.eye(self.bandit.d)
        self.sum_arms_rewards = np.zeros(self.bandit.d)
        self.theta_hat = np.zeros(self.bandit.d)

        self.indices = np.zeros(self.bandit.nbr_arms) + self.params['init']
        self.time = 0
                
    def update_statistics(self, arm, r):
        # Context vector associated with arm
        arm_context = self.bandit.arms[arm]
        
        # We need two things to sequentially compute the 
        # regularised least-squares predictor:
        # 1, Inverse of matrix V = sum X_s X_s^T
        # 2. S = sum Y_s X_s
        # where Y = <theta, X> + noise is the reward model.
        # Then theta_hat = V^{-1} S.
        self.sum_arms_rewards += r * arm_context
        self.V_inv = sherman_morrison_inverse(self.V_inv, arm_context)
        self.theta_hat = self.V_inv @ self.sum_arms_rewards
        
        self.nbr_pulls[arm] += 1
        # Least-squares predicted mean for arm with features X is
        # <theta_hat, X> where X is the 
        self.means = np.array([self.theta_hat @ arm for arm in self.bandit.arms])
        
        self.time += 1        
        self.compute_indices()
    
    def compute_indices(self):
        """UCB index for arm with features X is of the form
        predicted mean + bonus * sqrt(<X, V^{inv} X>)
        """
        self.indices = [
            self.means[i] + self.exploration_bonus * np.sqrt(arm.T @ self.V_inv @ arm)
            for i, arm in enumerate(self.bandit.arms)
        ]
        
    def choose_an_arm(self):
        return randamax(self.indices)
In [ ]:
## Generate a given bandit...
# theta = [0.7, -0.7]
# bandit = LinearBandit(theta, 2, sigma=1.0)

### ... or generate it at random!
bandit = generate_random_linear_bandit()

print(bandit)
In [ ]:
seqalglinucb_1 = LinUCB(
    bandit, 
    params={'init': 0, 'reg': 1.0, 'exploration_bonus': 1.0}
)

seqalglinucb_2 = LinUCB(
    bandit, 
    params={'init': 0, 'reg': 1.0, 'exploration_bonus': 2.0}
)
seqalglinucb_05 = LinUCB(
    bandit, 
    params={'init': 0, 'reg': 1.0, 'exploration_bonus': 0.5}
)

# Compare with a non-contextual UCB
seqalgucb = UCB(bandit)

print(seqalglinucb_1)
print(seqalglinucb_2)
print(seqalglinucb_05)
print(seqalgucb)
In [ ]:
test_experiment = Experiment(
    [seqalglinucb_1, seqalglinucb_2, seqalglinucb_05, seqalgucb], 
    bandit,
    statistics={'mean':True, 'std':True, 'quantile':False, 'pulls':False},
    complexity=False
)

Comments

Although UCB exhibits sublinear regret growth, it is beaten by all instances of LinUCB.

Do you find it surprising? How would you explain it?

Feel free to repeat the experiments with different exploration bonus.

In [ ]:
test_experiment.run(10, 1000)
test_experiment.plot()
In [ ]:
print(seqalglinucb_1)
print(seqalglinucb_2)
print(seqalglinucb_05)

2) The Linear IMED algorithm

For an arm $a \in \mathcal{A}$ and a time step $t\geq 1$, the LinearIMED index is defined as follows: $$ I_a(t) = N_a^{\text{eff}}(t)\,\dfrac{\left(\max_{x \in \mathcal{X}}\hat f_{t,\lambda}(x) - \hat f_{t,\lambda}(x_a)\right)^2}{2} + \log\!\left(N_a^{\text{eff}}(t)\right) \,, $$ where $ \hat f_{t,\lambda} $ is the current estimate of $f$ and $N_a^{\text{eff}}(t) = \!\left( |\!| \phi(x_a) |\!|_{G_t^{-1}}^2 \right)^{-1} $, the effective number of pull of arms $a$. $ G_t$ is the current Gramm matrix (used for the regression).

LinearIMED is the strategy that consists in pulling an arm with minimal index at each time step: $$ a_{t+1} = \arg\!\min_{a \in \mathcal{A}}I_a(t) \,.$$

Question 14

Write a class that provides an IMED type strategy for linear bandit inspired from the lower bounds on the number of pulls for Linear structure.

Question 15

Compare LinearUCB and LinearIMED.