An improved upper bound on the expected regret of UCB-type policies for a matching-selection bandit problem

この論文をさがす

説明

We improved an upper bound on the expected regret of a UCB-type policy LLR for a bandit problem that repeats the following rounds: a player selects a maximal matching on a complete bipartite graph K M , N and receives a reward for each component edge of the selected matching. Rewards are assumed to be generated independently of its previous rewards according to an unknown fixed distribution. Our upper bound is smaller than the best known result (Chen et?al., 2013) by a factor of ? ( M 2 / 3 ) .

収録刊行物

参考文献 (9)*注記

もっと見る

関連プロジェクト

もっと見る

詳細情報 詳細情報について

問題の指摘

ページトップへ