Natural Policy Gradient for Average Reward Non-Stationary RL

We consider the problem of non-stationary reinforcement learning (RL) in the
infinite-horizon average-reward setting. We model it by a Markov Decision
Process with time-varying rewards and transition probabilities, with a
variation budget of $\Delta_T$. Existing non-stationary RL algorithms focus on
model-based and model-free value-based methods. Policy-based methods despite
their flexibility in practice are not theoretically well understood in
non-stationary RL. We propose and analyze the first model-free policy-based
algorithm, Non-Stationary Natural Actor-Critic (NS-NAC), a policy gradient
method with a restart based exploration for change and a novel interpretation
of learning rates as adapting factors. Further, we present a bandit-over-RL
based parameter-free algorithm BORL-NS-NAC that does not require prior
knowledge of the variation budget $\Delta_T$. We present a dynamic regret of
$\tilde{\mathscr O}(|S|^{1/2}|A|^{1/2}\Delta_T^{1/6}T^{5/6})$ for both
algorithms, where $T$ is the time horizon, and $|S|$, $|A|$ are the sizes of
the state and action spaces. The regret analysis leverages a novel adaptation
of the Lyapunov function analysis of NAC to dynamic environments and
characterizes the effects of simultaneous updates in policy, value function
estimate and changes in the environment.

Este artículo explora los viajes en el tiempo y sus implicaciones.

Descargar PDF:

2504.16415v1

Natural Policy Gradient for Average Reward Non-Stationary RL

Plataforma Online

Enlaces

Verbalus Mater

Natural Policy Gradient for Average Reward Non-Stationary RL

Natural Policy Gradient for Average Reward Non-Stationary RL

Plataforma Online

Enlaces

Verbalus Mater

Signo en

Regístrate

— PRÓXIMO CURSO ONLINE EMPIEZA EL 15 DE ENERO —

La Ciencia Real Detrás de los Viajes Temporales 25% DTO

La Ciencia Real Detrás de
los Viajes Temporales
25% DTO