Shortcuts

SQN

Overview

Soft Q-learning is an extension from many previous work, including: Leverage the Average: an Analysis of KL Regularization in Reinforcement Learning , Soft Actor-Critic Algorithms and Applications, Equivalence Between Policy Gradients and Soft Q-Learning and so on. It an off-policy Q-learning algorithm, but adopt the maximum entropy framework, making a connection between Policy Gradients and Q-learning.

Quick Facts

  1. SQN is implemented for environments with Discrete action spaces.(i.e. Atari, Go)

  2. SQN is an off-policy and model-free algorithm, and uses Boltzmann policy to do exploration.

  3. SQN is based on Q-learning algorithm, which optimizes a Q-function and construct a policy from Q-function

  4. SQN is implemented for Multi-Discrete action space as well

Key Equations or Key Graphs

An entropy-regularized version of RL objective could lead to better exploration and stability. The most general way to define entropy-augmented return is

\[\sum_{t=0}^{\infty} \gamma^{t}\left(r_{t}-\tau H_{t}\right),\]

\(\bar{\pi}\) is some “reference” policy, \(\tau\) is a “temperature” parameter, and \(H\) is the Kullback-Leibler divergence. Note that the temperature \(\tau\) can be eliminated by re-scaling the rewards. So the definition of Q-function changing to:

\[Q^{\pi}(s, a)=\mathbb{E}\left[r_{0}+\sum_{t=1}^{\infty} \gamma^{t}\left(r_{t}-\tau \mathrm{KL}_{t}\right) \mid s_{0}=s, a_{0}=a\right]\]

Additionally, an optimal policy called Boltzmann-Policy can be derived by varying the action probabilities as a graded function of estimated value:

\[\begin{split}\begin{aligned} \pi^Q(\cdot \mid s) % &=\underset{\pi}{\arg \max }\left\{\mathbb{E}_{a \sim \pi}[Q(s, a)]-\tau D_{\mathrm{KL}}[\pi \| \bar{\pi}](s)\right\} \\ &=\frac{\bar{\pi}(a \mid s) \exp (Q(s, a) / \tau)}{\underbrace{\mathbb{E}_{a^{\prime} \sim \bar{\pi}}\left[\exp \left(Q\left(s, a^{\prime}\right) / \tau\right)\right]}_{\text {normalizing constant }}} \end{aligned}\end{split}\]

Generally speaking, we can use an uniformly $bar{pi} \sim U$, so $r_t - \tau \mathrm{KL}_t = r_t + \mathcal{H} - \log{mathcal{N}}$. Thus optimal policy additionally aims to maximize its entropy at each visited state and the TD-target becomes:

\[y_t = r + \gamma\left[Q_{\bar{\theta}}\left(\mathbf{s}', \mathbf{a}'\right)-\tau \log \pi_{\phi}\left(\mathbf{a}' \mid \mathbf{s}')\right]\right.\]

Pseudocode

../_images/sqn.png
\begin{algorithm}[tp]
\setstretch{1.35}
\DontPrintSemicolon
\SetAlgoLined
\SetKwInOut{Input}{Input}\SetKwInOut{Output}{Output}
\Input{Initial Q functions parameters $\Theta$ \\
        Temperature $\tau$ \\
        Empty replay buffer $\mathcal{D}$ \\
}

\textbf{Initialize: }
        $\overline{\theta^i}_1 \leftarrow {\theta^i}_1$, $\overline{\theta^i}_2 \leftarrow {\theta^i}_2$

\While(Train){not converge}{


        % \tcc{comments on code}
        \For(Collect){each environment step}{
                $a_t \sim \pi^{Q}(a_t|s_t)$ \\
                $s_{t+1} \sim p(s_{t+1}|s_t, a_t)$ \\
                $\mathcal{D} \cup \{(s_t, a_t, r_t, s_{t+1}, d\}$
                }

        \For(Update){each gradient step}{
        $\{(s, a, r, s^\prime, d)\}^B_{i=1} \sim \mathcal{D}$

        Compute Q-loss $\mathcal{L}_Q(\theta)$


        $\theta \leftarrow \theta - \lambda_{\theta} \bigtriangledown_{\theta} \mathcal{L}_Q(\theta)$

        Compute temperature loss $\mathcal{L}(\tau)$

        $\tau \leftarrow \tau - \lambda_{\tau} \bigtriangledown_{\tau} \mathcal{L}(\tau)$

        % Update target network\\
        Update Target: \\
        $\overline{\theta}_j \leftarrow \rho {\overline{\theta}}_j + (1-\rho) {\theta}_j, \ \ \text{for} \ j \in \{1,2\}$
        }
}
\caption{SQN Algorithm}
\end{algorithm}

Extensions

SQN can be combined with:

  • SQN could use a separate policy network, which called SAC-Discrete

  • SQN is closely related to general Regularization Reinforcement Learning which could have many form, but our implementation utilize auto adjust temperature and remove many unclear part Leverage the Average.

  • SQN is using Boltzmann policy for construct policy from Q-function, it’s although called softmax policy.

  • Some analyst draw connection between Soft Q-learning and Policy Gradient algorithms such as Equivalence Between Policy Gradients and Soft Q-Learning.

  • Some recent research treats RL as a problem in probabilistic inference, like MPO, VMPO they have close relationship to SQN, SAC and the max-entropy framework, it an activate area.

Implementation

Soft Q loss

# Target
with torch.no_grad():
      q0_targ = target_q_value[0]
      q1_targ = target_q_value[1]
      q_targ = torch.min(q0_targ, q1_targ)
      # discrete policy
      alpha = torch.exp(self._log_alpha.clone())
      # TODO use q_targ or q0 for pi
      log_pi = F.log_softmax(q_targ / alpha, dim=-1)
      pi = torch.exp(log_pi)
      # v = \sum_a \pi(a | s) (Q(s, a) - \alpha \log(\pi(a|s)))
      target_v_value = (pi * (q_targ - alpha * log_pi)).sum(axis=-1)
      # q = r + \gamma v
      q_backup = reward + (1 - done) * self._gamma * target_v_value
      # alpha_loss
      entropy = (-pi * log_pi).sum(axis=-1)
      expect_entropy = (pi * self._target_entropy).sum(axis=-1)

# Q loss
q0_loss = F.mse_loss(q0_a, q_backup)
q1_loss = F.mse_loss(q1_a, q_backup)
total_q_loss = q0_loss + q1_loss

Sample from policy

logits = output['logit'] / math.exp(self._log_alpha.item())
prob = torch.softmax(logits - logits.max(axis=-1, keepdim=True).values, dim=-1)
pi_action = torch.multinomial(prob, 1)

Alpha loss

alpha_loss = self._log_alpha * (entropy - expect_entropy).mean()

Other Public Implementations

DRL