What is On-Policy and Off-Policy Reinforcement Learning? What are the key differences between

To start with basics, the reinforcement learning strategy includes for features: An agent, A policy, A reward, and a value function.

The policy can be defined as a pathway that link perception and action in an environment. The interest of an agent is defined in terms of the policy. Now, let’s discuss both policies.

What is Off-Policy?

The agents learn optimal policy with the help of the greedy policy. Q learning is known as off policy because it estimates the rewards for future actions & adds value to the new state without following any greedy policy.

An off-policy is independent of agene’s actions. Off-policy figures out the optimal policy regardless of the agent’s motivation. Ex: Q learning is an off-policy learner.

What is On-Policy?

The policy which is used for updating & the policy used for acting is the same known as On-Policy. In this policy the agent grasps the optimal policy and uses the same to act.

SARSA algorithm is known as an on policy because it estimates the value of the policy without being followed. SARSA is in the form of  S, A, R, S’, A’ and means that

  • Current state S,
  • Current action A,
  • Reward R,
  • New state S’ and
  • Future action A’

How are these two different?

Differentiating both the reinforcement learning models for hyperparameter optimization is an expensive affair, and too speculative. So we can evaluate the performance of these algorithms via policy interactions with the target environment.

These interactions can help get insights about the kind of policy that the agent is implementing.

  • On policy can be used when we want to optimize the value of an agent that is exploring whereas in the off-policy reinforcement learning may be more appropriate where the agent does not explore much.
  • The on policy method attempts to evaluate or improve the policy which is used to make decisions, while the off-policy method evaluates or improves a policy different from that used to generate the data.

Off-Policy is good at predicting movement in robotics. It can be very cost-effective when it comes to deploying in real-world, reinforcement learning scenarios.

Evaluation becomes too challenging as there is too much to explore. These algorithms may assume that an off-policy evaluation method is more accurate that on policy when it comes to assessing the performance.

But agents felt from past experience may act very differently from newer learned agents. That makes it hard to get good estimates of performance.