Double Submit Cookie Pattern

Double submitting cookies is defined as sending a random value in both a cookie and as a request parameter, with the server verifying if the cookie value and request value are equal. When a user…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Introduction to Reinforcement Learning in R

Motivation for the study

Tremendous increase in the application of the robotics intelligence, and consolidating of the daily activities performed by humans with automation and transactions motivated the development of this article. Reinforcement learning is an aspect of machine learning that is made up of an Agent, interacting with an environment, and performing some specific actions with reward in respect to prior actions. R programming language will be used to develop systems consider in this learning.

Figure 1: The agent-environment interaction in a Markov decision process

Over time, there has been several applications of this technique in fascinating field of study such as:

i. Bidding and Advertising

ii. Games

iii. Traffic light control

iv. Robotics

v. Chemistry

vi. Resources management in computer clusters

vii. Healthcare

viii. Finance. etc.

In this article, our focus will be on how reinforcement learning makes decisions in solving some specific problems from both stationary and non-stationary perspectives. Sit back and enjoy!!!.

One distinguishing factor between reinforcement learning and other machine learning technique is the use of evaluative feedback instead of classical instructive feedback used by other learning. Evaluative feedback in RL is an indication of how good the action was, but not considering if the taken action was the worst or best action possible. Unlike instructive feedback that signals the correct action irrespective of previous action taken. And this is the cause for pattern classification, artificial neural network, etc. Distinctively, evaluative feedback is dependent on the previous actions taken while instructive feedback is independent of the previous action taken.

In this article we focus on evaluative part of reinforcement learning, studying this aspect enables us to understand how important it is to RL. And how it can be integrated with instructive feedback to design a fast and intelligent system.

For instance, imagine the rabbit-tiger environment in figure 1 above, it consists of a rabbit in an environment where pellets and carrot are scattered on the grid, and a tiger waiting in a corner for the rabbit to come closer so as to prey on it. In addition, the (rabbit) aim is to maximize expected reward over some period of time; for instance, 50 (steps) actions. So, when the rabbit eats pellets, its reward is given as +5, and when it eats carrot as +10, when eaten by the tiger as death but when it eventually eats the pellets and fruits (still maintain the step) without being eaten by the tiger it moves to the next phase (regarded as success). Afterwards, for each selected action, the agent receives a numerical reward. The RL in this context can be modeled as:

1. The agent (rabbit) is present in a state in the environment, and based on the entry state say S0, the agent takes an action A0.

2. There is an environment transition to a new state S1 based on the selected action A0, and it takes another action A1.

3. As the agent move closer to eat pellets (+5), and moves closer to eating carrot (+10), then the environment continue to transit and position the rabbit in a difficult position to finish the phase or being eaten by the tiger (-100) where the episode ends (rabbit lifespan cut short); and it reduces its chances of survival and ends the phase, as displayed in figure 2.

In the rabbit-tiger context, each action has an expected (mean) reward given that the action is selected. In accordance with this study, we refer to it as value of that action. Let’s represent the selected action at time step t as At, and the respective reward as Rt. Then, the value of an arbitrary action a, denoted as q*(a), is the mean reward given that a is chosen:

Assuming we know the value of each action, thereafter, we would always select the action with the highest value. This is referred to as greedy action. When any of these actions are selected, we say that the current knowledge of the values is being exploited. Otherwise, non-greedy actions are selected, and referred to as explore because it helps to improve the estimate of non-greedy actions value. Exploitation is better when considering the maximization of the mean reward on the next step, but exploration may produce the best total reward in the future. For example, in the rabbit-tiger context, assuming the nearest pellet to the left of the grid possesses a greedy action’s value, while several other actions (pellet to the right positions the rabbit in a difficult position to survive, carrot almost near tiger in front, etc.) are estimated to be almost good but surrounded with uncertainty.

One of the uncertainties (carrot almost near tiger in front) actions might be actually better than the greedy action but the agent is unaware. If there is enough time steps to still select actions, then it will be better to explore the non-greedy actions, and detect which is better than the greedy action. It should be noted that reward is comparably lower in the next step (short run) during exploration, but much higher in the next steps (long run) because after the discovery of the better actions, they can be exploited many times. Either to explore or exploit depends on these three factors; uncertainties, estimated values, and number of steps left. Because it is quite impossible to explore and exploit a selected single action. In that case, there will be a conflict in exploration and exploitation. There are many techniques to strike a balance between exploration and exploitation, which will be extensively discussed in our next article.

Hope this article has been of help to you in anyway; look out for the continuation in our next series.

It is good to note that, understanding the framework is more important and necessary before diving into code writing.

My name is Ogundepo Olumide Benjamin, thanks for reading. Hope, to see you in the next episode.

Add a comment

Related posts:

Loitering

I see a solitary bird High up in the sky Black outline and alone Loitering Cutting circle against sky And I yearn for That freedom and solitude While knowing Neither of us is Born for it

Maximizing Clicks on Your Medium Article

How do you get more clicks on Medium? What makes an article popular on Medium? Why is my Medium article not getting views? Can you monetize Medium articles?