Reinforcement learning is a fascinating learning paradigm because it allows you to map a series of inputs to outputs with dependencies, rather than simply one (Markov decision processes, for example).
Reinforcement learning takes place in the context of environmental states, and the actions were taken at each state.
During the learning process, the algorithm examines state-action pairs in a given environment at random (to create a state-action pair table), then uses the state-action pair rewards in practice to determine the best action for a given state that leads to a goal state.
Consider the case of a simple blackjack agent.
- The states represent the player’s total number of cards. The actions reflect what a blackjack player might do, such as hit or stand in this scenario.
- Many hands of poker would be used to teach an agent how to play blackjack, with a payout for winning or losing for each state-action nexus. For example, a state of 10 would have a value of 1.0 for hit and 0.0 for stand (indicating that hit is the optimal choice).
- The learned reward for state 20 would most likely be 0.0 for hit and 1.0 for the stand. A form of 17 may have an action value of 0.95 frames and a 0.05 impact for a less straightforward hand.
- According to probability, this agent would then stand 95% of the time and hit 5% of the time. These payouts would be calculated across several poker hands, indicating the most excellent option for a specific condition (or writing).
In contrast to supervised learning, where a critic grades each example, reinforcement learning allows the critic only to give a degree when the goal state is reached (having a hand with the state of 21).
