Deep Reinforcement Learning -Write an AI to play Pong with Q learning

In this post, we will implement Q learning to play Pong.
By the end of this post, you will be able to

  1. Design your own game in Python Pygame library. 
  2. Learn the basics of Q learning
  3. Implement an efficient Policy for the agent

To follow this tutorial it is highly recommended to have even a little bit of experience in

  1. Python
  2. Backpropagation 
  3. Linear algebra 
  4. Matrices. 

If you know the basics of these then we can move on.

I am using Python 3.5 and the software I am using for the coding part is Sublime Text 3 but you can even use the default Python IDLE editor

Before starting we need to install the pygame library. To do that just open the Python folder where it is installed then go to the scripts folder, and open command prompt from that location.

Now type this below

  pip install pygame 

Let it download first then type

  pip install numpy 

Let's go to the problem solving

The pong game basically has a rectangular bar with which we will have to bounce the ball everytime it tries to hit. If it misses then the reward will be -1 else + 1

from pygame.locals import *
This imports all the packages from the pygame library

import numpy as np
This imports the numpy library and renames it to 'np' for easy coding.

import pygame as pg
This imports the pygame library and renames it to 'pg' for easy coding.

import random
This imports the random library inorder to generate some random numbers.

import time
This imports the time library which I will use here to calculate the time taken to learn from experience.

start = time.time() 
The variable 'start' is storing the initial time at which the script was loaded.

FPS = xxx
A high value of FPS will make the game faster and a low value will make the game slower in terms of frames. Having a high FPS will make your agent learn in less time in case you lack patience ;)

fpsClock = pg.time.Clock()
It creates an object which keeps an eye on the time of the system.

This initializes the pygame module

window = pg.display.set_mode((800,600))
It will create a window container with height 800 pixels and width 600 pixels. Change according to your desire.

pg.display.set_caption('Q learning Example')
It will display 'Q learning Example' on the title bar

Left = 400
The co-ordinate of the left surface

Top = 570
The co-ordinate of the top surface

Width = 100 
Width of the rectangular bar

Height = 20
Height of the rectangular bar

LR = 0.01
Y = 0.99
Learning Rate and Gamma

Black, White, Green
RGB values of black white and green colour

rct = pg.Rect(Left, Top, Width, Height)
It creates a rectangular object from the pygame library and stores the coordinator as specified by the left, top, width and height.

storage = {}
It will store the value of each state.

action = 2
It defines the action of the agent. 2 stands for right 1 stand for left and 0 stands for rest

jumpY = 6
jumpX = 8
Number of pixels the agent will jump to the horizontal x-axis and according to the vertical y-axis

Q = np.zeros([25000, 3])
This creates a numpy array with 25000 rows and 3 columns. Each of the three columns define the action and each of the row defines the state. Each column stores the maximum Q value respective to the action according to the state

cenX = 10
cenY = 50
radius = 10
score = 0
missed = 0
reward = 0
CenX and CenY will store the coordinates of the centre of the circle. Radius for radius of the circle and rest is for the score, reward and the number of times the rectangular bar has missed the ball as 'missed'.

The calculate_store function will calculate the reward and return 1 if the ball is on the rectangular bar or else it will return -1 if the rectangular bar fails to deflect it. Whenever the rectangular bar message the ball the game will regenerate the ball at random location and that random location specifically for the x-axis is determined by the newXforCircle function.

The class state stores the location of the rectangular bar it consists of the general information about its coordinates and also of the coordinates of the circle. The class Circle stores the coordinates of the circle centre of the circle.

The convert function will convert the state into a number and this number will be stored as the index in the numpy array Q among the 25000 rows. The max function returns the index of the maximum value present in that storage.

The action function returns the index that contains the maximum value of a particular action (0, 1, 2) for the agent. The argmax function will return the indices of the maximum values along a certain axis. The afteraction function intakes in the current state and the action that has been taken on that state and returns the next state. For example, if the rectangle's coordinate is 200 on the x-axis and the action is 2 to move right then int the next state it will be 200 + 100 which is 300.

The newRect function will return a new rectangle with updated coordinates based on the current action taken. If the rectangle is at the edge of the right border of the window (800) then it will return the original rectangle else it will return an updated rectangle that has moved 100 pixels to the right. Similarly, if the rectangle is at the edge of the left border of the window (0) then it will return the original rectangle or else it will return an updated rectangle

Quite Simple isn't it? :)

Now coming to the training and the infinite loop part. Hold your horses for it's a bit long.

#The for loop at line 2 must be present, whenever you are making a game using Python
#library np.savetxt(), which saves the Q values matrix. COLL stores the random
#RGB values of the ball which will change whenever the ball will strike the
#rectangular bar.
#Window.fill() fills the entire window with a certain RGB colour value
# The If-else loop describes the action that will be taken whenever the ball hits
# any of the edges. It includes the top, bottom, left side (0 pixels) and right
# side (800 pixels). It basically defines the behaviour of the ball a.k.a how it
# should jump and in which direction it will jump by updating the values of the
# rectangle and the circle a.k.a by calling the respective functions
#The Q function is the engine that is working here it is the most important
# part that one must cover during Q learning the equation of Q learning
# follows Bellman equation of probability.

#It States

Q(s, a) = Q(s, a) + lr*[R + y*max(Q(s', a')) - Q(s, a)]

# where Q(s, a) is the current state
# lr is the learning rate
# y is the gamma
# R is the immediate reward of that action
# s' and a' represent the next state and it action

Take an example where the rectangle coordinates are

Left = 400 Top = 400 Height = 30 Width = 100

This will be stored in the state class in the self.rect variable. Similarly, the centre coordinates of the circle will be stored in variable in the class state. Then this state is converted into a number i.e each state is assigned a number.This number is the index in the Q table. Hence whenever the agent faces certain state which is already in the Q table, it will then calculate the argmax of that row and return the index with maximum Q value. The action (Q table column) having maximum value gives the agent information about the reward it has yet received in that state by taking that action. So it is pretty easy to understand that the maximum value reflects the maximum reward with that action.

For the full code click here

Eva :)


Popular posts from this blog

SPI Working with Verilog Code

Verilog Code for I2C Protocol

SR Flip Flop Verilog Code