《基于深度强化学习的flappybird.docx》由会员分享,可在线阅读,更多相关《基于深度强化学习的flappybird.docx(14页珍藏版)》请在taowenge.com淘文阁网|工程机械CAD图纸|机械工程制图|CAD装配图下载|SolidWorks_CaTia_CAD_UG_PROE_设计图分享下载上搜索。
1、精品文档SHANGHAI JIAO TONG UNIVERSITYProject Title: Playing the Game of Flappy Bird with Deep Reinforcement LearningGroup Number: G-07Group Members: Wang Wenqing Gao Xiaoning Qian Chen 11603Contents1Introduction12Deep Q-learning Network2Q-learning2Reinforcement Learning Problem2Q-learning Formulation 63
2、Deep Q-learning Network4Input Pre-processing5Experience Replay and Stability5DQN Architecture and Algorithm63Experiments7Parameters Settings7Results Analysis94Conclusion115References12Playing the Game of Flappy Bird with Deep Reinforcement LearningAbstractLetting machine play games has been one of t
3、he popular topics in AI today. Using game theory and search algorithms to play games requires specific domain knowledge, lacking scalability. In this project, we utilize a convolutional neural network to represent the environment of games, updating its parameters with Q-learning, a reinforcement lea
4、rning algorithm. We call this overall algorithm as deep reinforcement learning or Deep Q-learning Network(DQN). Moreover, we only use the raw images of the game of flappy bird as the input of DQN, which guarantees the scalability for other games. After training with some tricks, DQN can greatly outp
5、erform human beings.1 IntroductionFlappy bird is a popular game in the world recent years. The goal of players is guiding the bird on screen to pass the gap constructed by two pipes by tapping screen. If the player tap the screen, the bird will jump up, and if the player do nothing, the bird will fa
6、ll down at a constant rate. The game will be over when the bird crash on pipes or ground, while the scores will be added one when the bird pass through the gap. In Figure1, there are three different state of bird. Figure 1 (a) represents the normal flight state, (b) represents the crash state, (c) r
7、epresents the passing state.(a) (b) (c)Figure 1: (a) normal flight state (b) crash state (c) passing stateOur goal in this paper is to design an agent to play Flappy bird automatically with the same input comparing to human player, which means that we use raw images and rewards to teach our agent to
8、 learn how to play this game. Inspired by 1, we propose a deep reinforcement learning architecture to learn and play this game.Recent years, a huge amount of work has been done on deep learning in computer vision 6. Deep learning extracts high dimension features from raw images. Therefore, it is nat
9、ure to ask whether the deep learning can be used in reinforcement learning. However, there are four challenges in using deep learning. Firstly, most successful deep learning applications to date have required large amounts of hand-labelled training data. RL algorithms, on the other hand, must be abl
10、e to learn from a scalar reward signal that is frequently sparse, noisy and delayed. Secondly, the delay between actions and resulting rewards, which can be thousands of time steps long, seems particularly daunting when compared to the direct association between inputs and targets found in supervise
11、d learning. The third issue is that most deep learning algorithms assume the data samples to be independent, while in reinforcement learning one typically encounters sequences of highly correlated states. Furthermore, in RL the data distribution changes as the algorithm learns new behaviors, which c
12、an be problematic for deep learning methods that assume a fixed underlying distribution. This paper will demonstrate that using Convolutional Neural Network (CNN) can overcome those challenge mentioned above and learn successful control polices from raw images data in the game Flappy bird. This netw
13、ork is trained with a variant of the Q-learning algorithm 6. By using Deep Q-learning Network (DQN), we construct the agent to make right decisions on the game flappy bird barely according to consequent raw images.2 Deep Q-learning NetworkRecent breakthroughs in computer vision have relied on effici
14、ently training deep neural networks on very large training sets. By feeding sufficient data into deep neural networks, it is often possible to learn better representations than handcrafted features 23. These successes motivate us to connect a reinforcement learning algorithm to a deep neural network
15、, which operates directly on raw images and efficiently update parameters by using stochastic gradient descent. In the following section, we describe the Deep Q-learning Network algorithm (DQN) and how its model is parameterized. 2.1 Q-learning2.1.1 Reinforcement Learning ProblemQ-learning is a spec
16、ific algorithm of reinforcement learning (RL). As Figure 2 show, an agent interacts with its environment in discrete time steps. At each time t, the agent receives an state and a reward . It then chooses an action from the set of actions available, which is subsequently sent to the environment. The
17、environment moves to a new state and the reward associated with the transition is determined 4. Figure 2: Traditional Reinforcement Learning scenarioThe goal of an agent is to collect as much reward as possible. The agent can choose any action as a function of the history and it can even randomize i
18、ts action selection. ., maximize the future income), although the immediate reward associated with this might be negative 5.2.1.2 Q-learning Formulation 6In Q-learning problem, the set of states and actions, together with rules for transitioning from one state to another, make up a Markov decision p
19、rocess. One episode of this process (e.g. one game) forms a finite sequence of states, actions and rewards:Hererepresents the state, is the action and is the reward after performing the action. The episode ends with terminal state. To perform well in the long-term, we need to take into account not o
20、nly the immediate rewards, but also the future rewards we are going to get. Define the total future reward from time point t onward as: (1)In order to ensure the divergence and balance the immediate reward and future reward, total reward must use discounted future reward: (2)Here is the discount fac
21、tor between 0 and 1, the more into the future the reward is, the less we take it into consideration. Transforming equation can get: (3)In Q-learning, define a function representing the maximum discounted future reward when we perform action in state: (4)It is called Q-function, because it represents
22、 the “quality of a certain action in a given state. A good strategy for an agent would be to always choose an action that maximizes the discounted future reward: (5)Here represents the policy, the rule how we choose an action in each state. Given a transition, equation can get following bellman equa
23、tion - maximum future reward for this state and action is the immediate reward plus maximum future reward for the next state: (6)The only way to collect information about the environment is by interacting with it. Q-learning is the process of learning the optimal function, which is a table in. Here
24、is the overall algorithm 1: Algorithm 1 Q-learning Initialize Qnum_states, num_actions arbitrarilyObserve initial state s0Repeat Select and carry out an action aObserve reward r and new state s s = sUntil terminated 2.2 Deep Q-learning NetworkIn Q-learning, the state space often is too big to be put
25、 into main memory. A game frame of binary images has states, which is impossible to be represented by Q-table. Whats more, during training, encountering a known state, Q-learning just perform a random action, meaning that its not heuristic. In order overcome these two problems, just approximate the
26、Q-table with a convolutional neural networks (CNN) 78. This variation of Q-learning is called Deep Q-learning Network (DQN) 910. After training the DQN, a multilayer neural networks can approach the traditional optimal Q-table as followed: (7)As for playing flappy bird, the screenshot st is inputted
27、 into the CNN, and the outputs are the Q-value of actions, as shown in Figure 3:Figure 3: In DQN, CNNs input is raw game image while its outputs are Q-values Q(s, a), one neuron corresponding to one actions Q-value.In order to update CNNs weight, defining the cost function and gradient update functi
28、on as 910: (8) (9) (10)Here, are the DQN parameters that get trained and are non-updated parameters for the Q-value function. During training, use equation to update the weights of CNN.Meanwhile, obtaining optimal reward in every episode requires the balance between exploring the environment and exp
29、loiting experience.-greedy approach can achieve this target. When training, select a random action with probability or otherwise choose the optimal action . Theanneals linearly to zero with increase in number of updates.2.3 Input Pre-processingWorking directly with raw game frames, which are pixel R
30、GB images, can be computationally demanding, so we apply a basic preprocessing step aimed at reducing the input dimensionality. Figure 4: Pre-process game frames. First convert frames to gray images, then down-sample them to specific size. Afterwards, convert them to binary images, finally stack up
31、last 4 frames as a state.In order to improve the accuracy of the convolutional network, the background of game was removed and substituted with a pure black image to remove noise. As Figure 4 shows, the raw game frames are preprocessed by first converting their RGB representation to gray-scale and d
32、own-sampling it to an image. Then convert the gray image to binary image. In addition, stack up last 4 game frames as a state for CNN. The current frame is overlapped with the previous frames with slightly reduced intensities and the intensity reduces as we move farther away from the most recent fra
33、me. Thus, the input image will give good information on the trajectory on which the bird is currently in.2.4 Experience Replay and StabilityBy now we can estimate the future reward in each state using Q-learning and approximate the Q-function using a convolutional neural network. But the approximati
34、on of Q-values using non-linear functions is not very stable. In Q-learning, the experiences recorded in a sequential manner are highly correlated. If sequentially use them to update the DQN parameters, the training process might stuck in a local minimal solution or diverge. To ensure the stability
35、of training of DQN, we use a technical trick called experience replay. During game playing, particular number of experience are stored in a replay memory. When training the network, random mini-batches from the replay memory are used instead of the most recent transition. This breaks the similarity
36、of subsequent training samples, which otherwise might drive the network into a local minimum. As a result of this randomness in the choice of the mini-batch, the data that goes in to update the DQN parameters are likely to be de-correlated.Furthermore, to better the stability of the convergence of t
37、he loss functions, we use a clone of the DQN model with parameters. The parametersare updated to after every C updates to the DQN. 2.5 DQN Architecture and AlgorithmAs shown in Figure 5, firstly, get the flappy bird game frame, and after pre-processing described in section 2.3, stack up last 4 frame
38、s as a state. Input this state as raw images into the CNN whose output is the quality of specific action in given state., the agent performs an action According to policy , with probability , otherwise perform a random action. The current experience is stored in a replay memory, a random mini-batch
39、of experiences are sampled from the memory and used to perform a gradient descent on the CNNs parameters. This is an interactive process until some criteria are being satisfied.Figure 5: DQNs training architecture: upper data flow show the training process, while the lower data flow display the inte
40、ractive process between the agent and environment.The complete DQN training process is shown in Algorithm 2. We should note that the factor is set to zero during test, and while training we use a decaying value, balancing the exploration and exploitation.Algorithm 2 Deep Q-learning NetworkInitialize
41、 replay memory D to certain capacity NInitialize the CNN with random weights Initialize =: for games = 1: maxGames dofor snapShots = 1: T doWith probability select a random action atotherwise select Execute at and observe rt+1 and next sate st+1Store transition (st ,at , rt+1 , st+1) in replay memor
42、y DSample mini-batch of transitions from Dfor j = 1: batchSize doif game terminates at next state thenQ_pred =: rjelseQ_pred =: rj + end ifPerform gradient descent on according to equation end forEvery C steps reset =: end forend for3 ExperimentsThis section will describe our algorithms parameters s
43、etting and the analysis of experiment results.3.1 Parameters SettingsFigure 6 illustrates our CNNs layers setting. The neural networks has 3 CNN hidden layers followed by 2 fully connected hidden layers. Table 1 show the detailed parameters of every layer. Here we just use a max pooling in the first
44、 CNN hidden layer. Also, we use the ReLU activation function to produce the neural output.Figure 6: The layer setting of CNN: this CNN has 3 convolutional layers followed by 2 fully connected layers. As for training, we use Adam optimizer to update the CNNs parameters.Table 1: The detailed layers se
45、tting of CNNLayerInputFilter sizeStrideNum filtersActivationOutputconv18080488432ReLU202032max_pool202032222101032conv210103244264ReLU5564conv3556433164ReLU5564fc45564512ReLU512fc55122Linear2Table 1 lists all the parameter setting of DQN. We use a decayed ranging from 0.1 to 0.001 to balance explora
46、tion and exploitation. Whats more, Table 2 shows that the batch stochastic gradient descent optimizer is Adam with batch size of 32. Finally, we also allocate a large replay memory.Table 2: The training parameters of DQNParametersvalueObserve steps100000Explore stepsInitial_epsilonFinal_epsilonRepla
47、y_memory50000batch size32learning rateFPS30optimization algorithmAdam3.2 Results AnalysisWe train our model about 4 million epochs. Figure 7 shows the weights and biases of CNNs first hidden layer. The weights and biases finally centralize around 0, with low variance, which directly stabilize CNNs output Q-value and reduce probability of random action. The stability of CNNs parameters leads to obtaining optimal policy. Figure 7: Left (right) figure is the histogram of weights (biases) of CNNs fir