xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

matlab代写-MAY 2020 745

时间：2021-02-24

IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. 9, NO. 5, MAY 2020 745

Deep Reinforcement Learning Based Intelligent Reflecting Surface

Optimization for MISO Communication Systems

Keming Feng, Qisheng Wang, Xiao Li , Member, IEEE, and Chao-Kai Wen , Member, IEEE

Abstract—This letter investigates the intelligent reflecting

surface (IRS)-aided multiple-input single-output wireless trans-

mission system. Particularly, the optimization of the passive phase

shift of each element at IRS to maximize the downlink received

signal-to-noise ratio is considered. Inspired by the huge success

of deep reinforcement learning (DRL) on resolving complicated

control problems, we develop a DRL based framework to solve

this non-convex optimization problem. Numerical results reveal

that the proposed DRL based framework can achieve almost

the upper bound of the received SNR with relatively low time

consumption.

Index Terms—Intelligent reflecting surface, non-convex

optimization, deep reinforcement learning, phase shift design.

I. INTRODUCTION

RECENTLY, the intelligent reflecting surface (IRS)technology has drawn great amount of attention due

to its capability of providing remarkable massive MIMO-

like gains with low cost [1]–[4]. These surfaces are usually

made of almost passive reconfigurable units, each of which

can reflect the incident signal independently with different

phase shifts. Through adjusting these phase shifts dynami-

cally, a more preferable propagation condition can be obtained.

Additionally, these surfaces can be easily coated on facades

of outdoor buildings or indoor walls, thus can be implemented

with low complexity.

To utilize the IRS effectively and efficiently, some work

has been done on the configuration of phase shifts [5]–[8].

A semidefinite relaxation (SDR) method was introduced to

optimize the phase shifts of each unit so as to maximize the

received signal-to-noise ratio (SNR). Since SDR method is

of high computational complexity, a relatively low complex-

ity fix point iteration (FPI) algorithm was proposed in [6].

However, when the user is located far away from the BS,

the performance loss is relatively high. In [7] and [8], the

phase shifts of each unit is optimized one by one iteratively

Manuscript received December 13, 2019; accepted January 18, 2020. Date

of publication January 24, 2020; date of current version May 8, 2020.

The work of Xiao Li was supported by the National Natural Science

Foundation of China under Grant 61971126 and Grant 61831013. The work

of Chao-Kai Wen was supported by the Ministry of Science and Technology

of Taiwan under Grant MOST 108-2628-E-110-001-MY3. The associate edi-

tor coordinating the review of this article and approving it for publication was

J. Zhang. (Corresponding author: Xiao Li.)

Keming Feng, Qisheng Wang, and Xiao Li are with the National

Mobile Communications Research Laboratory, Southeast University, Nanjing

210096, China (e-mail: keming_feng@seu.edu.cn; qishengw@seu.edu.cn;

li_xiao@seu.edu.cn).

Chao-Kai Wen is with the Institute of Communications Engineering,

National Sun Yat-sen University, Kaohsiung 80424, Taiwan (e-mail:

chaokai.wen@mail.nsysu.edu.tw).

Digital Object Identifier 10.1109/LWC.2020.2969167

in a greedy manner. Thus, it is less efficient for large-scale

systems.

Due to the recent advances of artificial intelligence,

especially deep learning (DL), in wireless communica-

tion, [9] and [10] utilized DL methods to the phase shift

design. However, this supervised learning requires enormous

training labels being calculated in advance. In many cases,

these training labels themselves are difficult to obtain, if

not impossible. On the contrary, deep reinforcement learning

(DRL) based methods do not need training labels and possess

the property of online learning and sample generation, which

is more storage-efficient.

In this letter, we investigate the phase shift design of the

IRS utilizing deep reinforcement learning (DRL). A DRL-

based framework is proposed to tackle the non-convexity

induced by the unit modulus constraints. We introduce the

deep deterministic policy gradient (DDPG) algorithm into

the DRL framework. Simulation results indicate that the

performance of the proposed algorithm surpasses the state-

of-the-art algorithms in terms of received SNR and running

time.

II. SYSTEM MODEL AND PROBLEM FORMULATION

Consider a single-user multiple-input single-output (MISO)

downlink system, as illustrated in Fig. 1. The BS employs a

uniform linear array (ULA) with M antenna elements, the IRS

is deployed with N = Nx ×Ny passive phase shifters, where

Nx and Ny are the number of passive units in each row and

column. All phase shifters on the IRS are configurable via a

smart controller. All channels are assumed to be quasi-static

frequency flat-fading and available at both the BS and IRS.

The channels of the BS-user, IRS-user, and BS-IRS links are

denoted as hd ∈ CM×1, hr ∈ CN×1, and G ∈ CN×M ,

respectively.

For the considered system, the received signal at the user is

y = (hHr ΦG+ h

H

d )bs + n, (1)

where Φ = diag(ejθ1 , ejθ2 , . . . , ejθN ) is the phase shift

matrix at the IRS, diag(a1, . . . , aN ) denotes a diagonal matrix

with a1, . . . , aN as its diagonal entries, θi ∈ [0, 2π] represents

the phase shift of the i-th element on the IRS, b ∈ CM×1

is the beamforming vector at the BS with the constraint

‖b‖2 ≤ Pmax, Pmax is the maximum transmit power of

the BS, s is the transmitted signal satisfying E[s2] = 1,

n ∼ CN (0, σ2) is the noise. Then, the received SNR can

be obtained as

γ =

∣

∣

∣(hHr ΦG+ h

H

d )b

∣

∣

∣

2

/σ2. (2)

2162-2345 c© 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Heriot-Watt University. Downloaded on November 03,2020 at 10:19:21 UTC from IEEE Xplore. Restrictions apply.

746 IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. 9, NO. 5, MAY 2020

Fig. 1. IRS-aided single-user MISO system.

Note that, for a fixed phase shift matrix Φ, the optimal

beamforming method that maximizes the received SNR is the

maximum-ratio transmission (MRT) [8], i.e.,

b∗ =

√

Pmax

(

hHr ΦG+ h

H

d

)H

∥

∥hHr ΦG+ h

H

d

∥

∥

. (3)

The optimization problem for the phase shift matrix Φ to

maximize γ can be formulated as

(P1): max

Φ

‖hHr ΦG+ hHd ‖2,

s.t. |Φi ,i | = 1, ∀i = 1, 2, . . . ,N , (4)

where Φi ,i is the i-th diagonal element of Φ. Note that (P1) is

a NP-hard problem owing to the non-convexity of the objec-

tive function and the unit modulus constraints. A SDR method

was proposed in [5] to solve this problem. However, it is com-

putational expensive with complexity of O((N + 1)6) [6]. In

this letter, we focus on the design of the phase shift matrix Φ,

we propose a robust DRL based framework to deal with (P1)

efficiently, which will be described in the next section.

III. DRL BASED FRAMEWORK

In this section, we first briefly introduce the DRL techniques

involved. Then, the proposed DRL based framework will be

described in detail.

A. Deep Reinforcement Learning Basics

A reinforcement learning (RL) system consists of two

major parts, i.e., the agent and the environment. Interactions

between them can be described as a Markov Decision Process

(MDP) [11]. During time step t in each episode, the agent

obtains the state st from the environment, and chooses an

action at from the action space based on a policy π. Once the

action is done, the environment updates the current state to

st+1, and emits a reward rt which measures the performance

of at under current state. Learning of the agent is to deter-

mine the optimal policy that maximizes the long-term reward.

Two kinds of algorithms, i.e., the value based and policy based

algorithms, are usually applied to determine the optimal policy.

Deep Q network (DQN) [12] is a value based algorithm

for discrete action space. Under a policy π, an action-state-

Q function of the agent for an action a under state s, which

evaluates the current action-state pair, is defined as

Qπ(s , a; θ) = Eπ[Gt |st = s , at = a], (5)

where E[·] represents the expectation, Gt =

∑∞

t=0 λ

trt is the

expected cumulative reward, λ ∈ (0, 1] is a discounting fac-

tor, θ represents the parameters of the deep neural network

Fig. 2. The DRL Based Phase Shift Design Framework Using DDPG.

(DNN) used in DQN. This algorithm aims at maximizing

the Q value (5) of a certain action-state pair by training the

DNN [11]. The training batch is randomly sampled from a

relay buffer with {st , at , rt , st+1} as one piece of previous

data.

Policy gradient (PG) is a policy based algorithm aiming

at maximizing the expectation of the discounted cumulative

reward of each episode when the action space is continuous.

At each time step t, the agent chooses the action according to a

policy πθ . Therefore, training of the policy can be represented

as a gradient ascent procedure [13]

θ t+1 = θ t + βEπθt [∇θ t logπθ(s , a)Qπθt (s , a)], (6)

where β is the learning rate, Qπθt (s , a) is the action-state-Q function under current policy πθ t . The drawback of this

algorithm is that the policy network can be updated only after

an episode is done, which slows down the convergence rate.

B. Phase Shift Design Framework Using DDPG

According to the above description, the DQN algorithm

is not suitable to solve the problem (P1), since it can only

deals with discrete action spaces. As for PG algorithm,

its convergence performance is unsatisfactory under wireless

communication context. In this letter, a DDPG based algo-

rithm is developed to solve problem (P1), it can overcome

the limitations of the DQN and PG algorithm. The proposed

framework is illustrated in Fig. 2.

1) Deep Deterministic Policy Gradient: DDPG is a model-

free, off-policy actor-critic (AC) algorithm combining the

advantages of DQN and PG [14]. It can learn deterministic pol-

icy optimally under high-dimensional continuous action space.

In DDPG, a deterministic policy network (DPN) is used as an

actor to chose actions from a continuous action space A, i.e.,

a = μ(s ; θμ), where θμ is the parameters of the DPN. A

Q network Q(s , a; θq ) is modeled as a critic to measure the

performance of the chosen action, where θq is the parameter

of the critic network. The goal of DDPG is to maximize the

output Q value. To achieve this goal, as in DQN, an experience

replay is reserved to reduce the correlation of different train-

ing samples. Moreover, to solve the problem that the Q value

update is incline to divergence under a single Q network [12], a

copy is created for the actor and critic networks, i.e., μ′(s ; θμ′)

and Q ′(s , a; θq ′), which are referred to as target networks

and used to calculate the corresponding target values. The

networks being copied are referred to as evaluation networks.

Authorized licensed use limited to: Heriot-Watt University. Downloaded on November 03,2020 at 10:19:21 UTC from IEEE Xplore. Restrictions apply.

FENG et al.: DRL BASED IRS OPTIMIZATION FOR MISO COMMUNICATION SYSTEMS 747

Note that, the target network shares the same structure with its

corresponding evaluation network, but with different parame-

ters, i.e., θq = θq ′ , θμ = θμ′ . The target networks are then

updated through soft update, which can be written as

θ i ′ = τθ i + (1− τ)θ i ′ , i = μ or q , (7)

where τ 1. Soft update moves the unstable problem

in learning the action-state-Q function and accelerates the

convergence of AC method [14]. Since the DPN learns a deter-

ministic policy, for exploration, DDPG treats it independently

from the learning process. Moreover, an exploration policy

μ˜ is created through the adding of a noise sampled from a

stochastic process N

μ˜(st ) = μ(st ; θμ) +N , (8)

where N can be chosen to suit the environment.

2) The DRL Formulation: In this letter, the communication

system is regarded as the environment and the IRS is treated

as an agent. We define the corresponding elements as follows.

• State space: The state st is defined as

st =

[

γ(t−1), θ(t−1)1 , . . . , θ

(t−1)

N

]

, (9)

where γ(t−1) is the received SNR at time step t − 1.

• Action space: During time step t, the agent uses state

st as input to update the phase shifts induced by the

IRS under current channel states. When the update is

done, new phase shifts are obtained. Therefore, the action

vector at ∈ RN is defined as

at =

[

θ

(t)

1 , . . . , θ

(t)

N

]

. (10)

• Reward function: In this letter, the objective is to

maximize the received SNR. Thus, the received SNR

defined in (9) is used as the reward, i.e., rt = γ(t).

3) Working Procedure: At the initialization stage, four

networks are generated, i.e., the actor target net θμ′ , the actor

evaluation net θμ, the critic target net θq ′ and the critic eval-

uation net θq , whose parameters are all uniformly distributed.

Besides, an experience replay D with capacity C is built as

well. Without loss of generality, the phase shifts of all ele-

ments are chosen randomly from 0 to 2π at the beginning of

each episode. During each episode, we first calculate all chan-

nels involved. Then, taking the state st as excitation, the actor

evaluation net gives out a corresponding action at . Reforming

at into a phase shift matrix Φ(t) = diag(ejθ

(t)

1 , . . . , ejθ

(t)

N )

to calculate the current reward rt by (2), the next state st+1

can then be obtained by (9). Storing {st , at , rt , st+1} as one

transition into D. The critic evaluation net then samples a

NB -size minibatch {sj , aj , rj , sj+1} (j = 1, . . . ,NB ) from

the experience replay D to calculate the target Q value yj ,

i.e.,

yj =

{

rj , j = NB ,

rj + λQ

′(sj+1, μ′(sj+1; θμ′); θq ′), j < NB .

(11)

The loss function of the critic evaluation net is given as

L(θq ) =

1

NB

NB∑

j=1

(yj −Q(sj , aj ; θq ))2. (12)

Algorithm 1 The DRL Based Framework

Input: The discount factor λ, the soft update coefficient τ ,

the learning rate α, the experience replay capacity C, and

the batchsize NB .

Randomly initialize the critic evaluation network

Q(s , a; θq ) and the actor evaluation network μ(s ; θμ).

Initialize the critic target network Q ′(s , a; θq ′) and the

actor target network μ′(s ; θμ′) with the parameters of the

corresponding evaluation networks.

Empty the experience replay D.

Output: The optimal phase shift matrix Φ∗ and the maxi-

mized received SNR γ∗ under current channel state.

1: for episode j = 1, · · · ,K do

2: Obtain the current CSI (h(j )r ,G(j ), h

(j )

d );

3: Randomly chose phase shifts to obtain Φ(0) and γ(0)

as initial state s1;

4: Initialize a random process N ;

5: for t = 1, · · · ,T do

6: Action at = μ(st ; θμ) +N ;

7: Reform at into phase shift matrix Φ(t) =

diag(ejθ

(t)

1 , · · · , ejθ(t)N ) to calculate γ(t). Obtain

the next state st+1. Then, store the transition

{st , at , rt , st+1} into D.

8: Sample a NB minibatch transitions {sj , aj , rj , sj+1}

from D.

9: Set target Q value according to (11).

10: Update Q(s , a; θq ) by minimizing the loss in (12).

11: Update the policy μ(s ; θμ) using the sampled policy

gradient in (13).

12: Soft update the target networks according to (7).

13: Update the sate st = st+1.

14: end for

15: end for

Then, the critic evaluation net can be updated by SGD.

Afterwards, using policy gradient to update the actor evalu-

ation net with the ascent factor

Δθµ =

1

NB

NB∑

j=1

(∇aQ(sj , μ(sj ; θμ); θq )|∇θμμ(sj ; θμ)). (13)

Finally, the actor target net and the critic target net is updated

using soft update (7). The detail of the DRL based framework

is shown in the following algorithm.

IV. NUMERICAL RESULTS

This section demonstrates the performance of the proposed

framework. The channel between the BS and the user is

assumed to be Rayleigh fading which suggests that the line-

of-sight signal between them is blocked (note that it could be

other fading as well), i.e.,

hd =

√

PLd h˜d , (14)

where h˜d ∈ CM×1 contains independent and identical (i.i.d)

distributed CN (0, 1) elements. The channel between the BS

and IRS as well as that between the IRS and the user, are

Authorized licensed use limited to: Heriot-Watt University. Downloaded on November 03,2020 at 10:19:21 UTC from IEEE Xplore. Restrictions apply.

748 IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. 9, NO. 5, MAY 2020

Rician fading, i.e.,

G =

√

PLG

(√

K1

K1 + 1

G+

√

1

K1 + 1

G˜

)

, (15)

hr =

√

PLr

(√

K2

K2 + 1

hr +

√

1

K2 + 1

h˜r

)

, (16)

where K1 and K2 are the Rician-K factors, G˜ ∈ CN×M

and h˜r ∈ CN×1 are the random components with i.i.d and

CN (0, 1) distributed elements. The distances between two

adjacent antenna elements at BS and IRS are both half of

the carrier frequency. Thus, the deterministic components G

and hr can be expressed as [15, Eq. (3)], [16, Eq. (6)]

G =

[

aHNx (θAoA,h)⊗ aHNy (θAoA,v)

]

aM (θAoD,b), (17)

hr = a

H

Ny (θAoD,v)⊗ a˜HNx (θAoD,v, θAoD,h), (18)

with

ai (θ) = [1, e

−j2π dλ sin (θ), . . . , e−j2π(i−1)

d

λ sin (θ)], (19)

a˜Nx (θAoD,v, θAoD,h) = [1, e

−j2π dλφ, . . . , e−j2π(Nx−1)

d

λφ], (20)

where φ = cos (θAoD,v) sin (θAoD,h), θAoA/D,h/v represent

the angles of arrival/departure in horizontal/vertical directions

at the IRS, and θAoD,b is the angle of departure at the BS.

The distance between the BS and IRS is 51 m, M = 10,

N = 50 (Nx = 10,Ny = 5) (if not specified otherwise),

Pmax = 5 dBm, and σ2 = −80 dBm. The user moves on

a line in parallel to that connects the BS and IRS, and the

vertical distance between these two lines is 1.5 m. The path

loss is modeled as PL = PL0 − 10ξlog10( dD0 ) dB, where

PL0 = −30 dB, D0 = 1 m, ξ is the path loss exponent, and d is

the BS-user horizontal distance. The penetration loss of 5 dB is

assumed in both BS-user link and IRS-user link. The antenna

gain of 0 dBi is assumed at both the BS and user, and that of

the IRS is 5 dBi. The path loss exponents of the BS-IRS, BS-

user, and IRS-user links are set to ξbi = 2, ξbu = ξiu = 2.8,

respectively. Over 500 realizations of the channels’ random

components are averaged to obtain the simulation results.

In the proposed DRL framework, all neural networks are

considered to be a four layered DNN. The actor evaluation

net and the critic evaluation net both use Adam optimizer for

parameters update. The input layer of the actor network con-

tains N + 1 neurons while the output layer contains N neurons

(these two numbers change to 2N + 1 and 1, respectively in

the critic network). The two hidden layers contain 300 and 200

neurons, respectively. The first three layers are all followed by

a ReLU function while the output layer uses tanh(·) function

to provide enough gradient. Furthermore, we set the batchsize

NB = 16, the number of step in each episode T = 1000, the

learning rate α = 10−3, the discount factor λ = 0.95, the

soft update coefficient τ = 0.005, and the experience replay

capacity C = 50000. The additional noise N is selected as

complex Gaussian noise with zero mean and variance 0.1.

Fig. 3 demonstrates the received SNR of the proposed algo-

rithm vs. the horizontal distance between the BS and the user,

denoted as d. In this figure, We consider a scenario similar

with [5], where the IRS is coated on the facade of a tall build-

ing and is aware of the BS’s location, so that K1 → ∞ and

Fig. 3. Received SNR vs. BS-user horizontal distance.

Fig. 4. Received SNR vs. number of elements on IRS.

K2 = 0. The performance of the SDR algorithm [5] serving as

an upper bound, the fix point iteration algorithm [6] with ran-

dom initialization, as well as the system without IRS, are also

shown. It can be easily observed that the proposed DRL based

framework almost achieves the upper bound of the received

SNR, which testifies its optimality. Note that it brings less

gain when the user is close to the BS, this is because in this

setting, the user is far from the IRS and thus get less sig-

nal power from the IRS. In the absence of IRS, the received

SNR decreases rapidly as the user moves apart from the BS.

This performance degradation can be substantially improved

by placing an IRS between the BS and the user. It is also noted

that the performance of the proposed algorithm is obviously

superior to the fix point iteration when d ≥ 40 m.

In Fig. 4, the received SNRs under different number of pas-

sive element on IRS are compared. In this figure, we consider

a scenario similar with [7], where the IRS is placed without the

knowledge of the BS’s location, the Rician K-factors are set to

K1 = K2 = 10, and the horizontal BS-user distance d = 48 m.

As can be observed, the performance of all algorithms get bet-

ter as the number of passive unit on the IRS increases. This is

because the power reflected by the IRS increases. Particularly,

the difference between the received SNRs with N = 50 and

N = 100 is approximately 6 dB, which suggests that O(N 2)

gain can be attained by doubling the number of passive unit

on the IRS.

The running time of the three algorithms under different

number of passive element on IRS is given in Table I. The

other simulation parameters are the same with Fig. 4. We can

see that the SDR algorithm is extremely time-consuming, and

Authorized licensed use limited to: Heriot-Watt University. Downloaded on November 03,2020 at 10:19:21 UTC from IEEE Xplore. Restrictions apply.

FENG et al.: DRL BASED IRS OPTIMIZATION FOR MISO COMMUNICATION SYSTEMS 749

TABLE I

RUNNING TIME COMPARISON

Fig. 5. Received SNR vs. number of elements at BS/IRS.

its running time increases enormously as N increases. The fix

point iteration algorithm has the lowest running time [6], while

the time consumed increases promptly as N increases, which

is as expected since more optimization variables are involved.

In contrast, the time consumed by the proposed DRL based

framework remains around 34 ms for all N values, which is

explicable since the number of hidden layer neurons remain

unchanged as N grows. This property verifies that the proposed

framework is efficient and robust. More importantly, it can

achieve almost optimal received SNR with relatively low time

consumption.

Fig. 5 compares the performance of the proposed framework

under different numbers of antenna element at BS and different

number of passive unit at IRS. The curve “DRL-based-M” is

obtained by fixing the number of passive unit on IRS to 30

(Nx = 10 and Ny = 3) and varying the number of antenna at

the BS from 30 to 100. The curve “DRL-based-N” is obtained

by fixing the number of antenna at BS to 30 and varying the

number of passive unit on the IRS from 30 to 100 (Ny changes

from 3 to 10). In the curve “DRL-based-N”, each point needs

to be trained again. The other simulation parameters are the

same with Fig. 4. It can be seen that the increase in the number

of passive unit on the IRS leads to higher performance gain,

which indicates that increasing the number of low-cost passive

elements on the IRS is more energy-efficient than enlarging

the scale of costly RF chains on the BS. It is worth noting

that the performance gain becomes more pronounced as the

number of element grows.

V. CONCLUSION

In this letter, we investigated the phase shifts design for the

IRS-aided downlink MISO wireless communication system to

maximize the received SNR. An efficient DRL based frame-

work were proposed to tackle the non-convex unit modulus

constraints, which are major concerns for optimizing phase

shifts introduced by the IRS. Numerical results reveal that

the proposed framework can obtain significant performance

gain compared to the fix point iteration algorithm and achieve

almost the upper bound calculated by the SDR algorithm with

much less time consumption.

REFERENCES

[1] Q. Wu and R. Zhang, “Towards smart and reconfigurable environment:

Intelligent reflecting surface aided wireless network,” 2019. [Online].

Available: arXiv:1905.00152.

[2] W. Tang et al., “Wireless communications with programmable meta-

surface: New paradigms, opportunities, and challenges on transceiver

design,” 2019. [Online]. Available: arXiv:1907.01956.

[3] W. Tang et al., “Programmable metasurface-based RF chain-free 8PSK

wireless transmitter,” Electron. Lett., vol. 55, no. 7, pp. 417–420,

Apr. 2019.

[4] S. Abeywickrama, R. Zhang, and C. Yuen, “Intelligent reflecting sur-

face: Practical phase shift model and beamforming optimization,” 2019.

[Online]. Available: arXiv:1907.06002.

[5] Q. Wu and R. Zhang, “Intelligent reflecting surface enhanced wireless

network: Joint active and passive beamforming design,” in Proc. IEEE

Glob. Commun. Conf. (GLOBECOM), 2018, pp. 1–6.

[6] X. Yu, D. Xu, and R. Schober, “MISO wireless communication

systems via intelligent reflecting surfaces,” 2019. [Online]. Available:

arXiv:1904.12199.

[7] Y. Han, W. Tang, S. Jin, C.-K. Wen, and X. Ma, “Large intel-

ligent surface-assisted wireless communication exploiting statistical

CSI,” IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 8238–8242,

Aug. 2019.

[8] Q. Wu and R. Zhang, “Beamforming optimization for intelligent reflect-

ing surface with discrete phase shifts,” in Proc. IEEE Int. Conf. Acoust.

Speech Signal Process. (ICASSP), 2019, pp. 7830–7833.

[9] C. Huang, G. C. Alexandropoulos, C. Yuen, and M. Debbah, “Indoor

signal focusing with deep learning designed reconfigurable intelligent

surfaces,” 2019. [Online]. Available: arXiv:1905.07726.

[10] A. Taha, M. Alrabeiah, and A. Alkhateeb, “Enabling large intelligent

surfaces with compressive sensing and deep learning,” 2019. [Online].

Available: arXiv:1904.10136.

[11] R. W. Picard et al., “Affective learning—A manifesto,” BT Technol. J.,

vol. 22, no. 4, pp. 253–269, 2004.

[12] V. Mnih et al., “Human-level control through deep reinforcement

learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.

[13] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy

gradient methods for reinforcement learning with function approx-

imation,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2000,

pp. 1057–1063.

[14] T. P. Lillicrap et al., “Continuous control with deep reinforcement

learning,” 2015. [Online]. Available: arXiv:1509.02971.

[15] Y. Cao and T. Lv, “Intelligent reflecting surface aided multi-user

millimeter-wave communications for coverage enhancement,” 2019.

[Online]. Available: arXiv:1910.02398.

[16] X. Li, S. Jin, H. A. Suraweera, J. Hou, and X. Gao, “Statistical

3-D beamforming for large-scale MIMO downlink systems over rician

fading channels,” IEEE Trans. Commun., vol. 64, no. 4, pp. 1529–1543,

Apr. 2016.

Authorized licensed use limited to: Heriot-Watt University. Downloaded on November 03,2020 at 10:19:21 UTC from IEEE Xplore. Restrictions apply.

学霸联盟

Deep Reinforcement Learning Based Intelligent Reflecting Surface

Optimization for MISO Communication Systems

Keming Feng, Qisheng Wang, Xiao Li , Member, IEEE, and Chao-Kai Wen , Member, IEEE

Abstract—This letter investigates the intelligent reflecting

surface (IRS)-aided multiple-input single-output wireless trans-

mission system. Particularly, the optimization of the passive phase

shift of each element at IRS to maximize the downlink received

signal-to-noise ratio is considered. Inspired by the huge success

of deep reinforcement learning (DRL) on resolving complicated

control problems, we develop a DRL based framework to solve

this non-convex optimization problem. Numerical results reveal

that the proposed DRL based framework can achieve almost

the upper bound of the received SNR with relatively low time

consumption.

Index Terms—Intelligent reflecting surface, non-convex

optimization, deep reinforcement learning, phase shift design.

I. INTRODUCTION

RECENTLY, the intelligent reflecting surface (IRS)technology has drawn great amount of attention due

to its capability of providing remarkable massive MIMO-

like gains with low cost [1]–[4]. These surfaces are usually

made of almost passive reconfigurable units, each of which

can reflect the incident signal independently with different

phase shifts. Through adjusting these phase shifts dynami-

cally, a more preferable propagation condition can be obtained.

Additionally, these surfaces can be easily coated on facades

of outdoor buildings or indoor walls, thus can be implemented

with low complexity.

To utilize the IRS effectively and efficiently, some work

has been done on the configuration of phase shifts [5]–[8].

A semidefinite relaxation (SDR) method was introduced to

optimize the phase shifts of each unit so as to maximize the

received signal-to-noise ratio (SNR). Since SDR method is

of high computational complexity, a relatively low complex-

ity fix point iteration (FPI) algorithm was proposed in [6].

However, when the user is located far away from the BS,

the performance loss is relatively high. In [7] and [8], the

phase shifts of each unit is optimized one by one iteratively

Manuscript received December 13, 2019; accepted January 18, 2020. Date

of publication January 24, 2020; date of current version May 8, 2020.

The work of Xiao Li was supported by the National Natural Science

Foundation of China under Grant 61971126 and Grant 61831013. The work

of Chao-Kai Wen was supported by the Ministry of Science and Technology

of Taiwan under Grant MOST 108-2628-E-110-001-MY3. The associate edi-

tor coordinating the review of this article and approving it for publication was

J. Zhang. (Corresponding author: Xiao Li.)

Keming Feng, Qisheng Wang, and Xiao Li are with the National

Mobile Communications Research Laboratory, Southeast University, Nanjing

210096, China (e-mail: keming_feng@seu.edu.cn; qishengw@seu.edu.cn;

li_xiao@seu.edu.cn).

Chao-Kai Wen is with the Institute of Communications Engineering,

National Sun Yat-sen University, Kaohsiung 80424, Taiwan (e-mail:

chaokai.wen@mail.nsysu.edu.tw).

Digital Object Identifier 10.1109/LWC.2020.2969167

in a greedy manner. Thus, it is less efficient for large-scale

systems.

Due to the recent advances of artificial intelligence,

especially deep learning (DL), in wireless communica-

tion, [9] and [10] utilized DL methods to the phase shift

design. However, this supervised learning requires enormous

training labels being calculated in advance. In many cases,

these training labels themselves are difficult to obtain, if

not impossible. On the contrary, deep reinforcement learning

(DRL) based methods do not need training labels and possess

the property of online learning and sample generation, which

is more storage-efficient.

In this letter, we investigate the phase shift design of the

IRS utilizing deep reinforcement learning (DRL). A DRL-

based framework is proposed to tackle the non-convexity

induced by the unit modulus constraints. We introduce the

deep deterministic policy gradient (DDPG) algorithm into

the DRL framework. Simulation results indicate that the

performance of the proposed algorithm surpasses the state-

of-the-art algorithms in terms of received SNR and running

time.

II. SYSTEM MODEL AND PROBLEM FORMULATION

Consider a single-user multiple-input single-output (MISO)

downlink system, as illustrated in Fig. 1. The BS employs a

uniform linear array (ULA) with M antenna elements, the IRS

is deployed with N = Nx ×Ny passive phase shifters, where

Nx and Ny are the number of passive units in each row and

column. All phase shifters on the IRS are configurable via a

smart controller. All channels are assumed to be quasi-static

frequency flat-fading and available at both the BS and IRS.

The channels of the BS-user, IRS-user, and BS-IRS links are

denoted as hd ∈ CM×1, hr ∈ CN×1, and G ∈ CN×M ,

respectively.

For the considered system, the received signal at the user is

y = (hHr ΦG+ h

H

d )bs + n, (1)

where Φ = diag(ejθ1 , ejθ2 , . . . , ejθN ) is the phase shift

matrix at the IRS, diag(a1, . . . , aN ) denotes a diagonal matrix

with a1, . . . , aN as its diagonal entries, θi ∈ [0, 2π] represents

the phase shift of the i-th element on the IRS, b ∈ CM×1

is the beamforming vector at the BS with the constraint

‖b‖2 ≤ Pmax, Pmax is the maximum transmit power of

the BS, s is the transmitted signal satisfying E[s2] = 1,

n ∼ CN (0, σ2) is the noise. Then, the received SNR can

be obtained as

γ =

∣

∣

∣(hHr ΦG+ h

H

d )b

∣

∣

∣

2

/σ2. (2)

2162-2345 c© 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Heriot-Watt University. Downloaded on November 03,2020 at 10:19:21 UTC from IEEE Xplore. Restrictions apply.

746 IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. 9, NO. 5, MAY 2020

Fig. 1. IRS-aided single-user MISO system.

Note that, for a fixed phase shift matrix Φ, the optimal

beamforming method that maximizes the received SNR is the

maximum-ratio transmission (MRT) [8], i.e.,

b∗ =

√

Pmax

(

hHr ΦG+ h

H

d

)H

∥

∥hHr ΦG+ h

H

d

∥

∥

. (3)

The optimization problem for the phase shift matrix Φ to

maximize γ can be formulated as

(P1): max

Φ

‖hHr ΦG+ hHd ‖2,

s.t. |Φi ,i | = 1, ∀i = 1, 2, . . . ,N , (4)

where Φi ,i is the i-th diagonal element of Φ. Note that (P1) is

a NP-hard problem owing to the non-convexity of the objec-

tive function and the unit modulus constraints. A SDR method

was proposed in [5] to solve this problem. However, it is com-

putational expensive with complexity of O((N + 1)6) [6]. In

this letter, we focus on the design of the phase shift matrix Φ,

we propose a robust DRL based framework to deal with (P1)

efficiently, which will be described in the next section.

III. DRL BASED FRAMEWORK

In this section, we first briefly introduce the DRL techniques

involved. Then, the proposed DRL based framework will be

described in detail.

A. Deep Reinforcement Learning Basics

A reinforcement learning (RL) system consists of two

major parts, i.e., the agent and the environment. Interactions

between them can be described as a Markov Decision Process

(MDP) [11]. During time step t in each episode, the agent

obtains the state st from the environment, and chooses an

action at from the action space based on a policy π. Once the

action is done, the environment updates the current state to

st+1, and emits a reward rt which measures the performance

of at under current state. Learning of the agent is to deter-

mine the optimal policy that maximizes the long-term reward.

Two kinds of algorithms, i.e., the value based and policy based

algorithms, are usually applied to determine the optimal policy.

Deep Q network (DQN) [12] is a value based algorithm

for discrete action space. Under a policy π, an action-state-

Q function of the agent for an action a under state s, which

evaluates the current action-state pair, is defined as

Qπ(s , a; θ) = Eπ[Gt |st = s , at = a], (5)

where E[·] represents the expectation, Gt =

∑∞

t=0 λ

trt is the

expected cumulative reward, λ ∈ (0, 1] is a discounting fac-

tor, θ represents the parameters of the deep neural network

Fig. 2. The DRL Based Phase Shift Design Framework Using DDPG.

(DNN) used in DQN. This algorithm aims at maximizing

the Q value (5) of a certain action-state pair by training the

DNN [11]. The training batch is randomly sampled from a

relay buffer with {st , at , rt , st+1} as one piece of previous

data.

Policy gradient (PG) is a policy based algorithm aiming

at maximizing the expectation of the discounted cumulative

reward of each episode when the action space is continuous.

At each time step t, the agent chooses the action according to a

policy πθ . Therefore, training of the policy can be represented

as a gradient ascent procedure [13]

θ t+1 = θ t + βEπθt [∇θ t logπθ(s , a)Qπθt (s , a)], (6)

where β is the learning rate, Qπθt (s , a) is the action-state-Q function under current policy πθ t . The drawback of this

algorithm is that the policy network can be updated only after

an episode is done, which slows down the convergence rate.

B. Phase Shift Design Framework Using DDPG

According to the above description, the DQN algorithm

is not suitable to solve the problem (P1), since it can only

deals with discrete action spaces. As for PG algorithm,

its convergence performance is unsatisfactory under wireless

communication context. In this letter, a DDPG based algo-

rithm is developed to solve problem (P1), it can overcome

the limitations of the DQN and PG algorithm. The proposed

framework is illustrated in Fig. 2.

1) Deep Deterministic Policy Gradient: DDPG is a model-

free, off-policy actor-critic (AC) algorithm combining the

advantages of DQN and PG [14]. It can learn deterministic pol-

icy optimally under high-dimensional continuous action space.

In DDPG, a deterministic policy network (DPN) is used as an

actor to chose actions from a continuous action space A, i.e.,

a = μ(s ; θμ), where θμ is the parameters of the DPN. A

Q network Q(s , a; θq ) is modeled as a critic to measure the

performance of the chosen action, where θq is the parameter

of the critic network. The goal of DDPG is to maximize the

output Q value. To achieve this goal, as in DQN, an experience

replay is reserved to reduce the correlation of different train-

ing samples. Moreover, to solve the problem that the Q value

update is incline to divergence under a single Q network [12], a

copy is created for the actor and critic networks, i.e., μ′(s ; θμ′)

and Q ′(s , a; θq ′), which are referred to as target networks

and used to calculate the corresponding target values. The

networks being copied are referred to as evaluation networks.

Authorized licensed use limited to: Heriot-Watt University. Downloaded on November 03,2020 at 10:19:21 UTC from IEEE Xplore. Restrictions apply.

FENG et al.: DRL BASED IRS OPTIMIZATION FOR MISO COMMUNICATION SYSTEMS 747

Note that, the target network shares the same structure with its

corresponding evaluation network, but with different parame-

ters, i.e., θq = θq ′ , θμ = θμ′ . The target networks are then

updated through soft update, which can be written as

θ i ′ = τθ i + (1− τ)θ i ′ , i = μ or q , (7)

where τ 1. Soft update moves the unstable problem

in learning the action-state-Q function and accelerates the

convergence of AC method [14]. Since the DPN learns a deter-

ministic policy, for exploration, DDPG treats it independently

from the learning process. Moreover, an exploration policy

μ˜ is created through the adding of a noise sampled from a

stochastic process N

μ˜(st ) = μ(st ; θμ) +N , (8)

where N can be chosen to suit the environment.

2) The DRL Formulation: In this letter, the communication

system is regarded as the environment and the IRS is treated

as an agent. We define the corresponding elements as follows.

• State space: The state st is defined as

st =

[

γ(t−1), θ(t−1)1 , . . . , θ

(t−1)

N

]

, (9)

where γ(t−1) is the received SNR at time step t − 1.

• Action space: During time step t, the agent uses state

st as input to update the phase shifts induced by the

IRS under current channel states. When the update is

done, new phase shifts are obtained. Therefore, the action

vector at ∈ RN is defined as

at =

[

θ

(t)

1 , . . . , θ

(t)

N

]

. (10)

• Reward function: In this letter, the objective is to

maximize the received SNR. Thus, the received SNR

defined in (9) is used as the reward, i.e., rt = γ(t).

3) Working Procedure: At the initialization stage, four

networks are generated, i.e., the actor target net θμ′ , the actor

evaluation net θμ, the critic target net θq ′ and the critic eval-

uation net θq , whose parameters are all uniformly distributed.

Besides, an experience replay D with capacity C is built as

well. Without loss of generality, the phase shifts of all ele-

ments are chosen randomly from 0 to 2π at the beginning of

each episode. During each episode, we first calculate all chan-

nels involved. Then, taking the state st as excitation, the actor

evaluation net gives out a corresponding action at . Reforming

at into a phase shift matrix Φ(t) = diag(ejθ

(t)

1 , . . . , ejθ

(t)

N )

to calculate the current reward rt by (2), the next state st+1

can then be obtained by (9). Storing {st , at , rt , st+1} as one

transition into D. The critic evaluation net then samples a

NB -size minibatch {sj , aj , rj , sj+1} (j = 1, . . . ,NB ) from

the experience replay D to calculate the target Q value yj ,

i.e.,

yj =

{

rj , j = NB ,

rj + λQ

′(sj+1, μ′(sj+1; θμ′); θq ′), j < NB .

(11)

The loss function of the critic evaluation net is given as

L(θq ) =

1

NB

NB∑

j=1

(yj −Q(sj , aj ; θq ))2. (12)

Algorithm 1 The DRL Based Framework

Input: The discount factor λ, the soft update coefficient τ ,

the learning rate α, the experience replay capacity C, and

the batchsize NB .

Randomly initialize the critic evaluation network

Q(s , a; θq ) and the actor evaluation network μ(s ; θμ).

Initialize the critic target network Q ′(s , a; θq ′) and the

actor target network μ′(s ; θμ′) with the parameters of the

corresponding evaluation networks.

Empty the experience replay D.

Output: The optimal phase shift matrix Φ∗ and the maxi-

mized received SNR γ∗ under current channel state.

1: for episode j = 1, · · · ,K do

2: Obtain the current CSI (h(j )r ,G(j ), h

(j )

d );

3: Randomly chose phase shifts to obtain Φ(0) and γ(0)

as initial state s1;

4: Initialize a random process N ;

5: for t = 1, · · · ,T do

6: Action at = μ(st ; θμ) +N ;

7: Reform at into phase shift matrix Φ(t) =

diag(ejθ

(t)

1 , · · · , ejθ(t)N ) to calculate γ(t). Obtain

the next state st+1. Then, store the transition

{st , at , rt , st+1} into D.

8: Sample a NB minibatch transitions {sj , aj , rj , sj+1}

from D.

9: Set target Q value according to (11).

10: Update Q(s , a; θq ) by minimizing the loss in (12).

11: Update the policy μ(s ; θμ) using the sampled policy

gradient in (13).

12: Soft update the target networks according to (7).

13: Update the sate st = st+1.

14: end for

15: end for

Then, the critic evaluation net can be updated by SGD.

Afterwards, using policy gradient to update the actor evalu-

ation net with the ascent factor

Δθµ =

1

NB

NB∑

j=1

(∇aQ(sj , μ(sj ; θμ); θq )|∇θμμ(sj ; θμ)). (13)

Finally, the actor target net and the critic target net is updated

using soft update (7). The detail of the DRL based framework

is shown in the following algorithm.

IV. NUMERICAL RESULTS

This section demonstrates the performance of the proposed

framework. The channel between the BS and the user is

assumed to be Rayleigh fading which suggests that the line-

of-sight signal between them is blocked (note that it could be

other fading as well), i.e.,

hd =

√

PLd h˜d , (14)

where h˜d ∈ CM×1 contains independent and identical (i.i.d)

distributed CN (0, 1) elements. The channel between the BS

and IRS as well as that between the IRS and the user, are

Authorized licensed use limited to: Heriot-Watt University. Downloaded on November 03,2020 at 10:19:21 UTC from IEEE Xplore. Restrictions apply.

748 IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. 9, NO. 5, MAY 2020

Rician fading, i.e.,

G =

√

PLG

(√

K1

K1 + 1

G+

√

1

K1 + 1

G˜

)

, (15)

hr =

√

PLr

(√

K2

K2 + 1

hr +

√

1

K2 + 1

h˜r

)

, (16)

where K1 and K2 are the Rician-K factors, G˜ ∈ CN×M

and h˜r ∈ CN×1 are the random components with i.i.d and

CN (0, 1) distributed elements. The distances between two

adjacent antenna elements at BS and IRS are both half of

the carrier frequency. Thus, the deterministic components G

and hr can be expressed as [15, Eq. (3)], [16, Eq. (6)]

G =

[

aHNx (θAoA,h)⊗ aHNy (θAoA,v)

]

aM (θAoD,b), (17)

hr = a

H

Ny (θAoD,v)⊗ a˜HNx (θAoD,v, θAoD,h), (18)

with

ai (θ) = [1, e

−j2π dλ sin (θ), . . . , e−j2π(i−1)

d

λ sin (θ)], (19)

a˜Nx (θAoD,v, θAoD,h) = [1, e

−j2π dλφ, . . . , e−j2π(Nx−1)

d

λφ], (20)

where φ = cos (θAoD,v) sin (θAoD,h), θAoA/D,h/v represent

the angles of arrival/departure in horizontal/vertical directions

at the IRS, and θAoD,b is the angle of departure at the BS.

The distance between the BS and IRS is 51 m, M = 10,

N = 50 (Nx = 10,Ny = 5) (if not specified otherwise),

Pmax = 5 dBm, and σ2 = −80 dBm. The user moves on

a line in parallel to that connects the BS and IRS, and the

vertical distance between these two lines is 1.5 m. The path

loss is modeled as PL = PL0 − 10ξlog10( dD0 ) dB, where

PL0 = −30 dB, D0 = 1 m, ξ is the path loss exponent, and d is

the BS-user horizontal distance. The penetration loss of 5 dB is

assumed in both BS-user link and IRS-user link. The antenna

gain of 0 dBi is assumed at both the BS and user, and that of

the IRS is 5 dBi. The path loss exponents of the BS-IRS, BS-

user, and IRS-user links are set to ξbi = 2, ξbu = ξiu = 2.8,

respectively. Over 500 realizations of the channels’ random

components are averaged to obtain the simulation results.

In the proposed DRL framework, all neural networks are

considered to be a four layered DNN. The actor evaluation

net and the critic evaluation net both use Adam optimizer for

parameters update. The input layer of the actor network con-

tains N + 1 neurons while the output layer contains N neurons

(these two numbers change to 2N + 1 and 1, respectively in

the critic network). The two hidden layers contain 300 and 200

neurons, respectively. The first three layers are all followed by

a ReLU function while the output layer uses tanh(·) function

to provide enough gradient. Furthermore, we set the batchsize

NB = 16, the number of step in each episode T = 1000, the

learning rate α = 10−3, the discount factor λ = 0.95, the

soft update coefficient τ = 0.005, and the experience replay

capacity C = 50000. The additional noise N is selected as

complex Gaussian noise with zero mean and variance 0.1.

Fig. 3 demonstrates the received SNR of the proposed algo-

rithm vs. the horizontal distance between the BS and the user,

denoted as d. In this figure, We consider a scenario similar

with [5], where the IRS is coated on the facade of a tall build-

ing and is aware of the BS’s location, so that K1 → ∞ and

Fig. 3. Received SNR vs. BS-user horizontal distance.

Fig. 4. Received SNR vs. number of elements on IRS.

K2 = 0. The performance of the SDR algorithm [5] serving as

an upper bound, the fix point iteration algorithm [6] with ran-

dom initialization, as well as the system without IRS, are also

shown. It can be easily observed that the proposed DRL based

framework almost achieves the upper bound of the received

SNR, which testifies its optimality. Note that it brings less

gain when the user is close to the BS, this is because in this

setting, the user is far from the IRS and thus get less sig-

nal power from the IRS. In the absence of IRS, the received

SNR decreases rapidly as the user moves apart from the BS.

This performance degradation can be substantially improved

by placing an IRS between the BS and the user. It is also noted

that the performance of the proposed algorithm is obviously

superior to the fix point iteration when d ≥ 40 m.

In Fig. 4, the received SNRs under different number of pas-

sive element on IRS are compared. In this figure, we consider

a scenario similar with [7], where the IRS is placed without the

knowledge of the BS’s location, the Rician K-factors are set to

K1 = K2 = 10, and the horizontal BS-user distance d = 48 m.

As can be observed, the performance of all algorithms get bet-

ter as the number of passive unit on the IRS increases. This is

because the power reflected by the IRS increases. Particularly,

the difference between the received SNRs with N = 50 and

N = 100 is approximately 6 dB, which suggests that O(N 2)

gain can be attained by doubling the number of passive unit

on the IRS.

The running time of the three algorithms under different

number of passive element on IRS is given in Table I. The

other simulation parameters are the same with Fig. 4. We can

see that the SDR algorithm is extremely time-consuming, and

Authorized licensed use limited to: Heriot-Watt University. Downloaded on November 03,2020 at 10:19:21 UTC from IEEE Xplore. Restrictions apply.

FENG et al.: DRL BASED IRS OPTIMIZATION FOR MISO COMMUNICATION SYSTEMS 749

TABLE I

RUNNING TIME COMPARISON

Fig. 5. Received SNR vs. number of elements at BS/IRS.

its running time increases enormously as N increases. The fix

point iteration algorithm has the lowest running time [6], while

the time consumed increases promptly as N increases, which

is as expected since more optimization variables are involved.

In contrast, the time consumed by the proposed DRL based

framework remains around 34 ms for all N values, which is

explicable since the number of hidden layer neurons remain

unchanged as N grows. This property verifies that the proposed

framework is efficient and robust. More importantly, it can

achieve almost optimal received SNR with relatively low time

consumption.

Fig. 5 compares the performance of the proposed framework

under different numbers of antenna element at BS and different

number of passive unit at IRS. The curve “DRL-based-M” is

obtained by fixing the number of passive unit on IRS to 30

(Nx = 10 and Ny = 3) and varying the number of antenna at

the BS from 30 to 100. The curve “DRL-based-N” is obtained

by fixing the number of antenna at BS to 30 and varying the

number of passive unit on the IRS from 30 to 100 (Ny changes

from 3 to 10). In the curve “DRL-based-N”, each point needs

to be trained again. The other simulation parameters are the

same with Fig. 4. It can be seen that the increase in the number

of passive unit on the IRS leads to higher performance gain,

which indicates that increasing the number of low-cost passive

elements on the IRS is more energy-efficient than enlarging

the scale of costly RF chains on the BS. It is worth noting

that the performance gain becomes more pronounced as the

number of element grows.

V. CONCLUSION

In this letter, we investigated the phase shifts design for the

IRS-aided downlink MISO wireless communication system to

maximize the received SNR. An efficient DRL based frame-

work were proposed to tackle the non-convex unit modulus

constraints, which are major concerns for optimizing phase

shifts introduced by the IRS. Numerical results reveal that

the proposed framework can obtain significant performance

gain compared to the fix point iteration algorithm and achieve

almost the upper bound calculated by the SDR algorithm with

much less time consumption.

REFERENCES

[1] Q. Wu and R. Zhang, “Towards smart and reconfigurable environment:

Intelligent reflecting surface aided wireless network,” 2019. [Online].

Available: arXiv:1905.00152.

[2] W. Tang et al., “Wireless communications with programmable meta-

surface: New paradigms, opportunities, and challenges on transceiver

design,” 2019. [Online]. Available: arXiv:1907.01956.

[3] W. Tang et al., “Programmable metasurface-based RF chain-free 8PSK

wireless transmitter,” Electron. Lett., vol. 55, no. 7, pp. 417–420,

Apr. 2019.

[4] S. Abeywickrama, R. Zhang, and C. Yuen, “Intelligent reflecting sur-

face: Practical phase shift model and beamforming optimization,” 2019.

[Online]. Available: arXiv:1907.06002.

[5] Q. Wu and R. Zhang, “Intelligent reflecting surface enhanced wireless

network: Joint active and passive beamforming design,” in Proc. IEEE

Glob. Commun. Conf. (GLOBECOM), 2018, pp. 1–6.

[6] X. Yu, D. Xu, and R. Schober, “MISO wireless communication

systems via intelligent reflecting surfaces,” 2019. [Online]. Available:

arXiv:1904.12199.

[7] Y. Han, W. Tang, S. Jin, C.-K. Wen, and X. Ma, “Large intel-

ligent surface-assisted wireless communication exploiting statistical

CSI,” IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 8238–8242,

Aug. 2019.

[8] Q. Wu and R. Zhang, “Beamforming optimization for intelligent reflect-

ing surface with discrete phase shifts,” in Proc. IEEE Int. Conf. Acoust.

Speech Signal Process. (ICASSP), 2019, pp. 7830–7833.

[9] C. Huang, G. C. Alexandropoulos, C. Yuen, and M. Debbah, “Indoor

signal focusing with deep learning designed reconfigurable intelligent

surfaces,” 2019. [Online]. Available: arXiv:1905.07726.

[10] A. Taha, M. Alrabeiah, and A. Alkhateeb, “Enabling large intelligent

surfaces with compressive sensing and deep learning,” 2019. [Online].

Available: arXiv:1904.10136.

[11] R. W. Picard et al., “Affective learning—A manifesto,” BT Technol. J.,

vol. 22, no. 4, pp. 253–269, 2004.

[12] V. Mnih et al., “Human-level control through deep reinforcement

learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.

[13] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy

gradient methods for reinforcement learning with function approx-

imation,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2000,

pp. 1057–1063.

[14] T. P. Lillicrap et al., “Continuous control with deep reinforcement

learning,” 2015. [Online]. Available: arXiv:1509.02971.

[15] Y. Cao and T. Lv, “Intelligent reflecting surface aided multi-user

millimeter-wave communications for coverage enhancement,” 2019.

[Online]. Available: arXiv:1910.02398.

[16] X. Li, S. Jin, H. A. Suraweera, J. Hou, and X. Gao, “Statistical

3-D beamforming for large-scale MIMO downlink systems over rician

fading channels,” IEEE Trans. Commun., vol. 64, no. 4, pp. 1529–1543,

Apr. 2016.

Authorized licensed use limited to: Heriot-Watt University. Downloaded on November 03,2020 at 10:19:21 UTC from IEEE Xplore. Restrictions apply.

学霸联盟