matlab代写-MAY 2020 745
时间:2021-02-24
IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. 9, NO. 5, MAY 2020 745
Deep Reinforcement Learning Based Intelligent Reflecting Surface
Optimization for MISO Communication Systems
Keming Feng, Qisheng Wang, Xiao Li , Member, IEEE, and Chao-Kai Wen , Member, IEEE
Abstract—This letter investigates the intelligent reflecting
surface (IRS)-aided multiple-input single-output wireless trans-
mission system. Particularly, the optimization of the passive phase
shift of each element at IRS to maximize the downlink received
signal-to-noise ratio is considered. Inspired by the huge success
of deep reinforcement learning (DRL) on resolving complicated
control problems, we develop a DRL based framework to solve
this non-convex optimization problem. Numerical results reveal
that the proposed DRL based framework can achieve almost
the upper bound of the received SNR with relatively low time
consumption.
Index Terms—Intelligent reflecting surface, non-convex
optimization, deep reinforcement learning, phase shift design.
I. INTRODUCTION
RECENTLY, the intelligent reflecting surface (IRS)technology has drawn great amount of attention due
to its capability of providing remarkable massive MIMO-
like gains with low cost [1]–[4]. These surfaces are usually
made of almost passive reconfigurable units, each of which
can reflect the incident signal independently with different
phase shifts. Through adjusting these phase shifts dynami-
cally, a more preferable propagation condition can be obtained.
Additionally, these surfaces can be easily coated on facades
of outdoor buildings or indoor walls, thus can be implemented
with low complexity.
To utilize the IRS effectively and efficiently, some work
has been done on the configuration of phase shifts [5]–[8].
A semidefinite relaxation (SDR) method was introduced to
optimize the phase shifts of each unit so as to maximize the
received signal-to-noise ratio (SNR). Since SDR method is
of high computational complexity, a relatively low complex-
ity fix point iteration (FPI) algorithm was proposed in [6].
However, when the user is located far away from the BS,
the performance loss is relatively high. In [7] and [8], the
phase shifts of each unit is optimized one by one iteratively
Manuscript received December 13, 2019; accepted January 18, 2020. Date
of publication January 24, 2020; date of current version May 8, 2020.
The work of Xiao Li was supported by the National Natural Science
Foundation of China under Grant 61971126 and Grant 61831013. The work
of Chao-Kai Wen was supported by the Ministry of Science and Technology
of Taiwan under Grant MOST 108-2628-E-110-001-MY3. The associate edi-
tor coordinating the review of this article and approving it for publication was
J. Zhang. (Corresponding author: Xiao Li.)
Keming Feng, Qisheng Wang, and Xiao Li are with the National
Mobile Communications Research Laboratory, Southeast University, Nanjing
210096, China (e-mail: keming_feng@seu.edu.cn; qishengw@seu.edu.cn;
li_xiao@seu.edu.cn).
Chao-Kai Wen is with the Institute of Communications Engineering,
National Sun Yat-sen University, Kaohsiung 80424, Taiwan (e-mail:
chaokai.wen@mail.nsysu.edu.tw).
Digital Object Identifier 10.1109/LWC.2020.2969167
in a greedy manner. Thus, it is less efficient for large-scale
systems.
Due to the recent advances of artificial intelligence,
especially deep learning (DL), in wireless communica-
tion, [9] and [10] utilized DL methods to the phase shift
design. However, this supervised learning requires enormous
training labels being calculated in advance. In many cases,
these training labels themselves are difficult to obtain, if
not impossible. On the contrary, deep reinforcement learning
(DRL) based methods do not need training labels and possess
the property of online learning and sample generation, which
is more storage-efficient.
In this letter, we investigate the phase shift design of the
IRS utilizing deep reinforcement learning (DRL). A DRL-
based framework is proposed to tackle the non-convexity
induced by the unit modulus constraints. We introduce the
deep deterministic policy gradient (DDPG) algorithm into
the DRL framework. Simulation results indicate that the
performance of the proposed algorithm surpasses the state-
of-the-art algorithms in terms of received SNR and running
time.
II. SYSTEM MODEL AND PROBLEM FORMULATION
Consider a single-user multiple-input single-output (MISO)
downlink system, as illustrated in Fig. 1. The BS employs a
uniform linear array (ULA) with M antenna elements, the IRS
is deployed with N = Nx ×Ny passive phase shifters, where
Nx and Ny are the number of passive units in each row and
column. All phase shifters on the IRS are configurable via a
smart controller. All channels are assumed to be quasi-static
frequency flat-fading and available at both the BS and IRS.
The channels of the BS-user, IRS-user, and BS-IRS links are
denoted as hd ∈ CM×1, hr ∈ CN×1, and G ∈ CN×M ,
respectively.
For the considered system, the received signal at the user is
y = (hHr ΦG+ h
H
d )bs + n, (1)
where Φ = diag(ejθ1 , ejθ2 , . . . , ejθN ) is the phase shift
matrix at the IRS, diag(a1, . . . , aN ) denotes a diagonal matrix
with a1, . . . , aN as its diagonal entries, θi ∈ [0, 2π] represents
the phase shift of the i-th element on the IRS, b ∈ CM×1
is the beamforming vector at the BS with the constraint
‖b‖2 ≤ Pmax, Pmax is the maximum transmit power of
the BS, s is the transmitted signal satisfying E[s2] = 1,
n ∼ CN (0, σ2) is the noise. Then, the received SNR can
be obtained as
γ =


∣(hHr ΦG+ h
H
d )b



2
/σ2. (2)
2162-2345 c© 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Heriot-Watt University. Downloaded on November 03,2020 at 10:19:21 UTC from IEEE Xplore. Restrictions apply.
746 IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. 9, NO. 5, MAY 2020
Fig. 1. IRS-aided single-user MISO system.
Note that, for a fixed phase shift matrix Φ, the optimal
beamforming method that maximizes the received SNR is the
maximum-ratio transmission (MRT) [8], i.e.,
b∗ =

Pmax
(
hHr ΦG+ h
H
d
)H

∥hHr ΦG+ h
H
d


. (3)
The optimization problem for the phase shift matrix Φ to
maximize γ can be formulated as
(P1): max
Φ
‖hHr ΦG+ hHd ‖2,
s.t. |Φi ,i | = 1, ∀i = 1, 2, . . . ,N , (4)
where Φi ,i is the i-th diagonal element of Φ. Note that (P1) is
a NP-hard problem owing to the non-convexity of the objec-
tive function and the unit modulus constraints. A SDR method
was proposed in [5] to solve this problem. However, it is com-
putational expensive with complexity of O((N + 1)6) [6]. In
this letter, we focus on the design of the phase shift matrix Φ,
we propose a robust DRL based framework to deal with (P1)
efficiently, which will be described in the next section.
III. DRL BASED FRAMEWORK
In this section, we first briefly introduce the DRL techniques
involved. Then, the proposed DRL based framework will be
described in detail.
A. Deep Reinforcement Learning Basics
A reinforcement learning (RL) system consists of two
major parts, i.e., the agent and the environment. Interactions
between them can be described as a Markov Decision Process
(MDP) [11]. During time step t in each episode, the agent
obtains the state st from the environment, and chooses an
action at from the action space based on a policy π. Once the
action is done, the environment updates the current state to
st+1, and emits a reward rt which measures the performance
of at under current state. Learning of the agent is to deter-
mine the optimal policy that maximizes the long-term reward.
Two kinds of algorithms, i.e., the value based and policy based
algorithms, are usually applied to determine the optimal policy.
Deep Q network (DQN) [12] is a value based algorithm
for discrete action space. Under a policy π, an action-state-
Q function of the agent for an action a under state s, which
evaluates the current action-state pair, is defined as
Qπ(s , a; θ) = Eπ[Gt |st = s , at = a], (5)
where E[·] represents the expectation, Gt =
∑∞
t=0 λ
trt is the
expected cumulative reward, λ ∈ (0, 1] is a discounting fac-
tor, θ represents the parameters of the deep neural network
Fig. 2. The DRL Based Phase Shift Design Framework Using DDPG.
(DNN) used in DQN. This algorithm aims at maximizing
the Q value (5) of a certain action-state pair by training the
DNN [11]. The training batch is randomly sampled from a
relay buffer with {st , at , rt , st+1} as one piece of previous
data.
Policy gradient (PG) is a policy based algorithm aiming
at maximizing the expectation of the discounted cumulative
reward of each episode when the action space is continuous.
At each time step t, the agent chooses the action according to a
policy πθ . Therefore, training of the policy can be represented
as a gradient ascent procedure [13]
θ t+1 = θ t + βEπθt [∇θ t logπθ(s , a)Qπθt (s , a)], (6)
where β is the learning rate, Qπθt (s , a) is the action-state-Q function under current policy πθ t . The drawback of this
algorithm is that the policy network can be updated only after
an episode is done, which slows down the convergence rate.
B. Phase Shift Design Framework Using DDPG
According to the above description, the DQN algorithm
is not suitable to solve the problem (P1), since it can only
deals with discrete action spaces. As for PG algorithm,
its convergence performance is unsatisfactory under wireless
communication context. In this letter, a DDPG based algo-
rithm is developed to solve problem (P1), it can overcome
the limitations of the DQN and PG algorithm. The proposed
framework is illustrated in Fig. 2.
1) Deep Deterministic Policy Gradient: DDPG is a model-
free, off-policy actor-critic (AC) algorithm combining the
advantages of DQN and PG [14]. It can learn deterministic pol-
icy optimally under high-dimensional continuous action space.
In DDPG, a deterministic policy network (DPN) is used as an
actor to chose actions from a continuous action space A, i.e.,
a = μ(s ; θμ), where θμ is the parameters of the DPN. A
Q network Q(s , a; θq ) is modeled as a critic to measure the
performance of the chosen action, where θq is the parameter
of the critic network. The goal of DDPG is to maximize the
output Q value. To achieve this goal, as in DQN, an experience
replay is reserved to reduce the correlation of different train-
ing samples. Moreover, to solve the problem that the Q value
update is incline to divergence under a single Q network [12], a
copy is created for the actor and critic networks, i.e., μ′(s ; θμ′)
and Q ′(s , a; θq ′), which are referred to as target networks
and used to calculate the corresponding target values. The
networks being copied are referred to as evaluation networks.
Authorized licensed use limited to: Heriot-Watt University. Downloaded on November 03,2020 at 10:19:21 UTC from IEEE Xplore. Restrictions apply.
FENG et al.: DRL BASED IRS OPTIMIZATION FOR MISO COMMUNICATION SYSTEMS 747
Note that, the target network shares the same structure with its
corresponding evaluation network, but with different parame-
ters, i.e., θq = θq ′ , θμ = θμ′ . The target networks are then
updated through soft update, which can be written as
θ i ′ = τθ i + (1− τ)θ i ′ , i = μ or q , (7)
where τ 1. Soft update moves the unstable problem
in learning the action-state-Q function and accelerates the
convergence of AC method [14]. Since the DPN learns a deter-
ministic policy, for exploration, DDPG treats it independently
from the learning process. Moreover, an exploration policy
μ˜ is created through the adding of a noise sampled from a
stochastic process N
μ˜(st ) = μ(st ; θμ) +N , (8)
where N can be chosen to suit the environment.
2) The DRL Formulation: In this letter, the communication
system is regarded as the environment and the IRS is treated
as an agent. We define the corresponding elements as follows.
• State space: The state st is defined as
st =
[
γ(t−1), θ(t−1)1 , . . . , θ
(t−1)
N
]
, (9)
where γ(t−1) is the received SNR at time step t − 1.
• Action space: During time step t, the agent uses state
st as input to update the phase shifts induced by the
IRS under current channel states. When the update is
done, new phase shifts are obtained. Therefore, the action
vector at ∈ RN is defined as
at =
[
θ
(t)
1 , . . . , θ
(t)
N
]
. (10)
• Reward function: In this letter, the objective is to
maximize the received SNR. Thus, the received SNR
defined in (9) is used as the reward, i.e., rt = γ(t).
3) Working Procedure: At the initialization stage, four
networks are generated, i.e., the actor target net θμ′ , the actor
evaluation net θμ, the critic target net θq ′ and the critic eval-
uation net θq , whose parameters are all uniformly distributed.
Besides, an experience replay D with capacity C is built as
well. Without loss of generality, the phase shifts of all ele-
ments are chosen randomly from 0 to 2π at the beginning of
each episode. During each episode, we first calculate all chan-
nels involved. Then, taking the state st as excitation, the actor
evaluation net gives out a corresponding action at . Reforming
at into a phase shift matrix Φ(t) = diag(ejθ
(t)
1 , . . . , ejθ
(t)
N )
to calculate the current reward rt by (2), the next state st+1
can then be obtained by (9). Storing {st , at , rt , st+1} as one
transition into D. The critic evaluation net then samples a
NB -size minibatch {sj , aj , rj , sj+1} (j = 1, . . . ,NB ) from
the experience replay D to calculate the target Q value yj ,
i.e.,
yj =
{
rj , j = NB ,
rj + λQ
′(sj+1, μ′(sj+1; θμ′); θq ′), j < NB .
(11)
The loss function of the critic evaluation net is given as
L(θq ) =
1
NB
NB∑
j=1
(yj −Q(sj , aj ; θq ))2. (12)
Algorithm 1 The DRL Based Framework
Input: The discount factor λ, the soft update coefficient τ ,
the learning rate α, the experience replay capacity C, and
the batchsize NB .
Randomly initialize the critic evaluation network
Q(s , a; θq ) and the actor evaluation network μ(s ; θμ).
Initialize the critic target network Q ′(s , a; θq ′) and the
actor target network μ′(s ; θμ′) with the parameters of the
corresponding evaluation networks.
Empty the experience replay D.
Output: The optimal phase shift matrix Φ∗ and the maxi-
mized received SNR γ∗ under current channel state.
1: for episode j = 1, · · · ,K do
2: Obtain the current CSI (h(j )r ,G(j ), h
(j )
d );
3: Randomly chose phase shifts to obtain Φ(0) and γ(0)
as initial state s1;
4: Initialize a random process N ;
5: for t = 1, · · · ,T do
6: Action at = μ(st ; θμ) +N ;
7: Reform at into phase shift matrix Φ(t) =
diag(ejθ
(t)
1 , · · · , ejθ(t)N ) to calculate γ(t). Obtain
the next state st+1. Then, store the transition
{st , at , rt , st+1} into D.
8: Sample a NB minibatch transitions {sj , aj , rj , sj+1}
from D.
9: Set target Q value according to (11).
10: Update Q(s , a; θq ) by minimizing the loss in (12).
11: Update the policy μ(s ; θμ) using the sampled policy
gradient in (13).
12: Soft update the target networks according to (7).
13: Update the sate st = st+1.
14: end for
15: end for
Then, the critic evaluation net can be updated by SGD.
Afterwards, using policy gradient to update the actor evalu-
ation net with the ascent factor
Δθµ =
1
NB
NB∑
j=1
(∇aQ(sj , μ(sj ; θμ); θq )|∇θμμ(sj ; θμ)). (13)
Finally, the actor target net and the critic target net is updated
using soft update (7). The detail of the DRL based framework
is shown in the following algorithm.
IV. NUMERICAL RESULTS
This section demonstrates the performance of the proposed
framework. The channel between the BS and the user is
assumed to be Rayleigh fading which suggests that the line-
of-sight signal between them is blocked (note that it could be
other fading as well), i.e.,
hd =

PLd h˜d , (14)
where h˜d ∈ CM×1 contains independent and identical (i.i.d)
distributed CN (0, 1) elements. The channel between the BS
and IRS as well as that between the IRS and the user, are
Authorized licensed use limited to: Heriot-Watt University. Downloaded on November 03,2020 at 10:19:21 UTC from IEEE Xplore. Restrictions apply.
748 IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. 9, NO. 5, MAY 2020
Rician fading, i.e.,
G =

PLG
(√
K1
K1 + 1
G+

1
K1 + 1

)
, (15)
hr =

PLr
(√
K2
K2 + 1
hr +

1
K2 + 1
h˜r
)
, (16)
where K1 and K2 are the Rician-K factors, G˜ ∈ CN×M
and h˜r ∈ CN×1 are the random components with i.i.d and
CN (0, 1) distributed elements. The distances between two
adjacent antenna elements at BS and IRS are both half of
the carrier frequency. Thus, the deterministic components G
and hr can be expressed as [15, Eq. (3)], [16, Eq. (6)]
G =
[
aHNx (θAoA,h)⊗ aHNy (θAoA,v)
]
aM (θAoD,b), (17)
hr = a
H
Ny (θAoD,v)⊗ a˜HNx (θAoD,v, θAoD,h), (18)
with
ai (θ) = [1, e
−j2π dλ sin (θ), . . . , e−j2π(i−1)
d
λ sin (θ)], (19)
a˜Nx (θAoD,v, θAoD,h) = [1, e
−j2π dλφ, . . . , e−j2π(Nx−1)
d
λφ], (20)
where φ = cos (θAoD,v) sin (θAoD,h), θAoA/D,h/v represent
the angles of arrival/departure in horizontal/vertical directions
at the IRS, and θAoD,b is the angle of departure at the BS.
The distance between the BS and IRS is 51 m, M = 10,
N = 50 (Nx = 10,Ny = 5) (if not specified otherwise),
Pmax = 5 dBm, and σ2 = −80 dBm. The user moves on
a line in parallel to that connects the BS and IRS, and the
vertical distance between these two lines is 1.5 m. The path
loss is modeled as PL = PL0 − 10ξlog10( dD0 ) dB, where
PL0 = −30 dB, D0 = 1 m, ξ is the path loss exponent, and d is
the BS-user horizontal distance. The penetration loss of 5 dB is
assumed in both BS-user link and IRS-user link. The antenna
gain of 0 dBi is assumed at both the BS and user, and that of
the IRS is 5 dBi. The path loss exponents of the BS-IRS, BS-
user, and IRS-user links are set to ξbi = 2, ξbu = ξiu = 2.8,
respectively. Over 500 realizations of the channels’ random
components are averaged to obtain the simulation results.
In the proposed DRL framework, all neural networks are
considered to be a four layered DNN. The actor evaluation
net and the critic evaluation net both use Adam optimizer for
parameters update. The input layer of the actor network con-
tains N + 1 neurons while the output layer contains N neurons
(these two numbers change to 2N + 1 and 1, respectively in
the critic network). The two hidden layers contain 300 and 200
neurons, respectively. The first three layers are all followed by
a ReLU function while the output layer uses tanh(·) function
to provide enough gradient. Furthermore, we set the batchsize
NB = 16, the number of step in each episode T = 1000, the
learning rate α = 10−3, the discount factor λ = 0.95, the
soft update coefficient τ = 0.005, and the experience replay
capacity C = 50000. The additional noise N is selected as
complex Gaussian noise with zero mean and variance 0.1.
Fig. 3 demonstrates the received SNR of the proposed algo-
rithm vs. the horizontal distance between the BS and the user,
denoted as d. In this figure, We consider a scenario similar
with [5], where the IRS is coated on the facade of a tall build-
ing and is aware of the BS’s location, so that K1 → ∞ and
Fig. 3. Received SNR vs. BS-user horizontal distance.
Fig. 4. Received SNR vs. number of elements on IRS.
K2 = 0. The performance of the SDR algorithm [5] serving as
an upper bound, the fix point iteration algorithm [6] with ran-
dom initialization, as well as the system without IRS, are also
shown. It can be easily observed that the proposed DRL based
framework almost achieves the upper bound of the received
SNR, which testifies its optimality. Note that it brings less
gain when the user is close to the BS, this is because in this
setting, the user is far from the IRS and thus get less sig-
nal power from the IRS. In the absence of IRS, the received
SNR decreases rapidly as the user moves apart from the BS.
This performance degradation can be substantially improved
by placing an IRS between the BS and the user. It is also noted
that the performance of the proposed algorithm is obviously
superior to the fix point iteration when d ≥ 40 m.
In Fig. 4, the received SNRs under different number of pas-
sive element on IRS are compared. In this figure, we consider
a scenario similar with [7], where the IRS is placed without the
knowledge of the BS’s location, the Rician K-factors are set to
K1 = K2 = 10, and the horizontal BS-user distance d = 48 m.
As can be observed, the performance of all algorithms get bet-
ter as the number of passive unit on the IRS increases. This is
because the power reflected by the IRS increases. Particularly,
the difference between the received SNRs with N = 50 and
N = 100 is approximately 6 dB, which suggests that O(N 2)
gain can be attained by doubling the number of passive unit
on the IRS.
The running time of the three algorithms under different
number of passive element on IRS is given in Table I. The
other simulation parameters are the same with Fig. 4. We can
see that the SDR algorithm is extremely time-consuming, and
Authorized licensed use limited to: Heriot-Watt University. Downloaded on November 03,2020 at 10:19:21 UTC from IEEE Xplore. Restrictions apply.
FENG et al.: DRL BASED IRS OPTIMIZATION FOR MISO COMMUNICATION SYSTEMS 749
TABLE I
RUNNING TIME COMPARISON
Fig. 5. Received SNR vs. number of elements at BS/IRS.
its running time increases enormously as N increases. The fix
point iteration algorithm has the lowest running time [6], while
the time consumed increases promptly as N increases, which
is as expected since more optimization variables are involved.
In contrast, the time consumed by the proposed DRL based
framework remains around 34 ms for all N values, which is
explicable since the number of hidden layer neurons remain
unchanged as N grows. This property verifies that the proposed
framework is efficient and robust. More importantly, it can
achieve almost optimal received SNR with relatively low time
consumption.
Fig. 5 compares the performance of the proposed framework
under different numbers of antenna element at BS and different
number of passive unit at IRS. The curve “DRL-based-M” is
obtained by fixing the number of passive unit on IRS to 30
(Nx = 10 and Ny = 3) and varying the number of antenna at
the BS from 30 to 100. The curve “DRL-based-N” is obtained
by fixing the number of antenna at BS to 30 and varying the
number of passive unit on the IRS from 30 to 100 (Ny changes
from 3 to 10). In the curve “DRL-based-N”, each point needs
to be trained again. The other simulation parameters are the
same with Fig. 4. It can be seen that the increase in the number
of passive unit on the IRS leads to higher performance gain,
which indicates that increasing the number of low-cost passive
elements on the IRS is more energy-efficient than enlarging
the scale of costly RF chains on the BS. It is worth noting
that the performance gain becomes more pronounced as the
number of element grows.
V. CONCLUSION
In this letter, we investigated the phase shifts design for the
IRS-aided downlink MISO wireless communication system to
maximize the received SNR. An efficient DRL based frame-
work were proposed to tackle the non-convex unit modulus
constraints, which are major concerns for optimizing phase
shifts introduced by the IRS. Numerical results reveal that
the proposed framework can obtain significant performance
gain compared to the fix point iteration algorithm and achieve
almost the upper bound calculated by the SDR algorithm with
much less time consumption.
REFERENCES
[1] Q. Wu and R. Zhang, “Towards smart and reconfigurable environment:
Intelligent reflecting surface aided wireless network,” 2019. [Online].
Available: arXiv:1905.00152.
[2] W. Tang et al., “Wireless communications with programmable meta-
surface: New paradigms, opportunities, and challenges on transceiver
design,” 2019. [Online]. Available: arXiv:1907.01956.
[3] W. Tang et al., “Programmable metasurface-based RF chain-free 8PSK
wireless transmitter,” Electron. Lett., vol. 55, no. 7, pp. 417–420,
Apr. 2019.
[4] S. Abeywickrama, R. Zhang, and C. Yuen, “Intelligent reflecting sur-
face: Practical phase shift model and beamforming optimization,” 2019.
[Online]. Available: arXiv:1907.06002.
[5] Q. Wu and R. Zhang, “Intelligent reflecting surface enhanced wireless
network: Joint active and passive beamforming design,” in Proc. IEEE
Glob. Commun. Conf. (GLOBECOM), 2018, pp. 1–6.
[6] X. Yu, D. Xu, and R. Schober, “MISO wireless communication
systems via intelligent reflecting surfaces,” 2019. [Online]. Available:
arXiv:1904.12199.
[7] Y. Han, W. Tang, S. Jin, C.-K. Wen, and X. Ma, “Large intel-
ligent surface-assisted wireless communication exploiting statistical
CSI,” IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 8238–8242,
Aug. 2019.
[8] Q. Wu and R. Zhang, “Beamforming optimization for intelligent reflect-
ing surface with discrete phase shifts,” in Proc. IEEE Int. Conf. Acoust.
Speech Signal Process. (ICASSP), 2019, pp. 7830–7833.
[9] C. Huang, G. C. Alexandropoulos, C. Yuen, and M. Debbah, “Indoor
signal focusing with deep learning designed reconfigurable intelligent
surfaces,” 2019. [Online]. Available: arXiv:1905.07726.
[10] A. Taha, M. Alrabeiah, and A. Alkhateeb, “Enabling large intelligent
surfaces with compressive sensing and deep learning,” 2019. [Online].
Available: arXiv:1904.10136.
[11] R. W. Picard et al., “Affective learning—A manifesto,” BT Technol. J.,
vol. 22, no. 4, pp. 253–269, 2004.
[12] V. Mnih et al., “Human-level control through deep reinforcement
learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[13] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy
gradient methods for reinforcement learning with function approx-
imation,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2000,
pp. 1057–1063.
[14] T. P. Lillicrap et al., “Continuous control with deep reinforcement
learning,” 2015. [Online]. Available: arXiv:1509.02971.
[15] Y. Cao and T. Lv, “Intelligent reflecting surface aided multi-user
millimeter-wave communications for coverage enhancement,” 2019.
[Online]. Available: arXiv:1910.02398.
[16] X. Li, S. Jin, H. A. Suraweera, J. Hou, and X. Gao, “Statistical
3-D beamforming for large-scale MIMO downlink systems over rician
fading channels,” IEEE Trans. Commun., vol. 64, no. 4, pp. 1529–1543,
Apr. 2016.
Authorized licensed use limited to: Heriot-Watt University. Downloaded on November 03,2020 at 10:19:21 UTC from IEEE Xplore. Restrictions apply.































































































































































































































































































































































































































































































































































































学霸联盟


essay、essay代写