程序代写案例-5G|学霸联盟

程序代写案例-5G

时间：2021-09-23

Methodology approach to Machine Learning-based Cognitive Network & Management in Virtualized Multi-Tenant 5G Networks ElasticMON Regression Models Berkay KÖKSAL Fall 2018 1 Table of Contents Table of Contents 1 Project Description 2 Dataset Preprocessing 2 Dataset Analysis 4 Correlation Matrix 5 Final Dataset 7 Regression Models and Validation 8 LASSO - Least Absolute Shrinkage and Selection Operator 8 Elastic Net 10 Tree Based Models - Random Forest and XGBoost 11 Random Forest Regressor 11 XGBoost 13 Combined Model 14 Use of Model in Real Time ElasticMON 16 Further Work 17 2 Project Description This report is written to analyse the mac_stats JSON fields of FlexRAN and to create a regression model using the correlation and relation of different metrics with each other. The main purpose of the model is to predict the wbCqi of an UE with given important statistics. A properly trained wbCqi based model with high accuracy can be used for many applications such as: ● Predicting the wbCqi in the future ○ Can predict possible UE connection loss in advance and take actions beforehand ○ Can calculate velocity of the UE from the change rate in time ● Predicting possible eNB issues ○ Can see when all UE's of one eNB will lose wbCqi and take actions As the initial purpose, the quality of a UE will be assigned to its wbCqi for this application. Keep in mind that any other metric can be used in the future to create more complex models to predict different metrics and different situations depending on the purpose. In this context, this project is required to contain the following steps: ● Data cleaning ● Data preprocessing ● Data analysis ● Regression Model Training ● Save/Load of trained models ● Validate and Evaluate results ● (further work) generate real time predictor Dataset Preprocessing FlexRAN controller provides mac_stats measurements for UE stats in a JSON format with more than 100 metrics per measurement. Please examine the fields described in FlexRAN northbound API Doc In order to do monitoring and training on data in a given time space, we had to implement timestamp of measurement inside the JSON tree as well to know the time elapsed between measurements. 3 Also some of these metrics are stable regardless if the UE is in motion or even when it is stable. Therefore, we used a second preprocessing to get rid of fields that do not change over time. Even after removing stable metrics and keeping only the related/dynamic 42 columns in the dataset, there are still problems originated from the southbound of FlexRAN measurements when recording frequency is high(1ms per measurement). Please notice the integer overflow occurred on macStats_phr during the recording. Southbound API development is out of scope to this project therefore we will try to solve this issue on the data cleaning part for the moment. Cleaning Steps: ● Aggregation of multiple columns and using the mean or median of other columns to get rid of overflows and sudden metric spikes ● Change datetime format to date_index showing it is growing to future rather than exact dates. (exact dates give no useful information) 4 The preprocessing of the dataset is complete. We have 42 fields left that might or might not be used for the regression model. The recordings contain no bad data and data index problem is resolved now. Following is the list of 42 features left in the dataset for the further analysis part: ['date_index', 'rsrp', 'rsrq', 'wbcqi', 'macStats_phr', 'dlCqiReport_sfnSn', 'macStats_totalBytesSdusDl', 'macStats_totalTbsUl', 'macStats_mcs1Ul', 'macStats_totalPduDl', 'macStats_totalBytesSdusUl', 'macStats_tbsDl', 'macStats_totalPrbUl', 'macStats_macSdusDl_sduLength', 'macStats_macSdusDl_lcid', 'macStats_prbUl', 'macStats_totalPduUl', 'macStats_mcs1Dl', 'macStats_mcs2Dl', 'macStats_prbDl', 'macStats_totalPrbDl', 'macStats_prbRetxDl', 'macStats_totalTbsDl', 'ulCqiReport_sfnSn', 'pdcpStats_pktRx', 'pdcpStats_pktRxW', 'pdcpStats_pktRxAiatW', 'pdcpStats_pktRxOo', 'pdcpStats_pktRxBytesW', 'pdcpStats_pktRxSn', 'pdcpStats_pktTxBytesW', 'pdcpStats_pktTxSn', 'pdcpStats_pktTxBytes', 'pdcpStats_pktRxAiat', 'pdcpStats_pktRxBytes', 'pdcpStats_pktTx', 'pdcpStats_pktTxW', 'pdcpStats_pktTxAiatW', 'pdcpStats_sfn', 'pdcpStats_pktTxAiat', 'rnti', 'quality'] Dataset Analysis We have recorded many patterns from different scenarios of UE moving away or closer with respect to the eNB or staying on stable distance. Now we can use other analysis tools to see which fields in fact correlate with what. 5 Correlation Matrix The correlation matrix gives us the correlation level of different features in a DataFrame against the other fields. We have used DataFrame Corr for this purpose. This is an important factor as it will provide us a list of useful and not useful fields. 6 As we are focusing on wbCqi we also created a list of correlation levels(from 0 to 1) with other fields and sorted them to see which fields are the most correlated ones with wbCqi. From the matrix what we have learnt is: ● About quality(wbcqi): ○ As expected rsrp rsrq and phr actually correlates a lot with it! ○ Also macStats_mcs1Dl, macStats_mcs1Ul correlated quite nicely (which might change on other patterns, more work needed) ○ We definitely learnt that pktRx or pktTx has no correlation at all as they are not related with wbCqi at all! ● About other fields: ○ rsrp, rsrq and phr correlate amongst each other quite nicely as well. ○ tbsDl correlates with with Tx related fields ● About unnecessary fields: ○ We also learnt unnecessary fields such as prbUl, rnti, date_index(we already knew this one), pdcpStats_sfn etc NOTE: mcs1Dl is created and calculated using wbCqi directly. Therefore it has almost 1 to 1 correlation with wbCqi. Using this field in the training would cause all models to put their coefficients to this field and provide 100% accuracy. In the end we cannot allow this as it discredits the whole purpose of the project. Therefore, in the end we have dropped it from the training set to have less accurate but more valuable models. 7 Final Dataset Now we can drop unnecessary columns from the dataset and only focus on the top 10-15 fields we have for the prediction. NOTE: We also drop wbCqi as we will try to predict that one and we don't want it in the predictors for the moment. We have recordings from almost 30000 scenarios and the top 15 fields that correlate the most with wbCqi in our training set, where all of them can be represented as integers making regression models much reliable and easier to train. Dataset has minimum correlation of 0.415038 for field dcpStats_pktTxSn and all upwards. As all fields can be represented as integers as shown above, there is no need for transformations or even support vector machines to convert anything. 8 Regression Models and Validation We modeled several different regression and tree techniques, which will be constructed, explained and validated in the following section. We pick the best of all approaches and create an ensemble model which will depict our final model and end-result for this project. Regression models aim at modeling a linear relationship between one or several independent variables and a dependant variable(target variable). In our case, the dependant variable is Quality(wbCqi for now), while the independent variables are all variables which we kept in the dataset to predict the wbCqi. Before we start the regression, knowing the best 15 features we have but not knowing their relation with wbCqi it is a good practise to also include their x^2 x^3 sqrt(x) and x^(1/3) to the data frame to also be able to recognise polynomial relations they might have. This will increase the number of features we have in the table and will make calculations more complex, yet it might increase our accuracy so let us give it a shot. Total feature count in the end now is 15 + 4x15 = 75 fields Now we split our dataset to training and validation sets to be sure the validation set is not being used for training purposes. NOTE: Normally many techniques involve a percentage based separator for training and validation sets (i.e 95% to 5% etc). However, we chose to make a scenario based split, where training set includes patterns from different recordings and validation set has its own patterns. This is useful as it will provide more insight of where and where the models fail to predict and lose their accuracy. If we would use a percentage based split, we wouldn’t know the exact stats in time series when the casualty started. Still we tried to keep a good 90% to 10% training to validation set size for this part. Training set consists of 26082 Validation set consists of 2959 measurements LASSO - Least Absolute Shrinkage and Selection Operator LASSO is a regression method which penalizes the absolute size of coefficients, which results in a situation where parameter estimates may be exactly zero. The more penalty applied will shrink the estimates towards zero more rapidly. LASSO is a good approach when dealing with highly correlated predictors where standard regression will usually have very high coefficients because of their high coherance. The Lasso technique is regression analysis method and and powerful when we have a large number of features. It's a great model when trying to avoid overfitting and it is also helpful to 9 overcome computational challenges. The Lasso technique works by penalizing the magnitude of coefficients of features along with minimizing the error between predicted and actual observations. These are called regularization techniques. LassoCV Python parameters explained here We need to adjust parameters such as alpha, tolerance for the optimization(important as it stops the model), epoch count, features from our dataset. Sample predictions from Lasso: Please notice the predictions are rounded up or down to integer values as wbCqi is an integer field so should our predictions in the end. According to Lasso, the most important 6 fields for its predictions from our dataset were: Square Root of rsrp, rsrq, Square Root of phr, 1/3th power of rsrp, rsrp and phr and so on so forth up to all 75 fields in total. Mean error is 1.239 where around 4 predictions were very wrong. 689 miss predictions in total of 2959 (% 76.71510645488341 success) 10 Highest Errors: Elastic Net The elastic net is a combination of Lasso and ridge regression. It aims at overcoming individual limitations that are prevalent in the Lasso and Ridge, while taking advantage of each model's strengths. The Elastic Net enforces sparsity. Sparsity refers to the relationship between the amount of predictors and the count of samples. If the amount of predictors is greater than or equal to the amount if samples, our model is impossible to fit. Therefore we use a subset which is smaller than the number of predictors. This leads to the great advantage, that the Elastic Net does not have a limitation on how many predictors we can use. In our case, with this dataset, the dataset is greater than the number of predictors by far. However, it is nice to take advantage of another strength of the Elastic Net. The Elastic Net encourages a grouping effect amongst highly correlated predictors. This allows for a better control of the impact of each predictor. Some predictions of ElasticNet: According to ElasticNet Model, the most important fields were again square root of rsrp, rsrq, square root of phr, 1/3th power of rsrp and rsrp. Mean error is 1.357 where about 4 predictions were very wrong. 815 misspredictions in total of 2959 (% 72.45691111862116 success) 11 Highest errors: Tree Based Models - Random Forest and XGBoost Tree based learning algorithms are one of the best and mostly used supervised learning methods. They have the advantage that they can model non-linear relationships quite well. Random Forest works for both categorical and continuous input and output variables. A tree is said to be categorical if the target variable is categorical. In our case, we are creating a continuous random tree, as the target variable "quality" is continuous.The advantages of random forests are that they are easy to understand, useful in data exploration and not constrained by the data type. Data does not need to be cleaned as intensely for using random forests and random forest is a non-paramedic method, which means that trees have no assumptions about the space distribution and the classifier structure. The disadvantages of trees are that they tend to overfit and when we use them for continuous variables, our tree loses information when it categorizes variables in different categories. Both disadvantages must be accounted for in our wbCqi model! Random Forest Regressor Random Forest is a versatile machine learning method capable of performing both regression and classification tasks. It also undertakes dimensional reduction methods, treats missing values, outlier values and other essential steps of data exploration, and does in general a good job. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model. If we have a number of instances in training set N, each sample these N instances is taken at random but with replacement.This will be the training set for growing our tree. From M input variables, a number m of M. The best split on m is used to split the node. The value m is held constant while we grow the

forest. Each tree is grown to the largest extent and we do not prune. Last we predict new data by
aggregating the predictions of the n trees.

NOTE: Random Forest implementation actually uses percentage base split of the training set to
validate it against itself! The final model than can be used to predict the real validation set to confirm.
12
Refer to the RandomForestRegressor Python Manual
Everytime we run the regressor, we get different model with different accuracy. This benefit of
stochasticity helps building dynamic models that train on the fly.

Please notice the outliers where wbcqi is quite low. Training is randomized and parameters should
be optimized better when dataset size increases. For a relatively small dataset like the one we have,
some sample predictions look like this:

Mean error is 0.68 which is under 1 and is
quite better than the previous models we
had.

483 miss predictions in total of 2959 (%
83.67691787766137 success)

13

Highest Errors:

XGBoost
XGBoost stands for “Extreme Gradient Boosting” and is another Tree implementation that we would
like to use for wbCqi prediction. As in random forest, we grow a tree decision by decision. Yet, the
difference of XGBoost lies within Training. As the name of the implementation implies, training is
"boosted". We learn each variable-by-variable relation and grow a tree for each. Then we apply
something called additive training, where we grow the tree by adding new trees in an iterative way.
In our dataset for example, we would grow a tree for rsrp and add it to a tree that we already grew
for rsrq. We need to keep score of each leaf and tree structure. Therefore, we built this model in an
iterative way. This is a far more complex optimization than just optimizing with a gradient. Naturally,
we add the trees that optimize our result. Also, XGBoost takes advantage of regularisation methods,
which many other implementations do not.

NOTE: XGBoost currently dominates all Kaggle data science challenges worldwide.
XGBoost Python API explains the parameters and what they actually do. Our current model could be
improved even further. Parameter tuning is very complicated for XGBoost.

NOTE: XGBoost also has subsampling percentage drop to avoid overfitting.

Soma predictions look like this:

Mean error is 0.66 which is the best we had
so far.

Only 4 predictions were very wrong and
model on average performs much better than
others.

345 miss predictions in total of 2959 (%
88.34065562690098 success)

14
Highest Errors:

Combined Model

Not all models perform the best under all circumstances. Therefore, we made an optimizer that
will test all contributions from 0% to 100% of all models to combine them together and try to find
a contribution level where the validation set will be predicted the best.

Combined model concluded that 78% of XGBoost, 21% of Random Forest and only 1% of
Elastic Net would provide the best combined model.

As expected XGBoost and Random Forest has the highest contribution as their performances
were much better. But still this proves that regression models also needs to be included even on
lower percentages to provide better predictions.

15
Some predictions of the Combined Model:

Mean error is 1.03 which is much higher
than what we would expect.

344 misspredictions in total of 2959 (%
88.37445082798243 success)

Even thought it has the best percentage
base success on the validation data, the
mean error being high is unacceptable
and discredits this model.

Highest Errors:

16
Use of Model in Real Time ElasticMON
The developed model will be used to predict wbCqi of a real time running UE connected to the
FlexRAN test bed placed in Eurecom. Please refer to predictor.py provided which will use trained
models to predict real time data.
Steps:
1. A query will be sent to ElasticMON to retrieve latest known mac_stats of an UE(Can further
become a sequence of mac_stats aggregated as well)
2. A DataFrame will be formed from the mac_stats JSON in the same format as the trained
models above
3. Model will be used to predict the wbcqi of the UE
4. Results will be compared and validated
5. Error ratio will be stored back inside ElasticMON as a verification metric

Step 1: mac_stats fetch from ElasticMON

Refer to ElasticSearch Search API in order to learn more about query properties such as size,
groupby, aggregate, filter, where etc.
The query in predictor.py will return only the latest known mac_stats including all 100+ fields
FlexRAN provides. The response body will be parsed down to form the PandaFrame on the second
step.
Step 2: Forming the DataFrame from the response string
The JSON data has to be converted to the same format with trained models using DataFrames.
Step 3: Using premade models to predict the latest real time information
NOTE: When rrc measurements are deactivated, all models fail to predict properly. Because
deactivated rrc measurements cause rsrp and rsrq to be -1, and they have very high coefficient
contribution to all of the models we have
Step 4: Evaluation of results
Predictions are validated with the actual last known wbCqi and the error is calculated.
Step 5: Store results back to ElasticMON
17
Refer to ElasticSearch Python Client API
We will create a new json text containing the error of the model we just evaluated, then store results
in a different index named after the predictor used to predict it. (ie predictor_wbcqi_lasso)

Further Work
● The dataset needs to grow. More recording is required on different scenarios.
● The Models need to be self updating on the fly when more data comes in.
● New models should be trained to predict wbcqi when rrc measurements are not activated
● A context switcher control application can be deployed to shift between rrc_based models
and non_rrc based models
● New fields should be predicted (ex. UE throughput, Best eNB for each UE etc)
● More models should be trained using bigger datasets.

学霸联盟