xuebaunion@vip.163.com
3551 Trousdale Rkwy, University Park, Los Angeles, CA
留学生论文指导和课程辅导
无忧GPA:https://www.essaygpa.com
工作时间:全年无休-早上8点到凌晨3点

微信客服:xiaoxionga100

微信客服:ITCS521
Methodology approach to Machine Learning-based Cognitive Network & Management in Virtualized Multi-Tenant 5G Networks ElasticMON Regression Models Berkay KÖKSAL Fall 2018 1 Table of Contents Table of Contents 1 Project Description 2 Dataset Preprocessing 2 Dataset Analysis 4 Correlation Matrix 5 Final Dataset 7 Regression Models and Validation 8 LASSO - Least Absolute Shrinkage and Selection Operator 8 Elastic Net 10 Tree Based Models - Random Forest and XGBoost 11 Random Forest Regressor 11 XGBoost 13 Combined Model 14 Use of Model in Real Time ElasticMON 16 Further Work 17 2 Project Description This report is written to analyse the mac_stats JSON fields of FlexRAN and to create a regression model using the correlation and relation of different metrics with each other. The main purpose of the model is to predict the wbCqi of an UE with given important statistics. A properly trained wbCqi based model with high accuracy can be used for many applications such as: ● Predicting the wbCqi in the future ○ Can predict possible UE connection loss in advance and take actions beforehand ○ Can calculate velocity of the UE from the change rate in time ● Predicting possible eNB issues ○ Can see when all UE's of one eNB will lose wbCqi and take actions As the initial purpose, the quality of a UE will be assigned to its wbCqi for this application. Keep in mind that any other metric can be used in the future to create more complex models to predict different metrics and different situations depending on the purpose. In this context, this project is required to contain the following steps: ● Data cleaning ● Data preprocessing ● Data analysis ● Regression Model Training ● Save/Load of trained models ● Validate and Evaluate results ● (further work) generate real time predictor Dataset Preprocessing FlexRAN controller provides mac_stats measurements for UE stats in a JSON format with more than 100 metrics per measurement. Please examine the fields described in FlexRAN northbound API Doc In order to do monitoring and training on data in a given time space, we had to implement timestamp of measurement inside the JSON tree as well to know the time elapsed between measurements. 3 Also some of these metrics are stable regardless if the UE is in motion or even when it is stable. Therefore, we used a second preprocessing to get rid of fields that do not change over time. Even after removing stable metrics and keeping only the related/dynamic 42 columns in the dataset, there are still problems originated from the southbound of FlexRAN measurements when recording frequency is high(1ms per measurement). Please notice the integer overflow occurred on macStats_phr during the recording. Southbound API development is out of scope to this project therefore we will try to solve this issue on the data cleaning part for the moment. Cleaning Steps: ● Aggregation of multiple columns and using the mean or median of other columns to get rid of overflows and sudden metric spikes ● Change datetime format to date_index showing it is growing to future rather than exact dates. (exact dates give no useful information) 4 The preprocessing of the dataset is complete. We have 42 fields left that might or might not be used for the regression model. The recordings contain no bad data and data index problem is resolved now. Following is the list of 42 features left in the dataset for the further analysis part: ['date_index', 'rsrp', 'rsrq', 'wbcqi', 'macStats_phr', 'dlCqiReport_sfnSn', 'macStats_totalBytesSdusDl', 'macStats_totalTbsUl', 'macStats_mcs1Ul', 'macStats_totalPduDl', 'macStats_totalBytesSdusUl', 'macStats_tbsDl', 'macStats_totalPrbUl', 'macStats_macSdusDl_sduLength', 'macStats_macSdusDl_lcid', 'macStats_prbUl', 'macStats_totalPduUl', 'macStats_mcs1Dl', 'macStats_mcs2Dl', 'macStats_prbDl', 'macStats_totalPrbDl', 'macStats_prbRetxDl', 'macStats_totalTbsDl', 'ulCqiReport_sfnSn', 'pdcpStats_pktRx', 'pdcpStats_pktRxW', 'pdcpStats_pktRxAiatW', 'pdcpStats_pktRxOo', 'pdcpStats_pktRxBytesW', 'pdcpStats_pktRxSn', 'pdcpStats_pktTxBytesW', 'pdcpStats_pktTxSn', 'pdcpStats_pktTxBytes', 'pdcpStats_pktRxAiat', 'pdcpStats_pktRxBytes', 'pdcpStats_pktTx', 'pdcpStats_pktTxW', 'pdcpStats_pktTxAiatW', 'pdcpStats_sfn', 'pdcpStats_pktTxAiat', 'rnti', 'quality'] Dataset Analysis We have recorded many patterns from different scenarios of UE moving away or closer with respect to the eNB or staying on stable distance. Now we can use other analysis tools to see which fields in fact correlate with what. 5 Correlation Matrix The correlation matrix gives us the correlation level of different features in a DataFrame against the other fields. We have used DataFrame Corr for this purpose. This is an important factor as it will provide us a list of useful and not useful fields. 6 As we are focusing on wbCqi we also created a list of correlation levels(from 0 to 1) with other fields and sorted them to see which fields are the most correlated ones with wbCqi. From the matrix what we have learnt is: ● About quality(wbcqi): ○ As expected rsrp rsrq and phr actually correlates a lot with it! ○ Also macStats_mcs1Dl, macStats_mcs1Ul correlated quite nicely (which might change on other patterns, more work needed) ○ We definitely learnt that pktRx or pktTx has no correlation at all as they are not related with wbCqi at all! ● About other fields: ○ rsrp, rsrq and phr correlate amongst each other quite nicely as well. ○ tbsDl correlates with with Tx related fields ● About unnecessary fields: ○ We also learnt unnecessary fields such as prbUl, rnti, date_index(we already knew this one), pdcpStats_sfn etc NOTE: mcs1Dl is created and calculated using wbCqi directly. Therefore it has almost 1 to 1 correlation with wbCqi. Using this field in the training would cause all models to put their coefficients to this field and provide 100% accuracy. In the end we cannot allow this as it discredits the whole purpose of the project. Therefore, in the end we have dropped it from the training set to have less accurate but more valuable models. 7 Final Dataset Now we can drop unnecessary columns from the dataset and only focus on the top 10-15 fields we have for the prediction. NOTE: We also drop wbCqi as we will try to predict that one and we don't want it in the predictors for the moment. We have recordings from almost 30000 scenarios and the top 15 fields that correlate the most with wbCqi in our training set, where all of them can be represented as integers making regression models much reliable and easier to train. Dataset has minimum correlation of 0.415038 for field dcpStats_pktTxSn and all upwards. As all fields can be represented as integers as shown above, there is no need for transformations or even support vector machines to convert anything. 8 Regression Models and Validation We modeled several different regression and tree techniques, which will be constructed, explained and validated in the following section. We pick the best of all approaches and create an ensemble model which will depict our final model and end-result for this project. Regression models aim at modeling a linear relationship between one or several independent variables and a dependant variable(target variable). In our case, the dependant variable is Quality(wbCqi for now), while the independent variables are all variables which we kept in the dataset to predict the wbCqi. Before we start the regression, knowing the best 15 features we have but not knowing their relation with wbCqi it is a good practise to also include their x^2 x^3 sqrt(x) and x^(1/3) to the data frame to also be able to recognise polynomial relations they might have. This will increase the number of features we have in the table and will make calculations more complex, yet it might increase our accuracy so let us give it a shot. Total feature count in the end now is 15 + 4x15 = 75 fields Now we split our dataset to training and validation sets to be sure the validation set is not being used for training purposes. NOTE: Normally many techniques involve a percentage based separator for training and validation sets (i.e 95% to 5% etc). However, we chose to make a scenario based split, where training set includes patterns from different recordings and validation set has its own patterns. This is useful as it will provide more insight of where and where the models fail to predict and lose their accuracy. If we would use a percentage based split, we wouldn’t know the exact stats in time series when the casualty started. Still we tried to keep a good 90% to 10% training to validation set size for this part. Training set consists of 26082 Validation set consists of 2959 measurements LASSO - Least Absolute Shrinkage and Selection Operator LASSO is a regression method which penalizes the absolute size of coefficients, which results in a situation where parameter estimates may be exactly zero. The more penalty applied will shrink the estimates towards zero more rapidly. LASSO is a good approach when dealing with highly correlated predictors where standard regression will usually have very high coefficients because of their high coherance. The Lasso technique is regression analysis method and and powerful when we have a large number of features. It's a great model when trying to avoid overfitting and it is also helpful to 9 overcome computational challenges. The Lasso technique works by penalizing the magnitude of coefficients of features along with minimizing the error between predicted and actual observations. These are called regularization techniques. LassoCV Python parameters explained here We need to adjust parameters such as alpha, tolerance for the optimization(important as it stops the model), epoch count, features from our dataset. Sample predictions from Lasso: Please notice the predictions are rounded up or down to integer values as wbCqi is an integer field so should our predictions in the end. According to Lasso, the most important 6 fields for its predictions from our dataset were: Square Root of rsrp, rsrq, Square Root of phr, 1/3th power of rsrp, rsrp and phr and so on so forth up to all 75 fields in total. Mean error is 1.239 where around 4 predictions were very wrong. 689 miss predictions in total of 2959 (% 76.71510645488341 success) 10 Highest Errors: Elastic Net The elastic net is a combination of Lasso and ridge regression. It aims at overcoming individual limitations that are prevalent in the Lasso and Ridge, while taking advantage of each model's strengths. The Elastic Net enforces sparsity. Sparsity refers to the relationship between the amount of predictors and the count of samples. If the amount of predictors is greater than or equal to the amount if samples, our model is impossible to fit. Therefore we use a subset which is smaller than the number of predictors. This leads to the great advantage, that the Elastic Net does not have a limitation on how many predictors we can use. In our case, with this dataset, the dataset is greater than the number of predictors by far. However, it is nice to take advantage of another strength of the Elastic Net. The Elastic Net encourages a grouping effect amongst highly correlated predictors. This allows for a better control of the impact of each predictor. Some predictions of ElasticNet: According to ElasticNet Model, the most important fields were again square root of rsrp, rsrq, square root of phr, 1/3th power of rsrp and rsrp. Mean error is 1.357 where about 4 predictions were very wrong. 815 misspredictions in total of 2959 (% 72.45691111862116 success) 11 Highest errors: Tree Based Models - Random Forest and XGBoost Tree based learning algorithms are one of the best and mostly used supervised learning methods. They have the advantage that they can model non-linear relationships quite well. Random Forest works for both categorical and continuous input and output variables. A tree is said to be categorical if the target variable is categorical. In our case, we are creating a continuous random tree, as the target variable "quality" is continuous.The advantages of random forests are that they are easy to understand, useful in data exploration and not constrained by the data type. Data does not need to be cleaned as intensely for using random forests and random forest is a non-paramedic method, which means that trees have no assumptions about the space distribution and the classifier structure. The disadvantages of trees are that they tend to overfit and when we use them for continuous variables, our tree loses information when it categorizes variables in different categories. Both disadvantages must be accounted for in our wbCqi model! Random Forest Regressor Random Forest is a versatile machine learning method capable of performing both regression and classification tasks. It also undertakes dimensional reduction methods, treats missing values, outlier values and other essential steps of data exploration, and does in general a good job. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model. If we have a number of instances in training set N, each sample these N instances is taken at random but with replacement.This will be the training set for growing our tree. From M input variables, a number m