Comparasion of The M, MM and S Estimator in Robust Regression Analysis on Indonesian Literacy Index Data 2018

: Regression analysis is a method used to determine the relationship between one dependent variable and one or more independent variables. However, the existence of outliers in the 2018 Community Literacy Development Index data led to the application of statistical methods not sensitive to pencils for analysis. This was the reason for the adoption of robust regression methods which include the M, S, and MM estimations. The three estimation methods are estimators with high damage points. This study aims to compare whether the three estimation methods are better in estimating the regression coefficient in terms of the residual standard error and adjusted r-square values. The smaller the residual standard error and the greater the adjusted r-square, the better the estimation method. Descriptive and inferential analysis with robust regression was used due to the existence of several outlier data and to provide good regression model results with unbiased values. It was discovered that the S-estimator and MM-estimator are the best methods because they have the smallest Residual Standard Error (RSE) of 1.856 and 𝑅 2 of 0.9778.


Introduction
Regression analysis is a method normally used to determine the relationship between one dependent and one or more independent variables [1]. Moreover, the estimation method is usually applied to estimate the value of the response variable influenced by the independent variable, and this is often conducted using the Ordinary Least Squares (OLS) or least squares method.
The classical approach in linear regression models is the Ordinary Least Square (OLS) technique where the sum of square of errors is minimized. Minimization of this sum means that the deviation of observation Yfrom the fitted regression line is minimized. Three underlying assumptions must be fulfilled by the errors term in linear regression analysis. These assumptions are normality, constant variance, and independence assumptions, [2]. In general, the errors are assumed to be independently and identically distributed random variables from normal distribution with mean zero and constant variances. The violation of these assumptions will cause a misleading analysis or even disturb the validity of the linear regression model fitted to the variables [3] [4]. The estimations computed by linear regression will become unreliable and cannot provide useful information about the data if assumptions violate, [5] [6]. In real-life data, unusual observations are a widespread issue in data analysis, [7]. One of the possible unusual observations is observed in the response variable's direction. In that case, it is called an outlier while theextreme value contained in the predictor variables is called a high leverage point. In regression, an outlier is defined as observation that does not follow the general pattern of the whole data set [8]. Meanwhile, a high leverage point represents a x-value that lies far away from the rest of the data, [9]. Not all outliers and high leverage points are influential points, [10]. An observation is categorized as influential if removing that particular point singly or in combinationwill cause changes to the fitted model and hence the parameters of estimation. An influential point will affect the analysis's precision based on the OLS regression method. The situation worsens when such a point's presence causes the violation in linear regression assumptions, [11]. In regression analysis, in order to deal with the effect of an outlier or high leverage point which is influential, robust regression is introduced, ([12]- [13]).
The S and MM estimation methods were also compared in [14] and it was reported that S estimation is effective in the case of reading ability data for a group of children. LTS and MM have also been compared by [15] and it was discovered that MM is not more efficient than the LTS method which as has a smoother objective function, thereby, leading to its high sensitivity to local effects and the existence of a high breakdown point value.
Therefore, this research aims to determine a robust regression model with M, MM, and S estimations for the literacy index in the provinces throughout Indonesia and compare their effectiveness.

Data
This research simulated the 2018 data on Literacy Index, Collection Sufficiency, Sufficiency of Library Staff, and Number of Libraries with NES (National Education Standards) from every province in Indonesia. The data were obtained from the Center for Library Development and Reading Interest Correction (P3MB) of the National Library of the Republic of Indonesia (Perpusnas RI). Moreover, the research variables include the Literacy Index per province as the dependent variable ( ) while Collection Sufficiency (X 1 ), Library Power Sufficiency (X 2 ), and Number of Libraries with NES(X 3 ) were used as the independent variables ( ).

Methods
The analysis process applied is described using the following flow chart: The research process begins with inputting data on the Literacy Index in Indonesia in 2018 using R-3.6.1 software.

Ordinary Least Square (OLS) Method
Perform analysis of Ordinary Least Square (OLS) parameter estimation using R-3.6.1 software.

Classic Assumption Test
Carrying out classic assumption tests, namely the normality test, linearity test, heteroscedasticity test, autocorrelation test, and multicorrelation test. If the classical assumption test is not met, it can be suspected that there are outliers, then the analysis process is continued with outlier detection using R-3.6.1 software.

Outlier Detection
Detecting outliers using R-3.6.1 software using the R_student, DfFITS, and plots methods. 5. Robust Regression S-Estimator, M-Estimator, and MM-Estimator Performing analysis of estimation of robust regression parameter S Estimation, M Estimation, and MM Estimation using R-3.6.1 software.

Choose The Best Method
Comparing the results of the three estimates and selecting the best estimation method in terms of the MSE and R^2 values using software R-3.6.1.

Multiple Linear Regression Model
Regression is a statistical analysis to model the relationship between response and predictor variables [16]. The models can also be used to determine the significance of the dependent variable on the independent variable. A multiple regression model usually has more than one independent variable and can be denoted as follows: , … , x ik = the value of the independent variable on the i − th observation β 0 , β 1 , … , β k = regression parameters ε i = remainder in the normally distributed i − th observation i = 1,2, … , n and n states the number of observations while j = 1,2, … , k and k declares the predictor variable.

Ordinary Least Square
Ordinary Least Square or OLS is often used to estimate regression model parameters but it can provide inefficient results due to the presence of outliers in the data [16]. The principle associated with this method is to minimize the sum of the squares of the remainder to obtain the estimated value of β 0 , β 1 , β 2 , … , β k as follows: The partial derivative of β 0 , β 1 , β 2 , … , β k is subsequently determined and equated to zero in order to obtain the estimator value of the linear regression model.

Regression Analysis Assumption Test
The regression model obtained from OLS has a regression coefficient that meets the characteristics of an unbiased and best linear estimator commonly known as the Best Linear Unbiased Estimator (BLUE) [17]. Moreover, classical multiple regression assumption tests can be implemented on the residual data by determining the difference between the observed and estimated data in the multiple regression model. Those discussed in this research include the Normality, Heteroscedasticity, and Autocorrelation which were all conducted on residual data from literacy index data to determine the influence of internal libraries with the focus on the adequacy of library collections, adequacy of library staff, and NES libraries in all provinces in Indonesia.
The normality test was used to determine whether the residual data processed were normally distributed or not and it was conducted using Kolmogorov-Smirnov and Shapiro-Wilk tests. The basic concept of the Kolmogorov-Smirnov test is to compare the distribution of the data to be tested with the standard normal distribution based on the criterion that the data is normally distributed when it shows similar distribution without significant difference from the standard normal distribution [3]. The heteroscedasticity test was also applied to ensure the regression model had an inequality or similarity in residual variance from one observation to another [18]. It was conducted using the Breusch-Pagan test such that heteroscedasticity is believed to have occurred when the − < and is rejected but the assumption is met when the is accepted. Moreover, the autocorrelation test was used to determine the correlation between residuals in the regression model at irregular intervals due to the fact that autocorrelation often occurs in data containing an element of time (time series). It was detected using the Durbin-Watson test [19], thereby it is believed to exist in the residuals when the − < and is rejected but the assumption test can be fulfilled when the is accepted. Furthermore, a multicollinearity test was used to determine whether the independent variables had a significant relationship or not, and this was achieved using the Variance Inflation Factor (VIF) value [20]. The criterion is that there is no multicollinearity when the < and is accepted and this indicates the fulfillment of the assumption.

Outlier Detection
Outliers are data that do not follow the overall data pattern or the general pattern for the regression model produced (Seheult, A. H., et all, 2005). An outlier can be identified using Cook's Distance method with the test statistic determined using the following Equation (9). (3)

Robust
The term "robust" was introduced into the statistical literature by Box in 1953 [22] even though it has been used as a pruned method sporadically for more than a century as indicated in Anonymous (1821) [23]. However, Tukey 1960 was the first to recognize the extreme sensitivity of some conventional statistical procedures to small deviations from assumptions [24]. The realization that statistical methods optimized for conventional Gaussian models are unstable under small perturbations was also observed to be essential for further theoretical developments initiated by Huber (1964) [25] and Hampel (1968) [26]. According to [27], the estimation methods in robust regression include: a. M-estimation (Maximum likelihood type) is a simple estimation method both in the calculation and theory introduced by Huber (1973). It analyzes the data by assuming that most of the outliers are detected in the dependent variable. b. LTS (Least Trimmed Squares) estimation is a method with a high breakdown point introduced by Rousseeuw (1984). The breakdown point is a measure of the minimum proportion of data contaminated with outliers compared to all observational data. c. S (Scale) estimation is a method with a high breakdown point introduced by Rousseeuw and Yohai (1984). It has a higher efficiency than LTS at the same breakdown value. d. MM estimation (Method of Moment) is a combination of high breakdown point and M estimation by Yohai (1987) and is observed to have a higher efficiency than the S estimation.
M estimation M estimation is an extension of the maximum probability and robust estimation methods [28].  11. Conduct a test to determine whether the independent variable has a significant effect on the dependent variable.

S Estimation
According to Rousseeuw and Yohai [29], the S estimator is an estimate with a high breakdown point but low efficiency. It is normally obtained from the minimization of the M estimator based on the residual scale. The weakness of the M estimation is that it pays less attention to the distribution of the data and is not a function of the overall data because it only uses the median as a weighted value. Therefore, the S estimator uses the residual standard deviation to overcome this weakness.

MM estimation
The P was continued with the M estimator. This is the reason the MM estimator procedure involves estimating the regression parameters using the S estimator in order to minimize the residual scale. The aim is to have a high breakdown point and more efficiency. It is important to note that the breakdown value is a general measure of the proportion of outliers that can be overcome before they affect the model [27]. The MM estimator method is in the following form: Where, σ is the standard deviation obtained from the estimated residual ρ(u i ) and used as the objective function of Tukey Bisquare Where, is the value u i = e i σ and σ is the estimated scale and this means Equation (5) is changed to According to [30], the choice of population estimation for σ is σ sn which is fixed and indicates the σ scale of the S estimator in the nth iteration. f. Calculate parameters ̂ with the WLS method with weighting g. Repeat steps − until the value is obtained ̂ the convergent 8. Calculate ̂ using the WLS method with weights 9. Repeat steps 5-8 to get the convergent value of ̂. 10. Test to determine whether the independent variable has a significant effect on the dependent variable.

Results and Discussions
Ordinary Least Square (OLS) Method The relationship between literacy index and internal influences such as the adequacy of library collections, adequacy of the library, and library staff with NES per province in Indonesia for the 2018 period was analyzed using multiple regression. Meanwhile, a linear test was applied to determine the existence of linear relationships between two or more variables tested, and the p − value = 0,6794 which is more than α = 0,05. This means the model is linear and feasible to use. Furthermore, the parameter estimation results obtained through the OLS method are presented in the following Table 4.  Table 4 shows that the initial regression model using the OLS method can be defined using the following Equation 1 Y = −0.1998 + 146.5686X 1 − 5695.8156X 2 + 2721.8545X 3 The equation model (7) does not fully explain the dependent variable due to errors. Based on Table 4. The 2 value of 0.6702 is obtained, which means that the dependent variable ( ) can be explained by the variables 1 , 2 , 3 of 67.02% while the rest is explained by other variables. This method has the ability to produce the best approach when the classical assumptions have been fulfilled in order to avoid biased values and ensure valid interpretation in the acquisition of regression coefficients. Classic Assumption Test Several assumptions are required to be fulfilled in using regression analysis and these include the normality, homoscedasticity, autocorrelation, and multicollinearity tests. The normality tests conducted using the Shapiro-Wilk test produced p − value = 9,043 × 10 −7 which is smaller than α = 0,05 and this means the data is normally distributed and the assumption is fulfilled. Moreover, the homoscedasticity was determined using the Breusch-Pagan test, and the p − value < 2,2 × 10 −16 obtained was found to be smaller than α = 0,05, thereby, indicating the assumption of residual homoscedasticity is not satisfied. The autocorrelation test was performed using the Durbin-Watson test and the p − value = 0,2455 obtained is greater than α = 0,05, thereby, indicating the assumption has been satisfied.

Outlier Detection
The inability to satisfy the homoscedasticity assumption led to the detection of the outliers using the Cook Distance (Cook's D) method and the results are presented in Figure 2.

Robust Regression
The Robust Regression parameters estimated using the M-estimator are presented in the following Table 5.  (8) This implies an increase of one unit in X 1 , X 2 , and X 3 is expected to make Y increase by 105,2857 units, 2324,9940 units, and 4083,6711 units respectively. Meanwhile, the estimation of Robust Regression parameters using the S-estimator is presented in Table 6.
This shows an increase of one unit in X 1 , X 2 , and X 3 is expected to increase Y by 89,150 units, 17258,317 units, and 5398,865 units respectively. Furthermore, the estimation of Robust Regression parameters using the MM-estimator is presented in Table 7.
This indicates that an increase of one unit in X 1 , X 2 , and X 3 is expected to increase Y by 96,328 units, 5855,951 units, and 721,923 units respectively.

The Best Method
The best estimate is selected based on the smallest RSE and the most significant R 2 from the values presented in the following Table 8.
The Robust Regression Model with S-estimation also has an R 2 value of 0,9778 and this means the dependent variable Y can be influenced by variable X at 97,78% while the rest is explained or influenced by other variables outside the model. The regression equation (11) can be described as follows: 1. The regression coefficient of the Collection Sufficiency variable ( 1 ) is 89,150 , meaning that assuming the other independent variables are constant, for every change of 1 Collection Sufficiency unit, the literacy index per province will change by 89.150 units.

2.
(X 1 ), (X 2 ), and Number of Libraries with NES(X 3 ) were used as the independent variables ( ). 3. The regression coefficient of the Library Power Sufficiency variable ( 2 ) is 17258,31, meaning that by assuming the other independent variables are constant, for every change of 1 unit of Library Power Sufficiency, the literacy Index per province will change by 17258.31 units 4. The regression coefficient of the variable Number of Libraries with NES ( 2 ) is 5398.865, meaning that by assuming the other independent variables are constant, for every change of 1 unit of Library Power Sufficiency, the literacy index per province will change by 5398.865 units.

Conclusions
Regression analysis is a method normally applied to determine the relationship between one dependent variable and one or more independent variables while the Ordinary Least Squares (OLS) method is usually applied for estimation. However, the existence of outliers in the 2018 Community Literacy Development Index data requires the use of a statistical analysis method that is not sensitive to outliers.
This research was conducted to overcome the problem of regression analysis when the existing data assumptions are not met due to different reasons such as the presence of outliers. Robust regression methods including M, S, and MM estimations were, therefore, used and the findings showed that the S estimation was the best model to determine the factors mostly influencing the literacy index in each province of Indonesia in 2018. The R 2 value was also recorded to be 0.9778 and this implies 97.78% of the dependent variable was explained by the independent variables while the remaining is associated with other variables outside the model.
It is recommended that other robust methods with higher accuracy are used in future studies to reduce the overall effect of outlier interference while increasing the prediction accuracy. Attention should also be placed on more complex problems such as the use of multivariable robustness.