Modeling Pollution Index Using Artificial Neural Network and Multiple Linear Regression Coupled with Genetic Algorithm

Shatt Al-Arab River in Basrah province, Iraq, was assessed by applying comprehensive pollution index (CPI) at fifteen sampling locations from 2011 to 2020, taking into consideration twelve physicochemical parameters which included pH, Tur., TDS, EC, TH, Na+, K+, Ca+2, Mg+2, Alk., SO4 -2, and Cl-. The effectiveness of multiple linear regression (MLR) and artificial neural network (ANN) for predicting comprehensive pollution index was examined in this research. In order to determine the ideal values of the predictor parameters that lead to the lowest CPI value, the genetic algorithm coupled with multiple linear regression (GA-MLR) was used. A multi-layer feed-forward neural network with backpropagation algorithm was used in this study. The optimal ANN structure utilized in this research consisted of three layers: the input layer, one hidden layer, and one output layer. The predicted equation of the comprehensive pollution index was created using the regression technique and used as an objective function of the genetic algorithm. The minimum predicted comprehensive pollution index value recommended by the GA-MLR approach was 0.3777.


INTRODUCTION
Water is important for human and ecological survival and health in all aspects [Abyaneh, 2014]. According to the World Health Organization, water pollution is defined as any alteration in the physical, chemical, as well as biological characteristics of water which has a harmful impact on living beings [Salihu et al., 2017]. Water pollution is the primary cause of the water crisis. It must not be polluted to the point where it can no longer be utilized for irrigation and drinking [Singh et al., 2020]. The study of water quality provides a clear vision of the river's suitability for various uses [Al-Asadi et al., 2020].
The Shatt Al-Arab River (SAR) is the principal source of surface water in the Basrah governorate. The water supplier of the Shatt Al-Arab River comes from the Tigris and Euphrates rivers in Iraq, as well as the Karkheh and Karon rivers in Iran. Due to water scarcity, the Euphrates river was blocked as a supplier for the Shatt Al-Arab River, while Iran blocked off the waters of the Karon and Karkheh rivers from reaching Shatt Al-Arab. As a result, the Tigris river became the only supply of fresh water for Shatt Al-Arab [Al-Asadi and Alhello, 2019]. Due to the reason that the river and its branches have already become receptacles for pollutants from many sources, the river freshwater has been significantly degraded. As a result, monitoring the river pollution levels is critical for the human health in the area [Al-Asadi et al., 2020].
The neural networks technique has recently been used to a wide range of scientific fields. From the beginning in the 1990s, ANNs have been used in the fields of water engineering, and environmental sciences. When compared with conventional modeling methods, the artificial neural network is a suitable method having Modeling Pollution Index Using Artificial Neural Network and Multiple Linear Regression Coupled with Genetic Algorithm a flexible mathematical structure able of finding complicated nonlinear correlations among both the input and output data [Najah et al., 2013]. They are effectively utilized to predict water quality in a variety of water bodies [Kulisz et al., 2020]. In 1975, the basic concept of genetic algorithm has been first invented by John Holland when he was delivering a lecture called adapting systems theory at Michigan University [Azad et al., 2016]. The genetic algorithm is a method of searching that is dependent on Darwin's concept of evolution [Mijwel, 2016].
In this research, the comprehensive pollution index (CPI) was used to classify the Shatt Al-Arab River water pollution. Several researchers examined the water quality of the SAR [ The goals of the research are as follows: define the extent of water pollution in the SAR at many water treatment plants (WTPs) using the CPI, determine the optimum structure of the ANN, and determine the ideal values of the predictor parameters that lead to the lowest CPI value by using the GA-MLR method.

Study area
The Shatt Al-Arab River rises at the confluence of the Euphrates and Tigris rivers in Qurna City and flows southwest for 101 kilometers before forming the border between Iraq and Iran for the final 91 kilometers of its main course, before flowing into the Arabian Gulf [Allafta and Opp, 2020]. The SAR lies between the latitude of (29° 45' 0" -31° 15' 0" N) and the longitude of (47° 10' 20" -48° 45' 0" E) [Abdulla, 2013]. The main water source in the Basrah province is the SAR, a natural river that flows through the Basrah governorate at a rate of 25-75 m 3 /s [Almuktar et al., 2020]. The water quality of the Shatt Al-Arab has deteriorated dramatically during the last three decades caused by anthropogenic activities. The river is receiving growing volumes of untreated wastewater as well as runoff from the surrounding oil fields. As a result, the important functions the Shatt Al-Arab plays in maintaining healthy populations and sustaining a balanced ecology are considerably imperiled [Allafta and Opp, 2020].

Data description
The directorate of Basrah water provided monthly data on 12 water quality parameters collected at each of the fifteen water treatment plants throughout the period of 2011-2020. There are twelve parameters of water quality, which include pH, Tur., TDS, EC, TH, Na + , K + , Ca +2 , Mg +2 , Alk., SO 4 -2 , and Cl -. Table 1 illustrates the statistical analysis of twelve physical and chemical parameters for raw water in this study.

Sampling sites
The physical and chemical properties were obtained at fifteen water treatment plants. Table  2 presents the coordinates of various WTPs. Figure 1 shows the locations of the WTPs considered for this study.

Comprehensive Pollution Index (CPI)
CPI was used in several studies for the categorization of water quality. The steps for calculating CPI are [Ezzat and Elkorashey, 2020]: • The following equation should be used to compute the pollution index (PI) for every water quality parameter [Ezzat and Elkorashey, 2020]: • The standard permitted concentrations for every parameter selected for this study were acquired from the World Health Organization (WHO 2011), as shown in Table 3  • CPI was computed by taking the overall number of parameters into account [Ezzat and Elkorashey, 2020]: = where n is the number of parameters that have been chosen.
• The CPI values could be utilized to categorize the water quality level, as shown in Table 4 [Matta et al., 2018].

Artificial Neural Network (ANN)
ANN is a mathematical programming model that mimics the functioning process of the human brain. An ANN method can perform brain processes, decide, arrive at a solution in the absence of sufficient data using current knowledge, absorb continuous data input, learn, and remember. The capability of a neural network to model complicated nonlinear relation sans making prior assumptions about the nature of the relation is its greatest advantage [Banejad and Olyaie, 2011]. An ANN is comprised of multiple nodes that represent neurons. The independent variables are represented by the input nodes, while the dependent variables are represented by the output nodes [Nwobi and Ochieze, 2018]. The main purpose of the learning procedure is to identify the best set of weights that can give the best output for the given inputs. The network output is compared to the target answer to calculate the error [Najah et al., 2013]. Different structures can be found in neural networks. Feed forward and recurrent networks can be distinguished in principle. Only forward-directed information flows from the input nodes through hidden nodes to the output nodes in feed forward networks. There are links in recurrent networks where information can travel forwards and backwards through network node connections. Feedback networks are another name for the recurrent networks [Mijwel and Alsaadi, 2019].

Back Propagation Algorithm (BP)
Back propagation (BP) is the most common and widely applied learning algorithm over all neural network models among the various learning existing algorithms. This algorithm is employed in supervised learning [Banejad and Olyaie, 2011]. The primary training concept of BP is founded on gradient descent algorithm, which modifies weights to reduce Mean Square Error (MSE) [AlTobi et al., 2016]. The BP algorithm is divided in two phases: forward and backward phase. In the forward phase, the network input data is propagated to the following level and so forth. The network error is calculated after that. In the backward phase, the network error is propagated backwards, and the weights are adjusted accordingly [Gallo,2015]. As illustrated in Figure  2, the network structure is consists of three layers, each of which has n neurons.
The number of input variables determines the number of neurons in the first layer (input layer). This layer takes the input from external world and transfers them without any alteration to the hidden layer. Since they are only indirectly related to the outside environment, intermediate layers are usually known as hidden

Performance criteria
The models were evaluated using Mean squared error (MSE) and Correlation Coefficient (R), as follows [Kulisz et al., 2021]: where: N is number of data, T is the target value, O is the output value of the network,

Genetic Algorithm (GA)
John Holland invented Genetic Algorithm and presented his idea in his book in the year 1975 "Adaptation in Natural and Artificial Systems". GA was suggested by Holland as a computational method dependent on the Survival of the Fittest principle [Sivanandam and Deepa, 2008]. Genetic algorithm is population-based stochastic algorithm. Selection, crossover, and mutation are the three main GA operators. Because the GA algorithm is random, one can wonder how trustworthy it is. The technique of keeping the best solutions for each generation and applying them to improve subsequent solutions is what makes this algorithm dependable and capable of estimating the global optimum for a particular problem. As a result, the entire population improves with each passing generation [Mirjalili, 2019]. The GA works with a group of chromosomes (also called individuals). Each chromosome indicates a workable solution to the problem researched. A collection of biologically based genetic operators, such as selection, crossover, and mutation, are used to generate the offspring chromosomes. The offspring are expected to inherit perfect genes from their parents, resulting in a higher average quality of solutions than previous generations. GA is iterative in their approach. A generation is the name given to each iteration. The fitness function evaluates and determines the fitness of each chromosome in each generation. A chromosome becomes fitter when its fitness function value goes up, indicating that it has a better chance of surviving in the next generation. This process of evolution is repeated until certain stopping requirements are met [Guo and Wong, 2013].

Implementation of Genetic Algorithm
The steps below, explain what the genetic algorithm will be doing [Abuiziah and Nidal, 2013]: • GA begins with an initial population that is generated at random. • Calculate the population's fitness. Fitness function is implemented to each individual chromosome to produce a fitness score. • The solution utilized to create the next solution is chosen depending on its fitness value. The solutions with a larger fitness value have a better probability of being chosen for reproduction, whereas those with a lesser fitness • The present population is replaced by the new population. • This evolution process is replicated until a predetermined termination criterion is met. For example, satisfaction with the enhancement of the best solutions might be used as criterion. Figure 3 shows how the GA performs [Tabassum and Mathew, 2014].

Normalization data
The term normalization refers to the process of converting data values to a range between 0 and 1. The actual data is first normalized using the formula [Chopra et al., 2019]: where: x i is the i th data to have been normalized, x n is the normalized value, x min is the minimum value of data, and x max is the maximum value of data.

Comprehensive pollution index
On the basis of to the CPI classification of all WTPs in this study for ten years from 2011 to 2020, the water of the Shatt Al-Arab River is classified as moderately polluted water and seriously polluted water, as demonstrated in the Figures from 4 to 8. The year 2018 was found to be the most polluted for all WTPs compared to other years, with the highest value of TDS reaching 22 954 mg/l at Al-Labanie (WTP No. 15). This was attributable to the salt tide in this year, in addition to the pollutants resulting from domestic, industrial and agricultural activities, which led to an increase in the salinity of the river.

Estimation of CPI by multiple linear regression
The multiple linear regression model enables to investigate the impact of numerous independent variables on the dependent variable. Dependent variable: CPI, independent variables: pH, Tur., TDS, EC, TH, Na + , K + , Ca +2 , Mg +2 , Alk., SO 4 -2, and Cl -. The SPSS program  was used to analyze the data, in this model, the multiple correlation coefficients R is 0.996, and the coefficient of determination R 2 is 0.991. The success percentage of this model is 99.1% with a 0.9% error rate. The pH, TH, and Alk., in this model are not statistically significant, since their p-values are more than the 5% level of significance, p-value = 0.834 for pH, p-value = 0.916 for TH, and p-value = 0.848 for Alk., as illustrated in Table 5.
The SPSS program was used to analyze the data, in this model, the multiple correlation coefficients R is 1, and the coefficient of determination R 2 is 1; this means that this model is able to predict CPI values extremely accurately. Because the p-value for all predictor variables is less than 0.001, they are statistically significant, as illustrate in Table 6. The correlation between the measured and regression variables was positive, as presented in Table 7

Estimation of CPI by Artificial Neural Network
The back propagation algorithm has been used to train the created ANN models. Multiple linear regression analysis was used to determine the number of input variables. Tur., TDS, EC, Na + , K + , Ca +2 , Mg +2 , SO 4 -2 , and Clwere utilized as input variables to predict the CPI. 70% of the data was used for training set, 20% for testing, and 10% for validation set, because this proportion produced the best performance in terms of least MSE and highest R values. In order to find the optimal number of nodes in the hidden layer, many ANN models were created and evaluated. The effectiveness of the ANN models was assessed utilizing the coefficient of correlation (R) and the mean squared error (MSE).
The maximum regression coefficient and minimum mean squared error for the training set, validation set, and testing set, for each training functions used for one and two hidden layers are presented in Tables 8 and 9, respectively.
From Tables 8 and 9, the optimum prediction model was found in the 9-16-1 network structure. The Levenberg Marquardt algorithm trained this network, because when compared to other training functions, it produced the best performance in terms of least MSE and highest R values. In the hidden layer, logsig was chosen as activation function and purelin activation function was chosen in the output layer. This structure produced the lowest MSE value of 9.755×10 -7 for the training set, 1.945×10 -7 for the validation set, and 8.388×10 -8 for testing set as shown in the Figure 9, and maximum R value of 0.99996 for training set, 0.99998 for validation set, and 1 for testing set, as shown in the Figure 10. Table 10 represents the properties of the selected ANN model in this study. Figure 11 indicates the optimal ANN structure performed in this study.

Genetic algorithm optimization solution
GA is a method to resolving the optimization problems that are both constrained and unconstrained. The goal of the optimization procedure in this study was to identify the optimum values for the independent variables that lead to the minimum WPI value. The comprehensive pollution index estimation model described in Eq. (6) is chosen as the fitness function and written as follows: Minimize CPI (Tur., T.D.S, EC, Na + , K + , Ca +2 , Mg +2 , SO 4 -2 , Cl -) = min (-0.016 + 0.106 Tur. + 0.206 TDS + 0.202 EC + 0.229 Na + + 0.065 K + + 0.046 Ca +2 + 0.013 Mg +2 +0.146 SO 4 -2 + 0.191 Cl -).
The reduction of the objective function value is exposed to the limits of predictor variable values. The range of values of measured predictor variables are chosen to illustrate the constraints of the optimization solution, as presented in Table 11.    The Matlab optimization toolbox has been employed to identify the CPI lowest value at the ideal points by applying the fitness function of Eq. (7), the limits of predictor variables in Table 11. The genetic algorithm generated a global minimum of -0.0159 for the CPI as shown in Figure 12. The ideal values of predictor variables (normalized values) are zero for all predictor variables. At 51 iteration of the genetic algorithm, the best solution was Figure 11. The ideal network architecture for prediction CPI value   Figure 12. Best fitness value and mean fitness value found. Table 12 shows the suitable combination of parameters used for the genetic algorithm that leads in the lowest fitness function value.

RESULTS OF THIS STUDY
In comparison to the results of actual data, MLR, and ANN models, the GA-MLR method is effective in producing the lower CPI value at the ideal values of predictor variables. As presented in Table 13, the minimum CPI value for the GA-MLR approach is 0.3777.

CONCLUSIONS
The water of the Shatt Al-Arab River is categorized as moderately polluted water and seriously polluted water by the CPI classification in this study. The performance of the MLR and ANN models for estimating CPI was evaluated and it was found the MLR and ANN models were very suitable for predicting the CPI based on the results of this study. The optimum prediction model was found in the 9-16-1 network structure. This structure produced the lowest MSE value of 9.755×10 -7 for the training set, 1.945×10 -7 for the validation set, and 8.388×10 -8 for testing set, and maximum R value of 0.99996 for training set, 0.99998 for validation set, and 1 for testing set. According to the results of this study, the GA-MLR technique is capable of estimating the ideal parameters that result in the minimum CPI value. The minimum predicted CPI value recommended by the GA-MLR approach was 0.3777.