Predicting Water Quality Parameters in a Complex River System

This research applied a machine learning technique for predicting the water quality parameters of Kelantan River using the historical data collected from various stations. Support Vector Machine (SVM) was used to develop the prediction model. Six water quality parameters (dissolved oxygen (DO), biochemical oxygen demand (BOD), chemical oxygen demand (COD), ammonia nitrogen (NH3-N), and suspended solids (SS)) were predicted. The dataset was obtained from the measurement of 14 stations of Kelantan River from September 2005 to December 2017 with a total sample of 148 monthly data. We defined 3 schemes of prediction to investigate the contribution of the attribute number and the model performance. The outcome of the study demonstrated that the prediction of the suspended solid parameter gave the best performance, which was indicated by the highest values of the R2 score. Meanwhile, the prediction of the COD parameter gave the lowest score of R2 score, indicating the difficulty of the dataset to be modelled by SVM. The analysis of the contribution of attribute number shows that the prediction of the four parameters (DO, BOD, NH3-N, and SS) is directly proportional to the performance of the model. Similarly, the best prediction of the pH parameter is obtained from the utilization of the least number of attributes found in scheme 1.


INTRODUCTION
sciences, which involves the calculation and description of the water quality parameters and the contamination transmission mechanism. Moreover, the advent of innovative soft computing and artificial intelligence (AI) techniques have led researchers in evaluating the component of water quality and their internal relationship in time series. Recent studies have reported the applications of the artificial intelligence-based methods in the addressing water resources management issues (Slaughter et al., 2017;Tomas et al., 2017;Wu et al., 2018). Similarly, radial basis network (RBF), multilayer perceptron (MLP), and adaptive neuro-fuzzy inference system (ANFIS) were observed to be suitable in predicting the water quality parameters of Karoon River (Emamgholizadeh et al., 2013). Additionally, Najah et al. (2009) used the artificial neural network (ANN) approaches in predicting the water quality parameters of Johor River Basin. The outcome of the study indicated that the performance of the ANN models is efficient, as the mean absolute percentage error of 10% was obtained in the prediction of the water quality parameters. Zhang et al. (2010) proposed a tool for the water allocation schemes analysis of Jiaojiang River basin using the water quantity-quality model. Nikoo and Mahjouri (2013) applied fuzzy inference system and probabilistic support vector machines in estimating the probabilistic water quality of water resources. The outcome of the study indicated that the models could be used in feasibility studies of water conservation projects. (Antanasijević et al., 2014) estimated the dissolved oxygen (DO) concentration in Danube River using a general regression neural network (GRNN) model. The predicted outcome obtained from the study was compared with the output observed from the Monte Carlo simulations. The authors recommended that the GRNN model is an efficient tool for the estimation of the DO concentration in rivers. Heddam (2016a;2016b) predicted the water quality parameters using ANN in several case studies. He claimed that the AI methods are sufficient for modelling the water quality parameters in time series. Elkiran et al. (2018) estimated the DO concentration of Mathura River in India using feed-forward neural network, multilinear regression, and ANFIS. In the study, DO concentration, biochemical oxygen demand (BOD), temperature and pH parameters of the river were used for the prediction. The findings obtained from the study indicated that the ANFIS models greatly improved the performance over the feed-forward neural network and multilinear regression in the validation step.
In view of the past research work mentioned, a comparative study on the implementation of the AI techniques using different software packages is necessary to improve the accuracy level and its applications. However, several data analysis programs do not involve comprehensive modification in the implementation of the AI techniques. Hence, this research study explores the ratification of one of AI approaches, namely support vector machines (SVM), for monitoring and predicting river water quality parameters.

Collection of Data
In this study, the historical dataset of water quality parameters (dissolved oxygen (DO), biochemical oxygen demand (BOD), chemical oxygen demand (COD), pH, ammonia nitrogen (NH 3 -N), and suspended solids (SS)) was used. The dataset was obtained from the measurement of 14 stations of Kelantan River from September 2005 to December 2017 with a total sample of 148 monthly data. Missing values were found in the dataset, since no measurements were performed on that day. The missing values were filled by using the interpolation method. The location of the measurement station along the river is presented in Figure 1. Figure 1 depicted the water flow of the river to the area of measurement station 1 (area 1). Hence, the water properties in area 1 are affected by the water quality in other areas. This is because water from all areas gathers and flows to area 1. However, to improve the measurement efficiency, we can replace the conventional water quality measurement in area 1 by using a prediction model. The model for six parameters was developed by using the data of water quality in other areas as input variables. In this case, we considered the prediction of a parameter that is affected by the value of other parameters. For example, the value of the COD parameter is utilized to predict the value of the BOD parameter. Therefore, the number of input variables is equal to the multiplication of the number of areas by six parameters. The development of the model was performed using 3 schemes. Those schemes differ by the number of areas that are considered to affect the water quality in area 1. In the first scheme, we considered the edge areas only, i.e. area 8, 10, 11, 13, 14, with the total number of input variables of 30. Meanwhile, the second scheme was conducted by considering the edge and branch areas, i.e. area 4, 8, 10, 11, 12, 13, 14, with the total number of input variables as 42. In the third scheme, we considered all the remaining areas, i.e. area 1-13, with the total number of input variables of 78. The detailed information of those prediction model schemes can be seen in Table 1.

Support Vector Machine
In this present study, the model prediction was generated using Support Vector Machine (SVM). SVM is a branch of machine learning (ML) technique developed using the theory of statistical learning. The basic principle of the SVM implementation in pattern recognition is the mapping of the input vectors into a possibly higher dimension of feature space, either linearly or non-linearly. The mapping process is controlled by the type of kernel function. Then, an optimal hyperplane is constructed to obtain the maximal separation of two classes, or extended to multi-class. The SVM training is performed by seeking a globally optimized solution and managing the over-fitting problem. Therefore, the SVM method has an advantage in processing a large number of features (Vapnik, 1998). SVM is also known as the largest margin classifier, since this method tries to find an optimal hyperplane that results in the largest margin. The representation of the hyperplane and margin used in SVM is presented in Figure 2.
The main goal of SVM is to construct a classifier from the available samples by avoiding misclassifying in future predictions. The separating hyperplane used in the classifier is expressed as ⃗⃗⃗ • + = 0 , which refers to the formulation of ( ⃗⃗⃗ • + ) ≥ 1, = 1, . . . , . During the training, SVM will look for an optimal separating hyperplane by minimizing (1/2)‖ ⃗⃗⃗ ‖ 2 subject to the constraint. In this case, ‖ ⃗⃗⃗ ‖ 2 represents the Euclidean norm of ⃗⃗ , which maximizes the distance between the hyperplane and support  vectors. The training procedure of SVM is converted into convex Quantum Programming (QP) problem by utilizing Lagrange multipliers. The solution of the QP problem is represented as a global optimal expressed as: where: ⃗ represents support vector when α i > 0.
After the training process, the decision function used in prediction is formulated as: where: sgn() represents the given sign function.
Moreover, to allow errors during the training, slack variable (ζ) with > 0, = 1, . . . , were introduced by Cortes and Vapnik (Vapnik, 1995). This technique is known as a soft margin, which is effective in preventing overfitting. By considering the slack variable, the relaxed separation constraint is formulated as and the optimal hyperplane is obtained by minimizing where: C represents a regularization parameter that controls a trade-off between the optimal margin and training error. Similarly, to obtain an optimal hyperplane, the input vector was mapped into a higher dimensional Hilbert space, in which the process is controlled by the kernel function. The kernel functions that are commonly used in the SVM model are RBF, linear, and polynomial kernel function. The polynomial kernel function can be expressed as: ( , ) = (〈 , 〉 + 1) (5) where: E represents the exponent value. In the case of the linear kernel, the value of the exponent value is 1. Meanwhile, the RBF kernel function can be expressed as:

Hyperparameter Tuning
The performance of the SVM model was improved by performing a hyperparameter tuning procedure. This process aims to obtain the optimal parameter that will be used in model development. The SVM parameter that is tuned in this step consists of a regularization parameter (C), kernel coefficient (gamma), and kernel function. The option of parameter values used in the hyperparameter tuning is presented in Table 2.

Model Validation
The performance of the SVM model was measured by calculating two validation parameters, i.e. coefficient correlation (R2) and mean square of error (MSE). The parameters were used as a reference to determine the validity of the model for each scheme and parameter. These parameters were formulated as: where: A i , Ᾱ and P i represent the actual values in i-th month, the average of actual values and predicted values, respectively, while n represents the number of data. In the case of MSE, we calculated those parameters by using a scaled dataset to allow the comparison of the results amongst the water quality parameters.

Hyperparameter tuning
The performance of the SVM model was improved by conducting hyperparameter tuning for each scheme and parameter. The optimized parameter of the SVM model for schemes 1, 2 and 3 are presented in Tables 3, 4 and 5, respectively. We found that the sigmoid kernel function is not suitable for our study, as this function was not chosen from the hyperparameter tuning results. The chosen optimized kernel function for all parameters is the RBF function, except for the parameter for NH 3 -N. The optimized values of the regularization parameter (C) are varied for each water quality parameter. This is related to the tolerance level of the SVM model to accept errors during the training. The variation of the C parameter reflected the different characteristics of the dataset of water quality parameters.

Model validation
The SVM models developed by the optimized hyperparameter were evaluated by comparing the predicted values with the actual ones. The plot of predicted values against the actual ones of scheme 1 is presented in Figure 3. According to Figure 3, we found that all of the data points were close to the straight diagonal line, except the BOD parameter, indicate low values of error. We also found that the deviation of the data points of the BOD parameter is quite large compare to other parameters.
The results of the validation parameter, i.e. R2 and MSE, for schemes 1, 2 and 3 are presented in Tables 6, 7 and 8, respectively. As for scheme 1, we found that the R2 score of train data for all water quality parameters is more than 0.80, which signifies a satisfactory result in predicting the train data. However, the true quality of the model is evaluated according to the ability in predicting the external data as represented by the R2 score of test data. We found that the prediction of the SS parameter gave the best performance with an R2 score of 0.901. Meanwhile, the worst performance was found in the prediction of the COD parameter with an R2 score of 0.241. This indicates that the data set of the COD parameter is more complex than others. In this case, the number of used attribute seems not enough to reveal the pattern of the COD data set.
As for scheme 2, we found that the R2 of train data for all the water quality parameters is satisfactory, as all the R2 values were observed to be more than 0.90. However, the R2 of test data is different for each parameter. The best performance is obtained from the prediction of the SS parameter with an R2 score of 0.940. Meanwhile, the prediction of COD gives the worst performance with an R2 score of 0.499. By comparing the results of COD prediction in scheme 1, we found that the addition of attribute in scheme 2 improves the R2 score from 0.241 to 0.449. Even   though the R2 score is still low, the improvement indicates that the number of attributes contributed to the R2 score of COD prediction. As for scheme 3, we found that the R2 score of the train data for all water quality parameters is good with the score of more than 0.90. According to the R2 score of test data, we found that the prediction of the SS parameter gives the best result with an R2 score of 0.936. Meanwhile, the worst performance is obtained from the prediction of COD with an R2 score of 0.490. The value of the R2 score of COD prediction in scheme 3 is not significantly different compared to the value in scheme 2. This indicates that the addition of attribute in scheme 3 failed to improve the results of COD prediction. Generally, the best and worst results of all schemes were obtained from the prediction of the SS and COD parameters, respectively. This indicate that the data quality of the   The contribution of the attribute number in each scheme on the model performance was investigated by comparing the R2 score of test data for all water quality parameters, as presented in Figure 4. The number of the attribute from scheme 1 to scheme 3 is increased and leads to the increasing of the model complexity. The positive correlation was found in the R2 scores of the DO, BOD, NH 3 -N and SS parameters. In these parameters, the increase of the attribute number leads to an increasing in the R2 score. This shows that the attribute number can improve the performance of the model. Conversely, the R2 score of the pH parameter decreases as the addition of the attribute number increases. This point out that the increasing of attribute number lead to too complex model and caused overfitting state. In the case of the prediction of the COD parameter, we found that the best R2 score was obtained from scheme 2. However, the difference in the R2 score between scheme 2 and scheme 3 is not significant. The overall results reveal the importance of the attribute number to obtain satisfying results.
Moreover, we found that no scheme that gives the best performance for all parameters.

CONCLUSION
The values of six water quality parameters, i.e. dissolved oxygen (DO), biochemical oxygen demand (BOD), chemical oxygen demand (COD), pH, ammonia nitrogen (NH 3 -N) and suspended solids (SS) of station 1 were predicted by using the SVM model. The prediction was performed by defining 3 schemes according to the number of attributes used for model development. Amongst the water quality parameters, the prediction of the SS parameter gave the best results with the highest values of the R2 score for both the train and test data. Meanwhile, the worst results were obtained from the prediction of the COD parameter. Regarding the contribution of attribute number in each scheme, we found that the prediction of four parameters, i.e. the DO, BOD, NH 3 -N and SS parameters, were improved as the contribution of the attribute number increases. Conversely, the best prediction of the pH parameter was obtained from scheme 1 with the least number of attributes.   Figure 4. The comparison of R2 score of test data of water quality parameters calculated by using different schemes