Water Quality Classification by Integration of Attribute-Realization and Support Vector Machine for the Chao Phraya River

The water quality index (WQI) is an essential indicator to manage water usage properly. This study aimed at applying a machine learning-based approach integrating attribute-realization (AR) and support vector machine (SVM) algorithm to classify the Chao Phraya River’s water quality. The historical monitoring dataset during 2008-2019 including biological oxygen demand (BOD), conductivity (Cond), dissolved oxygen (DO), faecal coliform bacteria (FCB), total coliform bacteria (TCB), ammonia (NH3-N), nitrate (NO3-N), salinity (Sal), suspended solids (SS), total nitrogen (TN), total dissolved solids (TDS), and turbidity (Turb), were processed via four studied steps: data pre-processing by means substituting method, contributing parameter evaluation by recognition pattern study, examination of the mathematic functions for quality classification, and validation of obtained approach. The results showed that NH3-N, TCB, FCB, BOD, DO, and Sal were the main attributes contributing orderly to water quality classification with confidence values of 0.80, 0.79, 0.78, 0.76, 0.69, and 0.64, respectively. Linear regression was the most suitable function to river water data classification than Sigmoid, Radial basis and Polynomial. The different number of attributes and mathematic functions promoted the different classification performance and accuracy. The validation confirmed that AR-SVM was a potent approach application to classify river water’s quality with 0.86-0.95 accuracy when applied three to six attributes.


INTRODUCTION
River water quality is an essential required data that needs to be addressed to providing information for availability decisions and usage management. Human activities are the majority of factors greatly influencing the quality of water resources. On the other hand, water quality is significantly related to public human's health. The quality classification is crucial and beneficial to monitoring, predicting, and managing water resources [Shakhman and Bystriantseva, 2021]. In this case, the water quality index (WQI) is a universal indicator. WQI represents a precise mathematic function that integrated multi-variable effects, stated as a single value [Yan et al., 2015]. Various applications of WQI had been reported for many purposes, i.e., protection of the urban, lake, groundwater environments [Talalaj, 2014] and development of a specific index for river ecosystem [Naubi, 2016]. Several conditions and criteria were applied to determine WQI; for instance, Gradilla-Hernández et al. [2020] used seven physical parameters, i.e., oxygen, nutrients, organics, heavy metals, to develop their WQI from historical data. On the other hand, nine parameters, such as aluminium, iron, copper, Escherichia coli and nitrate, were used differently by [2020] used many parameters between 11-14 for their WQI. In Thailand, WQI is formerly calculated from eight parameters: pH, DO, BOD, NO 3 -N, FCB, TP, TS, and SS. It is then deduced into five parameters: DO, BOD, NH 3 -N, FCB, and TCB. The stable variation of temperature and seasonal fluctuation of TP, TS, and SS is the main neglected reasons [Thammarak et al., 2020]. NH 3 -N replaces NO 3 -N due to its direct reflection on the contamination of wastewater from human activities. TCB has been considered a critical parameter indicating coliform bacteria and faecal coliform contamination, while pH is neglected by its low variation found [PCD, 2018].
In the determination of WQI, the multivariables system is usually applied. This system is considered valuable in its more accuracy, but some weaknesses are promoted simultaneously. For instance, there are much variable interference, longer processing time-consumed, difficulty processing of a large amount of data, errors by the inclusion of less accurate data, and the cost of analysis increase [Tung and Yaseen, 2020]. The use of a smaller number of some significant variables is preferred to minimize these limitations. However, an advance of recently developed automation programming and the supervised algorithm has demonstrated its practical applicability to the environmental monitoring data [Dezfooli et al., 2018]. A machine learning-based approach (ML) is an algorithm used popularly in the environmental prediction proposes, for example, clustering image classification [Okwuash and Ndehedehe, 2020;Najah et al., 2016], data classification [Braun et al., 2011], discovering the information for mobile's massages data and malware data [Chen, 2020], determining the monitoring site for design the hydrometeorological monitoring network by groundwater-level data [Asquith, 2020], and creating a visualization map for river's water quality prediction [Kausar et al., 2011]. ML involves the scientific study of the statistical model and mathematic functions in programming. The statistical theory is applied to build mathematical models and improve accuracy by recognizing patterns of the experienced data consisted mainly of a majority training set and a minority testing set of all historical data [Alpaydin, 2020]. ML adaptation approaches include an accuracy improvement via ensembles learning, comprising a scaling up, supervising and reinforcing the learning algorithm, and then covering the complex stochastic model [Dietterich, 1997].
The mathematic function of ML that has been used popularly in quality classification is supervised learning algorithms such as a support vector machine (SVM), Naïve Bayes, and decision trees. SVM is a powerful function due to its roots in statistical learning theory and the optimization ability method to solve convex and non-convex problems. The use of a combined ML algorithm has also been studied, i.e., for predicting the suspended sediment concentration from turbidity of a stream [Bayram et al., 2012], predicting the water quality parameters like a dissolved oxygen, biological oxygen demand, ammonia nitrogen, and suspended sediment concentration from complex river system [Kurniawan et al., 2021], designing the water quality parameters and sampling frequency from surfact water quality management network [Khalil et al., 2014], and evaluating the carbon dioxide emissions with the related influencing factors [Wei and Wang, 2017]. It was reported that SVM provides a more accurate result [Singh et al., 2011], requires less time, and can operate with a smaller amount of data than other algorithms using the same data and condition [Gamble and Babbar-Sebens, 2012]. However, solving the limitation of variable interference is complicated for each environmental data set. Therefore, the development of water quality classification using a combination of learning algorithms is a tempting approach. In this study, an Attribute-Realization (AR) combined with a Support Vector Machine (SVM) algorithm was implemented to classify the Chao Praya River's water quality. The alternative and practical method optimizing the number of variables needed to classify the river water quality and its validity when applying AR-SVM to the new dataset was an aim.

MATERIAL AND METHODS
In this study, four steps were applied to develop an integrated approach for water quality classification: data preparation for missing values minimization using the means substitution method, attribute-realization by recognition pattern study for the primary contributing parameters identification, a mathematic algorithm selection for creation of classification approach, and validation of a proposed approach for the new dataset. In Figure 1 the methodological procedure used in this study is summarized.
The research was conducted based on the monitoring dataset of the Chao Phraya River, the largest river in Thailand. This river begins in the north of the country, sources from two small rivers named Ping and Nan, and becomes the Chao Phraya River in Nakhon Sawan province, as shown in Figure 2. Then, the river fl ows through the central region, including Bangkok and exits the Gulf of Thailand in Samut Prakarn province [Muttamara and Sales, 1994]. The classifi cation approach was developed using the water quality monitoring data collected by the Pollution Control Department (PCD), Ministry of Natural Resources and Environment of Thailand [IWIS, 2019]. PCD is responsible for river water quality monitoring. About 18 stations are installed in the Chao Praya River, divided into three zones along the river. The fi rst zone is downstream, starting in Samutprakarn province at the latitude of 13.59697 to 13.81063 and longitude of 100.59439 to 100.51880, consisting of six monitoring stations named to the PCD as CH01, CH03, CH06, CH08, CH10, and CH12. The second zone is the midstream having fi ve stations, CH15, CH16.1, CH17, CH18, and CH20, located in the latitude of 13.94527 to 14.34268 and longitude of 100.53825 to 100.57916. The upstream area is the last zone in the northern region with seven stations, CH21, CH24, CH25, CH27, CH28, CH30, and CH33, in the latitude of 14.58753 to 15.68577 and longitude 100.45550 to 100.25335. The collection of monitoring data is four times a year divided roughly into two main seasons: wet season (two samplings in January-March and April-June) and the dry season (two samplings in July to September and October to December). While a new dataset of the Tha Chin River, a branch of the Chao Phraya River, was applied in the validation step. This river starts in Chi Nart province and then runs through the western part of the country and fl ows into the Gulf of Thailand at Samut Sakorn. There are 14 monitoring stations along the river running from Chi Nart to Samut Sakorn.

Data collection and data preprocessing step
From January 2008 to February 2019, the raw dataset of the Chao Phraya River was collected by the PCD. This monitoring data consists of 12 parameters characterized by physical, chemical, and biological characteristics, as shown in Table 1. These monitoring parameters indicated the water quality infl uenced by the anthropogenic activities which are the major contamination sources of river water bodies, i.e., agriculture, household, and industry located along the river. In practice, the raw data obtained from water monitoring stations contains some missing values. This incomplete, noisy, and inconsistent data hinders the data processing [Balderas, ]. Therefore, the pre-processing method was applied initially to reduce the impact of incomplete and noisy values and to normalize all the monitoring data. The data preprocessing step consisted of data cleaning and data integration procedures. The data cleaning process was corrected for inconsistencies by filling in missing values and minimizing the noise using the attrite mean process. In data integration, all monitored data were checked for redundancies using schema integration. Then, the dataset was transformed into a CSV UTF-8 type for database creation and machine learning analysis. All realized parameters of water quality in each dataset were defined afterwards as an Attributes. In Table 1, the dataset between 703-815 points for the parameters of the Chao Praya River is shown. It had been stated that the Chao Phraya River is rounded wastewaters from agricultural activities, industrial activities, and household wastewater. The average value for wastewater indicates a sufficient quality comparing to the standard values in such parameters as DO, BOD, TCB and FCB, which are 4.05 mg/L, 2.29 mg/l, 3.0x10 4 MPN/100 ml and 1.0x10 4 MPN/100 ml, respectively.

Attribute realization step
The attribute-realization step was implemented to quantify the contribution of each monitoring parameter, and hereafter this step is called an attribute used to develop the water quality classification approach. These contributing attributes are crucial in the classification approach because they provide the main constituents for index calculation [Khalil et al., 2014]. To determine the attributes in each constituent index, the different characteristic of water quality parameters was retrieved for its essence meaning of each characteristic. The realization was performed using three groups of monitored parameters indicating the water quality: (i) Turb, Cond, TDS, and SS for physical characteristics; (ii) DO, BOD, NO 3 -N, NH 3 -N, Sal, and TN for chemical characteristics; and (iii) TCB and FCB for biological characteristics. The realization was determined using tools of the PostgresSQL and MySQL Workbench programs, with pseudocode used in programming the algorithms. The mathematical models To determine the significant contribution parameters of water quality classification, the essential attributes representing each water characteristic were evaluated comparing the traditional method of WQI calculation, afterwards called calculated WQI. The Apriori algorithm was implemented to identify each parameter's importance as a parameter that promoted the class similarly to WQI calculated from the traditional method by the PCD and USEPA. Two criteria for the calculated WQI algorithm are that firstly, the expected value of the water quality attribute according to PCD standard defining as A for lower concentration and B for a higher concentration of attribute's value compared with surface water quality standard value. Secondly, the classification class follows the Inland Water Quality Information System of PCD (IWIS-PCD), consisting of four classes classified water quality as below [IWIS, 2016].

Algorithm selection step
Four mathematical algorithms, namely a linear regression, sigmoid, radial basis, and polynomial function, were examined for their suitability and compatibility with the monitoring data in developing the classification approach for river water's quality. In this procedure, the SVM algorithm was utilized to transform the original water attributes into a multidimensional feature space. Then, clustered data groups were identified, and subsequently, a hyperplane for data classification was designed. This procedure was conducted using PostgresSQL, the python program, and pseudocode. The six steps in the application procedure were: (i) the system divided the dataset into two, namely a training set accounted for 80% of the total monitoring data, and a testing set accounted for 20% of the total monitoring data, based on automated random sampling; (ii) the system acquired the data set from the database; (iii) the system set up the algorithm for classification; (iv) the system evaluated the optimum hyperplane by increasing the margin of two spaces in between the hyperplane;(v) analysis to optimize the kernel function was carried out using four functions which is linear, sigmoid, radial basis, and polynomial function; and (vi) the system performance of algorithms Note: a nd non-detectable when Turb < 5.0 NTU, SS < 25 mg/L, BOD < 1.5 mg/L, NH3-N < 0.5 mg/L Sal < 1.00 ppt, and TN <6.0 mg/L; b Standard value for surface water of PCD; c Standard value for surface water of USEPA.
was evaluated, and the suitable algorithm was selected based on its precision, recall, F1-score, and accuracy. The SVM algorithm used for the supervised machine and performance evaluation is provided in detail in Figure 5.

Validation step
Various indications are used to determine mathematic function performance, divided into prediction and classifi cation approaches. For example, the linear regression model is verifi ed by mean-absolute-error (MAE), mean squared error (MSE), root-mean-squared-error (RMSE), or R-Squared (R 2 ). However, in this study, the classifi cation model was developed. The classifi cation performance of each mathematic function for the ML application approach was determined using precision, recall, F1-score, and accuracy [Muharemi et al., 2019]. The precision, recall, F1-score, and accuracy are range from 0 to 1, where the minimum to a maximum of those values depicted the poor to perfect classifi cation result [Chicco and Jurman, 2020]. In this study, the evaluation criteria were where: TP -true positive predicted results; TNtrue negative predicted results; FP -false positive predicted results; FN -false negative predicted results.
The new dataset of the Tha Chin River from Jan 2017 to Feb 2019 was applied to validate the model. The validation of the proposed approach of AR-SVM in classifying the water quality class was evaluated. The obtained finding from the attribute realization and algorithm selection step were performed. The validation process was discussed for the greatest contributing attributes, based on precision, recall, F1-score, and accuracy.

Preprocessed data
The application of AR-SVM for the classification of the river water quality was herein investigated for the Chao Praya River. An approach considered the minimum number of attributes for using in the SVM algorithm to classify water quality. The results were based on the analytical pattern and statistical correlative frequency in the analysis of water attributes. The historical data indicated substantial variations in the majority of contributing attributes, which affected water quality classification. Table 1 summarizes the average values of water quality parameters in the Chao Phraya River. The results of data preprocessing showed the consistent trends of each attribute after applying the preprocessing process for missing data correction and noise minimization. The dataset consisted of 815 points. The missing data is 561 from 9,780 data in 815 points: Turb (6 data), Cond (12 data), Sal (16 data), DO (2 data), BOD (3 data), TCB (2 data), FCB (3 data), NO 3 -N (23 data), NH 3 -N (67 data), TN (67 data), SS (23 data), TDS (337 data). They indicated that the river's water was good (DO more than 4.0 mg/L) at Nakornsawan. Pollution contamination was then found where the river passed through communities and industrial areas due to inflow from wastewater from agricultural and industrial activity, recreation, and household wastewater.
Limitations cause the missing data in this study during water quality samplings such as critical weather, equipment, which is call missing completely at random (MCAR) type and limitation during analysis water quality in the laboratory like a non-detection limit of equipment and missing record, which is missing at random (MAR). Those of missing value affects an error in analysis results. The data set after pre-processing become a quality dataset due to pro-processing.

Realized attributes
The results of the attribute realization identified the crucial contribution of the monitoring parameters to the water quality classification. The highest contributing attributes were NH 3 -N, TCB, and FCB for the chemical and biological characteristics, as shown in Table 2 From Table 2, the chemical parameter realization's results based on its frequency correlation and the pattern of occurrences indicated that the concentration change in NH 3 -N promoted the highest contribution affecting water quality classification accounted for 0.80 contributed support compared to 0.79, 0.78, 0.76, 0.69 and 0.64 of TCB, FCB, BOD, DO, and Sal, respectively. This finding confirmed that NH 3 -N was one of the main attributes that indicated the quality of river water contaminated by agricultural and household activities. NH 3 -N is the product of aquatic organism excretion and organic residue decomposition in amino acid catabolism [Mallasen and Valenti, 2015], precipitation, anthropogenic source, and bacterial activities [Frazier et al., 1996]. The second and third-order of the highest contribution affecting water quality classification was TCB and FCB.
of the recreational water quality standard and quantify gastrointestinal illness [USEPA, 1986;Francy et al., 1993;Cude, 2005]. In particular, TCB is a crucial parameter that can affect human health as the initiator of many illnesses. While FCB represents contamination from sources of the intestinal tracts of warm-blooded animals and others such as plants, soil, or seeds [Cude, 2005]. The four and five orders of the highest contribution affecting water quality classification were BOD and DO with contributed support 0.79 and 0.69, shown in Table 2. Both BOD and DO also provide a direct indication of the quality level of river water. These parameters depict the carbonaceous biological oxygen demand to digest the remaining organic via biological metabolism and oxygen in These biological parameters provided the contributed support values of 0.79 and 0.78, respectively. TCB's data pattern had the highest frequency of water quality class, while FCB had a bit low-frequency effect. Both TCB and FCB are indicators used as a recreational water quality standard and represent gastrointestinal illness [USEPA, 1986;Francy et al., 1993]. Even though these attributes' concentrations were lower than the standard limit were defined as good water quality.
In addition, we found that both TCB and FCB were highly essential attributes for water quality analysis. Those parameters have an enormous impact on water quality as it indicates intestinal bacteria, and it is also correlated with water-borne disease. TCB and FCB are the leading indicators Table 2. Realized result of the attribute contributed to the classification of river water quality Note: a A is the lower concentration of attribute's value when compare with surface water quality standard value from PCD and USEPA except DO; b B is the higher concentration of attribute's value when compare with surface water quality standard value from PCD and USEPA except DO; c-f I-IV are the surface water-quality class that calculated from WQI, c I (good water class), d II (fairwater class), e III (poor water class), and f IV (very poor water class, respectively; g Avg of contributed support is the average of the lower concentration of attribute's value than standard value that contribute to water quality class I and II and the higher concentration of attribute's value than standard value that contribute to water quality class III and IV the water. In practice, an analysis of DO is easier and quicker than BOD. Thus, DO is an essential indicator of dissolved oxygen for clean water, while BOD measurements is sometimes affected by nitrogenous contaminants that also demand oxygen, which refers to water quality. This extra oxygen demand can arise from algal respiration during intensive radiation [Bayram et al., 2012]. Also, DO is a critical factor for aquatic life and the aquatic ecosystem, making it one of the most crucial water quality attributes. Simultaneously, the data realized the DO pattern had the highest frequency effect on water quality classification and was a significant attribute regarding chemical  Table 2. The classification level was defined as good, where the concentration of a parameter was lower than the standard amount. The quality of this river water was then classified as good quality. This finding was similar to the traditional calculation assumption that these physical characteristics affected the overall water quality less significantly. Turb and TDS are related to the amounts of the suspended solids, colloids, and organic-inorganic particles. However, Turb refers to particles in water and is determined by the amount of light scattered by particles. Turb includes dissolved particles in the water like TDS and affects by colour, fluorescent dissolved organic matter, SS, and TS. These are particulate matters of sediment, soil erosion, runoff, discharges, and algal blooms. Turb is a clear indicator of water quality more than TS and SS because the former includes the colour of dissolved organic matter (DOM). Furthermore, it is not affected by settled solids during the rainy and dry seasons. Despite Turb not being an inherent property of water, it is an indicator of water bodies' environmental health and is used to regulate drinking water, determine water clarity for aquatic organisms [Anderson, 2005], and marine ecosystem [Srivastava and Kumar, 2013;Parra et al., 2018].
It was found that the Chao Phraya River's water quality was significantly affected by riverbank activities on both sides, such as communities discharging waste, commercial waste, agricultural contamination, and industrial waste [PCB, 2018]. These variations influenced the contamination and quality of the river. The upper part of the river tended to have lower density communities than the middle and downstream elements, so better water could be expected in the upper reaches. For example, Cond, TS, and SS values were similar to WQI of the upstream part of the river (CH01, 03, 06, 08, 10, and 12). Slight differences were observed in the midstream to the beginning of downstream (CH15, 16.1, 17, 18, 20, 21, 24, 25, 27, and 28), while a significant change was evident downstream (CH30 and CH32). From the attribute-realization step, it was concluded that NH 3 -N, TCB and FCB were the most realized parameters contributing to the quality index classification. These attributes were then applied to develop the classification approach via algorithm selection.
The probability distribution of data characteristics and distribution frequency is related to water-quality classification, weighting order criteria. Like, NH 3 -N, TCB, FCB, and BOD are continuous exponential probability distribution of data characteristic (ordering value) that same occurrence with water classes, so they are high accuracy relationship with the water-quality class result (represented as contributed support). Those results are similar to the finding by Rodrigues et al., [2016]. While DO is a normal probability distribution and results in dataset some in the missing classification. For Sal, is combination of detached island type and exponential probability distribution cause missing water-quality classification. Two criteria were applied to analyze the contributed attributes with the expected to obtain more accuracy in values and classification. According to the attribute concentration, lower (A) than standard value classifies to good (I) or fair (II) water class, while higher (B) than standard value classifies to poor (III) or very poor (IV) water class. Otherwise, WQI classification is based on aggregated weighting criteria; therefore, it causes errors in detail classification to compare with WQI classification. Otherwise, WQI classification is based on aggregated weighting criteria; therefore, it causes errors in detail classification to compare with WQI classification.

Selected algorithm for classification approach development
In this step, the classification performance of four mathematic functions, which were linear regression, sigmoid, radial basis, and polynomial function, were evaluated. The dataset was divided into two sets: training data which accounted for 80% of total data or around 653 data points, and testing data, around 162 data points. The testing results from this approach were compared to the calculated WQI. The crucial attributes were selected as a primary set, representing the pollutant from anthropogenic activities and harmful pathogen effects to disease and illness from surface water utilization. Besides, each attributes supplementation effect on classification performance after applied different classification functions were also considered. The performance of classification is summarized in Table 3. The result demonstrated that the different number of attributes and mathematic function classification promoted different performance measurements as accuracy, precision, recall and F1-score. The linear was the most suitable function for river water quality classification. The linear algorithm's best classification performance was obtained when six attributes (NH 3 -N, TCB, FCB, BOD, DO, and Sal) were applied. This condition provided an accuracy of 0.94. However, about three to six attributes also promoted satisfactorily performing in classification depicted as an accuracy between 0.79-0.94. Table 3 shows the training results gathered by increasing attributes based on contributed support from more to less. As a result, we found that the accuracy of the classification was expanding in each classification function. On the other hand, classification with over six attributes promoted a slightly decreasing trend due to lower contributed support value and un-distributing in a wide range of data patterns.
The classification using different attributes and mathematic functions promoted production performance differently, as shown in Table 3 and Figure 6. The proper attributes for 3-6 were found a suitable condition for classification. The higher number of attributes application was seemed unnecessary and promoted a reduction of classification performance significantly subsequently. Linear regression promoted higher classification performance than other mathematic functions.  However, a similar trend was found that when increasing the number of attributes from three to six, the classification performance increased similarly in all functions. For example, the six attributes provided the highest accuracy result for the linear regression function for 0.86-0.94. The second to last was 5, 4, and 3 attributes, which gave 0.83-0.91, and the latter were 0.78-0.89 and 0.73-0.79, respectively. With 3-6 attributes, classification results represent that the linear function was suitable for classification. The best performance evaluation was the linear function with six attributes; NH 3  The comparison of the classified results done on the developed AR-SVM approach and the traditional calculation of WQI using classification by the one to twelve attributes and four functions with the conventional WQI. The conventional water classification results from WQI were used for comparison and represented as calculated-WQI. NH 3 -N, TCB, and FCB were the highest contributing attributes from chemical and biological characteristics, which are the primary pollutants from municipal wastewater. Those attributes are provided with a proper performance classification over 0.70. When applying BOD, DO, and Sal, the accuracy classification is increasing according to more comprehensive data range (4-10 ranges) and high data distribution frequency. Turb provide lower classification accuracy than Sal when utilizes at the 6 th attribute, which might cause from 1) missing training of class I, where Sal classification accuracy is 96% (110 from 115 data) while Turb is 0% (0 from 115) and 2) limitation to classification class I and II. due to Turb being able to classify class I and II lower than 1% on the other hand, Sal contributes to classifying class I and II over 95% and class IV 63%. In addition, Sal is representing the saltwater intrusion in the current situation of the Chao Phraya River. Due to the lower amount of water downstream of the river then saltwater invasion from the Gulf of Thailand into the river, especially during dry season and rain delay period. In addition, typhoon Linda in 1997 and tropical storm Pabuk in 2019 cause widespread along the Gulf of Thailand then affect rising sea levels downstream of the river [Charoensuk et al., 2019]. The over standard Sal concentration in the river also affect agriculture and irrigation usage, water supply process and quality, aquatic life and aquatic plant in the river, ecosystem along the river, and human health. Therefore, Sal could be included in water quality analysis and classification for river water sources. Figure 6 shows the classification performance comparisons of four algorithms when increasing attributes. Furthermore, the results from each function in this approach give a similar trend result of Bui et al. [2012], Ravi [2016] and Kalcheva et al. [2020], which is reported that the linear function had better performance than the radial basis function, sigmoid function, and polynomial function. Due to linear function is the best function to deal with the linear data type and binary and multiply class [Fan et al., 2008], while radial basis function, polynomial function and sigmoid function are powerful ability to classify the nonlinear data and s curve data type [Keskes and Braham, 2014].
From prediction results for water quality classification of the Chao Phraya River found the optimum condition for water quality classification at six attributes with linear function. The optimum condition provided over 80% accuracy in each classification.
Class I: classification accuracy is 85.81% (48 corrected classification from 56 calculated WQI). The 18.18% error is from class I cause by missing classification to class I (9.09% or two missing classifications) and class III (9.09% or two missing classification).
Class II: classification accuracy is 81.82% of class II (18 corrected classification from 22). The 18.18% error is from class I cause by missing classification to class I (9.09% or two missing classifications) and class III (9.09% or two missing classification).
Class III: classification accuracy is 86.76% of class III (59 corrected classification from 68). Then 14.24% error occurred in class III with missing classification to class I (9.50% or six missing classifi cation) and class IV (4.74% or three missing classifi cations).
Class IV: classifi cation accuracy is 86.76% of class III (13 corrected classifi cation from 15) and 86.67% of class IV (13 corrected classifi cation from 15). Error 2 missing classifi cation (13.33%) are appeared as class I and III. The result from optimum condition classifi cation is shown in Figure 7.

Validation of the developed approach
AR-SVM approach was validated using the new set of monitoring data of the Tha Chin River. A similar approach fi nding from the realization and SVM-algorithm selection step was adopted. The results showed that the proposed approach could classify river water quality depicted a performance accuracy of 0.95, 0.90, 0.86, and 0.86 for six to   Table 4. The comparison of water quality classifi cation results by the proposed approach compared to the calculated WQI in 16 sampling points (TC01-TC28) in 2017-2019. The prediction results were accurate and corresponded well with the traditional WQI values, with the same result for 15 of the 16 datasets or 93.75% accuracy. Regarding non-accurate results, being out by two classes compared to the traditional WQI results occurred in 1 of 16 datasets or 6.25%. The comparison of prediction classifi cation with the traditional WQI of the Tha Chin River is shown in Figure 8.
Several water parameters are utilized in a water quality study for monitoring and evaluation based on their properties. In Thailand, the primary contamination of water resources, including the Chao Praya River, are accordingly to wastewater discharged mainly from municipal and industrial wastewater, anthropogenic activities besides water resource. It was worth noting that the results depicted the possibility of the water quality classifi cation by AR-SVM of multiple attributes, which is comparable to the conventional calculated WQI with the same water body's condition dynamically changing according to time, pollution sources, and the environment. From attribute realized found that the Chao Phraya River's water quality has been aff ected by several variables during the past ten year. This approach was applied based on static analysis and data pattern learning to deal with dynamically changing water quality data. The results obtained depicted a possibility of ML approach integrating attributes realization and SVM to identify attributes, order and accurate classifi cation, representing the quality of water resource. In addition, those results can promote the likely  classification result and trend comparing with the traditional method. Some of the minority results are represented the slightly decreasing trend of performance, even though its performance is also possible to utilize for water quality classification. In detail, the attributed realization step provided the minimum number of water quality attributes in the WQI classification that covered the three characteristics of the water. The contributed attributes' results were appropriate for classifying water quality instead of overall variables and covered three characteristics with the linear function. The outcome based on the realized attributes produced the same result as Ye and Chiang [Ye and Chiang, 2006], who found a regular water class due to each water parameters. A similar result in chronic responses of aquatic ecotoxicology analysis, which is provided accuracy of prediction by a linear function and multifactor profit analysis [Slaughter et al., 2007]. The suitable AR-SVM gave a similar trend result with the linear weighting method of calculated WQI classification according to the linear function of SVM. A few errors were occurred due to the classification process's inherent complexity like a margin classifies part, which is changed according to the new insert dataset [Zhou and Jetter, 2006;Gorriz et al., 2017]. Furthermore, the approach could be applied to another river system (Tha Chin) as a case study. The results produced a similar outcome compared with calculated WQI and established the minimum number of variables necessary (dimensionless).
As well as AR step also including Sal according to a current situation in Thailand, which facing to saltwater invasion problem in the dry season and rain delay period. Other also changing, the physical characteristic has highly variant during the rainy (flooding) season due to soil erosions. It causes a higher concentration of SS and Turb, while the concentration of TDS and Sal are lower in the rainy (flooding) season. In the dry season, the physical characteristic is affected by the salt intrusion, and it leaks a higher concentration of Sal and TDS but also cause a low concentration of SS and Turb. So physical characteristics could be limited for water quality analysis and classification in Thailand and other developing countries with high variant water according to season. Therefore, this study could alternate for water quality analysis and classification in limitation areas.

CONCLUSIONS
The results demonstrated the possibility of applying a machine learning tool integrating AR and SVM algorithms to classify river water quality. AR identified the most contributing attributes to promote the river's quality. The most contributing attributes were orderly NH 3 -N, TCB, FCB, BOD, DO, and Sal, promoting the contributed values in the classification of 0.80, 0.79, 0,78, 0.76, 0.69, and 0.64, respectively, compared to 0.25-0.64 of TDS, Turb, TN, SS, NO 3 -N, and Cond. The SVM linear algorithm was the most suitable function for river water quality classification with six attributes. It enabled the highest classification performances depicted as the accuracy of 0.94, a precision average of 0.84, recall average of 0.84, and F1-score average of 0.84. While the minimum condition of three attributes also made it possible to classification with an accuracy of 0.73. The validation of the developed approach integrating AR and SVM for the Tha Chin River dataset confirmed the possibility of applying this alternative approach to classify river water, with satisfactory and reliable classification results being obtained as 0.95, 0.90, 0.86, and 0.86 for six to three attributes for classification. The prediction results in 2019 were accurate 93.75% and corresponded well with the traditional WQI values. The finding results depicted a beneficial application of ML for the classification of river water quality and the possibility of using the different attributes that influencing the classification performance-related significantly to contamination source relatively.