Predictive Modelling for Characterisation of Organics in Pit Latrine Sludge from Unplanned Settlements in Cities of Malawi

The limited availability of data on faecal sludge characteristics remains one of the major challenges faced by developing countries in proper management of faecal sludge. In view of the limited financial resources and expertise in these developing countries, there is a need to come up with less-resource-intensive approaches for faecal sludge characterisation. Despite being used substantially in wastewater, there is limited evidence on the use of predictive modelling as a tool for cost-effective characterisation of faecal sludge. In this study, first order multiple linear regression modelling is investigated as a less-resource-intensive approach for accurate prediction of organics (biochemical oxygen demand and chemical oxygen demand) in pit latrine sludge. The predictor variables explored in the modelling include pH, electrical conductivity, total solids, total volatile solids, fixed solids and moisture content. The modelling uses data collected from 80 latrines in unplanned settlements of four cities in Malawi. The study shows that it is possible to reliably predict chemical oxygen demand and biochemical oxygen demand in pit latrine sludge using electrical conductivity and total solids, which require low levels of resources and expertise to determine.


INTRODUCTION
The limited availability of data on faecal sludge characteristics remains one of the major challenges faced by developing countries.This is attributed to lack of financial resources and expertise, among other factors (Strande et al., 2014).The existing body of knowledge presents high spatial and temporal variability necessitating generation of context-specific data (Bassan et al., 2013).Generation of such data using traditional lab-based approach calls for high levels of resourcing.Resource constraint in developing countries, thus, presents a need to generate lessresource-intensive approaches for characterisation of sludge (Strande et al., 2014).One of such approaches is the application of predictive models in characterisation of faecal sludge.Predictive modelling provides a cost-effective way of generating accurate information (Aguado, et al., 2006).De-spite substantial use in wastewater, there is lack of literature pointing towards use of predictive modelling as a tool for cost-effective characterisation of faecal sludge (Brdjanovic et al., 2007;Singh et al., 2010;Khataee and Kasiri, 2011;Nasr et al., 2012).This study, therefore, explored the applicability of multiple linear regression modelling as a less-resource-intensive method for accurate prediction of organics (biochemical oxygen demand and chemical oxygen demand) in pit latrine sludge from four cities in Malawi.

DATA AND METHODOLOGY
The data used for modelling was collected from 80 pit latrines in unplanned settlements of four cities in Malawi (Blantyre, Lilongwe, Mzuzu and Zomba).Predictive models for organics (biochemical oxygen demand and chemical oxygen demand) in pit latrine sludge were generated using first order multiple linear regression modelling.The generic form for first order regression model with n predictor variables is: (1) The procedure for model building and selection is shown in Figure 1.The model building and selection were carried out in Minitab 17 and Microsoft Excel at a significance level of 0.05.The first stage in the model building and selection process was data preparation during which the data was checked for missing values and outliers.In addition, a visual inspection of the probability plot was performed to check for normality of the untransformed values of the dependent variable.In the cases where normality was not satisfied, the dependent variable was subjected to transformations following the order in the Tukey Ladder of Powers until normality was attained (Barker and Shaw, 2015).
The second stage was the identification of predictor variables and their combinations for building competing models.The latrine sludge parameters requiring low skill and resourcing levels were selected to be predictor variables.These included pH, electrical conductivity (EC), moisture content (MC), total solids (TS), total volatile solids (TVS) and fixed solids (FS).Determination of pH and EC in the pit latrine sludge was done using potentiometric methods.The potentiometric methods require basic skills of dipping meter probes and direct reading of values (for both buffer solution during calibration and sample solution) from the meter.The sludge moisture content and solids were determined using gravimetric methods, the core skills of which include weighing, setting right temperature of the furnace and direct measurement reading from the weighing scale.In order to reduce the probability of multicollinearity in competing models built from these predictor variables, Pearson correlation was used to identify highly correlated predictor variables.Highly correlated predictor variables were those with |r| ≥ 0.7 (Vatcheva et al., 2016).Subsets of predictor variables were formulated in such a way to ensure that no subset contained highly correlated predictor variables.The Best Subset function in Minitab 17 was used to generate candidate models from the different subsets of the predictor variables.The best model was selected from the list of the candidate models, using the prediction sum of squares (PRESS) statistic.The model with the lowest PRESS was selected.In the instances where competing models had the same PRESS value, Akaike Information Criterion (AIC) values were calculated and the model with the smallest AIC value was selected.AIC statistic was chosen because it aims at achieving parsimony, which fits in well with resource maximisation that is desirable in resource-constrained settings (Bozdogan, 2000).The selected model was then investigated for model significance, homoscedasticity, randomness of residuals, outliers, amount of data for precise estimation of the strength of the regression relationship and multicollinearity of predictor variables.A model was deemed to be significant when its p-value was less than a significance level of 0.05.Homoscedasticity, normality of residuals and outliers were checked through the visual inspection of the Residuals vs Fitted Values plots.Specifically, the randomness of points on both sides of zero and large residuals that could have a strong influence on the model were checked.Large residuals and unusual values were identified and investigated back to the untransformed data for their unusual nature.The observation-to-predictor ratio was used to check the sufficiency of data for precise estimation of the strength of the regression relationship.In literature, the minimum observation-topredictor ratio ranges from 10 to 30 (Pedhazur and Schmelkin, 2013).Variance inflation factor (VIF) was used to check for multicollinearity of the predictor variables fitted in the model.VIF values in the range 0 < VIF < 5 suggest that there is no multicollinearity problem.VIF values of 5 ≤ VIF ≤ 10 show moderate multicollinearity while VIF ≥ 10 is indicative of significant multicollinearity (Moustris et al., 2012).
Model validation was performed using the predicted r-squared value (R 2 pred ), which measures how well a model predicts responses for new observations.Validation of predictions was conducted by comparing the model R 2 and R 2 pred values.A model was judged to provide valid predictions if values of R 2 pred and R 2 were close to each other (Frost, 2013).

Predictive model for biochemical oxygen demand (BOD)
The study found that linear regression modelling can be used to reliably predict biochemical oxygen demand (BOD) in latrine sludge from unplanned settlement across the four cities.The predictive model arrived at was: where: BOD is biochemical oxygen demand (mg/g TS), TS is total solids (%).
The BOD model statistics are shown in Table 1.The relationship between the model variables is significant (p<.0001) and explains about 91% (R 2 ) of the variability that existed in the data.Since R 2 > 75%, the variability explained is substantial enough to have confidence in the model (Hair et al., 2013).There is no effect of multicollinearity, since the VIF (1) for the model is less than 5.The observation-to-predictor ratio for the model ( 240) is greater than the minimum of 10 to 30.No observable trend was found in the Residuals vs Fitted values plot, implying homoscedasticity and randomness of residuals.The model provides valid predictions as R 2 pred (90.9%) is close to R 2 (91.0%).This prediction model presents a way of cutting down on the time required to analyse faecal sludge for BOD.It takes at least 5 days to obtain BOD results from the method used in this study while total solids' determination takes less than 24 hours.However, it should be noted that this level of reliability of the model holds for a BOD range 3.65 to 1139.7 mg/g TS within which the model was developed.

Predictive model for chemical oxygen demand (COD)
Linear regression modelling produced a model that allows a reliable prediction of chemical ox- where: COD is chemical oxygen demanding (mg/g TS), TS is total solids (%), EC is electrical conductivity (µs/cm).
The COD model statistics are shown in Table 2.The COD model is significant (p<0.0001) and explains a substantial part of the variability in the data with R 2 (91.8%) > 75% (Hair et al., 2013).No multicollinearity of the fitted predictor variables exists in the model, as both variables have VIF of 1.03, which is less than 5.The observation-to-predictor ratio (160) for the model is greater than the minimum range of 10 to 30.pertise to determine.This predictive characterisation seems to be applicable across different spatial settings/localities.Since the models were developed using data from latrines from only four sites, there is a need to evaluate the performance of these models with sludge from other urban areas of Malawi for generalizability at a national level.Valid predictions can be made from the model, since R 2 pred (91.6%) and R 2 (91.8%) close to each other.The Residuals vs Fitted values plot did not display any observable trend implying homoscedasticity and randomness of residuals.Though COD takes a shorter duration (2 hours) to obtain results, the model still provides a cost-effective way of generating data on COD of pit latrine sludge, since the reagents and expertise required to conduct a COD test outweigh the requirements of gravimetric methods.Just like the BOD model, the level of reliability for this model holds within the COD range 33.8 to 9604.4 mg/g TS.

CONCLUSIONS
The study has demonstrated that it is possible to reliably predict BOD and COD in pit latrine sludge using electrical conductivity and total solids, which require low levels of resources and ex-

Figure 1 .
Figure 1.Model building and selection flow chart

Table 2 .
COD model statistics