Geospatial Assessment of Regression Analysis Between the Hydrocarbon Content in Surface Waters and Snow Cover on the Example of the Territories of the Far North of Russia

The article presents the generalized results obtained from the analysis of oil pollution of surface waters in the fields of the Far North. The research considered the administrative territorial division of the Russian Federation, the territory of the Khanty-Mansi Autonomous Okrug – Yugra (KhMAO). The results of the study performed on the basis of field data on sampling for the year were presented. The influence of the hydrocarbon content in surface waters and snow cover was assessed. The aim of the work was to consider the snow cover as a natural source of pollutants, affecting the accumulation in surface waters and snow cover. The results obtained can be used for subsequent observations of snow cover and surface waters. The data obtained can serve as a basis for planning further research and developing the solutions for environmental protection in the Far North. The analysis of the dependencies between the indicators of hydrocarbon pollution in surface waters and snow cover was carried out using the methods of correlation and parametric multivariate regression analysis. The methods of geoinformation analysis and GIS technologies were also used in the work. It was revealed that the problem of the state of snow cover and its role as an indicator of atmospheric and soil pollution require further research. On the one hand, the snow cover detains metals, and polluted soil areas are formed locally, on the other hand, after the snow melts, the pollutants remaining on the surface with surface runoff enter rivers and are carried by the wind for quite long distances.


INTRODUCTION
In matters of climate protection, the Russian Federation positions itself as a developing country facing a dual task: economic development and environmental protection; therefore, in the process of promoting its overall modernization program, it declares environmental protection one of its main national goals, considering sustainable development as an important strategy, and implements the measures to prevent as well as control pollution and environmental protection.
The Russian Federation takes an active part in the processes of the Climate agenda, an impetus has been given at the federal level, and discussions have been launched at the level of the Government of the Russian Federation, the State Duma and at the level of economic sectors. Today, most of the instruments of global climate policy have been implemented or are under development in Russia, such as carbon regulation, renewable energy incentives, green financing mechanisms, green certificates market, and ESG taxonomy. Within the framework of the climate conference in Paris 2015, the goal was set for the Russian Federation: to reduce the polluting emissions by 2030 to 70-75% of the 1990 level, provided that the absorption capacity of forests is taken into account as much as possible. The main objectives of environmental protection include urban air quality control, improvement of surface water quality, and reduction of total carbon dioxide emissions.
The Arctic, the Arctic zone of the Russian Federation, the Russian Arctic, the Far North are all territories undergoing intensive industrial development, especially within the framework of the development of the Earth's subsurface, that is, the development of oil and oil and gas fields. The oil industry occupies one of the first places in terms of technogenic impact, material intensity, and labor intensity in the fuel and energy complex.
As a result of half a century of development of oil and gas-bearing territories, the natural environment of the Far North has undergone significant transformations and disturbances, the reduction of the consequences of which is unlikely to be expected in the near future. In the current situation, the preservation and restoration of natural resources, the prevention of negative manmade impacts and the elimination of their consequences are urgent tasks of the environmental policy of the Far North.
The surface waters of the Far North are experiencing a powerful anthropogenic load associated with the active development in recent decades of the infrastructure of cities and the largest oil and gas production complex in Russia. As a result of man-made impact on the water bodies of the Far North, the state of surface waters is characterized as unfavorable. Thus, the Ob River in the areas within the Far North belongs to the category of "dirty". The Irtysh River belongs to one of the most polluted water bodies that require priority implementation of environmental measures. Many rivers of the Far North belong to the categories of "very polluted" and "dirty". Contamination of water bodies occurs with nitrite nitrogen, ammonium nitrogen, petroleum products, iron, copper, zinc, and manganese compounds.
Therefore, the assessment of surface water quality, identification of pollution sources, its scale and dynamics are the basis for making the most important management decisions in the field of environmental management.
The aim of the work was to consider the snow cover as a natural source of hydrocarbon pollutants, affecting the accumulation in surface waters and snow cover. In order to achieve this goal, the following tasks were solved: • an overview of the soil pollution and snow cover problem under the conditions of the far North was made according to the available literature data; • selection and chemical-analytical studies of samples of snow cover and surface waters were carried out; • based on authors' own research results, a characteristic of the state of snow cover and soil in the studied territories was given.
Geoinformation systems allow creating information in digital form (Klemmer, 2021), which can then be used for continuous monitoring of environmental problems. Therefore, the geoinformation system ArcGIS Pro was adopted as a tool for solving the problems in the study.

Research area
Environmental pollution by oil and petroleum products is one of the most important issues on the global climate agenda.
In order to analyze the oil pollution of surface waters in the fields of the Far North, the territory of the Khanty-Mansiysk Autonomous Okrug -Yugra (KhMAO), which is an administrative-territorial unit of the Russian Federation, was selected.
Khanty-Mansiysk Autonomous Okrug -Yugra is one of the world leaders in the production of hydrocarbons ( The areas of large and unique deposits of the Middle Ob region that have been developed for a long time are characterized by a very high degree of technogenic load by various environmentally unsafe industries and transport systems. Extreme technogenic load is recorded in most of the "old" large oil fields of Nizhnevartovsk, Surgut and Nefteyugansk districts. The absence or poor development of the extractive industry and transport communications in a significant part of the western territory of the district and the east led to a relatively favorable environmental situation within their borders -most of the Berezovsky and Beloyarsk districts, the east of the Nizhnevartovsk district were practically not affected by industrial influence (Kurakova and Chalov, 2020;Bogdanov et al., 2020).
This study (carried out in the laboratories of the Industrial University of Tyumen) was conducted on the territory of the Khanty-Mansi Autonomous Okrug -Yugra, located in the middle part of Russia and the Eurasian continent ( Figure 1). From west to east, the territory of the region stretches for 1400 km from the eastern slopes of the Northern Urals almost to the banks of the Yenisei; from north to south -900 km from the Siberian Uvalas to the Kondinsky taiga. The entire territory of Ugra belongs to the regions of the Far North.

Sampling
In this study, the samples were taken from the territory of the KhMAO ( Figure 2).
The study area was divided into a grid for the distribution of values with a size of 20 000 m. Next, a random method was use for selecting the units of each grid element, which makes it possible to analyze and cover the sample sites throughout the study area (the study was carried out in the field, funded by the Industrial University of Tyumen).
As initial data for spatial reference, the methods of satellite-geodetic determination of the coordinates of sampling sites were used. The coordinate system adopted is the Geographic Coordinate System (GCS).

Timing of sampling data for the year
The surface data involved the data for April-May (high water), and in snow cover for March-April.
Using the functionality of the ArcGIS Pro geoinformation system, a number of models for establishing the dependence and independence of data was obtained. Going through all these stages of creating a correct regression model, as a result, different variants of data representation in models (with and without transformed variables) were obtained, which allow studying different aspects of the data obtained and the possibility to analyze the surfaces of coefficients. As a research, this allows obtaining and expanding the necessary knowledge to model the investigated process and (Chabuk, 2021;Fischer, 2009;Boori, 2021).
The following methods of analysis were used: 1. The average nearest neighborhood. 2. Spatial autocorrelation

The method is the average neighborhood
The method of point distribution analysisnearest neighbor analysis is a generally accepted procedure for determining the distance from each point to its nearest neighbor and comparing this value with the average distance between neighbors. The average distance of the nearest neighbor gives a measure of the sparsity of points in the distribution. This is valuable in itself, since in some cases point objects can conflict if they are located too close to each other (Nath, 2021).
The method is based on an algorithm for calculating the distance between the center of each object in space and the location of the center of its nearest neighbor. Next, it is necessary to bring to the average values of the distances between neighboring objects.  On the basis of the results of the obtained values, it is possible to analyze the value of the obtained distance values with the theoretical data of a random distribution. In the software, the distances are either clustered based on the results or a dispersed distribution of objects is obtained.
The average nearest neighborhood is calculated using formula 1.
where: d i equals the distance between object i and its nearest neighbor, n corresponds to the total number of objects, S is the area of the minimum rectangle covering all objects.
The evaluation of the obtained data is calculated by the formula 2.
where: M is the RMS measurement error.

Spatial autocorrelation method
Spatial autocorrelation implements the basic principle of geography -close objects are more similar than distant ones.
The essence of the method is to measure spatial autocorrelation based simultaneously on the location of objects and their values. On the basis of the proposed set of objects and their associated attributes, an assessment is made whether there is clustering of objects or they are distributed scattered, or randomly.
On the basis of the analysis of the obtained half-dispersion graphs, it can be noticed that when the distance between the reference points is small, the half-dispersion is also small. This means that the measured values are close and, therefore, interrelated due to their spatial proximity. The half-dispersion increases along with the distance between the points, showing a rapid decline in the spatial correlation of values. Thus, the half-dispersion is a measure of the relationship of the measured values, depending on how close they are to each other.

High/Low clustering method
The High/Low clustering method is a multidimensional statistical procedure that collects the data containing the information about a sample of objects, and then arranges objects into relatively homogeneous groups.
The essence of the method is to divide the research area into cells and assign a degree of clustering of high or low values.

Least Squares Method (OLS)
Performs the Global Least Squares Method (OLS) for linear regression to predict or model a dependent variable based on its relationship with independent variables. It is one of the basic methods of regression analysis for estimating unknown parameters of regression models from sample data.

RESULT
The results of the study (the study was carried out in the laboratories of the Industrial University of Tyumen) provide the data obtained under the office conditions, by processing on laboratory equipment using the ArcGIS Pro software.
The Average Nearest Neighborhood method is used to calculate the range of distances to the number of neighboring objects. It returns the minimum, maximum, and average distances to the specified nth nearest neighbor (N is the input parameter) for a set of objects. As the tool works, messages are recorded.
Next, a step-by-step spatial autocorrelation was performed. It measures spatial autocorrelation for a series of distances and, if necessary, creates a line graph of these distances and the corresponding z-estimates. Z-scores reflect the intensity of spatial clustering; statistically significant and increasing peak z-scores indicate the distances at which spatial processes that provide spatial clustering are most pronounced. These peak distances often need to be used in tools with the Distance Range or Distance Radius parameter.
The first peak in surface waters is calculated at 110474.90 meters, whereas in snow cover -at 56388.62 meters. Results of step-by-step spatial autocorrelation shown in Figure 3 and 4.
Identification of the "hot spots". Optimized analysis of "hot" points includes obtaining random points or objects with weights (points or polygons), creating a map of statistically significant "hot" points and "cold" points based on the Getis-Ord Gi statistical indicator in ArcGIS Pro. At the same time, the characteristics of the class of input objects are evaluated to obtain optimal results.
The optimal fixed distance band is based on the peak clustering found for surface waters is 34578.0 meters, for snow cover 29107.23 meters. The results of the neighborhood calculation are presented in Table 1.
In the space of the territory of the object of study, a grid was built, the edge of which is equal to 20,000 m. Further, the values for hydrocarbons in surface waters and snow cover were aggregated into the grid.
Using the Least squares method, a dependent variable was calculated for linear regression based on its relationship with independent variables. The independent variable is the concentration of hydrocarbons in surface waters, whereas the dependent variable is in the snow cover. Figure 5 shows the results of the least squares method.
In this case, the regression equation:  Six inspections were carried out: 1. The independent variable is statistically significant.

The coefficient of the independent variable
is 0.124196, the sign of the relationship is positive.
3. The factor increasing the variance (VIF) in this case was not calculated. 4. Jacques-Ber statistics are statistically significant. The distribution of residuals has a positive asymmetry -the model is biased. The distribution of residuals is not normal, the model  Figure 5. Results of the least squares method for hydrocarbons in surface waters and snow cover is incorrect. The graph of residuals relative to the predicted dependent values of variables is structured (Figure 6, 7). 5. R2 is 0.018754, which indicates that the model is not correct. 6. No key independents found.
However, despite the failed checks, the Kenker test is statistically significant, so the investigated model can be improved by switching to geographically weighted regression. Within the framework of geographically weighted regression, a local form of linear regression used to model relations varying in space is obtained (Figure 8, 9).
The tool ArcGIS Pro revised the model characteristics obtained when running the least squares method tool and theoretically should have shown improved AICc and R-squared results. AICc decreased (it was -1,634, it became -1,745), which is a good indicator since a more  improved model should reduce the AICc value by more than 3 points. The R-squared value has increased (it was 0.02, it became 0.24), which means an improvement in the model, since now the 0.24 part of the dependent variable has been described by the model.
Attention should be paid to the R-squared indicator. On the Relationship graph, the distribution of all the residuals is seen, their form is structured, which means the model is incorrect. The relations themselves were tracked along the diagonal of the R-square line, where the maximum value corresponds to 0.40, and the minimum value is -0.03. However, for a reliable result, it is necessary that the R-squared indicator be equal to 0.5 or more, otherwise the model should not be trusted.
Next, using the Local bivariant relations method, two variables were analyzed for statistically significant relations using local entropy. Each object was classified into one of six categories based on the type of relationship. The output data can be used to visualize the areas where relationships between variables exist and to study changes in relationships within the study area ( Figure 10). The tool showed that there are no statistically significant relationships when running with all possible variants of the number of neighbors, the number of permutations and confidence levels. Given  The dependent variable was the concentration of hydrocarbons in surface waters, and the independent concentration of hydrocarbon snow cover. All relationships between values are not significant.

CONCLUSIONS
The content of petroleum hydrocarbons in surface waters on the territory of KhMAO -Yugra, as a rule, does not exceed the maximum permissible values. This is significantly lower than 10-20 years ago, which indicates the effectiveness of the environmental policy carried out in the district. Regression analysis showed the presence of a statistically reliable dependence of the concentration of hydrocarbons in surface waters, as well as the concentration of hydrocarbon snow cover. As a confirmation, it can be highlighted that other studies have confirmed the data on the sources of pollution. Approximately half of the oil hydrocarbons in the fields of KhMAO -Yugra enter natural waters from man-made sources, whereas the remaining share is of natural origin.
The variety of sources of oil pollution makes it urgent to assess the impact of each of them on chemical runoff. The paper assesses the quality of water resources depends on hydrocarbon pollution.
Mathematical analysis of indicators of the composition of natural waters in deposits that differ in the intensity of man-made load provides ample opportunities for environmental assessment of large territories of the Far North.