Parallelization of Concise Convolutional Neural Networks for Plant Classification

Monitoring the agricultural field is the key to preventing the spread of disease and handling it quickly. The com - puter-based automatic monitoring system can meet the needs of large-scale and real-time monitoring. Plant classifiers that can work quickly in computer with limited resources are needed to realize this monitoring system. This study proposes convolutional neural network (CNN) architecture as a plant classifier based on leaf imagery. This architecture was built by parallelizing two concise CNN channels with different filter sizes using the addition operation. GoogleNet, SqueezeNet and MobileNetV2 were used to compare the performance of the proposed architecture. The classification performance of all these architectures was tested using the PlantVillage dataset which consists of 38 classes and 14 plant types. The experimental results indicated that the proposed architecture with a smaller number of parameters achieved nearly the same accuracy as the comparison architectures. In addition, the proposed architecture classified images 5.12 times faster than SqueezeNet, 8.23 times faster than GoogleNet, and 9.4 times faster than MobileNetV2. These findings suggest that when implemented in the agricultural field, the proposed architecture can be a reliable and faster plant classifier with fewer resources.


INTRODUCTION
Monitoring the agricultural environment plays a major role in early detection of plant diseases so that further damage can be prevented and disease spread can be controlled more quickly and at a lower cost (Najdenovska et al., 2021). Monitoring, on the other hand, necessitates a large amount of manpower, takes a long time, and ultimately costs a lot (Ma et al., 2018). Computerized monitoring can significantly reduce these costs, while increasing monitoring efficiency and effectiveness (Lajoie-O'Malley et al., 2020).
The ability of a computer system to classify plant species and distinguish between healthy and disease-exposed plants is absolutely necessary for automated monitoring with a computer system (Knoll et al., 2018). Plant classifiers have been developed using a variety of algorithms, including Naive Bayes, random forest, support vector machine (SVM), K-nearest neighbor (KNN), decision trees, and artificial neural networks (ANNs). The role of leaf features as distinguishing description is important in the methods described above. With the advancement of computing technology, it is relatively simple to extract more than 100 features from a leaf, but it remains difficult to determine which features contribute the most to classification. Therefore, feature engineering (FE) and the presence of an expert are still required for this task (Zhang et al., 2019).
Since its introduction in 2012, the convolutional neural network (CNN) has outperformed other algorithms in classification in a variety of fields, with a classification accuracy close to 100% (Hassan et al., 2021;Mohanty et al., 2016). Furthermore, CNN can automatically extract important and unique features of each class from image and video data without the need for feature engineering or the presence of an expert to select the most optimal features in the classification (Kamilaris and Prenafeta-Boldú, 2018; Yamashita et al., 2018). The advantages of CNN are driving its widespread use in the automotive, health, business, and other industries, including agriculture (Too et al., 2019).
However, implementing CNN as a classifier in agricultural monitoring systems faces numerous challenges, particularly in developing countries. These difficulties arise because the implementation of CNN necessitates large computing resources and a lengthy training period (Y. Wu et al., 2020), and farmers in developing countries are generally unable to provide adequate computing resources and internet connections to run CNN monitoring systems (Rahman et al., 2020). Therefore, trade-off between classification performance and computing resources is required so that CNN can be used in a low-cost monitoring system in agriculture (Karthik et al., 2020).
This study aimed to develop a concise CNN model by fusing two CNN channels with different filter sizes using an addition operation, as well as to provide a reliable CNN model that is faster than comparison architectures in classifying plants based on leaf images. This study also investigated the performance of the proposed CNN and comparison CNN models on plant classification using datasets with varying number of classes.
This work's main contributions are: (1) the proposed CNN model has fewer parameters and performs classification faster than all of the comparison architectures in this study. A smaller number of parameters will allow for less expensive implementation (2) the classification accuracy performance of the proposed CNN model is nearly identical to the classification accuracy performance of all comparison architectures in this study.

Plant classification
Tomato diseases were identified using the SVM classifier in (Mokhtar et al., 2015). The dataset was classified into two classes, beginning with segmentation on each image. The classification accuracy of the SVM classifier with five different kernels was 92%. The SVM classifier was used again in (Kaur et al., 2018) to classify the soybean leaf images from the PlantVillage dataset with 90% accuracy. The SVM was built using a combination of texture and color features. Study in (Chouhan et al., 2018) classified six fungal diseases using a Radial Basis Function Neural Network (RBFNN). Bacterial foraging optimization (BFO) was used to improve the speed and accuracy of RBFNN, and this method outperforms the K-means (KM) and Genetic Algorithm (GA) algorithms.
The use of CNN in plant classification is dominated by the use of architectures that are known to be reliable, with AlexNet and VGG being the most widely used architectures (Abade et al., 2021). AlexNet and VGG architectures were  (Liu et al., 2021). The reliable performance of the three architectures in these studies confirms that a concise CNN architecture can reliably perform the classification task of datasets with relatively few classes.

Batch normalization
A normalized batch layer follows each convolution layer in the CNN architecture utilized in this work. Sergey Ioffe and Christian Szegedy introduced the use of batch normalization in CNN in 2015 to lessen internal covariate shift (ICS) during CNN training. CNN training toward convergence will be accelerated by reducing ICS (Ioffe and Szegedy, 2015).
Several studies cast doubt on the role of normalized batch layers in ICS reduction. The study in (Santurkar et al., 2018) shows that batch normalization does not completely solve this ICS problem, although this study still confirms that batch normalization increases the speed of deep learning training to achieve convergence by controlling the mean and variance of the dataset.
The study in (Bjorck et al., 2018) also questions the contribution of ICS reduction to the success of batch normalization in expediting deep learning network training. The results of this study provide the evidence that batch normalization produces more reliable gradient updates, enabling deep learning networks to operate at greater learning rates and expediting the training of the network toward convergence.
The strengthening of Ioffe and Szegedy's argument is obtained from a study by (Awais et al., 2021). A series of experiments in this study show that ICS reduction is a major factor in increasing the convergence of a deep learning network, not only by batch normalization but also by all other methods that contribute to ICS reduction.
Batch normalization xî can be determined by using Eq. 1: where: μ B adalah is the mean of the input x i and σ 2 B is the variance. The value of e is used to avoid division by zero when σ B is very small so that numerical stability is increased. Furthermore, the final result of the y i normalization batch is calculated by Eq. 2.
where: γ the scaling factor and β is the shifting factor and these two values are included in the parameters studied during CNN training.

Fusion of CNN channel
The various CNN architectures proposed recently use multiple modules consisting of convolution layer parallel channels. The use of channels with different filter sizes is intended to improve the ability of the CNN model to handle objects at multiple scales . Concatenation and addition operations can be used to combine two or more CNN channels. Concatenation is a channel-wise action that is more commonly employed than addition, which is an element-wise operation (Wu and Wang, 2019). While in the addition operation the size of the fused layer is the same as the initial size of the two channels, in concatenation the size of the fused layer is equal to the total of the sizes of the two channels. Equation 3 shows the concatenation operation on tensor A and tensor B which form tensor C, while Eq. 4 is the addition operation of tensor A and tensor B.
The CNN GoogleNet model uses 9 concatenation operations to combine multiple parallel channels on the inception module. The first channel contains 1 convolution layer with a filter size of 1 × 1, the second channel with a filter size of 3 × 3 and the third channel with a filter size of 5 × 5 is combined with the concatenation operation (Szegedy et al., 2015). Squeezenet uses 8 concatenation operations in its architecture (Iandola et al., 2016). MobileNetV2 uses 10 addition operations and ShuffleNet uses these two operations with 3 concatenation operations and 13 addition operations (Sandler et al., 2018). However, the addition operation on ShuffleNet and MobileNetV2 adds the feature maps from the previous convolution block to the feature map of the following block rather than the feature maps of the two parallel channels. In this work, the proposed approach is to combine the two feature maps that were created through the extraction of two parallel CNN channels with the same depth, but different filter sizes.

Dataset
The dataset used in this study contains leaf images that are grouped into 38 classes consisting of 14 classes from different healthy plants and 28 classes from leaves exposed to various diseases. All images in this dataset are from the open access repository via the PlantVillage project (Hughes and Salathe, 2015). The PlantVillage dataset is one of the most important datasets in the field of plant classification (Brahimi, 2018). Each image in the PlantVillage dataset was a single leaf RGB image with a size of 256 × 256. The total number of images in this dataset was 70.846 and was divided into 80% for training and 20% for testing. This dataset was utilized for both transfer learning on the comparison architectures and for training from scratch on the proposed CNN model. Image size was maintained at 256 × 256 in the proposed CNN model training. In transfer learning for the comparison architectures, the size of this image was changed according to the default size of the input image of each comparison architecture. In both training and testing schemes, there was no further image preprocessing applied to the dataset.
The proposed architecture Figure 1 shows the proposed architecture. The CNN model called SlimPlantNet was built from the fusion of two CNN channels with different filter sizes in some convolution layers. Each channel was a concise CNN consisting of 6 layers of convolution. The addition element-wise operation was used to combine the two channels in order to add the features obtained from the first channel to the features obtained from the second channel. This channel summation was used so that the features extracted by the two channels complement each other based on the difference in scale. A 256 × 256 color image was used as the input for both CNN channels.
The first channel was a channel consisting of a convolution layer with a smaller filter size than the second channel. The first channel was preceded by a convolution layer consisting of 8 filters with each filter size of 7 × 7, while the second channel was preceded by a convolution layer with 8 filters measuring 15 × 15 each. The difference in filter sizes was intended to capture features at different scales and demonstrated that the first channel was responsible for extracting more detailed and local features, whereas the second channel extracted more global features. The size of the first and second channel filters were reduced in subsequent convolution layers to decrease the computational burden and the number of parameters involved. The details of the size of each layer of the two channels are shown in Table 1.
The difference in filter size between the two channels will extract features with varying sharpness, resulting in different feature maps. The feature map m extracted by each convolution layer using the F filter on the input tensor I is shown in Eq. 5.
The batch normalization layer, ReLU activation function, and the maxPooling layer followed the first to fifth convolution layers in both channels, while the maxPooling layer did not follow the sixth convolution layer. Stride [2 2] was used on all maxPooling layers and most convolution layers to reduce the size of the feature maps generated by these layers, decrease the number of parameters involved in computation, and speed up the training and classification tasks. Padding and stride [1 1] were used to keep the output size of the two channels the same so that addition operations could be performed at the ends of the two channels.  in this study. In this study, transfer learning was used to replace the classification layer in the three architectures with a new layer that was scaled to the number of classes in the dataset. All layers in the comparison architecture except the classification layer are preserved and frozen before being retrained only on the new layer using the Plant-Village dataset. The proposed architecture named SlimPlant-Net was trained from scratch. SlimPlantNet training and transfer learning comparison architecture were implemented in 20 epochs using the stochastic gradient descent with momentum (SGDM) optimization function. All comparison architectures used a learning rate of 0.0003, whereas the proposed architecture used a rate of 0.05. The test was performed after each epoch, and the classification time per image was performed sequentially after all training and testing was completed. The test was conducted on a computer with a Core i5 @ 2.7 GHz processor and 16 GB RAM memory with a Matlab 2019 environment.
Further investigation was carried out by examining the different effects of the fusion method of the two channels on SlimPlantNet via addition and concatenation operations. In Table 2, the architecture formed by the addition operation was named SlimPlantNet_addition, and the architecture formed by the concatenation operation was named SlimPlantNet_concat. The performance of each channel on the proposed architecture was also demonstrated as a single CNN architecture. This architecture was created by connecting each channel's sixth ReLU layer to the averagePooling layer and forming a complete CNN. SlimPlantNet_fil-ter7 was the single CNN obtained from a channel with a 7 × 7 filter, and SlimPlantNet_filter15 was obtained from a channel with a 15 × 15 filter.
The performance of all CNN architectures involved in this study was measured using classification accuracy, loss, and average time required to classify each leaf image. Classification accuracy and loss are the most widely used parameters in testing the performance of a CNN (Maeda-Gutiérrez et al., 2020). Accuracy and loss were measured in training and testing every training epoch was completed, whereas the average classification time was measured during testing. Classification accuracy CA was measured based on the following Eq. 6: = + + + + + (6) where: TruePos (true positive) is the number of positive images classified as positive; TrueNeg (true negative) is the number of negative images classified as negative; FalPos denotes the number of negative images classified as positive; FalNeg denotes the number of positive images classified as negative. The test results show that the classification accuracy, as well as the loss value, are nearly identical between the proposed CNN model and the comparison CNN architecture. In this study, the accuracy of GoogleNet is almost identical to the accuracy of GoogleNet shown in (Maeda-Gutiérrez et al., 2020), which is 99.39% in 10 tomato classes from the PlantVillage dataset.

RESULTS AND DISCUSSION
Significant differences can be seen in the training time and the required classification time per image. SlimPlantNet using addition and concatenation operations takes 0.0043 seconds to classify 1 image on the test computer used in this study, which is 5.12 times faster than SqueezeNet, 8.23 times than GoogleNet, and 9.40 times than Mo-bileNetV2. The SlimPlantNet model also has a much shorter training time. Of course, the aspects of training time and classification time, as well as the number of parameters involved, will be the advantages of the SlimPlantNet model when implemented in an agricultural real-time monitoring system. Because there are fewer parameters, it can be implemented in computing devices with fewer resources and at a lower cost. The faster classification time, the faster the monitoring system will run in real time.
This study also carried out further testing of the SlimPlanNet model using fusion with addition operations and concatenation operations based on the accuracy and loss of testing. SlimPlanNet's forming channels were also evaluated as a single CNN. This test was run in 60 epochs with varying numbers of classes, ranging from 5 to 38 classes, to see trends and performance limits in different data scales. Figure 2 depicts  The results of this test also show that SlimP-lantNet_addition is better in classifying the dataset than SlimPlantNet_concatenation and its constructor channel. SlimPlantNet_addition's accuracy is always higher, and the trend is quite stable as the number of classified classes increases.    The performance comparison based on the test loss value shown in Figure 3 confirms the test results based on classification accuracy that the SlimPlantNet model with addition operation outperforms channel merging with concatenation operation. It is clear that the SlimPlantNet_addition loss value is always lower than the other three models.
The last comparison carried out in this study was a comparison of the trend of accuracy and test loss in 60 training epochs from SlimPlantNet, both combined with addition and concatenation operations as well as the SlimPlantNet channel. The training data on the PlantVillage dataset with 38 classes was highlighted, and the comparison results are shown in Figure 4. SlimPlantNet testing with addition operation fusion appears to be more accurate and stable over 60 training epochs, as does the loss value, which remains relatively low during the training period.
SlimPlanNet_filter7 and SlimPlanNet_fil-ter15 have generally lower performance than SlimPlanNet_addition and SlimPlanNet_concat, but SlimPlanNet_filter7 outperforms SlimPlan-Net_concat in classification from 5 to 25 dataset classes. Although the accuracy of SlimPlanNet_ filter7 is slightly lower than that of SlimPlanNet_ addition, if the number of parameters and classification speed are the most important factors, the SlimPlanNet_filter7 model can be recommended in classification tasks with fewer than 25 classes.

CONCLUSIONS
The results of the tests show that SlimPlant-Net, which was formed by fusing two concise CNN channels using the addition operation, achieves reliable and competitive performance. Although classification speed and training time are not different, the SlimPlantNet model's classification accuracy and loss are better and more stable than the performance of SlimPlantNet one which uses channel fusion with concatenation operation. SlimPlantNet's classification accuracy on 14 different plant classes is nearly identical to the classification accuracy of the comparison architectures.
SlimPlantNet classifies images faster and with fewer parameters than comparison architectures. The speed in classification and the smaller number of parameters are the advantages of the SlimPlantNet model when implemented in a realtime monitoring system in the agricultural field with limited computer resources. Therefore, the future work after this study will be to integrate SlimPlantNet into the agricultural monitoring system. The use of the SlimPlantNet model in classification tasks other than plants is also interesting to investigate, particularly when the classification task involves a small number of classes, as in the PlantVillage dataset.