OVERSAMPLING METHOD FOR ENVIRONMENTAL MONITORING DATA AUGMENTATION IN CADMIUM-POLLUTED PADDY FIELD

Peitong Hao∗,† Yue Li∗∗,∗∗∗,† Weiman Xu,∗∗ Cong Li,∗∗ Mingzhu Huo,∗ Xiaoyu Zhang,∗ and Yi An∗

Keywords

Cadmium-contaminated rice, risk prediction, classifier models, data augmentation, oversampling method ∗ Agro-environmental Protection Institute, Ministry of Agricul- ture, Tianjin, 300071, China; e-mail: {809495047, 884796615, 1747162724}@qq.com, simon8601@126.com ∗∗ College of Computer Science, Nankai University, Tianjin, 300350, China; e-mail: liyue80@nankai.edu.cn, {xuweiman, 2120180519}@mail.nankai.edu.cn ∗∗∗ Key Laboratory for Medical Data Analysis and Statistical Research of Tianji

Abstract

Considering cadmium over standard rate (OSR) of Chinese rice not exceeding 10%, accurate identification of cadmium-polluted rice in low OSR regions is more valuable for Cd pollution remediation, relative to in high OSR regions. However, the big gap between the number of positive samples (samples exceeding the cadmium limit in rice) and negative samples (samples not exceeding the cadmium limit in rice) will lead to the collapse of classifier prediction models in low OSR regions. We tried to apply oversampling method to improve the feasibility of the classifier models. The improvement was evaluated by F1 value, which represents the harmonic average of accuracy and recall. Based on the results, oversampling can upgrade the collapsed classifier models (F1 is close to or equal to 0) to the available level (F1 is greater than 0.3). In addition, the stability of six classifier models enhanced by Adaptive Synthetic (ADASYN) and Synthetic Minority Oversampling Technique (SMOTE) was significantly improved in low OSR region (coefficient of variation decreased from 156.29% to 12.45% and 7.23%). According to the analysis of correlation coefficient among the parameters in three regions, the high correlation of each index parameter will affect the performance of the classifiers. This method is also suitable and valuable for risk identification of other pollutants, especially considering that pollution regions are always in the minority.

Important Links:

Go Back