{"@context":{"@vocab":"https://cir.nii.ac.jp/schema/1.0/","rdfs":"http://www.w3.org/2000/01/rdf-schema#","dc":"http://purl.org/dc/elements/1.1/","dcterms":"http://purl.org/dc/terms/","foaf":"http://xmlns.com/foaf/0.1/","prism":"http://prismstandard.org/namespaces/basic/2.0/","cinii":"http://ci.nii.ac.jp/ns/1.0/","datacite":"https://schema.datacite.org/meta/kernel-4/","ndl":"http://ndl.go.jp/dcndl/terms/","jpcoar":"https://github.com/JPCOAR/schema/blob/master/2.0/"},"@id":"https://cir.nii.ac.jp/crid/1361699995403605376.json","@type":"Article","productIdentifier":[{"identifier":{"@type":"DOI","@value":"10.1145/1007730.1007735"}},{"identifier":{"@type":"URI","@value":"https://dl.acm.org/doi/10.1145/1007730.1007735"}},{"identifier":{"@type":"URI","@value":"https://dl.acm.org/doi/pdf/10.1145/1007730.1007735"}}],"dc:title":[{"@value":"A study of the behavior of several methods for balancing machine learning training data"}],"description":[{"type":"abstract","notation":[{"@value":"<jats:p>There are several aspects that might influence the performance achieved by existing learning systems. It has been reported that one of these aspects is related to class imbalance in which examples in training data belonging to one class heavily outnumber the examples in the other class. In this situation, which is found in real world data describing an infrequent but important event, the learning system may have difficulties to learn the concept related to the minority class. In this work we perform a broad experimental evaluation involving ten methods, three of them proposed by the authors, to deal with the class imbalance problem in thirteen UCI data sets. Our experiments provide evidence that class imbalance does not systematically hinder the performance of learning systems. In fact, the problem seems to be related to learning with too few minority class examples in the presence of other complicating factors, such as class overlapping. Two of our proposed methods deal with these conditions directly, allying a known over-sampling method with data cleaning methods in order to produce better-defined class clusters. Our comparative experiments show that, in general, over-sampling methods provide more accurate results than under-sampling methods considering the area under the ROC curve (AUC). This result seems to contradict results previously published in the literature. Two of our proposed methods, Smote + Tomek and Smote + ENN, presented very good results for data sets with a small number of positive examples. Moreover, Random over-sampling, a very simple over-sampling method, is very competitive to more complex over-sampling methods. Since the over-sampling methods provided very good performance results, we also measured the syntactic complexity of the decision trees induced from over-sampled data. Our results show that these trees are usually more complex then the ones induced from original data. Random over-sampling usually produced the smallest increase in the mean number of induced rules and Smote + ENN the smallest increase in the mean number of conditions per rule, when compared among the investigated over-sampling methods.</jats:p>"}]}],"creator":[{"@id":"https://cir.nii.ac.jp/crid/1381699995403605377","@type":"Researcher","foaf:name":[{"@value":"Gustavo E. A. P. A. Batista"}],"jpcoar:affiliationName":[{"@value":"Instituto de Ciências Matemáticas e de Computação, São Carlos - SP, Brazil"}]},{"@id":"https://cir.nii.ac.jp/crid/1381699995403605378","@type":"Researcher","foaf:name":[{"@value":"Ronaldo C. Prati"}],"jpcoar:affiliationName":[{"@value":"Instituto de Ciências Matemáticas e de Computação, São Carlos - SP, Brazil"}]},{"@id":"https://cir.nii.ac.jp/crid/1381699995403605376","@type":"Researcher","foaf:name":[{"@value":"Maria Carolina Monard"}],"jpcoar:affiliationName":[{"@value":"Instituto de Ciências Matemáticas e de Computação, São Carlos - SP, Brazil"}]}],"publication":{"publicationIdentifier":[{"@type":"PISSN","@value":"19310145"},{"@type":"EISSN","@value":"19310153"}],"prism:publicationName":[{"@value":"ACM SIGKDD Explorations Newsletter"}],"dc:publisher":[{"@value":"Association for Computing Machinery (ACM)"}],"prism:publicationDate":"2004-06","prism:volume":"6","prism:number":"1","prism:startingPage":"20","prism:endingPage":"29"},"reviewed":"false","dc:rights":["https://www.acm.org/publications/policies/copyright_policy#Background"],"url":[{"@id":"https://dl.acm.org/doi/10.1145/1007730.1007735"},{"@id":"https://dl.acm.org/doi/pdf/10.1145/1007730.1007735"}],"createdAt":"2007-01-17","modifiedAt":"2025-06-18","relatedProduct":[{"@id":"https://cir.nii.ac.jp/crid/1360002218095802496","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning"}]},{"@id":"https://cir.nii.ac.jp/crid/1360004235180726272","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"State-of-the-Art Deep Learning: Evolving Machine Intelligence Toward Tomorrow’s Intelligent Network Traffic Control Systems"}]},{"@id":"https://cir.nii.ac.jp/crid/1360005519232363520","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"Applying machine learning to market analysis: Knowing your luxury consumer"}]},{"@id":"https://cir.nii.ac.jp/crid/1360025430175576192","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"Handling missing values and imbalanced classes in machine learning to predict consumer preference: Demonstrations and comparisons to prominent methods"}]},{"@id":"https://cir.nii.ac.jp/crid/1360285705289122048","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"Quasi‐linear SVM classifier with segmented local offsets for imbalanced data classification"}]},{"@id":"https://cir.nii.ac.jp/crid/1360285711754320512","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"Classifying protein-protein interaction articles using word and syntactic features"}]},{"@id":"https://cir.nii.ac.jp/crid/1360290617411217920","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"Adopting deep learning methods for airborne RGB fluvial scene classification"}]},{"@id":"https://cir.nii.ac.jp/crid/1360298336571568640","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"Relationship Between Personality Patterns and Harmfulness"}]},{"@id":"https://cir.nii.ac.jp/crid/1360298755618788992","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"An Interpretable Machine Learning Approach to Predict Fall Risk Among Community-Dwelling Older Adults: a Three-Year Longitudinal Study"}]},{"@id":"https://cir.nii.ac.jp/crid/1360298757165626240","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"Event reconstruction of Compton telescopes using a multi-task neural network"}]},{"@id":"https://cir.nii.ac.jp/crid/1360302864782371200","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"Heterogeneous Defect Prediction Based on Federated Prototype Learning"}]},{"@id":"https://cir.nii.ac.jp/crid/1360576118709210752","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"Fast, exact, and parallel-friendly outlier detection algorithms with proximity graph in metric spaces"}]},{"@id":"https://cir.nii.ac.jp/crid/1360576118721151488","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"Table-structure recognition method using neural networks for implicit ruled line estimation and cell estimation"}]},{"@id":"https://cir.nii.ac.jp/crid/1360576118730629248","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"A Resampling Method for Imbalanced Datasets Considering Noise and Overlap"}]},{"@id":"https://cir.nii.ac.jp/crid/1360576118810663040","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"Predicting Final User Satisfaction Using Momentary UX Data and Machine Learning Techniques"}]},{"@id":"https://cir.nii.ac.jp/crid/1360580232176778368","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"Using Random Ordering in User Experience Testing to Predict Final User Satisfaction"}]},{"@id":"https://cir.nii.ac.jp/crid/1360584339759548544","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"Evaluating Behavior Recognition Pipeline of Laying Hens Using Wearable Inertial Sensors"}]},{"@id":"https://cir.nii.ac.jp/crid/1360584340717174144","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"An End-to-End Table Structure Analysis Method Using Graph Attention Networks"}]},{"@id":"https://cir.nii.ac.jp/crid/1360848660863996544","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"In Silico Prediction of Major Drug Clearance Pathways by Support Vector Machines with Feature-Selected Descriptors"}]},{"@id":"https://cir.nii.ac.jp/crid/1360848661287964032","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"A Framework of Rebalancing Imbalanced Healthcare Data for Rare Events’ Classification: A Case of Look-Alike Sound-Alike Mix-Up Incident Detection"}]},{"@id":"https://cir.nii.ac.jp/crid/1360849945089597056","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"Achieving Accurate Ubiquitous Sleep Sensing with Consumer Wearable Activity Wristbands Using Multi-class Imbalanced Classification"}]},{"@id":"https://cir.nii.ac.jp/crid/1361412895298986112","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@value":"ContractWard: Automated Vulnerability Detection Models for Ethereum Smart Contracts"}]},{"@id":"https://cir.nii.ac.jp/crid/1390008445625882368","@type":"Article","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@language":"en","@value":"An Effective Feature Extraction Mechanism for Intrusion Detection System"}]},{"@id":"https://cir.nii.ac.jp/crid/1390288082109950464","@type":"Article","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@language":"en","@value":"Class Imbalanced Fault Diagnosis via Combining K-Means Clustering Algorithm with Generative Adversarial Networks"}]},{"@id":"https://cir.nii.ac.jp/crid/1390573242789886336","@type":"Article","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@language":"en","@value":"A Data Augmentation Method for Cow Behavior Estimation Systems Using 3-Axis Acceleration Data and Neural Network Technology"}]},{"@id":"https://cir.nii.ac.jp/crid/1390867344224956416","@type":"Article","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@language":"en","@value":"A Boosting Method Based on Center-of-Gravity Oversampling and Pruning for Classifying Imbalanced Data"}]},{"@id":"https://cir.nii.ac.jp/crid/1390870064890417152","@type":"Article","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@language":"ja","@value":"変動する状況下でリアルタイムかつプロアクティブに看護師のワークロードを予測するモデルの構築"},{"@language":"en","@value":"Construction of a Model to Estimate Nurse Workload in Real-Time and Proactively under Fluctuating Conditions"},{"@language":"ja-Kana","@value":"ヘンドウ スル ジョウキョウ カ デ リアルタイム カツ プロアクティブ ニ カンゴシ ノ ワークロード オ ヨソク スル モデル ノ コウチク"}]},{"@id":"https://cir.nii.ac.jp/crid/1391975276373919232","@type":"Article","resourceType":"学術雑誌論文(journal article)","relationType":["isReferencedBy"],"jpcoar:relatedTitle":[{"@language":"en","@value":"A Machine Learning Method for Automatic Copyright Notice Identification of Source Files"}]}],"dataSourceIdentifier":[{"@type":"CROSSREF","@value":"10.1145/1007730.1007735"},{"@type":"CROSSREF","@value":"10.1080/23270012.2019.1692254_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1016/j.eswa.2023.121694_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1002/tee.22808_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1186/1471-2105-12-s8-s9_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1016/j.rse.2020.112107_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1109/tkde.2012.232_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.4018/ijitwe.298654_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1109/dasc/picom/cbdcom/cyberscitech.2019.00143_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1007/s11606-022-07394-8_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1016/j.nima.2022.166897_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1109/access.2023.3313001_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1587/transinf.2024edp7147_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.20965/jaciii.2021.p0346_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1007/s00778-022-00729-1_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1145/3469096.3469870_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1016/j.procs.2020.08.043_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.3390/jtaer16070171_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1109/tnse.2020.2968505_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1587/transfun.2021smp0003_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.3390/informatics9040085_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.3390/s23115077_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1007/978-981-99-8088-8_20_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1124/dmd.114.057893_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1155/2018/6275435_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1587/transinf.2020edl8089_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1587/transinf.2021ngp0007_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.1109/comst.2017.2707140_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"},{"@type":"CROSSREF","@value":"10.5100/jje.61.355_references_DOI_I1cVyd4ZvmuO82NMcR6DHqnWNGi"}]}