Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Semi-random partitioning of data into training and test sets in granular computing context

Liu, Han and Cocea, Mihaela 2017. Semi-random partitioning of data into training and test sets in granular computing context. Granular Computing 2 (4) , pp. 357-386. 10.1007/s41066-017-0049-2

[img]
Preview
PDF - Published Version
Download (3MB) | Preview

Abstract

Due to the vast and rapid increase in the size of data, machine learning has become an increasingly more popular approach for the purpose of knowledge discovery and predictive modelling. For both of the above purposes, it is essential to have a data set partitioned into a training set and a test set. In particular, the training set is used towards learning a model and the test set is then used towards evaluating the performance of the model learned from the training set. The split of the data into the two sets, however, and the influence on model performance, has only been investigated with respect to the optimal proportion for the two sets, with no attention paid to the characteristics of the data within the training and test sets. Thus, the current practice is to randomly split the data into approximately 70% for training and 30% for testing. In this paper, we show that this way of partitioning the data leads to two major issues: (a) class imbalance and (b) sample representativeness issues. Class imbalance is known to affect the performance of many classifiers by introducing a bias towards the majority class; the representativeness of the training set affects a model’s performance through the lack of opportunity for the algorithm to learn, by not presenting it with relevant examples—similar to testing a student on material that was not taught. To solve the above two issues, we propose a semi-random data partitioning framework, in the setting of granular computing. While we discuss how the framework can address both issues, in this paper, we focus on avoiding class imbalance when partitioning the data, through the proposed approach. The results show that avoiding class imbalance results in better model performance.

Item Type: Article
Date Type: Publication
Status: Published
Schools: Computer Science & Informatics
Publisher: Springer
ISSN: 2364-4966
Date of First Compliant Deposit: 5 September 2017
Date of Acceptance: 27 July 2017
Last Modified: 11 Jan 2018 10:19
URI: http://orca-mwe.cf.ac.uk/id/eprint/104334

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics