In real world applications, training the classifier
using unbalanced dataset is the major problem, as it decreases the performance
of Machine Learning algorithms. Unbalanced dataset can be prominently classified
based on Support Vector Machine (SVM) which uses Kernel technique to find
decision boundary. High Dimensionality and uneven distribution of data has a
significant impact on the decision boundary. By employing Feature selection (FS) high dimensionality
of data can be solved by selecting prominent features. It is usually applied as
a pre-processing step in both soft computing and machine learning tasks. FS is
employed in different applications with a variety of purposes: to overcome the
curse of dimensionality, to speed up the classification model construction, to
help unravel and interpret the innate structure of data sets, to streamline
data collection when the measurement cost of attributes are considered and to
remove irrelevant and redundant features thus improving classification
performance. Hence, in this paper, two different FS approaches has been
proposed namely Fuzzy Rough set based FS and Fuzzy Soft set based FS. After FS
the reduced dataset has been given to the proposed Iterative Fuzzy Support
Vector Machine (IFSVM) for classification which has considered two different
membership functions. The Experiments has been carried out on four different
data sets namely Thyroid, Breast Cancer, Thoracic surgery, and Heart Disease.
The results shown that the classification accuracy is better for Fuzzy Rough
set based FS when compared other.

 

Keywords: Support Vector Machine, Fuzzy logic, Rough
Sets, Soft Sets, Feature selection.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

—————————————————————————————————————————————

 

1. Introduction:

 

     SVM is one of the most
well­ known supervised machine learning algorithms for classification or
prediction method developed by Cortes and Vapnik 1 in the 1990s as a result
of the collaboration between the statistical and the machine learning research
community. SVM tries to classify cases by finding a separating boundary called
hyper plane. The main advantage of the SVM is that it can, with relative ease,
overcome ‘the high dimensionality problem’, i.e., the problem that arises when
there is a large number of input variables relative to the number of available
observations 2. Also, because the SVM approach is data­  driven and possible without theoretical
framework, it may have important discriminative power for classification,
especially in the cases where sample sizes are small and a large number of
features (variables) are involved (i.e., high dimensional space). This
technique has recently been used to improve methods for detecting diseases in
clinical settings 3, 4. Moreover, SVM has demonstrated high performance in
solving classification problems in bioinformatics 5, 6.

      In many practical
engineering applications, the obtained training data is often contaminated by
noises. Furthermore, some points in the training data set are misplaced far
away from main body or even on the wrong side in feature space. One of the main
drawbacks of the standard SVM is that the training process of the SVM is
sensitive to the outliers or noise in the training dataset due to over fitting.
A training data point may neither exactly belong to any of the two classes when
the outliers or noises exist in many real-world classification problems. The data
point nearer to decision boundary may belong to one of the class or it may be a
noisy point. But these kinds of uncertainty points may be more important than
others for making decision, which leads to the problem of over fitting. Fuzzy
approaches are effective in solving uncertain problems, which reduces the
sensitivity of less important data 7. This approach assigns a fuzzy
membership value as a weight to each training data point and uses this weight
to control the importance of the corresponding data point. So many fuzzy
approaches are developed and proposed in literature to reduce the effect of
outliers. A similarity measure function to compute fuzzy memberships were
introduced in 8. However, they had to assume that outliers should be somewhat
separate from the normal data. The effect of the trade-off parameter C to the model of conventional
two-class SVM and introduced a triangular membership function to set higher
grades to the data points in regions containing data of both classes. However
this method could be applied with some assumptions involved 9. Above two
problems are solved by Fuzzy SVM.

     The method proposed in 10
is based on the supposition that outliers in the training vector set are less trustworthy,
and hence of less significant over other training vectors. As outliers are
detected based solely on their relative distance from their class mean, this
method may be expected to produce good results if the distributions of training
vectors xi of each class are spherical with central means (in the
space used to calculate the memberships). In general, however, this assumption
may not hold, which motivates us to seek a more universally applicable method.
Hence computing fuzzy memberships is still a challenge. This problem can be
solved by IFSVM.

     Generally fuzzy approach
based machine learning techniques faces two main difficulties that are how to
set fuzzy memberships and how to decrease computational complexity. It has been
found that the performance of fuzzy SVM highly depends on the determination of
fuzzy memberships therefore in this paper; we proposed a new method to compute
fuzzy memberships that calculates membership values for only misclassified
points and calculates membership values for all training data points. For
calculating the membership values for misclassified points an iterative method
has been employed where membership values are generated iteratively based on
the positions of training vectors relative to the SVM decision surface itself.
For calculating the membership values for misclassified points a fuzzy
clustering based technique has been adopted where  clustering method has been applied on the data
and determines  the clusters in mixed
regions and set Fuzzy membership value as 1 and fuzzy memberships of other data
points are determined by their closest cluster accordingly.