基于 knn 方法分析乳腺癌数据

文章目錄

1. 数据来源
2. 数据了解
3. 数据处理

数据来源

数据：来自于 UCI 常用数据，也是机器学习的经典数据
http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data

数据说明：对数据变量的解释和其他解说请看
http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names

数据了解

根据数据说明，我们可以得知，该数据集共569个样本，32个特征，需要预测的是第二个特征 diagnosis: B = benign, M = malignant，即乳腺癌是良性还是恶性的

其中第一个特征是用户id，从第3个到第32个特征都是十个特征的不同度量，其中十个特征是指

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry 
j) fractal dimension ("coastline approximation" - 1)

3-12是这十个特征的 mean, 13-22是它们的 standard error, 23-32是 mean of the three largest values

之前最好的预测：

best predictive accuracy obtained using one separating plane
in the 3-D space of Worst Area, Worst Smoothness and
Mean Texture.  Estimated accuracy 97.5% using repeated
10-fold crossvalidations.  Classifier has correctly
diagnosed 176 consecutive new patients as of November
1995.

针对以上数据集，我们将分别使用 class 包和 caret 包来进行 knn 算法的实现

数据处理

数据读取

首先，从网络读取数据，并查看一下，与数据说明的描述一致

1 2	wdbc <- read.table("http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",sep = ",") dim(wdbc)

1	## [1] 569 32

数据预处理

我们需要给变量命名

1
2
3

wdbc.names=c("Radius","Texture","Perimeter","Area","Smoothness","Compactness","Concavity","Concave points","Symmetry","Fractal dimension")
wdbc.names=c(paste(wdbc.names,"_mean",sep=""),paste(wdbc.names,"_se",sep=""),paste(wdbc.names,"_worst",sep=""))
names(wdbc)=c("id","diagnosis",wdbc.names)

虽然可以使用所有的30个特征来预测乳腺癌的性质，但是我们只验证在数据说明中提到的最好预测情形，即使用 Worst Area, Worst Smoothness 和 Mean Texture 三个特征来进行预测。为此，我们提取出相关数据

1 2	wdbc.short <- wdbc[c("diagnosis","Area_worst","Smoothness_worst","Texture_mean")] summary(wdbc.short)

##  diagnosis   Area_worst     Smoothness_worst   Texture_mean  
##  B:357     Min.   : 185.2   Min.   :0.07117   Min.   : 9.71  
##  M:212     1st Qu.: 515.3   1st Qu.:0.11660   1st Qu.:16.17  
##            Median : 686.5   Median :0.13130   Median :18.84  
##            Mean   : 880.6   Mean   :0.13237   Mean   :19.29  
##            3rd Qu.:1084.0   3rd Qu.:0.14600   3rd Qu.:21.80  
##            Max.   :4254.0   Max.   :0.22260   Max.   :39.28

由于这三个特征的数据范围相差太大，为此，我们使用 scale 函数将其标准化

1 2	wdbc.short[2:4] <- scale(wdbc.short[2:4]) summary(wdbc.short)

##  diagnosis   Area_worst      Smoothness_worst   Texture_mean    
##  B:357     Min.   :-1.2213   Min.   :-2.6803   Min.   :-2.2273  
##  M:212     1st Qu.:-0.6416   1st Qu.:-0.6906   1st Qu.:-0.7253  
##            Median :-0.3409   Median :-0.0468   Median :-0.1045  
##            Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##            3rd Qu.: 0.3573   3rd Qu.: 0.5970   3rd Qu.: 0.5837  
##            Max.   : 5.9250   Max.   : 3.9519   Max.   : 4.6478

使用 caret 包进行 knn 预测

接下来，我们先使用 caret 包，对该数据集进行预测。

1	library(caret)

1 2	## Loading required package: lattice ## Loading required package: ggplot2

首先把数据集分为训练集和测试集，各取一半样本点。

1
2
3

index <- createDataPartition(wdbc.short$diagnosis)
caret.fit <- knn3(diagnosis ~ ., wdbc.short, index[[1]], k = 5)
predict.caret <- predict(caret.fit,newdata = wdbc.short[-index[[1]],2:4], type = "class")

我们来看一下使用 caret 包对乳腺癌测试集预测的准确性

1	sum(wdbc.short[-index[[1]],1] == predict.caret)/length(index[[1]])

1	## [1] 0.9403509

准确率还是不错的。

10-fold crossvalidations

我们看到在数据说明里，使用 repeated 10-fold crossvalidations 可以达到 97.5%的准确率。我们使用 cvTools 包中的 cvFolds 函数产生用于进行 10折交叉验证的下标，但是前提条件是被分组的变量需是排过序的，因此我们使用 reshape 包中的 sort_df 来对数据集进行排序

1	library(cvTools)

1	## Loading required package: robustbase

1 2	library(reshape) wdbc.short <- sort_df(wdbc.short, "diagnosis")

然后，分别对良性和恶性两类样本分别分为十组。由于 cvFolds 函数返回的是一个特有类的对象，我们要从中抽取出想要的部分，并使用 rbind 和 cbind 合并为下标矩阵 index.matrix

set.seed(2014)
Benign.folds <- cvFolds(n = as.numeric(table(wdbc.short$diagnosis)["B"]), K = 10, type = "random")
set.seed(2014)
Malignant.folds <- cvFolds(n = as.numeric(table(wdbc.short$diagnosis)["M"]), K = 10, type = "random")
index.matrix <- rbind(cbind(Benign.folds$which,as.vector(Benign.folds$subsets)),cbind(Malignant.folds$which,as.vector(Malignant.folds$subsets)+as.numeric(table(wdbc.short$diagnosis)["B"])))

由于一共有 569 个样本，无法均匀分成 10 组，所以我们需要用一个因子 index.folds 来存放各个组对应样本的下标。对于每个组，我们使用另外 9 组建模，并对该组进行验证，此即交叉验证。 prediction.right 向量用来存放每组验证的正确率.

index.folds <- list()
prediction.right <- vector()
for(i in 1:10){
  index.folds[[i]] <- index.matrix[index.matrix[,1] == i,2]
}
for(i in 1:10){
  current.model <- knn3(diagnosis ~ ., data = wdbc.short, subset = -index.folds[[i]])
  prediction <- predict(current.model, newdata = wdbc.short[index.folds[[i]],2:4], type = "class")
  prediction.right[i] <- sum(prediction == wdbc.short[index.folds[[i]],"diagnosis"])
}
sum(prediction.right)/nrow(wdbc.short)

1	## [1] 0.9507909

最后我们看到，使用 10-fold crossvalidations 的正确率仅为 0.9507909, 这可能和我们并没有使用 repeat 方法有关。