資料探勘Chapter5資料前置處理
5.1 資料集載入
A. 安裝三個軟體套件(lattice,MASS,nnet)
install.packages("lattice")
install.packages("MASS")
install.packages("nnet")
library(lattice)
library(MASS)
library(nnet)
B. 載入mice軟體套件。選擇使用nhanes2資料集示範,該資料集是個含有遺漏值的小規模資料集
install.packages("mice")
library(mice)
data("nhanes2")
nrow(nhanes2);ncol(nhanes2)
[1] 25
[1] 4
summary(nhanes2)
age bmi hyp chl
20-39:12 Min. :20.40 no :13 Min. :113.0
40-59: 7 1st Qu.:22.65 yes : 4 1st Qu.:185.0
60-99: 6 Median :26.75 NA's: 8 Median :187.0
Mean :26.56 Mean :191.4
3rd Qu.:28.93 3rd Qu.:212.0
Max. :35.30 Max. :284.0
NA's :9 NA's :10
NA:
1.觀測職缺失
2.存在計算錯誤或計算值不符合要求
C. 為了將要使用的資料集有直觀的把握,將前六筆資料展示
head(nhanes2)
age bmi hyp chl
1 20-39 NA <NA> NA
2 40-59 22.7 no 187
3 20-39 NA no 187
4 60-99 NA <NA> NA
5 20-39 20.4 no 113
6 60-99 NA <NA> 184
5.2 資料清理
通常使用統計圖表來探索規律
畫圖:對pay中66個資料進行探索性資料分析(長條圖、點圖、箱型圖、Q-Q圖)
pay=c(11,19,14,28,13,81,12,43,11,16,31,16,23,42,22,26,17,22,13,27,180,16,43,82,14,11,51,76,28,66,29,14,14,65,37,16,37,35,39,27,14,17,13,38,28,40,85,32,25,26,16,12,54,40,18,27,16,14,33,29,77,50,19,34)
par(mfrow=c(2,2))
hist(pay)
dotchart(pay)
boxplot(pay,horizontal = T)
qnorm(pay);qqline(pay)
5.2.1遺漏值處理
A. 用函數is.na計算遺漏值的數量
sum(is.na(nhanes2)) #計算nhanes2中遺漏值的數量
[1] 27
B. 用函數complete.cases計算nhanes2中完整樣本的數量
sum(complete.cases(nhanes2)) #計算nhanes2中完整樣本的數量
[1] 13
C. 資料缺失的情況下,需要進一步對資料缺失情況進行觀測,判斷缺失資料是否隨機。可以利用mice套件中的md.pattern函數觀測遺漏值情況。
md.pattern(nhanes2)
age hyp bmi chl
13 1 1 1 1 0
1 1 1 0 1 1
3 1 1 1 0 1
1 1 0 0 1 2
7 1 0 0 0 3
0 8 9 10 27
D. mice套件中的函數實現多重差補法
基本形式 mice(data,m=5,...)
E. 用mice()函數建構模型
imp=mice(nhanes2,m=4) #產生4組完整的資料庫並指定給imp
iter imp variable
1 1 bmi hyp chl
1 2 bmi hyp chl
1 3 bmi hyp chl
1 4 bmi hyp chl
2 1 bmi hyp chl
2 2 bmi hyp chl
2 3 bmi hyp chl
2 4 bmi hyp chl
3 1 bmi hyp chl
3 2 bmi hyp chl
3 3 bmi hyp chl
3 4 bmi hyp chl
4 1 bmi hyp chl
4 2 bmi hyp chl
4 3 bmi hyp chl
4 4 bmi hyp chl
5 1 bmi hyp chl
5 2 bmi hyp chl
5 3 bmi hyp chl
5 4 bmi hyp chl
fit=with(imp,lm(chl~age+hyp+bmi))
pooled=pool(fit)
summary(pooled)
est se t df Pr(>|t|) lo 95
(Intercept) 26.813103 71.663294 0.3741539 8.613855 0.71733321 -136.4151163
age2 51.887524 32.830729 1.5804560 3.904344 0.19088242 -40.1527272
age3 68.792827 29.656688 2.3196396 8.090160 0.04860053 0.5368177
hyp2 -15.849396 23.327260 -0.6794366 11.760671 0.51001290 -66.7900807
bmi 5.103029 2.584499 1.9744746 6.971265 0.08907247 -1.0134517
hi 95 nmis fmi lambda
(Intercept) 190.04132 NA 0.4406377 0.3242718
age2 143.92777 NA 0.7151216 0.5989477
age3 137.04884 NA 0.4638688 0.3459102
hyp2 35.09129 NA 0.3208435 0.2143983
bmi 11.21951 9 0.5181282 0.3972260
nmis:缺失資料個數
fmi:由缺失資料貢獻的變異
F. 插補法
使用隨機差補
sub=which(is.na(nhanes2[,4])==TRUE)
dataTR=nhanes2[-sub,]
dataTE=nhanes2[sub,]
dataTE[,4]=sample(dataTR[,4],length(dataTE[,4]),replace = T)
dataTE
age bmi hyp chl
1 20-39 NA <NA> 184
4 60-99 NA <NA> 187
10 40-59 NA <NA> 113
11 20-39 NA <NA> 238
12 40-59 NA <NA> 131
15 20-39 29.6 no 238
16 20-39 NA <NA> 204
20 60-99 25.5 yes 206
21 20-39 NA <NA> 199
24 60-99 24.9 no 204
使用平均值法
sub=which(is.na(nhanes2[,4])==TRUE)
dataTR=nhanes2[-sub,]
dataTE=nhanes2[sub,]
dataTE[,4]=mean(dataTR[,4])
dataTE
age bmi hyp chl
1 20-39 NA <NA> 191.4
4 60-99 NA <NA> 191.4
10 40-59 NA <NA> 191.4
11 20-39 NA <NA> 191.4
12 40-59 NA <NA> 191.4
15 20-39 29.6 no 191.4
16 20-39 NA <NA> 191.4
20 60-99 25.5 yes 191.4
21 20-39 NA <NA> 191.4
24 60-99 24.9 no 191.4
透過建立回歸模型預測出因變數的值缺失變數進行插補
sub=which(is.na(nhanes2[,4])==TRUE)
dataTR=nhanes2[-sub,]
dataTE=nhanes2[sub,]
dataTE
age bmi hyp chl
1 20-39 NA <NA> NA
4 60-99 NA <NA> NA
10 40-59 NA <NA> NA
11 20-39 NA <NA> NA
12 40-59 NA <NA> NA
15 20-39 29.6 no NA
16 20-39 NA <NA> NA
20 60-99 25.5 yes NA
21 20-39 NA <NA> NA
24 60-99 24.9 no NA
lm=lm(chl~age,data=dataTR)
nhanes2[sub,4]=round(predict(lm,dataTE))
head(nhanes2)
age bmi hyp chl
1 20-39 NA <NA> 169
2 40-59 22.7 no 187
3 20-39 NA no 187
4 60-99 NA <NA> 225
5 20-39 20.4 no 113
6 60-99 NA <NA> 184
熱平台插補
在非遺漏值資料集中找到一個與遺漏值所在樣本(比對樣本),利用其中的觀測值對遺漏值進行插補。
accept=nhanes2[which(apply(is.na(nhanes2),1,sum)!=0),]
donate=nhanes2[which(apply(is.na(nhanes2),1,sum)==0),]
accept[1,]
age bmi hyp chl
1 20-39 NA <NA> 169
donate[1,]
age bmi hyp chl
2 40-59 22.7 no 187
對於accept中的第2個樣本
accept[2,]
age bmi hyp chl
3 20-39 NA no 187
sa=donate[which(donate[,1]==accept[2,1]&donate[,3]==accept[2,3]&donate[,4]==accept[2,4]),]
sa
age bmi hyp chl
8 20-39 30.1 no 187
accept[2,2]=sa[1,2]
accept[2,]
age bmi hyp chl
3 20-39 30.1 no 187
冷平台插補法
當變數數量很多時,通常很難找到與需要插補樣本完全相同的樣本,此時可以按照某些變數將資料分層,在曾中隊遺漏值使用平均值插補
level1=nhanes2[which(nhanes2[,3]=="yes"),]
level1
age bmi hyp chl
14 40-59 28.7 yes 204
17 60-99 27.2 yes 284
18 40-59 26.3 yes 199
20 60-99 25.5 yes 225
level1[4,4]=mean(level1[1:3,4])
level1
age bmi hyp chl
14 40-59 28.7 yes 204
17 60-99 27.2 yes 284
18 40-59 26.3 yes 199
20 60-99 25.5 yes 229
.
5.2.2雜訊資料處理
先安裝outliers軟體
install.packages("outliers")
A. 尋找異常值
函數的主要形式outlier(x,opposite=FALSE,logical=FALSE)
B.
library(outliers)
set.seed(1);s1=.Random.seed
y=rnorm(100)
outlier(y)
[1] -2.2147
outlier(y,opposite = TRUE)
[1] 2.401618
dotchart(y)
C.
dim(y) <- c(20,5)
outlier(y)
[1] -2.214700 -1.989352 1.980400 2.401618 -1.523567
outlier(y,opposite = TRUE)
[1] 1.595281 1.358680 -1.129363 -1.804959 1.586833
set.seed(1);s1=.Random.seed
y=rnorm(10)
outlier(y,logical = TRUE)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
plot(y)
D.
先建立亂數
install.packages(.Random.seed)
寬箱平均值光滑方法為例
set.seed(1); s1=.Random.seed
x=rnorm(12)
x=sort(x)
dim(x)=c(3,4)
x[1,]=apply(x,1,mean)[1]
x[2,]=apply(x,1,mean)[2]
x[3,]=apply(x,1,mean)[3]
x
[,1] [,2] [,3] [,4]
[1,] -0.003212265 -0.003212265 -0.003212265 -0.003212265
[2,] 0.340596290 0.340596290 0.340596290 0.340596290
[3,] 0.468529029 0.468529029 0.468529029 0.468529029
5.2.3資料不一致處理
A. vapply函數
基本形式 vapply(X, FUN.VALUE,...,USE.NAMES = TRUE)
B. 下面程式中的rt.value變數設定傳回值長度和類型,如果FUN函數獲得的結果和rt.value設定的不一樣會出錯:
x<- list(a=1:10,beta=exp(-3:3),logic=c(TRUE,FALSE,FALSE,TRUE))
x
$a
[1] 1 2 3 4 5 6 7 8 9 10
$beta
[1] 0.04978707 0.13533528 0.36787944 1.00000000 2.71828183 7.38905610
[7] 20.08553692
$logic
[1] TRUE FALSE FALSE TRUE
probs <- c(1:3/4)
rt.value <- c(0,0,0) #設定傳回值為3個數字
vapply(x, quantile,FUN.VALUE = rt.value,probs=probs)
a beta logic
25% 3.25 0.2516074 0.0
50% 5.50 1.0000000 0.5
75% 7.75 5.0536690 1.0
C. 將probs <- c(1:3/4)改成probs <- c(1:4/4),會導致傳回值與要求格式不一致,進而提示錯誤。
probs <- c(1:4/4) #設定四個分為點 > vapply(x, quantile,FUN.VALUE = rt.value,probs=probs) Error in vapply(x, quantile, FUN.VALUE = rt.value, probs = probs) : values must be length 3, but FUN(X[[1]]) result is length 4 |
D. 將要求值長度改成4
rt.value <- c(0,0,0,0) #設定傳回值為4個數字
vapply(x, quantile,FUN.VALUE = rt.value,probs=probs)
a beta logic
25% 3.25 0.2516074 0.0
50% 5.50 1.0000000 0.5
75% 7.75 5.0536690 1.0
100% 10.00 20.0855369 1.0
rt.value <- c(0,0,0,"") #設定傳回值為3個數字和1個字串
vapply(x, quantile,FUN.VALUE = rt.value,probs=probs)
Error in vapply(x, quantile, FUN.VALUE = rt.value, probs = probs) :
values must be type 'character',
but FUN(X[[1]]) result is type 'double'
5.3 資料整合
A. 對矩陣x進行卡方檢定,檢查兩列是否相關。
x=cbind(sample(c(1:50),10),sample(c(1:50),10))
chisq.test(x)
Pearson's Chi-squared test
data: x
X-squared = 80.226, df = 9, p-value = 1.458e-13
B. 相關係數和協方差的R實現如下:
x=cbind(rnorm(10),rnorm(10))
cor(x) #求兩列資料的相關係數
[,1] [,2]
[1,] 1.0000000 -0.3839065
[2,] -0.3839065 1.0000000
cov(x) #求兩列資料的協方差
[,1] [,2]
[1,] 0.9187874 -0.2380315
[2,] -0.2380315 0.4184109
C.檢測觀測值是否存在重複
x=cbind(sample(c(1:10),10,replace = T),rnorm(10),rnorm(10))
> head(x)
[,1] [,2] [,3]
[1,] 5 -0.1324555 -3.0008140
[2,] 7 -0.3057390 -0.8859022
[3,] 1 0.4773073 -0.7555930
[4,] 3 0.2899630 1.6394230
[5,] 6 -0.4389847 -0.2607102
[6,] 8 -0.7283198 -0.5318854
y=unique(x[,1])
sub=rep(0,length(y))
for(i in 1:length(y))
+ sub[i]=which(x[,1]==y[i])[1]
x=x[sub,]
head(x)
[,1] [,2] [,3]
[1,] 5 -0.1324555 -3.0008140
[2,] 7 -0.3057390 -0.8859022
[3,] 1 0.4773073 -0.7555930
[4,] 3 0.2899630 1.6394230
[5,] 6 -0.4389847 -0.2607102
[6,] 8 -0.7283198 -0.5318854
5.4 資料轉換
A.標準化
set.seed(1);s1=.Random.seed
a=rnorm(5)
b=scale(a)
b
[,1]
[1,] -0.78636077
[2,] 0.05657773
[3,] -1.00401553
[4,] 1.52544305
[5,] 0.20835553
attr(,"scaled:center") #原資料平均值
[1] 0.1292699
attr(,"scaled:scale") #原資料標準差
[1] 0.9610394
B.離散化
a=c(0.7063422,0.7533599,0.6675749,0.6100253,0.9341495,0.6069284,0.3462011)
n=length(a)
la=rep(0,n)
la[which(a>0.5)]=1
[1] 1 1 1 1 1 1 0
C. 資料泛化可視為資料合併
city=c(6,7,6,2,2,6,2,1,5,7,2,1,1,6,1,3,8,8,1,1)
province=rep(0,20)
province[which(city>4)]=1
province
[1] 1 1 1 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0
5.5 資料精簡
安裝glmnet
install.packages("glmnet")
library(glmnet)
install.packages("Matrix")
install.packages("foreach")
library(foreach)
library(Matrix)
- AIC準則是赤池資訊準則的簡稱,通常用來評價模型的複雜度和擬合效果
計算公式 AIC=-2ln(L)+2k
- 使用glmnet()函數實現對不同分布資料進行LASSO變數選擇
x=matrix(rnorm(100*20),100,20)
y=rnorm(100)
fit1=glmnet(x,y)
b=coef(fit1,s=0.01)
b
21 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -0.0886071507
V1 0.0189077134
V2 -0.2255922175
V3 0.2290716761
V4 -0.0337900690
V5 0.0290566569
V6 0.1017043781
V7 -0.1249554544
V8 0.0024999142
V9 0.1183096398
V10 0.2108410505
V11 .
V12 .
V13 0.1189089764
V14 -0.0004527807
V15 -0.1126944051
V16 0.0444677393
V17 0.1477912729
V18 -0.0480913783
V19 0.0052171596
V20 -0.0637364512
predict(fit1,newx = x[1:10,],s=c(0.01,0.005))
1 2
[1,] 0.62248794 0.65001683
[2,] -0.18558974 -0.18122650
[3,] -0.95325106 -0.99672620
[4,] 0.25398562 0.25783962
[5,] -0.30157582 -0.31235142
[6,] 0.33770641 0.35028559
[7,] -0.10093579 -0.10975594
[8,] 0.47216618 0.48768391
[9,] -0.65937676 -0.70485304
[10,] 0.04617174 0.02387368
-------------------------------------------------------------------------
參考書為「利用R語言 打通大數據的經脈(第二版)」
本文為看書練習的成果~