DataBaser.Net: 변수간의상관정도에대한분석

Contents

1 상관분석(Correlation Analysis)
2 예제데이터
3 공분산(Covariance)
4 피어슨 상관계수
5 스피어만 상관계수
6 켄달 상관계수
7 data frame에서 상관계수 다루기
8 시각화1
9 시각화2
10 sql로 상관계수 구하기
11 로버스트 상관 분석
12 편상관분석
13 참고자료

1 상관분석(Correlation Analysis) #

두 확률 변수의 상관 정도를 분석 (원인이 아님)
가계소득이 높아지면 자녀의 성적도 높아지는가?
공분산 값 자체의 의미보다는 부호가 중요한 의미를 가짐 (공분산은 방향성만 파악)
피어슨 상관계수는 방향 및 정도에 대한 2가 의미 (두 변수가 모두 정규분포이어야..)
스피어만, 켄달 상관계수는 분포를 모를때 사용 (켄달 상관계수가 모집단의 추정치이므로 더 선호된다고 함)
상관분석
- 산점도(scatter plot)
- 상관계수(correlation coefficient)

2 예제데이터 #

허리둘레(girth)와 몸무게(weigth)

tmp <- textConnection( 
"girth  weight
35.00  3.45
32.00	3.20
30.00	3.00
31.50	3.20
32.70	3.30
30.00	3.20
36.00	3.85
30.50	3.15
34.70	3.65
30.50	3.40
33.00	3.50
35.00	4.00
31.80	3.10
38.00	4.20
33.00	3.45") 
x <- read.table(tmp, header=TRUE) 
close.connection(tmp)
plot(x)

산점도

3 공분산(Covariance) #

> cov(x)
           girth    weight
girth  5.7183810 0.7461667
weight 0.7461667 0.1210238

공분산이 양수이므로 양의 상관관계를 가짐

4 피어슨 상관계수 #

> cor.test(girth, weight, data=x, method="pearson") #default: pearson

	Pearson's product-moment correlation

data:  girth and weight 
t = 7.3142, df = 13, p-value = 5.881e-06
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 0.7116680 0.9655589 
sample estimates:
     cor 
0.896941

피어슨 상관계수가 0.896941로 강한 양의 상관관계가 있다고 볼 수 있음
해석(http://ko.wikipedia.org/wiki/%EC%83%81%EA%B4%80%EB%B6%84%EC%84%9D#.ED.94.BC.EC.96.B4.EC.8A.A8_.EC.83.81.EA.B4.80.EA.B3.84.EC.88.98_.28Pearson_correlation_coefficient.29)
- 1.0 ~ 0.7: 강
- 0.7 ~ 0.3: 중
- 0.3 ~ 0.1: 약
- 0.1 ~ 0.0: 무시

5 스피어만 상관계수 #

> cor.test(girth, weight, data=x, method = "spearman")

	Spearman's rank correlation rho

data:  girth and weight 
S = 70.0628, p-value = 1.963e-05
alternative hypothesis: true rho is not equal to 0 
sample estimates:
      rho 
0.8748878 

Warning message:
In cor.test.default(girth, weight, data = x, method = "spearman") :
  Cannot compute exact p-values with ties

순위를 가지고 하는 것이므로 같은 값이 있어 warning이 보임.
상관계수가 0.8748878로 강한 양의 상관관계

참고:

6 켄달 상관계수 #

> cor.test(girth, weight, data=x, method = "kendall")

	Kendall's rank correlation tau

data:  girth and weight 
z = 3.7508, p-value = 0.0001762
alternative hypothesis: true tau is not equal to 0 
sample estimates:
      tau 
0.7425743 

Warning message:
In cor.test.default(girth, weight, data = x, method = "kendall") :
  ties 때문에 정확한 p값을 계산할 수가 없습니다

상관계수가 0.7425743로 역시 강한 상관관계 임.

7 data frame에서 상관계수 다루기 #

install.packages("Hmisc")
library("Hmisc")
rcorr(as.matrix(x), type="pearson")
rcorr(as.matrix(x), type="spearman")

8 시각화1 #

#일반적인 산점도
pairs(x1)

install.packages("GGally")
library("GGally")

#정보가 많은 산점도
ggpairs(x1)

9 시각화2 #

출처: R Graphics Cookbook

panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...) {
    usr <- par("usr")
    on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y, use="complete.obs"))
    txt <- format(c(r, 0.123456789), digits=digits)[1]
    txt <- paste(prefix, txt, sep="")
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex =  cex.cor * (1 + r) / 2)
}
    
panel.hist <- function(x, ...) {
  usr <- par("usr")
  on.exit(par(usr))
  par(usr = c(usr[1:2], 0, 1.5) )
  opar=par(ps=30) #font-size
  h <- hist(x, plot = FALSE)
  breaks <- h$breaks
  nB <- length(breaks)
  y <- h$counts
  y <- y/max(y)
  rect(breaks[-nB], 0, breaks[-1], y, col="white", ...)
}

panel.lm <- function (x, y, col = par("col"), bg = NA, pch = par("pch"),
                            cex = 1, col.smooth = "black", ...) {
    points(x, y, pch = pch, col = col, bg = bg, cex = cex)
    abline(stats::lm(y ~ x),  col = col.smooth, ...)

    #로버스한거
    #abline(MASS::rlm(y ~ x),  col = col.smooth, ...)
}

pairs(c2009[,2:5], pch=".",
                   upper.panel = panel.cor,
                   diag.panel  = panel.hist,
                   lower.panel = panel.lm)

10 sql로 상관계수 구하기 #

select 
    (count(*) * sum(x * y) - sum(x) * sum(y)) / 
    (sqrt(count(*) * sum(x * x) - sum(x) * sum(x)) * 
    sqrt(count(*) * sum(y * y) - sum(y) * sum(y))) r
from (select x,y from data) t

11 로버스트 상관 분석 #

library("robust")
covRob(stackloss) #Robust Estimate of Covariance
covRob(stackloss, corr = TRUE) #Robust Estimate of Correlation

결과

> covRob(stackloss, estim = "mcd")
Call:
covRob(data = stackloss, estim = "mcd")

Robust Estimate of Covariance: 
           Air.Flow Water.Temp Acid.Conc. stack.loss
Air.Flow      77.32      24.02      54.91      67.30
Water.Temp    24.02      18.28      18.40      25.16
Acid.Conc.    54.91      18.40      99.03      48.99
stack.loss    67.30      25.16      48.99      60.93

Robust Estimate of Location: 
  Air.Flow Water.Temp Acid.Conc. stack.loss 
     56.15      20.23      85.38      13.15 
> covRob(stackloss, corr = TRUE)
Call:
covRob(data = stackloss, corr = TRUE)

Robust Estimate of Correlation: 
           Air.Flow Water.Temp Acid.Conc. stack.loss
Air.Flow     1.0000     0.6677     0.6174     0.9514
Water.Temp   0.6677     1.0000     0.4960     0.7868
Acid.Conc.   0.6174     0.4960     1.0000     0.5389
stack.loss   0.9514     0.7868     0.5389     1.0000

Robust Estimate of Location: 
  Air.Flow Water.Temp Acid.Conc. stack.loss 
     56.92      20.43      86.29      13.73

12 편상관분석 #

두 변수간 상관관계에 영향을 줄 수 있는 제 3의 변수가 존재할 수도 있으므로 이를 통제하고 상관분석 수행

library(ggm)
height <- c(160, 155, 170, 160, 180, 177)
weight <- c(55, 50, 56, 70,80,73)
gender <- c(0,0,0,1,1,1)
mydata <- data.frame(height, weight, gender)
pcor(c("height", "weight", "gender"), var(mydata))
# partial corr between [height] and [weight] controlling for [gender]
cor(height, weight)

결과

> pcor(c("height", "weight", "gender"), var(mydata))
[1] 0.8267963
> cor(height, weight)
[1] 0.7598541
>

--http://wolfpack.hnu.ac.kr/2015_Fall/MDA/MDA_correlation.pdf

ds=read.csv("SMSA_USA.csv") #데이터 읽기
attach(ds) #데이터 사용 선언; names(ds) #변수이름 리스트
library(car) #행렬 산점도 패키지 설치
scatterplot.matrix(~d_index+edu+non_white+wc_ratio+income)
library(Hmisc) #상관계수 행렬 계산 패키지 설치
rcorr(as.matrix(ds[6:13]),type=“spearman")
#부분결정계수
summary(lm(d_index~income)) #사망률(Y)=소득(X) 회귀분석
fit=lm(d_index~edu) # 회귀분석 Y=Z(통제변수, 교육수준)
fit2=lm(income~edu) # 회귀분석 X=Z
plot(fit$residuals, fit2$residuals) #산점도 (X, Y) : Z 통제
summary(lm(fit$residuals~fit2$residuals)) #통제 후 Y=X 회귀분석
cor.test(fit$residuals,fit2$residuals) #부분상관계수

13 참고자료 #

정준상관분석 --> 독립변수군과 종속변수군의 상관관계

오 좋네요 -- sion 2018-09-12 20:55:04