#title 평균의 차이에 대한 검정
[[TableOfContents]]

==== SE, Standard Error ====
몽둥의 길이를 5회 측정했다. 
{{{
x <- c(76.2, 76.3, 76.1, 76,3, 76.4)
se <- sd(x)/sqrt(length(x)) #0.6009252
se #1.643168
}}} 


학생 2,000명의 수학 모의고사 성적이 있다. 
{{{
set.seed(1000)
x <- rnorm(2000, mean = 70, sd = 10)
summary(x)
}}} 
결과
{{{
> summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  36.38   63.33   69.89   69.94   76.54   99.19 
}}}

표준오차는..
{{{
mu <- c()
for(i in 1:100){
    mu <- c(mean(sample(x, 5)), mu)
}

se <- sd(mu) / sqrt(5)
mean(mu) #69.82971
se #1.935552
}}}


==== 평균을 비교하는 검정 방법 ====
||정규분포 여부||검정법||
||Y||t 검정, 분산분석||
||N||Wilcoxon 검정, Kurskal-Wallis 검정||

 * Two-sample t-test <-> Wilcoxonrank sum test
 * Paired t-test <-> Wilcoxonsigned rank test

==== t 검정 개요 ====
 * 평균을 비교하여 해당 집단의 대표값으로의 역할을 하고 있다는 것을 의미함.
 * 이상치(outlier)가 있는 정규분포가 아닌 데이터는 t 검정 하면 안됨.

여기서 필요한 기술
 * 정규분포여부 확인 하는 방법
 * 이상치 제거 방법

==== One Sample t-test ====
2학년 1반의 2011년 하루 평균 게임시간 2.1시간 이었다. 2012년에 10명을 무작위로 선발하여 게임 시간을 조사하였다. 2011년과 2012년이 다른가?
{{{
x <- c(3.3, 2.8, 3.0, 2.7, 2.7, 2.0, 1.9, 3.4, 1.4, 1.4)
t.test(x, mu=2.1)
}}}

정규성 검정
{{{
> shapiro.test(x)

	Shapiro-Wilk normality test

data:  x 
W = 0.9085, p-value = 0.2711
}}}
 * 가설
  * 귀무가설: 정규분포와 차이가 없다.
  * 대립가설: 정규분포와 차이가 있다. 
 * 유의수준 0.05라고 했을 때, p-value가 0.2711이므로 귀무가설 기각하지 못함. 즉, 정규분포다.

{{{
> t.test(x, mu=2.1)

	One Sample t-test

data:  x 
t = 1.5454, df = 9, p-value = 0.1567
alternative hypothesis: true mean is not equal to 2.1 
95 percent confidence interval:
 1.933026 2.986974 
sample estimates:
mean of x 
     2.46 
}}}
 * 가설
  * 귀무가설: 평균에 차이가 없다.
  * 대립가설: 평균에 차이가 있다.
 * 유의수준 0.05라고 했을 때, p-value가 0.1567이므로 귀무가설 기각하지 못함.

아래와 같이 구간추정도 한다.
{{{
> t.test(x)

	One Sample t-test

data:  x 
t = 10.6, df = 9, p-value = 2.269e-06
alternative hypothesis: true mean is not equal to 0 
95 percent confidence interval:
 1.93 2.99 
sample estimates:
mean of x 
     2.46 
}}}
==== Two Sample t-test ====
 * 두 집단의 평균 비교
 * 가정
  * 두 집단이 정규분포
  * 두 집단이 분산이 같음

예제
{{{
x1 <- c(15,10,13,7,9,8,21,9,14,8)
x2 <- c(15,14,12,8,14,7,16,10,15,12)
}}}

정규성 검정 --> x1, x2가 유의수준 0.05에서 정규분포임.
{{{
> shapiro.test(x1)

	Shapiro-Wilk normality test

data:  x1 
W = 0.8666, p-value = 0.09131

> shapiro.test(x2)

	Shapiro-Wilk normality test

data:  x2 
W = 0.9125, p-value = 0.2986
}}}

분산이 동일한가?
{{{
> var.test(x1, x2)

	F test to compare two variances

data:  x1 and x2 
F = 1.9791, num df = 9, denom df = 9, p-value = 0.3237
alternative hypothesis: true ratio of variances is not equal to 1 
95 percent confidence interval:
 0.491579 7.967821 
sample estimates:
ratio of variances 
          1.979094 
}}}
 * 가설
  * 귀무가설: x1과 x2는 분산의 차이가 없다.
  * 대립가설: x1과 x2는 분산의 차이가 있다.
 * 유의수준 0.05라고 했을 때, p-value가 0.3237이므로 귀무가설 기각하지 못함. 즉, 분산의 차이가 없다. 

두 집단이 평균이 동일한지 검정
{{{
> t.test(x1, x2, var.equal=T)

	Two Sample t-test

data:  x1 and x2 
t = -0.5331, df = 18, p-value = 0.6005
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -4.446765  2.646765 
sample estimates:
mean of x mean of y 
     11.4      12.3 
}}}
 * 가설
  * 귀무가설: x1과 x2는 평균의 차이가 없다.
  * 대립가설: x1과 x2는 평균의 차이가 있다.
 * 유의수준 0.05라고 했을 때, p-value가 0.6005이므로 귀무가설 기각하지 못함. 즉, 두 집단간 평균의 차이가 없다. 

단측 검정일 경우 (왼쪽 즉, x1이 작다는 검정)
{{{
> t.test(x1, x2, alternative="less", var.equal=T)

	Two Sample t-test

data:  x1 and x2
t = -0.5331, df = 18, p-value = 0.3002
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
     -Inf 2.027436
sample estimates:
mean of x mean of y 
     11.4      12.3 
}}}
==== Paired t-test ====
 * Paired -> 11월에 30명을 추출하여 몸무게를 측정했고, 12월에 11월에 추출된 30명의 몸무게를 다시 측정하여 평균차이가 있는지 검정 할 때..
 * 나머지는 Two Sample t-test와 동일
 * Two Sample t-test 예제에서의 데이터가 Paired 데이터라면..

{{{
> t.test(x1, x2, var.equal=T, paired=T)

	Paired t-test

data:  x1 and x2 
t = -0.9612, df = 9, p-value = 0.3616
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -3.018069  1.218069 
sample estimates:
mean of the differences 
                   -0.9 
}}}
 * 가설
  * 귀무가설: x1과 x2는 평균의 차이가 없다.
  * 대립가설: x1과 x2는 평균의 차이가 있다.
 * 유의수준 0.05라고 했을 때, p-value가 0.3616이므로 귀무가설 기각하지 못함. 즉, 두 집단간 평균의 차이가 없다.

==== 분산분석 ====
 * 3개 이상의 집단에 대한 평균의 차이 검정(two-sample test의 확장)
 * 분산분석
  * 일원분산분석(one-way ANOVA) - 1개의 그룹변수
  * 이원분산분석(two-way ANOVA) - 2개의 그룹변수
  * 공분산분석(ANCOVA; Analysis of Covariance) - 일원분산분석에 연속형 변수 공변량(covariate)추가
 * 기본적인 가정
  * 집단간 독립
  * 정규분포를 따라야 함
  * 동일한 분산
 * 기본적인 가정에 맞지 않는 경우
  * 분산분석 대신 Kurskal-Wallis Test


==== 일원분산분석 ====
{{{
#http://code.google.com/p/sonya/source/browse/trunk/r-project/sample/PlantGrowth.csv
plantGrowth = read.csv("c:\\data\\PlantGrowth.csv")
head(plantGrowth)
boxplot(weight ~ group, data=plantGrowth)
out <- lm(weight ~ group, data=plantGrowth)
summary(out)
anova(out)
}}}

{{{
> summary(out)

Call:
lm(formula = weight ~ group, data = plantGrowth)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0710 -0.4180 -0.0060  0.2627  1.3690 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.0320     0.1971  25.527   <2e-16 ***
grouptrt1    -0.3710     0.2788  -1.331   0.1944    
grouptrt2     0.4940     0.2788   1.772   0.0877 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6234 on 27 degrees of freedom
Multiple R-squared: 0.2641,	Adjusted R-squared: 0.2096 
F-statistic: 4.846 on 2 and 27 DF,  p-value: 0.01591 

> anova(out)
Analysis of Variance Table

Response: weight
          Df  Sum Sq Mean Sq F value  Pr(>F)  
group      2  3.7663  1.8832  4.8461 0.01591 *
Residuals 27 10.4921  0.3886                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
}}}
 * 가설
  * 귀무가설: 3그룹의 평균이 같다.
  * 대립가설: 3그룹의 평균이 다르다.
 * 유의수준 0.05라고 했을 때, p-value가 0.01591이므로 귀무가설 기각. 즉, 3그룹이 평균의 차이가 있다.
 * 그러나, 회귀진단을 해봐야 적합한지 알 수 있음.

{{{
par(mfrow=c(2,2))
plot(out)
}}}
attachment:평균의차이에대한검정/1.png

정규성 -> p-value = 0.4379이므로 정규분포다.
{{{
> shapiro.test(resid(out))

	Shapiro-Wilk normality test

data:  resid(out) 
W = 0.9661, p-value = 0.4379
}}}

등분산성 -> p-value = 0.1714로 귀무가설 지지. 즉, 등분산
{{{
#library("lmtest")
> bptest(out)

	studentized Breusch-Pagan test

data:  out 
BP = 3.5273, df = 2, p-value = 0.1714
}}}

독립성
{{{
> dwtest(out) #library("lmtest")

	Durbin-Watson test

data:  out 
DW = 2.704, p-value = 0.9502
alternative hypothesis: true autocorrelation is greater than 0 
}}}

 * d통계량
  * 0 : 양의 자기상관, p-value = 1
  * 2 : 독립, p-value = 0
  * 4 : 음의 자기상관, p-value = -1
 * 대략적으로 DW값이 1보다 작거나 3보다 크면 자기상관이 확실히 있다고 판달할 수 있으며, 1.5~2.5 사이에 있을 경우 독립이라고 판단
 * 대략 독립.

정규성, 등분산성, 독립성, Q-Q Plot의 직선성을 확인하여 모형이 적합한 것을 확인했다. 위에서 3개의 그룹에 대한 평균 차이 검정 결과 차이가 있음을 알았다. 어떤 그룹끼리의 차이가 발생했는지 보자. 

방법1: Dunnett -> 평균의 차이가 없는 조합을 보여준다. (control 대비 비교법)
{{{
install.packages("multcomp")
library("multcomp")
out <- lm(weight ~ group, data=PlantGrowth)
dunnett <- glht(out, linfct=mcp(group="Dunnett")) #여기서 group은 plantGrowth$group 이다.
summary(dunnett)
plot(dunnett)
}}}
attachment:평균의차이에대한검정/2.png
''95% 신뢰구간이 0을 포함''

{{{
> summary(dunnett)

	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Dunnett Contrasts


Fit: lm(formula = weight ~ group, data = PlantGrowth)

Linear Hypotheses:
                 Estimate Std. Error t value Pr(>|t|)
trt1 - ctrl == 0  -0.3710     0.2788  -1.331    0.323
trt2 - ctrl == 0   0.4940     0.2788   1.772    0.153
(Adjusted p values reported -- single-step method)
}}}
 * trt1 - ctrl의 p-value = 0.323 -> 귀무가설 지지, 평균의 차이 없음
 * trt2 - ctrl의 p-value = 0.153 -> 귀무가설 지지, 평균의 차이 없음

방법2: Tukey -> 평균의 차이가 없는 조합과 평균의 차이가 있는 조합을 모두 보여준다.
{{{
install.packages("multcomp")
library("multcomp")
out <- lm(weight ~ group, data=PlantGrowth)
tukey <- glht(out, linfct=mcp(group="Tukey")) #여기서 group은 PlantGrowth$group 이다.
summary(tukey)
plot(tukey)
}}}
attachment:평균의차이에대한검정/3.png

{{{
> summary(tukey)

	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = weight ~ group, data = plantGrowth)

Linear Hypotheses:
                 Estimate Std. Error t value Pr(>|t|)  
trt1 - ctrl == 0  -0.3710     0.2788  -1.331    0.391  
trt2 - ctrl == 0   0.4940     0.2788   1.772    0.198  
trt2 - trt1 == 0   0.8650     0.2788   3.103    0.012 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
(Adjusted p values reported -- single-step method)
}}}
 * trt1 - ctrl의 p-value = 0.323 -> 귀무가설 지지, 평균의 차이 없음
 * trt2 - ctrl의 p-value = 0.153 -> 귀무가설 지지, 평균의 차이 없음
 * trt2 - trt1의 p-value = 0.153 -> 귀무가설 기각, 평균의 차이 있음

==== 이원분산분석 ====
{{{
#http://code.google.com/p/sonya/source/browse/trunk/r-project/sample/warpbreaks.csv?r=653
warpbreaks = read.csv("c:\\data\\warpbreaks.csv")
}}}

이산형 변수인 wool과 tension의 순서 확인
{{{
> levels(warpbreaks$wool)
[1] "A" "B"
> levels(warpbreaks$tension)
[1] "L" "M" "H"
}}}

 * wool -> 순서가 맞음.
 * tension -> 순서가 맞지 않음. L, M, H 순서이어야 함.

tension의 순서를 바로 잡음.
{{{
> warpbreaks$tension = factor(warpbreaks$tension, level = c("L", "M", "H"))
> levels(warpbreaks$tension)
[1] "L" "M" "H"
}}}

분산분석
{{{
out <- lm(breaks ~ wool*tension, data = warpbreaks)
summary(out)
}}}

{{{
> summary(out)

Call:
lm(formula = breaks ~ wool * tension, data = warpbreaks)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.5556  -6.8889  -0.6667   7.1944  25.4444 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      44.556      3.647  12.218 2.43e-16 ***
woolB           -16.333      5.157  -3.167 0.002677 ** 
tensionM        -20.556      5.157  -3.986 0.000228 ***
tensionH        -20.000      5.157  -3.878 0.000320 ***
woolB:tensionM   21.111      7.294   2.895 0.005698 ** 
woolB:tensionH   10.556      7.294   1.447 0.154327    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 10.94 on 48 degrees of freedom
Multiple R-squared: 0.3778,	Adjusted R-squared: 0.3129 
F-statistic: 5.828 on 5 and 48 DF,  p-value: 0.0002772 
}}}
 * 평균에 차이가 있음(아직 회귀진단을 하지는 않은 상태)
 * 교호작용이 있음을 확인
  * woolB:tensionM -> p-value = 0.005698, 유의수준 0.05에서 귀무가설 기각 => 교호작용 확인함!!
  * woolB:tensionH -> p-value = 0.154327, 유의수준 0.05에서 귀무가설 지지

정규성 -> p-value = 0.8162이므로 정규분포다.
{{{
> shapiro.test(resid(out))

	Shapiro-Wilk normality test

data:  resid(out) 
W = 0.9869, p-value = 0.8162
}}}

등분산성 -> __p-value = 0.0006307로 귀무가설 기각. 즉, 등분산이 아님. 종속변수 breaks에 log()나 sqrt()하자.__
{{{
#library("lmtest")
> bptest(out)

	studentized Breusch-Pagan test

data:  out 
BP = 21.5744, df = 5, p-value = 0.0006307
}}}

독립성
{{{
> dwtest(out)

	Durbin-Watson test

data:  out 
DW = 2.2376, p-value = 0.575
alternative hypothesis: true autocorrelation is greater than 0 
}}}

 * d통계량
  * 0 : 양의 자기상관, p-value = 1
  * 2 : 독립, p-value = 0
  * 4 : 음의 자기상관, p-value = -1
 * 대략적으로 DW값이 1보다 작거나 3보다 크면 자기상관이 확실히 있다고 판달할 수 있으며, 1.5~2.5 사이에 있을 경우 독립이라고 판단
 * 대략 독립.

종속변수 변환(log) 후 다시 검정
{{{
> out <- lm(log(breaks) ~ wool*tension, data = warpbreaks)
> shapiro.test(resid(out))

	Shapiro-Wilk normality test

data:  resid(out) 
W = 0.9729, p-value = 0.2583

> bptest(out)

	studentized Breusch-Pagan test

data:  out 
BP = 4.8045, df = 5, p-value = 0.4402

> dwtest(out)

	Durbin-Watson test

data:  out 
DW = 2.06, p-value = 0.3167
alternative hypothesis: true autocorrelation is greater than 0 
}}}
 * 정규성, 등분산성, 독립성이 모두 갠춘함.

어떤 조합의 평균이 다른가?
{{{
library("multcomp")
out <- lm(log(breaks) ~ wool + tension, data=warpbreaks)
tukey1 <- glht(out, linfct=mcp(wool="Tukey"))
tukey2 <- glht(out, linfct=mcp(tension="Tukey"))
summary(tukey1)
summary(tukey2)
}}}

{{{
> summary(tukey1)

	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = log(breaks) ~ wool + tension, data = warpbreaks)

Linear Hypotheses:
           Estimate Std. Error t value Pr(>|t|)
B - A == 0  -0.1522     0.1063  -1.431    0.159
(Adjusted p values reported -- single-step method)

> summary(tukey2)

	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = log(breaks) ~ wool + tension, data = warpbreaks)

Linear Hypotheses:
           Estimate Std. Error t value Pr(>|t|)   
M - L == 0  -0.2871     0.1302  -2.205  0.08018 . 
H - L == 0  -0.4893     0.1302  -3.758  0.00133 **
H - M == 0  -0.2022     0.1302  -1.553  0.27550   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
(Adjusted p values reported -- single-step method)
}}}
==== 공분산분석(ANCOVA; Analysis of Covariance) ====
 * lm(종속변수 ~ 공변량변수 + 그룹변수)로 하면 된다.
 * 공변량변수
  * 연속형 변수
  * 종속변수에 영향을 끼치는 이미 알려진 사실에 대한 변수
  * 예: 몸무게가 큰 사람일수록 치료와 상관없이 치료의 전/후 차이도 크다. -> 치료 전 몸무게를 공변량 변수로 추가
 * 공변량 변수를 추가하는 것 빼고는 분산분석과 동일

==== 차이의 조합 ====
{{{
out <- aov(lifetime ~ q, data=x8)
TukeyHSD(out)
}}}

==== 참고도서 ====
 * [http://www.kyobobook.co.kr/product/detailViewKor.laf?ejkGb=KOR&mallGb=KOR&barcode=9788955661101&orderClick=LAG&Kc=SETLBkserp1_5 R을 이용한 누구나 하는 통계분석]
 * [http://www.kyobobook.co.kr/product/detailViewKor.laf?ejkGb=KOR&mallGb=KOR&barcode=9788931428216&orderClick=LEA&Kc=SETLBkserp1_5 SAS 강좌와 통계컨설팅]
----
Hey, that's the grseetat! So with ll this brain power AWHFY? -- Earng 2015-12-11 04:19:24