#title 랜덤포레스트
[[TableOfContents]]

숲(forest)이 있고, 숲속에는 나무(tree)들이 있다. 여기서 나무는 의사결정트리다. input 데이터는 random이고, 숲도 random이다. 숲에 있는 나무들에게 random input을 해서 각 나무들이 뱉어내는 결과를 voting(다수결의 원칙)해서 분류한다. 대용량 데이터에 효과적으로 실행되고, 많은 변수를 이용해도 변수 제거 없이 실행되어 정확도가 높은 편이다. unbalanced된 class의 모집단에 대해 잘 맞는다. -- R을 이용한 빅데이터 분석, 김경태 참고

==== 예제 데이터 ====
[http://www.kyobobook.co.kr/product/detailViewKor.laf?ejkGb=KOR&mallGb=KOR&barcode=9788983257000&orderClick=LAG&Kc=SETLBkserp11_15 EXCEL에 의한 조사방법 및 통계분석]를 이용함.
{{{
cname <- c("ID", "구매브랜드", "연령","세대연수입", "세대사람수", "방문빈도", "거주년수")
x = read.table("c:\\data\\disc.txt", col.names = cname)
head(x)
}}}
attachment:판별분석/disc.txt

{{{
> head(x)
  ID  구매브랜드 연령  세대연수입  세대사람수  방문빈도  거주년수
1  1          A   48       9000          4        5        6
2  2          A   58       8000          6        4       20
3  3          A   52       7000          6        4       12
4  4          A   63       7000          6        4       15
5  5          A   59       8000          4        6        6
6  6          A   38      11000          5        4       10
> 
}}}

==== 랜덤포레스트 ====
{{{
tree <- randomForest(구매브랜드 ~ 연령 + 세대연수입 + 세대사람수 + 방문빈도 + 거주년수, data=x)
print(tree) # view results 
importance(tree)
}}}

결과
{{{
> print(tree) # view results 

Call:
 randomForest(formula = 구매브랜드 ~ 연령 + 세대연수입 + 세대사람수 + 방문빈도 + 거주년수, data = x) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 5%
Confusion matrix:
   A B class.error
A 10 0         0.0
B  1 9         0.1
> importance(tree)
           MeanDecreaseGini
연령               2.624620
세대연수입         1.815804
세대사람수         1.263035
방문빈도           1.196015
거주년수           2.576659
> 
}}}
구매브랜드를 결정하는 변수의 중요도는 연령 > 거주년수 > 세대연수입 > 세대사람수 > 방문빈도 순이다. 

{{{
rf <- randomForest(factor(t3)~diff_cnt+diff_time, data=x6, type="classification", importance=TRUE,na.action=na.omit)
pred <- predict(rf, newdata=test)
table(pred, test$t3)
}}}

{{{
data(iris)
set.seed(111)
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,])
iris.pred <- predict(iris.rf, iris[ind == 2,])
table(observed = iris[ind==2, "Species"], predicted = iris.pred)
}}}

플로팅..
{{{
install.packages("rpart")
library("rpart")
cf <- cforest(Species ~ ., data = iris) 
pt <- party:::prettytree(cf@ensemble[[1]], names(cf@data@get("input"))) 
pt 
nt <- new("BinaryTree") 
nt@tree <- pt 
nt@data <- cf@data 
nt@responses <- cf@responses 
nt 
plot(nt) 
}}}

{{{
install.packages("tree")
library(tree)
tr <- tree(Species ~ ., data=iris)
tr
}}}

==== 변수 중요도와 Gini impurity ====
'''Gini impurity '''
지니 지수(Gini Index)는 불순도(impurity)를 측정하는 하나의 지수이다. 임의의 한 개체가 목표변수의 i번째 범주로부터 추출되었고, 그 개체를 목표변수의 j번째 범주에 속한다고 오분류(misclassification)할 확률은 P(i)P(j)가 된다. 여기에서 P(i)는 각 마디에서 한 개체가 목표변수의 I번째 범주에 속할 확률이다. 이러한 오분류 확률은 모두 더하여

attachment:랜덤포레스트/rf100.png

를 얻을 수 있고, 이는 위와 같은 분류규칙 하에서 오분류 확률의 추정치 가 된다. 여기서 c는 목표변수의 범주의 수를 말한다.
--http://kostat.go.kr/attach/journal/4-1-3.PDF

'''변수 중요도'''
{{{
imp <- data.frame(importance(model))
imp[order(imp$MeanDecreaseGini, decreasing=T),]
varImpPlot(model)
}}}
결과는 정확도 관점에서의 중요도와 지니 불순도 관점에서 중요도 2가지로 plotting 된다.

==== RRF ====
Regularized Random Forest
{{{
install.packages("RRF")
library("RRF")
model <- RRF(factor(is_out) ~ ., data=training, type="classification", importance=TRUE)
pred <- predict(model, newdata=test2)
confusionMatrix(pred, test2$is_out)
}}}


==== 시각화 ====
 * http://stats.stackexchange.com/questions/41443/how-to-actually-plot-a-sample-tree-from-randomforestgettree
 * [https://cran.r-project.org/web/packages/randomForestExplainer/vignettes/randomForestExplainer.html randomForestExplainer]


==== 참고자료 ====
 * attachment:랜덤포레스트/Classification_with_Random_Forests.doc
 * [https://gist.github.com/shanebutler/96f0e78a02c84cdcf558 sql.export.randomForest.R]