#title 분류기 예제 [[TableOfContents]] 복잡해 보이고 고급져 보이는 알고리즘을 쓴다고 결과가 좋아지지 않는다. 중요한 것은 '변수'다. ==== sampling ==== 7:3으로 데이터를 나눈 예제 {{{ nrow(iris) #150 #training set sampling, test set sampling library("caret") partition_idx <- createDataPartition (iris$Species, p=0.3)$Resample1 training <- iris[partition_idx, ] test <- iris[-partition_idx, ] nrow(training); nrow(test) #45;105 }}} over/under sampling {{{ library("DMwR") resample.df <- SMOTE(Species~., data=iris, perc.over=100, perc.under=50) table(resample.df$Species) }}} 비율 차이가 2:1 정도 차이난다고 할 때, 데이터가 적은 쪽은 over sampling(perc.over=100)하고, 데이터가 많은 쪽은 under sampling(perc.under=50) 한다. 최근접 이웃 알고리즘을 사용하여 class를 결정한 다음 0~1사이의 random한 값을 곱하여 데이터를 추가한다고 한다. ==== ctree(conditional inference tree) ==== 의사결정트리(rpart 패키지)는 다음의 2가지 문제를 가지고 있다. * over fitting * 통계적 유의성을 보지 않음 이러한 문제를 해결한 방식이 ctree다. {{{ library(party) model <- ctree(Species~., data=training) pred <- predict(model, newdata=test, type="response") library (caret) confusionMatrix(predict(model, newdata=test, type="response"), test$Species) }}} ==== SVM ==== {{{ library(e1071) model <- svm(Species ~ ., data = training) pred <- predict(model, newdata=test, type="response") library (caret) confusionMatrix(pred, test$Species) table(pred, test$is_out) obj <- tune.svm(factor(is_out)~., data = training, sampling = "fix", gamma = 2^c(-8,-4,0,4), cost = 2^c(-8,-4,-2,0)) plot(obj, transform.x = log2, transform.y = log2) plot(obj, type = "perspective", theta = 120, phi = 45) library("kernlab") model <- ksvm(factor(is_out) ~., data=training, kernel = "rbfdot") pred <- predict(model, newdata=test2, type="response") confusionMatrix(pred, test2$is_out) }}} ==== kNN ==== {{{ library("class") pred <- knn(training[,1:4], test[,1:4], training$Species, k = 5, prob=TRUE) library (caret) confusionMatrix(pred, test$Species) }}} ==== randomForest ==== {{{ library("randomForest") model <- randomForest(Species ~ ., data=training, type="classification", importance=TRUE) pred <- predict(model, newdata=test) library (caret) confusionMatrix(pred, test$Species) }}} ==== neural networks ==== {{{ library(nnet) model <- nnet(Species~., data=training, size=5) pred <- predict(model, newdata=test, type="class") library (caret) confusionMatrix(pred, test$Species) }}} ==== logistic regression ==== {{{ library(nnet) model <- multinom(Species~., data=training) #head (fitted(out)) #결과는 확률 pred <- predict(model, newdata=test, type="class") library (caret) confusionMatrix(pred, test$Species) }}} summary(model) {{{ > summary(model) Call: multinom(formula = Species ~ ., data = training) Coefficients: (Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width versicolor 157.4558 -45.15929 -27.99315 79.58368 -58.51255 virginica -157.1791 13.78356 -82.12478 54.58758 80.74923 Std. Errors: (Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width versicolor 17101.14 9088.735 10475.50 4091.601 7197.552 virginica 17101.14 9088.740 10475.49 4091.600 7197.552 Residual Deviance: 0.0001210002 AIC: 20.00012 > }}} ==== ada booting ==== {{{ library(ada) model <- ada(Species~., data=training) pred <- predict(model, newdata=test) library (caret) confusionMatrix(pred, test$Species) }}} ==== naive bayes ==== {{{ library(e1071) model <- naiveBayes(Species~., data=training) pred <- predict(model, newdata=test) library (caret) confusionMatrix(pred, test$Species) }}} ==== gam ==== {{{ library("mgcv") model <- gam(이탈여부 ~ s(변수1) + s(변수2),family=binomial, data=training) summary(model) pred <- predict(model, test, type="response") confusionMatrix(ifelse(pred < 0.5, "이탈", "잔존"), test$이탈여부) }}} ==== som ==== {{{ library("kohonen") training.class <- training$Species test.class <- test$Species training <- scale(training[,1:4]) test <- scale(test[,1:4]) model <- som(training, grid = somgrid(5, 5, "hexagonal")) pred <- predict(model, newdata = test, trainX = training, trainY = factor(training.class)) confusionMatrix(pred$prediction, test.class) }}} ==== 예제2 ==== {{{ #ctree library(party) library (caret) model <- ctree(factor(is_out)~., data=training) #model <- ctree(factor(is_out)~., data=training, weights=ifelse(training$누적구매건수 >= 0, 100, 1)) pred <- predict(model, newdata=test, type="response") confusionMatrix(predict(model, newdata=test, type="response"), test$is_out) #randomForest library(randomForest) model <- randomForest(factor(is_out) ~ ., data=training, type="classification", importance=TRUE, proximity=TRUE) pred <- predict(model, newdata=test) confusionMatrix(pred, test$is_out) imp <- data.frame(importance(model)) imp[order(imp$MeanDecreaseGini, decreasing=T),] varImpPlot(model) #CART library(rpart) model <- rpart(factor(is_out) ~., data=training, method="class") pred <- predict(model, newdata=test, type="class") confusionMatrix(pred, test$is_out) #SVM library(e1071) model <- svm(factor(is_out) ~., data=training, method="class") pred <- predict(model, newdata=test, type="class") confusionMatrix(pred, test$is_out) #NN library(nnet) model <- nnet(factor(is_out) ~., data=training, size=40, method="class") pred <- predict(model, newdata=test, type="class") confusionMatrix(pred, test$is_out) #kNN library("class") pred <- knn(training[,2:ncol(training)], test[,2:ncol(training)], training$is_out, k = 7, prob=TRUE) confusionMatrix(pred, test$is_out) #logistic regression library(nnet) model <- multinom(is_out~., data=training) pred <- predict(model, newdata=test, type="class") confusionMatrix(pred, test$is_out) }}} ==== train ==== {{{ library(caret) library(rpart) library(e1071) data(iris) formula <- as.formula(Species ~.) t <- train(formula, iris, method = "rpart", cp=0.002, maxdepth=8) plot(t) plot(t$finalModel) text(t$finalModel) library(rattle) library("rpart.plot") fancyRpartPlot(t$finalModel) }}} ==== 참고자료 ==== * http://www.slideshare.net/kmettler/caret-package-for-r * http://r4pda.co.kr/r4pda_2013_12_01.pdf