Background (ref. http://groupware.les.inf.puc-rio.br/har)

This human activity recognition research has traditionally focused on discriminating between different activities. The “how (well)” investigation has only received little attention, even though it potentially provides useful information for a large variety of applications,such as sports training.

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

In this project, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har.

Our goal is to predict in which one of the 5 ways that participants performed in 20 separate qualitative measurements.

Data

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Load required libraries

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

Read data

pmltraining <- read.csv(url("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"))
pmltesting <- read.csv(url("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"))

Data cleaning and partition for validation

dim(pmltraining);dim(pmltesting)
## [1] 19622   160
## [1]  20 160

We checked the data quality in training dataset before modeling. We first converted ‘#DIV/0!’ in a numeric field to ‘NA’, then use function colSums to identify columns that contains missing values.

These columns together with a number of timestamp columns are removed from the training dataset. The same cleaning rule is then applied to the testing dataset.

(We also tried to remove missing records by rows but it turned out to be inapplicable because 98% of rows contain at least one missing value in a certain field.)

We also separated the testing dataset into testing and validation partitions. Set seed so the same result can be reproduced.

# use training data only to remove NA columns
prePro<-pmltraining
prePro[prePro=='#DIV/0!']<-NA
nona<-colSums(is.na(prePro)) == 0
nona[c(1,3:5)]<-FALSE

set.seed(1234)
mlData<-pmltraining[,nona]
inTrain<-createDataPartition(y=mlData$classe,p=0.5,list=F)
training<-mlData[inTrain,]
validating<-mlData[-inTrain,]

#do the same variable selection on testing dataset
testing<-pmltesting[,nona]

Cleaned data has 56 columns and they are

##  [1] "user_name"            "new_window"           "num_window"          
##  [4] "roll_belt"            "pitch_belt"           "yaw_belt"            
##  [7] "total_accel_belt"     "gyros_belt_x"         "gyros_belt_y"        
## [10] "gyros_belt_z"         "accel_belt_x"         "accel_belt_y"        
## [13] "accel_belt_z"         "magnet_belt_x"        "magnet_belt_y"       
## [16] "magnet_belt_z"        "roll_arm"             "pitch_arm"           
## [19] "yaw_arm"              "total_accel_arm"      "gyros_arm_x"         
## [22] "gyros_arm_y"          "gyros_arm_z"          "accel_arm_x"         
## [25] "accel_arm_y"          "accel_arm_z"          "magnet_arm_x"        
## [28] "magnet_arm_y"         "magnet_arm_z"         "roll_dumbbell"       
## [31] "pitch_dumbbell"       "yaw_dumbbell"         "total_accel_dumbbell"
## [34] "gyros_dumbbell_x"     "gyros_dumbbell_y"     "gyros_dumbbell_z"    
## [37] "accel_dumbbell_x"     "accel_dumbbell_y"     "accel_dumbbell_z"    
## [40] "magnet_dumbbell_x"    "magnet_dumbbell_y"    "magnet_dumbbell_z"   
## [43] "roll_forearm"         "pitch_forearm"        "yaw_forearm"         
## [46] "total_accel_forearm"  "gyros_forearm_x"      "gyros_forearm_y"     
## [49] "gyros_forearm_z"      "accel_forearm_x"      "accel_forearm_y"     
## [52] "accel_forearm_z"      "magnet_forearm_x"     "magnet_forearm_y"    
## [55] "magnet_forearm_z"     "classe"

Exploratory plots

We first used featurePlot in caret package to look at the relationships among aggregated statistics total_accel_belt, total_accel_arm, total_accel_dumbbell and total_accel_forearm, and we observed two clusters in total_accel_belt against other three varialbles. Three individual plots are provided below.

plot of chunk unnamed-chunk-6plot of chunk unnamed-chunk-6plot of chunk unnamed-chunk-6

Data preprocessing

We tried to use PCA to reduce dimention of predictors, but it didn’t help to improve the accuracy, neither to improve the speed of calculation. More detail is in Modeling section.

lastCol<-ncol(training)
preProc<-preProcess(training[,-c(1:6,lastCol)],method='pca',thresh=.995)
trainPC<-predict(preProc,training[,-c(1:6,lastCol)])
validPC<-predict(preProc,validating[,-c(1:6,lastCol)])
testPC<-predict(preProc,testing[,-c(1:6,lastCol)])

Modeling

The first fitted model is using trainPC dataset in random forest and then it is compared to a randmod forest model using all variables in training set.

The following result shows that the trainPC model does not significantly improve the computing time and OOB error rate is visiblly higher than that in the model using all predictors.

Another side note to make here is that train function is comparablly slower than randomForest in modeling.

# very slow!
# system.time(modFit<-train(classe~.,method='rf',data=trainPC))
# modFit$finalModel

(a) Principal component random forest model

system.time(PCFit<-randomForest(training$classe~.,data=trainPC))
##    user  system elapsed 
##  17.105   0.043  17.145
PCFit
## 
## Call:
##  randomForest(formula = training$classe ~ ., data = trainPC) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 3.13%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 2769   11    3    3    4    0.007527
## B   55 1816   23    1    4    0.043707
## C    2   38 1653   17    1    0.033898
## D    1    1   70 1525   11    0.051617
## E    2   16   23   21 1742    0.034368

(b) All predictor random forest model

system.time(modFit<-randomForest(classe~.,data=training)) #reduce ntree=200 when it is slow
##    user  system elapsed 
##  19.261   0.048  19.305
modFit
## 
## Call:
##  randomForest(formula = classe ~ ., data = training) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.58%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 2788    1    0    0    1   0.0007168
## B   11 1887    1    0    0   0.0063191
## C    0   10 1700    1    0   0.0064290
## D    0    0   21 1585    2   0.0143035
## E    0    0    0    9 1795   0.0049889

In the following analysis, we will stay with the random forest model using all predictors in training dataset.

We have a decent model that gives the out of bagg estimate of error rate 0.58%.

Cross-Validation

To estimate out of sample error, we provide the following analysis with cross-validation.

This is a 2-fold random sampling cross-validation.

For the first training dataset, we have the following validation result.

pred<-predict(modFit,validating)
a<-table(pred,validating$classe)
oos1<-round(1-sum(diag(a))/sum(a),4)
a
##     
## pred    A    B    C    D    E
##    A 2790    2    0    0    0
##    B    0 1893    8    0    0
##    C    0    3 1702   18    0
##    D    0    0    1 1589    3
##    E    0    0    0    1 1800

We now switch training dataset and validating dataset. We use the validaing dataset to fit a new random forest model then use the training dataset for validation. Here is the result

system.time(modFit2<-randomForest(classe~.,data=validating))
##    user  system elapsed 
##  19.185   0.047  19.229
pred2<-predict(modFit2,training)
a<-table(pred2,training$classe)
oos2<-round(1-sum(diag(a))/sum(a),4)
a
##      
## pred2    A    B    C    D    E
##     A 2790    5    0    0    0
##     B    0 1891   18    0    0
##     C    0    3 1693   19    0
##     D    0    0    0 1588    4
##     E    0    0    0    1 1800

The estimated out of sample error is oos = 0.44% which is an average out of sample error rates oos1 and oos2..

The following plot shows where we were wrong in the prediction during cross validataion.

validating$predRight<-pred==validating$classe
qplot(total_accel_belt,total_accel_arm,col=predRight,data=validating,main='Prediction Errors')

plot of chunk unnamed-chunk-13

Compared to the confusion matrix above, most prediction errors occured between classe C and D.

Testing

We now come to the testing stage. We need to predict the outcomes from the 20 testing data records labeled by ‘problem_id_1’ to ‘problem_id_20’ using the random forest model we just built.

Notice that the variable new_window in testing dataset doesn’t have the same number of levels as that in the training data. We have to add a new level ‘yes’ into it before it can be fitted into the training model.

levels(testing$new_window)<-c(levels(testing$new_window),'yes')
answers<-predict(modFit,testing)

Our answers are 100% correct. Here is the GitHub page.