Saturday, December 29, 2012

Speed skating 10 km

It is winter which makes it time for one of Netherlands beloved sports: speed skating. Speed skating is done over various distances, but for me, the most beautiful is the 10 km. The top men do this in about 13 minutes. In this post I try to use a simple PLS model to predict the finish time as function of lap times. It turns out that is not so easy. The later rounds are most related to the finish time.

The tournament is on a 400 skating rink. To quote wikipedia 'All races are run counter-clockwise. In all individual competition forms, only two skaters are allowed to race at once, and must remain in their respective lanes. Skaters must change lanes every lap. The skater changing from the outside lane to the inside has right-of-way.' It follows that there are 25 rounds. Hence there are also 25 lap times.

Source data

Data for 10 km have been downloaded from schaatsstatistieken (pdf file in Dutch English language version) beginning of December. It contains lap times where the final time was under 13'30 at that point in calendar time. Since then some additional results have been added. Surely Dutch championships this weekend will add more. The data was transformed to .txt file in Calibre. Unfortunately the result seems to have rather random placed line-breaks and no delimiters except the space so it took me some lines to read the data. It was not possible to just use the counts of spaces, these vary over skater's names and event locations. I put this script in the appendix since it breaks the text too much.

Plotting data

The plot shows all lap times in the data. The pattern is fairly simple; the first lap is slow, speed is made. The intermediate laps have a fairly constant speed. The last laps have an increase or decrease in speed. 
# plot data
long <- reshape(scating4,varying=list(colnames(times),colnames(rtimes)),
    v.names=c('total','round'),direction='long',times=seq(400,10000,400),
    timevar='Distance')
xyplot(round ~ Distance ,groups= nums,data=long,type='l')


The second plot shows selected details, all results from the year 2000. The plot shows typical results.  Frank Dittrich is ever increasing speeds. Gianni Romme in Heerenveen slowed lightly over laps. Bob de Jong had nice flat lap times.
long$year <- substr(long$date,1,4)
xyplot(lap ~ Distance ,groups= paste(name,' (',location,')',
    sep=''),data=long[long$year=='2000',],
    type='l',ylab='lap time',main='Year 2000',
    auto.key=list(space='bottom',lines=TRUE,points=FALSE))

Predictive model

To model these these correlated data it seems obvious to use PLS. The question is then; at which distance is the final time well predicted? At 2 km it is not so good. A Root Mean Squared Error of Prediction (RMSEP) of 7.5 at 4 components in the model translates into easily fat chance of an error of more than 7 seconds, see also the plot observed-predicted.
pls1 <- plsr(m10000 ~ rm400 + rm800 +rm1200 + rm1600 + rm2000,data=scating4,validation='CV',ncomp=5)
RMSEP(pls1)
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps
CV           9.879    7.766    7.569    7.492    7.469    7.469
adjCV        9.879    7.764    7.566    7.490    7.466    7.465
formattl <- function(x) 
  paste(x %/% 60,"'",
      sprintf('%02i',round(x %% 60))
      ,sep='')

obspredplot <- function(y,main) {

  plot(x=scating4$m10000,y=y,axes=FALSE,
      xlab='Observed time',ylab='Predicted time',
      main=main,
      xlim=c(759,811),ylim=c(759,821))
  axis(1,at=seq(760,810,10),labels=formattl(seq(760,810,10)))
  axis(2,at=seq(760,820,10),labels=formattl(seq(760,820,10)))
  abline(a=0,b=1,col='blue')
  return(invisible(0))
}
obspredplot(y=as.numeric(predict(pls1,ncomp=2,type='response')),
    'Data upto 2 km')


As more data is added, the predictions get better. After 8 km data the RMSEP is 2.5 so a good prediction can be made.

pls2 <- plsr(m10000 ~ rm400 + rm800 +rm1200 + rm1600 + rm2000
            +rm2400 + rm2800 + rm3200 + rm3600 + rm4000,data=scating4,
    validation='CV',ncomp=9)
RMSEP(pls2)
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV           9.879    6.808    6.484    6.354    6.329    6.320    6.315
adjCV        9.879    6.807    6.482    6.351    6.323    6.312    6.307
       7 comps  8 comps  9 comps
CV       6.319    6.319    6.319
adjCV    6.311    6.311    6.311
pls3 <- plsr(m10000 ~ rm400 + rm800 +rm1200 + rm1600 + rm2000
        +rm2400 + rm2800 + rm3200 + rm3600 + rm4000
        +rm4400 + rm4800 + rm5200 + rm5600 +rm6000,data=scating4,
    validation='CV',ncomp=14)
RMSEP(pls3)
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV           9.879    5.219    4.645    4.507    4.474    4.492    4.525
adjCV        9.879    5.217    4.643    4.505    4.469    4.485    4.514
       7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
CV       4.536    4.538    4.539     4.540     4.540     4.540     4.540
adjCV    4.525    4.526    4.527     4.528     4.528     4.528     4.528
       14 comps
CV        4.540
adjCV     4.528
pls4 <- plsr(m10000 ~ rm400 + rm800 +rm1200 + rm1600 + rm2000
        +rm2400 + rm2800 + rm3200 + rm3600 + rm4000
        +rm4400 + rm4800 + rm5200 + rm5600 +rm6000
        +rm6400 + rm6800 + rm7200 + rm7600 +rm8000,data=scating4,
    validation='CV',ncomp=14)
RMSEP(pls4)
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV           9.879    2.806    2.558    2.411    2.366    2.351    2.372
adjCV        9.879    2.805    2.557    2.409    2.362    2.347    2.366
       7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
CV       2.385    2.386    2.387     2.389     2.390     2.390     2.391
adjCV    2.377    2.378    2.380     2.381     2.382     2.383     2.383
       14 comps
CV        2.391
adjCV     2.383
obspredplot(as.numeric(predict(pls4,ncomp=2,type='response')),'Data upto 8 km')
Further calculations shows that it is not the amount of data that determines the prediction quality. Just using data at 2, 4, 6 and 8 km gives good predictions.
pls4b <- plsr(m10000 ~  rm2000 + rm4000 +rm6000 +rm8000,data=scating4,
     validation='CV',ncomp=4)
RMSEP(pls4b)
       (Intercept)  1 comps  2 comps  3 comps  4 comps
CV           9.879    2.979    2.903    2.891    2.892
adjCV        9.879    2.978    2.902    2.889    2.889

In hindsight I could have found this much simpler. Just looking at correlations shows highest correlation just under 6 and 8 km. 

plot(y=cor(cbind(rtimes,times[,25]),use='complete.obs')[26,1:25],x=seq(400,10000,400),
    ylab='Correlation with 10 km time',xlab='Lap (distance)',
    main='Correlation between lap time and 10 km time')


Appendix: reading and transforming data

library(lattice)
library(pls)
library(randomForest)
r1 <- readLines("Rondetijden 10000 meter mannen onder 13.30,00 - SchaatsStatistieken.nl.txt",encoding='UTF-8')
# removing unneeded input
nc <- sapply(1:length(r1),function(x) nchar(r1[x]))
r2 <- r1[nc!=0]
r2 <- r2[r2!="datum"]
r2 <- r2[r2!="# naam"]
r2 <- r2[-grep("pagina",r2)]
r2 <- r2[-grep("SchaatsStatistieken",r2)]
r2 <- r2[-grep("^400m",r2)]
r2 <- r2[-grep("^nat plaats",r2)]
r4 <- sapply(1:length(r2),function(x) 
      gsub("800m 1200m 1600m 2000m 2400m 2800m 3200m 3600m 4000m 4400m 4800m 5200m 5600m 6000m 6400m 6800m 7200m 7600m 8000m 8400m 8800m 9200m 9600m 10000m ",
           "",  r2[x]))
lengths <- function(x) {
  nc <- sapply(1:length(x),function(xx) nchar(x[xx]))
  nc
}
#appending lines - each line should start with a number
while (length(grep('^[0-9]{1,3} ',r4,invert=TRUE))>0) {
  le <- grep('^[0-9]{1,3} ',r4,invert=TRUE)[1]
  r4[le-1] <- paste(r4[le-1],r4[le])
  r4 <- r4[-le]

r5 <- r4[-grep('- - - -',r4)]

# extracting columns
num <- regexpr(" ",r5)
scating1 <- data.frame(nums = as.numeric(substr(r5,1,num-1)),
    strings = substring(r5,num+1))
#country three letter uppercase, name befor that
num <- regexpr("[A-Z]{3}",scating1$strings)
scating1$name <- substr(scating1$strings,1,num-1)
scating1$country <- substr(scating1$strings,num,num+2)
scating1$strings <- substring(scating1$strings,num+4)
#head(scating1)
#date starts with 4 numbers characterising year
#location befor date
num <- regexpr("[0-9]{4}",scating1$strings)
#table(num)
scating1$location <- substr(scating1$strings,1,num-1)
scating1$date <- substr(scating1$strings,num,num+10)
scating1$strings <- substring(scating1$strings,num+11)
#head(scating1)
#notation of , and .
scating1$strings <- gsub('.','#',scating1$strings,fixed=TRUE)
scating1$strings <- gsub(',','.',scating1$strings,fixed=TRUE)
#head(scating1)

distance <- paste('m',seq(400,10000,400),sep='')
scating1$strings <- paste(scating1$strings,' ')
times <- matrix(0.0,nrow=nrow(scating1),
          ncol=25,dimnames=list(1:nrow(scating1),distance))

for (i in 1:25) {
   num <- regexpr(" ",scating1$strings)
   time <- substr(scating1$strings,1,num-1)
   time[time=='-'] <- '0#0'
   ntime <- 60*as.numeric(substr(time,1,regexpr("#",time)-1))+
       as.numeric(substr(time,regexpr("#",time)+1,lengths(time)))
   ntime[time=='0#0'] <- NA   
   times[,i] <- ntime
   scating1$strings <- substr(scating1$strings,num+1,
      lengths(scating1$strings))
 }
scating2 <- cbind(scating1[,!(names(scating1)=='strings')],times)
rtimes <- times
colnames(rtimes) <- paste('r',colnames(times),sep='')
for (i in 2:25) rtimes[,i] <- times[,i] - times[,i-1] 
scating3 <- cbind(scating2,rtimes)

scating4 <- scating3[complete.cases(scating3),]
# next two corrections have been applied in the source data (website)
scating4 <- scating4[scating4$m400>20,]    # wrong data in 1st lap
scating4 <- scating4[scating4$num != 336,] # wrong data at 6400 m

Tuesday, December 25, 2012

Common words in the Gathering Storm

The Wheel of Time is a series of books started by Robert Jordan. Unfortunately he died too early. Like all fans of the series I feel very lucky that Brandon Sanderson was able to continue these books. The first book Sanderson wrote was the Gathering Storm, last one is due January 2013. In this post it is examined how common words can be used differentiate between books written by Sanderson and those written by Jordan.

Data

The training data used are some of the books by Sanderson and Jordan. They form three categories; Robert Jordan Wheel of Time, Robert Jordan other and Brandon Sanderson various.
  • the Eye of the World (Wheel of Time) by Robert Jordan
  • the Fires of Heaven (Wheel of Time) by Robert Jordan
  • Elantris by Brandon Sanderson
  • Warbreaker by Brandon Sanderson
  • Prince of the Blood (other) by Robert Jordan
  • Conan the Defender (other) by Robert Jordan
The test set is three books;
  • Knife of Dreams (Wheel of Time) by Robert Jordan
  • Mistborn by Brandon Sanderson
  • the Gathering Storm (Wheel of Time) by Brandon Sanderson and Robert Jordan
All books were acquired via darknet and read into R as a vector with one element per chapter. Prologue and epilogue count for separate chapters. The relative amount of common words is counted in each chapter. In this case, common words are defined as stopwords from the tm package.  For example;
tm::stopwords("English")[1:5]
[1] "a"      "about"  "above"  "across" "after" 
Two functions were devised to count the relative occurrence of these words per chapter:
numwords <- function(what,where) {
  g1 <- gregexpr(paste('[[:blank:]]+[[:punct:]]*',what,'[[:punct:]]*[[:blank:]]+',sep=''),where,perl=TRUE,ignore.case=TRUE)
  if (g1[[1]][1]==-1) 0L
  else length(g1[[1]])
}
countwords <- function(book) {
  sw <- tm::stopwords("English")
  la <- lapply(book,function(where) {        
        sa <- sapply(sw,function(what) numwords(what,where))
        ntot <- length(gregexpr('[[:blank:]]+',
                       where,perl=TRUE,ignore.case=TRUE)[[1]])
        sa/ntot
      } )
  mla <- t(do.call(cbind,la))
}
# words are counted
wtEotW <- countwords(tEotW)
wElantris <- countwords(Elantris)
wtFoH <- countwords(tFoH)
wWarbreaker <- countwords(Warbreaker)
wPotB <- countwords(PotB)
wConan <- countwords(Conan)
wtGS <- countwords(tGS)
wMistborn <- countwords(Mistborn)
wKoD <- countwords(KoD)

Model

Random forest is used as the number of variables is much bigger than the number of objects.
#combine the counts and make predictions
all <- rbind(wElantris,wWarbreaker,wtEotW,wtFoH,wPotB,wConan)
cats <-  factor(c(
        rep('BS',nrow(wElantris)),
        rep('BS',nrow(wWarbreaker)),
        rep('WoT',nrow(wtEotW)),
        rep('WoT',nrow(wtFoH)),
        rep('RJ',nrow(wPotB)),
        rep('RJ',nrow(wConan))
    ),levels=c('BS','WoT','RJ'))
rf1 <- randomForest(y=cats,x=all,importance=TRUE)
rf1

Call:
 randomForest(x = all, y = cats, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 22

        OOB estimate of  error rate: 3.93%
Confusion matrix:
     BS WoT RJ class.error
BS  124   0  1 0.008000000
WoT   0 110  1 0.009009009
RJ    5   4 35 0.204545455
varImpPlot(rf1)


Words which discriminate between the three categories are such as 'and', 'did't' and 'not'. The next figure shows the typical usage of nine words. Note that the data has been scaled at this point in order to make the display more easy to read.
im <- importance(rf1)
toshow <- rownames(im)[order(-im[,'MeanDecreaseGini'])][1:9]
tall <- as.data.frame(scale(all[,toshow]))
tall$chapters <- rownames(tall)
tall$cats <- cats
rownames(tall) <- 1:nrow(tall)
propshow <- reshape(tall,direction='long',
    timevar='Word',
    v.names='ScaledScore',
    times=toshow,
    varying=list(toshow))
bwplot(  cats ~ScaledScore  | Word,data=propshow)
Based on this it seems Sanderson would use contractions such as 'didn't', which Jordan did not. Jordan used 'not', 'and' and 'or' more often. 'However is very much Sanderson.

Predictions

For predictions I took the predicted proportion trees for each category, as this shows a bit of the uncertainty in the categorization, which I find of interest. To display the predictions density plots are used. Each pane in the plot shows the strength of the associations between books and categories. The higher the values, the stronger association. Each row represents a book, each column a category.
ptGS <- predict(rf1,wtGS,type='prob')
pMistborn <- predict(rf1,wMistborn,type='prob')
pKoD <- predict(rf1,wKoD,type='prob')
preds <- as.data.frame(rbind(ptGS,pMistborn,pKoD))
preds$Book <- c(rep('the Gathering Storm',nrow(ptGS)),
    rep('Mistborn',nrow(pMistborn)),rep('Knife of Dreams',nrow(pKoD)))
predshow <- reshape(preds,direction='long',
    timevar='Prediction',v.names='Score',times=c('BS','WoT','RJ'),
    varying=list(w=c('BS','WoT','RJ')))
densityplot(~Score | Prediction + Book,data=predshow)

Interpretation

Knife of dreams is correctly categorized as Wheel of Time, Mistborn is correctly categorized as Sanderson. This shows the predictions are indeed performing well and the item of interest can be examined; the Gathering Storm. It sits solidly in the Sanderson category. Interestingly, it sits a little bit less in Sanderson than Mistborn and sits a bit more in Wheel of Time than Mistborn.  

Sunday, December 16, 2012

The Eye of the World as word cloud

The Eye of the World is the first book of Robert Jordan's Wheel of Time books. As the last of these books will be published soon, I was wondering if natural language processing can be used to examine books like these. For this purpose I downloaded a copy from somewhere undisclosed and analyzed it.

During my experiments with this file I found wordcloud was actually a good way to look at this. My first attempts, using correspondence analysis did not give anything useful. Everything on top of each other does not yield an interesting plot. Clustering of chapters did not reveal anything nice. Wordcloud has comparison clouds, which can be used to differentiate between chapters.
I am sure readers can do their own interpretation of this. Myself, I am surprised by the massive amount of names of places and persons in this first book, even though I know the number of persons in the series is large.


R code
r1 <- readLines("Robert Jordan - Wheel Of Time 01 - The Eye Of The World.txt")
#remove text page xxx
pagina <- grep('^Page [[:digit:]]+$',r1)
r1 <- r1[-pagina]
r1 <- sub('Page [[:digit:]]+$','',r1)
# remove empty lines
r1 <- r1[r1!='']
#extract chapter headers
chapterrow <- grep('^(CHAPTER [[:digit:]]+)|(PROLOGUE)$',r1)
chapterrow <- c(chapterrow,length(r1)+1)
#extract chapters
chapters <- sapply(1:(length(chapterrow)-1),function(i) 
      paste(r1[(chapterrow[i]+2):(chapterrow[i+1]-1)],sep=' '))
chapterrow <- chapterrow[-length(chapterrow)]
#name the chapters
chapternames <- paste(sub('CHAPTER ','',r1[chapterrow]),r1[chapterrow+1])
names(chapters) <- chapternames

# use example processing from tm
library(tm)
EotW <- Corpus(VectorSource(chapters))
EotW <- tm_map(EotW,stripWhitespace)
EotW <- tm_map(EotW,tolower)
EotW <- tm_map(EotW,removeWords,stopwords("English"))
EotW <- tm_map(EotW,stemDocument)
EotW <- tm_map(EotW,removePunctuation)

library(wordcloud)
tdmEotW <- TermDocumentMatrix(EotW)

h1 <- hclust(dist(t(sqrt(as.matrix(tdmEotW )))),method='ward')
# hclust to put related chapters together

# and make a cloud
library(colorspace)
tdmEotW2 <- as.matrix(tdmEotW)[,h1$order]

comparison.cloud(tdmEotW2,random.order=FALSE,scale=c(1.4,.6),title.size=.7,
    colors=rainbow_hcl(n=57))

 

Thursday, December 6, 2012

To reject random walk in climate

I read the post The surprisingly weak case for global warming and the rejection; Climate: Misspecified. Based on the first, I wanted to make a post, just to write I agree with the second.

The post features a number of plots like this

For me, one of the noticeable features of this figure is how much the red line does not deviate from middle. Hence, in this post I want to examine how extreme it is in the middle.
The red line, is
# cumsum(changes)
The blue lines are (repeated):

# jumps = sample(changes, 130, replace=T)
# lines(cumsum(c(0,jumps)), col=rgb(0, 0, 1, alpha = .1))

For how close the line is to the centre I use
sd(cumsum(changes))
[1] 12.81393
And the simulated distributions:
simdes <- sapply(1:10000,
   function(x) sd(c(0,cumsum(sample(changes, 130, replace=TRUE)))))
Plot of distributions:
plot(density(simdes))
abline(v=sd(cumsum(changes)),col='red')
All of the simulations have a bigger sd than the original
sum(stdes>sd(cumsum(changes)))
[1] 10000
And that is why I can only say, this random jump, it is not the correct model for these data. 

R code

The first part of the code is (obviously) copied, final lines are mine 
theData = read.table("C:\\Users\\...\\GLB.Ts+dSSTcleaned.txt", 
    header=TRUE,na.strings='**') 
theData <- theData[complete.cases(theData),]

# There has to be a more elegant way to do this
theData$means = rowMeans(aggregate(theData[,c("DJF","MAM","JJA","SON")], by=list(theData$Year), FUN="mean")[,2:5])

# Get a single vector of Year over Year changes
rawChanges = diff(theData$means, 1)

# SD on yearly changes
sd(rawChanges,na.rm=TRUE)

# Subtract off the mean, so that the distribution now has an expectaion of zero
changes = rawChanges - mean(rawChanges)

# Find the total range, 1881 to 2011
(theData$means[131] - theData$means[1])/100

# Year 1 average, year 131 average, difference between them in hundreths
y1a = theData$means[1]/100 + 14
y131a = theData$means[131]/100 + 14
netChange = (y131a - y1a)*100 

# First simulation, with plotting
plot.ts(cumsum(c(0,rawChanges)), col="red", ylim=c(-300,300), lwd=3, xlab="Year", ylab="Temperature anomaly in hundreths of a degrees Celsius")

trials = 1000
finalResults = rep(0,trials)

for(i in 1:trials) {
  jumps = sample(changes, 130, replace=T)
  # Add lines to plot for this, note the "alpha" term for transparency
  lines(cumsum(c(0,jumps)), col=rgb(0, 0, 1, alpha = .1))
  finalResults[i] = sum(jumps)
}
# Re-plot red line again on top, so it's visible again
lines(cumsum(c(0,rawChanges)), col="red", ylim=c(-300,300), lwd=3) 
#
# new lines 
sd(cumsum(changes))
simdes <- sapply(1:10000,function(x) sd(c(0,cumsum(sample(changes, 130, replace=TRUE)))))
plot(density(simdes))
abline(v=sd(cumsum(changes)),col='red')
sum(stdes>sd(cumsum(changes)))

Sunday, December 2, 2012

Triangle tests

Introduction

A triangle test is a test beloved by sensory scientists for its simplicity and general use in detecting presence of product differences. The principle is simple. Test subjects get served three samples. One of these contains A, two of these contain B. The test subject has the task to indicate which is the odd sample. If enough odd samples are selected then the samples are declared different. The reverse may also happen. If few enough odd samples are selected the products are declared 'same'. This is a commercially relevant question as A may contain ingredients which are cheaper or from a competing supplier.

Statistical Testing 

As follows from the setup, the chance to select the odd sample is 1/3 under H0; products are the same. This allows calculation of the critical values. For 10 to 50 triangles these are plotted below.
The alpha is not quite 5%. It is guaranteed to be at least most 5%. The figure below shows that most of the time the tests are conservative. The jumps are because of the discrete nature of triangle tests. For 10 tests the critical value is the same as for 11 tests, resulting in a difference in alpha.
The same effect is seen in the error of the second kind. The figure below shows power for the alternative test of 60% correct. The figure also shows that in the neighborhood of 30 triangles must be done to have faith in the ability to detect a sample which gives 60% correct.
This leads to the question: what could be detected with 90% power? This is plotted below. Extreme cases, say 70% or 80% correct can be detected quite easy. Problem is, these are not relevant. In general creation or application staff would reject these before going to the sensory lab. Hence 30 to 40 tests is in general found the minimum to test when searching to declare samples close enough.


R script

triangle <- data.frame(n=10:50)
triangle$crit.val95 <- qbinom(.95,triangle$n,prob=1/3)
triangle$ph0 <- pbinom(triangle$crit.val,triangle$n,1/3)
triangle$pow.6 <- pbinom(triangle$crit.val95,triangle$n,.6,lower.tail=FALSE)
triangle$get.pow.90 <- sapply(1:nrow(triangle), function(row) {
      uniroot(function(x) 
          pbinom(triangle$crit.val95[row],
                 triangle$n[row],x,lower.tail=FALSE)
            -.90,
          interval=c(1/3,1))$root
    }
)

plot(crit.val95~n,data=triangle,xlab='Number of triangles',type='b',
    ylab='Critical value',main='Triangle test at 95%')

plot(ph0~n,data=triangle,xlab='Number of triangles',type='b',
    ylab='1-alpha',main='Triangle test at 95%')

plot(pow.6~n,data=triangle,xlab='Number of triangles',type='b',
    ylab='Power for H1 (60% correct)',main='Triangle test at 95%')

plot(get.pow.90~n,data=triangle,xlab='Number of triangles',type='b',
    ylab='Proportion correct for 90% power',main='Triangle test at 95%')





Saturday, November 24, 2012

Secret Santa - unfinished business

Last week I wrote:
This is actually a more difficult calculation (or I forgot too much probability). Luckily a bit of brute force comes in handy. To reiterate, in general simulated data shows 0.54 redraws because of the first person etc.
colSums(countstop)/nrep
[1] 0.54276 0.40735 0.31383 0.24834 0.20415
In this last week I figured out how to finish this more elegantly/without too much brute force.
To reiterate, the chance within a particular round of drawing names are:
(p.onedraw <- c(colSums(redraw)/nrow(redraw),p.succes) )
[1] 0.200 0.150 0.117 0.0917 0.075 0.367
This means that a 'round' of secret santa with 5 persons can have 6 outcomes: The first person draws self, the second draws self, same for person three , four and five. Sixth outcome: a valid drawing of names. If a person self draws, a next round occurs. This means an infinite number of rounds and drawings may occur. The brute force solution was to follow a lot of solutions and do this again and again. But, we can calculate the probability of of a finish after twice a fail at person 1 and three times fail at person 3. It is the chance of this happening: 0.22*0.1173*0.367 times the number of ways this can happen: (2+3)!/(2!*3!). Obviously, this is a generalization of some of the formulas for binomial. 
Now, to do this for all reasonable possibilities of results. The latter may be obtained e.g. by 
ncalc <- 8 xx <- expand.grid(0:ncalc,0:ncalc,0:ncalc,0:ncalc,0:ncalc)
nrow(xx)
[1] 59049

Implementation phase 1

There are quite a number of probabilities to calculate, and as this will be wholesale, every shortcut is welcome. The main approach is to do the whole calculation on a logarithmic scale and transform back at the end. 
With a bit of preparation log(factorial) is just getting a number out of a vector.
lnumb <- c(0,log(1:40))
clnumb <- cumsum(lnumb)
lfac <- function(x) clnumb[x+1]
exp(lfac(4))
[1] 24
to store logs for chances: 
lp1d <- log(p.onedraw)
Combine it all into functions
lpoccur <- function(x,lpvals) {
  x <- as.numeric(x)
  crossprod(c(x,1),lpvals) + lfac(sum(x))-sum(sapply(x,lfac))
}
poccur <- function(x,lpvals) exp(lpoccur(x,lpvals))
And call it over the matrix xx
chances <- sapply(1:nrow(xx),function(x) poccur(xx[x,],lp1d))
newcalc.res <- as.data.frame(cbind(xx,chances))
newcalc.res <- newcalc.res[order(-newcalc.res$chances),]
head(newcalc.res)
     Var1 Var2 Var3 Var4 Var5    chances
1       0    0    0    0    0 0.36666667
2       1    0    0    0    0 0.07333333
10      0    1    0    0    0 0.05500000
82      0    0    1    0    0 0.04277778
730     0    0    0    1    0 0.03361111
6562    0    0    0    0    1 0.02750000

Implementation phase 2

The idea was to not store matrix xx, but rather generate the rows on the fly, because I thought xx would be too large. But is seems that is not needed and a suitable xx fits into memory. 
sum(chances)
[1] 0.9998979
Hence the calculation can be vectorised a bit more, reducing calculation time again:
ntrial <- rowSums(xx)
accu <- lfac(ntrial) + crossprod(t(as.matrix(xx)),lp1d[1:(length(lp1d)-1)])
for (i in 1:ncol(xx) ) accu <- accu-lfac(xx[,i])
accu <- accu+lp1d[length(lp1d)]
chance2 <- exp(accu)
newcalc2.res <- cbind(xx,chance2)
newcalc2.res <- newcalc2.res[order(-newcalc2.res$chance2),]
So, the values from simulation are recreated
sum(newcalc2.res$Var1*newcalc2.res$chance2)
[1] 0.5445788
sum(newcalc2.res$Var5*newcalc2.res$chance2)
[1] 0.2044005
Calculation done! In the end it was not so much a difficult or long calculation. However, a simulation is much easy to construct and clearly pointed out directions.

Sunday, November 18, 2012

Secret Santa - again

Based on comments by cellocgw I decided to look at last week's Secret Santa again. This time, the moment a person, whoever that is, draws his/her own name, the drawing starts again at the first person.

Introduction

A group of n persons draws sequentially names for Secret Santa. Each person may not draw his/her own name. If a person draws his/her own name then all names are returned and the drawing starts again. Questions are such as: How often do you draw names?

Simulation 

To understand the problem a simulation is build. The simulation draws the names, with restarts if needed. These are placed into outcome. On top of that it counts where a person self-draws.
selsan1 <- function(persons) {
  startoutcome <- rep(0,persons)
  countstop <- rep(0,persons)
  outcome <- startoutcome
  done <- FALSE
  possible <- 1:persons
  who <- 1           
  while(!done) {
    remaining <- possible[!(possible %in% outcome)]
    if (length(remaining)==1) {
      if (who==remaining) {
        countstop[who] <- countstop[who]+1
        outcome <- startoutcome
        who <- 1
      } else {
        outcome[who] <- remaining
        who <- who+1
        done <- TRUE
      }
    }     else {
      select <- sample(possible[!(possible %in% outcome)],1)
      if (who==select) {
        countstop[who] <- countstop[who]+1
        outcome <- startoutcome
        who <- 1
      } else {
        outcome[who] <- select
        who <- who+1
      }
    }
  }     
  return(list(outcome=outcome,countstop=countstop))
}
In an unlucky draw for five persons there were four restarts, at person 1, 2, 3 and 5.
selsan1(5)
$outcome
[1] 2 1 5 3 4

$countstop
[1] 1 1 1 0 1
The aim of the simulation is to do this a lot of times and draw conclusions from there. 
nrep=1e5
simulations <- lapply(1:nrep,function(x) selsan1(5))
The first question concerns the outcomes. There are 44 allowed results. They are obtained about equally often.
outcomes <- sapply(simulations,function(x) paste(x$outcome,collapse=' '))
as.data.frame(table(outcomes))
    outcomes Freq
1  2 1 4 5 3 2301
2  2 1 5 3 4 2263
3  2 3 1 5 4 2192
4  2 3 4 5 1 2323
5  2 3 5 1 4 2268
6  2 4 1 5 3 2230
7  2 4 5 1 3 2280
8  2 4 5 3 1 2242
9  2 5 1 3 4 2203
10 2 5 4 1 3 2247
11 2 5 4 3 1 2327
12 3 1 2 5 4 2321
13 3 1 4 5 2 2243
14 3 1 5 2 4 2183
15 3 4 1 5 2 2346
16 3 4 2 5 1 2264
17 3 4 5 1 2 2216
18 3 4 5 2 1 2269
19 3 5 1 2 4 2301
20 3 5 2 1 4 2359
21 3 5 4 1 2 2301
22 3 5 4 2 1 2206
23 4 1 2 5 3 2291
24 4 1 5 2 3 2249
25 4 1 5 3 2 2263
26 4 3 1 5 2 2321
27 4 3 2 5 1 2265
28 4 3 5 1 2 2318
29 4 3 5 2 1 2329
30 4 5 1 2 3 2220
31 4 5 1 3 2 2309
32 4 5 2 1 3 2228
33 4 5 2 3 1 2291
34 5 1 2 3 4 2284
35 5 1 4 2 3 2314
36 5 1 4 3 2 2304
37 5 3 1 2 4 2339
38 5 3 2 1 4 2213
39 5 3 4 1 2 2285
40 5 3 4 2 1 2194
41 5 4 1 2 3 2255
42 5 4 1 3 2 2336
43 5 4 2 1 3 2218
44 5 4 2 3 1 2289
The number of times a redraw is taken can also be obtained. 
countstop <- t(sapply(simulations,function(x) x$countstop))
table(rowSums(countstop))/nrep
      0       1       2       3       4       5       6       7       8       9 
0.36938 0.23036 0.14788 0.09337 0.05929 0.03673 0.02220 0.01440 0.00988 0.00603 
     10      11      12      13      14      15      16      17      18      19 
0.00396 0.00240 0.00168 0.00089 0.00053 0.00041 0.00021 0.00014 0.00012 0.00005 
     20      21      22      28 
0.00004 0.00003 0.00001 0.00001 
We can also extract where the redraws occur. In general there is 0.5 redraw because of the first person, 0.4 because of the second, etc. The numbers do not add to 1, they are not chances. 
colSums(countstop)/nrep
[1] 0.54276 0.40735 0.31383 0.24834 0.20415

Calculations

The question is, can we achieve the same with a calculation. For this we obtain the chances of various results. For this we build three matrices. All permutations in pp. Continuation of a sequence is in permitted. Finally, redraw contains the person where a person causes a new draw. Trick here is that if person 2 causes a redraw, then no subsequent persons cause a redraw, hence only 2 is marked in redraw.
pp <- randtoolbox::permut(nn)
redraw <- matrix(FALSE,nrow(pp),nn)
permitted <- redraw
redraw[,1] <- pp[,1] ==1
permitted[,1] <- pp[,1]!=1

for(i in 2:nn) {
  permitted[,i] <- pp[,i]!=i & permitted[,i-1]
  redraw[,i] <- pp[,i] == i & permitted[,i-1]
}
The sequences start like this.
head(pp)
     i i i    
[1,] 5 4 3 1 2
[2,] 5 4 3 2 1
[3,] 5 4 1 3 2
[4,] 5 4 2 3 1
[5,] 5 4 1 2 3
[6,] 5 4 2 1 3
head(permitted)
     [,1] [,2]  [,3]  [,4]  [,5]
[1,] TRUE TRUE FALSE FALSE FALSE
[2,] TRUE TRUE FALSE FALSE FALSE
[3,] TRUE TRUE  TRUE  TRUE  TRUE
[4,] TRUE TRUE  TRUE  TRUE  TRUE
[5,] TRUE TRUE  TRUE  TRUE  TRUE
[6,] TRUE TRUE  TRUE  TRUE  TRUE
head(redraw)
      [,1]  [,2]  [,3]  [,4]  [,5]
[1,] FALSE FALSE  TRUE FALSE FALSE
[2,] FALSE FALSE  TRUE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE

Chance of succes

The chance of a success drawing is the mean of the last column in permitted. Below a comparison with the simulation result. First the observed proportions.
byrow <- as.data.frame(table(rowSums(countstop))/nrep)
head(byrow)
  Var1    Freq
1    0 0.36938
2    1 0.23036
3    2 0.14788
4    3 0.09337
5    4 0.05929
6    5 0.03673
Now the matching calculation. The numbers can be calculated easily.
(p.succes <- mean(permitted[,nn]))
[1] 0.3666667
byrow$n <- as.numeric(levels(byrow$Var1)[byrow$Var1])
byrow$p <- sapply(byrow$n,function(x) p.succes*(1-p.succes)^x)
byrow[,c(1,3,4,2)]
   Var1  n            p    Freq
1     0  0 3.666667e-01 0.36938
2     1  1 2.322222e-01 0.23036
3     2  2 1.470741e-01 0.14788
4     3  3 9.314691e-02 0.09337
5     4  4 5.899305e-02 0.05929
6     5  5 3.736226e-02 0.03673
7     6  6 2.366277e-02 0.02220
8     7  7 1.498642e-02 0.01440
9     8  8 9.491398e-03 0.00988
10    9  9 6.011219e-03 0.00603
11   10 10 3.807105e-03 0.00396
12   11 11 2.411167e-03 0.00240
13   12 12 1.527072e-03 0.00168
14   13 13 9.671458e-04 0.00089
15   14 14 6.125256e-04 0.00053
16   15 15 3.879329e-04 0.00041
17   16 16 2.456908e-04 0.00021
18   17 17 1.556042e-04 0.00014
19   18 18 9.854933e-05 0.00012
20   19 19 6.241457e-05 0.00005
21   20 20 3.952923e-05 0.00004
22   21 21 2.503518e-05 0.00003
23   22 22 1.585561e-05 0.00001
24   28 28 1.023239e-06 0.00001

Where fall the redraws

This is actually a more difficult calculation (or I forgot too much probability). Luckily a bit of brute force comes in handy. To reiterate, in general simulated data shows 0.54 redraws because of the first person etc.
colSums(countstop)/nrep
[1] 0.54276 0.40735 0.31383 0.24834 0.20415
So, what happens in a drawing? The outcomes follow from the matrix redraw. There is 20% chance the first person draws 1, 25% chance the second person draws a 2 etc. Finally, as established, the chance is 36% to have a good draw.
(p.onedraw <- c(colSums(redraw)/nrow(redraw),p.succes) )
[1] 0.20000000 0.15000000 0.11666667 0.09166667 0.07500000 0.36666667
The function below takes these numbers and a locator vector to return a data frame with chances, location of fail and success status in column finish
one.draw <- function(status.now,p.now,p.onedraw) {
  la <- lapply(1:(nn+1),function(x) {
        status <- status.now
        if (x>nn) finish=TRUE
        else {          
          status[x] <- status[x] +1
          finish=FALSE}
        list(status=status,p=p.onedraw[x],finish=finish)
      })
  res <- as.data.frame(do.call(rbind, lapply(la,function(x) x$status)))
  res$p <- sapply(la,function(x) x$p*p.now)
  res$finish <- sapply(la,function(x) x$finish)
  res
}
status.now <- rep(0,nn)
names(status.now) <- paste('person',1:5,sep='')
od <- one.draw(status.now,1,p.onedraw)
od
  person1 person2 person3 person4 person5          p finish
1       1       0       0       0       0 0.20000000  FALSE
2       0       1       0       0       0 0.15000000  FALSE
3       0       0       1       0       0 0.11666667  FALSE
4       0       0       0       1       0 0.09166667  FALSE
5       0       0       0       0       1 0.07500000  FALSE
6       0       0       0       0       0 0.36666667   TRUE
Where the sequence is not finished, the same chances apply again. For this a second function is build. Same outcomes are combined to restrict the number of outcomes.
one.draw.wrap <- function(x,p.pnedraw) {
  if (x$finish) return(x)
  one.draw(x[grep('person',names(x))],x$p,p.onedraw)
  }
  
new.draw <- function(od,p.onedraw) {
  todo <- od[!od$finish,]
  done <- od[od$finish,]
  la <- lapply(1:nrow(todo),function(x) one.draw.wrap(todo[x,],p.onedraw))
  snd <- do.call(rbind,la)
  snd <- snd[do.call(order,snd),]
  i <- 1
  while(i<nrow(snd) ) {
    if(all(snd[i,!(colnames(snd)=='p')] == snd[i+1,!(colnames(snd)=='p')])) {
      snd$p[i] <- snd$p[i] + snd$p[i+1]
      snd <- snd[-(i+1),]
    }
    else i=i+1
  }
  snd <- rbind(snd, done)
  snd[order(-snd$p),]
head(new.draw(od))

   person1 person2 person3 person4 person5          p finish
61       0       0       0       0       0 0.36666667   TRUE
6        1       0       0       0       0 0.07333333   TRUE
2        1       1       0       0       0 0.06000000  FALSE
25       0       1       0       0       0 0.05500000   TRUE
3        1       0       1       0       0 0.04666667  FALSE
35       0       0       1       0       0 0.04277778   TRUE
There are still quite some unfinished drawings, so we can redo that a few times. This is where a fast processor is practical.
myfin <- list(od=od)
for (i in 1:15) myfin[[i+1]] <- new.draw(myfin[[i]])
Even after 15 steps there are some unfinished sequences. Nevertheless I stop here.
xtabs(p ~finish,data=myfin[[15]])
finish
      FALSE        TRUE 
0.001057999 0.998942001 
In the finished series there are in on average 0.54 redraws at person 1. That is pretty close to 0.54 from the simulation.
with(myfin[[15]],sum(person1[finish]*p[finish]))
[1] 0.5398659
However, the 0.1% unfinished sequences would contribute a lot. Even if they were finished at step 15.
with(myfin[[15]],sum(person1*p))
0.5448775