Sunday, April 21, 2013

Ordinal data, models with observers

I recently made three posts regarding analysis of ordinal data. A post looking at all methods I could find in R, a post with an additional method and a post using JAGS. Common in all three was using the cheese data, a data set where only product data was available, observers were not there. The real world is different, there are observers and they are quite different. So, in this post I have added observers and look at two models again; Simple ANOVA and, complex, an ordinal model with random observers.

Data

To spare myself looking for data, and keeping in mind I may want to run some simulations, I have created a data generator. In this generator I entered some of the things I could imagine regarding the observers. First of all they all have different sensitivities. In other words, what is very salt for some, is not at all salt for others. This is expressed in an additive and a multiplicative effect. Second, they have observation error. The same product does not always give the same response. Third is the categories, observers use them slightly different. I am absolutely aware these are much more properties than I might estimate. After all, a typical consumer takes two or three products in a test, while I have many more parameters. Finally, the outer limits (categories 1 and 9) are placed a bit outward. This is because there is an aversion to using these categories.
num2scale <- function(x,scale) {
  cut(x,c(-Inf,scale,Inf),1:(length(scale)+1))
}
pop.limits2ind.limits <- function(scale,sd=.5) {
  newscale <- scale+rnorm(length(scale),sd=sd)
  newscale[order(newscale)]
}

obs.score <- function(obs,pop.score,pop.limits,scale.sd=0.5,
    sensitivity.sd=.1, precision.sd=1,
    additive.sd=2,center.scale=5,
    labels=LETTERS[1:length(pop.score)]) {
  # individual sensitivity (multiplicative)
  obs.errorfreeintensity <- center.scale + 
      (pop.score-center.scale)*rlnorm(1,sd=sensitivity.sd)
  #individual (additive) 
  obs.errorfreeintensity <- obs.errorfreeintensity +
      rnorm(1,mean=0,sd=additive.sd)
  # individual observation error 
  obs.intensity <- obs.errorfreeintensity+
      rnorm(length(pop.score),sd=precision.sd)
  # individual cut offs between categories  
  obs.limits <- pop.limits2ind.limits(pop.limits)
  obs.score <- num2scale(obs.intensity,obs.limits)
  data.frame(obs=obs,
      score = obs.score,
      product=labels
  )
}

panel.score <- function(n=100,pop.score,pop.limits,scale.sd=0.5,
    sensitivity.sd=.1,precision.sd=1,
    additive.sd=2,center.scale=5,
    labels=LETTERS[1:length(pop.score)]) {
  la <- lapply(1:n,function(x) {
        obs.score(x,pop.score,pop.limits,scale.sd,sensitivity.sd,
            precision.sd,labels=labels)
      })
  dc <- do.call(rbind,la)
  dc$obs <- factor(dc$obs)
  dc
}

pop.limits <- c(1,3:8,10)
prodmeans <- c(7,7,8)
scores <- panel.score(40,prodmeans,pop.limits)
scores$numresponse <- as.numeric(levels(scores$score))[scores$score]

Analysis - ANOVA

The first analysis method is ANOVA. Plain ANOVA. Not because I dislike variance components or such, but because this is what is done most commonly. On top of that, I cannot imagine somebody taking these data, pulling it through a mixed model, while forgetting it is not from a continuous scale.
library(multcomp)
Res.aov <- aov( numresponse ~ obs + product , data=scores)
anova(Res.aov)

Analysis of Variance Table

Response: numresponse
          Df  Sum Sq Mean Sq F value    Pr(>F)    
obs       39 239.300  6.1359  6.3686 2.321e-12 ***
product    2   9.517  4.7583  4.9388   0.00956 ** 
Residuals 78  75.150  0.9635                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
mt <- model.tables(Res.aov,type='means',cterms='product') 

data.frame(dimnames(mt$tables$product)[1],
    Response=as.numeric(mt$tables$product),
    LSD=cld(glht(Res.aov,linfct=mcp(product='Tukey'))
    )$mcletters$monospacedLetters
)    
  product Response LSD
A       A    6.725  a 
B       B    6.850  a 
C       C    7.375   b

Analysis - cumulative link mixed model (clmm)

The alternative, is the other extreme. Do it as correct as available, using both a cumulative link and making observers random. This model is standard available in the ordinal package. This means two intermediate models are not shown. Both of these models lack the simplicity of ANOVA while being less correct than clmm, hence inferior to both clmm and ANOVA.
library(ordinal) 
Res.clmm <- clmm(score  ~ product + (1|obs),data=scores)
summary(Res.clmm)

Cumulative Link Mixed Model fitted with the Laplace approximation

formula: score ~ product + (1 | obs)
data:    scores

 link  threshold nobs logLik  AIC    niter    max.grad cond.H
 logit flexible  120  -179.57 379.14 37(5046) 8.77e-06 Inf   

Random effects:
      Var Std.Dev
obs 8.425   2.903
Number of groups:  obs 40 

Coefficients:
         Estimate Std. Error z value Pr(>|z|)   
productB   0.1309     0.4225   0.310  0.75680   
productC   1.4640     0.4718   3.103  0.00192 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Threshold coefficients:
    Estimate Std. Error z value
2|3  -6.0078         NA      NA
3|4  -5.6340         NA      NA
4|5  -4.1059     0.4528  -9.069
5|6  -3.0521     0.4769  -6.400
6|7  -1.0248     0.4970  -2.062
7|8   0.5499     0.5303   1.037
8|9   4.0583     0.7429   5.463
Res.clmm2 <- clmm(score ~ 1 + (1|obs),data=scores)

anova(Res.clmm2,Res.clmm)

Likelihood ratio tests of cumulative link models:

          formula:                    link: threshold:
Res.clmm2 score ~ 1 + (1 | obs)       logit flexible  
Res.clmm  score ~ product + (1 | obs) logit flexible  

          no.par    AIC  logLik LR.stat df Pr(>Chisq)   
Res.clmm2      8 387.41 -185.70                         
Res.clmm      10 379.14 -179.57  12.266  2    0.00217 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
co <- coef(Res.clmm)[c('7|8','productB','productC')]

vc <- vcov(Res.clmm)[c('7|8','productB','productC'),
    c('7|8','productB','productC')]
names(co) <- levels(scores$product)
sd <- sqrt(c(vc[1,1],diag(vc)[-1]+vc[1,1]-2*vc[1,-1]))
data.frame(
    `top 2 box`=arm::invlogit(c(-co[1],-co[1]+co[-1])),
    `2.5% limit`=arm::invlogit(c(-co[1],-co[1]+co[-1])+qnorm(.025)*sd),
    `97.5% limit`=arm::invlogit(c(-co[1],-co[1]+co[-1])+qnorm(.975)*sd),
    check.names=FALSE
)

  top 2 box 2.5% limit 97.5% limit
A 0.3658831  0.1694873   0.6199713
B 0.3967401  0.1753347   0.6704337
C 0.7138311  0.4498042   0.8838691

Sunday, April 14, 2013

Continuing Sync

I am continuing in Sync: How Order Emerges from Chaos in the Universe, Nature, and Daily Life by Steven Strogatz. To get a feeling on it, I was building a group of things which have only a minute influence on each other are able to synchronize their behavior. I got some more explanations how things fit together and adapted my calculations. Two things are changed.
You can actually use one of the items as a kind of clock, only register status when it goes to zero. This makes it possible to plot behavior over much longer times. Then, I found my function of last week a bit primitive, by going step-wise through time. I was thinking at first to reduce the step-size  why not do the more exact thing and follow a function? So, I scaled the logit function a bit to provide behavior of an item and build some functions to determine when an item goes over the threshold. Code is at the bottom.

Even with a small additive effect they synchronize, as shown below. They are not completely synchronized, but close enough for me.



R code

library(ggplot2)
library(arm)


plotpart <- function(xmat,fname='een.png',thin=1) {
  nstep <- ncol(xmat)
  coluse <- seq(1,nstep,by=thin)
  xmat <- xmat[,coluse]
  fin <- length(coluse)
  nitems <- nrow(xmat)
  df <- data.frame(score=as.numeric(xmat),
      cycle=rep((coluse),each=nitems),
      item =rep(1:nitems,fin))
  g<- ggplot(df,aes(x=cycle,y=score,group=item,alpha=score)) + 
      geom_line() + 
      scale_alpha_continuous(range=c(.02,.0201) )+
      theme(legend.position='none') +
      xlab('Iteration') + theme(panel.background = element_rect(fill='white'))
  png(fname)
  print(g)
  dev.off()
}

time2score <- function(x) 2*invlogit(x)-1
score2time <- function(x) logit((1+x)/2)

time_delta_score <- function(score1,score2=.99) {
  score2time(score2)-score2time(score1)
}
score_delta_time <- function(scorenow,tdelta) {
  tnow <- score2time(scorenow)
  time2score(tnow+tdelta)
}

onestep <- function(score,limit=0.99,spil=1e-4) {
  mscore <- max(score)
  dtnew <- time_delta_score(mscore,limit)
  scorenew <- score_delta_time(score,dtnew)
  maxed1 <- maxed <- scorenew==max(scorenew)#>limit*.99999999
  while(sum(maxed)>0) {
    scorenew[!maxed] <- scorenew[!maxed] + sum(maxed)*spil
    scorenew[maxed] <- 0
    maxed <- scorenew>limit
    maxed1 <- maxed | maxed1
  }
  scorenew[maxed1] <- 0
  scorenew
}

nitems <- 500
niter <- 10000
xmat <- matrix(0,nrow=nitems,ncol=niter)

score  <- time2score(runif(nitems,0,score2time(.99)))
n <- 0
while (n < niter) {
  score <- onestep(score,spil=1e-7)
  if (score[1]==0) {
    n<- n+1
    xmat[,n] <- score
  }
}

plotpart(xmat[1:250,],'een.png',thin=20)

Sunday, April 7, 2013

Sync

I am listening to the audiobook Sync: How Order Emerges from Chaos in the Universe, Nature, and Daily Life by Steven Strogatz which I got from Audible. Obviously a mathematical book is not ideal to listen to, but lacking illustrations I can make them myself. On top of that, you learn more from programming than listening.

He mentioned a simple system which synchronizes itself. As far as I understand a large number of items are considered. Each  item has more or less the same properties. Small steps are taken. Every step has a small additive effect, a multiplicative decrease. When an item gets over a limit, then it resets to zero and causes a small increase in the other items.

What is supposed to happen is that at low values you get steep slopes and at high values small slopes. As time progresses the small increases due to resets cause a synchronization.

To program this was not difficult, except for the initialization. If you start the items at random levels you have too many at low levels. In the end I decided to add an item at each time point for an init period. What did surprise me, was that calculating what happens actually takes less computer time than plotting it. But maybe that is because I wanted to use ggplot and a low alpha to show the bunching of items. I also discovered it is also not wise to display too complex a figure in ggplot running in Eclipse. It can freeze the whole program. I took to saving all plots rather than displaying within Eclipse.

In the init phase every item has different start and ending times. Note the black line at level 0 which consists of not yet running items
After 10000 steps it is still chaos, with a few somewhat fatter lines.
But at 50000 steps it is clear that the things are synchronizing.
Not that they are perfect after 300000 steps, but it gets more and more closer.
It is really happening. So simple and it works. Having said that, it took some tweaking to get parameters which take a while to result in sync, yet do not degenerate either. I guess it gets more easy to sync when more items are syncing, but that rapidly gets to the end of my hardware. I must say that I look forward to listening to the rest of the book. Hope to make more plots and get a bit of the math which is supposed to be behind all this.

R code



library(ggplot2)

plotpart <- function(xmat,fname='een.png',xstart=1) {
  fin <- ncol(xmat)
  nitems <- nrow(xmat)
  df <- data.frame(score=as.numeric(xmat),
      time=rep((1:fin)+xstart,each=nitems),
      item =rep(1:nitems,fin))
  g<- ggplot(df,aes(x=time,y=score,group=item,alpha=score)) + 
      geom_line() + 
      scale_alpha_continuous(range=c(.05,.0501) )+
      theme(legend.position='none')
  png(fname)
  print(g)
  dev.off()
}

nextgen <- function(x,add,shrink=.99,limit=1,spil=.0105) {
  x <- x*shrink+add
  maxed <- x>limit
  x[maxed]<- .01#add[maxed]
  x[!maxed] <- x[!maxed] + sum(maxed)*spil
  x[x>limit] <- limit
  x
}

ngenmat <- function(xstart,niter,add,spil) {
  nitems <- length(xstart)
  xmat <- matrix(0,nrow=nitems,ncol=niter)
  xmat[,1 ] <- xstart
  for ( i in 2:niter) {
    xmat[,i] <- nextgen(xmat[,i-1],add=add,spil=spil)
  }
  xmat
}

ngenmove <- function(xstart,niter,add,spil) {
  for ( i in 1:niter) {
    xstart <- nextgen(xstart,add=add,spil=spil)
  }
  xstart
}

# init a run
nitems <- 1000
niter <- 1000
xmat <- matrix(0,nrow=nitems,ncol=niter)
for (i in 1:nitems) xmat[i,i] <- runif(1,min=.01,max=.02)
add <- rnorm(nitems,.0103,.00005)
spil=.000025
for ( i in 2:niter) {
  todo <- xmat[,i-1]>0
  xmat[todo,i] <- nextgen(xmat[todo,i-1],add=add[todo],spil=spil)
}
#and show it
show <- sample(1:nrow(xmat),50)
plotpart(xmat[show,],fname='een.png',xstart=1)

# move 9000 steps
x1 <- ngenmove(xmat[,ncol(xmat)],niter=9000,add=add,spil=spil)
xm1 <- ngenmat(x1,niter=1000,add=add,spil=spil)
plotpart(xm1[show,],fname='twee.png',xstart=10000)

x2 <- ngenmove(xm1[,ncol(xm1)],niter=39000,add=add,spil=spil)
xm2 <- ngenmat(x2,niter=1000,add=add,spil=spil)
plotpart(xm2[show,],fname='drie.png',xstart=50000)

x3 <- ngenmove(xm2[,ncol(xm2)],niter=49000,add=add,spil=spil)
xm3 <- ngenmat(x3,niter=1000,add=add,spil=spil)
plotpart(xm3[show,],99000,fname='vier.png')

x4 <- ngenmove(xm3[,ncol(xm3)],niter=49000,add=add,spil=spil)
xm4 <- ngenmat(x4,niter=1000,add=add,spil=spil)
plotpart(xm4[show,],150000,fname='vijf.png')

x5 <- ngenmove(xm4[,ncol(xm4)],niter=49000,add=add,spil=spil)
xm5 <- ngenmat(x5,niter=1000,add=add,spil=spil)
plotpart(xm5[show,],200000,fname='zes.png')

x6 <- ngenmove(xm5[,ncol(xm5)],niter=99000,add=add,spil=spil)
xm6 <- ngenmat(x6,niter=1000,add=add,spil=spil)
plotpart(xm6[show,],300000,fname='zeven.png')