Sunday, January 27, 2013

European Fishing

I am playing around with Eurostat data and ggplot2 a bit more. As I progress it seems the plotting gets more easy, the data pre-processing a bit more simple and the surprises on the data stay.

Eurostat data

The data used are fish_fleet (number of ships) and fish_pr (production=catch+aquaculture). After a bit of year selection, 1992 and later, I decided to pull the data not as xls but as csv with formatting '1 234.56'. The consequence is that the data now comes as tall and skinny, which may actually be better. However, the actual number format and ':' for missing still make a bit of processing needed.
library(ggplot2)
fleet <- read.csv("fish_fleet_1_Data.csv",colClasses=c(rep(NA,4),'character'))
fleet$Number <- scan(textConnection(gsub(' ', '',fleet$Value)),na.strings=':')
catch <- read.csv("fish_pr_00_1_Data.csv",colClasses=c(rep(NA,5),'character'))
catch$Tonnes <- scan(textConnection(gsub(' ', '',catch$Value)),na.strings=':')
Still need to make the GEO labels a bit shorter
shortlevels <- function(xx) {
  levels(xx) <- gsub('European Economic Area','EEA' ,levels(xx))
  levels(xx) <- gsub(' plus IS, LI, NO','+' ,levels(xx),fixed=TRUE)
  levels(xx) <- sub(' countries)',')' ,levels(xx),fixed=TRUE)
  levels(xx) <- sub(' (under United Nations Security Council Resolution 1244/99)','' ,levels(xx),fixed=TRUE)
  levels(xx) <- sub('European Free Trade Association','EFTA' ,levels(xx),fixed=TRUE)
  levels(xx) <- sub('Former Yugoslav Republic of Macedonia, the','FYROM' ,levels(xx),fixed=TRUE)
  levels(xx) <- sub('including +former GDR','Incl GDR' ,levels(xx))
  levels(xx) <- sub('European Union','EU' ,levels(xx))
  levels(xx)[grep('Germany',levels(xx))] <- 'Germany'
  xx
}
catch$GEO <- shortlevels(catch$GEO)
fleet$GEO <- shortlevels(fleet$GEO)

Plot about the fleet

Only preparation needed was to select Tonnage as property and use only countries. EFTA and EEA and EU have a number like 15, 25 or 27 in them
f2 <- fleet[grep('Tonnage',as.character(fleet$VESSIZE)) ,]
f2 <- f2[-grep('15|25|27',f2$GEO),]
f2 <- f2[complete.cases(f2),]
f2$VESSIZE <- factor(f2$VESSIZE)
f2$GEO <- factor(f2$GEO)
order levels of VESSIZE by value for a nice display
lev <- gsub('(-|\\+).*','',levels(f2$VESSIZE))
nlev <- as.numeric(gsub('^[[:alpha:]]* ','',lev))
f2$VESSIZE <- factor(as.character(f2$VESSIZE),
    levels= levels(f2$VESSIZE)[order(nlev,lev)])
levels(f2$VESSIZE) <-    gsub('Tonnage ','',levels(f2$VESSIZE))
First aim is a dotplot of the last year (2010). With countries ordered by size of fleet
f3 <- f2[f2$TIME==2010 ,]
f4 <- f3[f3$VESSIZE=='Total all Classes',]
f3$GEO <- factor(as.character(f3$GEO),
    levels=as.character(f4$GEO[order(f4$Number)]))
ggplot(f3,
        aes(y=GEO,x=Number,colour=VESSIZE))  + 
    geom_point() +
    labs(colour='Tonnage')
It seems Greece had the largest fleet. All my thoughts that Netherlands was a fishing country have been erased. 
For a time related plot I chose to put the number of vessels on a logarithmic scale. As the number of countries is a bit large the biggest countries have been selected.
mfleet <- aggregate(f4$Number,list(GEO=f4$GEO),max)
bigfleet <- mfleet$GEO[mfleet$x>quantile(mfleet$x,1-9/nrow(mfleet))]
ggplot(f2[f2$GEO %in% bigfleet & f2$VESSIZE!='Total all Classes'  ,],
        aes(x=TIME,y=Number,colour=VESSIZE))  + 
    geom_line() +
    facet_wrap( ~ GEO, drop=TRUE) + 
    scale_y_log10()  +
    labs(colour='Tonnage')
The interesting thing about this plot is that the number of vessels is decreasing. That is, except for one category, the biggest, more than 2000 Tonnage, there are only a few tens of those, but they must count for loads of smaller vessels.

Catch

Fish caught is probably same thing. In this case, SPECIES and GEO have far too many levels for a decent display. So the biggest catches are shown. On top of that three SPECIES categories are almost the same. These are 'Total', 'Aquatic animals' and 'Finfish and invertebrates'. 
Finfish probably needs an explanation. To quote wikipediaMany types of aquatic animals commonly referred to as "fish" are not fish in the sense given above; examples include shellfishcuttlefish,starfishcrayfish and jellyfish. In earlier times, even biologists did not make a distinction – sixteenth century natural historians classified also seals, whales, amphibianscrocodiles, even hippopotamuses, as well as a host of aquatic invertebrates, as fish.[15] However, according the definition above, all mammals, including cetaceans like whales and dolphins, are not fish. In some contexts, especially in aquaculture, the true fish are referred to as finfish (or fin fish) to distinguish them from these other animals.
c2010 <- catch[catch$TIME==2010,] 
c2010 <- c2010[complete.cases(c2010),]
mcatch <- aggregate(c2010$Tonnes,list(GEO=c2010$GEO),max)
bigcatch <- mcatch$GEO[mcatch$x>quantile(mfleet$x,.5)]
c2010 <- c2010[c2010$GEO %in% bigcatch,]
mcatch <- aggregate(c2010$Tonnes,list(SPECIES=c2010$SPECIES),max)
bigcatch <- mcatch$SPECIES[mcatch$x>quantile(mcatch$x,.5)]
bigcatch <- bigcatch[!(bigcatch %in% 
          c('Aquatic animals','Finfish and invertebrates'))]
c2010 <- c2010[c2010$SPECIES %in% bigcatch,]
c2010$SPECIES <- factor(c2010$SPECIES)

ggplot(c2010,
        aes(y=GEO,x=Tonnes,colour=SPECIES))  + 
    geom_point() +
    labs(colour='Tonnes live weight')
The surprise here is Denmark. It is getting loads of fish. Same is true for UK, Spain

Combination of fleet and catch

Since we have both data sets, they can be combined. The merging id's are GEO and TIME, which means the data have to be transposed beforehand. The newly created variables have Number and Tonnes in the newly created variables, which are not needed for me.
tfl <- reshape(fleet,direction='wide',idvar=c('TIME','GEO'),
    timevar='VESSIZE',drop=c('Value','Flag.and.Footnotes','UNIT'),v.names='Number')
names(tfl) <- gsub('Number.','',names(tfl),fixed=TRUE)
rca <- reshape(catch,direction='wide',idvar=c('TIME','GEO'),
    timevar='SPECIES',drop=c('Value','FISHREG','UNIT'),v.names='Tonnes')
names(rca) <- gsub('Tonnes.','',names(rca),fixed=TRUE)
both <- merge(tfl,rca,id=c('TIME','GEO'))
both2 <- both[-grep('15|25|27',both$GEO),]
ggplot(both2[!(both2$GEO %in% c('Belgium','Bulgaria','Cyprus','Estonia',
                      'Latvia','Lithuania','Malta','Romania','Slovenia','Poland',
                      'Germany','Finland','Ireland','Netherlands','Sweden')),],
        aes(y=`Total fishery products`,
            x=`Total all Tonnage Classes`,colour=TIME))  + 
    geom_point() +
    facet_wrap( ~ GEO, drop=TRUE)
I like very much how ggplot2 defaulted TIME as colour variable. It shows very nicely how catches and fleets are getting smaller. The latter obviously not true for the biggest ships as seen above. It is also shown that Denmark and Iceland have remarkably efficient fleets. Small but catching loads of fish. In contrast Greece has a big fleet but small catch. That does not seem economical, but tonnes do not equal Euro's. Regarding UK and Spain, yes the Spanish are just a bit bigger than the UK, so that pain may exist. 

Catch per species

As a final, I wanted to look per species. However, this would be a bit too long for this blog, so I only show one. It runs in a function, which just takes a bit of string from the SPECIES variable. To keep the plot simple only the six largest countries are taken. Facet_wrap does two things here. It puts a title even if there is only one species and makes separate panes if more than one value for species fits the string.
byspecies <- function(species) {
  ca <- catch[grep(species,catch$SPECIES,ignore.case=TRUE),c(-4,-5,-6)]
  ca <- ca[complete.cases(ca),]
  ca <- ca[!(ca$GEO %in% c('EFTA','EU (15)','EU (27)')),]
  ag <- aggregate(ca$Tonnes,list(GEO=ca$GEO),median)
  ag <- ag[order(-ag$x),]
  ca <- ca[ca$GEO %in% ag$GEO[1:6],]
  ca$GEO <- factor(ca$GEO)    
  ggplot(ca   ,   aes(y=Tonnes,x=TIME,colour=GEO))  + 
      geom_line() +
      facet_wrap( ~ SPECIES, drop=TRUE)
}
byspecies('octop')

Sunday, January 20, 2013

Unemployment

I want to exercise a bit more with ggplot2 and there is always data to be gotten from Eurostat which is interesting. In Netherlands the statistics agency (CBS) brought these headlines (translated with http://translate.google.nl/):
  • Unemployment rose to 7.2 percent in December
  • From mid-2011, almost continuous increase of unemployment
  • Unemployment benefits in December increased by 18,000
  • In 2012, fewer benefits were terminated because of work resumption than in 2011
I lost overview in unemployment ages ago, save to know it is not good at all. Hence unemployment  plots.

Data introduction

Data from Eurostat (Harmonised unemployment rate % (seasonally adjusted)). I fiddles around with data ranges, put age categories in columns, country and month in rows, exported in .xls, cleaned it a bit in LibreOffice and exported to .cvs. It was important to have time as a variable, my first attempt had more month columns than the process could cope with. So, the R script can start with preparations:
library(ggplot2)
library(KernSmooth)
library(plyr)

r1 <- read.csv("unemployment.csv")
levels(r1$Region) <- sub(' countries)',')' ,levels(r1$Region),fixed=TRUE)
levels(r1$Region) <- sub(' (under United Nations Security Council Resolution 1244/99)','' ,levels(r1$Region),fixed=TRUE)
levels(r1$Region) <- sub('including +former GDR','Incl GDR' ,levels(r1$Region))
levels(r1$Region) <- sub('European Union','EU' ,levels(r1$Region))
levels(r1$Region)[levels(r1$Region)=='United Kingdom'] <- 'UK'
levels(r1$Region)[levels(r1$Region)=='United States'] <- 'US'
levels(r1$Region)[levels(r1$Region)=='Germany (Incl GDR from 1991)'] <- 'Germany'
levels(r1$Region)
 [1] "Austria"        "Belgium"        "Bulgaria"       "Croatia"   
 [5] "Cyprus"         "Czech Republic" "Denmark"        "Estonia"   
 [9] "Euro area (17)" "EU (27)"        "Finland"        "France" 
[13] "Germany"        "Greece"         "Hungary"        "Iceland" 
[17] "Ireland"        "Italy"          "Japan"          "Latvia" 
[21] "Lithuania"      "Luxembourg"     "Malta"          "Netherlands"
[25] "Norway"         "Poland"         "Portugal"       "Romania"
[29] "Slovakia"       "Slovenia"       "Spain"          "Sweden" 
[33] "Turkey"         "UK"             "US"   
r2 <- reshape(r1,varying=list(names(r1)[c(-1,-2)]),
    idvar=c('Region','TIME'),timevar='Age',direction='long',
    v.names=c('Percentage'),
    times=c(names(r1)[c(-1,-2)]))
r2$Age <-     gsub('.',' ',r2$Age,fixed=TRUE)
Agelevels <- unique(r2$Age)
Agelevels <- Agelevels[order(Agelevels)]
Agelevels
[1] "From 25 to 74 years" "Less than 25 years"  "Total"              
r2$Age <- factor(r2$Age,levels=Agelevels[c(2,1,3)]) 
r2$Date <- as.Date(paste(gsub('M','-',as.character(r2$TIME)),'-01',sep=''))

Plots

As far as I understand, Eurostat makes the numbers comparable between countries. We also get US and Japan, so this will be a somewhat global picture. To make the space per country large enough, the countries are split in three: low, middle and high maximum unemployment.
maxi <- aggregate(r2$Percentage,by=list(Region=r2$Region),FUN=max,na.rm=TRUE)
low <- maxi$Region[maxi$x<quantile(maxi$x,1/3)]
middle <- maxi$Region[maxi$x>quantile(maxi$x,1/3) & maxi$x<quantile(maxi$x,2/3)]
high <- maxi$Region[maxi$x>quantile(maxi$x,2/3)]
ggplot(r2[r2$Region %in% low,],aes(x=Date,y=Percentage,colour=Age)) +
    facet_wrap( ~ Region, drop=TRUE) +
    geom_line()  +
    theme(legend.position = "bottom") +# stat_smooth(span=.1,method='loess') +
    ylab('% Unemployment') + xlab('Year')
The plots show a bit too much wobbles for my taste, and at some countries more than others, but ggplot has a smoothing solution for that:
ggplot(r2[r2$Region %in% low,],aes(x=Date,y=Percentage,colour=Age)) +
    facet_wrap( ~ Region, drop=TRUE) +
    theme(legend.position = "bottom") + 
    stat_smooth(span=.1,method='loess') +
    ylab('% Unemployment') + xlab('Year')
I am not completely happy with that. If I look at results and remember correctly, these smoothers may choose a smoothing parameter, which is not what I would intent here. So I dragged a kernel smoother out. This used a normal kernel with bandwidth 90 (days), or three months. This means the smoothing uses data over a time period from plus or minus a year, with most of the info from the neighboring months. In addition this also gave me opportunity to try plyr.
r2$class <- interaction(r2$Region,r2$Age)
r3 <- r2[complete.cases(r2),]
r3$class <- factor(r3$class)
Perc <- ddply(.data=r3,.variables=.(class), 
    function(piece,...) {
      lp <- locpoly(x=as.numeric(piece$Date),y=piece$Percentage,
          drv=0,bandwidth=90)
      sdf <- data.frame(Date=as.Date(lp$x,origin='1970-01-01'),
          sPerc=lp$y,Age=piece$Age[1],Region=piece$Region[1])}
    ,.inform=FALSE
)
ggplot(Perc[Perc$Region %in% low ,],aes(x=Date,y=sPerc,colour=Age)) +
    facet_wrap( ~ Region, drop=TRUE) +
    geom_line()  +
    theme(legend.position = "bottom")+
    ylab('% Unemployment') + xlab('Year')
# removed incantation for middle and high



I will leave interpretation alone. Personally I mainly look at Europe, US, Germany and Netherlands.  The first two give me a global view. Locally, I live in the Netherlands and Germany is so influential in Netherlands it is a need to look at. 
In these plots I was surprised at Iceland (low numbers), Slovenia (big increase) and horrifed by Spain and Greece. I am sure each of us has their own interests and at least as much insight in these things as I do. 

Second part; derivative

The big question is, where are we going. For this I made the first derivative, smoothed with a somewhat bigger bandwidth, half a year. Derivatives are much more noisy and also suffer much more from rounding to the first digit, so this is a suitable bandwidth.
dPerc <- ddply(.data=r3,.variables=.(class), 
    function(piece,...) {
      lp <- locpoly(x=as.numeric(piece$Date),y=piece$Percentage,
          drv=1,bandwidth=365/2)
      sdf <- data.frame(Date=as.Date(lp$x,origin='1970-01-01'),
          dPerc=lp$y,Age=piece$Age[1],Region=piece$Region[1])}
    ,.inform=FALSE
)
ggplot(dPerc[dPerc$Region %in% low ,],aes(x=Date,y=dPerc,colour=Age)) +
    facet_wrap( ~ Region, drop=TRUE) +
    geom_line()  +
    theme(legend.position = "bottom")+
    ylab('Change in % Unemployment') + xlab('Year')
# removed incantation middle and high
To me, these plots show a bit where we are going. Especially the youth unemployment seems to be the weather vane for the total. What I pick here is Portugal which seems to stabilizing. Denmark, UK and Ireland may be improving. The US is improving but while they were first, the numbers are not so spectacular. Germany gets a bit of headwind, Netherlands a bit more. However, France and Slovania may become a point of worry. Latvia also looks bad.

Sunday, January 13, 2013

European Migration

Last week on the radio I heard a story of southern Europeans and Irish looking for better times in northern Europe. I heard the tale of an Italian academic who left Italy to end up waiting tables in an Italian restaurant in the Netherlands. Obviously this is not good, not good for Italy which loses its academics, not good for somebody who actually desires to wait tables. I cannot blame the Italian academic though. I also heard the tale of an Irish guy who ended up working for ASML, which is much better (except for Ireland obviously).
Based on this tale I wondered if this migration is big enough to be visible in population statistics and would make for nice plots. Eurostat, the European Statistical Agency, is the place to get such data. Since I tend to want to look at data then decide, I took all years and countries. My chosen format was year and country in columns, with adjacent the population, migration, natural change, births and deaths.

First thing to do, was have a go at the labels, which were extremely long.

r1 <- read.csv("population.eu.csv")
levels(r1$Region) <- gsub('European Economic Area','EEA' ,levels(r1$Region))
levels(r1$Region) <- gsub(' plus IS, LI, NO','+' ,levels(r1$Region),fixed=TRUE)
levels(r1$Region) <- sub(' countries)',')' ,levels(r1$Region),fixed=TRUE)
levels(r1$Region) <- sub(' (under United Nations Security Council Resolution 1244/99)','' ,levels(r1$Region),fixed=TRUE)
levels(r1$Region) <- sub('European Free Trade Association','EFTA' ,levels(r1$Region),fixed=TRUE)
levels(r1$Region) <- sub('Former Yugoslav Republic of Macedonia, the','FYROM' ,levels(r1$Region),fixed=TRUE)
levels(r1$Region) <- sub('including +former GDR','Incl GDR' ,levels(r1$Region))
levels(r1$Region) <- sub('European Union','EU' ,levels(r1$Region))
levels(r1$Region)

 [1] "Albania"                      "Andorra"                     
 [3] "Armenia"                      "Austria"                     
 [5] "Azerbaijan"                   "Belarus"                     
 [7] "Belgium"                      "Bosnia and Herzegovina"      
 [9] "Bulgaria"                     "Croatia"                     
[11] "Cyprus"                       "Czech Republic"              
[13] "Denmark"                      "Estonia"                     
[15] "Euro area (15)"               "Euro area (16)"              
[17] "Euro area (17)"               "EEA (EU-25+)"                
[19] "EEA (EU-27+)"                 "EFTA"                        
[21] "EU (25)"                      "EU (27)"                     
[23] "Finland"                      "FYROM"                       
[25] "France"                       "France (metropolitan)"       
[27] "Georgia"                      "Germany (Incl GDR from 1991)"
[29] "Germany (Incl GDR)"           "Greece"                      
[31] "Hungary"                      "Iceland"                     
[33] "Ireland"                      "Italy"                       
[35] "Kosovo"                       "Latvia"                      
[37] "Liechtenstein"                "Lithuania"                   
[39] "Luxembourg"                   "Malta"                       
[41] "Moldova"                      "Monaco"                      
[43] "Montenegro"                   "Netherlands"                 
[45] "Norway"                       "Poland"                      
[47] "Portugal"                     "Romania"                     
[49] "Russia"                       "San Marino"                  
[51] "Serbia"                       "Slovakia"                    
[53] "Slovenia"                     "Spain"                       
[55] "Sweden"                       "Switzerland"                 
[57] "Turkey"                       "Ukraine"                     
[59] "United Kingdom"              
As can be seen, the list of countries is fairly long. After checking I found that "France (metropolitan)" is the part of France within Europe, while France in addition contains parts outside Europe. 
Next step was to make relative changes and a plot.
r1$RelMig <- 100*r1$MigrationA/r1$Population
r1$RelNC <- 100*r1$NatChange/r1$Population
r1$RelCh <- 100*r1$TotChange/r1$Population
r1$RelB <- 100*r1$LiveBirths/r1$Population
r1$RelD <- 100*r1$Deaths/r1$Population
xyplot(RelMig+RelNC+RelCh+RelB+RelD ~ Year | Region,data=r1,type='l',
    auto.key=TRUE)
Most obvious 'features' of this plot are too many countries, still too long country names. Data shows big migration in FYROM (Macedonia), Albania, and Bosnia Herzegovina, which swamps all other information. Clearly war has big effect on migration, more than the economy. There are also a number of countries with only limited years. 
r2 <- r1[! (r1$Region %in% grep('FYROM|Albania|Bosnia|15|16|25|1991',levels(r1$Region),value=TRUE)),]
xt <- xtabs(!is.na(RelMig) ~ Region,data=r1)
table(xt)
xt
 1  6  7 14 15 16 17 40 45 48 49 51 52 
 1  1  1  1  1  7  2  1  1  1  1  8 33 
r3 <- r2[! (r2$Region %in% names(xt[xt<40])),]
levels(r3$Region)[levels(r3$Region)=='United Kingdom'] <- 'UK'
levels(r3$Region)[levels(r3$Region)=='Germany (Incl GDR)'] <- 'Germany'
levels(r3$Region)[levels(r3$Region)=='France (metropolitan)'] <- 'France'

Lattice

library(lattice)
xyplot(RelMig + RelNC + RelCh ~ Year | Region,
    data=r3[r3$Region %in% c("Portugal","Iceland","Ireland","Italy","Greece","Spain",
            "Sweden","Finland","Denmark","Germany","Netherlands","Austria",
            "UK","Luxembourg","France"),],
    type='l',
    abline = list(h = c(0,NA,NA),v=c(NA,1999,2007),col=c('grey','blue','red')),
    ylim=c(-1.6,3),ylab='% Change in Population',
    xlim=c(1989,2012),
    auto.key=list(space='bottom',columns=3,
        text=c('Migration','Natural causes','Total'),
        points=FALSE,lines=TRUE)
)


In the plot the vertical blue line represents start of the Euro zone, while the red line represents start of the crisis. The data had quite some surprises for me. Large crisis effects in Iceland and Ireland, much larger than in southern Europe. Big immigration in in Luxembourg. Big immigration in Italy, Spain, Iceland and Ireland before the crisis. And yes, the crisis did cause emigration in Spain and Greece. Not big, but 2012 data is not yet available. Finally, regarding all those Polish working in western Europe, I don't see an effect in Poland.     

Ggplot2 

Finally, I wanted to do the same in ggplot2. It took me ages to learn lattice, now it starts again with ggplot2. Unfortunately I did not manage to make the countries to cover both rows and columns, so the number of countries is more reduced. I was not able to place the legend title above the legend either. 
r4 <- r3[r3$Region %in% c("Portugal","Ireland","Italy","Greece","Spain",
        "Germany",
        "UK","France"),]
r4$Region <- factor(r4$Region)
table(r4$Region)
r5 <- reshape(r4,varying=list(c('RelMig','RelNC','RelCh')),
    idvar=c('Year','Region'),timevar='Source',direction='long',
    v.names=c('Percentage'),
    times=c('Migration','Natural Causes','Total Change'),
    drop=c('LiveBirths','Deaths','NatChange','Migration','TotChange','RelB','RelD'))

ggplot(r5,aes(x=Year,y=Percentage,colour=Source))  + #group=Source,
    geom_line() +
    facet_grid(. ~ Region, drop=TRUE) +
    scale_x_continuous(breaks=c(1990,2000,2010),
        labels=c("'90","2000","'10"),limits=c(1989,2012)) +
    scale_y_continuous('% Change',limits=c(-1,3)) +
    theme(legend.position = "bottom") 

Postscript. Following comment Facet_wrap was what I needed:
r4 <- r3[r3$Region %in% c("Portugal","Iceland","Ireland","Italy","Greece","Spain",

        "Sweden","Finland","Denmark","Germany","Netherlands","Austria",
        "UK","Luxembourg","France","Poland"),]
r4$Region <- factor(r4$Region)
table(r4$Region)
r5 <- reshape(r4,varying=list(c('RelMig','RelNC','RelCh')),
    idvar=c('Year','Region'),timevar='Source',direction='long',
    v.names=c('Percentage'),
    times=c('Migration','Natural Causes','Total Change'),
    drop=c('LiveBirths','Deaths','NatChange','Migration','TotChange','RelB','RelD'))
ggplot(r5,aes(x=Year,y=Percentage,colour=Source))  + #group=Source,
    geom_line() +
    facet_wrap( ~ Region, drop=TRUE) +
    scale_x_continuous(breaks=c(1990,2000,2010),
        labels=c("'90","2000","'10"),limits=c(1989,2012)) +
    scale_y_continuous('% Change',limits=c(-1,3)) +
    theme(legend.position = "bottom") 

Sunday, January 6, 2013

Sequential testing in a triangle test setting

It is well known the binomial test never has an error of exactly 5%. You aim for at most 5%, calculate the number correct to get there and end up with an error of e.g 2%. This is a shame but there is no solution. However, it is also an opportunity; the 'unused' error may be employed for additional testing. For instance, in a triangle test, why not aim for say 30 persons, do a pre-test at 17 persons where H0 is to be rejected at less than 1% error level. When not rejected, continue to 30, reject at the original 5% and still have an overall error level of less than 5%?

Example

As described before (blog entry) in a triangle test is a sensory test where there is a chance of 1/3 to select the correct product. When the proportion correct is significantly larger than 1/3 the products are deemed different. This latter part is performed with a binomial test. 
With 30 trials the number of correct needs to be 14 with resulting alpha 0.043
n_2 <- 30
(critVal_2 <- qbinom(.95,n_2,1/3))
[1] 14
pbinom(critVal_2,n_2,1/3,lower.tail=FALSE)
[1] 0.04348228
With 17 trials:
n_1 <- 17
(critVal_1 <- qbinom(.98,n_1,1/3))
[1] 10
pbinom(critVal_1,n_1,1/3,lower.tail=FALSE)
[1] 0.00800819
As can be seen the errors add to more than 0.05. However, they don't need to be added. The only way to get in the 30 trials persons situation is to have less than 10 correct in the first phase. This conditional value can be calculated easily with a few functions. What is done is to examine for each number of correct in the first phase what the chance is to get sufficient correct in the second phase to reach the critical value. These numbers are multiplied with the chance to get the corresponding number correct in the first phase and added. This is shown in the next two functions. 
condPval_S1 <- function(nFound_1,n_1,n_2,critVal_2) {
  nAdditional <- n_2-n_1
  if (nAdditional < critVal_2-nFound_1+1) 0
  else {
    pbinom(critVal_2-nFound_1,
            nAdditional,1/3,lower.tail=FALSE)
  }
}
condPval <- function(n_1,critVal_1,n_2,alpha2=0.05) {
  critVal_2 <- qbinom(1-alpha2,n_2,1/3)
  nFound <- 0:critVal_1
  sa <- sapply(nFound,function(nFound_1) 
        dbinom(nFound_1,n_1,1/3)*
        condPval_S1(nFound_1,n_1,n_2,critVal_2))
  sum(sa)
}
condPval(n_1,critVal_1,n_2)+p_H0_1
[1] 0.04568412
The total error level is slightly less than 5%. Hence we can do this even while we keep to the 5% level which is promised.

A bit more extensive

In practice, not everybody asked to come and do the triangle test will be there to taste. What if there are a few trials short or extra? Obviously this can be calculated as well. The apply function helps greatly. The overall level is for all these cases is under 5%.
range(sapply(25:35,function(n_2)
      condPval(n_1,critVal_1,n_2)+p_H0_1 ))
[1] 0.02484166 0.04711329
This can also be put into a function with a bit more details:
McondPval <- function(n_1,
    n_2_min = round(n_1*1.5),
    n_2_max = 3*n_1) {
  critVal_1 <- qbinom(.98,n_1,1/3)
  p_H0_1 <- pbinom(critVal_1,n_1,1/3,lower.tail=FALSE)
  n_2 <- n_2_min:n_2_max
  alpha <- sapply(n_2,function(n_2)
        condPval(n_1,critVal_1,n_2)+p_H0_1 )
  critVal_2 <- qbinom(.95,n_2,1/3)
  alpha_orig <- pbinom(critVal_2,n_2,1/3,lower.tail=FALSE)
  return(data.frame(n_1,n_2,alpha,alpha_orig))
}
McondPval(17,25,35)
   n_1 n_2      alpha alpha_orig
1   17  25 0.04277008 0.04151368
2   17  26 0.02729441 0.02475400
3   17  27 0.03792652 0.03592712
4   17  28 0.02484166 0.02156168
5   17  29 0.03384347 0.03113864
6   17  30 0.04568412 0.04348228
7   17  31 0.03037324 0.02702409
8   17  32 0.04047668 0.03765334
9   17  33 0.02740706 0.02348101
10  17  34 0.03605287 0.03265134
11  17  35 0.04711329 0.04419916
Unfortunately, it is not always this nice. With these settings at 16 trials in the first phase it may go wrong. Look at 30 and 35 trials total. The 30 trials is just over 5%, while the 35 is clearly over it. Either the test in phase 1 should be more stringent or it should be ensured not to end with 35 trials at the end of testing. It does not matter which of these is chosen but we have to choose. Ideally the level of testing at phase 1 should be determined prior to knowing how many correct there are. 
McondPval(16,25,35)
   n_1 n_2      alpha alpha_orig
1   16  25 0.04648649 0.04151368
2   16  26 0.03245610 0.02475400
3   16  27 0.04236557 0.03592712
4   16  28 0.03048510 0.02156168
5   16  29 0.03885603 0.03113864
6   16  30 0.05006582 0.04348228
7   16  31 0.03584845 0.02702409
8   16  32 0.04538488 0.03765334
9   16  33 0.03326027 0.02348101
10  16  34 0.04140208 0.03265134
11  16  35 0.05194322 0.04419916

Conclusion

With a few simple functions and a bit of care an extra hypothesis test can be added during a triangle test. This gives opportunity to declare differences at an intermediate step while retaining the original error level.