Saturday, January 11, 2014

Converting a JAGS model to STAN

For my first experience with STAN I wanted to convert my last JAGS program into STAN. This was a bit more difficult than I expected. The JAGS program was Fe concentration in rainwater including values below detection level.

Data

Data has been explained before. The only difference is that by now I am reading the data using read.xls from the gdata package. The new code is at the bottom of this post.

Converting the code

Initially I thought the conversion would be simple. I made some simplification and the code worked. However, it appeared my initial code was not very fast. Then I added the out of range data and that resulted in a problem seen before, out of range initialized data and hence no MCMC samples. After muddying along for a bit my second step was to make a simple regression model and get to know STAN a bit better. This knowledge set in a simple model. Extended it till the model had the features which I wanted except the out of range data. To add the out of range data I decided to initialize the more complex model using estimates from a more simple model. This worked well, the first code also serves as a kind of warm-up too. There are some warnings the way I set it up, but that is only in the first model, which is only run for a few cycles. Final running time was a disappointing close to two hours. This could probably be reduced a bit using less but longer chains.

learned

  • STAN is not so similar to JAGS that you can drop in code, it is better to start a program fresh than try a quick convert 
  • Using normal(0,1) and a multiplication factor gives faster code than using normal(0,f). It says so in the manual and yes it does make a difference
  • Having aliased parameters slows things down
  • STAN's compilation step makes for slower development than JAGS but it is still best to start small and extend the model
    • at some point I just worked on code with few samples and two chains, which helped a bit.
  • the STAN code does not feel faster than JAGS code

Run results

output during fitting first model

TRANSLATING MODEL 'Fe_code1' FROM Stan CODE TO C++ CODE NOW.
DIAGNOSTIC(S) FROM PARSER:

Warning (non-fatal): sampling statement (~) contains a transformed parameter or local variable. 
You must increment lp__ with the log absolute determinant of the Jacobian of the transform. 
Sampling Statement left-hand-side expression:
          LAmountC ~ normal_log(...)

COMPILING THE C++ CODE FOR MODEL 'Fe_code1' NOW.
SAMPLING FOR MODEL 'Fe_code1' NOW (CHAIN 1).

Informational Message: The current Metropolis proposal is about to be rejected becuase of the following issue:
Error in function stan::prob::normal_log(N4stan5agrad3varE): Scale parameter[0]  is 0:0, but must be > 0!
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Iteration:  1 / 75 [  1%]  (Warmup)
Informational Message: The current Metropolis proposal is about to be rejected becuase of the following issue:
Error in function stan::prob::normal_log(N4stan5agrad3varE): Location parameter[0]  is inf:0, but must be finite!
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Iteration:  7 / 75 [  9%]  (Warmup)
Iteration: 14 / 75 [ 18%]  (Warmup)
Iteration: 21 / 75 [ 28%]  (Warmup)
Iteration: 28 / 75 [ 37%]  (Warmup)
Iteration: 35 / 75 [ 46%]  (Warmup)
Iteration: 42 / 75 [ 56%]  (Warmup)
Iteration: 49 / 75 [ 65%]  (Warmup)
Iteration: 56 / 75 [ 74%]  (Sampling)
Iteration: 63 / 75 [ 84%]  (Sampling)
Iteration: 70 / 75 [ 93%]  (Sampling)
Iteration: 75 / 75 [100%]  (Sampling)
Elapsed Time: 55.0709 seconds (Warm-up)
              56.8683 seconds (Sampling)
              111.939 seconds (Total)

[Etc chains 2 to 4]

fit1

Inference for Stan model: Fe_code1.
4 chains, each with iter=75; warmup=50; thin=1;
post-warmup draws per chain=25, total post-warmup draws=100.

             mean se_mean  sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
intercept     0.0     0.0 0.1   -0.1    0.0    0.0    0.1    0.1    46  1.1
rate         -0.4     0.0 0.0   -0.5   -0.4   -0.4   -0.4   -0.3    45  1.1
FLoc_r[1]    -0.2     0.0 0.3   -0.8   -0.5   -0.2    0.0    0.4    58  1.1
FLoc_r[2]     1.4     0.1 0.4    0.8    1.1    1.3    1.6    2.2    25  1.1
FLoc_r[3]    -0.3     0.1 0.3   -0.9   -0.5   -0.3   -0.1    0.2    37  1.1
.
...output lines deleted
.
lp__       4194.4     0.9 4.4 4186.6 4190.8 4194.9 4197.5 4202.4    26  1.2

Samples were drawn using NUTS(diag_e) at Thu Jan  9 19:54:38 2014.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Running second model

TRANSLATING MODEL 'Fe_code2' FROM Stan CODE TO C++ CODE NOW.
COMPILING THE C++ CODE FOR MODEL 'Fe_code2' NOW.
SAMPLING FOR MODEL 'Fe_code2' NOW (CHAIN 1).

Iteration:   1 / 500 [  0%]  (Warmup)
Informational Message: The current Metropolis proposal is about to be rejected becuase of the following issue:
Error in function stan::prob::normal_log(N4stan5agrad3varE): Location parameter[0]  is inf:0, but must be finite!
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Iteration:  50 / 500 [ 10%]  (Warmup)
Iteration: 100 / 500 [ 20%]  (Warmup)
Iteration: 150 / 500 [ 30%]  (Sampling)
Iteration: 200 / 500 [ 40%]  (Sampling)
Iteration: 250 / 500 [ 50%]  (Sampling)
Iteration: 300 / 500 [ 60%]  (Sampling)
Iteration: 350 / 500 [ 70%]  (Sampling)
Iteration: 400 / 500 [ 80%]  (Sampling)
Iteration: 450 / 500 [ 90%]  (Sampling)
Iteration: 500 / 500 [100%]  (Sampling)
Elapsed Time: 322.58 seconds (Warm-up)
              1212.62 seconds (Sampling)
              1535.2 seconds (Total)

[etc chains 2 to 4]

fit2

Inference for Stan model: Fe_code2.
4 chains, each with iter=500; warmup=100; thin=1;
post-warmup draws per chain=400, total post-warmup draws=1600.

             mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff
intercept     0.2     0.0  0.1     0.1     0.2     0.2     0.3     0.4   562
yearlydec     0.9     0.0  0.0     0.9     0.9     0.9     0.9     0.9   601
sigmaF[1]     0.5     0.0  0.0     0.5     0.5     0.5     0.5     0.5   394
.
...output lines deleted
.
lp__      -1836.8     3.6 56.4 -1944.1 -1876.3 -1830.9 -1802.3 -1723.9   245
          Rhat
intercept  1.0
yearlydec  1.0
sigmaF[1]  1.0
.
.output lines deleted
.
lp__       1.0

Samples were drawn using NUTS(diag_e) at Thu Jan  9 21:32:53 2014.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Code

Reading data

# http://www.lml.rivm.nl/data/gevalideerd/data/lmre_1992-2005.xls
# http://www.r-bloggers.com/read-excel-files-from-r/
library(gdata)

readsheet <- function(cursheet,fname) {
df = read.xls(fname, 
sheet = cursheet , 
blank.lines.skip=FALSE, 
colClasses = "character")
topline <- 8
test <- which(df[6,]=='C')+1
numin <- df[topline:nrow(df),test]
names(numin) <- make.names( gsub(' ','',df[6,test]))
for(i in 1:ncol(numin)) numin[,i] <- as.numeric(gsub(',','.',numin[,i]))
status = df[topline:nrow(df),test-1]
names(status) <- paste('C',make.names( gsub(' ','',df[6,test])),sep='')
numin$StartDate <- as.Date(substr(df[topline:nrow(df),1],1,10))
numin$EndDate <-  as.Date(substr(df[topline:nrow(df),2],1,10))
numin <- cbind(numin,status)
numin <- numin[!is.na(numin$StartDate),]
tall <- reshape(numin,
varying=list(
                            make.names(gsub(' ','',df[6,test])),
                            paste('C',
                               make.names(gsub(' ','',df[6,test])),sep='')),
v.names=c('Amount','Status'),
timevar='Compound',
idvar=c('StartDate','EndDate'),
times=paste(df[6,test],'(',df[5,test],')',sep=''),
direction='long')
tall$Compound <- factor(gsub(' )',')',gsub(' +',' ',tall$Compound)))
row.names(tall)<- NULL
tall$Location <- cursheet
tall
}

readfile <- function(fname) {
sheets <- sheetNames(fname)[-1]
tt <- lapply(sheets,readsheet,fname=fname)
tt2 <- do.call(rbind,tt) 
tt2$Location <- factor(tt2$Location)
tt2$fname <- fname
tt2
}

fnames <- dir(pattern='*.xls') #.\\. ?
rf <- lapply(fnames,readfile)
rf2 <- do.call(rbind,rf)

rf2$machine <- factor(ifelse(rf2$fname=="lmre_1992-2005.xls",
'Van Essen/ECN vanger',
' Eigenbrodt NSA 181 KHT'))

Select Fe data and prepare 

Fe <- rf2[rf2$Compound==levels(rf2$Compound)[9],]
Fe$LAmount <- log10(Fe$Amount)
Fe$LAmount[Fe$Amount<0.4] <- NA

Fe2 <- Fe[!is.na(Fe$Status),]
Fe3 <- Fe2[!is.na(Fe2$LAmount),]
Fe4 <- Fe2[is.na(Fe2$LAmount),]

library(rstan) 
datain <- list(
LAmount = Fe3$LAmount,
Machine=(Fe3$machine=='Van Essen/ECN vanger')+1,
Location=(1:nlevels(Fe3$Location))[Fe3$Location],
time=as.numeric(Fe3$StartDate),
nobs =nrow(Fe3),
nloc=nlevels(Fe3$Location),
MachineC=(Fe4$machine=='Van Essen/ECN vanger')+1,
LocationC=(1:nlevels(Fe4$Location))[Fe4$Location],
timeC=as.numeric(Fe4$StartDate),
nobsC =nrow(Fe4),
L=log10(0.399)
)

First model

parameters1 <- c('intercept','rate','FLoc_r',
                    'sigmaF','sfmach','sfloc','FMach_r' )

Fe_code1 <- ' 
data {
int<lower=0> nloc; 
      int<lower=0> nobs;
        vector[nobs] LAmount;
        int<lower=1,upper=2> Machine[nobs];
        int<lower=1,upper=nloc> Location[nobs];
        vector[nobs] time;
        int<lower=0> nobsC;
        real <upper=min(LAmount)> L;
        int<lower=1,upper=2> MachineC[nobsC];
        int<lower=1,upper=nloc> LocationC[nobsC];
        vector[nobsC] timeC;
  }
parameters {
        real intercept;
        real rate;  
        vector [nloc] FLoc_r;
        vector<lower=0> [2] sigmaF;
        real<lower=0> sfmach;
        real<lower=0> sfloc;
        vector[2] FMach_r;
        }
    transformed parameters {
vector[nobs] ExpLAmount;
        vector<lower=0>[nobs] sigma;
        real mean_FLoc;
        real mean_FMach;
        vector[nloc] FLoc;
        vector[2] FMach;
        vector[nobsC] ExpLAmountC;
        vector<lower=0> [nobsC] sigmaC;
        vector[nobsC] LAmountC;
        mean_FLoc <- mean(FLoc_r);
        FLoc <- (FLoc_r-mean_FLoc)*sfloc;
        mean_FMach <- mean(FMach_r);
        FMach <- (FMach_r-mean_FMach)*sfmach;
        for (i in 1:nobs) {
            ExpLAmount[i] <- intercept  
+ .0001*rate*time[i] 
+ FMach[Machine[i]] + 
+ FLoc[Location[i]];
            sigma[i] <- sigmaF[Machine[i]];
        }
        for (i in 1:nobsC) {
            ExpLAmountC[i] <- intercept  
+ .0001*rate*timeC[i] 
+ FMach[MachineC[i]] + 
+ FLoc[LocationC[i]];
            sigmaC[i] <- sigmaF[MachineC[i]];
            LAmountC[i] <- log10(.2); 
        }
     }  
     model {        
        sfmach ~ cauchy(0,4);
        sfloc ~ cauchy(0,4);
        FMach_r ~ normal(0,1);
        FLoc_r ~ normal(0,1);
        sigmaF ~ cauchy(0,4);
LAmount ~ normal(ExpLAmount,sigma);
        LAmountC ~ normal(ExpLAmountC,sigmaC);
        rate ~ normal(0,1);
   }
     generated quantities {
        real dailydec;
        real yearlydec; 
      dailydec <- pow(10,rate*.0001);
        yearlydec <- pow(dailydec,365);
        }
'

fit1 <- stan(model_code = Fe_code1, 
data = datain, 
pars=parameters1,
warmup=50,
iter = 75, 
chains = 4)
fit1

Second model

samples <- as.array(fit1)
lastsample <- dim(samples)[1]
nchain <- dim(samples)[2]
init2 <- lapply(1:nchain,function(i) {
   fin <- samples[lastsample,i,]
   list(
intercept=fin[1],
rate=fin[2],
FLoc_r=fin[3:(3+16)],
sigmaF=fin[20:21],
sfmach=fin[22],
sfloc=fin[23],
FMach_r=fin[24:25],
LAmountC=rep(log10(.2),datain$nobsC)
)
})

parameters2 <- c('intercept','yearlydec',
'sigmaF','FMach','sfmach','sfloc',
'FLoc'   ,'rate' )

Fe_code2 <- ' 
data {
int<lower=0> nloc; 
int<lower=0> nobs;
vector[nobs] LAmount;
int<lower=1,upper=2> Machine[nobs];
int<lower=1,upper=nloc> Location[nobs];
vector[nobs] time;
int<lower=0> nobsC;
real <upper=min(LAmount)> L;
int<lower=1,upper=2> MachineC[nobsC];
int<lower=1,upper=nloc> LocationC[nobsC];
vector[nobsC] timeC;
}
parameters {
real intercept;
real rate;  
vector [nloc] FLoc_r;
vector<lower=0> [2] sigmaF;
real<lower=0> sfmach;
real<lower=0> sfloc;
vector[2] FMach_r;
vector<upper=L>[nobsC] LAmountC;
        
}
transformed parameters {
vector[nobs] ExpLAmount;
vector<lower=0>[nobs] sigma;
real mean_FLoc;
real mean_FMach;
vector[nloc] FLoc;
vector[2] FMach;
vector[nobsC] ExpLAmountC;
vector<lower=0> [nobsC] sigmaC;
mean_FLoc <- mean(FLoc_r);
FLoc <- (FLoc_r-mean_FLoc)*sfloc;
mean_FMach <- mean(FMach_r);
FMach <- (FMach_r-mean_FMach)*sfmach;
for (i in 1:nobs) {
ExpLAmount[i] <- intercept  
  + .0001*rate*time[i] 
+ FMach[Machine[i]] + 
+ FLoc[Location[i]];
sigma[i] <- sigmaF[Machine[i]];
}
for (i in 1:nobsC) {
ExpLAmountC[i] <- intercept  
+ .0001*rate*timeC[i] 
+ FMach[MachineC[i]] + 
+ FLoc[LocationC[i]];
sigmaC[i] <- sigmaF[MachineC[i]];
}
}  
model {        
sfmach ~ cauchy(0,4);
sfloc ~ cauchy(0,4);
FMach_r ~ normal(0,1);
FLoc_r ~ normal(0,1);
sigmaF ~ cauchy(0,4);
LAmount ~ normal(ExpLAmount,sigma);
LAmountC ~ normal(ExpLAmountC,sigmaC);
rate ~ normal(0,1);
}
generated quantities {
real dailydec;
real yearlydec; 
dailydec <- pow(10,rate*.0001);
yearlydec <- pow(dailydec,365);
}
'

fit2 <- stan(model_code = Fe_code2, 
data = datain, 
pars=parameters2,
init=init2,
warmup=100,
iter = 500, 
chains = 4)

fit2

5 comments:

  1. Thanks for writing up your experience with Stan. I'm glad you got a model complex this working.

    I'm curious how you measured speed. The only thing that matters for MCMC is effective samples per second. For instance, Metropolis is super-fast per iteration, but its random walk nature means it takes many iterations to generate an effective sample. In most of the measurements we've done, Stan is faster than JAGS in terms of effective samples per second. Another item on our to-do list is making it easier to do apples-to-apples comparisons here. Stan uses a very conservative measure of effective sample size based on the same kind of within-chain and between-chain variance estimates as are used to do R-hat, with the added twist that each chain is split in half to diagnose non-stationarity (definitions in the manual and also in the 3rd edition of Andrew et al.'s Bayesian Data Analysis).

    We're working on making the compilation faster --- we made a conscious decision from the get go that it would be worth the hit in compilation time to make the models run faster. What it means in practice is that for simple models, and by that I mean ones that take only a few seconds to generate all the effective samples you need, JAGS will smoke Stan in overall time taken. Also, it's really annoying just having to sit and wait while the compiler churns away.

    Another issue is that configuring the compilers in R is a pain. We've had a lot of trouble, particularly with people upgrading versions of Stan, where the compilation isn't being done with full optimization. There's an order of magnitude or more difference between optimization level 0 and optimization level 3 for the standard C++ compilers.

    There are some steps you can take with this code to speed up your Stan model. You define a variable LAmountC as a transformed parameter that only depends on data. It's much much more efficient to define this in the transformed data block. The efficiency is because (a) the calculation is only done once as the data is loaded, not once per leapfrog step within an iteration, (b) there's no overhead on the automatic differentiation which comes from definining constant transformed parameters, and (c) there's no I/O overhead in writing the value every iteration.

    There are also minor issues like being able to replace .0001*rate*time[i] with a transformed data parameter defined to be 0.0001 * time[i]. Again, this cuts down on having to do the multiplication every leapfrog step and cuts down on auto-diff time, which has to do the multiplication again on the derivatives.

    I couldn't figure out what was going on with the two sampling statements, LAmount ~ normal(ExpLAmount,sigma); and LAmountC ~ normal(ExpLAmountC,sigmaC);.

    Finally, it looks like both sigma and sigmaC are the same values, so I'd just drop sigmaC and replace all of its uses with sigma.

    We're always happy to help out on our users group, too.

    ReplyDelete
    Replies
    1. Dear Lingpipe,

      Thank you for your comments. There was quite some learning to do before I got this done and I am sure the code could be better.
      Regarding speed, in the end it is a subjective matter. I did not pick up on the difference in calculating effective sampling sizes. Maybe I should have, I had a number of occasions with four chains giving two effective samples, that should have told me it was being conservative. This probably means I could live with less samples.
      On the model, LAmountC sits in the parameters section, because y_cens does section 10.3 (page 81) of the Stan manual. I did not save that parameter, so I hope Stan did not either. Sigma and sigmaC are vectors, depending on the machine for the data record, so can not be collapsed. I have to agree, adding this 'feature' to the model did not help speed at all. I will try to use some of the comments and might come back on the users group.
      Thanks again.

      Delete
    2. If you're only getting two effective samples from four chains, that's an indication that the model hasn't converged. If you only care about posterior means for parameters, you typically only need 20 or so effective samples --- you can judge how much you need by the MCMC standard error estimates compared to the posterior variance.

      LAmountC is defined by LAmountC[i] <- log10(.2);, so it doens't have to be a transformed parameter. A variable only needs to be a transformed parameter if it depends on a parameter or another transformed parameter. (I was looking at your first model. Maybe it all got cleaned up in the second model.)

      As to Sigma and SigmaC, I was suggesting only using one of them because they have the same definition. I realize they can't be collapsed into a scalar.

      All the transformed parameters are saved in the output.

      - Bob Carpenter (LingPipe's just my old blog and WordPress login)

      Delete
    3. Dear Bob,

      I had the feeling the two samples had to do with confounded parameters. One goes up, the other goes down neither is estimated well. Two effective samples was obviously a sign that the model was not acceptable.

      Your looking at model one explains a bit. In the first model I agree that LAmountC is treated wrong. But that is just a stripped down version of the model which I want to use, model two is eating the time.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete