rdirichlet in Rcpp -- 3/16/16

A paper (and R package) I am currently working on requires me to sample quite a bit from a Dirichlet distribution. There are a number of nice options for doing this in many languages, including R, but as far as I am aware, none of these interface nicely with Rcpp (and are easily portable in an R package). While browsing StackExchange and Dirk Eddelbuettel's website, hoping to find a magical solution, I came across a link to the Wikipedia page on the Dirichlet_distribution. It points to a pretty simple way to sample from a Dirichlet distribution using a bunch of Gammas, which Rcpp does provide, so I decided to code it up myself! Lets take a look at the source code :

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

// [[Rcpp::export]]
arma::mat rdirichlet_cpp(int num_samples,
                         arma::vec alpha_m) {
    int distribution_size = alpha_m.n_elem;
    // each row will be a draw from a Dirichlet
    arma::mat distribution = arma::zeros(num_samples, distribution_size);

    for (int i = 0; i < num_samples; ++i) {
        double sum_term = 0;
        // loop through the distribution and draw Gamma variables
        for (int j = 0; j < distribution_size; ++j) {
            double cur = R::rgamma(alpha_m[j],1.0);
            distribution(i,j) = cur;
            sum_term += cur;
        }
        // now normalize
        for (int j = 0; j < distribution_size; ++j) {
            distribution(i,j) = distribution(i,j)/sum_term;
        }
    }
    return(distribution);
}

I decided to try this out against the rdirichlet() functions included in the gtools and MCMCpack packages. It looks like about a 25% speedup. The more important thing is that this code is easy to stick in your Rcpp project, which makes for much faster sampling.

> alpha <- 5
> topics <- 100
> alpha_m <- rep(alpha/topics,topics)
> 
> system.time({
+     temp <- gtools::rdirichlet(1000000, alpha_m)
+ })
   user  system elapsed 
 11.704   0.947  13.638 
> 
> system.time({
+     temp2 <- MCMCpack::rdirichlet(1000000, alpha_m)
+ })
   user  system elapsed 
 11.151   0.927  12.482 
> 
> system.time({
+     temp3 <- rdirichlet_cpp(1000000,alpha_m)
+ })
   user  system elapsed 
  8.543   0.567   9.245

Finally, we can take a look at an example draw using the following snippet of code.

alpha <- 5
 draws <- 1
 dimen <- 20
 alpha_m <- rep(alpha/dimen, dimen)
 x <- rdirichlet_cpp(draws,alpha_m)

 dat <- data.frame(Category = factor(1:dimen),
                   Density = as.vector(t(x)))

 library(ggplot2)
 ggplot(dat,aes(x = Category,y = Density,ymin = 0, ymax = Density)) +
     geom_point(colour = "darkblue",fill = "darkblue") +
     geom_linerange(colour = "darkblue") +
     scale_y_continuous(lim=c(0,0.5))

Which produces a plot that should look something like the following:

Building an Academic Website -- 5/8/15

I have now had a few people ask me about building an academic website so I decided to write up a little blog post on the tools I used to build mine and my strategy for expanding it and adding more functionality. In general, this tutorial/approach is meant for people without a ton of experience building a website but who like to tinker and code and have the time to do so. The approach I have taken has not yielded the slickest website with minimal effort, but I understand how everything works. Because I expand the functionality of my site bit by bit, I have also avoided wasted features that I do not know how to use. Nothing says unprofessional like a search bar or blog archive without any need for it. I am also willing to spend about $100 a year on my website, something that some folks cannot or will not do.

The purpose of my website is to be an outlet for work related things I think are cool, a place to post my academic work, and an online business card to help me get a job. Beyond having a super simple page with your name, a picture, and a link to download your CV, you will not get much credit for having an awesome website, so treat anything you do beyond the basics as part of your fun time. I have elected to go with private hosting and my own domain name. For many folks, your university will give you lots of tools to build a nice static website (meaning no databases or PHP code) right on the university servers -- for free! This is a great option but it will mean that people will have to find your new website when you eventually change institutions. This is not a big deal if you are faculty but can be a pain if you are a grad student, as your university may not let you keep a redirect page up after you leave. That said, google scholar will index your academic page if it is on a university server, something it will not necessarily do if you are on a private server. I actually started hosting my own website because I was doing a lot of large scale web experiments on MTurk and my university did not support PHP (which I wanted to use to roll my own experiments). I will write a blog post about MTurk at some point but on to the practical stuff!

I registered my domain name with GoDaddy. I am not sure how I feel about their advertising and they are probably not the cheapest but I have not had much trouble with them. I think it is worth shelling out for at least 5 years when you register your site so you do not have to think about it while in grad school.
I went with FatCow for actual hosting of my site as they seem to have a decent reputation and are headquartered in Vermont as far as I know, which is cool. They run about $80 a year (if you sign up for 3 years at a time) and have good tech support, lots of integrated tools and a reasonably easy and intuitive website management interface. One thing that can be a bit of a pain is if you use the online file editor to make changes to the HTML code for your web pages repeatedly in a short period of time, there can be a lag of up to half an hour before the changes show up on what gets served to you. This is likely because your page gets cached to some server that is closer to you and if you spam changes it will make you wait a bit. Not a big deal, but can be frustrating at times.
My strategy for starting a website was to pick a simple template for my home page, hack it down to be even simpler, and then build it up as needed. I started with a template from OpenDesigns and then edited it from there. I would suggest keeping your background white and the theme as simple as possible. Remember, you want people to take you seriously.
There are lots of great GUI web page editors available out there (WordPress of course), but I like to actually edit the HTML myself. I would suggest that you just try tweaking stuff repeatedly (like hundreds of times) and see what happens, and you should be able to figure out enough about HTML in a weekend to get things looking decent. You also need to practice Googling! Your brain should not try to commit too much web development stuff to memory -- that is where probability distributions are supposed to live. I did not know how to make an ordered list before writing this post so I just Googled "how to make a numbered list in html" and the first link got me there. There are so many people who post solutions to doing anything in HTML on StackExchange that you should never need to buy a book or even necessarily follow a tutorial (although they can be helpful). If you are a decent R programmer, then HTML is not a big jump.
If you want to blog about your awesome R code, then you will want to have some sort of code highlighting. I like prism.js a lot although it can be a bit finicky to make work correctly in my experience.

As I mentioned earlier, working on your webpage will not yield a large professional payoff, however, if you are going to spend time on your website, here are the things I would suggest you do:

Start with a free web space from your university if it is available and try to tweak a home page with no links or anything until you get it looking the way you want. Only consider upgrading to hosting your own site when you have outgrown the resources available to you for free. For example, they have a ton of awesome WordPress resources available at Penn State, so if I did not need to run experiments off my own server, I would probably take a beak from hosting my own website.
Your website should be clean and simple, so de-cluttering is a good use of time (although you should not let it get messy in the first place)
All presentations, papers, posters and a copy of you most current CV should be up there -- keep this stuff updated! This is the face you present to the world so keep it current.
Keeping an academic blog (where you do not give away good research ideas) but show people helpful tips for doing something useful or present cool graphics is a good way to build a name for yourself. Just be sure to share what you do on Twitter/FaceBook.

One more thing, here is a little snippet to embed your CV in a webpage:

 <object data="Denny_CV.pdf" type="application/pdf" width="100%" height="800">
<p>It appears you don't have a PDF plugin for this browser. No biggie... you can <a href="Denny_CV.pdf">click here to download the PDF file.</a></p>
</object>

Thats all for now -- will add more thoughts as they come to mind, enjoy!

The %in% operator in R -- 4/18/15

I was just reminded about the %in% operator in R the other day, it will tell you which and how many elements of one vector are in another, very useful for text analysis. Since I have been working a lot with legislative texts lately I decided to speed test the %in% operator find out the position that the unique noun-noun phrases I have been working with fall inside the larger class of Short noun phrases (bigrams and trigrams). I was expecting this to take at least ten seconds given the size of the vocabularies but was surprised with the results below -- less than 0.2 seconds, this operator is fast! Will post more about the uses of this operator soon, it is definitely a hidden gem!

rm(list = ls())
library(slam)
load("Document_Term_Matrix_List.Rdata")

NN <- colnames(Document_Term_Matrix_List$NN)
length(NN)
## 320523

ShortNP <- colnames(Document_Term_Matrix_List$JK)
length(ShortNP)
## 1480674
system.time({
     which(NN %in% ShortNP)
})
##   user  system elapsed 
##  0.178   0.002   0.181

Out of Memory! -- 4/5/15

There is no such thing as too much RAM...

Calculating Mutual Information in Rcpp -- 4/5/15

I have been thinking a lot lately about using concepts from information theory to identify meaningful phrases in legislative texts, and in particular I have been using mutual information for vocabulary partitioning as part of a paper I am working on with Brendan O'Connor and Hanna Wallach. I looked for a fast implementation of a mutual information calculator but decided to code up my own in RcppArmadillo. Lets take a look at the source code :

#include <RcppArmadillo.h>
#include <cmath.h>
//[[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;

// [[Rcpp::export]]
double Mutual_Information(
    arma::mat joint_dist
    ){
    joint_dist = joint_dist/sum(sum(joint_dist));
    double mutual_information = 0;
    int num_rows = joint_dist.n_rows;
    int num_cols = joint_dist.n_cols;
    arma::mat colsums = sum(joint_dist,0);
    arma::mat rowsums = sum(joint_dist,1);
    for(int i = 0; i < num_rows; ++i){
        for(int j = 0; j <  num_cols; ++j){
            double temp = log((joint_dist(i,j)/(colsums[j]*rowsums[i])));
            if(!std::isfinite(temp)){
                temp = 0;
            }
            mutual_information += joint_dist(i,j) * temp; 
        }
    } 
    return mutual_information;    
}

I decided to try this out agianst the mi.plugin function included in the entropy package and it looks like about a 10x speedup. Enjoy!

> freqs2d = rbind(runif(10000000),runif(10000000) )
 > system.time({
 +     print(mi.plugin(freqs2d))
 + })
 [1] 0.1022667
    user  system elapsed 
   7.662   4.497  25.685 
 > system.time({
 +     print(Mutual_Information(freqs2d))
 + })
 [[1]]
 [1] 0.1022667

    user  system elapsed 
   0.815   0.463   2.400

Short Tutorials

rdirichlet in Rcpp -- 3/16/16

Building an Academic Website -- 5/8/15

The %in% operator in R -- 4/18/15

Out of Memory! -- 4/5/15

Calculating Mutual Information in Rcpp -- 4/5/15