27 February 2017
30 December 2015
Over the past few months I've mostly been blogging at a number of other venues. These include:
A piece with Mark Hallerberg in Democracy Audit UK summarising our research on how, despite previous findings, democratic governments run similarly sizable bank bailout tabs as autocracies. This wasn't noticed in previous work, because democratic governments have incentives (possiblilty of losing elections) to shift the realisation of these costs into the future.
A post over at Bruegel introducing the Financial Supervisory Transparency Index that Mark Copelovitch, Mark Hallerberg, and I created. We also discuss supervisory transparency's implications for a European capital markets union.
I didn't write this one, but my co-author Tom Pepinsky, wrote a nice piece about a new working paper we have on the (difficulty) of predicting financial crises.
20 June 2015
Previously on The Political Methodologist, I posted updates to the Database of Political Institutions' election timing variable:
yrcurnt. That set of corrections was only for the current 28 EU member states.
I’ve now expanded the corrections to include most other OECD countries.1 Again, there were many missing elections:
|Australia||Corrects missing 1998 election year.|
|Canada||Corrects missing 2000, 2006, 2008, 2011 election years.|
|Iceland||Corrects missing 2009 election year.|
|Ireland||Corrects missing 2011 election.|
|Japan||Corrects missing 2005 and 2012 elections. Corrects misclassification of the 2003 and 2009 elections, which were originally erroneously labeled as being in 2004 and 2008, respectively.|
Import into R
To import the most recent corrected version of the data into R simply use:
election_time <- rio::import('https://raw.githubusercontent.com/christophergandrud/yrcurnt_corrected/master/data/yrcurnt_original_corrected.csv')
Australia, Canada, Iceland, Israel, Japan, South Korea, New Zealand, Norway, Switzerland, USA↩
8 May 2015
Carson Sievert and Kenny Shirley have put together the really nice LDAvis R package. It provides a Shiny-based interactive interface for exploring the output from Latent Dirichlet Allocation topic models. If you've never used it, I highly recommend checking out their XKCD example (this paper also has some nice background).
LDAvis doesn't fit topic models, it just visualises the output. As such it is agnostic about what package you use to fit your LDA topic model. They have a useful example of how to use output from the lda package.
I wanted to use LDAvis with output from the topicmodels package. It works really nicely with texts preprocessed using the tm package. The trick is extracting the information LDAvis requires from the model and placing it into a specifically structured JSON formatted object.
To make the conversion from topicmodels output to LDAvis JSON input easier, I created a linking function called
topicmodels_json_ldavis. The full function is below. To use it follow these steps:
VCorpusobject using the tm package's
Convert this to a document term matrix using
DocumentTermMatrix, also from tm.
Run your model using topicmodel's
Convert the output into JSON format using
topicmodels_json_ldavis. The function requires the output from steps 1-3.
Visualise with LDAvis'
16 December 2014
I just finished teaching a new course on collaborative data science to social science students. The materials are on GitHub if you're interested.
What did we do and why?
Maybe the most unusual thing about this class from a statistics pedagogy perspective was that it was entirely focused on real world data; data that the students gathered themselves. I gave them virtually no instruction on what data to gather. They gathered data they felt would help them answer their research questions.
Students directly confronted the data warts that usually consume a large proportion of researchers' actual time. My intention was that the students systematically learn tools and best practices for how to address these warts.
This is in contrast to many social scientists' statistics education. Typically, students are presented with pre-arranged data. They are then asked to perform some statistical function with it. The end.
This leaves students underprepared for actually using statistics in an undirected project (their thesis, in a job). Typically when confronted with data gathering and transformation issues in the real world most muddle through, piecing together ad hoc techniques as they go along in an decidedly non-efficient manner and often with poor results. A fair number of students will become frustrated and may never actually succeed in using any of the statistical tools they did learn.
What kind of data?
How does this course fit into a broader social science statistical education?
Zachary Jones had a really nice post the other day advocating that statistics courses use Monte Carlo simulation rather than real world data. The broad argument being that the messiness of real world data distracts students from carefully learning the statistical properties that instructors intend them to learn.
Superficially, it would seem that the course I just finished and Zachary's prescription are opposed. We could think of stats courses as using one of two different types of data:
simulated --- real world
Simulated vs. Real?
As you'll see I almost entirely agree with Zachary's post, but I think there is a more important difference between the social science statistic course status quo and less commonly taught courses such as mine and (what I think) Zachary is proposing. The difference is where the data comes from: is it gathered/generated by students or is it prepackaged by an instructor?
Many status quo courses use data that is prepackaged by instructors. Both simulated and real world data can be prepackaged. I suppose there are many motivations for this, but an important one surely is that it is easier to teach. As an instructor, you know what the results will be and you know the series of clicks or code that will generate this answer. There are no surprises. Students may also find prepackaged data comforting as they know that there is a correct answer out there. They just need to decode the series of clicks to get it.
Though prepackaged data is easier for instructors and students, it surely is counterproductive in terms of learning how to actually answer research questions with data analysis.
Students will not learn necessary skills needed to gather and transform real world data so that it can be analysed. Students who simply load a prepackaged data set of simulated values will often not understand where it came. They can succumb to the temptation to just click through until they get the right answer.
On the other hand I've definitely had the experience teaching with student simulated data that Zachary describes:
I think many students find [hypothesis testing] unintuitive and end up leaving with a foggy understanding of what tests do. With simulation I don't think it is so hard to explain since you can easily show confidence interval coverage, error rates, power, etc.
The actually important distinction in social science statistics education for thinking about what is more or less effective is:
student gathered/generated --- instructor gathered/generated
Prepackaged vs. student generated data
There is of course a pedagogical difference between data that students gathered from the real world and data they simulated with a computer. Simulated data is useful for teaching the behaviour of statistical methods. Real world data is useful for teaching students how to plan and execute a project using these methods to answer research questions in a way that is reproducible and introduces fewer data munging biases into estimates. Though almost certainly too much to take on together in one course, both should be central to a well-rounded social science statistics education.
10 December 2014
A few months ago I posted the script that I use to set up my R/JAGS working environment on an Amazon EC2 instance.
Since then I've largely transitioned to using R/Stan to estimate my models. So, I've updated my setup script (see below).
There are a few other changes:
I don't install/use RStudio on Amazon EC2. Instead, I just use R from the terminal. Don't get me wrong, I love RStudio. But since what I'm doing on EC2 is just running simulations (I handle the results on my local machine), RStudio is overkill.
I don't install git anymore. Instead I use
source_url(from devtools) and
source_data(from repmis) to source scripts from GitHub. Again all of the manipulation I'm doing to these scripts is on my local machine.