Cal Newport over at Study Hacks has a great article on building a research paper database

While the target of his post is to undergraduates, there are basic principles here that someone organizing a larger database for a dissertation or a scholarly career can learn from.

1) Keys – the source identification number is what in database terminology is referred to as a primary key.  It is one string/number that refers you back to a location that has all the information you may need about a source. 

In database design (my prior life) there are two types of keys; intelligent and surrogate. 

  • A surrogate key is just a number – you could start at 1 and count up.  It has no meaning embedded in it.
  • An intelligent key has some meaning embedded in it.  For example, I could build a by key combining the first authors last name, the year published, and a predefined abbreviation for the journal, resulting in something that looked like this: Baker2006EdPol. 

On first glance, the intelligent key may seem like a really neat idea – easier to remember and therefore you don’t have to keep going back to you source list to look up where something came from.  However it has some problems:

  1. What if there were more than one Baker to publish in the journal Education Policy in 2006?  Or if this Baker published more than once in that journal in that year?
  2. You now have to memorize all the journal abbreviations you invented, which is a challenge
  3. What if you have, rather than a journal article, a book chapter?  What about a newspaper article or report by a think tank?

There are more, but the short answer is that you are better off just making up consecutive numbers.  If you insist on some organization to your database (unnecessary to the database/spreadsheet – this is just for you) you can always create a partial intelligent key by, for example, taking the year on the beginning.  If you’ve ever gotten a number from technical support like this: 20071001-558, that is what they have done.  The date is intelligent, but the number AFTER the date is just the next one in sequence. 

2) Extracted Quotes – it seems so much easier to just highlight/underline a meaningful sentence and keep going, but it is almost impossible when you sit down and write to find the right sentence again.  By keeping it in a searchable format, sortable and linked to where it came from, you make your own writing process faster and easier.

3) Extensible – while this isn’t really mentioned in the original article, you can extend this solution to handle multiple projects or many different media types.  You can add new pages for additional types of information, such as a page on themes instead of quotes if you are working with literature.  You can then give each major THEME a code and link that to the quotes as well, so that now you know not just what the source was but also what major theme of the work you are relating this quote to.  You can also add whatever additional columns are meaningful to you.  URLs, whether there is a useful graphic, tags, etc.

Personally my goal is to get Zotero to be the database so that I don’t HAVE to manage a database/spreadsheet of this stuff.  It’s ability to keep notes and tags should make searching/reorganizing possible, but I’ll let you know after I use it for a few things.  Still, if you prefer to work with something you know, this method is great.

Advertisement