Exporting Mercurial Data

16 February 2014

Yesterday, I came across a report card for GitHub users. It inspired me to mine the data from my current project, which uses Mercurial. For those of you that do not know, Mercurial is similar to git in that it is a distributed source control system. It is written in Python, which makes it the language of choice for exporting the data I am interested in.

For now, I only want to get the revision information into MongoDB so that I can play with the data later. For this, I needed a few packages that I installed via pip.

The first package I installed is hgapi. Python has an API that it uses internally. However, it is not an official API because (I suspect) the Mercurial team wants to keep its options open to change it. When Mercurial is installed, it also puts the API on the file system, but it is not stored where Python can find it. There is a work around to use it. However, to keep things simple, I opted to follow Mercurial's suggestion and used hgapi. Simply install hgapi via pip by running:

pip install hgapi

Since I am putting the data into MongoDB, I also needed a MongoDB driver. I am using PyMongo.

pip install pymongo

Because the Mercurial API returns the time stamp as a string and I want to be able to parse the string to a datetime so that I can properly store it in MongoDB, I also imported dateutil.

pip install python-dateutil

With all of the dependencies installed, it is time to put it all together. The code below opens up the repository from the local Mercurial repository, loops through each of the revisions and inserts the metadata into the MongoDB collection.

 1 import sys
 2 import hgapi
 3 from dateutil import parser
 4 from pymongo import MongoClient
 5 
 6 repo = hgapi.Repo("c:/ProjectFolder/")  # existing folder
 7 
 8 c = MongoClient('mongodb://localhost/')
 9 db = c.work_database
10 
11 for rev in repo:
12     try:
13         print rev.node
14         dt = parser.parse(rev.date)
15         db.commits_collection.save(
16             dict(_id=rev.node,
17                  timestamp=dt,
18                  author=rev.author,
19                  branch=rev.branch,
20                  descr=rev.desc,
21                  tags=rev.tags))
22     except:
23         print "Unexpected error:", sys.exc_info()[0]
24         # For some reason, an exception seems to be thrown at the end.
25         # For my purposes, this is not something I am worried about.

I would love to know how the report card for GitHub users runs so fast. The Mercurial API is not exactly fast to loop through all of the revisions. But, then again, I do not find Mercurial to be all that fast to begin with. That being said, the data is now in MongoDB, and I can use its speed to quickly map-reduce the data for reports.