(About) 425 days ago (at the time of this writing) I started scraping Hacker News via its shiny new API. And then I promptly forgot about it. That is, until I noticed my cronjob had been throwing errors constantly for a few weeks:
Traceback (most recent call last):
File "/home/dummy/projects/hn-cron/hn.py", line 62, in <module>
main()
File "/home/dummy/projects/hn-cron/hn.py", line 53, in main
log_line = str(details['id']) + "\t" + details['title'] + "\t" + details['url'] + "\t" + str(details['score']) + "\n"
KeyError: 'url'
Instead of fixing anything, I just commented out the cronjob. But now I feel somewhat obligated to do at least a rudimentary analysis of this data. In keeping with my extreme negligence/laziness throughout this project, I hacked together a few bash commands to do just that.
A few notes about this data, and the (in)accuracy thereof:
- The script ran once every 40 minutes, collecting the 30 most popular stories (i.e. those on the front page), and adding them to the list if they were new
- I only know I started roughly 425 days ago because the first link in log.txt was this one right here (Who needs timestamps? I have IDs!)
- A not-insignificant percent (probably ~10%) of the time, the script would fail because the stupid(, stupid, stupid) Python 2 script I banged out in 10 minutes didn’t know how to handle Unicode characters properly (oops).
- I saved everything to a flat file with tab delineation. I probably should’ve used something else, but I didn’t, so here we are.
- I only saved the score from the first time a story was found, so theoretically any given post only had an arbitrary 40 minute window to accumulate points, at most. This is probably not strictly true for a number of reasons, but I’m going to pretend it is.
- These bash commands grew organically (often with much help from StackOverflow), so they made sense to me at the time, but YMMV
- The data is probably inaccurate in a million small ways, but overall, it’s at least worth poking at.
Okay, let’s get down to it!
Read More