Analyzing ~425 days of Hacker News posts with standard shell commands

(About) 425 days ago (at the time of this writing) I started scraping Hacker News via its shiny new API. And then I promptly forgot about it. That is, until I noticed my cronjob had been throwing errors constantly for a few weeks:

Traceback (most recent call last):
  File "/home/dummy/projects/hn-cron/hn.py", line 62, in <module>
    main()
  File "/home/dummy/projects/hn-cron/hn.py", line 53, in main
    log_line = str(details['id']) + "\t" + details['title'] + "\t" + details['url'] + "\t" + str(details['score']) + "\n"
KeyError: 'url'

Instead of fixing anything, I just commented out the cronjob. But now I feel somewhat obligated to do at least a rudimentary analysis of this data. In keeping with my extreme negligence/laziness throughout this project, I hacked together a few bash commands to do just that.

A few notes about this data, and the (in)accuracy thereof:

  1. The script ran once every 40 minutes, collecting the 30 most popular stories (i.e. those on the front page), and adding them to the list if they were new
  2. I only know I started roughly 425 days ago because the first link in log.txt was this one right here (Who needs timestamps? I have IDs!)
  3. A not-insignificant percent (probably ~10%) of the time, the script would fail because the stupid(, stupid, stupid) Python 2 script I banged out in 10 minutes didn’t know how to handle Unicode characters properly (oops).
  4. I saved everything to a flat file with tab delineation. I probably should’ve used something else, but I didn’t, so here we are.
  5. I only saved the score from the first time a story was found, so theoretically any given post only had an arbitrary 40 minute window to accumulate points, at most. This is probably not strictly true for a number of reasons, but I’m going to pretend it is.
  6. These bash commands grew organically (often with much help from StackOverflow), so they made sense to me at the time, but YMMV
  7. The data is probably inaccurate in a million small ways, but overall, it’s at least worth poking at.

Okay, let’s get down to it!

Read More

dot-man

I recently hacked together a little 300-line bash script to manage my dotfiles called dot-man. Basically, it will let you manage your dotfiles in a git repository, and you can run it every so often to keep your local / remote dotfiles up to date.

Install is as simple as:

git clone git@github.com:cneill/dot-man.git
OR
git clone https://github.com/cneill/dot-man.git

Let me know what you think! You can find me on Twitter @ccneill.

Announcing DefectDojo v1.0.2!

I’m happy to announce the latest version of a project that the Security Engineering team at Rackspace has been working on: DefectDojo! DefectDojo is an open source defect tracking system that was created by our team to keep up with security engagements, but it can be useful for tracking any type of application testing. It supports functionality like Finding templates, PDF report generation, metrics graphs, charts, and some self-service tools for doing port scans, for example.

Checking out DefectDojo

A view of the DefectDojo dashboard

A view of the DefectDojo dashboard

To get the latest version, you can download a zip file or view the source on Github. Want to check out a demo before installing it on your machine? We have you covered.

Login as admin:

Login as product owner / non-staff user:

Read More