(About) 425 days ago (at the time of this writing) I started scraping Hacker News via its shiny new API. And then I promptly forgot about it. That is, until I noticed my cronjob had been throwing errors constantly for a few weeks:
Traceback (most recent call last):
File "/home/dummy/projects/hn-cron/hn.py", line 62, in <module>
main()
File "/home/dummy/projects/hn-cron/hn.py", line 53, in main
log_line = str(details['id']) + "\t" + details['title'] + "\t" + details['url'] + "\t" + str(details['score']) + "\n"
KeyError: 'url'Instead of fixing anything, I just commented out the cronjob. But now I feel somewhat obligated to do at least a rudimentary analysis of this data. In keeping with my extreme negligence/laziness throughout this project, I hacked together a few bash commands to do just that.
A few notes about this data, and the (in)accuracy thereof:
- The script ran once every 40 minutes, collecting the 30 most popular stories (i.e. those on the front page), and adding them to the list if they were new
- I only know I started roughly 425 days ago because the first link in log.txt was this one right here (Who needs timestamps? I have IDs!)
- A not-insignificant percent (probably ~10%) of the time, the script would fail because the stupid(, stupid, stupid) Python 2 script I banged out in 10 minutes didn’t know how to handle Unicode characters properly (oops).
- I saved everything to a flat file with tab delineation. I probably should’ve used something else, but I didn’t, so here we are.
- I only saved the score from the first time a story was found, so theoretically any given post only had an arbitrary 40 minute window to accumulate points, at most. This is probably not strictly true for a number of reasons, but I’m going to pretend it is.
- These bash commands grew organically (often with much help from StackOverflow), so they made sense to me at the time, but YMMV
- The data is probably inaccurate in a million small ways, but overall, it’s at least worth poking at.
Okay, let’s get down to it!
15 Most Popular Domains #
Script
cat log.txt | uniq | awk 'BEGIN {FS = "\t+" }; {print $3}' | grep -o '^h.*' | sed 's/https\?:\/\///' | grep -o '^[^/]*' | sed 's/^www\.//' | sort | uniq -c | sort -nr | head -15WTF does that do?
- Gets only the unique lines in the file (couldn’t trust myself to actually get that part right with the script)
- Get the link, chop off junk (http(s)://, trailing slash, www.)
- Sort results (actually this is just a hacky way to get uniq -c to work by getting rid of extra whitespace)
- Get unique items again, outputting the number of repeats for each domain (i.e. number of links containing that domain)
- Sort this by its numeric, not lexicographic, value (i.e. where 100 > 11), in reverse (descending order)
- Get the first 15 lines/domains
And?
2152 github.com
1387 nytimes.com
916 medium.com
731 techcrunch.com
486 washingtonpost.com
477 bbc.com
472 theguardian.com
420 wired.com
406 bloomberg.com
388 nautil.us
354 youtube.com
329 bbc.co.uk
324 newyorker.com
323 theatlantic.com
316 arstechnica.comMost of these aren’t exactly shocking, though I suppose I didn’t realize just how popular nautil.us had become. Well done, chaps.
50 Most Popular Words #
Script
cat log.txt | awk 'BEGIN {FS = "\t+" }; {print $2}' | grep -o "[^ ]*" | tr '[:upper:]' '[:lower:]' | tr -cd '[[:alnum:]\n]' | sort | uniq -c | sort -nr | head -50WTF does that do?
- Gets titles
- Splits into words by spaces
- Converts to lowercase
- Deletes anything that’s not a letter, number, or newline
- Mashes
- Counts instances
- Sorts in reverse, numeric order
- Gets 50 lines/words (I couldn’t settle on where to draw the line on useless words, so I figured I’d just include the top 50)
And?
12446 the
7474 a
7166 of
6208 to
5296 in
5022 and
4513 for
3220 <- wtf? oops..
2561 is
2557 hn
2501 on
2392 with
2134 how
1877 show
1358 from
1347 new
1325 an
1313 [pdf]
1209 why
1103 your
978 are
969 you
920 at
896 what
830 by
816 data
811 that
805 i
729 google
692 as
678 ask
655 it
624 using
608 its
607 we
604 be
590 can
551 about
547 programming
545 web
543 us
543 not
524 code
510 do
501 my
483 open
471 go
471 first
467 c
465 languageOf course “The” is at the top of the list. But the order of common question words is (maybe) more interesting:
- How - 2134
- Why - 1209
- What - 896
- Who - 366
- When - 347
- Where - 157
So we care a lot about how stuff works, and why, and just what that stuff is, but we’re a global group of post-linear-time robots, so we don’t care about whos/whens/wheres.
Top 20 Hacker News Posts #
Script
cat log.txt | uniq | awk 'BEGIN {FS = "\t+" }; {print $4" "$2" - "$3" ("$1")"}' | sort -nr | uniq | grep -vE "\((85|90)" | sed -r "s/\([0-9]+\)$//g" | head -20WTF does that do?
- Print the fields in a different order (score title - URL(ID))
- Sort in reverse, numeric order
- De-dupe (again?)
- Remove some arbitrary stories that got lucky (i.e. I started the script when they were already popular) based on their ID
- Remove the ID from output
And?
- Sir Terry Pratchett has died - 448
- Pro Rata - 330
- Unreal Engine 4 is now available to everyone for free - 326
- “Swift will be open source later this year” - 289
- Leonard Nimoy, Spock of ‘Star Trek,’ Dies at 83 - 285
- Airbnb, My $1B Lesson - 263
- Announcing Rust 1.0 - 257
- JRuby 9000 released - 246
- US to ban soaps and other products containing microbeads - 244
- Handwriting Generation with Recurrent Neural Networks - 217
- Snowden Meets the IETF - 187
- Fired - 178
- Symple Introduces the $89 Planet Friendly Ubuntu Linux Web Workstation - 167
- Jessica Livingston - 166
- FCC Passes Strict Net Neutrality Regulations on 3-2 Vote - 164
- Ellen Pao Is Stepping Down as Reddit’s Chief - 164
- YC Research - 158
- New Star Trek Series Premieres January 2017 - 158
- Just doesn’t feel good - 154
- Gay Marriage Upheld by Supreme Court - 154
From this list, it’s clear that there are a few things you can do to ensure you fit in with the HN zeitgeist and make it to the top of the front page:
- Be famous (BONUS: be Paul Graham)
- Be a life-changing programming language or framework
- Change politics forever (in America)
- Die or get fired
So there you have it, everything you never wanted to know about Hacker News! Thanks for reading, and I hope you enjoyed this slightly tongue-in-cheek analysis as much as I enjoyed writing it :)