Analyzing ~425 days of Hacker News posts with standard shell commands
(About) 425 days ago (at the time of this writing) I started scraping Hacker News via its shiny new API. And then I promptly forgot about it. That is, until I noticed my cronjob had been throwing errors constantly for a few weeks:
Traceback (most recent call last): File "/home/dummy/projects/hn-cron/hn.py", line 62, in <module> main() File "/home/dummy/projects/hn-cron/hn.py", line 53, in main log_line = str(details['id']) + "\t" + details['title'] + "\t" + details['url'] + "\t" + str(details['score']) + "\n" KeyError: 'url'
Instead of fixing anything, I just commented out the cronjob. But now I feel somewhat obligated to do at least a rudimentary analysis of this data. In keeping with my extreme negligence/laziness throughout this project, I hacked together a few bash commands to do just that.
A few notes about this data, and the (in)accuracy thereof:
- The script ran once every 40 minutes, collecting the 30 most popular stories (i.e. those on the front page), and adding them to the list if they were new
- I only know I started roughly 425 days ago because the first link in log.txt was this one right here (Who needs timestamps? I have IDs!)
- A not-insignificant percent (probably ~10%) of the time, the script would fail because the stupid(, stupid, stupid) Python 2 script I banged out in 10 minutes didn’t know how to handle Unicode characters properly (oops).
- I saved everything to a flat file with tab delineation. I probably should’ve used something else, but I didn’t, so here we are.
- I only saved the score from the first time a story was found, so theoretically any given post only had an arbitrary 40 minute window to accumulate points, at most. This is probably not strictly true for a number of reasons, but I’m going to pretend it is.
- These bash commands grew organically (often with much help from StackOverflow), so they made sense to me at the time, but YMMV
- The data is probably inaccurate in a million small ways, but overall, it’s at least worth poking at.
Okay, let’s get down to it!
15 Most Popular Domains
Script
cat log.txt | uniq | awk 'BEGIN {FS = "\t+" }; {print $3}' | grep -o '^h.*' | sed 's/https\?:\/\///' | grep -o '^[^/]*' | sed 's/^www\.//' | sort | uniq -c | sort -nr | head -15
WTF does that do?
- Gets only the unique lines in the file (couldn’t trust myself to actually get that part right with the script)
- Get the link, chop off junk (http(s)://, trailing slash, www.)
- Sort results (actually this is just a hacky way to get uniq -c to work by getting rid of extra whitespace)
- Get unique items again, outputting the number of repeats for each domain (i.e. number of links containing that domain)
- Sort this by its numeric, not lexicographic, value (i.e. where 100 > 11), in reverse (descending order)
- Get the first 15 lines/domains
And?
2152 github.com 1387 nytimes.com 916 medium.com 731 techcrunch.com 486 washingtonpost.com 477 bbc.com 472 theguardian.com 420 wired.com 406 bloomberg.com 388 nautil.us 354 youtube.com 329 bbc.co.uk 324 newyorker.com 323 theatlantic.com 316 arstechnica.com
Most of these aren’t exactly shocking, though I suppose I didn’t realize just how popular nautil.us had become. Well done, chaps.
50 Most Popular Words
Script
cat log.txt | awk 'BEGIN {FS = "\t+" }; {print $2}' | grep -o "[^ ]*" | tr '[:upper:]' '[:lower:]' | tr -cd '[[:alnum:]\n]' | sort | uniq -c | sort -nr | head -50
WTF does that do?
- Gets titles
- Splits into words by spaces
- Converts to lowercase
- Deletes anything that’s not a letter, number, or newline
- Mashes
- Counts instances
- Sorts in reverse, numeric order
- Gets 50 lines/words (I couldn’t settle on where to draw the line on useless words, so I figured I’d just include the top 50)
And?
12446 the 7474 a 7166 of 6208 to 5296 in 5022 and 4513 for 3220 <- wtf? oops.. 2561 is 2557 hn 2501 on 2392 with 2134 how 1877 show 1358 from 1347 new 1325 an 1313 [pdf] 1209 why 1103 your 978 are 969 you 920 at 896 what 830 by 816 data 811 that 805 i 729 google 692 as 678 ask 655 it 624 using 608 its 607 we 604 be 590 can 551 about 547 programming 545 web 543 us 543 not 524 code 510 do 501 my 483 open 471 go 471 first 467 c 465 language
Of course “The” is at the top of the list. But the order of common question words is (maybe) more interesting:
- How – 2134
- Why – 1209
- What – 896
- Who – 366
- When – 347
- Where – 157
So we care a lot about how stuff works, and why, and just what that stuff is, but we’re a global group of post-linear-time robots, so we don’t care about whos/whens/wheres.
Top 20 Hacker News Posts
Script
cat log.txt | uniq | awk 'BEGIN {FS = "\t+" }; {print $4" "$2" - "$3" ("$1")"}' | sort -nr | uniq | grep -vE "\((85|90)" | sed -r "s/\([0-9]+\)$//g" | head -20
WTF does that do?
- Print the fields in a different order (score title – URL(ID))
- Sort in reverse, numeric order
- De-dupe (again?)
- Remove some arbitrary stories that got lucky (i.e. I started the script when they were already popular) based on their ID
- Remove the ID from output
And?
- Sir Terry Pratchett has died – 448
- Pro Rata – 330
- Unreal Engine 4 is now available to everyone for free – 326
- “Swift will be open source later this year” – 289
- Leonard Nimoy, Spock of ‘Star Trek,’ Dies at 83 – 285
- Airbnb, My $1B Lesson – 263
- Announcing Rust 1.0 – 257
- JRuby 9000 released – 246
- US to ban soaps and other products containing microbeads – 244
- Handwriting Generation with Recurrent Neural Networks – 217
- Snowden Meets the IETF – 187
- Fired – 178
- Symple Introduces the $89 Planet Friendly Ubuntu Linux Web Workstation – 167
- Jessica Livingston – 166
- FCC Passes Strict Net Neutrality Regulations on 3-2 Vote – 164
- Ellen Pao Is Stepping Down as Reddit’s Chief – 164
- YC Research – 158
- New Star Trek Series Premieres January 2017 – 158
- Just doesn’t feel good – 154
- Gay Marriage Upheld by Supreme Court – 154
From this list, it’s clear that there are a few things you can do to ensure you fit in with the HN zeitgeist and make it to the top of the front page:
- Be famous (BONUS: be Paul Graham)
- Be a life-changing programming language or framework
- Change politics forever (in America)
- Die or get fired
So there you have it, everything you never wanted to know about Hacker News! Thanks for reading, and I hope you enjoyed this slightly tongue-in-cheek analysis as much as I enjoyed writing it 🙂