Mark Pilgrim is fuming again. This time over NewsMonster’s method of aggregating feeds. Here’s his gripe:
[...] it has one particularly disturbing feature: extracting full HTML content from linked RSS items. The feature is off by default, but once turned on (one checkbox during installation), every time it finds a new RSS item in your feed, it will automatically download the linked HTML page (as specified in the RSS item’s link element), along with all relevant stylesheets, Javascript files, and images.
I fail to see the horror in this. It would indeed be annoying to the site owners when enabled, if it downloads new content automagically at even intervals, more so if the intervals are extremely short. Like that guy who hammered my Audrey RSS every two minutes before I denied him with a .htaccess.
I find his arguments on robots.txt compliance to be spot on, though. It it’s an automated content crawler, it should honor robots.txt. Onward:
If I showed you a program that downloaded your home page (or any random page) and then followed all the links on that page, and downloaded all of those pages and all of the images on all of those pages, and then I told you that there was a simple standard way to control such programs but that this particular program didn’t support that standard, you’d scream bloody murder. (There are such programs, and they are considered the scourge of the industry, in the same league as spambots and image leechers.)
No, I wouldn’t scream bloody murder. Nor would I if someone used a downloading thingie to download my entire site for offline reading. Heck, your average RSS aggregator fits into the description of “downloading thingie for reading offline,” though it just grabs your RSS (well, unless it’s NewsMonster) instead of sucking down the entire site with all resources (images, style sheets, the whole nine yards). While the latter may be a bandwidth hit, I don’t see a problem with it. The only difference between this and reading the entire site with a browser is that the former is much quicker, but it uses exactly the same amount of your bandwidth.
Personally, I see this as a completely unnecessary feature in NewsMonster. When I read my daily dose of RSS items, I skip a large chunk of them after skimming though the headlines. Automatic caching of the permalink would be a complete waste if I never read it. And I like to read my stuff in Straw with just black text on white background without whatever layout the site owner decided I should look at.