Mark Pilgrim is fuming again. This time over NewsMonster’s method of aggregating feeds. Here’s his gripe:
[...] it has one particularly disturbing feature: extracting full HTML content from linked RSS items. The feature is off by default, but once turned on (one checkbox during installation), every time it finds a new RSS item in your feed, it will automatically download the linked HTML page (as specified in the RSS item’s link element), along with all relevant stylesheets, Javascript files, and images.
I fail to see the horror in this. It would indeed be annoying to the site owners when enabled, if it downloads new content automagically at even intervals, more so if the intervals are extremely short. Like that guy who hammered my Audrey RSS every two minutes before I denied him with a .htaccess.
I find his arguments on robots.txt compliance to be spot on, though. It it’s an automated content crawler, it should honor robots.txt. Onward:
If I showed you a program that downloaded your home page (or any random page) and then followed all the links on that page, and downloaded all of those pages and all of the images on all of those pages, and then I told you that there was a simple standard way to control such programs but that this particular program didn’t support that standard, you’d scream bloody murder. (There are such programs, and they are considered the scourge of the industry, in the same league as spambots and image leechers.)
No, I wouldn’t scream bloody murder. Nor would I if someone used a downloading thingie to download my entire site for offline reading. Heck, your average RSS aggregator fits into the description of “downloading thingie for reading offline,” though it just grabs your RSS (well, unless it’s NewsMonster) instead of sucking down the entire site with all resources (images, style sheets, the whole nine yards). While the latter may be a bandwidth hit, I don’t see a problem with it. The only difference between this and reading the entire site with a browser is that the former is much quicker, but it uses exactly the same amount of your bandwidth.
Personally, I see this as a completely unnecessary feature in NewsMonster. When I read my daily dose of RSS items, I skip a large chunk of them after skimming though the headlines. Automatic caching of the permalink would be a complete waste if I never read it. And I like to read my stuff in Straw with just black text on white background without whatever layout the site owner decided I should look at.
One Comment to “Molehill?”
Leave a Reply
About this siteRecent PostsRecent Comments |
||||
For the (un?)lucky bloggers (or any other content provider), like Mark, to whom bandwith usage actually costs a significant amount of money it’s justifiably a big problem.
The point of RSS is a) to iterate posts in a standardized way and b) to just iterate the posts in the standardized way, not the images, not the stylesheets, not the javascripts, nothing but the posts. It’s short for Rich Site Summary (depending on who you ask) and the keyword here is summary.
Also, I don’t know about you but some of us don’t read all the posts which the RSS-reader aggregates, some posts just seem uninteresting and we ignore those. Thus, not all posts aggregated are page-download equivalents. Furthermore, browsers cache separate files like images (and sometimes CSS and JavaScripts too), does NewsMonster and other RSS readers?
In conclusion, I think NewsMonster, and all other RSS readers, are robots and therefore should respect robots.txt. It goes without saying that it should use a distinct user-agent string and not identify itself as a popular browser..