Saturday, 1 January 2011

PubSubHubBub Content Distribution without charset encoding

I'm having an issue with feeds that don't declare their content
encoding in the XML processing instruction.

The content distribution defined in the PSHB spec (7.3) doesn't appear
to allow specification of a character encoding. The Content-Type is
defined as "application/rss+xml" or "application/atom+xml".

This would not be an issue if the feed XML specified the encoding in
the XML processing instruction however not all feeds do.

For example, Google Alert Feeds:
http://www.google.com/alerts/feeds/08979446703162538414/13217883862269731888

I simply HTTP GET a Google Alert feed as the HTTPResponse reports
"Content-Type:text/xml; charset=UTF-8". This allows me to decode it
correctly. However as part of a Content Distribution I don't have this
information.

The HTTP standard http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html
states (in 3.7.1) the default should be ISO-8859-1. So my Java
HttpServletRequest reader appears to be defaulting to this content
type, which means I can't decode the stream correctly.

Simplest thing would be to get some Googler to fix the Google Alert
XML ;-)

Or could we consider specifying that HUBs replicate the charset from
the Fetch through into the Content-Distribution?

No comments:

Post a Comment