Over the last 6 months or so part of the tech team in dotMobi has been working on Instant Mobilizer, a site mobilizer service to help small businesses dip their toes in the waters of the mobile web. One of the big problems with content transcoding is deciding when it should be applied: if there is little hope of it working well another alternative should be offered.
In order to work out when content transcoding is likely to work well or not on a particular site or page, we decided to try out some Bayesian logic. Bayesian logic, named after its creator Reverend Thomas Bayes, appears to have become popular only about 270 years after it was devised, thanks to the invention of email and the subsequent deluge of spam. Thankfully, the advances in computing that made internet email possible have also enabled the application of Bayesian probability in a manner fast enough to be reasonable, an experience Reverend Thomas never lived to see. As ever, new problems and new solutions advance in lockstep.
We trained up a home-grown Bayes implementation using a corpus of human rated known-good and known-bad sites (sites that do/don't transcode well to mobile versions). In our first iteration of the test we used simple HTML tags as the training tokens, and gave the Bayes algorithm no prior knowledge of the world apart from what humans had previously rated as good or bad experiences. After a good deal of number crunching the following HTML tags emerged as the leading indicators of the likely success of page transcoding:
Good tags
- h4
- blockquote
- dl
- address
Bad tags (the "viagra" of tags):
- frame
- frameset
- noframes
- object
The bad tags are pretty obvious — even on a PC browser frames are usually a usability disaster. The object tag also needs no explanation: there is not much that can be done with a Flash object to make it work well on mobile devices.
The good tags are more interesting:
- h4 – probabably indicates a semantically designed page
- blockquote – tends to be used in "crafted", semantically-designed pages
- dl (definition list) – rarely used tag, mostly used by HTML cognoscenti
- address – ditto
In summary, and this isn't really surprising on hindsight, pages that are semantically (rather than visually) designed lend themselves best to being transcoded into other formats.