Google says nearly 50pc of web is now in Unicode

31 Jan 2010

Almost half of the entire internet is now unicoded to include characters in thousands of forms, from Arabic to Chinese and Zulu, Google says, adding it is determined to get beyond the 50pc mark.

Mark Davis, senior international software architect with Google, says there has been an exponential increase in the use of Unicode to allow universal searches for documents in any number of characters and languages.

“Web pages can use a variety of different character encodings, like ASCII, Latin-1, or Windows 1252 or Unicode,” Davis says in the official Google blog.

“Most encodings can only represent a few languages, but Unicode can represent thousands: from Arabic to Chinese to Zulu. We have long used Unicode as the internal format for all the text we search: any other encoding is first converted to Unicode for processing.”

Google recently upgraded to the latest version of Unicode, version 5.2 (via ICU and CLDR). This adds more than 6,600 new characters: some of mostly academic interest, such as Egyptian hieroglyphs, but many others for living languages.

“We’re constantly improving our handling of existing characters. For example, the characters ‘fi’ can either be represented as two characters (‘f’ and ‘I’), or a special display form ‘ﬁ’.

“A Google search for (financials) or (office) used to not see these as equivalent — to the software they would just look like *nancials and of*ce. There are thousands of characters like this, and they occur in surprisingly many pages on the web, especially generated PDF documents.

“But no longer — after extensive testing, we just recently turned on support for these and thousands of other characters; your searches will now also find these documents.

“Further steps in our mission to organise the world’s information and make it universally accessible and useful,” Davis said.

By John Kennedy