UTF-8, Indic and Stub Length Article in Wikipedia

  • Access to Knowledge

U.B.Pavanaja

20 October 2016

One of the activities conducted as part of Wiki Conference India 2016 was the Punjab Editathon. It was about adding articles related to Punjab to Indian language Wikipedias and English Wikipedia. There was also an announcement made about some award for highest contribution.

See the original blog post at Dr. Pavanaja Blog


This lead to continued discussions in a closed chat group on how do we decide the winner. People thought it is very simple to announce the winner just based on highest number of bytes added. On first look, it looked very trivial and a simple case. I pointed out during the discussions about the encoding used in Wikipedia is UTF-8 and it uses different number of bytes for English and Indian languages. Before giving more details I would like to draw your attention to a simple experiment.

I typed Kannada letter ಅ (a) in my Sandbox in Kannada Wikipedia and saved it. Then I checked the RecentChanges page in Kannada Wikipedia. That showed that I have added 3 bytes to my Sandbox page. But I had added just one Kannada character. I did the same experiment in English Wikipedia. I just added one letter, the English letter “A” to my Sandbox in English Wikipedia and checked the number of bytes added. It showed just one byte.

whatsapp-image-2016-10-19-at-11-20-09-pm whatsapp-image-2016-10-19-at-11-20-14-pm
english-a english-1-byte

What is going on? Here is the explanation. There are different ways Unicode text can be stored. UTF-8, UTF-16 and UTF-32 are the prominent ways. UTF-16 uses 2 bytes for all characters. UTF-32 uses 4 bytes. UTF-8 is a special kind of encoding. It uses series of single bytes to represent Unicode data. The first character, called Byte Order Mark (BOM) indicates what encoding is being used. Unicode website has more details on these. UTF-8 was mainly used for web as the networking devices used on the initial days of Unicode could handle only 8 bits (1 byte) of data. In other words, UTF-8 was used for backward compatibility with ASCII, the original 8-bit encoding used prior to the advent of Unicode. Even today the default encoding used by HTML is UTF-8.

Does these answer our original question? Not yet. I said UTF-8 uses series of single bytes. It uses 1 byte for English, 2 bytes for European languages and 3 bytes for Indian languages. That is the reason why we saw 3 bytes for one Kannada character.

This pops up another interesting question regarding the definition of a stub article in Wikipedia. As per Wikipedia, an article which has less than 2048 bytes is considered as a stub article. Go to any language Wikipedia’s search page and type Special:ShortPages to get the list of all articles which are having less than 2048 bytes. If we convert this into number of characters it turns out to be 2048 for English but about 682 for Indic. That means the length of a stub article will be different for English and Indian language Wikipedias. Should we have a different yardstick for the definition of a stub article for Indian language Wikipedias then? I think yes.

Related Events

Sorted By Date

Telecom

Judicial Trends: How Courts Applied the Proportionality Test

This is the second in a series of essays aimed at studying the different ways in which apex courts have evaluated national biometric digital ID programs of their countries.

Event

23 March 2024
Read more

Access to Knowledge

Information Disorders & their Regulation

The Indian media and digital sphere, perhaps a crude reflection of the socio-economic realities of the Indian political landscape, presents a unique and challenging setting for studying information disorders.

Event

5 MB
Read more

Digital Cultures

Security of Open Source Software

A Survey of Technical Stakeholders’ Perceptions and Actions

Event

2.5 MB
Read more

Access to Knowledge

Global Accessibility Awareness Day 2017

The Centre for Internet & Society along with Prakat Solutions and Mitra Jyothi is co-hosting the Global Accessibility Awareness Day in Bengaluru on May 18, 2017.

Event

18 May 2017
Read more