Arduino and UTF-8

Introduction

The Arduino editor and the compiler (avr-gcc) use UTF-8 for special characters. When UTF-8 is used in the program code itself, the characters can occupy one up to four bytes.

More information about UTF-8 on Wikipedia : http://en.wikipedia.org/wiki/UTF-8

Examples

These examples show the hexadecimal value of a few characters
Normal character 'a' = 0x61
Micro µ = 0xC2 0xB5
Degrees ° = 0xC2 0xB0
Euro € = 0xE2 0x82 0xAC

Unicode

Unicode is not UTF-8. Unicode describes all useful characters in the world. The UTF-8 format contains all the Unicode characters, and UTF-8 specifies how each character is defined.
It therefor does not make sense when a device should receive Unicode characters, since that can be UTF-16 as well as UTF-8 or other formats.

Using UTF-8 in the Arduino Editor

Using UTF-8 in the editor is no problem for comments en explanation. They will be stored in the *.ino file, since that is a UTF-8 file.

When a webpage is create in the sketch, and that webpage has UTF-8 coding, the UTF-8 characters can be used without problem. They will be shown in the browser.

UTF-8 in the Arduino serial monitor

At this moment (November 2014) the UTF-8 is not yet supported by the Arduino serial monitor. When UTF-8 character are sent to the serial port, a serial terminal program should be used that is able to use UTF-8 characters. Most serial terminal programs in Linux support UTF-8.

To use extended Characters in the serial monitor, it is necessary to convert the UTF8-strings to extended ASCII. This can be done using a short conversion routine given on http://playground.arduino.cc/Main/Utf8ascii.

UTF-8 for displays

Displays that use the extended ASCII characters conflict with the UTF-8 coding.

For example when writing "µ", the Arduino editor uses two bytes 0xC2, 0xB5 and the extended ASCII uses a single byte 0xB5.

To display extended ASCII-characters, strings have to be converted using a short conversion routine given on http://playground.arduino.cc/Main/Utf8ascii.

Using UTF-8 in the code

Since the Arduino editor allows UTF-8 code in the editor, a string can be created with UTF-8 characters.

The size of a string with UTF-8 characters is more than the number of characters. This applies to character arrays as well as for the String object.

char c = 'µ';          // Wrong
char bad[4] = "5µA";   // Wrong
char good[] = "5µA";   // Good
String okay = "5µA";   // Good

With the string declarations above, the compiler does not have to know that it is a UTF-8 string. It places the UTF-8 code from the file into a string.
In most cases the UTF-8 can be used as long as the origin (the sketch) and the destination (the computer or a device) do handle them correctly.

The string object "String" handles the UTF-8 characters just as a number of bytes, and doesn't care for UTF-8 characters. For example the String.length() returns the number of bytes and not the number of characters.

When string operations are performed (either with a character array or with the String object), or when a buffer is declared, special care should be taken to allow characters of variable size.

Wide characters

The avr-gcc compiler has an option for 16-bit or 32-bit characters. It is called "wide characters". The special characters are extended with zero bytes to fill up to 16-bit or 32-bit. The format is defined by the compiler and doesn't have to be compatible with other devices.

The UTF-8 on the other hand is very well defined and should be compatible between all devices that support it. It is the most popular format, and has exceeded other formats by far.
The wide characters was an effort to standardize a character format with special characters, but it has been overtaken by UTF-8. The wide characters is still in the compiler since it was used before and also files have been created with it.

A string with wide characters gets a capital 'L' in front.

wchar_t TextOne[] = L"5µA"; // wide characters

The PROGMEM can be use with wide characters, but not the 'F()'-macro and not the 'PSTR', those cast it to normal characters.

In November 2014, the wchar.h is not yet implemented, and WCHAR_MAX and WCHAR_MIN are not defined. The wchar_t and the L"<text>" can be used, but there is for example no wsprintf function.
There is almost no support for wide characters in the libraries. For example the common functions like "Serial.println" and "strlen" do not support the wide characters.

Sketch

The next sketch shows basic use of normal string with UTF-8 characters.

There was a problem with the 'Get code', it translated the UTF-8 characters not correct. Please use the file "utf-8.ino" at the bottom.

// Test for normal strings with UTF-8
// Public Domain

char three[] = "3µV";
const char four[] PROGMEM = "4µA";
String five = "5µF";
char six[] = "60€";
String seven = "70€";

char buffer[40];

void setup() 
{
  Serial.begin( 9600);

  Serial.println("\n+++++++++++++++++++++++++++++++++++++++++");
  Serial.println("Use a serial terminal that supports UTF-8");

  Serial.println(F("1µH"));        // Good, text in flash

  // copy a string from flash memory to a buffer.  
  sprintf_P( buffer, PSTR("2µH")); // Good, text in flash
  Serial.println( buffer);

  // copy a string in ram to a buffer
  strcpy( buffer, three);          // Good
  Serial.println( buffer);

  // add one to strlen for the zero terminator
  strncpy( buffer, three, strlen(three) + 1); // Good, strlen works with UTF-8 string
  Serial.println( buffer);

  strcpy_P( buffer, four);         // Good, text in flash with PROGMEM
  Serial.println( buffer);

  // copy a string in flash to buffer byte for byte
  for( int i = 0 ; i < sizeof( four) ; i++)  // Good, sizeof works with UTF-8 string
  {
    buffer[i] = pgm_read_byte( four + i);
  }
  Serial.println( buffer);

  Serial.println( five);           // Good, a String class with UTF-8 character

  Serial.print( "array of char: \"");
  Serial.print( six);
  Serial.print( "\", strlen=");
  Serial.println( strlen(six));

  Serial.print( "String object: \"");
  Serial.print( seven);
  Serial.print( "\", String.length()=");
  Serial.println( seven.length());

  Serial.println( "+++++++++++++++++++++++++++++++++++++++++");

  Serial.println( "Enter a UTF-8 character and press <enter>");
  Serial.println( "The hexadecimal value will be displayed.");
}

void loop()
{
  if( Serial.available())
  {
    Serial.print( "You have entered: ");
    delay(100);      // allow the rest of the line to be received.
    while( Serial.available())
    {
      byte c = Serial.read();
      if( c != '\r' && c != '\n')  // ignore trailing CR and LF
      {
        if( c <= 0x0F)
          Serial.print( "0");
        Serial.print( c, HEX);
        Serial.print( ", ");
      }
    }
    Serial.println();
  }
}

The file : http://forum.arduino.cc/index.php?action=dlattach;topic=276985.0;attach=101943

Share