UTF8 to extended ASCII

Sooner or later you may want to display special characters like m², äöü or € on the serial monitor or an external display. You will find, that this is not as easy as expected. On many displays, characters with a code above 127 come out wrong or mixed up with strange additions, because many smaller displays don't support UTF-8.

Arduino supports UTF-8 (see UTF-8), which allows it to handle thousands of characters. Internally the IDE and the gcc-compiler use UTF8-encoding, using two or more byte for special character encoding. The euro symbol char* s="€" is internally represented in UTF-8 as 3 Bytes: char* s={0xE2, 0x82, 0xAC}.

The serial monitor and most smaller displays only know about 255 characters, using some extended ASCII set. Many of them *can* display a few of the most common characters such as m², äöü or €.

To show strings as expected on such a display, we first need to convert the UTF8-strings into whichever extended ASCII set is supported by that specific display. Luckily the code definitions for UTF8 and the most common versions of extended ASCII are very similar for the first 255 codes:

  • codes 0..127 are identical in ASCII and UTF8
  • codes 160-191 in ISO-8859-1 and Windows-1252 are two-byte characters in UTF-8 -- 0xC2 as a first byte, the second byte is identical to the extended ASCII-code.

  • codes 192-255 in ISO-8859-1 and Windows-1252 are two-byte characters in UTF-8 -- 0xC3 as a first byte, the second byte differs only in the first two bits.

  • codes 128-159 in Windows-1252 are different, but usually only the €-symbol will be needed from this range. The euro symbol is 0x80 in Windows-1252, 0xa4 in ISO-8859-15, and 0xe2 0x82 0xac in UTF-8.

It comes out that it is easy to write a simple UTF8-decoder, that works for most characters in ISO-8859-1, ISO-8859-15, and Windows-1252.

A version for single character conversion and two string conversion routines are shown below. The single byte version returns "0" if a byte must not be displayed in the output. The string versions are given for String Object types and zero terminated strings (C string). The C string version can convert a UTF-8 string to an extended ASCII string "in place", since UTF8-strings are always longer than extended ASCII-strings.

// ****** UTF8-Decoder: convert UTF8-string to extended ASCII *******
static byte c1;  // Last character buffer

// Convert a single Character from UTF8 to Extended ASCII
// Return "0" if a byte has to be ignored
byte utf8ascii(byte ascii) {
    if ( ascii<128 )   // Standard ASCII-set 0..0x7F handling  
    {   c1=0;
        return( ascii );
    }

    // get previous input
    byte last = c1;   // get last char
    c1=ascii;         // remember actual character

    switch (last)     // conversion depending on first UTF8-character
    {   case 0xC2: return  (ascii);  break;
        case 0xC3: return  (ascii | 0xC0);  break;
        case 0x82: if(ascii==0xAC) return(0x80);       // special case Euro-symbol
    }

    return  (0);                                     // otherwise: return zero, if character has to be ignored
}

// convert String object from UTF8 String to Extended ASCII
String utf8ascii(String s)
{      
        String r="";
        char c;
        for (int i=0; i<s.length(); i++)
        {
                c = utf8ascii(s.charAt(i));
                if (c!=0) r+=c;
        }
        return r;
}

// In Place conversion UTF8-string to Extended ASCII (ASCII is shorter!)
void utf8ascii(char* s)
{      
        int k=0;
        char c;
        for (int i=0; i<strlen(s); i++)
        {
                c = utf8ascii(s[i]);
                if (c!=0)
                        s[k++]=c;
        }
        s[k]=0;
}

Here is a simple demonstration of the usage

char* s="abcABC äöüß ÄÖÜ €xm²/kg³<";

  Serial.println("UTF8-decoder Test");
  Serial.print("Original: ");
  Serial.println(s);
  utf8ascii(s);
  Serial.print("Extended ASCII-Version ");
  Serial.println(s);

Share