RFC 5: Unicode support in GDAL

Author: Andrey Kiselev

Status: Development

Summary

This document contains proposal on how to make GDAL core locale independent preserving support for native character sets.

Main concepts

GDAL should be modified in a way to support three following main ideas:

Users work in localized environment using their native languages. That means we can not assume ASCII character set when working with string data passed to GDAL.
GDAL uses UTF-8 encoding internally when working with strings.
GDAL uses Unicode version of third-party APIs when it is possible.

So all strings, used in GDAL, are in UTF-8, not in plain ASCII. That means we should convert user's input from the local encoding to UTF-8 during interactive sessions. The opposite should be done for GDAL output. For example, when user passes a filename as a command-line parameter to GDAL utilities, that filename should be immediately converted to UTF-8 and only afterwards passed to functions like GDALOpen() or OGROpen(). All functions, which take character strings as parameters, assume UTF-8 (with except of several ones, which will do the conversion between different encodings, see Implementation). The same is valid for output functions. Output functions (CPLError/CPLDebug), embedded in GDAL, should convert all strings from UTF-8 to local encoding just before printing them. Custom error handlers should be aware of UTF-8 issue and do the proper transformation of strings passed to them.

The string encoding pops up again when GDAL needs to call the third-party API. UTF-8 should be converted to encoding suitable for that API. In particular, that means we should convert UTF-8 to UTF-16 before calling CreateFile() function in Windows implementation of VSIFOpenL(). Another example is a PostgreSQL API. PostgreSQL stores strings in UTF-8 encoding internally, so we should notify server that passed string is already in UTF-8 and it will be stored as is without any conversions and losses.

For file format drivers the string representation should be worked out on per-driver basis. Not all file formats support non-ASCII characters. For example, various .HDR labeled rasters are just 7-bit ASCII text files and it is not a good idea to write 8-bit strings in such a files. When we need to pass strings, extracted from such file outside the driver (e.g., in SetMetadata() call), we should convert them to UTF-8. If you just want to use extracted strings internally in driver, there is no need in any conversions.

In some cases the file encoding can differ from the local system encoding and we do not have a way to know the file encoding other than ask a user (for example, imagine a case when someone added a 8-bit non-ASCII string field to mentioned above plain text .HDR file). That means we can't use conversion form the local encoding to UTF-8, but from the file encoding to UTF-8. So we need a way to get file encoding in some way on per datasource basis. The natural solution of the problem is to introduce optional open parameter "ENCODING" to GDALOpen/OGROpen functions. Unfortunately, those functions do not accept options. That should be introduced in another RFC. Fortunately, there is no need to add encoding parameter immediately, because it is independent from the general i18n process. We can add UTF-8 support as it is defined in this RFC and add support for forcing per-datasource encoding later, when the open options will be introduced.

Implementation

New character conversion functions will be introduced in CPLString class. Objects of that class always contain UTF-8 string internally.

// Get string in local encoding from the internal UTF-8 encoded string.
// Out-of-range characters replaced with '?' in output string.
// nEncoding A codename of encoding. If 0 the local system
// encoding will be used.
char* CPLString::recode( int nEncoding = 0 );

// Construct UTF-8 string object from string in other encoding
// nEncoding A codename of encoding. If 0 the local system
// encoding will be used.
CPLString::CPLString( const char*, int nEncoding );

// Construct UTF-8 string object from array of wchar_t elements.
// Source encoding is system specific.
CPLString::CPLString( wchar_t* );

// Get string from UTF-8 encoding into array of wchar_t elements.
// Destination encoding is system specific.
operator wchar_t* (void) const;

In order to use non-ASCII characters in user input every application should call setlocale(LC_ALL, "") function right after the entry point.
Code example. Let's look how the gdal utilities and core code should be changed in regard to Unicode.

For input instead of

pszFilename = argv[i];
if( pszFilename )
    hDataset = GDALOpen( pszFilename, GA_ReadOnly );

we should do

CPLString oFilename(argv[i], 0); // <-- Conversion from local encoding to UTF-8
hDataset = GDALOpen( oFilename.c_str(), GA_ReadOnly );

For output instead of

printf( "Description = %s\n", GDALGetDescription(hBand) );

we should do

CPLString oDescription( GDALGetDescription(hBand) );
printf( "Description = %s\n", oDescription.recode( 0 ) ); // <-- Conversion
                            // from UTF-8 to local

The filename passed to GDALOpen() in UTF-8 encoding in the code snippet above will be further processed in the GDAL core. On Windows instead of

hFile = CreateFile( pszFilename, dwDesiredAccess,
    FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, dwCreationDisposition,
    dwFlagsAndAttributes, NULL );

we do

CPLString oFilename( pszFilename );
// I am prefer call the wide character version explicitly
// rather than specify _UNICODE switch.
hFile = CreateFileW( (wchar_t *)oFilename, dwDesiredAccess,
        FILE_SHARE_READ | FILE_SHARE_WRITE, NULL,
        dwCreationDisposition,  dwFlagsAndAttributes, NULL );

The actual implementation of the character conversion functions does not specified in this document yet. It needs additional discussion. The main problem is that we need not only local<->UTF-8 encoding conversions, but arbitrary<->UTF-8 ones. That requires significant support on software part.

Backward Compatibility

The GDAL/OGR backward compatibility will be broken by this new functionality in the way how 8-bit characters handled. Before users may rely on that all 8-bit character strings will be passed through the GDAL/OGR without change and will contain exact the same data all the way. Now it is only true for 7-bit ASCII and 8-bit UTF-8 encoded strings. Note, that if you used only ASCII subset with GDAL, you are not affected by these changes.

From The Unicode Standard, chapter 5:

The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text.

References

The Unicode Standard, Version 4.0 - Implementation Guidelines - Chapter 5 (PDF)
FAQ on how to use Unicode in software: http://www.cl.cam.ac.uk/~mgk25/unicode.html
FLTK implementation of string conversion functions: http://svn.easysw.com/public/fltk/fltk/trunk/src/utf.c
http://www.easysw.com/~mike/fltk/doc-2.0/html/utf_8h.html
Ticket #1494 : UTF-8 encoding for GML output.
Filenames also covered in [[wiki:rfc30_utf8_filenames]]