Last Updated:

Cyrillic characters in the Windows console

 

There are persistent rumors that Linux has no problems with working with the Cyrillic alphabet in the console. This article is for those who are less fortunate – for indoctrinators.

The problem is that when Microsoft came up with Windows, they simultaneously came up with a new encoding for the Cyrillic alphabet. It's hard to say why, but they came up with it. And the old encoding, which was used in MS DOS, was left. Apparently for backward compatibility purposes. And there was an ass. With the release of new versions of Windows, the situation has only worsened. Since the console, already as part of the operating system, inherited the Cyrillic encoding from MS DOS.

As a result, now for the Cyrillic alphabet we have two encodings: cp866 - the old Dosov encoding and cp1251 (aka windows-1251) - new, from Windows. Currently, the matter is complicated by the fact that Unicode has finally matured, which gives several more encodings that are not compatible with cp1251 and cp866, and are not quite compatible with each other. But about Unicode some other time.

By the way, the letter "cp" in the encoding name means codepage - a code page in the sense of "character encoding page".

In addition, the console has its own code page setup. For Russia, the default is cp866. (The localization of the windows itself doesn't seem to matter.) The console code page can be customized by the .chcp <code page number>

So, when writing programs, strings can occur in two different encodings in the following places:

  1. In the source code of programs in the form of literals.
  2. Output to the console.
  3. Input from the console.
  4. Output to a file.
  5. Input from a file.

In addition, Windowsa, when entering and outputting the Cyrillic alphabet, somewhere in its depths, can make some encoding conversions, giving results that cannot be deciphered.

The first rule when working with national alphabets: all strings must be in a single encoding.

If you do not follow this rule, you will not be able to compare and sort strings (as well as characters), and it will be difficult to correctly enter and output strings to the console or file.

Another pitfall:

A console window that uses Raster fonts correctly displays only the original equipment manufacturer (OEM) code page installed with Windows XP. Other code pages display correctly in full-screen mode or in a console window that uses True Type fonts.

Quote from MS TechNet.

At the same time, in the console itself, the display of the Cyrillic alphabet when typing (in the command line) works even when using raster fonts.

The second rule when working with national alphabets: in the console settings, set the Font to True Type for output.

You can do this in the following way:

  1. Start a command prompt
  2. Open the context menu and go to the console settings
  3. Change the font to Lucida Console.
    Caption

And one more thing. It seems that neither Unicode, nor UTF-8, nor multibyte strings in the console are directly supported. At least I didn't get any better with them.

Here are some tips to combat these issues.

The easiest (and most inconvenient) way

Work in the console's "native" code page, in cp866. That is, all lines with Cyrillic in the source code of the program must be written in the cp866 encoding. In the same encoding, there must be all input files for the program. And in the same encoding will be all the output files. It's like we're back 20 years to MS DOS.

If you have an IDE at hand that works in the console, there will be no special problems. If you use the development environment under the Windows GUI, then there are quite understandable difficulties, since IDEs usually work in the encoding cp1251, "native" for Windows. By the way, no matter how chaotic MS Visual Studio is, it is able to work with the source codes of programs in various encodings, correctly displaying them in its editor.

Output to the Windows 7 console

The method is suitable if you only need to output the Cyrillic alphabet to the console, and you are working under Windows 7. Under Windows XP, this doesn't work (editor's note , everything works).

The simplest way is to use the function:setlocale()

#include <iostream>
#include <clocale> // may be optional - compiler dependent

using namespace std;

int main() {

     setlocale(LC_ALL, "English");

     cout << "Hi!\n";

     cin.get();
     return 0;
}

The function sets or modifies national information for the current program (what is specified in the Region and Language applet in the Control panel). A description of the feature can be found on MSDN.setlocale()

Also popular is the stripped-down version of the call:

    setlocale(0, "");

Here it is used that the symbol is equal to 0 and it is understood that the operating system has the host country of Russia (localization of Windows itself does not play a role). But it is better to still use the full form.LC_ALL

The function is quite capricious. This applies to the second parameter. Some values that are specified in the documentation may not work on some systems (compilers?).

Console input and output

For the correct input and output of the Cyrillic alphabet to the console, you need to use a couple of functions: and . MsDN descriptions are here and here, respectively.SetConsoleOutputCP()SetConsoleCP()

As a single parameter, both functions are passed the code page number. In our case (Cyrillic) it is 1251.

This method works for both Windows XP and Windows 7. Tested with Dev-C++ 5.6.3 (TDM-GCC 4.8.1 32-bit compiler) and MS Visual Studio 2012.

The following test program demonstrates the output of the Cyrillic alphabet to the console, the input of the Cyrillic string from the console, the control output of the entered string, the comparison of the entered string with the reference line and the output of the entered line to the file.

Source coded in cp1251:

#include <iostream>
#include <iomanip>
#include <windows.h>

#include <cstring>
#include <fstream>

using namespace std;

char str[128];
char pattern[] = "ytsuken";

int main(int argc, char** argv)
{
     //setlocale(LC_ALL, "Russian");
     SetConsoleOutputCP(1251);
     SetConsoleCP(1251);

     cout << "Hi!" <<endl;
     for (int i = 32; i < 256; i++) {
         if (i % 8 == 0)
             cout << endl;
         if (i == 128)
             cout << endl;
         cout << setw(5) << i << setw(3) << (char)i;
     }
     cout << endl << "Enter the string \"ytsuken\": ";
     cin >>str;
     cout << "Introduced: " << str << endl;
     cout << "strings" << (strcmp(str, pattern) == 0 ? "" : "NOT ") << "identical" << endl;

     ofstream fout("cp-test.txt");
     fout << str << endl;
     fout.close();

     cin.get();
     cin.get();
     return 0;
}

This program is also useful for experimenting with various code tables and their combinations.

For practical purposes, you can use the template:

#include <iostream>
//#include <clocale> // may be optional - compiler dependent
#include <windows.h>

int main(int argc, char** argv)
{
     //setlocale(LC_ALL, "Russian");
     SetConsoleOutputCP(1251);
     SetConsoleCP(1251);

     // program should be here

     return 0;
}

Here I intentionally left a commented out call. It no longer affects the input-output of the Cyrillic alphabet, but it may be required for other national settings (splitter of the fractional part of the number, date format, time, etc.)setlocale(LC_ALL, "Russian");

Transcoding functions

In the Windows API, there are two (or rather, four pairs) functions that transcode OEM <->ANSI (as stated in the documentation). Simply put, in the context of the issue at hand, it is a transcoding between cp866 (OEM) and cp1251 (ANSI).

"Dangerous" functions without controlling the length of the line:

  • OemToChar
  • CharToOem

As parameters, pointers to the input and output buffers are obtained. The null character is considered the end of the input string.

"Safe" functions with line length control:

  • OemToCharBuff
  • CharToOemBuff

The parameters are pointers to the input and output buffers and the number of characters for the input string. Null characters are not considered the end of a string. The specified number of characters is converted.

In fact, these functions are macros that are expanded, for example, for , in (with Unicode support) or in (ANSI - without Unicode support). But such subtleties can usually not be remembered.CharToOemBuffCharToOemBuffWCharToOemBuffA

These functions are useful when the output is in one encoding and the input is in another. This situation arises, for example, when using only the function: the output is carried out in cp1251, and the input is in cp866. Following the first rule, the entered string must be converted to cp1251. To do this, use the .setlocale()OemToCharBuff()

Also, these functions can be used before outputting a line to a file or after entering a line from a file, in case the encodings do not match.

P. S. Don't judge strictly – this is my first experience writing an article. I looked at it, people write, and how am I worse? Especially since there was something to share. And it turned out to be difficult. And write, and check mistakes, and prepare illustrations.

I was prompted to write this article by the question of "cracosabras in XP". For this case, an old XP computer was even extracted from the ruins. And it turned out that the problem really exists. I had to do a little research, the results of which I have outlined here. I also compiled the available information on this topic from the materials of the site, so that everything would be in one place. To what extent I succeeded is up to you. I will be glad to see comments and additions in the comments to the article.