Portability of Code Conversion Functions in C

Keywords: encoding github C

Blog move, original address: Encoding issues in https://langzi989.github.io/2017/07/17/C/

Introduction to Coding

Some Chinese is unavoidable in the code. At this time, we should consider the encoding format of Chinese, if we do not pay attention to the problems that may lead to garbled code or information distortion.The common Chinese codes we use are GBK,gb2312,Unicode, etc.For a detailed introduction, see the following articles:

C Language Code Conversion

In C, if you need to talk about coding for conversion, you can use the iconv series of functions.
Header files and common functions:

#include <iconv.h>
typedef void* iconv_t;

extern iconv_t iconv_open(const char* to_code, const char* from_code);

extern size_t iconv(iconv_t cd, char** restrict inbuf, size_t* in_left_buf, char** restrict outbuf, size_t* out_left_buf);

extern int iconv_close(iconv_t cd);

iconv_open

Function Description

This function describes which two types of encoding transformations will be performed and returns a transformation handle.

Parameter Description

  • tocode:target encoding
  • fromcode: original encoding

iconv

extern size_t iconv(iconv_t cd, char** restrict inbuf, size_t* in_left_buf, char** restrict outbuf, size_t* out_left_buf);

Function Description

This function reads data from inbuf and outputs the data converted to the specified encoding into outbuf. If the conversion is successful, the number of bytes converted will be output, otherwise sizeof_t(-1) will be returned.

Parameter Description

  • cd: Conversion descriptor, obtained by iconv_open
  • inbuf: input buffer
  • in_left_buf: Number of characters that have not been converted by the input buffer
  • outbuf: output buffer
  • out_len_buf: The remaining space in the output buffer.

iconv_close

extern int iconv_close(iconv_t cd);

File descriptor for closing iconv_open open open

Example Conversion Function

#include <iostream>
#include <string>
#include <iconv.h>
#include <cstring>
#include <errno.h>
using namespace std;


string convertCode(const string& p_str, const char* from, const char* to) {
  char * sin, * sout;
  int lenin, lenout, ret;
  const int BUF_LEN = 10240;
  char bufOut[BUF_LEN];
  string result("");

  memset(bufOut, 0x0, sizeof(bufOut));


  iconv_t cd;
  if ((cd = iconv_open(to, from)) == (iconv_t)(-1)) {
    std::cout << "open iconv error" << std::endl;
    return "";
  }

  lenin = p_str.length();
  lenout = BUF_LEN;

  sin = (char*)p_str.c_str();
  sout = bufOut;
//  std::cout << sin << std::endl;
  //std::cout << lenin << std::endl;
  //std::cout << lenout << std::endl;

  ret = iconv(cd, &sin, static_cast<size_t * >(&lenin), &sout, static_cast<size_t * >(&lenout));

  //errno:84:Invalid or incomplate multibyte or wide character
  if (-1 == ret) {
    std::cout << strerror(errno) << std::endl;
    if (errno != 84) {
      return "";
    }
  }
  std::cout << "bufout:" << bufOut << std::endl;
  std::cout << "bufout end" << std::endl;
  iconv_close(cd);

  result.assign(bufOut, BUF_LEN - lenout);

  return result;
}

int main() {
  string s = "Ha-ha";
  std::cout << s.length() << std::endl;
  s = convertCode(s, "gbk", "utf-8//IGNORE");
  //std::cout << s << std::endl;
  std::cout << s.length() << std::endl;
}

Reason for segment error in iconv function

Segment errors may occur when converting using the iconv function. The main reason for this error is to pay attention to the function prototype of the iconv function:

extern size_t iconv(iconv_t cd, char** restrict inbuf, size_t* in_left_buf, char** restrict outbuf, size_t* out_left_buf);

Conversion of int pointer to size_t pointer can cause problems in some systems, resulting in length errors, memory out of bounds, and segment errors.The error message is as follows:

Program received signal SIGSEGV, Segmentation fault.
from_gbk (irreversible=0x7fffffffb188, outend=0x61d7c0 "", outptrp=<synthetic pointer>,
    inend=0xa7ffffffdb76 <error: Cannot access memory at address 0xa7ffffffdb76>,
    inptrp=0x7fffffffb2e8, step_data=0x6157d0, step=0x615030) at ../iconv/loop.c:325
325	../iconv/loop.c: No such file or directory.

size_t and int types

The size_t type is defined in the stddef.h file.The size_t type is OS-dependent and is generally defined in a 32-bit architecture as:

typedef unsigned int size_t;

In 64 it is defined as:

typedef unsigned long size_t;

The int type is 4 bits long on 32 and 64 machines, long is 4 bits on 32-bit machines, and 8 bits on 64-bit machines.Therefore, on 64-machine, there will be problems during the conversion of size_t and int pointers. In 32-system, positive integer pointers will not have pointers, but negative integers will also have problems.

Posted by chris_2001 on Fri, 10 May 2019 15:00:30 -0700