edbee - Qt Editor Library
Public Member Functions | Static Public Member Functions | Protected Member Functions | List of all members
edbee::TextCodecDetector Class Reference

This class is used to detect the encoding of a given string. The detector is based on the Java code of Guillaume LAFORGE. More...

#include <textcodecdetector.h>

+ Collaboration diagram for edbee::TextCodecDetector:

Public Member Functions

 TextCodecDetector (const QByteArray *buffer=0, TextCodec *preferedCodec=0)
 
 TextCodecDetector (const char *buffer, int length=0, TextCodec *preferedCodec=0)
 
virtual ~TextCodecDetector ()
 
virtual TextCodecdetectCodec ()
 Detects the encoding of the provided buffer. If Byte Order Markers are encountered at the beginning of the buffer, we immidiately return the charset implied by this BOM. Otherwise, the file would not be a human readable text file. More...
 
virtual void setBuffer (const char *buf, int length)
 Sets the buffer reference. More...
 
virtual const char * buffer () const
 Returns the buffer reference. More...
 
virtual int bufferLength ()
 Returns the buffer length. More...
 
virtual void setPreferedCodec (TextCodec *codec=0)
 This method returns the prefered codec. More...
 
virtual TextCodecpreferedCodec ()
 
virtual void setFallbackCodec (TextCodec *codec=0)
 Sets the fallback text codec. More...
 
virtual TextCodecfallbackCodec () const
 

Static Public Member Functions

static TextCodecglobalPreferedCodec ()
 return the static global prefered codec More...
 
static void setGlobalPreferedCodec (TextCodec *codec)
 
static bool hasUTF8Bom (const char *buffer, int length)
 Has a Byte Order Marker for UTF-8. More...
 
static bool hasUTF16LEBom (const char *buffer, int length)
 Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le). More...
 
static bool hasUTF16BEBom (const char *buffer, int length)
 Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2). More...
 
static bool hasUTF32LEBom (const char *buffer, int length)
 Has a Byte Order Marker for UTF-32 Low Endian. More...
 
static bool hasUTF32BEBom (const char *buffer, int length)
 Has a Byte Order Marker for UTF-32 Big Endian. More...
 

Protected Member Functions

virtual bool isContinuationChar (char b)
 If the byte has the form 10xxxxx, then it's a continuation byte of a multiple byte character;. More...
 
virtual bool isTwoBytesSequence (char b)
 If the byte has the form 110xxxx, then it's the first byte of a two-bytes sequence character. More...
 
virtual bool isThreeBytesSequence (char b)
 If the byte has the form 1110xxx, then it's the first byte of a three-bytes sequence character. More...
 
virtual bool isFourBytesSequence (char b)
 If the byte has the form 11110xx, then it's the first byte of a four-bytes sequence character. More...
 
virtual bool isFiveBytesSequence (char b)
 If the byte has the form 11110xx, then it's the first byte of a five-bytes sequence character. More...
 
virtual bool isSixBytesSequence (char b)
 

Detailed Description

This class is used to detect the encoding of a given string. The detector is based on the Java code of Guillaume LAFORGE.

Utility class to guess the encoding of a given byte array. The guess is unfortunately not 100% sure. Especially for 8-bit charsets. It's not possible to know which 8-bit charset is used. Except through statistical analysis. we will then infer that the charset encountered is the same as the default standard charset.

On the other hand, unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are easy to find. For UTF-8 files with no BOM, if the buffer is wide enough, it's easy to guess.

A byte buffer of 4KB or 8KB is sufficient to be able to guess the encoding.

TextCodecDetector detector( QByteArray) ; TextCodec encoding = detector.guessEncoding( QByteArray arr, QTextCode fallback );

Constructor & Destructor Documentation

◆ TextCodecDetector() [1/2]

edbee::TextCodecDetector::TextCodecDetector ( const QByteArray *  buffer = 0,
TextCodec preferedCodec = 0 
)
explicit

◆ TextCodecDetector() [2/2]

edbee::TextCodecDetector::TextCodecDetector ( const char *  buffer,
int  length = 0,
TextCodec preferedCodec = 0 
)
explicit

◆ ~TextCodecDetector()

edbee::TextCodecDetector::~TextCodecDetector ( )
virtual

Member Function Documentation

◆ buffer()

virtual const char* edbee::TextCodecDetector::buffer ( ) const
inlinevirtual

Returns the buffer reference.

◆ bufferLength()

virtual int edbee::TextCodecDetector::bufferLength ( )
inlinevirtual

Returns the buffer length.

◆ detectCodec()

TextCodec * edbee::TextCodecDetector::detectCodec ( )
virtual

Detects the encoding of the provided buffer. If Byte Order Markers are encountered at the beginning of the buffer, we immidiately return the charset implied by this BOM. Otherwise, the file would not be a human readable text file.

If there is no BOM, this method tries to discern whether the file is UTF-8 or not. If it is not UTF-8, we assume the encoding is the default system encoding (of course, it might be any 8-bit charset, but usually, an 8-bit charset is the default one)

It is possible to discern UTF-8 thanks to the pattern of characters with a multi-byte sequence

UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000-7FFF FFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

With UTF-8, 0xFE and 0xFF never appear.

Returns
the QTextCodec that is 'detected'
Todo:
the buffer is not read up to the end, but up to length - 6

◆ fallbackCodec()

virtual TextCodec* edbee::TextCodecDetector::fallbackCodec ( ) const
inlinevirtual

◆ globalPreferedCodec()

TextCodec * edbee::TextCodecDetector::globalPreferedCodec ( )
static

return the static global prefered codec

◆ hasUTF16BEBom()

bool edbee::TextCodecDetector::hasUTF16BEBom ( const char *  buffer,
int  length 
)
static

Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).

◆ hasUTF16LEBom()

bool edbee::TextCodecDetector::hasUTF16LEBom ( const char *  buffer,
int  length 
)
static

Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).

◆ hasUTF32BEBom()

bool edbee::TextCodecDetector::hasUTF32BEBom ( const char *  buffer,
int  length 
)
static

Has a Byte Order Marker for UTF-32 Big Endian.

◆ hasUTF32LEBom()

bool edbee::TextCodecDetector::hasUTF32LEBom ( const char *  buffer,
int  length 
)
static

Has a Byte Order Marker for UTF-32 Low Endian.

◆ hasUTF8Bom()

bool edbee::TextCodecDetector::hasUTF8Bom ( const char *  buffer,
int  length 
)
static

Has a Byte Order Marker for UTF-8.

◆ isContinuationChar()

virtual bool edbee::TextCodecDetector::isContinuationChar ( char  b)
inlineprotectedvirtual

If the byte has the form 10xxxxx, then it's a continuation byte of a multiple byte character;.

◆ isFiveBytesSequence()

virtual bool edbee::TextCodecDetector::isFiveBytesSequence ( char  b)
inlineprotectedvirtual

If the byte has the form 11110xx, then it's the first byte of a five-bytes sequence character.

◆ isFourBytesSequence()

virtual bool edbee::TextCodecDetector::isFourBytesSequence ( char  b)
inlineprotectedvirtual

If the byte has the form 11110xx, then it's the first byte of a four-bytes sequence character.

◆ isSixBytesSequence()

virtual bool edbee::TextCodecDetector::isSixBytesSequence ( char  b)
inlineprotectedvirtual

◆ isThreeBytesSequence()

virtual bool edbee::TextCodecDetector::isThreeBytesSequence ( char  b)
inlineprotectedvirtual

If the byte has the form 1110xxx, then it's the first byte of a three-bytes sequence character.

◆ isTwoBytesSequence()

virtual bool edbee::TextCodecDetector::isTwoBytesSequence ( char  b)
inlineprotectedvirtual

If the byte has the form 110xxxx, then it's the first byte of a two-bytes sequence character.

◆ preferedCodec()

virtual TextCodec* edbee::TextCodecDetector::preferedCodec ( )
inlinevirtual

◆ setBuffer()

virtual void edbee::TextCodecDetector::setBuffer ( const char *  buf,
int  length 
)
inlinevirtual

Sets the buffer reference.

◆ setFallbackCodec()

void edbee::TextCodecDetector::setFallbackCodec ( TextCodec codec = 0)
virtual

Sets the fallback text codec.

Parameters
codecthe codec to use. When you use 0 the system codec is used

prefer System

◆ setGlobalPreferedCodec()

void edbee::TextCodecDetector::setGlobalPreferedCodec ( TextCodec codec)
static

◆ setPreferedCodec()

void edbee::TextCodecDetector::setPreferedCodec ( TextCodec codec = 0)
virtual

This method returns the prefered codec.

prefer UTF-8


The documentation for this class was generated from the following files: