edbee - Qt Editor Library
|
This class is used to detect the encoding of a given string. The detector is based on the Java code of Guillaume LAFORGE. More...
#include <textcodecdetector.h>
Public Member Functions | |
TextCodecDetector (const QByteArray *buffer=0, TextCodec *preferedCodec=0) | |
TextCodecDetector (const char *buffer, int length=0, TextCodec *preferedCodec=0) | |
virtual | ~TextCodecDetector () |
virtual TextCodec * | detectCodec () |
Detects the encoding of the provided buffer. If Byte Order Markers are encountered at the beginning of the buffer, we immidiately return the charset implied by this BOM. Otherwise, the file would not be a human readable text file. More... | |
virtual void | setBuffer (const char *buf, int length) |
Sets the buffer reference. More... | |
virtual const char * | buffer () const |
Returns the buffer reference. More... | |
virtual int | bufferLength () |
Returns the buffer length. More... | |
virtual void | setPreferedCodec (TextCodec *codec=0) |
This method returns the prefered codec. More... | |
virtual TextCodec * | preferedCodec () |
virtual void | setFallbackCodec (TextCodec *codec=0) |
Sets the fallback text codec. More... | |
virtual TextCodec * | fallbackCodec () const |
Static Public Member Functions | |
static TextCodec * | globalPreferedCodec () |
return the static global prefered codec More... | |
static void | setGlobalPreferedCodec (TextCodec *codec) |
static bool | hasUTF8Bom (const char *buffer, int length) |
Has a Byte Order Marker for UTF-8. More... | |
static bool | hasUTF16LEBom (const char *buffer, int length) |
Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le). More... | |
static bool | hasUTF16BEBom (const char *buffer, int length) |
Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2). More... | |
static bool | hasUTF32LEBom (const char *buffer, int length) |
Has a Byte Order Marker for UTF-32 Low Endian. More... | |
static bool | hasUTF32BEBom (const char *buffer, int length) |
Has a Byte Order Marker for UTF-32 Big Endian. More... | |
Protected Member Functions | |
virtual bool | isContinuationChar (char b) |
If the byte has the form 10xxxxx, then it's a continuation byte of a multiple byte character;. More... | |
virtual bool | isTwoBytesSequence (char b) |
If the byte has the form 110xxxx, then it's the first byte of a two-bytes sequence character. More... | |
virtual bool | isThreeBytesSequence (char b) |
If the byte has the form 1110xxx, then it's the first byte of a three-bytes sequence character. More... | |
virtual bool | isFourBytesSequence (char b) |
If the byte has the form 11110xx, then it's the first byte of a four-bytes sequence character. More... | |
virtual bool | isFiveBytesSequence (char b) |
If the byte has the form 11110xx, then it's the first byte of a five-bytes sequence character. More... | |
virtual bool | isSixBytesSequence (char b) |
This class is used to detect the encoding of a given string. The detector is based on the Java code of Guillaume LAFORGE.
Utility class to guess the encoding of a given byte array. The guess is unfortunately not 100% sure. Especially for 8-bit charsets. It's not possible to know which 8-bit charset is used. Except through statistical analysis. we will then infer that the charset encountered is the same as the default standard charset.
On the other hand, unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are easy to find. For UTF-8 files with no BOM, if the buffer is wide enough, it's easy to guess.
A byte buffer of 4KB or 8KB is sufficient to be able to guess the encoding.
TextCodecDetector detector( QByteArray) ; TextCodec encoding = detector.guessEncoding( QByteArray arr, QTextCode fallback );
|
explicit |
|
explicit |
|
virtual |
|
inlinevirtual |
Returns the buffer reference.
|
inlinevirtual |
Returns the buffer length.
|
virtual |
Detects the encoding of the provided buffer. If Byte Order Markers are encountered at the beginning of the buffer, we immidiately return the charset implied by this BOM. Otherwise, the file would not be a human readable text file.
If there is no BOM, this method tries to discern whether the file is UTF-8 or not. If it is not UTF-8, we assume the encoding is the default system encoding (of course, it might be any 8-bit charset, but usually, an 8-bit charset is the default one)
It is possible to discern UTF-8 thanks to the pattern of characters with a multi-byte sequence
With UTF-8, 0xFE and 0xFF never appear.
|
inlinevirtual |
|
static |
return the static global prefered codec
|
static |
Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).
|
static |
Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).
|
static |
Has a Byte Order Marker for UTF-32 Big Endian.
|
static |
Has a Byte Order Marker for UTF-32 Low Endian.
|
static |
Has a Byte Order Marker for UTF-8.
|
inlineprotectedvirtual |
If the byte has the form 10xxxxx, then it's a continuation byte of a multiple byte character;.
|
inlineprotectedvirtual |
If the byte has the form 11110xx, then it's the first byte of a five-bytes sequence character.
|
inlineprotectedvirtual |
If the byte has the form 11110xx, then it's the first byte of a four-bytes sequence character.
|
inlineprotectedvirtual |
|
inlineprotectedvirtual |
If the byte has the form 1110xxx, then it's the first byte of a three-bytes sequence character.
|
inlineprotectedvirtual |
If the byte has the form 110xxxx, then it's the first byte of a two-bytes sequence character.
|
inlinevirtual |
|
inlinevirtual |
Sets the buffer reference.
|
virtual |
Sets the fallback text codec.
codec | the codec to use. When you use 0 the system codec is used |
prefer System
|
static |
|
virtual |
This method returns the prefered codec.
prefer UTF-8