This class is used to detect the encoding of a given string. The detector is based on the Java code of Guillaume LAFORGE. More...

#include <textcodecdetector.h>

Collaboration diagram for edbee::TextCodecDetector:

Public Member Functions
	TextCodecDetector (const QByteArray buffer=0, TextCodec preferedCodec=0)

	TextCodecDetector (const char buffer, int length=0, TextCodec preferedCodec=0)

virtual	~TextCodecDetector ()

virtual TextCodec *	detectCodec ()
	Detects the encoding of the provided buffer. If Byte Order Markers are encountered at the beginning of the buffer, we immidiately return the charset implied by this BOM. Otherwise, the file would not be a human readable text file.

virtual void	setBuffer (const char *buf, int length)
	Sets the buffer reference.

virtual const char *	buffer () const
	Returns the buffer reference.

virtual int	bufferLength ()
	Returns the buffer length.

virtual void	setPreferedCodec (TextCodec *codec=0)
	This method returns the prefered codec.

virtual TextCodec *	preferedCodec ()

virtual void	setFallbackCodec (TextCodec *codec=0)
	Sets the fallback text codec.

virtual TextCodec *	fallbackCodec () const

Static Public Member Functions
static TextCodec *	globalPreferedCodec ()
	return the static global prefered codec

static void	setGlobalPreferedCodec (TextCodec *codec)

static bool	hasUTF8Bom (const char *buffer, int length)
	Has a Byte Order Marker for UTF-8.

static bool	hasUTF16LEBom (const char *buffer, int length)
	Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).

static bool	hasUTF16BEBom (const char *buffer, int length)
	Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).

static bool	hasUTF32LEBom (const char *buffer, int length)
	Has a Byte Order Marker for UTF-32 Low Endian.

static bool	hasUTF32BEBom (const char *buffer, int length)
	Has a Byte Order Marker for UTF-32 Big Endian.

Protected Member Functions
virtual bool	isContinuationChar (char b)
	If the byte has the form 10xxxxx, then it's a continuation byte of a multiple byte character;.

virtual bool	isTwoBytesSequence (char b)
	If the byte has the form 110xxxx, then it's the first byte of a two-bytes sequence character.

virtual bool	isThreeBytesSequence (char b)
	If the byte has the form 1110xxx, then it's the first byte of a three-bytes sequence character.

virtual bool	isFourBytesSequence (char b)
	If the byte has the form 11110xx, then it's the first byte of a four-bytes sequence character.

virtual bool	isFiveBytesSequence (char b)
	If the byte has the form 11110xx, then it's the first byte of a five-bytes sequence character.

virtual bool	isSixBytesSequence (char b)

Detailed Description

This class is used to detect the encoding of a given string. The detector is based on the Java code of Guillaume LAFORGE.

Utility class to guess the encoding of a given byte array. The guess is unfortunately not 100% sure. Especially for 8-bit charsets. It's not possible to know which 8-bit charset is used. Except through statistical analysis. we will then infer that the charset encountered is the same as the default standard charset.

On the other hand, unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are easy to find. For UTF-8 files with no BOM, if the buffer is wide enough, it's easy to guess.

A byte buffer of 4KB or 8KB is sufficient to be able to guess the encoding.

TextCodecDetector detector( QByteArray) ; TextCodec encoding = detector.guessEncoding( QByteArray arr, QTextCode fallback );

Constructor & Destructor Documentation

◆ TextCodecDetector() [1/2]

edbee::TextCodecDetector::TextCodecDetector	(	const QByteArray *	buffer = 0,
		TextCodec *	preferedCodec = 0 )

explicit

◆ TextCodecDetector() [2/2]

edbee::TextCodecDetector::TextCodecDetector	(	const char *	buffer,
		int	length = 0,
		TextCodec *	preferedCodec = 0 )

explicit

◆ ~TextCodecDetector()

edbee::TextCodecDetector::~TextCodecDetector ( )

virtual

Member Function Documentation

◆ buffer()

virtual const char * edbee::TextCodecDetector::buffer ( ) const

inlinevirtual

Returns the buffer reference.

◆ bufferLength()

virtual int edbee::TextCodecDetector::bufferLength ( )

inlinevirtual

Returns the buffer length.

◆ detectCodec()

TextCodec * edbee::TextCodecDetector::detectCodec ( )

virtual

Detects the encoding of the provided buffer. If Byte Order Markers are encountered at the beginning of the buffer, we immidiately return the charset implied by this BOM. Otherwise, the file would not be a human readable text file.

If there is no BOM, this method tries to discern whether the file is UTF-8 or not. If it is not UTF-8, we assume the encoding is the default system encoding (of course, it might be any 8-bit charset, but usually, an 8-bit charset is the default one)

It is possible to discern UTF-8 thanks to the pattern of characters with a multi-byte sequence

UCS-4 range (hex.)        UTF-8 octet sequence (binary)
0000-0000 007F       0xxxxxxx
0080-0000 07FF       110xxxxx 10xxxxxx
0800-0000 FFFF       1110xxxx 10xxxxxx 10xxxxxx
0000-001F FFFF       11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0000-03FF FFFF       111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0000-7FFF FFFF       1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

With UTF-8, 0xFE and 0xFF never appear.

Returns: the QTextCodec that is 'detected'

Todo: the buffer is not read up to the end, but up to length - 6

◆ fallbackCodec()

virtual TextCodec * edbee::TextCodecDetector::fallbackCodec ( ) const

inlinevirtual

◆ globalPreferedCodec()

TextCodec * edbee::TextCodecDetector::globalPreferedCodec ( )

static

return the static global prefered codec

◆ hasUTF16BEBom()

bool edbee::TextCodecDetector::hasUTF16BEBom	(	const char *	buffer,
		int	length )

static

Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).

◆ hasUTF16LEBom()

bool edbee::TextCodecDetector::hasUTF16LEBom	(	const char *	buffer,
		int	length )

static

Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).

◆ hasUTF32BEBom()

bool edbee::TextCodecDetector::hasUTF32BEBom	(	const char *	buffer,
		int	length )

static

Has a Byte Order Marker for UTF-32 Big Endian.

◆ hasUTF32LEBom()

bool edbee::TextCodecDetector::hasUTF32LEBom	(	const char *	buffer,
		int	length )

static

Has a Byte Order Marker for UTF-32 Low Endian.

◆ hasUTF8Bom()

bool edbee::TextCodecDetector::hasUTF8Bom	(	const char *	buffer,
		int	length )

static

Has a Byte Order Marker for UTF-8.

◆ isContinuationChar()

virtual bool edbee::TextCodecDetector::isContinuationChar ( char b )

inlineprotectedvirtual

If the byte has the form 10xxxxx, then it's a continuation byte of a multiple byte character;.

◆ isFiveBytesSequence()

virtual bool edbee::TextCodecDetector::isFiveBytesSequence ( char b )

inlineprotectedvirtual

If the byte has the form 11110xx, then it's the first byte of a five-bytes sequence character.

◆ isFourBytesSequence()

virtual bool edbee::TextCodecDetector::isFourBytesSequence ( char b )

inlineprotectedvirtual

If the byte has the form 11110xx, then it's the first byte of a four-bytes sequence character.

◆ isSixBytesSequence()

virtual bool edbee::TextCodecDetector::isSixBytesSequence ( char b )

inlineprotectedvirtual

◆ isThreeBytesSequence()

virtual bool edbee::TextCodecDetector::isThreeBytesSequence ( char b )

inlineprotectedvirtual

If the byte has the form 1110xxx, then it's the first byte of a three-bytes sequence character.

◆ isTwoBytesSequence()

virtual bool edbee::TextCodecDetector::isTwoBytesSequence ( char b )

inlineprotectedvirtual

If the byte has the form 110xxxx, then it's the first byte of a two-bytes sequence character.

◆ preferedCodec()

virtual TextCodec * edbee::TextCodecDetector::preferedCodec ( )

inlinevirtual

◆ setBuffer()

virtual void edbee::TextCodecDetector::setBuffer	(	const char *	buf,
		int	length )

inlinevirtual

Sets the buffer reference.

◆ setFallbackCodec()

void edbee::TextCodecDetector::setFallbackCodec ( TextCodec * codec = 0 )

virtual

Sets the fallback text codec.

Parameters

codec the codec to use. When you use 0 the system codec is used

prefer System

◆ setGlobalPreferedCodec()

void edbee::TextCodecDetector::setGlobalPreferedCodec ( TextCodec * codec )

static

◆ setPreferedCodec()

void edbee::TextCodecDetector::setPreferedCodec ( TextCodec * codec = 0 )

virtual

This method returns the prefered codec.

prefer UTF-8

The documentation for this class was generated from the following files:

edbee/util/textcodecdetector.h
edbee/util/textcodecdetector.cpp

Public Member Functions

Static Public Member Functions

Protected Member Functions

Detailed Description

Constructor & Destructor Documentation

◆ TextCodecDetector() [1/2]

◆ TextCodecDetector() [2/2]

◆ ~TextCodecDetector()

Member Function Documentation

◆ buffer()

◆ bufferLength()

◆ detectCodec()

◆ fallbackCodec()

◆ globalPreferedCodec()

◆ hasUTF16BEBom()

◆ hasUTF16LEBom()

◆ hasUTF32BEBom()

◆ hasUTF32LEBom()

◆ hasUTF8Bom()

◆ isContinuationChar()

◆ isFiveBytesSequence()

◆ isFourBytesSequence()

◆ isSixBytesSequence()

◆ isThreeBytesSequence()

◆ isTwoBytesSequence()

◆ preferedCodec()

◆ setBuffer()

◆ setFallbackCodec()

◆ setGlobalPreferedCodec()

◆ setPreferedCodec()