Hướng dẫn get encoding string php

(PHP 4 >= 4.0.6, PHP 5, PHP 7, PHP 8)

Nội dung chính Show

Description
Return Values

mb_detect_encoding — Detect character encoding

Description

mb_detect_encoding(string $string, array|string|null $encodings = null, bool $strict = false): string|false

Automatic detection of the intended character encoding can never be entirely reliable; without some additional information, it is similar to decoding an encrypted string without the key. It is always preferable to use an indication of character encoding stored or transmitted with the data, such as a "Content-Type" HTTP header.

This function is most useful with multibyte encodings, where not all sequences of bytes form a valid string. If the input string contains such a sequence, that encoding will be rejected, and the next encoding checked.

Parameters

string

The string being inspected.

encodings

A list of character encodings to try, in order. The list may be specified as an array of strings, or a single string separated by commas.

If encodings is omitted or null, the current detect_order (set with the mbstring.detect_order configuration option, or mb_detect_order() function) will be used.

strict

Controls the behaviour when string is not valid in any of the listed encodings. If strict is set to false, the closest matching encoding will be returned; if strict is set to true, false will be returned.

The default value for strict can be set with the mbstring.strict_detection configuration option.

Return Values

The detected character encoding, or false if the string is not valid in any of the listed encodings.

Examples

Example #1 mb_detect_encoding() example

// Detect character encoding with current detect_order echo mb_detect_encoding($str);// "auto" is expanded according to mbstring.language echo mb_detect_encoding($str, "auto");// Specify "encodings" parameter by list separated by comma echo mb_detect_encoding($str, "JIS, eucjp-win, sjis-win");// Use array to specify "encodings" parameter $encodings = [ "ASCII", "JIS", "EUC-JP" ]; echo mb_detect_encoding($str, $encodings); ?>

Example #2 Effect of strict parameter

// 'áéóú' encoded in ISO-8859-1 $str = "\xE1\xE9\xF3\xFA";// The string is not valid ASCII or UTF-8, but UTF-8 is considered a closer match var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], false)); var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], true));// If a valid encoding is found, the strict parameter does not change the result var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], false)); var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], true)); ?>

The above example will output:

string(5) "UTF-8"
bool(false)
string(10) "ISO-8859-1"
string(10) "ISO-8859-1"

In some cases, the same sequence of bytes may form a valid string in multiple character encodings, and it is impossible to know which interpretation was intended. For instance, among many others, the byte sequence "\xC4\xA2" could be:

"Ä¢" (U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS followed by U+00A2 CENT SIGN) encoded in any of ISO-8859-1, ISO-8859-15, or Windows-1252
"ФЂ" (U+0424 CYRILLIC CAPITAL LETTER EF followed by U+0402 CYRILLIC CAPITAL LETTER DJE) encoded in ISO-8859-5
"Ģ" (U+0122 LATIN CAPITAL LETTER G WITH CEDILLA) encoded in UTF-8

Example #3 Effect of order when multiple encodings match

$str = "\xC4\xA2";// The string is valid in all three encodings, so the first one listed will be returned var_dump(mb_detect_encoding($str, ['UTF-8', 'ISO-8859-1', 'ISO-8859-5'])); var_dump(mb_detect_encoding($str, ['ISO-8859-1', 'ISO-8859-5', 'UTF-8'])); var_dump(mb_detect_encoding($str, ['ISO-8859-5', 'UTF-8', 'ISO-8859-1'])); ?>

The above example will output:

string(5) "UTF-8"
string(10) "ISO-8859-1"
string(10) "ISO-8859-5"

Gerg Tisza ¶

11 years ago

If you try to use mb_detect_encoding to detect whether a string is valid UTF-8, use the strict mode, it is pretty worthless otherwise.

$str = 'áéóú'; // ISO-8859-1 mb_detect_encoding($str, 'UTF-8'); // 'UTF-8' mb_detect_encoding($str, 'UTF-8', true); // false ?>

Chrigu ¶

17 years ago

If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list: mb_detect_encoding($string, 'UTF-8, ISO-8859-1');

if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.

chris AT w3style.co DOT uk ¶

16 years ago

Based upon that snippet below using preg_match() I needed something faster and less specific. That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8. I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8.

I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string. This is quite a lot faster.

function detectUTF8($string) { return preg_match('%(?: [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte |\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte |\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates |\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 |[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 |\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )+%xs', $string); }?>

rl at itfigures dot nl ¶

15 years ago

I used Chris's function "detectUTF8" to detect the need from conversion from utf8 to 8859-1, which works fine. I did have a problem with the following iconv-conversion.

The problem is that the iconv-conversion to 8859-1 (with //TRANSLIT) replaces the euro-sign with EUR, although it is common practice that \x80 is used as the euro-sign in the 8859-1 charset.

I could not use 8859-15 since that mangled some other characters, so I added 2 str_replace's:

if(detectUTF8($str)){ $str=str_replace("\xE2\x82\xAC","€",$str); $str=iconv("UTF-8","ISO-8859-1//TRANSLIT",$str); $str=str_replace("€","\x80",$str); }

If html-output is needed the last line is not necessary (and even unwanted).

nat3738 at gmail dot com ¶

13 years ago

A simple way to detect UTF-8/16/32 of file by its BOM (not work with string or file without BOM)

// Unicode BOM is U+FEFF, but after encoded, it will look like this. define ('UTF32_BIG_ENDIAN_BOM' , chr(0x00) . chr(0x00) . chr(0xFE) . chr(0xFF)); define ('UTF32_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE) . chr(0x00) . chr(0x00)); define ('UTF16_BIG_ENDIAN_BOM' , chr(0xFE) . chr(0xFF)); define ('UTF16_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE)); define ('UTF8_BOM' , chr(0xEF) . chr(0xBB) . chr(0xBF));

function

detect_utf_encoding($filename) {$text = file_get_contents($filename);
    $first2 = substr($text, 0, 2);
    $first3 = substr($text, 0, 3);
    $first4 = substr($text, 0, 3);        if ($first3 == UTF8_BOM) return 'UTF-8';
    elseif ($first4 == UTF32_BIG_ENDIAN_BOM) return 'UTF-32BE';
    elseif ($first4 == UTF32_LITTLE_ENDIAN_BOM) return 'UTF-32LE';
    elseif ($first2 == UTF16_BIG_ENDIAN_BOM) return 'UTF-16BE';
    elseif ($first2 == UTF16_LITTLE_ENDIAN_BOM) return 'UTF-16LE';
}
?>

php-note-2005 at ryandesign dot com ¶

17 years ago

Much simpler UTF-8-ness checker using a regular expression created by the W3C:

// Returns true if $string is valid UTF-8 and false otherwise. function is_utf8($string) {// From http://w3.org/International/questions/qa-forms-utf-8.html return preg_match('%^(?: [\x09\x0A\x0D\x20-\x7E] # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*$%xs', $string);

}

// function is_utf8?>

hmdker at gmail dot com ¶

14 years ago

Function to detect UTF-8, when mb_detect_encoding is not available it may be useful.

recentUser at example dot com ¶

4 years ago

In my environment (PHP 7.1.12), "mb_detect_encoding()" doesn't work where "mb_detect_order()" is not set appropriately.

To enable "mb_detect_encoding()" to work in such a case, simply put "mb_detect_order('...')" before "mb_detect_encoding()" in your script file.

Both "ini_set('mbstring.language', '...');" and "ini_set('mbstring.detect_order', '...');" DON'T work in script files for this purpose whereas setting them in PHP.INI file may work.

garbage at iglou dot eu ¶

5 years ago

For detect UTF-8, you can use:

if (preg_match('!!u', $str)) { echo 'utf-8'; }

- Norihiori

bmrkbyet at web dot de ¶

9 years ago

a) if the FUNCTION mb_detect_encoding is not available:

### mb_detect_encoding ... iconv ###

// -------------------------------------------if(!function_exists('mb_detect_encoding')) { function mb_detect_encoding($string, $enc=null) {

static

$list = array('utf-8', 'iso-8859-1', 'windows-1251');        foreach ($list as $item) {
        $sample = iconv($item, $item, $string);
        if (md5($sample) == md5($string)) { 
            if ($enc == $item) { return true; }    else { return $item; } 
        }
    }
    return null;
}
}// -------------------------------------------
?>

b) if the FUNCTION mb_convert_encoding is not available: 
### mb_convert_encoding ... iconv ###
// -------------------------------------------if(!function_exists('mb_convert_encoding')) { 
function mb_convert_encoding($string, $target_encoding, $source_encoding) { 
    $string = iconv($source_encoding, $target_encoding, $string); 
    return $string; 
}
}// -------------------------------------------
?>

emoebel at web dot de ¶

8 years ago

if the function " mb_detect_encoding" does not exist ...

... try:

// ---------------------------------------------------- if ( !function_exists('mb_detect_encoding') ) { // ---------------------------------------------------------------- function mb_detect_encoding ($string, $enc=null, $ret=null) {

static

$enclist = array( 
            'UTF-8', 'ASCII', 
            'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4', 'ISO-8859-5', 
            'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 
            'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16', 
            'Windows-1251', 'Windows-1252', 'Windows-1254', 
            );$result = false;                 foreach ($enclist as $item) { 
            $sample = iconv($item, $item, $string); 
            if (md5($sample) == md5($string)) { 
                if ($ret === NULL) { $result = $item; } else { $result = true; } 
                break; 
            }
        }            return $result; 
} 
// ---------------------------------------------------------------- } 
// ---------------------------------------------------- 
?>

example / usage of: mb_detect_encoding() 
// ------------------------------------------------------ 
function str_to_utf8 ($str) { 
        if (
mb_detect_encoding($str, 'UTF-8', true) === false) { 
    $str = utf8_encode($str); 
    }    return $str;
}
// ------------------------------------------------------ 
?>

$txtstr = str_to_utf8($txtstr);

d_maksimov ¶

5 months ago

It was helpful for my exec(...) call. When it returned cp866 or cp1251:

try { $line = iconv('CP866', 'CP1251', $line); } catch(Exception $e) { } return iconv('CP1251', 'UTF-8', $line);

telemach ¶

17 years ago

beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests)

mb_detect_encoding('accentu?e' , 'UTF-8, ISO-8859-1')

returns ISO-8859-1, while

mb_detect_encoding('accentu?' , 'UTF-8, ISO-8859-1')

returns UTF-8

bottom line : an ending '?' (and probably other accentuated chars) mislead mb_detect_encoding

maarten ¶

17 years ago

Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8. To verify utf 8 use the following:

// // utf8 encoding validation developed based on Wikipedia entry at: // http://en.wikipedia.org/wiki/UTF-8 // // Implemented as a recursive descent parser based on a simple state machine // copyright 2005 Maarten Meijer // // This cries out for a C-implementation to be included in PHP core // function valid_1byte($char) { if(!is_int($char)) return false; return ($char & 0x80) == 0x00; }

function valid_2byte($char) { if(!is_int($char)) return false; return ($char & 0xE0) == 0xC0; }

function valid_3byte($char) { if(!is_int($char)) return false; return ($char & 0xF0) == 0xE0; }

function valid_4byte($char) { if(!is_int($char)) return false; return ($char & 0xF8) == 0xF0; }

function valid_nextbyte($char) { if(!is_int($char)) return false; return ($char & 0xC0) == 0x80; }

for a drawing of the statemachine see: http://www.xs4all.nl/~mjmeijer/unicode.png and http://www.xs4all.nl/~mjmeijer/unicode2.png

yaqy at qq dot com ¶

14 years ago

/* *QQ: 290359552 * conver to Utf8 if $str is not equals to 'UTF-8' */ function convToUtf8($str) { if( mb_detect_encoding($str,"UTF-8, ISO-8859-1, GBK")!="UTF-8" ) {

return

iconv("gbk","utf-8",$str);

}

else

{

return $str;

}

}
?>

Anonymous ¶

8 years ago

// -----------------------------------------------------------

if(!function_exists('mb_detect_encoding')) {

function mb_detect_encoding($string, $enc=null, $ret=true) { $out=$enc; static $list = array('utf-8', 'iso-8859-1', 'iso-8859-15', 'windows-1251'); foreach ($list as $item) { $sample = iconv($item, $item, $string); if (md5($sample) == md5($string)) { $out = ($ret !== false) ? true : $item; } } return $out; }

}

// -----------------------------------------------------------

matthijs at ischen dot nl ¶

13 years ago

I seriously underestimated the importance of setlocale... $strings = array( "mais coisas a pensar sobre diário ou dois!", "plus de choses à penser à journalier ou à deux !", "¡más cosas a pensar en diario o dos!", "più cose da pensare circa giornaliere o due!", "flere ting å tenke på hver dag eller to!", "Další věcí, přemýšlet o každý den nebo dva!", "mehr über Spaß spät schönen", "më vonë gjatë fun bukur", "több mint szórakozás késő csodálatos kenyér" );$convert = array(); setlocale(LC_CTYPE, 'de_DE.UTF-8'); foreach( $strings as $string ) $convert[] = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string); ?> Produces the following:

Array ( [0] => mais coisas a pensar sobre diario ou dois! [1] => plus de choses a penser a journalier ou a deux ! [2] => ?mas cosas a pensar en diario o dos! [3] => piu cose da pensare circa giornaliere o due! [4] => flere ting aa tenke paa hver dag eller to! [5] => Dalsi veci, premyslet o kazdy den nebo dva! [6] => mehr ueber Spass spaet schoenen [7] => me vone gjate fun bukur [8] => toebb mint szorakozas keso csodalatos kenyer )

whereas

$convert = array(); setlocale(LC_CTYPE, 'nl_NL.UTF-8'); foreach( $strings as $string ) $convert[] = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string); ?> produces: Array ( [0] => mais coisas a pensar sobre di?rio ou dois! [1] => plus de choses ? penser ? journalier ou ? deux ! [2] => ?m?s cosas a pensar en diario o dos! [3] => pi? cose da pensare circa giornaliere o due! [4] => flere ting ? tenke p? hver dag eller to! [5] => Dal?? v?c?, p?em??let o ka?d? den nebo dva! [6] => mehr ?ber Spass sp?t sch?nen [7] => m? von? gjat? fun bukur [8] => t?bb mint sz?rakoz?s k?s? csod?latos keny?r )

This might be of interest when trying to convert utf-8 strings into ASCII suitable for URL's, and such. this was never obvious for me since I've used locales for us and nl.

jaaks at playtech dot com ¶

17 years ago

Last example for verifying UTF-8 has one little bug. If 10xxxxxx byte occurs alone i.e. not in multibyte char, then it is accepted although it is against UTF-8 rules. Make following replacement to repair it.

Replace } // goto next char with } else { return false; // 10xxxxxx occuring alone } // goto next char

lexonight at yahoo dot com ¶

5 years ago

$file = file_get_contents("somefile.txt"); $encodings = implode(',', mb_list_encodings()); echo mb_detect_encoding($file, $encodings, true); ?> seems to work

prgss at bk dot ru ¶

13 years ago

Another light way to detect character encoding: function detect_encoding($string) { static $list = array('utf-8', 'windows-1251');

foreach (

$list as $item) {
    $sample = iconv($item, $item, $string);
    if (md5($sample) == md5($string))
      return $item;
  }
  return null;
}
?>

sunggsun ¶

16 years ago

from PHPDIG

function isUTF8($str) { if ($str === mb_convert_encoding(mb_convert_encoding($str, "UTF-32", "UTF-8"), "UTF-8", "UTF-32")) { return true; } else { return false; } }

Hướng dẫn get encoding string php

Description

Parameters

Return Values

Examples

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội