What string encoding does php use?

A PHP string is just a sequence of bytes, with no encoding tagged to it whatsoever. String values can come from various sources: the client [over HTTP], a database, a file, or from string literals in your source code. PHP reads all these as byte sequences, and it never extracts any encoding information.

As long as all your data sources and destinations use the same encoding, the worst thing that can happen is that string positions are wrong [if you use multi-byte encodings], since PHP will count bytes, not characters.

But if the encodings don't match [e.g. you write a string literal in a source file stored as UTF-8, and then send it to a database that expects Latin-1], PHP will not perform any conversion for you: it will happily copy the bytes over raw.

The sanest solution is this:

  • Set PHP's internal encoding to UTF-8.
  • Save all your source files as UTF-8.
  • Use UTF-8 as your output encoding [don't forget to send suitable Content-type headers].
  • Set the database connection to use UTF-8 [SET NAMES UTF8 in MySQL].
  • Configure everything else to be UTF-8 if at all possible.
  • For anything that you can't control [e.g. third-party web services], make sure you know the encoding, and convert to UTF-8 as early as possible, and back to the other encoding as late as possible.

Why UTF-8? Because it can represent all Unicode characters and thus supersedes all the existing 7-bit and 8-bit encodings, and because it is binary compatible with ASCII, that is, every valid ASCII string is also a valid UTF-8 string [but not vv.].

In your example, what happens is this.

First, you save your source file; your text editor is probably configured to use UTF-8, so your string literal ends up UTF-8 encoded on disk. PHP reads this file, interpreting the string as a series of bytes; $original now holds a UTF-8 encoded string of 7 characters, which is just a byte sequence [though it contains more than 7 bytes, because each character is represented by two or more bytes]. If you then call echo $original, the encoded string is sent to the client as-is; if you have told the client to expect UTF-8, everything is fine, but if you haven't, PHP has no way to tell the difference, and you'll end up with garbage in the browser. As an experiment, try this:

$original = "शक्नोम्यत्तुम्";
echo strlen[$original];

strlen is encoding-agnostic and assumes a fixed-width 8 bit encoding, that is, one byte per character, so it will count bytes, not characters.

[PHP 4, PHP 5, PHP 7, PHP 8]

utf8_encodeConverts a string from ISO-8859-1 to UTF-8

Warning

This function has been DEPRECATED as of PHP 8.2.0. Relying on this function is highly discouraged.

Description

utf8_encode[string $string]: string

Note:

This function does not attempt to guess the current encoding of the provided string, it assumes it is encoded as ISO-8859-1 [also known as "Latin 1"] and converts to UTF-8. Since every sequence of bytes is a valid ISO-8859-1 string, this never results in an error, but will not result in a useful string if a different encoding was intended.

Many web pages marked as using the ISO-8859-1 character encoding actually use the similar Windows-1252 encoding, and web browsers will interpret ISO-8859-1 web pages as Windows-1252. Windows-1252 features additional printable characters, such as the Euro sign [] and curly quotes [ ], instead of certain ISO-8859-1 control characters. This function will not convert such Windows-1252 characters correctly. Use a different function if Windows-1252 conversion is required.

Parameters

string

An ISO-8859-1 string.

Return Values

Returns the UTF-8 translation of string.

Changelog

VersionDescription
8.2.0 This function has been deprecated.
7.2.0 This function has been moved from the XML extension to the core of PHP. In previous versions, it was only available if the XML extension was installed.

Examples

Example #1 Basic example

The above example will output:

See Also

  • utf8_decode[] - Converts a string from UTF-8 to ISO-8859-1, replacing invalid or unrepresentable characters
  • mb_convert_encoding[] - Convert a string from one character encoding to another
  • UConverter::transcode[] - Convert a string from one character encoding to another
  • iconv[] - Convert a string from one character encoding to another

deceze at gmail dot com

11 years ago

Please note that utf8_encode only converts a string encoded in ISO-8859-1 to UTF-8. A more appropriate name for it would be "iso88591_to_utf8". If your text is not encoded in  ISO-8859-1, you do not need this function. If your text is already in UTF-8, you do not need this function. In fact, applying this function to text that is not encoded in ISO-8859-1 will most likely simply garble that text.

If you need to convert text from any encoding to any other encoding, look at iconv[] instead.

Aidan Kehoe

18 years ago

Here's some code that addresses the issue that Steven describes in the previous comment;

bisqwit at iki dot fi

17 years ago

For reference, it may be insightful to point out that:
  utf8_encode[$s]
is actually identical to:
  recode_string['latin1..utf8', $s]
and:
  iconv['iso-8859-1', 'utf-8', $s]
That is, utf8_encode is a specialized case of character set conversions.

If your string to be converted to utf-8 is something other than iso-8859-1 [such as iso-8859-2 [Polish/Croatian]], you should use recode_string[] or iconv[] instead rather than trying to devise complex str_replace statements.

a dot rueedlinger at gmail dot com

9 years ago

If you need a function which converts a string array into a utf8 encoded string array then this function might be useful for you:

Oscar Broman

10 years ago

Walk through nested arrays/objects and utf8 encode all strings.

Pini

6 years ago

My version of utf8_encode_deep,
In case you need one that returns a value without changing the original.

        /**
        * Convert Anything To UTF-8
        * @param mixed $var The variable you want to convert.
        * @param boolean $deep Deep convertion? [*Default: TRUE].
        * @return mixed
        */
        function anything_to_utf8[$var,$deep=TRUE]{
            if[is_array[$var]]{
                foreach[$var as $key => $value]{
                    if[$deep]{
                        $var[$key] = anything_to_utf8[$value,$deep];
                    }elseif[!is_array[$value] && !is_object[$value] && !mb_detect_encoding[$value,'utf-8',true]]{
                         $var[$key] = utf8_encode[$var];
                    }
                }
                return $var;
            }elseif[is_object[$var]]{
                foreach[$var as $key => $value]{
                    if[$deep]{
                        $var->$key = anything_to_utf8[$value,$deep];
                    }elseif[!is_array[$value] && !is_object[$value] && !mb_detect_encoding[$value,'utf-8',true]]{
                         $var->$key = utf8_encode[$var];
                    }
                }
                return $var;
            }else{
                return [!mb_detect_encoding[$var,'utf-8',true]]?utf8_encode[$var]:$var;
            }
        }

Mark AT modernbill DOT com

17 years ago

If you haven't guessed already: If the UTF-8 character has no representation in the ISO-8859-1 codepage, a ? will be returned. You might want to wrap a function around this to make sure you aren't saving a bunch of ???? into your database.

rattones at gmail dot com

1 year ago

/**
* Convert all values of an array to utf8_encode
* @author Marcelo Ratton
* @version 1.0
*
* @param  array  $arr   array to encode values
* @param  bool   $keys  true to convert keys to UTF8
* @return array  same   array but with all values encoded to UTF8
*/
function arrayEncodeToUTF8[array $arr, bool $keys= false] : array {
  $ret= [];
  foreach [$arr as $k=>$v] {
    if [is_array[$v]] {
      $ret[$k]= arrayEncodeToUTF8[$v];
    } else {
      if [$keys] {
        $k= utf8_encode[[string]$k];
      }
      $ret[$k]= utf8_encode[[string]$v];
    }
  }

  return $ret;
}

rogeriogirodo at gmail dot com

13 years ago

This function may be useful do encode array keys and values [and checks first to see if it's already in UTF format]:



Hope this may help.

[NOTE BY danbrown AT php DOT net: Original function written by [cmyk777 AT gmail DOT com] on 28-JAN-09.]

powtac 4t gmx d0t de

11 years ago

I tried a lot of things, but this seems to be the final fail save method to convert any string to proper UTF-8.

Janci

16 years ago

I was searching for a function similar to Javascript's unescape[]. In most cases it is OK to use url_decode[] function but not if you've got UTF characters in the strings. They are converted into %uXXXX entities that url_decode[] cannot handle.
I googled the net and found a function which actualy converts these entities into HTML entities [&#xxx;] that your browser can show correctly. If you're OK with that, the function can be found here: //pure-essence.net/stuff/code/utf8RawUrlDecode.phps

But it was not OK with me because I needed a string in my charset to make some comparations and other stuff. So I have modified the above function and in conjuction with code2utf[] function mentioned in some other note here, I have managed to achieve my goal:

Chủ Đề