Check for non utf-8 characters python

I have a big amount of files and parser. What I Have to do is strip all non utf-8 symbols and put data in mongodb. Currently I have code like this.

with open[fname, "r"] as fp:
    for line in fp:
        line = line.strip[]
        line = line.decode['utf-8', 'ignore']
        line = line.encode['utf-8', 'ignore']

somehow I still get an error

bson.errors.InvalidStringData: strings in documents must be valid UTF-8: 
1/b62010montecassianomcir\xe2\x86\x90ta0\xe2\x86\x90008923304320733/290066010401040101506055soccorin

I don't get it. Is there some simple way to do it?

UPD: seems like Python and Mongo don't agree about definition of Utf-8 Valid string.

maudulus

9,9599 gold badges73 silver badges111 bronze badges

asked Oct 24, 2014 at 5:24

1

Try below code line instead of last two lines. Hope it helps:

line=line.decode['utf-8','ignore'].encode["utf-8"]

answered Oct 24, 2014 at 13:45

Irshad BhatIrshad Bhat

8,0811 gold badge21 silver badges33 bronze badges

7

For python 3, as mentioned in a comment in this thread, you can do:

line = bytes[line, 'utf-8'].decode['utf-8', 'ignore']

The 'ignore' parameter prevents an error from being raised if any characters are unable to be decoded.

If your line is already a bytes object [e.g. b'my string'] then you just need to decode it with decode['utf-8', 'ignore'].

answered Apr 27, 2017 at 6:19

AlexAlex

10.7k6 gold badges61 silver badges71 bronze badges

1

Example to handle no utf-8 characters

import string

test=u"\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi \nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n"

print ''.join[x for x in test if x in string.printable]

answered Jan 5, 2017 at 5:23

1

with open[fname, "r"] as fp:
for line in fp:
    line = line.strip[]
    line = line.decode['cp1252'].encode['utf-8']

answered Jul 28, 2018 at 5:04

WillemWillem

1,2941 gold badge8 silver badges7 bronze badges

1

Remove the non utf-8 characters from a String in Python #

To remove the non utf-8 characters from a string:

  1. Use the str.encode[] method to encode the string to a bytes object.
  2. Set the errors keyword argument to ignore to drop any non utf-8 characters.
  3. Use the bytes.decode[] method to decode the bytes object to a string.

Copied!

# ✅ remove non utf-8 characters from string my_str = 'abc' result = my_str.encode['utf-8', errors='ignore'].decode['utf-8'] print[result] # 👉️ 'abc' # ---------------------------------------------------- # ✅ remove non utf-8 characters when reading from file with open['example.txt', 'r', encoding='utf-8'] as f: lines = f.readlines[] for line in lines: line = line.encode['utf-8', errors='ignore'].decode['utf-8'] print[line] # ---------------------------------------------------- # ✅ if you are starting with a bytes object my_bytes = 'abc'.encode['utf-8'] result = my_bytes.decode['utf-8', errors='ignore'].encode['utf-8'] print[result] # 👉️ b'abc'

The first example removes the non utf-8 characters from a string.

The str.encode method returns an encoded version of the string as a bytes object. The default encoding is utf-8.

When the errors keyword argument is set to ignore, characters that cannot be encoded are dropped.

Any characters that cannot be encoded using the utf-8 encoding will get dropped from the string.

The next step is to decode the bytes object using the utf-8 encoding.

Copied!

my_str = 'abc' result = my_str.encode['utf-8', errors='ignore'].decode['utf-8'] print[result] # 👉️ 'abc'

The bytes.decode method returns a string decoded from the given bytes. The default encoding is utf-8.

The result is a string that doesn't contain any non-utf-8 characters.

If you need to remove the non-utf-8 characters when reading from a file, use a for loop to iterate over the lines in the file and repeat the same process.

Copied!

with open['example.txt', 'r', encoding='utf-8'] as f: lines = f.readlines[] for line in lines: line = line.encode['utf-8', errors='ignore'].decode['utf-8'] print[line]

Encoding is the process of converting a string to a bytes object and decoding is the process of converting a bytes object to a string.

If you are starting with a bytes object, you have to use the decode[] method to decode the bytes object to a string first.

Copied!

my_bytes = 'abc'.encode['utf-8'] result = my_bytes.decode['utf-8', errors='ignore'].encode['utf-8'] print[result] # 👉️ b'abc'

Make sure to set the errors keyword argument to ignore in the call to the decode[] method to drop any non-utf-8 characters when converting to a string.

How do you replace a non UTF

The default encoding is utf-8 ..
Use the str. encode[] method to encode the string to a bytes object..
Set the errors keyword argument to ignore to drop any non utf-8 characters..
Use the bytes. decode[] method to decode the bytes object to a string..

What are non UTF

Non-UTF-8 characters are characters that are not supported by UTF-8 encoding and, they may include symbols or characters from foreign unsupported languages. We'll get an error if we attempt to store these characters to a variable or run a file that contains them.

How do I open a non UTF

Simply specify encoding when opening the file: with open["xxx. csv", encoding="latin-1"] as fd: rd = csv. reader[fd] ...

How do I remove a non UTF

2 Answers.
use a charset that will accept any byte such as iso-8859-15 also known as latin9..
if output should be utf-8 but contains errors, use errors=ignore -> silently removes non utf-8 characters, or errors=replace -> replaces non utf-8 characters with a replacement marker [usually ? ].

Chủ Đề