I have a big amount of files and parser. What I Have to do is strip all non utf-8 symbols and put data in mongodb. Currently I have code like this.
with open[fname, "r"] as fp:
for line in fp:
line = line.strip[]
line = line.decode['utf-8', 'ignore']
line = line.encode['utf-8', 'ignore']
somehow I still get an error
bson.errors.InvalidStringData: strings in documents must be valid UTF-8:
1/b62010montecassianomcir\xe2\x86\x90ta0\xe2\x86\x90008923304320733/290066010401040101506055soccorin
I don't get it. Is there some simple way to do it?
UPD: seems like Python and Mongo don't agree about definition of Utf-8 Valid string.
maudulus
9,9599 gold badges73 silver badges111 bronze badges
asked Oct 24, 2014 at 5:24
1
Try below code line instead of last two lines. Hope it helps:
line=line.decode['utf-8','ignore'].encode["utf-8"]
answered Oct 24, 2014 at 13:45
Irshad BhatIrshad Bhat
8,0811 gold badge21 silver badges33 bronze badges
7
For python 3, as mentioned in a comment in this thread, you can do:
line = bytes[line, 'utf-8'].decode['utf-8', 'ignore']
The 'ignore' parameter prevents an error from being raised if any characters are unable to be decoded.
If your line is already a bytes object [e.g. b'my string'
] then you just need to decode it with decode['utf-8', 'ignore']
.
answered Apr 27, 2017 at 6:19
AlexAlex
10.7k6 gold badges61 silver badges71 bronze badges
1
Example to handle no utf-8 characters
import string
test=u"\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi \nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n"
print ''.join[x for x in test if x in string.printable]
answered Jan 5, 2017 at 5:23
1
with open[fname, "r"] as fp:
for line in fp:
line = line.strip[]
line = line.decode['cp1252'].encode['utf-8']
answered Jul 28, 2018 at 5:04
WillemWillem
1,2941 gold badge8 silver badges7 bronze badges
1
Remove the non utf-8 characters from a String in Python #
To remove the non utf-8 characters from a string:
- Use the
str.encode[]
method to encode the string to a bytes object. - Set the
errors
keyword argument to ignore to drop any non utf-8 characters. - Use the
bytes.decode[]
method to decode the bytes object to a string.
Copied!
# ✅ remove non utf-8 characters from string my_str = 'abc' result = my_str.encode['utf-8', errors='ignore'].decode['utf-8'] print[result] # 👉️ 'abc' # ---------------------------------------------------- # ✅ remove non utf-8 characters when reading from file with open['example.txt', 'r', encoding='utf-8'] as f: lines = f.readlines[] for line in lines: line = line.encode['utf-8', errors='ignore'].decode['utf-8'] print[line] # ---------------------------------------------------- # ✅ if you are starting with a bytes object my_bytes = 'abc'.encode['utf-8'] result = my_bytes.decode['utf-8', errors='ignore'].encode['utf-8'] print[result] # 👉️ b'abc'
The first example removes the non utf-8 characters from a string.
The str.encode method returns an encoded version of the string as a bytes object. The default encoding is utf-8
.
When the errors
keyword argument is set to ignore
,
characters that cannot be encoded are dropped.
Any characters that cannot be encoded using the utf-8
encoding will get dropped from the string.
The next step is to decode the bytes
object using the utf-8
encoding.
Copied!
my_str = 'abc' result = my_str.encode['utf-8', errors='ignore'].decode['utf-8'] print[result] # 👉️ 'abc'
The bytes.decode method returns a string decoded from the given bytes. The default encoding is utf-8
.
The result is a string that doesn't contain any non-utf-8 characters.
If you need to remove the non-utf-8 characters when reading from a file, use a for
loop to iterate over the lines in the file and repeat the same process.
Copied!
with open['example.txt', 'r', encoding='utf-8'] as f: lines = f.readlines[] for line in lines: line = line.encode['utf-8', errors='ignore'].decode['utf-8'] print[line]
Encoding is the process of converting a string
to a bytes
object and decoding is the process of converting a bytes
object to a string
.
If you are starting with a bytes
object, you have to use the
decode[]
method to decode the bytes object to a string first.
Copied!
my_bytes = 'abc'.encode['utf-8'] result = my_bytes.decode['utf-8', errors='ignore'].encode['utf-8'] print[result] # 👉️ b'abc'
Make sure to set the errors
keyword argument to ignore
in the call to the decode[]
method to drop any non-utf-8 characters when converting to a string.