Discussion:
Chardet oddity
(too old to reply)
Albert-Jan Roskam
2024-10-23 17:07:14 UTC
Permalink
Today I used chardet.detect in the repl and it returned windows-1252
(incorrect, because it later resulted in a UnicodeDecodeError). When I ran
chardet as a script (which uses UniversalLineDetector) this returned
MacRoman. Isn't charset.detect the correct way? I've used this method many
times.
# Interpreter
contents = open(FILENAME, "rb").read()
chardet.detect(content)
{'encoding': 'Windows-1252', 'confidence': 0.7282676610947401, 'language':
''}
# Terminal
$ python -m chardet FILENAME
FILENAME: MacRoman with confidence 0.7167379080370483
Thanks!
Albert-Jan
Stefan Ram
2024-10-23 17:43:51 UTC
Permalink
Post by Albert-Jan Roskam
Today I used chardet.detect in the repl and it returned windows-1252
(incorrect, because it later resulted in a UnicodeDecodeError). When I ran
chardet as a script (which uses UniversalLineDetector) this returned
MacRoman. Isn't charset.detect the correct way? I've used this method many
times.
Oof, that's a head-scratcher! Looks like chardet's throwing
you a curveball. Usually, chardet.detect() is the go-to method,
but it seems to be off its game here.

The script version's using UniversalLineDetector under the hood
(as you wrote), which might be giving it an edge in this case.

It's weird that the confidence levels are so close, though.
Maybe the file's got some quirks that are tripping up the
simpler detect() method.

I'd say stick with the script version for now if it's giving
you better results.

Here's how you can use it in your code:

from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()
with open(FILENAME, 'rb') as file:
for line in file:
detector.feed(line)
if detector.done:
break
detector.close()
print(detector.result)
Mark Bourne
2024-10-23 19:42:00 UTC
Permalink
Post by Albert-Jan Roskam
Today I used chardet.detect in the repl and it returned windows-1252
(incorrect, because it later resulted in a UnicodeDecodeError). When I ran
chardet as a script (which uses UniversalLineDetector) this returned
MacRoman. Isn't charset.detect the correct way? I've used this method many
times.
# Interpreter
contents = open(FILENAME, "rb").read()
chardet.detect(content)
Is that copy and pasted from the terminal, or retyped with possible
transcription errors? As written, you've assigned the open file handle
to `contents`, but passed `content` (with no "s") to `chardet.detect` -
so the result would depend on whatever was previously assigned to `content`.
Post by Albert-Jan Roskam
''}
# Terminal
$ python -m chardet FILENAME
FILENAME: MacRoman with confidence 0.7167379080370483
Thanks!
Albert-Jan
--
Mark.
Roland Mueller
2024-10-24 15:51:47 UTC
Permalink
ke 23. lokak. 2024 klo 20.11 Albert-Jan Roskam via Python-list (
Post by Albert-Jan Roskam
Today I used chardet.detect in the repl and it returned windows-1252
(incorrect, because it later resulted in a UnicodeDecodeError). When I
ran
chardet as a script (which uses UniversalLineDetector) this returned
MacRoman. Isn't charset.detect the correct way? I've used this method
many
times.
# Interpreter
contents = open(FILENAME, "rb").read()
chardet.detect(content)
{'encoding': 'Windows-1252', 'confidence': 0.7282676610947401,
''}
# Terminal
$ python -m chardet FILENAME
FILENAME: MacRoman with confidence 0.7167379080370483
Thanks!
Albert-Jan
The entry point for the module chardet is chardet.cli.chardetect:main and
main() calls function description_of(lines, name).
'lines' is an opened file in mode 'rb' and name will hold the filename.

Following way I tried this in interactive mode: I think the crucial
difference is that description_of(lines, name) reads
the opened file line by line and stops after something has been detected in
some line.

When reading the whole file into the variable contents probably gives
another result depending on the input.
This behaviour I was not able to repeat.
I am assuming that you used the same Python for both tests.
Post by Albert-Jan Roskam
from chardet.cli import chardetect
chardetect.description_of(open('/tmp/DATE', 'rb'), 'some file')
'some file: ascii with confidence 1.0'
Your approach
Post by Albert-Jan Roskam
from chardet import detect
detect(open('/tmp/DATE','rb').read())
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}


from /usr/lib/python3/dist-packages/chardet/cli/chardetect.py

def description_of(lines, name='stdin'):
u = UniversalDetector()
for line in lines:
line = bytearray(line)
u.feed(line)
# shortcut out of the loop to save reading further - particularly
useful if we read a BOM.
if u.done:
break
u.close()
result = u.result
...
Post by Albert-Jan Roskam
--
https://mail.python.org/mailman/listinfo/python-list
Albert-Jan Roskam
2024-10-25 10:31:25 UTC
Permalink
On Oct 24, 2024 17:51, Roland Mueller via Python-list
<python-***@python.org> wrote:

ke 23. lokak. 2024 klo 20.11 Albert-Jan Roskam via Python-list (
    Today I used chardet.detect in the repl and it returned
windows-1252
    (incorrect, because it later resulted in a UnicodeDecodeError).
When I
ran
    chardet as a script (which uses UniversalLineDetector) this
returned
    MacRoman. Isn't charset.detect the correct way? I've used this
method
many
    times.
    # Interpreter
    >>> contents = open(FILENAME, "rb").read()
    >>> chardet.detect(content)
    {'encoding': 'Windows-1252', 'confidence': 0.7282676610947401,
    ''}
    # Terminal
    $ python -m chardet FILENAME
    FILENAME: MacRoman with confidence 0.7167379080370483
    Thanks!
    Albert-Jan
The entry point for the module chardet is chardet.cli.chardetect:main
and
main() calls function description_of(lines, name).
'lines' is an opened file in mode 'rb' and name will hold the filename.

Following way I tried this in interactive mode: I think the crucial
difference is that  description_of(lines, name) reads
the opened file line by line and stops after something has been detected
in
some line.

When reading the whole file into the variable contents probably gives
another result depending on the input.
This behaviour I was not able to repeat.
I am assuming that you used the same Python for both tests.
from chardet.cli import chardetect
chardetect.description_of(open('/tmp/DATE', 'rb'), 'some file')
'some file: ascii with confidence 1.0'
Your approach
from chardet import detect
detect(open('/tmp/DATE','rb').read())
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

from /usr/lib/python3/dist-packages/chardet/cli/chardetect.py

def description_of(lines, name='stdin'):
    u = UniversalDetector()
    for line in lines:
        line = bytearray(line)
        u.feed(line)
        # shortcut out of the loop to save reading further -
particularly
useful if we read a BOM.
        if u.done:
            break
    u.close()
    result = u.result

=============
Hi Mark, Roland,
Thanks for your replies. I experimented a bit with both methods and the
derived encoding still differed, even after I removed the "if u.done: 
break" (I removed that because I've seen cp1252 files with a utf8 BOM in
the past. I kid you not!). BUT next day, at closer inspection I saw that
the file was quite a mess. I contained mojibake. So I don't blame chardet
for not being able to figure out the encoding. 
Albert-Jan

Loading...