I have a file, which I can decompress under linux using the following command:
unxz < file.xz > file.txt
How can I do the same using python? If I use python3 and the tarfile module and do the following:
import sys
import tarfile
try:
with tarfile.open['temp.xz', 'r:xz'] as t:
t.extract[]
except Exception as e:
print["Error:", e.strerror]
I get the exception: ReadError['invalid header',]. So apparently it expects some file- or directory information which is not present in the xz file.
So how can I decompress a file without header information?
Pure Python implementation of the XZ file format with random access support Leveraging the lzma module for fast [de]compressionpython-xz
📖 Documentation | 📃 Changelog
A XZ file can be composed of several streams and blocks. This allows for fast random access
when reading, but this is not supported by Python's builtin lzma
module [which would read all previous blocks for nothing].
module type | builtin | cffi [C extension] | pure Python |
📄 read | |||
random access | ❌ no1 | ✔️ yes2 | ✔️ yes2 |
several blocks | ✔️ yes | ✔️✔️ yes3 | ✔️✔️ yes3 |
several streams | ✔️ yes | ✔️ yes | ✔️✔️ yes4 |
stream padding | ❌ no5 | ✔️ yes | ✔️ yes |
📝 write | |||
w mode
| ✔️ yes | ✔️ yes | ✔️ yes |
x mode
| ✔️ yes | ❌ no | ✔️ yes |
a mode
| ✔️ new stream | ✔️ new stream | ⏳ planned |
r+ /w+ /… modes
| ❌ no | ❌ no | ✔️ yes |
several blocks | ❌ no | ❌ no | ✔️ yes |
several streams | ❌ no6 | ❌ no6 | ✔️ yes |
stream padding | ❌ no | ❌ no | ⏳ planned |
- Reading from a position will read the file from the very beginning
- Reading from a position will read the file from the beginning of the block
- Block positions available with the
block_boundaries
attribute - Stream positions available with the
stream_boundaries
attribute - Related issue
- Possible by manually closing and re-opening in append mode
Usage
The API is similar to lzma: you can use either xz.open
or xz.XZFile
.
Read mode
>>> with xz.open['example.xz'] as fin: ... fin.read[18] ... fin.stream_boundaries # 2 streams ... fin.block_boundaries # 4 blocks in first stream, 2 blocks in second stream ... fin.seek[1000] ... fin.read[31] ... b'Hello, world! \xf0\x9f\x91\x8b' [0, 2000] [0, 500, 1000, 1500, 2000, 3000] 1000 b'\xe2\x9c\xa8 Random access is fast! \xf0\x9f\x9a\x80'
Opening in text mode works as well, but notice that seek arguments as well as boundaries are still in bytes [just like with lzma.open
].
>>> with xz.open['example.xz', 'rt'] as fin: ... fin.read[15] ... fin.stream_boundaries ... fin.block_boundaries ... fin.seek[1000] ... fin.read[26] ... 'Hello, world! 👋' [0, 2000] [0, 500, 1000, 1500, 2000, 3000] 1000 '✨ Random access is fast! 🚀'
Write mode
Writing is only supported from the end of file. It is however possible to truncate the file first. Note that truncating is only supported on block boundaries.
>>> with xz.open['test.xz', 'w'] as fout: ... fout.write[b'Hello, world!\n'] ... fout.write[b'This sentence is still in the previous block\n'] ... fout.change_block[] ... fout.write[b'But this one is in its own!\n'] ... 14 45 28
Advanced usage:
- Modes like
r+
/w+
/x+
allow to open for both read and write at the same time; however in the current implementation, a block with writing in progress is automatically closed when reading data from it. - The
check
,preset
andfilters
arguments toxz.open
andxz.XZFile
allow to configure the default values for new streams and blocks. - Change block with the
change_block
method [thepreset
andfilters
attributes can be changed beforehand to apply to the new block]. - Change stream with the
change_stream
method [thecheck
attribute can be changed beforehand to apply to the new stream].
FAQ
How does random-access works?
XZ files are made of a number of streams, and
each stream is composed of a number of block. This can be seen with xz --list
:
$ xz --list file.xz Strms Blocks Compressed Uncompressed Ratio Check Filename 1 13 16.8 MiB 297.9 MiB 0.056 CRC64 file.xz
To read data from the middle of the 10th block, we will decompress the 10th block from its start it until we reach the middle [and drop that decompressed data], then returned the decompressed data from that point.
Choosing the good block size is a tradeoff between seeking time during random access and compression ratio.
How can I create XZ files optimized for random-access?
You can open the file for writing and use the change_block
method to create several blocks.
Other tools allow to create XZ files with several blocks as well:
- XZ Utils needs to be called with flags:
$ xz -T0 file # threading mode $ xz --block-size 16M file # same size for all blocks $ xz --block-list 16M,32M,8M,42M file # specific size for each block
- PIXZ creates files with several blocks by default:
Python version support
As a general rule, all Python versions that are both released and still officially supported are supported by python-xz
and tested against [both CPython and PyPy implementations].
Moreover, Python 3.6 is currently supported as well, but may be dropped in future releases.
If you have other use cases or find issues with some Python versions, feel free to open a ticket!