Examples
Read lines from a gzip-compressed file
The following snippet is an example of using CodecZlib.jl, which exports GzipDecompressorStream{S} as an alias of TranscodingStream{GzipDecompressor,S} where S<:IO:
using CodecZlib
stream = GzipDecompressorStream(open("data.txt.gz"))
for line in eachline(stream)
# do something...
end
close(stream)Note that the last close call will close the file as well. Alternatively, open(<stream type>, <filepath>) do ... end syntax will close the file at the end:
using CodecZlib
open(GzipDecompressorStream, "data.txt.gz") do stream
for line in eachline(stream)
# do something...
end
endRead compressed data from a pipe
The input is not limited to usual files. You can read data from a pipe (actually, any IO object that implements basic I/O methods) as follows:
using CodecZlib
pipe, proc = open(`cat some.data.gz`)
stream = GzipDecompressorStream(pipe)
for line in eachline(stream)
# do something...
end
close(stream) # This will finish the process as well.Save a data matrix with Zstd compression
Writing compressed data is easy. One thing you need to keep in mind is to call close after writing data; otherwise, the output file will be incomplete:
using CodecZstd
mat = randn(100, 100)
stream = ZstdCompressorStream(open("data.mat.zst", "w"))
writedlm(stream, mat)
close(stream)Of course, open(<stream type>, ...) do ... end works well:
using CodecZstd
mat = randn(100, 100)
open(ZstdCompressorStream, "data.mat.zst", "w") do stream
writedlm(stream, mat)
endExplicitly finish transcoding by writing TOKEN_END
When writing data, the end of a data stream is indicated by calling close, which may write an epilogue if necessary and flush all buffered data to the underlying I/O stream. If you want to explicitly specify the end position of a stream for some reason, you can write TranscodingStreams.TOKEN_END to the transcoding stream as follows:
using CodecZstd
using TranscodingStreams
buf = IOBuffer()
stream = ZstdCompressorStream(buf)
write(stream, "foobarbaz"^100, TranscodingStreams.TOKEN_END)
flush(stream)
compressed = take!(buf)
close(stream)Use a noop codec
Sometimes, the Noop codec, which does nothing, may be useful. The following example creates a decompressor stream based on the extension of a filepath:
using CodecZlib
using CodecBzip2
using TranscodingStreams
function makestream(filepath)
if endswith(filepath, ".gz")
codec = GzipDecompressor()
elseif endswith(filepath, ".bz2")
codec = Bzip2Decompressor()
else
codec = Noop()
end
return TranscodingStream(codec, open(filepath))
end
makestream("data.txt.gz")
makestream("data.txt.bz2")
makestream("data.txt")Change the codec of a file
TranscodingStreams are composable: a stream can be an input/output of another stream. You can use this to chage the codec of a file by composing different codecs as below:
using CodecZlib
using CodecZstd
input = open("data.txt.gz", "r")
output = open("data.txt.zst", "w")
stream = GzipDecompressorStream(ZstdCompressorStream(output))
write(stream, input)
close(stream)Effectively, this is equivalent to the following pipeline:
cat data.txt.gz | gzip -d | zstd >data.txt.zstStop decoding on the end of a block
Most codecs support decoding concatenated data blocks. For example, if you concatenate two gzip files into a file and read it using GzipDecompressorStream, you will see the byte stream of concatenation of two files. If you need the first part of the file, you can set stop_on_end to true to stop transcoding at the end of the first block:
using CodecZlib
# cat foo.txt.gz bar.txt.gz > foobar.txt.gz
stream = GzipDecompressorStream(open("foobar.txt.gz"), stop_on_end=true)
read(stream) #> the content of foo.txt
eof(stream) #> trueIn the case where you need to reuse the wrapped stream, the code above must be slightly modified because the transcoding stream may read more bytes than necessary from the wrapped stream. By wrapping a stream with NoopStream, the problem of overreading is resolved:
using CodecZlib
using TranscodingStreams
stream = NoopStream(open("foobar.txt.gz"))
read(GzipDecompressorStream(stream, stop_on_end=true)) #> the content of foo.txt
read(GzipDecompressorStream(stream, stop_on_end=true)) #> the content of bar.txtCheck I/O statistics
TranscodingStreams.stats returns a snapshot of the I/O statistics. For example, the following function shows progress of decompression to the standard error:
using CodecZlib
function decompress(input, output)
buffer = Vector{UInt8}(16 * 1024)
while !eof(input)
n = min(nb_available(input), length(buffer))
unsafe_read(input, pointer(buffer), n)
unsafe_write(output, pointer(buffer), n)
stats = TranscodingStreams.stats(input)
print(STDERR, "\rin: $(stats.in), out: $(stats.out)")
end
println(STDERR)
end
input = GzipDecompressorStream(open("foobar.txt.gz"))
output = IOBuffer()
decompress(input, output)stats.in is the number of bytes supplied to the stream and stats.out is the number of bytes consumed out of the stream.
Transcode data in one shot
TranscodingStreams.jl extends the transcode function to transcode a data in one shot. transcode takes a codec object as its first argument and a data vector as its second argument:
using CodecZlib
decompressed = transcode(ZlibDecompressor, b"x\x9cKL*JLNLI\x04R\x00\x19\xf2\x04U")
String(decompressed)Transcode lots of strings
transcode(<codec type>, data) method is convenient but suboptimal when transcoding a number of objects. This is because the method reallocates a new codec object for every call. Instead, you can use transcode(<codec object>, data) method that reuses the allocated object as follows:
using CodecZstd
strings = ["foo", "bar", "baz"]
codec = ZstdCompressor()
try
for s in strings
data = transcode(codec, s)
# do something...
end
catch
rethrow()
finally
CodecZstd.TranscodingStreams.finalize(codec)
endUnread data
TranscodingStream supports unread operation, which inserts data into the current reading position. This is useful when you want to peek from the stream. TranscodingStreams.unread and TranscodingStreams.unsafe_unread functions are provided:
using TranscodingStreams
stream = NoopStream(open("data.txt"))
data1 = read(stream, 8)
TranscodingStreams.unread(stream, data1)
data2 = read(stream, 8)
@assert data1 == data2The unread operaion is different from the write operation in that the unreaded data are not written to the wrapped stream. The unreaded data are stored in the internal buffer of a transcoding stream.
Unfortunately, unwrite operation is not provided because there is no way to cancel write operations that are already commited to the wrapped stream.