Examples

Read lines from a gzip-compressed file

The following snippet is an example of using CodecZlib.jl, which exports GzipDecompressorStream{S} as an alias of TranscodingStream{GzipDecompressor,S} where S<:IO:

using CodecZlib
stream = GzipDecompressorStream(open("data.txt.gz"))
for line in eachline(stream)
    # do something...
end
close(stream)

Note that the last close call will close the file as well. Alternatively, open(<stream type>, <filepath>) do ... end syntax will close the file at the end:

using CodecZlib
open(GzipDecompressorStream, "data.txt.gz") do stream
    for line in eachline(stream)
        # do something...
    end
end

Read compressed data from a pipe

The input is not limited to usual files. You can read data from a pipe (actually, any IO object that implements basic I/O methods) as follows:

using CodecZlib
pipe, proc = open(`cat some.data.gz`)
stream = GzipDecompressorStream(pipe)
for line in eachline(stream)
    # do something...
end
close(stream)  # This will finish the process as well.

Save a data matrix with Zstd compression

Writing compressed data is easy. One thing you need to keep in mind is to call close after writing data; otherwise, the output file will be incomplete:

using CodecZstd
mat = randn(100, 100)
stream = ZstdCompressorStream(open("data.mat.zst", "w"))
writedlm(stream, mat)
close(stream)

Of course, open(<stream type>, ...) do ... end works well:

using CodecZstd
mat = randn(100, 100)
open(ZstdCompressorStream, "data.mat.zst", "w") do stream
    writedlm(stream, mat)
end

Explicitly finish transcoding by writing `TOKEN_END`

When writing data, the end of a data stream is indicated by calling close, which may write an epilogue if necessary and flush all buffered data to the underlying I/O stream. If you want to explicitly specify the end position of a stream for some reason, you can write TranscodingStreams.TOKEN_END to the transcoding stream as follows:

using CodecZstd
using TranscodingStreams
buf = IOBuffer()
stream = ZstdCompressorStream(buf)
write(stream, "foobarbaz"^100, TranscodingStreams.TOKEN_END)
flush(stream)
compressed = take!(buf)
close(stream)

Use a noop codec

Sometimes, the Noop codec, which does nothing, may be useful. The following example creates a decompressor stream based on the extension of a filepath:

using CodecZlib
using CodecBzip2
using TranscodingStreams

function makestream(filepath)
    if endswith(filepath, ".gz")
        codec = GzipDecompressor()
    elseif endswith(filepath, ".bz2")
        codec = Bzip2Decompressor()
    else
        codec = Noop()
    end
    return TranscodingStream(codec, open(filepath))
end

makestream("data.txt.gz")
makestream("data.txt.bz2")
makestream("data.txt")

Change the codec of a file

TranscodingStreams are composable: a stream can be an input/output of another stream. You can use this to chage the codec of a file by composing different codecs as below:

using CodecZlib
using CodecZstd

input  = open("data.txt.gz",  "r")
output = open("data.txt.zst", "w")

stream = GzipDecompressorStream(ZstdCompressorStream(output))
write(stream, input)
close(stream)

Effectively, this is equivalent to the following pipeline:

cat data.txt.gz | gzip -d | zstd >data.txt.zst

Stop decoding on the end of a block

Most codecs support decoding concatenated data blocks. For example, if you concatenate two gzip files into a file and read it using GzipDecompressorStream, you will see the byte stream of concatenation of two files. If you need the first part of the file, you can set stop_on_end to true to stop transcoding at the end of the first block:

using CodecZlib
# cat foo.txt.gz bar.txt.gz > foobar.txt.gz
stream = GzipDecompressorStream(open("foobar.txt.gz"), stop_on_end=true)
read(stream)  #> the content of foo.txt
eof(stream)   #> true

In the case where you need to reuse the wrapped stream, the code above must be slightly modified because the transcoding stream may read more bytes than necessary from the wrapped stream. By wrapping a stream with NoopStream, the problem of overreading is resolved:

using CodecZlib
using TranscodingStreams
stream = NoopStream(open("foobar.txt.gz"))
read(GzipDecompressorStream(stream, stop_on_end=true))  #> the content of foo.txt
read(GzipDecompressorStream(stream, stop_on_end=true))  #> the content of bar.txt

Check I/O statistics

TranscodingStreams.stats returns a snapshot of the I/O statistics. For example, the following function shows progress of decompression to the standard error:

using CodecZlib

function decompress(input, output)
    buffer = Vector{UInt8}(16 * 1024)
    while !eof(input)
        n = min(nb_available(input), length(buffer))
        unsafe_read(input, pointer(buffer), n)
        unsafe_write(output, pointer(buffer), n)
        stats = TranscodingStreams.stats(input)
        print(STDERR, "\rin: $(stats.in), out: $(stats.out)")
    end
    println(STDERR)
end

input = GzipDecompressorStream(open("foobar.txt.gz"))
output = IOBuffer()
decompress(input, output)

stats.in is the number of bytes supplied to the stream and stats.out is the number of bytes consumed out of the stream.

Transcode data in one shot

TranscodingStreams.jl extends the transcode function to transcode a data in one shot. transcode takes a codec object as its first argument and a data vector as its second argument:

using CodecZlib
decompressed = transcode(ZlibDecompressor, b"x\x9cKL*JLNLI\x04R\x00\x19\xf2\x04U")
String(decompressed)

Transcode lots of strings

transcode(<codec type>, data) method is convenient but suboptimal when transcoding a number of objects. This is because the method reallocates a new codec object for every call. Instead, you can use transcode(<codec object>, data) method that reuses the allocated object as follows:

using CodecZstd
strings = ["foo", "bar", "baz"]
codec = ZstdCompressor()
try
    for s in strings
        data = transcode(codec, s)
        # do something...
    end
catch
    rethrow()
finally
    CodecZstd.TranscodingStreams.finalize(codec)
end

Unread data

TranscodingStream supports unread operation, which inserts data into the current reading position. This is useful when you want to peek from the stream. TranscodingStreams.unread and TranscodingStreams.unsafe_unread functions are provided:

using TranscodingStreams
stream = NoopStream(open("data.txt"))
data1 = read(stream, 8)
TranscodingStreams.unread(stream, data1)
data2 = read(stream, 8)
@assert data1 == data2

The unread operaion is different from the write operation in that the unreaded data are not written to the wrapped stream. The unreaded data are stored in the internal buffer of a transcoding stream.

Unfortunately, unwrite operation is not provided because there is no way to cancel write operations that are already commited to the wrapped stream.