TranscodingStreams.jl

TranscodingStreams.jl

TranscodingStreams.jl is a package for transcoding (e.g. compression) data streams. This package exports a type TranscodingStream, which is a subtype of IO and supports various I/O operations like other usual I/O streams in the standard library.

Introduction

TranscodingStream has two type parameters, C<:Codec and S<:IO, and hence the actual type should be written as TranscodingStream{C<:Codec,S<:IO}. This type wraps an underlying I/O stream S by a codec C. The codec defines transformation (or transcoding) of the stream. For example, when C is a lossless decompression type and S is a file, TranscodingStream{C,S} behaves like a data stream that incrementally decompresses data from the file.

Codecs are defined in other packages listed below:

Package Library Format Codec Stream Description
CodecZlib.jl zlib RFC1952 GzipCompression GzipCompressionStream Compress data in gzip (.gz) format.
GzipDecompression GzipDecompressionStream Decompress data in gzip (.gz) format.
RFC1950 ZlibCompression ZlibCompressionStream Compress data in zlib format.
ZlibDecompression ZlibDecompressionStream Decompress data in zlib format.
RFC1951 DeflateCompression DeflateCompressionStream Compress data in deflate format.
DeflateDecompression DeflateDecompressionStream Decompress data in deflate format.
CodecBzip2.jl bzip2 Bzip2Compression Bzip2CompressionStream Compress data in bzip2 (.bz2) format.
Bzip2Decompression Bzip2DecompressionStream Decompress data in bzip2 (.bz2) format.
CodecXz.jl xz The .xz File Format XzCompression XzCompressionStream Compress data in xz (.xz) format.
XzDecompression XzDecompressionStream Decompress data in xz (.xz) format.
CodecZstd.jl zstd Zstandard Compression Format ZstdCompression ZstdCompressionStream Compress data in zstd (.zst) format.
ZstdDecompression ZstdDecompressionStream Decompress data in zstd (.zst) format.

Install packages you need by calling Pkg.add(<package name>) in a Julia session. For example, if you want to read gzip-compressed files, call Pkg.add("CodecZlib") to use GzipDecompression or GzipDecompressionStream. By convention, codec types have a name that matches .*(Co|Deco)mpression and I/O types have a codec name with Stream suffix. All codecs are a subtype TranscodingStreams.Codec and streams are a subtype of Base.IO. An important thing is these packages depend on TranscodingStreams.jl and not vice versa. This means you can install any codec package you need without installing all codec packages. Also, if you want to define your own codec, it is totally feasible like these packages. TranscodingStreams.jl requests a codec to implement some interface functions which will be described later.

Examples

Read lines from a gzip-compressed file

The following snippet is an example of using CodecZlib.jl, which exports GzipDecompressionStream{S} as an alias of TranscodingStream{GzipDecompression,S} where S<:IO:

using CodecZlib
stream = GzipDecompressionStream(open("data.txt.gz"))
for line in eachline(stream)
    # do something...
end
close(stream)

Note that the last close call will close the file as well. Alternatively, open(<stream type>, <filepath>) do ... end syntax will close the file at the end:

using CodecZlib
open(GzipDecompressionStream, "data.txt.gz") do stream
    for line in eachline(stream)
        # do something...
    end
end

Save a data matrix with Zstd compression

Writing compressed data is easy. One thing you need to keep in mind is to call close after writing data; otherwise, the output file will be incomplete:

using CodecZstd
mat = randn(100, 100)
stream = ZstdCompressionStream(open("data.mat.zst", "w"))
writedlm(stream, mat)
close(stream)

Of course, open(<stream type>, ...) do ... end works well:

using CodecZstd
mat = randn(100, 100)
open(ZstdCompressionStream, "data.mat.zst", "w") do stream
    writedlm(stream, mat)
end

Explicitly finish transcoding by writing TOKEN_END

When writing data, the end of a data stream is indicated by calling close, which may write an epilogue if necessary and flush all buffered data to the underlying I/O stream. If you want to explicitly specify the end position of a stream for some reason, you can write TranscodingStreams.TOKEN_END to the transcoding stream as follows:

using CodecZstd
using TranscodingStreams
buf = IOBuffer()
stream = ZstdCompressionStream(buf)
write(stream, "foobarbaz"^100, TranscodingStreams.TOKEN_END)
flush(stream)
compressed = take!(buf)
close(stream)

Use an identity (no-op) codec

Sometimes, the Identity codec, which does nothing, may be useful. The following example creates a decompression stream based on the extension of a filepath:

using CodecZlib
using CodecBzip2
using TranscodingStreams
using TranscodingStreams.CodecIdentity

function makestream(filepath)
    if endswith(filepath, ".gz")
        codec = GzipDecompression()
    elseif endswith(filepath, ".bz2")
        codec = Bzip2Decompression()
    else
        codec = Identity()
    end
    return TranscodingStream(codec, open(filepath))
end

makestream("data.txt.gz")
makestream("data.txt.bz2")
makestream("data.txt")

Transcode data in one shot

TranscodingStreams.jl extends the transcode function to transcode a data in one shot. transcode takes a codec object as its first argument and a data vector as its second argument:

using CodecZlib
decompressed = transcode(ZlibDecompression(), b"x\x9cKL*JLNLI\x04R\x00\x19\xf2\x04U")
String(decompressed)

API

TranscodingStream(codec::Codec, stream::IO; bufsize::Integer=16384)

Create a transcoding stream with codec and stream.

Examples

julia> using TranscodingStreams

julia> using CodecZlib

julia> file = open(Pkg.dir("TranscodingStreams", "test", "abra.gzip"));

julia> stream = TranscodingStream(GzipDecompression(), file)
TranscodingStreams.TranscodingStream{CodecZlib.GzipDecompression,IOStream}(<state=idle>)

julia> readstring(stream)
"abracadabra"
source
Base.transcodeMethod.
transcode(codec::Codec, data::Vector{UInt8})::Vector{UInt8}

Transcode data by applying codec.

Examples

julia> using CodecZlib

julia> data = Vector{UInt8}("abracadabra");

julia> compressed = transcode(ZlibCompression(), data);

julia> decompressed = transcode(ZlibDecompression(), compressed);

julia> String(decompressed)
"abracadabra"
source

A special token indicating the end of data.

TOKEN_END may be written to a transcoding stream like write(stream, TOKEN_END), which will terminate the current transcoding block.

Note

Call flush(stream) after write(stream, TOKEN_END) to make sure that all data are written to the underlying stream.

source
Identity()

Create an identity (no-op) codec.

source
IdentityStream(stream::IO)

Create an identity (no-op) stream.

source

Defining a new codec

An abstract codec type.

Any codec supporting transcoding interfaces must be a subtype of this type.

source
initialize(codec::Codec)::Void

Initialize codec.

source
finalize(codec::Codec)::Void

Finalize codec.

source
startproc(codec::Codec, state::Symbol)::Symbol

Start data processing with codec of state.

source
process(codec::Codec, input::Memory, output::Memory)::Tuple{Int,Int,Symbol}

Do data processing with codec.

source