Skip to main content

parquet_decode

EXPERIMENTAL

This component is experimental and therefore subject to change or removal outside of major version releases.

Decodes Parquet files into a batch of structured messages.

Introduced in version 4.4.0.

# Config fields, showing default values
label: ""
parquet_decode:
byte_array_as_string: false

This processor uses https://github.com/segmentio/parquet-go, which is itself experimental. Therefore changes could be made into how this processor functions outside of major version releases.

By default any BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY values will be copied as []byte values as they could contain arbitrary data. This means that when serialising messages as JSON documents these values will by default be base 64 encoded into strings, which is the default for arbitrary data fields. It is possible to convert these binary values to strings (or other data types) using Bloblang transformations such as root.foo = this.foo.string() or root.foo = this.foo.encode("hex"), etc.

However, in cases where all BYTE_ARRAY values are strings within your data it may be easier to set the config field byte_array_as_string to true in order to automatically extract all of these values as strings.

Fields

byte_array_as_string

Whether to extract BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY values as strings rather than byte slices. Enabling this field makes serialising the data as JSON more intuitive, otherwise the byte slice fields should be mapped via Bloblang in order to extract meaningful values.

Type: bool
Default: false

Examples

In this example we consume files from AWS S3 as they're written by listening onto an SQS queue for upload events. We make sure to use the all-bytes codec which means files are read into memory in full, which then allows us to use a parquet_decode processor to expand each file into a batch of messages. Finally, we write the data out to local files as newline delimited JSON.

input:
aws_s3:
bucket: TODO
prefix: foos/
codec: all-bytes
sqs:
url: TODO
processors:
- parquet_decode:
byte_array_as_string: true

output:
file:
codec: lines
path: './foos/${! meta("s3_key") }.jsonl'