filesystems/py-kerchunk/pkg-descr


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

Kerchunk is a library that provides a unified way to represent a variety of
chunked, compressed data formats (e.g. NetCDF, HDF5, GRIB), allowing efficient
access to the data from traditional file systems or cloud object storage. It
also provides a flexible way to create virtual datasets from multiple files. It
does this by extracting the byte ranges, compression information and other
information about the data and storing this metadata in a new, separate object.
This means that you can create a virtual aggregate dataset over potentially many
source files, for efficient, parallel and cloud-friendly in-situ access without
having to copy or translate the originals. It is a gateway to in-the-cloud
massive data processing while the data providers still insist on using legacy
formats for archival storage.

We provide the following things:
- completely serverless architecture
- metadata consolidation, so you can understand a many-file dataset (metadata
  plus physical storage) in a single read
- read from all of the storage backends supported by fsspec, including object
  storage (s3, gcs, abfs, alibaba), http, cloud user storage (dropbox, gdrive)
  and network protocols (ftp, ssh, hdfs, smb...)
- loading of various file types (currently netcdf4/HDF, grib2, tiff, fits,
  zarr), potentially heterogeneous within a single dataset, without a need to go
  via the specific driver (e.g., no need for h5py)
- asynchronous concurrent fetch of many data chunks in one go, amortizing the
  cost of latency
- parallel access with a library like zarr without any locks
- logical datasets viewing many (>~millions) data files, and direct
  access/subselection to them via coordinate indexing across an arbitrary number
  of dimensions