diff options
Diffstat (limited to 'filesystems/py-kerchunk/pkg-descr')
-rw-r--r-- | filesystems/py-kerchunk/pkg-descr | 28 |
1 files changed, 28 insertions, 0 deletions
diff --git a/filesystems/py-kerchunk/pkg-descr b/filesystems/py-kerchunk/pkg-descr new file mode 100644 index 000000000000..a351e30943fe --- /dev/null +++ b/filesystems/py-kerchunk/pkg-descr @@ -0,0 +1,28 @@ +Kerchunk is a library that provides a unified way to represent a variety of +chunked, compressed data formats (e.g. NetCDF, HDF5, GRIB), allowing efficient +access to the data from traditional file systems or cloud object storage. It +also provides a flexible way to create virtual datasets from multiple files. It +does this by extracting the byte ranges, compression information and other +information about the data and storing this metadata in a new, separate object. +This means that you can create a virtual aggregate dataset over potentially many +source files, for efficient, parallel and cloud-friendly in-situ access without +having to copy or translate the originals. It is a gateway to in-the-cloud +massive data processing while the data providers still insist on using legacy +formats for archival storage. + +We provide the following things: +- completely serverless architecture +- metadata consolidation, so you can understand a many-file dataset (metadata + plus physical storage) in a single read +- read from all of the storage backends supported by fsspec, including object + storage (s3, gcs, abfs, alibaba), http, cloud user storage (dropbox, gdrive) + and network protocols (ftp, ssh, hdfs, smb...) +- loading of various file types (currently netcdf4/HDF, grib2, tiff, fits, + zarr), potentially heterogeneous within a single dataset, without a need to go + via the specific driver (e.g., no need for h5py) +- asynchronous concurrent fetch of many data chunks in one go, amortizing the + cost of latency +- parallel access with a library like zarr without any locks +- logical datasets viewing many (>~millions) data files, and direct + access/subselection to them via coordinate indexing across an arbitrary number + of dimensions |