summaryrefslogtreecommitdiff
path: root/filesystems/py-kerchunk/pkg-descr
diff options
context:
space:
mode:
Diffstat (limited to 'filesystems/py-kerchunk/pkg-descr')
-rw-r--r--filesystems/py-kerchunk/pkg-descr28
1 files changed, 28 insertions, 0 deletions
diff --git a/filesystems/py-kerchunk/pkg-descr b/filesystems/py-kerchunk/pkg-descr
new file mode 100644
index 000000000000..a351e30943fe
--- /dev/null
+++ b/filesystems/py-kerchunk/pkg-descr
@@ -0,0 +1,28 @@
+Kerchunk is a library that provides a unified way to represent a variety of
+chunked, compressed data formats (e.g. NetCDF, HDF5, GRIB), allowing efficient
+access to the data from traditional file systems or cloud object storage. It
+also provides a flexible way to create virtual datasets from multiple files. It
+does this by extracting the byte ranges, compression information and other
+information about the data and storing this metadata in a new, separate object.
+This means that you can create a virtual aggregate dataset over potentially many
+source files, for efficient, parallel and cloud-friendly in-situ access without
+having to copy or translate the originals. It is a gateway to in-the-cloud
+massive data processing while the data providers still insist on using legacy
+formats for archival storage.
+
+We provide the following things:
+- completely serverless architecture
+- metadata consolidation, so you can understand a many-file dataset (metadata
+ plus physical storage) in a single read
+- read from all of the storage backends supported by fsspec, including object
+ storage (s3, gcs, abfs, alibaba), http, cloud user storage (dropbox, gdrive)
+ and network protocols (ftp, ssh, hdfs, smb...)
+- loading of various file types (currently netcdf4/HDF, grib2, tiff, fits,
+ zarr), potentially heterogeneous within a single dataset, without a need to go
+ via the specific driver (e.g., no need for h5py)
+- asynchronous concurrent fetch of many data chunks in one go, amortizing the
+ cost of latency
+- parallel access with a library like zarr without any locks
+- logical datasets viewing many (>~millions) data files, and direct
+ access/subselection to them via coordinate indexing across an arbitrary number
+ of dimensions