diff options
author | Po-Chuan Hsieh <sunpoet@FreeBSD.org> | 2025-05-29 12:39:19 +0800 |
---|---|---|
committer | Po-Chuan Hsieh <sunpoet@FreeBSD.org> | 2025-05-29 12:52:18 +0800 |
commit | 49cf560f18818c74f2e587a5bfb8bc1933ae9250 (patch) | |
tree | 539e699db835f365a70fa4f8754a02f6131b7649 /filesystems | |
parent | graphics/qb3: Add qb3 1.3.2 (diff) |
filesystems/py-kerchunk: Add py-kerchunk 0.2.7
Kerchunk is a library that provides a unified way to represent a variety of
chunked, compressed data formats (e.g. NetCDF, HDF5, GRIB), allowing efficient
access to the data from traditional file systems or cloud object storage. It
also provides a flexible way to create virtual datasets from multiple files. It
does this by extracting the byte ranges, compression information and other
information about the data and storing this metadata in a new, separate object.
This means that you can create a virtual aggregate dataset over potentially many
source files, for efficient, parallel and cloud-friendly in-situ access without
having to copy or translate the originals. It is a gateway to in-the-cloud
massive data processing while the data providers still insist on using legacy
formats for archival storage.
We provide the following things:
- completely serverless architecture
- metadata consolidation, so you can understand a many-file dataset (metadata
plus physical storage) in a single read
- read from all of the storage backends supported by fsspec, including object
storage (s3, gcs, abfs, alibaba), http, cloud user storage (dropbox, gdrive)
and network protocols (ftp, ssh, hdfs, smb...)
- loading of various file types (currently netcdf4/HDF, grib2, tiff, fits,
zarr), potentially heterogeneous within a single dataset, without a need to go
via the specific driver (e.g., no need for h5py)
- asynchronous concurrent fetch of many data chunks in one go, amortizing the
cost of latency
- parallel access with a library like zarr without any locks
- logical datasets viewing many (>~millions) data files, and direct
access/subselection to them via coordinate indexing across an arbitrary number
of dimensions
Diffstat (limited to 'filesystems')
-rw-r--r-- | filesystems/Makefile | 1 | ||||
-rw-r--r-- | filesystems/py-kerchunk/Makefile | 29 | ||||
-rw-r--r-- | filesystems/py-kerchunk/distinfo | 3 | ||||
-rw-r--r-- | filesystems/py-kerchunk/pkg-descr | 28 |
4 files changed, 61 insertions, 0 deletions
diff --git a/filesystems/Makefile b/filesystems/Makefile index 7225d1423458..c61eae0c5e36 100644 --- a/filesystems/Makefile +++ b/filesystems/Makefile @@ -94,6 +94,7 @@ SUBDIR += py-fsspec-xrootd SUBDIR += py-fusepy SUBDIR += py-gcsfs + SUBDIR += py-kerchunk SUBDIR += py-libzfs SUBDIR += py-llfuse SUBDIR += py-prometheus-zfs diff --git a/filesystems/py-kerchunk/Makefile b/filesystems/py-kerchunk/Makefile new file mode 100644 index 000000000000..4fef8b7643e6 --- /dev/null +++ b/filesystems/py-kerchunk/Makefile @@ -0,0 +1,29 @@ +PORTNAME= kerchunk +PORTVERSION= 0.2.7 +CATEGORIES= filesystems python +MASTER_SITES= PYPI +PKGNAMEPREFIX= ${PYTHON_PKGNAMEPREFIX} + +MAINTAINER= sunpoet@FreeBSD.org +COMMENT= Functions to make reference descriptions for ReferenceFileSystem +WWW= https://fsspec.github.io/kerchunk/ \ + https://github.com/fsspec/kerchunk + +LICENSE= MIT +LICENSE_FILE= ${WRKSRC}/LICENSE + +BUILD_DEPENDS= ${PYTHON_PKGNAMEPREFIX}setuptools>=42:devel/py-setuptools@${PY_FLAVOR} \ + ${PYTHON_PKGNAMEPREFIX}setuptools-scm>=7:devel/py-setuptools-scm@${PY_FLAVOR} \ + ${PYTHON_PKGNAMEPREFIX}wheel>=0:devel/py-wheel@${PY_FLAVOR} +RUN_DEPENDS= ${PYTHON_PKGNAMEPREFIX}fsspec>=0:filesystems/py-fsspec@${PY_FLAVOR} \ + ${PYTHON_PKGNAMEPREFIX}numcodecs>=0:misc/py-numcodecs@${PY_FLAVOR} \ + ${PYTHON_PKGNAMEPREFIX}numpy>=0,1:math/py-numpy@${PY_FLAVOR} \ + ${PYTHON_PKGNAMEPREFIX}ujson>=0:devel/py-ujson@${PY_FLAVOR} \ + ${PYTHON_PKGNAMEPREFIX}zarr>=0.1<3,1:devel/py-zarr@${PY_FLAVOR} + +USES= python +USE_PYTHON= autoplist concurrent pep517 + +NO_ARCH= yes + +.include <bsd.port.mk> diff --git a/filesystems/py-kerchunk/distinfo b/filesystems/py-kerchunk/distinfo new file mode 100644 index 000000000000..45262a4ccc15 --- /dev/null +++ b/filesystems/py-kerchunk/distinfo @@ -0,0 +1,3 @@ +TIMESTAMP = 1748107898 +SHA256 (kerchunk-0.2.7.tar.gz) = 0425aa0fbf56f898053ee4c4dd40b35cea12d2fc986e036086e99a4ad16bd4e6 +SIZE (kerchunk-0.2.7.tar.gz) = 709052 diff --git a/filesystems/py-kerchunk/pkg-descr b/filesystems/py-kerchunk/pkg-descr new file mode 100644 index 000000000000..a351e30943fe --- /dev/null +++ b/filesystems/py-kerchunk/pkg-descr @@ -0,0 +1,28 @@ +Kerchunk is a library that provides a unified way to represent a variety of +chunked, compressed data formats (e.g. NetCDF, HDF5, GRIB), allowing efficient +access to the data from traditional file systems or cloud object storage. It +also provides a flexible way to create virtual datasets from multiple files. It +does this by extracting the byte ranges, compression information and other +information about the data and storing this metadata in a new, separate object. +This means that you can create a virtual aggregate dataset over potentially many +source files, for efficient, parallel and cloud-friendly in-situ access without +having to copy or translate the originals. It is a gateway to in-the-cloud +massive data processing while the data providers still insist on using legacy +formats for archival storage. + +We provide the following things: +- completely serverless architecture +- metadata consolidation, so you can understand a many-file dataset (metadata + plus physical storage) in a single read +- read from all of the storage backends supported by fsspec, including object + storage (s3, gcs, abfs, alibaba), http, cloud user storage (dropbox, gdrive) + and network protocols (ftp, ssh, hdfs, smb...) +- loading of various file types (currently netcdf4/HDF, grib2, tiff, fits, + zarr), potentially heterogeneous within a single dataset, without a need to go + via the specific driver (e.g., no need for h5py) +- asynchronous concurrent fetch of many data chunks in one go, amortizing the + cost of latency +- parallel access with a library like zarr without any locks +- logical datasets viewing many (>~millions) data files, and direct + access/subselection to them via coordinate indexing across an arbitrary number + of dimensions |