py-distributed: Update to 2024.9.0

To: pkgsrc-wip-changes%NetBSD.org@localhost
Subject: py-distributed: Update to 2024.9.0
From: Matthew Danielson <matthewd%fastmail.us@localhost>
Date: Mon, 16 Sep 2024 12:15:52 +0000
Module Name:	pkgsrc-wip
Committed By:	Matthew Danielson <matthewd%fastmail.us@localhost>
Pushed By:	matthewd
Date:		Mon Sep 16 05:14:20 2024 -0700
Changeset:	9dcc1a224439c89bfbae4955e8dfd6b2fab194fa

Modified Files:
	py-distributed/Makefile
	py-distributed/PLIST
	py-distributed/distinfo

Log Message:
py-distributed: Update to 2024.9.0

2024.9.0¶
Highlights
Bump Bokeh minimum version to 3.1.0

bokeh>=3.1.0 is now required for diagnostics and the distributed cluster dashboard.

See GH#11375 and GH#8861 by James Bourbeau for more details.
Introduce new Task class

Add a Task class to replace tuples for task specification.

See GH#11248 by Florian Jetter for more details.

2024.8.2
Highlights
Automatic selection of rechunking method

To enable users to rechunk data at larger scales than before, Dask now automatically chooses an appropriate rechunking method when rechunking on a cluster. This requires no additional configuration and is enabled by default.

Specifically, Dask chooses between task-based and P2P rechunking. While task-based rechunking has been the previous default, P2P rechunking is beneficial when rechunking requires almost all-to-all communication between the old and new chunks, e.g., when changing between spacial and temporal chunking. In these cases, P2P rechunking offers constant memory usage and creates smaller task graphs. As a result, it works for cases where tasks-based rechunking would have previously failed.

To disable automatic selection, users can select their preferred method via the configuration

import dask.config
dask.config.set({"array.rechunk.method": "tasks"})

or when rechunking

import dask.array as da
arr = da.random.random(size=(1000, 1000, 365), chunks=(-1, -1, "auto"))
arr = arr.rechunk(("auto", "auto", -1), method="tasks")

See GH#11337 by Hendrik Makait for more details.
New shuffle API for Dask Arrays

Dask added a shuffle-API to Dask Arrays. This API allows for shuffling the data along a single dimension. It will ensure that every group of elements along this dimension are in exactly one chunk. This is a very useful operation for GroupBy-Map patterns in Xarray. See shuffle() for more information and API signature.

See GH#11267, GH#11311 and GH#11326 by Patrick Hoefler for more details.
New blockwise_reshape API for Dask Arrays

The new blockwise_reshape() enables an embarassingly parallel reshaping operation for cases where you don’t care about the order of the underlying array. It is embarassingly parallel and doesn’t trigger a rechunking operation under the hood anymore. This is useful when you don’t care about the order of the resulting Array, i.e. if a reduction is applied to the array or if the reshaping is only temporary.

arr = da.random.random(size=(100, 100, 48_000), chunks=(1000, 100, 83)
result = reshape_blockwise(arr, (10_000, 48_000))
result.sum()

result = reshape_blockwise(result, (100, 100, 48_000), chunks=arr.chunks)

Dask will automatically calculate the resulting chunks if the number of dimensions is reduced, but you have to specify the resulting chunks if the number of dimensions is increased.

Reshaping a Dask Array oftentimes creates a very complicated computations with rechunk operations in between because Dask respect the C ordering of the Array by default. This ensures that the resulting Dask Array is returned in the same order as the corresponding NumPy Array. However, this can lead to very inefficient computations. The blockwise_reshape is a lot more efficient than the default implemenation if you don’t care about the order.

Warning

Blockwise reshape operations are more efficient as the default, but they will return an Array that is ordered differently. Use with care!

See GH#11328 by Patrick Hoefler for more details.
Mutlidimensional positional indexing keeping chunksizes consistent

Indexing a Dask Array with vindex() previously created a single output chunk along the dimensions that were indexed. vindex is commonly used in Xarray when indexing multiple dimensions in a single step, i.e.:

arr = xr.DataArray(
    da.random.random((100, 100, 100), chunks=(5, 5, 50)),
    dims=['a', "b", "c"],
)

Previously, this put the indexed dimensions into a single chunk:
Size of each individual chunk increases to over 1GB

Dask now uses an improved algorithm that ensures that the chunksizes are kept consistent:
Size of each individual chunk increases to over 1GB

See GH#11330 by Patrick Hoefler for more details.

2024.8.1
Highlights
Improve output chunksizes for reshaping Dask Arrays

Reshaping a Dask Array oftentimes squashed the dimensions to reshape into a single chunk. This caused very large output chunks and subsequently a lot of out of memory errors and performance issues.

arr = da.ones(shape=(1000, 100, 48_000), chunks=(1000, 100, 83))
arr.reshape(1000, 100, 4, 12_000)

Previously, this put the last dimension into a single chunk of size 12_000.
Size of each individual chunk increases to over 1GB

The new algorithm will ensure that the chunk-size between in- and output is kept the same. This will avoid large increases in chunk-size and fragmentation of chunks.
Size of each individual chunk stays the same
Improve scheduling efficiency for Xarray Rechunk-GroupBy-Reduce patterns

The scheduler previously created an inefficient execution graph for Xarray GroupBy-Reduction patterns that use the cohorts strategy:

import xarray as xr

arr = xr.open_zarr(...)
arr.chunk(time=TimeResampler("ME")).groupby("time.month").mean()

An issue in the algorithm that creates the execution order of the task graph lead to an inefficient execution strategy that accumulates a lot of unnecessary memory on the cluster. The improvement is very similar to the previous ordering improvement in 2024.08.0.
Drop support for Python 3.9

This release drops support for Python 3.9 in accordance with NEP 29. Python 3.10 is now the required minimum version to run Dask.

See GH#11245 and GH#8793 by Patrick Hoefler for more details.

2024.8.0
Highlights
Improve efficiency and performance of slicing with positional indexers

Performance improvement for slicing a Dask Array with a positional indexer. Random access patterns are now more stable and produce easier-to-use results.

x[slice(None), [1, 1, 3, 6, 3, 4, 5]]

Using a positional indexer was previously prone to drastically increasing the number of output chunks and generating a very large task graph. This has been fixed with a more efficient algorithm.

The new algorithm will keep the chunk-sizes along the axis that is indexed the same to avoid fragmentation of chunks or a large increase in chunk-size.

See GH#11262 and GH#11267 by Patrick Hoefler for more details and performance benchmarks.
Improve scheduling efficiency for Xarray GroupBy-Reduce patterns

The scheduler previously created an inefficient execution graph for Xarray GroupBy-Reduction patterns like:

import xarray as xr

arr = xr.open_zarr(...)
arr.groupby("time.month").mean()

An issue in the algorithm that creates the execution order of the task graph lead to an inefficient execution strategy that accumulates a lot of unneceessary memory on the cluster.
Memory keeps accumulating on the cluster when running an embarassingly parallel operation.

The operation itself is embarassingly parallel. Using the proper execution strategy the scheduler can now execute the operation with constant memory, avoiding spilling and allowing us to scale to larger datasets.
Same operation is running with constant memory usage for the whole computation and can scale for bigger datasets.

See GH#8818 by Patrick Hoefler for more details and examples.

2024.7.1
Highlights
More resilient distributed lock

distributed.Lock is now resilient to worker failures. Previously deadlocks were possible in cases where a lock-holding worker was lost and/or failed to release the lock due to an error.

See GH#8770 by Florian Jetter for more details.

2024.7.0
Highlights
Drop support for pandas 1.x

This release drops support for pandas<2. pandas 2.0 is now the required minimum version to run Dask DataFrame.

The mimimum version of partd was also raised to 1.4.0. Versions before 1.4 are not compatible with pandas 2.

See GH#11199 by Patrick Hoefler for more details.
Publish-subscribe APIs deprecated

distributed.Pub and distributed.Sub have been deprecated and will be removed in a future release. Please switch to distributed.Client.log_event() and distributed.Worker.log_event() instead.

See GH#8724 by Hendrik Makait for more details.

To see a diff of this commit:
https://wip.pkgsrc.org/cgi-bin/gitweb.cgi?p=pkgsrc-wip.git;a=commitdiff;h=9dcc1a224439c89bfbae4955e8dfd6b2fab194fa

Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.

diffstat:
 py-distributed/Makefile |  4 +---
 py-distributed/PLIST    | 12 +++++++++---
 py-distributed/distinfo |  6 +++---
 3 files changed, 13 insertions(+), 9 deletions(-)

diffs:
diff --git a/py-distributed/Makefile b/py-distributed/Makefile
index a32f764840..7e13248051 100644
--- a/py-distributed/Makefile
+++ b/py-distributed/Makefile
@@ -1,6 +1,6 @@
 # $NetBSD$
 
-DISTNAME=	distributed-2024.6.2
+DISTNAME=	distributed-2024.9.0
 PKGNAME=	${PYPKGPREFIX}-${DISTNAME}
 CATEGORIES=	devel net
 GITHUB_PROJECT=	distributed
@@ -12,8 +12,6 @@ HOMEPAGE=	https://github.com/dask/distributed/
 COMMENT=	Distributed is the parallel scheduler for dask
 LICENSE=	modified-bsd
 
-PYTHON_VERSIONS_INCOMPATIBLE=	27 38
-
 TOOL_DEPENDS+=	${PYPKGPREFIX}-wheel>=0:../../devel/py-wheel
 TOOL_DEPENDS+=	${PYPKGPREFIX}-versioneer>=0.28:../../devel/py-versioneer
 
diff --git a/py-distributed/PLIST b/py-distributed/PLIST
index eaa2879f6b..a3942ec281 100644
--- a/py-distributed/PLIST
+++ b/py-distributed/PLIST
@@ -12,6 +12,9 @@ ${PYSITELIB}/${WHEEL_INFODIR}/top_level.txt
 ${PYSITELIB}/distributed/__init__.py
 ${PYSITELIB}/distributed/__init__.pyc
 ${PYSITELIB}/distributed/__init__.pyo
+${PYSITELIB}/distributed/_async_taskgroup.py
+${PYSITELIB}/distributed/_async_taskgroup.pyc
+${PYSITELIB}/distributed/_async_taskgroup.pyo
 ${PYSITELIB}/distributed/_asyncio.py
 ${PYSITELIB}/distributed/_asyncio.pyc
 ${PYSITELIB}/distributed/_asyncio.pyo
@@ -39,6 +42,9 @@ ${PYSITELIB}/distributed/batched.pyo
 ${PYSITELIB}/distributed/bokeh.py
 ${PYSITELIB}/distributed/bokeh.pyc
 ${PYSITELIB}/distributed/bokeh.pyo
+${PYSITELIB}/distributed/broker.py
+${PYSITELIB}/distributed/broker.pyc
+${PYSITELIB}/distributed/broker.pyo
 ${PYSITELIB}/distributed/cfexecutor.py
 ${PYSITELIB}/distributed/cfexecutor.pyc
 ${PYSITELIB}/distributed/cfexecutor.pyo
@@ -239,6 +245,9 @@ ${PYSITELIB}/distributed/event.pyo
 ${PYSITELIB}/distributed/exceptions.py
 ${PYSITELIB}/distributed/exceptions.pyc
 ${PYSITELIB}/distributed/exceptions.pyo
+${PYSITELIB}/distributed/gc.py
+${PYSITELIB}/distributed/gc.pyc
+${PYSITELIB}/distributed/gc.pyo
 ${PYSITELIB}/distributed/http/__init__.py
 ${PYSITELIB}/distributed/http/__init__.pyc
 ${PYSITELIB}/distributed/http/__init__.pyo
@@ -528,9 +537,6 @@ ${PYSITELIB}/distributed/utils.pyo
 ${PYSITELIB}/distributed/utils_comm.py
 ${PYSITELIB}/distributed/utils_comm.pyc
 ${PYSITELIB}/distributed/utils_comm.pyo
-${PYSITELIB}/distributed/utils_perf.py
-${PYSITELIB}/distributed/utils_perf.pyc
-${PYSITELIB}/distributed/utils_perf.pyo
 ${PYSITELIB}/distributed/utils_test.py
 ${PYSITELIB}/distributed/utils_test.pyc
 ${PYSITELIB}/distributed/utils_test.pyo
diff --git a/py-distributed/distinfo b/py-distributed/distinfo
index c6e954516e..d7d345e888 100644
--- a/py-distributed/distinfo
+++ b/py-distributed/distinfo
@@ -1,5 +1,5 @@
 $NetBSD$
 
-BLAKE2s (distributed-2024.6.2.tar.gz) = 743304f5bcd47e43499cb57b29ae8a0fcf6db804659cb427dc64cdd65878c7c9
-SHA512 (distributed-2024.6.2.tar.gz) = ec270a5f03ac2ed62b39d40bede18f154567a1fcc71072a8f1c1dd8a9bd8eb088f02e48232758dbaeefabc696dca1f7afbd8b80df241ca5d8dd1a3a872de6300
-Size (distributed-2024.6.2.tar.gz) = 2556083 bytes
+BLAKE2s (distributed-2024.9.0.tar.gz) = 0eb629ddb58d036e58ccc7dd2080e9d3bb7631b939ada6d0f92edd9acab3ed07
+SHA512 (distributed-2024.9.0.tar.gz) = c74d769f801eaeb765f4d8211ccc6dd83ceb502a8c58b6523b9897b53ef22df8b73ad071f02a4c71d4e17383869e3b876e2be3baece0a0e06ed535967b8480f0
+Size (distributed-2024.9.0.tar.gz) = 2570218 bytes
Prev by Date: py-apache-arrow: Update to 17.0.0
Next by Date: py-dask: Update to 2024.9.0
Previous by Thread: py-apache-arrow: Update to 17.0.0
Next by Thread: py-dask: Update to 2024.9.0
Indexes:
Home | Main Index | Thread Index | Old Index