clustering

Methods for merging intervals into "clusters"¶

This module contains utility functions related to clustering intervals into larger intervals.

There is currently one public method available:

cluster_intervals() -- clusters a list of intervals into a set of intervals that cover the input intervals, and are not larger than an input max_size. each original interval will be wholly covered by at least one cluster. The name attribute of each cluster will be set, and the original intervals are returned with a name attribute that matches that of a cluster that wholly contains it.

Examples¶

>>> from prymer.api.clustering import cluster_intervals
>>> from pybedlite.overlap_detector import Interval
>>> intervals = [Interval("chr1", 1, 2), Interval("chr1", 3, 4)]
>>> cluster_intervals(intervals, 10)
ClusteredIntervals(clusters=[Interval(refname='chr1', start=1, end=4, negative=False, name='chr1:1-4')], intervals=[Interval(refname='chr1', start=1, end=2, negative=False, name='chr1:1-4'), Interval(refname='chr1', start=3, end=4, negative=False, name='chr1:1-4')])
>>> cluster_intervals(intervals, 2)
ClusteredIntervals(clusters=[Interval(refname='chr1', start=1, end=2, negative=False, name='chr1:1-2'), Interval(refname='chr1', start=3, end=4, negative=False, name='chr1:3-4')], intervals=[Interval(refname='chr1', start=1, end=2, negative=False, name='chr1:1-2'), Interval(refname='chr1', start=3, end=4, negative=False, name='chr1:3-4')])

Classes¶

ClusteredIntervals `dataclass` ¶

The list of clusters (intervals) and the original source intervals. The source intervals must have the name corresponding to the cluster to which the source interval belongs. Each cluster must envelop ("wholly contain") the intervals associated with the cluster.

Attributes:

Name	Type	Description
`clusters`	`list[Interval]`	the clusters that wholly contain one or more source intervals.
`intervals`	`list[Interval]`	the source intervals, with name corresponding to the name of the associated cluster.

Source code in prymer/api/clustering.py

@dataclass(frozen=True, init=True, slots=True)
class ClusteredIntervals:
    """
    The list of clusters (intervals) and the original source intervals.  The source intervals must
    have the name corresponding to the cluster to which the source interval belongs. Each cluster
    must envelop ("wholly contain") the intervals associated with the cluster.

    Attributes:
        clusters: the clusters that wholly contain one or more source intervals.
        intervals: the source intervals, with name corresponding to the name of the associated
            cluster.
    """

    clusters: list[Interval]
    intervals: list[Interval]

    def __post_init__(self) -> None:
        cluster_names: set[str] = {c.name for c in self.clusters}
        interval_names: set[str] = {i.name for i in self.intervals}
        # Check that the names of clusters are unique
        if len(self.clusters) != len(cluster_names):
            raise ValueError("Cluster names are not unique")
        # Check that every interval has the name of one cluster
        for interval in self.intervals:
            if interval.name not in cluster_names:
                raise ValueError(f"Interval does not belong to a cluster: {interval}")
        # Check that every cluster has at least one interval associated with the cluster
        if len(interval_names) != len(cluster_names):
            raise ValueError("Cluster and interval names differ.")

Functions¶

cluster_intervals ¶

cluster_intervals(
    intervals: list[Interval], max_size: int
) -> ClusteredIntervals

Cluster a list of intervals into intervals that overlap the given intervals and are not larger than max_size.

Implements a greedy algorithm for hierarchical clustering, merging subsequent intervals (from a sorted list) as long as the maximal size is respected. Each "cluster" is replaced by an interval that spans it, and the algorithm terminates when it can no longer merge anything without creating a cluster that is larger than max_size.

Parameters:

Name	Type	Description	Default
`intervals`	`list[Interval]`	The intervals to cluster.	required
`max_size`	`int`	The maximum size (in bp) of the resulting clusters.	required

Returns:

Type	Description
`ClusteredIntervals`	A named tuple (`clusters`, `intervals`), where `clusters` contains one (named) interval per
`ClusteredIntervals`	cluster, defining the region spanned by the cluster, and `intervals` contains the original
`ClusteredIntervals`	set of intervals, each adorned with a `name` that agrees with that of a cluster in
`ClusteredIntervals`	`clusters` that wholly contains it.

Raises:

Type	Description
`ValueError`	If any of the input intervals are larger than `max_size`.