Skip to content

clustering

Methods for merging intervals into "clusters"

This module contains utility functions related to clustering intervals into larger intervals.

There is currently one public method available:

  • cluster_intervals() -- clusters a list of intervals into a set of intervals that cover the input intervals, and are not larger than an input max_size. each original interval will be wholly covered by at least one cluster. The name attribute of each cluster will be set, and the original intervals are returned with a name attribute that matches that of a cluster that wholly contains it.

Examples

>>> from prymer.api.clustering import cluster_intervals
>>> from pybedlite.overlap_detector import Interval
>>> intervals = [Interval("chr1", 1, 2), Interval("chr1", 3, 4)]
>>> cluster_intervals(intervals, 10)
ClusteredIntervals(clusters=[Interval(refname='chr1', start=1, end=4, negative=False, name='chr1:1-4')], intervals=[Interval(refname='chr1', start=1, end=2, negative=False, name='chr1:1-4'), Interval(refname='chr1', start=3, end=4, negative=False, name='chr1:1-4')])
>>> cluster_intervals(intervals, 2)
ClusteredIntervals(clusters=[Interval(refname='chr1', start=1, end=2, negative=False, name='chr1:1-2'), Interval(refname='chr1', start=3, end=4, negative=False, name='chr1:3-4')], intervals=[Interval(refname='chr1', start=1, end=2, negative=False, name='chr1:1-2'), Interval(refname='chr1', start=3, end=4, negative=False, name='chr1:3-4')])

Classes

ClusteredIntervals dataclass

The list of clusters (intervals) and the original source intervals. The source intervals must have the name corresponding to the cluster to which the source interval belongs. Each cluster must envelop ("wholly contain") the intervals associated with the cluster.

Attributes:

Name Type Description
clusters list[Interval]

the clusters that wholly contain one or more source intervals.

intervals list[Interval]

the source intervals, with name corresponding to the name of the associated cluster.

Source code in prymer/api/clustering.py
@dataclass(frozen=True, init=True, slots=True)
class ClusteredIntervals:
    """
    The list of clusters (intervals) and the original source intervals.  The source intervals must
    have the name corresponding to the cluster to which the source interval belongs. Each cluster
    must envelop ("wholly contain") the intervals associated with the cluster.

    Attributes:
        clusters: the clusters that wholly contain one or more source intervals.
        intervals: the source intervals, with name corresponding to the name of the associated
            cluster.
    """

    clusters: list[Interval]
    intervals: list[Interval]

    def __post_init__(self) -> None:
        cluster_names: set[str] = {c.name for c in self.clusters}
        interval_names: set[str] = {i.name for i in self.intervals}
        # Check that the names of clusters are unique
        if len(self.clusters) != len(cluster_names):
            raise ValueError("Cluster names are not unique")
        # Check that every interval has the name of one cluster
        for interval in self.intervals:
            if interval.name not in cluster_names:
                raise ValueError(f"Interval does not belong to a cluster: {interval}")
        # Check that every cluster has at least one interval associated with the cluster
        if len(interval_names) != len(cluster_names):
            raise ValueError("Cluster and interval names differ.")

Functions

cluster_intervals

cluster_intervals(
    intervals: list[Interval], max_size: int
) -> ClusteredIntervals

Cluster a list of intervals into intervals that overlap the given intervals and are not larger than max_size.

Implements a greedy algorithm for hierarchical clustering, merging subsequent intervals (from a sorted list) as long as the maximal size is respected. Each "cluster" is replaced by an interval that spans it, and the algorithm terminates when it can no longer merge anything without creating a cluster that is larger than max_size.

Parameters:

Name Type Description Default
intervals list[Interval]

The intervals to cluster.

required
max_size int

The maximum size (in bp) of the resulting clusters.

required

Returns:

Type Description
ClusteredIntervals

A named tuple (clusters, intervals), where clusters contains one (named) interval per

ClusteredIntervals

cluster, defining the region spanned by the cluster, and intervals contains the original

ClusteredIntervals

set of intervals, each adorned with a name that agrees with that of a cluster in

ClusteredIntervals

clusters that wholly contains it.

Raises:

Type Description
ValueError

If any of the input intervals are larger than max_size.

Source code in prymer/api/clustering.py
def cluster_intervals(
    intervals: list[Interval],
    max_size: int,
) -> ClusteredIntervals:
    """
    Cluster a list of intervals into intervals that overlap the given
    intervals and are not larger than `max_size`.

    Implements a greedy algorithm for hierarchical clustering, merging subsequent intervals
    (from a sorted list) as long as the maximal size is respected.
    Each "cluster" is replaced by an interval that spans it, and the algorithm terminates
    when it can no longer merge anything without creating a cluster that is larger than `max_size`.

    Args:
        intervals: The intervals to cluster.
        max_size: The maximum size (in bp) of the resulting clusters.

    Returns:
        A named tuple (`clusters`, `intervals`), where `clusters` contains one (named) interval per
        cluster, defining the region spanned by the cluster, and `intervals` contains the original
        set of intervals, each adorned with a `name` that agrees with that of a cluster in
        `clusters` that wholly contains it.

    Raises:
        ValueError: If any of the input intervals are larger than `max_size`.
    """

    intervals_by_refname: Dict[str, list[Interval]] = {
        key: list(group) for key, group in itertools.groupby(intervals, key=lambda x: x.refname)
    }
    # import ipdb; ipdb.set_trace()
    # iterate over the refnames and call _cluster_in_contig() per refname.

    # need to store this in a list since we iterate over it twice.
    per_refname_clusters_and_intervals = [
        x
        for x in (
            _cluster_in_contig(intervals_in_refname, max_size)
            for intervals_in_refname in intervals_by_refname.values()
        )
    ]

    # flatten the clusters and intervals
    clusters = [
        cluster
        for refname_result in per_refname_clusters_and_intervals
        for cluster in refname_result.clusters
    ]

    intervals_out = [
        interval
        for refname_results in per_refname_clusters_and_intervals
        for interval in refname_results.intervals
    ]

    return ClusteredIntervals(clusters=clusters, intervals=intervals_out)