clustering
Methods for merging intervals into "clusters"¶
This module contains utility functions related to clustering intervals into larger intervals.
There is currently one public method available:
cluster_intervals()-- clusters a list of intervals into a set of intervals that cover the input intervals, and are not larger than an inputmax_size. each original interval will be wholly covered by at least one cluster. Thenameattribute of each cluster will be set, and the original intervals are returned with anameattribute that matches that of a cluster that wholly contains it.
Examples¶
>>> from prymer.api.clustering import cluster_intervals
>>> from pybedlite.overlap_detector import Interval
>>> intervals = [Interval("chr1", 1, 2), Interval("chr1", 3, 4)]
>>> cluster_intervals(intervals, 10)
ClusteredIntervals(clusters=[Interval(refname='chr1', start=1, end=4, negative=False, name='chr1:1-4')], intervals=[Interval(refname='chr1', start=1, end=2, negative=False, name='chr1:1-4'), Interval(refname='chr1', start=3, end=4, negative=False, name='chr1:1-4')])
>>> cluster_intervals(intervals, 2)
ClusteredIntervals(clusters=[Interval(refname='chr1', start=1, end=2, negative=False, name='chr1:1-2'), Interval(refname='chr1', start=3, end=4, negative=False, name='chr1:3-4')], intervals=[Interval(refname='chr1', start=1, end=2, negative=False, name='chr1:1-2'), Interval(refname='chr1', start=3, end=4, negative=False, name='chr1:3-4')])
Classes¶
ClusteredIntervals
dataclass
¶
The list of clusters (intervals) and the original source intervals. The source intervals must have the name corresponding to the cluster to which the source interval belongs. Each cluster must envelop ("wholly contain") the intervals associated with the cluster.
Attributes:
| Name | Type | Description |
|---|---|---|
|
|
the clusters that wholly contain one or more source intervals. |
|
|
the source intervals, with name corresponding to the name of the associated cluster. |
Source code in prymer/api/clustering.py
Functions¶
cluster_intervals ¶
Cluster a list of intervals into intervals that overlap the given
intervals and are not larger than max_size.
Implements a greedy algorithm for hierarchical clustering, merging subsequent intervals
(from a sorted list) as long as the maximal size is respected.
Each "cluster" is replaced by an interval that spans it, and the algorithm terminates
when it can no longer merge anything without creating a cluster that is larger than max_size.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
intervals |
|
The intervals to cluster. |
required |
max_size |
|
The maximum size (in bp) of the resulting clusters. |
required |
Returns:
| Type | Description |
|---|---|
|
A named tuple ( |
|
cluster, defining the region spanned by the cluster, and |
|
set of intervals, each adorned with a |
|
|
Raises:
| Type | Description |
|---|---|
|
If any of the input intervals are larger than |