Skip to content

bwa

Methods and Classes to run and interact with BWA

The BwaAlnInteractive class is used to map a list of queries to the reference genome. A single query may be mapped using map_one(), while multiple queries may be mapped using map_all(). The latter is more efficient than mapping queries one-by-one.

The queries are provided to these methods using the Query class, which contains the unique identifier for the query and the bases to query. These methods return a BwaResult, which represents zero or more hits (or alignments) found by BWA for the given query. Each hit is represented by a BwaHit object. In some cases, BWA will write fewer (or no) hits in the "XA" tag than the total number hits reported in the "HN". This occurs when BWA finds more hits than max_hits (see bwt aln -X).

Use of this module requires installation of a custom version of BWA named bwa-aln-interactive. See:

Example

>>> from pathlib import Path
>>> ref_fasta = Path("./tests/offtarget/data/miniref.fa")
>>> query = Query(bases="TCTACTAAAAATACAAAAAATTAGCTGGGCATGATGGCATGCACCTGTAATCCCGCTACT", id="NA")
>>> bwa = BwaAlnInteractive(ref=ref_fasta, max_hits=1)
>>> result = bwa.map_one(query=query.bases, id=query.id)
>>> result.hit_count
1
>>> len(result.hits)
1
>>> result.hits[0]
BwaHit(refname='chr1', start=61, negative=False, cigar=Cigar(elements=(CigarElement(length=60, operator=<CigarOp.M: (0, 'M', True, True)>),)), edits=0)
>>> query = Query(bases="AAAAAA", id="NA")
>>> bwa.map_all(queries=[query])
[BwaResult(query=Query(id='NA', bases='AAAAAA'), hit_count=3968, hits=[])]
>>> bwa.close()
True

Attributes

BWA_AUX_EXTENSIONS module-attribute

BWA_AUX_EXTENSIONS: list[str] = [
    ".amb",
    ".ann",
    ".bwt",
    ".pac",
    ".sa",
]

The file extensions for BWA index files

BWA_EXECUTABLE_NAME module-attribute

BWA_EXECUTABLE_NAME: str = 'bwa-aln-interactive'

The executable name for the interactive build of bwa aln.

MAX_GAP_EXTENSIONS module-attribute

MAX_GAP_EXTENSIONS: int = -1

The default maximum number of gap extensions allowed in the full query sequence

MAX_GAP_OPENS module-attribute

MAX_GAP_OPENS: int = 0

The default maximum number of gap opens allowed in the full query sequence

MAX_MISMATCHES module-attribute

MAX_MISMATCHES: int = 3

The default maximum number of mismatches allowed in the full query sequence

MAX_MISMATCHES_IN_SEED module-attribute

MAX_MISMATCHES_IN_SEED: int = 3

The default maximum number of mismatches allowed in the seed region

SEED_LENGTH module-attribute

SEED_LENGTH: int = 20

The default length of the seed region

Classes

BwaAlnInteractive

Bases: ExecutableRunner

Wrapper around a novel mode of 'bwa aln' that allows for "interactive" use of bwa to keep the process running and be able to send it chunks of reads periodically and get alignments back without waiting for a full batch of reads to be sent.

See:

Attributes:

Name Type Description
max_hits int

the maximum number of hits to report - if more than this number of seed hits are found, report only the count and not each hit.

reverse_complement bool

reverse complement each query sequence before alignment.

include_alt_hits bool

if True include hits to references with names ending in _alt, otherwise do not include them.

header

the SAM alignment header.

Source code in prymer/offtarget/bwa.py
class BwaAlnInteractive(ExecutableRunner):
    """Wrapper around a novel mode of 'bwa aln' that allows for "interactive" use of bwa to keep
    the process running and be able to send it chunks of reads periodically and get alignments
    back without waiting for a full batch of reads to be sent.

    See:

    - [https://github.com/fulcrumgenomics/bwa-aln-interactive](https://github.com/fulcrumgenomics/bwa-aln-interactive)
    - [https://bioconda.github.io/recipes/bwa-aln-interactive/README.html](https://bioconda.github.io/recipes/bwa-aln-interactive/README.html)

    Attributes:
        max_hits: the maximum number of hits to report - if more than this number of seed hits
                  are found, report only the count and not each hit.
        reverse_complement: reverse complement each query sequence before alignment.
        include_alt_hits: if True include hits to references with names ending in _alt, otherwise
                          do not include them.
        header: the SAM alignment header.
    """

    def __init__(
        self,
        ref: Path,
        max_hits: int,
        executable: str | Path = BWA_EXECUTABLE_NAME,
        max_mismatches: int = 3,
        max_mismatches_in_seed: int = 3,
        max_gap_opens: int = 0,
        max_gap_extensions: int = -1,
        min_indel_to_end_distance: int = 3,
        seed_length: int = 20,
        reverse_complement: bool = False,
        include_alt_hits: bool = False,
        threads: Optional[int] = None,
    ):
        """
        Args:
            ref: the path to the reference FASTA, which must be indexed with bwa.
            max_hits: the maximum number of hits to report - if more than this number of seed hits
                      are found, report only the count and not each hit.
            executable: string or Path representation of the `bwa-aln-interactive` executable path
            max_mismatches: the maximum number of mismatches allowed in the full query sequence
            max_mismatches_in_seed: the maximum number of mismatches allowed in the seed region
            max_gap_opens: the maximum number of gap opens allowed in the full query sequence
            max_gap_extensions: the maximum number of gap extensions allowed in the full query
                                sequence
            min_indel_to_end_distance: do not place an indel within this many bp of the ends of
                the query sequence
            seed_length: the length of the seed region
            reverse_complement: reverse complement each query sequence before alignment
            include_alt_hits: if true include hits to references with names ending in _alt,
                              otherwise do not include them.
            threads: the number of threads to use.  If `None`, use all available CPUs.
        """
        threads = os.cpu_count() if threads is None else threads
        executable_path = ExecutableRunner.validate_executable_path(executable=executable)
        self.reverse_complement: bool = reverse_complement
        self.include_alt_hits: bool = include_alt_hits
        self.max_hits: int = max_hits

        missing_aux_paths = []
        for aux_ext in BWA_AUX_EXTENSIONS:
            aux_path = Path(f"{ref}{aux_ext}")
            if not aux_path.exists():
                missing_aux_paths.append(aux_path)
        if len(missing_aux_paths) > 0:
            message: str
            if len(missing_aux_paths) > 1:
                message = "BWA index files do not exist:\n\t"
            else:
                message = "BWA index file does not exist:\n\t"
            message += "\t\n".join(f"{p}" for p in missing_aux_paths)
            raise FileNotFoundError(f"{message}\nIndex with: `{executable_path} index {ref}`")

        # -N = non-iterative mode: search for all n-difference hits (slooow)
        # -S = output SAM (run samse)
        # -Z = interactive mode (no input buffer and force processing with empty lines between recs)
        command: list[str] = [
            f"{executable_path}",
            "aln",
            "-t",
            f"{threads}",
            "-n",
            f"{max_mismatches}",
            "-o",
            f"{max_gap_opens}",
            "-e",
            f"{max_gap_extensions}",
            "-i",
            f"{min_indel_to_end_distance}",
            "-l",
            f"{seed_length}",
            "-k",
            f"{max_mismatches_in_seed}",
            "-X",
            f"{max_hits}",
            "-N",
            "-S",
            "-Z",
            "-D",
            f"{ref}",
            "/dev/stdin",
        ]

        super().__init__(command=command, stderr=subprocess.PIPE)

        header = []
        for line in self._subprocess.stdout:
            if line.startswith("@"):
                header.append(line)
            if line.startswith("@PG"):
                break

        self.header = AlignmentHeader.from_text("".join(header))

        # NB: ExecutableRunner will default to redirecting stderr to /dev/null. However, we would
        # like to preserve stderr messages from bwa for potential debugging. To do this, we create
        # a single thread to continuously read from stderr and redirect text lines to a debug
        # logger. The close() method of this class will additionally join the stderr logging thread.
        self._logger = logging.getLogger(self.__class__.__qualname__)
        self._stderr_thread = Thread(
            daemon=True,
            target=self._stream_to_sink,
            args=(self._subprocess.stderr, self._logger.debug),
        )
        self._stderr_thread.start()

    def __signal_bwa(self) -> None:
        """Signals BWA to process the queries."""
        self._subprocess.stdin.flush()
        # NB: the executable compiled on different platforms requires a different number of newlines
        # NB: it is not understood why, but 16 newlines seems to work for all platforms tested
        self._subprocess.stdin.write("\n" * 16)
        self._subprocess.stdin.flush()

    @override
    def close(self) -> bool:
        """
        Gracefully terminates the underlying subprocess if it is still running.

        Returns:
            True: if the subprocess was terminated successfully
            False: if the subprocess failed to terminate or was not already running
        """
        safely_closed: bool = super().close()
        self._stderr_thread.join()
        return safely_closed

    def map_one(self, query: str, id: str = "unknown") -> BwaResult:
        """Maps a single query to the genome and returns the result.

        Args:
            query: the query to map with BWA

        Returns:
            a `BwaResult` based on mapping the query
        """
        return self.map_all([Query(bases=query, id=id)])[0]

    def map_all(self, queries: list[Query]) -> list[BwaResult]:
        """Maps multiple queries and returns the results.  This is more efficient than using
        `map_one` on each query one-by-one as it batches reads to BWA.

        Args:
            queries: the queries to map with BWA

        Returns:
            one `BwaResult`s for each query
        """
        if len(queries) == 0:
            return []

        # Send the reads to BWA
        for query in queries:
            fastq_str = query.to_fastq(reverse_complement=self.reverse_complement)
            self._subprocess.stdin.write(fastq_str)

        # Force the input to be sent to the underlying process.
        self.__signal_bwa()

        # Read back the results
        results: list[BwaResult] = []
        for query in queries:
            # get the next alignment and convert to a result
            line: str = next(self._subprocess.stdout).strip()
            assert not line.startswith("@"), f"SAM record must not start with '@'! {line}"
            alignment = AlignedSegment.fromstring(line, self.header)
            results.append(self._to_result(query=query, rec=alignment))

        return results

    def _to_result(self, query: Query, rec: pysam.AlignedSegment) -> BwaResult:
        """
        Convert the query and alignment to a result.

        Args:
            query: the original query
            rec: the alignment
        """
        if query.id != rec.query_name:
            raise ValueError(
                "Query and Results are out of order" f"Query=${query.id}, Result=${rec.query_name}"
            )

        num_hits: int = int(rec.get_tag("HN")) if rec.has_tag("HN") else 0
        # `to_hits()` removes artifactual hits which span the boundary between concatenated
        # reference sequences. If we are reporting a list of hits, the number of hits should match
        # the size of this list. Otherwise, we either have zero hits, or more than the maximum
        # number of hits. In both of the latter cases, we have to rely on the count reported in the
        # `HN` tag.
        hits: list[BwaHit]
        if 0 < num_hits <= self.max_hits:
            hits = self.to_hits(rec=rec)
            num_hits = len(hits)
        else:
            hits = []

        return BwaResult(query=query, hit_count=num_hits, hits=hits)

    def to_hits(self, rec: AlignedSegment) -> list[BwaHit]:
        """Extracts the hits from the given alignment.  Beyond the current alignment
        additional alignments are parsed from the XA SAM tag.

        Args:
            rec: the given alignment
        """
        negative = rec.is_reverse != self.reverse_complement
        first_hit: BwaHit = BwaHit(
            refname=rec.reference_name,
            start=rec.reference_start + 1,  # NB: pysam is 0-based, Hit is 1-based
            negative=negative,
            cigar=Cigar.from_cigartuples(rec.cigartuples),
            edits=int(rec.get_tag("NM")),
        )

        # Add the hits
        # E.g. XA:Z:chr4,-97592047,24M,3;chr8,-32368749,24M,3;
        hits: list[BwaHit] = [first_hit]
        if rec.has_tag("XA"):
            for xa in cast(str, rec.get_tag("XA")).split(";"):
                if xa == "":
                    continue
                fields = xa.split(",")

                # If the reverse-complement of the query sequence was fed to BWA, various fields
                # should be negated/reversed in the Hit
                negative = fields[1][0] == "-"
                if self.reverse_complement:
                    negative = not negative

                hit: BwaHit = BwaHit(
                    refname=fields[0],
                    start=int(fields[1][1:]),  # the XA tag is 1-based
                    negative=negative,
                    cigar=Cigar.from_cigarstring(fields[2]),
                    edits=int(fields[3]),
                )
                hits.append(hit)

        if not self.include_alt_hits:
            hits = [hit for hit in hits if not hit.refname.endswith("_alt")]

        # Remove hits that extend beyond the end of a contig - these are artifacts of `bwa aln`'s
        # alignment process, which concatenates all reference sequences and reports hits which span
        # across contigs.
        # NB: the type ignore is necessary because pysam's type hint for `get_reference_length` is
        # incorrect.
        # https://github.com/pysam-developers/pysam/pull/1313
        hits = [hit for hit in hits if hit.end <= self.header.get_reference_length(hit.refname)]  # type: ignore[arg-type]  # noqa: E501

        return hits

Functions

__init__
__init__(
    ref: Path,
    max_hits: int,
    executable: str | Path = BWA_EXECUTABLE_NAME,
    max_mismatches: int = 3,
    max_mismatches_in_seed: int = 3,
    max_gap_opens: int = 0,
    max_gap_extensions: int = -1,
    min_indel_to_end_distance: int = 3,
    seed_length: int = 20,
    reverse_complement: bool = False,
    include_alt_hits: bool = False,
    threads: Optional[int] = None,
)

Parameters:

Name Type Description Default
ref Path

the path to the reference FASTA, which must be indexed with bwa.

required
max_hits int

the maximum number of hits to report - if more than this number of seed hits are found, report only the count and not each hit.

required
executable str | Path

string or Path representation of the bwa-aln-interactive executable path

BWA_EXECUTABLE_NAME
max_mismatches int

the maximum number of mismatches allowed in the full query sequence

3
max_mismatches_in_seed int

the maximum number of mismatches allowed in the seed region

3
max_gap_opens int

the maximum number of gap opens allowed in the full query sequence

0
max_gap_extensions int

the maximum number of gap extensions allowed in the full query sequence

-1
min_indel_to_end_distance int

do not place an indel within this many bp of the ends of the query sequence

3
seed_length int

the length of the seed region

20
reverse_complement bool

reverse complement each query sequence before alignment

False
include_alt_hits bool

if true include hits to references with names ending in _alt, otherwise do not include them.

False
threads Optional[int]

the number of threads to use. If None, use all available CPUs.

None
Source code in prymer/offtarget/bwa.py
def __init__(
    self,
    ref: Path,
    max_hits: int,
    executable: str | Path = BWA_EXECUTABLE_NAME,
    max_mismatches: int = 3,
    max_mismatches_in_seed: int = 3,
    max_gap_opens: int = 0,
    max_gap_extensions: int = -1,
    min_indel_to_end_distance: int = 3,
    seed_length: int = 20,
    reverse_complement: bool = False,
    include_alt_hits: bool = False,
    threads: Optional[int] = None,
):
    """
    Args:
        ref: the path to the reference FASTA, which must be indexed with bwa.
        max_hits: the maximum number of hits to report - if more than this number of seed hits
                  are found, report only the count and not each hit.
        executable: string or Path representation of the `bwa-aln-interactive` executable path
        max_mismatches: the maximum number of mismatches allowed in the full query sequence
        max_mismatches_in_seed: the maximum number of mismatches allowed in the seed region
        max_gap_opens: the maximum number of gap opens allowed in the full query sequence
        max_gap_extensions: the maximum number of gap extensions allowed in the full query
                            sequence
        min_indel_to_end_distance: do not place an indel within this many bp of the ends of
            the query sequence
        seed_length: the length of the seed region
        reverse_complement: reverse complement each query sequence before alignment
        include_alt_hits: if true include hits to references with names ending in _alt,
                          otherwise do not include them.
        threads: the number of threads to use.  If `None`, use all available CPUs.
    """
    threads = os.cpu_count() if threads is None else threads
    executable_path = ExecutableRunner.validate_executable_path(executable=executable)
    self.reverse_complement: bool = reverse_complement
    self.include_alt_hits: bool = include_alt_hits
    self.max_hits: int = max_hits

    missing_aux_paths = []
    for aux_ext in BWA_AUX_EXTENSIONS:
        aux_path = Path(f"{ref}{aux_ext}")
        if not aux_path.exists():
            missing_aux_paths.append(aux_path)
    if len(missing_aux_paths) > 0:
        message: str
        if len(missing_aux_paths) > 1:
            message = "BWA index files do not exist:\n\t"
        else:
            message = "BWA index file does not exist:\n\t"
        message += "\t\n".join(f"{p}" for p in missing_aux_paths)
        raise FileNotFoundError(f"{message}\nIndex with: `{executable_path} index {ref}`")

    # -N = non-iterative mode: search for all n-difference hits (slooow)
    # -S = output SAM (run samse)
    # -Z = interactive mode (no input buffer and force processing with empty lines between recs)
    command: list[str] = [
        f"{executable_path}",
        "aln",
        "-t",
        f"{threads}",
        "-n",
        f"{max_mismatches}",
        "-o",
        f"{max_gap_opens}",
        "-e",
        f"{max_gap_extensions}",
        "-i",
        f"{min_indel_to_end_distance}",
        "-l",
        f"{seed_length}",
        "-k",
        f"{max_mismatches_in_seed}",
        "-X",
        f"{max_hits}",
        "-N",
        "-S",
        "-Z",
        "-D",
        f"{ref}",
        "/dev/stdin",
    ]

    super().__init__(command=command, stderr=subprocess.PIPE)

    header = []
    for line in self._subprocess.stdout:
        if line.startswith("@"):
            header.append(line)
        if line.startswith("@PG"):
            break

    self.header = AlignmentHeader.from_text("".join(header))

    # NB: ExecutableRunner will default to redirecting stderr to /dev/null. However, we would
    # like to preserve stderr messages from bwa for potential debugging. To do this, we create
    # a single thread to continuously read from stderr and redirect text lines to a debug
    # logger. The close() method of this class will additionally join the stderr logging thread.
    self._logger = logging.getLogger(self.__class__.__qualname__)
    self._stderr_thread = Thread(
        daemon=True,
        target=self._stream_to_sink,
        args=(self._subprocess.stderr, self._logger.debug),
    )
    self._stderr_thread.start()
__signal_bwa
__signal_bwa() -> None

Signals BWA to process the queries.

Source code in prymer/offtarget/bwa.py
def __signal_bwa(self) -> None:
    """Signals BWA to process the queries."""
    self._subprocess.stdin.flush()
    # NB: the executable compiled on different platforms requires a different number of newlines
    # NB: it is not understood why, but 16 newlines seems to work for all platforms tested
    self._subprocess.stdin.write("\n" * 16)
    self._subprocess.stdin.flush()
close
close() -> bool

Gracefully terminates the underlying subprocess if it is still running.

Returns:

Name Type Description
True bool

if the subprocess was terminated successfully

False bool

if the subprocess failed to terminate or was not already running

Source code in prymer/offtarget/bwa.py
@override
def close(self) -> bool:
    """
    Gracefully terminates the underlying subprocess if it is still running.

    Returns:
        True: if the subprocess was terminated successfully
        False: if the subprocess failed to terminate or was not already running
    """
    safely_closed: bool = super().close()
    self._stderr_thread.join()
    return safely_closed
map_all
map_all(queries: list[Query]) -> list[BwaResult]

Maps multiple queries and returns the results. This is more efficient than using map_one on each query one-by-one as it batches reads to BWA.

Parameters:

Name Type Description Default
queries list[Query]

the queries to map with BWA

required

Returns:

Type Description
list[BwaResult]

one BwaResults for each query

Source code in prymer/offtarget/bwa.py
def map_all(self, queries: list[Query]) -> list[BwaResult]:
    """Maps multiple queries and returns the results.  This is more efficient than using
    `map_one` on each query one-by-one as it batches reads to BWA.

    Args:
        queries: the queries to map with BWA

    Returns:
        one `BwaResult`s for each query
    """
    if len(queries) == 0:
        return []

    # Send the reads to BWA
    for query in queries:
        fastq_str = query.to_fastq(reverse_complement=self.reverse_complement)
        self._subprocess.stdin.write(fastq_str)

    # Force the input to be sent to the underlying process.
    self.__signal_bwa()

    # Read back the results
    results: list[BwaResult] = []
    for query in queries:
        # get the next alignment and convert to a result
        line: str = next(self._subprocess.stdout).strip()
        assert not line.startswith("@"), f"SAM record must not start with '@'! {line}"
        alignment = AlignedSegment.fromstring(line, self.header)
        results.append(self._to_result(query=query, rec=alignment))

    return results
map_one
map_one(query: str, id: str = 'unknown') -> BwaResult

Maps a single query to the genome and returns the result.

Parameters:

Name Type Description Default
query str

the query to map with BWA

required

Returns:

Type Description
BwaResult

a BwaResult based on mapping the query

Source code in prymer/offtarget/bwa.py
def map_one(self, query: str, id: str = "unknown") -> BwaResult:
    """Maps a single query to the genome and returns the result.

    Args:
        query: the query to map with BWA

    Returns:
        a `BwaResult` based on mapping the query
    """
    return self.map_all([Query(bases=query, id=id)])[0]
to_hits
to_hits(rec: AlignedSegment) -> list[BwaHit]

Extracts the hits from the given alignment. Beyond the current alignment additional alignments are parsed from the XA SAM tag.

Parameters:

Name Type Description Default
rec AlignedSegment

the given alignment

required
Source code in prymer/offtarget/bwa.py
def to_hits(self, rec: AlignedSegment) -> list[BwaHit]:
    """Extracts the hits from the given alignment.  Beyond the current alignment
    additional alignments are parsed from the XA SAM tag.

    Args:
        rec: the given alignment
    """
    negative = rec.is_reverse != self.reverse_complement
    first_hit: BwaHit = BwaHit(
        refname=rec.reference_name,
        start=rec.reference_start + 1,  # NB: pysam is 0-based, Hit is 1-based
        negative=negative,
        cigar=Cigar.from_cigartuples(rec.cigartuples),
        edits=int(rec.get_tag("NM")),
    )

    # Add the hits
    # E.g. XA:Z:chr4,-97592047,24M,3;chr8,-32368749,24M,3;
    hits: list[BwaHit] = [first_hit]
    if rec.has_tag("XA"):
        for xa in cast(str, rec.get_tag("XA")).split(";"):
            if xa == "":
                continue
            fields = xa.split(",")

            # If the reverse-complement of the query sequence was fed to BWA, various fields
            # should be negated/reversed in the Hit
            negative = fields[1][0] == "-"
            if self.reverse_complement:
                negative = not negative

            hit: BwaHit = BwaHit(
                refname=fields[0],
                start=int(fields[1][1:]),  # the XA tag is 1-based
                negative=negative,
                cigar=Cigar.from_cigarstring(fields[2]),
                edits=int(fields[3]),
            )
            hits.append(hit)

    if not self.include_alt_hits:
        hits = [hit for hit in hits if not hit.refname.endswith("_alt")]

    # Remove hits that extend beyond the end of a contig - these are artifacts of `bwa aln`'s
    # alignment process, which concatenates all reference sequences and reports hits which span
    # across contigs.
    # NB: the type ignore is necessary because pysam's type hint for `get_reference_length` is
    # incorrect.
    # https://github.com/pysam-developers/pysam/pull/1313
    hits = [hit for hit in hits if hit.end <= self.header.get_reference_length(hit.refname)]  # type: ignore[arg-type]  # noqa: E501

    return hits

BwaHit dataclass

Represents a single hit or alignment of a sequence to a location in the genome.

Attributes:

Name Type Description
refname str

the reference name of the hit

start int

the start position of the hit (1-based inclusive)

negative bool

whether the hit is on the negative strand

cigar Cigar

the cigar string returned by BWA

edits int

the number of edits between the read and the reference

Source code in prymer/offtarget/bwa.py
@dataclass(init=True, frozen=True)
class BwaHit:
    """Represents a single hit or alignment of a sequence to a location in the genome.

    Attributes:
        refname: the reference name of the hit
        start: the start position of the hit (1-based inclusive)
        negative: whether the hit is on the negative strand
        cigar: the cigar string returned by BWA
        edits: the number of edits between the read and the reference
    """

    refname: str
    start: int
    negative: bool
    cigar: Cigar
    edits: int

    @property
    def mismatches(self) -> int:
        """The number of mismatches for the hit"""
        indel_sum = sum(elem.length for elem in self.cigar.elements if elem.operator.is_indel)
        if indel_sum > self.edits:
            raise ValueError(
                f"indel_sum ({indel_sum}) > self.edits ({self.edits}) with cigar: {self.cigar}"
            )
        return self.edits - indel_sum

    @property
    def end(self) -> int:
        """The end position of the hit (1-based inclusive)"""
        return coordmath.get_closed_end(self.start, self.cigar.length_on_target())

    @staticmethod
    def build(
        refname: str, start: int, negative: bool, cigar: str, edits: int, revcomp: bool = False
    ) -> "BwaHit":
        """Generates a hit object.

        Args:
            refname: the reference name of the hit
            start: the start position of the hit (1-based inclusive)
            negative: whether the hit is on the negative strand
            cigar: the cigar string returned by BWA
            edits: the number of edits between the read and the reference
            revcomp: whether the reverse-complement of the query sequence was fed to BWA, in which
                case various fields should be negated/reversed in the Hit

        Returns:
            A Hit that represents the mapping of the original query sequence that was supplied
        """
        negative = negative != revcomp
        return BwaHit(
            refname=refname,
            start=start,
            negative=negative,
            cigar=Cigar.from_cigarstring(cigar),
            edits=edits,
        )

    def __str__(self) -> str:
        """Returns the string representation in bwa's XA tag format."""
        # E.g. XA:Z:chr4,-97592047,24M,3;chr8,-32368749,24M,3;
        return ",".join(
            [
                self.refname,
                ("-" if self.negative else "+") + f"{self.start}",
                f"{self.cigar}",
                f"{self.edits}",
            ]
        )

Attributes

end property
end: int

The end position of the hit (1-based inclusive)

mismatches property
mismatches: int

The number of mismatches for the hit

Functions

__str__
__str__() -> str

Returns the string representation in bwa's XA tag format.

Source code in prymer/offtarget/bwa.py
def __str__(self) -> str:
    """Returns the string representation in bwa's XA tag format."""
    # E.g. XA:Z:chr4,-97592047,24M,3;chr8,-32368749,24M,3;
    return ",".join(
        [
            self.refname,
            ("-" if self.negative else "+") + f"{self.start}",
            f"{self.cigar}",
            f"{self.edits}",
        ]
    )
build staticmethod
build(
    refname: str,
    start: int,
    negative: bool,
    cigar: str,
    edits: int,
    revcomp: bool = False,
) -> BwaHit

Generates a hit object.

Parameters:

Name Type Description Default
refname str

the reference name of the hit

required
start int

the start position of the hit (1-based inclusive)

required
negative bool

whether the hit is on the negative strand

required
cigar str

the cigar string returned by BWA

required
edits int

the number of edits between the read and the reference

required
revcomp bool

whether the reverse-complement of the query sequence was fed to BWA, in which case various fields should be negated/reversed in the Hit

False

Returns:

Type Description
BwaHit

A Hit that represents the mapping of the original query sequence that was supplied

Source code in prymer/offtarget/bwa.py
@staticmethod
def build(
    refname: str, start: int, negative: bool, cigar: str, edits: int, revcomp: bool = False
) -> "BwaHit":
    """Generates a hit object.

    Args:
        refname: the reference name of the hit
        start: the start position of the hit (1-based inclusive)
        negative: whether the hit is on the negative strand
        cigar: the cigar string returned by BWA
        edits: the number of edits between the read and the reference
        revcomp: whether the reverse-complement of the query sequence was fed to BWA, in which
            case various fields should be negated/reversed in the Hit

    Returns:
        A Hit that represents the mapping of the original query sequence that was supplied
    """
    negative = negative != revcomp
    return BwaHit(
        refname=refname,
        start=start,
        negative=negative,
        cigar=Cigar.from_cigarstring(cigar),
        edits=edits,
    )

BwaResult dataclass

Represents zero or more hits or alignments found by BWA for the given query.

The number of hits found may be more than the number of hits listed in the hits attribute.

Attributes:

Name Type Description
query Query

the query (as given, no RC'ing)

hit_count int

the total number of hits found by bwa (this may be more than len(hits))

hits list[BwaHit]

the subset of hits that were reported by bwa

Source code in prymer/offtarget/bwa.py
@dataclass(init=True, frozen=True)
class BwaResult:
    """
    Represents zero or more hits or alignments found by BWA for the given query.

    The number of hits found may be more than the number of hits listed in the `hits` attribute.

    Attributes:
        query: the query (as given, no RC'ing)
        hit_count: the total number of hits found by bwa (this may be more than len(hits))
        hits: the subset of hits that were reported by bwa
    """

    query: Query
    hit_count: int
    hits: list[BwaHit]

Query dataclass

Represents a single query sequence for mapping.

Attributes:

Name Type Description
id str

the identifier for the query (e.g. read name)

bases str

the query bases

Source code in prymer/offtarget/bwa.py
@dataclass(init=True, frozen=True)
class Query:
    """Represents a single query sequence for mapping.

    Attributes:
        id: the identifier for the query (e.g. read name)
        bases: the query bases
    """

    id: str
    bases: str

    _DEFAULT_BASE_QUALITY: ClassVar[str] = "H"
    """The default base quality for a query"""

    def __post_init__(self) -> None:
        if self.bases is None or len(self.bases) == 0:
            raise ValueError("No bases were provided to query")

    def to_fastq(self, reverse_complement: bool = False) -> str:
        """Returns the multi-line FASTQ string representation of this query"""
        bases = sequence.reverse_complement(self.bases) if reverse_complement else self.bases
        quals = self._DEFAULT_BASE_QUALITY * len(self.bases)
        return f"@{self.id}\n{bases}\n+\n{quals}\n"

Functions

to_fastq
to_fastq(reverse_complement: bool = False) -> str

Returns the multi-line FASTQ string representation of this query

Source code in prymer/offtarget/bwa.py
def to_fastq(self, reverse_complement: bool = False) -> str:
    """Returns the multi-line FASTQ string representation of this query"""
    bases = sequence.reverse_complement(self.bases) if reverse_complement else self.bases
    quals = self._DEFAULT_BASE_QUALITY * len(self.bases)
    return f"@{self.id}\n{bases}\n+\n{quals}\n"

Modules