/usr/lib/R/site-library/IRanges/doc/IRangesOverview.Rnw

%\VignetteIndexEntry{An Introduction to IRanges}
%\VignetteDepends{}
%\VignetteKeywords{Ranges}
%\VignettePackage{IRanges}
\documentclass[10pt]{article}

\usepackage{times}
\usepackage{hyperref}

\textwidth=6.5in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=-.1in
\evensidemargin=-.1in
\headheight=-.3in

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}

\newcommand{\software}[1]{\textsf{#1}}
\newcommand{\R}{\software{R}}
\newcommand{\IRanges}{\Rpackage{IRanges}}

\title{An Introduction to \IRanges{}}
\author{Patrick Aboyoun, Michael Lawrence, Herv\'e Pag\`es}
\date{\today}

\begin{document}

\maketitle

<<options,echo=FALSE>>=
options(width=72)
@

\section{Introduction}

The \IRanges{} package is designed to represent sequences, ranges
representing indices along those sequences, and data related to those
ranges. In this vignette, we will rely on simple, illustrative example
datasets, rather than large, real-world data, so that each data structure
and algorithm can be explained in an intuitive, graphical manner. We
expect that packages that apply \IRanges{} to a particular problem domain
will provide vignettes with relevant, realistic examples.

The \IRanges{} package is available at bioconductor.org and can be
downloaded via \Rfunction{biocLite}:

<<biocLite, eval=FALSE>>=
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
@
<<initialize, results=hide>>=
library(IRanges)
@

\section{Vector objects}

In the context of the \IRanges{} package, a sequence is an ordered finite
collection of elements. The \IRanges{} packages represents two types of
objects as sequences: (1) atomic sequences and (2) lists (or non-atomic
sequences). The following subsections describe each in turn.
All \IRanges{}-derived sequences inherit from the \Rclass{Vector}
virtual class.

\subsection{Atomic Vectors}

In \R{}, atomic sequences are typically stored in atomic vectors. The
\IRanges{} package includes an additional atomic sequence object type,
\Rclass{Rle}, which compresses an atomic sequence through run-length
encoding.

We begin our discussion of atomic sequences using two \Rclass{Rle} vectors.
<<initialize>>=
set.seed(0)
lambda <- c(rep(0.001, 4500), seq(0.001, 10, length = 500), 
            seq(10, 0.001, length = 500))
xVector <- Rle(rpois(1e7, lambda))
yVector <- Rle(rpois(1e7, lambda[c(251:length(lambda), 1:250)]))
@

All atomic sequences in \R{} have three main properties: (1) a
notion of length or number of elements, (2) the ability to extract
elements to create new atomic sequences, and (3) the ability to be
combined with one or more atomic sequences to form larger
atomic sequences. The main functions for these three operations are
\Rfunction{length}, \Rfunction{[}, and \Rfunction{c}.
<<basic-ops>>=
length(xVector)
xVector[1]
zVector <- c(xVector, yVector)
@

While these three methods may seem trivial, they provide a great deal of
power and many atomic sequence manipulations can be constructed
using them.

\subsubsection{Vector Subsetting}

As with ordinary \R{} atomic vectors, it is often necessary to subset
one sequence from another. When this subsetting does not duplicate or
reorder the elements being extracted, the result is called a
\textit{subsequence}. In general, the \Rfunction{[} function can be used
to construct a new sequence or extract a subsequence, but its interface
is often inconvenient and not amenable to optimization. To compensate for
this, the \IRanges{} package supports seven additional functions for
sequence extraction:

\begin{enumerate}
\item \Rfunction{window} - Extracts a subsequence over a specified region.
\item \Rfunction{subset} - Extracts the subsequence specified by a logical
  vector.
\item \Rfunction{head} - Extracts a consecutive subsequence containing the
  first n elements.
\item \Rfunction{tail} - Extracts a consecutive subsequence containing the
  last n elements.
\item \Rfunction{rev} - Creates a new sequence with the elements in the
  reverse order.
\item \Rfunction{rep} - Creates a new sequence by repeating sequence
  elements.
\end{enumerate}

The following code illustrates how these functions are used on an
ordinary \R{} \Rclass{integer} vector:

<<seq-extraction>>=
xSnippet <- xVector[IRanges(4751, 4760)]
xSnippet
head(xSnippet)
tail(xSnippet)
rev(xSnippet)
rep(xSnippet, 2)
subset(xSnippet, xSnippet >= 5L)
@

\subsubsection{Combining Vectors}

The \IRanges{} package uses two generic functions, \Rfunction{c} and
\Rfunction{append}, for combining two \Rclass{Vector} objects. The methods
for \Rclass{Vector} objects follow the definition that these two functions
are given the the \Rpackage{base} package.

<<seq-combine>>=
c(xSnippet, rev(xSnippet))
append(xSnippet, xSnippet, after = 3)
@

\subsubsection{Looping over Vectors and Vector subsets}

In \R{}, \Rfunction{for} looping can be an expensive operation. To
compensate for this, \IRanges{} uses three generics, \Rfunction{endoapply},
\Rfunction{lapply}, and \Rfunction{sapply}, for looping over sequences and
two generics, \Rfunction{aggregate} and \Rfunction{shiftApply}, to perform
calculations over subsequences. The \Rfunction{lapply} and \Rfunction{sapply}
functions are familiar to many \R{} users since they are the standard
functions for looping over the elements of an \R{} \Rclass{list} object.
The \Rfunction{endoapply} function performs an endomorphism equivalent to
\Rfunction{lapply}, i.e. returns a \Rclass{Vector} object of the same class
as the input rather than a \Rclass{list} object. More will be given on these
three functions in the Lists subsection.

The \Rfunction{aggregate} function combines sequence extraction functionality
of the \Rfunction{window} function with looping capabilities of the
\Rfunction{sapply} function. For example, here is some code to compute medians
across a moving window of width 3 using the function \Rfunction{aggregate}:

<<aggregate>>=
xSnippet
aggregate(xSnippet, start = 1:8, width = 3, FUN = median)
@

The \Rfunction{shiftApply} function is a looping operation involving two
sequences whose elements are lined up via a positional shift operation.
For example, the elements of \Robject{xVector} and \Robject{yVector} were
simulated from Poisson distributions with the mean of element i from
\Robject{yVector} being equivalent to the mean of element i + 250 from
\Robject{xVector}. If we did not know the size of the shift, we could
estimate it by finding the shift that maximizes the correlation between
\Robject{xVector} and \Robject{yVector}.

<<shiftApply-cor>>=
cor(xVector, yVector)
shifts <- seq(235, 265, by=3)
corrs  <- shiftApply(shifts, yVector, xVector, FUN = cor)
@
%
<<figshiftcorrs, fig=TRUE, include=FALSE, eps=FALSE, width=5, height=5>>=
plot(shifts, corrs)
@

The result is shown in Fig.~\ref{figshiftcorrs}.
\begin{figure}[tb]
  \begin{center}
     \includegraphics[width=0.5\textwidth]{IRangesOverview-figshiftcorrs}  
     \caption{\label{figshiftcorrs}%
      Correlation between \Robject{xVector} and \Robject{yVector} for various
      shifts.}
  \end{center}
\end{figure}


\subsubsection{Run Length Encoding}

Up until this point we have used \R{} atomic vectors to represent
atomic sequences, but there are times when these object become too
large to manage in memory. When there are lots of consecutive repeats in
the sequence, the data can be compressed and managed in memory through a
run-length encoding where a data value is paired with a run length. For
example, the sequence \{1, 1, 1, 2, 3,  3\} can be represented as
values = \{1, 2, 3\}, run lengths = \{3, 1, 2\}.

The \Rclass{Rle} class in \IRanges{} is used to represent a run-length
encoded (compressed) sequence of \Rclass{logical}, \Rclass{integer},
\Rclass{numeric}, \Rclass{complex},  \Rclass{character}, or \Rclass{raw}
values. One way to construct an \Rclass{Rle} object is through the
\Rclass{Rle} constructor function:

<<Rle-construction>>=
xRle <- Rle(xVector)
yRle <- Rle(yVector)
xRle
yRle
@

When there are lots of consecutive repeats, the memory savings through
an RLE can be quite dramatic. For example, the \Robject{xRle} object
occupies less than one quarter of the space of the original \Robject{xVector}
object, while storing the same information:

<<Rle-vector-compare>>=
as.vector(object.size(xRle) / object.size(xVector))
identical(as.vector(xRle), xVector)
@

The functions \Rfunction{runValue} and \Rfunction{runLength} extract
the run values and run lengths from an \Rclass{Rle} object respectively:

<<Rle-accessors>>=
head(runValue(xRle))
head(runLength(xRle))
@

The \Rclass{Rle} class supports many of the basic methods associated with
\R{} atomic vectors including the Ops, Math, Math2, Summary, and Complex
group generics. Here is a example of manipulating \Rclass{Rle} objects
using methods from the Ops group:

<<Rle-ops>>=
xRle > 0
xRle + yRle
xRle > 0 | yRle > 0
@

Here are some from the Summary group:

<<Rle-summary>>=
range(xRle)
sum(xRle > 0 | yRle > 0)
@

And here is one from the Math group:

<<Rle-math>>=
log1p(xRle)
@

As with the atomic vectors, the \Rfunction{cor} and \Rfunction{shiftApply}
functions operate on \Rclass{Rle} objects:

<<Rle-cor>>=
cor(xRle, yRle)
shiftApply(249:251, yRle, xRle, FUN = function(x, y) var(x, y) / (sd(x) * sd(y)))
@

For more information on the methods supported by the \Rclass{Rle} class,
consult the \Rcode{Rle} man page.

\subsection{Lists}

In many data analysis situation there is a desire to organize and
manipulate multiple objects simultaneously. Typically this is done in
\R{} through the usage of a list. While a list serves as a generic
container, it does not confer any information about the specific class
of its elements, provides no infrastructure to ensure type safety, and
the S3 and S4 method dispatch mechanisms do not support method selection
for lists with homogeneous object types.
The \Rclass{List} virtual class defined in the \IRanges{} package addresses
these issues. \Rclass{List} is a direct extension of \Rclass{Vector}.

\subsubsection{Lists of Atomic Vectors}

The first type of lists we consider are those containing atomic sequences
such as \Rclass{integer} vectors or \Rclass{Rle} objects. We may wish to
define a method that retrieves the length of each atomic sequence element,
without special type checking. To enable this, we define collection
classes such as \Rclass{IntegerList} and \Rclass{RleList}, which inherit
from the \Rclass{List} virtual class, for representing lists of
\Rclass{integer} vectors and \Rclass{Rle} objects respectively.

<<list-intro>>=
getClassDef("RleList")
@

As the class definition above shows, the \Rclass{RleList} class is virtual
with subclasses \Rclass{SimpleRleList} and \Rclass{CompressedRleList}. A
\Rclass{SimpleRleList} class uses a regular \R{} list to store the
underlying elements and the \Rclass{CompressedRleList} class stores the
elements in an unlisted form and keeps track of where the element breaks
are. The former ``simple list" class is useful when the Rle elements are
long and the latter ``compressed list" class is useful when the list is
long and/or sparse (i.e. a number of the list elements have length 0).

In fact, all of the atomic vector types (raw, logical, integer, numeric,
complex, and character) have similar list classes that derive from the
\Rclass{List} virtual class. For example, there is an
\Rclass{IntegerList} virtual class with subclasses
\Rclass{SimpleIntegerList} and \Rclass{CompressedIntegerList}.

Each of the list classes for atomic sequences, be they stored as vectors
or \Rclass{Rle} objects, have a constructor function with a name of the
appropriate list virtual class, such as \Rclass{IntegerList}, and an
optional argument \Rfunarg{compress} that takes an argument to specify
whether or not to create the simple list object type or the compressed
list object type. The default is to create the compressed list object
type.

<<list-construct>>=
args(IntegerList)
cIntList1 <- IntegerList(x = xVector, y = yVector)
cIntList1
sIntList2 <- IntegerList(x = xVector, y = yVector, compress = FALSE)
sIntList2
## sparse integer list
xExploded <- lapply(xVector[1:5000], function(x) seq_len(x))
cIntList2 <- IntegerList(xExploded)
sIntList2 <- IntegerList(xExploded, compress = FALSE)
object.size(cIntList2)
object.size(sIntList2)
@

The \Rfunction{length} function returns the number of elements in a
\Rclass{Vector}-derived object and, for a \Rclass{List}-derived object
like ``simple list" or ``compressed list", the \Rfunction{elementLengths}
function returns an integer vector containing the lengths of each of the
elements:

<<list-length>>=
length(cIntList2)
Rle(elementLengths(cIntList2))
@

Just as with ordinary \R{} \Rclass{list} objects, \Rclass{List}-derived
object support the \Rfunction{[[} for element extraction, \Rfunction{c}
for combining, and \Rfunction{lapply}/\Rfunction{sapply} for looping. When
looping over sparse lists, the ``compressed list" classes can be much faster
during computations since only the non-empty elements are looped over during
the \Rfunction{lapply}/\Rfunction{sapply} computation and all the empty
elements are assigned the appropriate value based on their status.

<<list-lapply>>=
system.time(sapply(xExploded, mean))
system.time(sapply(sIntList2, mean))
system.time(sapply(cIntList2, mean))
identical(sapply(xExploded, mean), sapply(sIntList2, mean))
identical(sapply(xExploded, mean), sapply(cIntList2, mean))
@

Unlist ordinary \R{} \Rclass{list} objects, \Rclass{AtomicList} objects support
the \Rfunction{Ops} (e.g. \Rfunction{+}, \Rfunction{==}, \Rfunction{\&}),
\Rfunction{Math} (e.g. \Rfunction{log}, \Rfunction{sqrt}),
\Rfunction{Math2} (e.g. \Rfunction{round}, \Rfunction{signif}),
\Rfunction{Summary} (e.g. \Rfunction{min}, \Rfunction{max}, \Rfunction{sum}),
and \Rfunction{Complex} (e.g. \Rfunction{Re}, \Rfunction{Im}) group generics.

<<list-groupgenerics>>=
xRleList <- RleList(xRle, 2L * rev(xRle))
yRleList <- RleList(yRle, 2L * rev(yRle))
xRleList > 0
xRleList + yRleList
sum(xRleList > 0 | yRleList > 0)
@

Since these atomic lists inherit from \Rclass{List}, they can also use
the looping function \Rfunction{endoapply} to perform endomorphisms.

<<list-endoapply>>=
safe.max <- function(x) { if(length(x)) max(x) else integer(0) }
endoapply(sIntList2, safe.max)
endoapply(cIntList2, safe.max)
endoapply(sIntList2, safe.max)[[1]]
@

\section{Data Tables}

To Do:
\Rclass{DataTable},
\Rclass{DataFrame},
\Rclass{DataFrameList},
\Rclass{SplitDataFrameList}

\section{Vector Annotations}

Often when one has a collection of objects, there is a need to attach
metadata that describes the collection in some way. 
Two kinds of metadata can be attached to a \Rclass{Vector} object:
\begin{enumerate}
\item Metadata about the object as a whole: this metadata is accessed
      via the \Rfunction{metadata} accessor and is represented as an
      ordinary \Rclass{list};
\item Metadata about the individual elements of the object: this metadata
      is accessed via the \Rfunction{mcols} accessor (\Rfunction{mcols}
      stands for {\it metadata columns}) and is represented as a
      \Rclass{DataTable} object (i.e. as an instance of a concrete
      subclass of \Rclass{DataTable}, e.g. a \Rclass{DataFrame} object).
      This \Rclass{DataTable} object can be thought of as the result of
      binding together one or several vector-like objects (the metadata
      columns) of the same length as the \Rclass{Vector} object. Each
      row of the \Rclass{DataTable} object annotates the corresponding
      element of the \Rclass{Vector} object.
\end{enumerate}

\section{Vector Ranges}

When analyzing sequences, we are often interested in particular
consecutive subsequences. For example, the \Sexpr{letters} vector could
be considered a sequence of lower-case letters, in alphabetical order.
We would call the first five letters (\textit{a} to \textit{e}) a
consecutive subsequence, while the subsequence containing only the
vowels would not be consecutive. It is not uncommon for an analysis
task to focus only on the geometry of the regions, while
ignoring the underlying sequence values. A list of indices would be a
simple way to select a subsequence. However, a sparser representation
for consecutive subsequences would be a range, a pairing of a start
position and a width, as used when extracting sequences with
\Rfunction{window}.

When analyzing subsequences in \IRanges{}, each range is treated as an
observation. The virtual \Rclass{Ranges} class represents lists of
ranges, or, equivalently and as a derivative \Rclass{IntegerList},
sequences of consecutive integers. The most commonly used
implementation of \Rclass{Ranges} is \Rclass{IRanges}, which stores
the starts and widths as ordinary integer vectors. To construct an
\Rclass{IRanges} instance, we call the \Rfunction{IRanges}
constructor. Ranges are normally specified by passing two out of the
three parameters: start, end and width (see \Rcode{help(IRanges)} for
more information).
%
<<iranges-constructor>>=
ir1 <- IRanges(start = 1:10, width = 10:1)
ir2 <- IRanges(start = 1:10, end = 11)
ir3 <- IRanges(end = 11, width = 10:1)
identical(ir1, ir2) & identical(ir2, ir3)
ir <- IRanges(c(1, 8, 14, 15, 19, 34, 40),
  width = c(12, 6, 6, 15, 6, 2, 7))
@
%
All of the above calls construct an \Rclass{IRanges} instance with the
same ranges, using different combinations of the \Rfunarg{start},
\Rfunarg{end} and \Rfunarg{width} parameters.

Accessing the starts, widths and ends is supported by every
\Rclass{Ranges} implementation.
<<iranges-start>>=
start(ir)
@
<<iranges-end>>=
end(ir)
@
<<iranges-width>>=
width(ir)
@

For \Rclass{IRanges} and some other \Rclass{Ranges} derivatives,
subsetting is also supported, by numeric and logical indices.
<<iranges-subset-numeric>>=
ir[1:4]
@
<<iranges-subset-logical>>=
ir[start(ir) <= 15]
@

One may think of each range as a sequence of integer ranges, and
\Rclass{Ranges} is, in fact, derived from \Rclass{IntegerList}.
<<ranges-extraction>>=
ir[[1]]
@

In order to illustrate range operations, we'll create a function to plot ranges.

<<plotRanges>>=
plotRanges <- function(x, xlim = x, main = deparse(substitute(x)),
                       col = "black", sep = 0.5, ...) 
{
  height <- 1
  if (is(xlim, "Ranges"))
    xlim <- c(min(start(xlim)), max(end(xlim)))
  bins <- disjointBins(IRanges(start(x), end(x) + 1))
  plot.new()
  plot.window(xlim, c(0, max(bins)*(height + sep)))
  ybottom <- bins * (sep + height) - height
  rect(start(x)-0.5, ybottom, end(x)+0.5, ybottom + height, col = col, ...)
  title(main)
  axis(1)
}
@

<<ir-plotRanges, fig=TRUE, include=FALSE, eps=FALSE, height=2.25>>=
plotRanges(ir)
@

\begin{figure}[tb]
  \begin{center}
     \includegraphics[width=0.5\textwidth]{IRangesOverview-ir-plotRanges}  
     \caption{\label{fig-ir-plotRanges}%
      Plot of original ranges.}
  \end{center}
\end{figure}

\subsection{Normality}

Sometimes, it is necessary to formally represent a
subsequence, where no elements are repeated and order is
preserved. Also, it is occasionally useful to think of a
\Rclass{Ranges} object as a set, where no elements are repeated and
order does not matter. While every \Rclass{Ranges} object, as a
\Rclass{Vector} derivative, has an implicit ordering, one can
enforce the same ordering for all such objects, so that ordering
becomes inconsequential within that context. 

The \Rclass{NormalIRanges} class formally represents either a
subsequence encoding or a set of integers. By definition
a Ranges object is said to be \textit{normal} when its ranges are: (a)
not empty (i.e. they have a non-null width); (b) not overlapping; (c)
ordered from left to right; (d) not even adjacent (i.e. there must be
a non empty gap between 2 consecutive ranges).

There are three main advantages of using a \textit{normal}
\Rclass{Ranges} object: (1) it guarantees a subsequence
encoding or set of integers, (2) it is compact in terms of the number
of ranges, and (3) it uniquely identifies its information, which
simplifies comparisons.

The \Rfunction{reduce} function reduces any \Rclass{Ranges} object to
a \Rclass{NormalIRanges} by merging redundant ranges.
<<ranges-reduce, fig=TRUE, include=FALSE, eps=FALSE, height=2.25>>=
reduce(ir)
plotRanges(reduce(ir))
@

\begin{figure}[tb]
  \begin{center}
     \includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-reduce}  
     \caption{\label{fig-ranges-reduce}%
      Plot of reduced ranges.}
  \end{center}
\end{figure}

\subsection{Lists of \Rclass{Ranges} objects}

It is common to manipulate collections of \Rclass{Ranges} objects during
an analysis. Thus, the \IRanges{} package defines some specific classes
for working with multiple \Rclass{Ranges} objects.

The \Rclass{RangesList} class asserts that each element is a
\Rclass{Ranges} object and provides convenience methods, such as
\Rfunction{start}, \Rfunction{end} and \Rfunction{width} accessors
that return \Rclass{IntegerList} objects, aligning with the
\Rclass{RangesList} object.
To explicitly construct a \Rclass{RangesList}, use the
\Rfunction{RangesList} function.
<<rangeslist-contructor>>=
rl <- RangesList(ir, rev(ir))
@
%
<<rangeslist-start>>=
start(rl)
@

\subsection{Vector Extraction}

As the elements of a \Rclass{Ranges} object encode consecutive
subsequences, they may be used directly in sequence extraction. Note
that when a \textit{normal} \Rclass{Ranges} is given as the index, the
result is a subsequence, as no elements are repeated or reordered.
If the sequence is a \Rclass{Vector} subclass (i.e. not an ordinary
\Rclass{vector}), the canonical \Rfunction{[} function accepts a
\Rclass{Ranges} instance.
%
<<bracket-ranges>>=
irextract <- IRanges(start = c(4501, 4901) , width = 100)
xRle[irextract]
@
%

\subsection{Finding Overlapping Ranges}

The function \Rfunction{findOverlaps} detects overlaps between two
\Rclass{Ranges} objects.

<<overlap-ranges>>=
ol <- findOverlaps(ir, reduce(ir))
as.matrix(ol)
@

\subsection{Counting Overlapping Ranges}

The function \Rfunction{coverage} counts the number of ranges over each
position.

<<ranges-coverage, fig=TRUE, include=FALSE, eps=FALSE, height=2.25>>=
cov <- coverage(ir)
plotRanges(ir)
cov <- as.vector(cov)
mat <- cbind(seq_along(cov)-0.5, cov)
d <- diff(cov) != 0
mat <- rbind(cbind(mat[d,1]+1, mat[d,2]), mat)
mat <- mat[order(mat[,1]),]
lines(mat, col="red", lwd=4)
axis(2)
@

\begin{figure}[tb]
  \begin{center}
     \includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-coverage}  
     \caption{\label{fig-ranges-coverage}%
      Plot of ranges with accumulated coverage.}
  \end{center}
\end{figure}

\subsection{Finding Neighboring Ranges}

The \Rfunction{nearest} function finds the nearest neighbor ranges (overlapping
is zero distance). The \Rfunction{precede} and \Rfunction{follow} functions
find the non-overlapping nearest neighbors on a specific side.

\subsection{Transforming Ranges}

Utilities are available for transforming a \Rclass{Ranges} object in a
variety of ways. Some transformations, like \Rfunction{reduce}
introduced above, can be dramatic, while others are simple per-range
adjustments of the starts, ends or widths.

\subsubsection{Adjusting starts, ends and widths}

Perhaps the simplest transformation is to adjust the start values by a
scalar offset, as performed by the \Rfunction{shift} function. Below,
we shift all ranges forward 10 positions.
%
<<ranges-shift>>=
shift(ir, 10)
@

There are several other ways to transform ranges.  These include
\Rfunction{narrow}, \Rfunction{resize}, \Rfunction{flank},
\Rfunction{reflect}, \Rfunction{restrict}, and \Rfunction{threebands}.
For example \Rfunction{narrow} supports the adjustment of start, end
and width values, which should be relative to each range. These adjustments
are vectorized over the ranges. As its name suggests, the ranges can only be
narrowed.
% 
<<ranges-narrow>>=
narrow(ir, start=1:5, width=2)
@

The \Rfunction{restrict} function ensures every range falls within a
set of bounds. Ranges are contracted as necessary, and the ranges that
fall completely outside of but not adjacent to the bounds are dropped,
by default.
% 
<<ranges-restrict>>=
restrict(ir, start=2, end=3)
@
The \Rfunction{threebands} function extends \Rfunction{narrow} so that
the remaining left and right regions adjacent to the narrowed region
are also returned in separate \Rclass{Ranges} objects.
% 
<<ranges-threebands>>=
threebands(ir, start=1:5, width=2)
@

The arithmetic operators \Rfunction{+}, \Rfunction{-} and
\Rfunction{*} change both the start and the end/width by symmetrically
expanding or contracting each range.  Adding or subtracting a numeric
(integer) vector to a \Rclass{Ranges} causes each range to be expanded
or contracted on each side by the corresponding value in the numeric
vector.
<<ranges-plus>>=
ir + seq_len(length(ir))
@
%
The \Rfunction{*} operator symmetrically magnifies a \Rclass{Ranges}
object by a factor, where positive contracts (zooms in) and negative
expands (zooms out).
%
<<ranges-asterisk>>=
ir * -2 # double the width
@

WARNING: The semantic of these arithmetic operators might be revisited
at some point. Please restrict their use to the context of interactive
visualization (where they arguably provide some convenience) but avoid
to use them programmatically.

\subsubsection{Making ranges disjoint}

A more complex type of operation is making a set of ranges disjoint,
\textit{i.e.} non-overlapping. For example, \Rfunction{threebands}
returns a disjoint set of three ranges for each input range.

The \Rfunction{disjoin} function makes a \Rclass{Ranges} object
disjoint by fragmenting it into the widest ranges where the set of
overlapping ranges is the same.

<<ranges-disjoin, fig=TRUE, include=FALSE, eps=FALSE, height=2.25>>=
disjoin(ir)
plotRanges(disjoin(ir))
@

\begin{figure}[tb]
  \begin{center}
     \includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-disjoin}  
     \caption{\label{fig-ranges-disjoin}%
      Plot of disjoined ranges.}
  \end{center}
\end{figure}

A variant of \Rfunction{disjoin} is \Rfunction{disjointBins}, which
divides the ranges into bins, such that the ranges in each bin are
disjoint. The return value is an integer vector of the bins.
<<ranges-disjointBins>>=
disjointBins(ir)
@

\subsubsection{Other transformations}

Other transformations include \Rfunction{reflect} and
\Rfunction{flank}. The former ``flips'' each range within a set of common
reference bounds.
<<ranges-reflect>>=
reflect(ir, IRanges(start(ir), width=width(ir)*2))
@
%
The \Rfunction{flank} returns ranges of a specified width that flank,
to the left (default) or right, each input range. One use case of this
is forming promoter regions for a set of genes.
<<ranges-flank>>=
flank(ir, width = seq_len(length(ir)))
@
%

\subsection{Set Operations}

Sometimes, it is useful to consider a \Rclass{Ranges} object as a set
of integers, although there is always an implicit ordering. This is
formalized by \Rclass{NormalIRanges}, above, and we now present
versions of the traditional mathematical set operations
\textit{complement}, \textit{union}, \textit{intersect}, and
\textit{difference} for \Rclass{Ranges} objects. There are two
variants for each operation. The first treats each \Rclass{Ranges}
object as a set and returns a \textit{normal} value, while the other
has a ``parallel'' semantic like \Rfunction{pmin}/\Rfunction{pmax} and
performs the operation for each range pairing separately.

The \textit{complement} operation is implemented by the
\Rfunction{gaps} and \Rfunction{pgap} functions. By default,
\Rfunction{gaps} will return the ranges that fall between the ranges
in the (normalized) input. It is possible to specify a set of bounds,
so that flanking ranges are included.
<<ranges-gaps, fig=TRUE, include=FALSE, eps=FALSE, height=2.25>>=
gaps(ir, start=1, end=50)
plotRanges(gaps(ir, start=1, end=50), c(1,50))
@

\begin{figure}[tb]
  \begin{center}
     \includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-gaps}  
     \caption{\label{fig-ranges-gap}%
      Plot of gaps from ranges.}
  \end{center}
\end{figure}


\Rfunction{pgap} considers each parallel pairing between two
\Rclass{Ranges} objects and finds the range, if any, between
them. Note that the function name is singular, suggesting that only
one range is returned per range in the input.
<<ranges-pgap>>=
@

The remaining operations, \textit{union}, \textit{intersect} and
\textit{difference} are implemented by the \Rfunction{[p]union},
\Rfunction{[p]intersect} and \Rfunction{[p]setdiff} functions,
respectively. These are relatively self-explanatory.
<<ranges-union>>=
@
<<ranges-punion>>=
@
<<ranges-intersect>>=
@
<<ranges-pintersect>>=
@
<<ranges-setdiff>>=
@
<<ranges-psetdiff>>=
@

% \subsection{Mapping Ranges Between Vectors}

% \Rfunction{mapCoords}

\section{Vector Views}

The \IRanges{} package provides the virtual \Rclass{Views} class, which
stores a sequence together with an \Rclass{IRanges} object defining ranges
on the sequence. Each range is said to represent a \textit{view} onto
the sequence.

Here, we will demonstrate the \Rclass{RleViews} class, where the
sequence is of class \Rclass{Rle}. Other \Rclass{Views}
implementations exist, such as \Rclass{XStringViews} in the
\Rpackage{Biostrings} package.

\subsection{Creating Views}

There are two basic constructors for creating views: the \Rfunction{Views}
function based on indicators and the \Rfunction{slice} based on numeric
boundaries.

<<Views-constructors>>=
xViews <- Views(xRle, xRle >= 1)
xViews <- slice(xRle, 1)
xViewsList <- slice(xRleList, 1)
@

\subsection{Aggregating Views}

While \Rfunction{sapply} can be used to loop over each window, the native
functions \Rfunction{viewMaxs}, \Rfunction{viewMins}, \Rfunction{viewSums},
and \Rfunction{viewMeans} provide fast looping to calculate their respective
statistical summaries.

<<views-looping>>=
head(viewSums(xViews))
viewSums(xViewsList)
head(viewMaxs(xViews))
viewMaxs(xViewsList)
@

\section{Data on Ranges}

IMPORTANT NOTE: Starting with BioC 2.12, the use of \Rclass{RangedData} and
\Rclass{RangedDataList} objects is discouraged in favor of \Rclass{GRanges}
and \Rclass{GRangesList} objects (those classes are defined in the
\Rpackage{GenomicRanges} package).

When analyzing ranges, there are often additional variables of
interest, besides the geometry (starts, ends and widths). To formally
represent a dataset where the ranges are the observations, \IRanges{}
defines the \Rclass{RangedData} class.

%\subsection{Manipulating \Rclass{RangedData}}

To create a \Rclass{RangedData} instance, one needs to provide a
\Rclass{Ranges} object and, optionally, any number of variables on
those ranges. The variable objects need not be vectors, but they must
satisfy the contract of \Rclass{DataFrame}.
<<RangedData-construct>>=
values <- rnorm(length(ir))
rd <- RangedData(ir, name = letters[seq_len(length(ir))], values)
rd
@

One might notice the term ``sequence'' in the above output. This
refers to an important feature of \Rclass{RangedData}: the ability to
segregate ranges by their sequence (or space). For example, when
analyzing genomic data, one is often working with ranges on different
chromosomes. In many cases, such as when calculating overlap, one
needs to separately treat ranges from different spaces, and
\Rclass{RangedData} aims to facilitate this mode of operation. The
segregation may be performed at construction time.
<<RangedData-construct-space>>=
rd <- RangedData(ir, name = letters[seq_len(length(ir))], values, 
                 space = rep(c("chr1", "chr2"), c(3, length(ir) - 3)))
rd
@

With the knowledge that the data is split into spaces, it should not
be surprising that the \Rfunction{ranges} accessor returns a
\Rclass{RangesList} and \Rfunction{values} returns a
\Rclass{SplitDataFrameList}.
<<RangedData-ranges>>=
ranges(rd)
@
<<RangedData-values>>=
values(rd)
@

To obtain a \Rclass{RangedData} for a specific set of spaces, one should
use the \Rfunction{[} function, which accepts logical, numeric and
character indices.
<<RangedData-subspace>>=
rd["chr1"]
@
%
<<RangedData-subspace-2>>=
all(identical(rd["chr1"], rd[1]),
    identical(rd[1], rd[c(TRUE, FALSE)]))
@
The \Rfunction{names} and \Rfunction{length} functions return the
names and number of spaces, respectively.
<<RangedData-names>>=
names(rd)
@
<<RangedData-length>>=
length(rd)
@

The \Rfunction{lapply} function operates over the spaces. The object
passed to the user function is a subset \Rclass{RangedData}.
<<RangedData-lapply>>=
lapply(rd, names)
@

The above would suggest that \Rclass{RangedData} is a sequence of
spaces. However, \Rclass{RangedData} also inherits from
\Rclass{DataTable}, so it in some ways behaves like a sequence of
columns. For example, one can extract a column via \Rfunction{\$} or
\Rfunction{[[}.
<<RangedData-extract>>=
rd[[2]]
@
<<RangedData-dollar>>=
rd$values
@
Note that the extracted columns are ``unlisted'' over the spaces,
which is usually much more convenient than obtaining them as lists. It
is important to note that the elements have been sorted by
the space factor and thus may not have the same order as the
objects passed to the constructor.

The two dimensional matrix-style subsetting is also supported. The
rows are indexed globally, independent of space.
<<RangedData-subset-2d>>=
rd[1:3, "name"]
@

\section{IRanges in Biological Sequence Analysis}

The \IRanges{} packages was primarily designed with biological sequence
analysis in mind and Table \ref{table:bioseq} shows how some biological
sequence analysis concepts are represented in the \IRanges{} class system.

\begin{table}[ht]
\begin{center}
\begin{tabular}{l|l}
\hline
Biological Entity                        & \Rclass{Vector} Subclass \\
\hline
Genome browser track(s)                  & \Rclass{GRanges}/\Rclass{GRangesList} \\
Coverage across chromosomes/contigs      & \Rclass{RleList} \\
Mapped ranges to genome                  & \Rclass{CompressedIRangesList} \\
Data (sans ranges) across chroms/contigs & \Rclass{SplitDataFrameList} \\
\hline
\end{tabular}
\end{center}
\caption{\Rclass{Vector} subclasses for Biological Sequence Analysis}
\label{table:bioseq}
\end{table}

\pagebreak[4]

\section{Session Information}

\begin{table*}[tbp]
\begin{minipage}{\textwidth}
<<sessionInfo, results=tex, print=TRUE>>=
toLatex(sessionInfo())
@
\end{minipage}
\caption{\label{tab:sessioninfo}%
The output of \Rfunction{sessionInfo} on the build system 
after running this vignette.}
\end{table*}

\end{document}
r-bioc-iranges 2.4.1-1 / usr / lib / R / site-library / IRanges / doc / IRangesOverview.Rnw