This file is indexed.

/usr/lib/R/site-library/Biostrings/doc/matchprobes.Rnw is in r-bioc-biostrings 2.46.0-1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
%\VignetteIndexEntry{Handling probe sequence information}
%\VignetteDepends{Biostrings, hgu95av2probe, hgu95av2cdf, affy, affydata}
%\VignetteKeywords{Expression Analysis}
%\VignettePackage{Biostrings}

\documentclass[11pt]{article}
\usepackage[margin=2cm,nohead]{geometry}

\usepackage{color}
\definecolor{darkblue}{rgb}{0.0,0.0,0.75}
\usepackage[%
baseurl={http://www.bioconductor.org},%
pdftitle={Basic infrastructure for using oligonucleotide microarray reporter 
sequence information for preprocessing and quality assessment},%
pdfauthor={Wolfgang Huber and Robert Gentleman},%
pdfkeywords={Bioconductor},%
pagebackref,bookmarks,colorlinks,linkcolor=darkblue,citecolor=darkblue,%
pagecolor=darkblue,raiselinks,plainpages,pdftex]{hyperref}

\SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE} 

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rfile}[1]{{\texttt{#1}}}

\newcommand{\myincfig}[3]{%
  \begin{figure}[tb]
    \begin{center}
      \includegraphics[width=#2]{#1}
      \caption{\label{#1}#3}
    \end{center}
  \end{figure}
}


\begin{document}
\title{Using oligonucleotide microarray reporter sequence information
for preprocessing and quality assessment}
\author{Wolfgang Huber and Robert Gentleman}
\maketitle
\tableofcontents

%------------------------------------------------------------
\section{Overview}
%------------------------------------------------------------
This document presents some basic and simple tools for dealing with
the oligonucleotide microarray reporter sequence information in the
Bioconductor \textit{probe} packages. This information is used, for
example, in the \Rpackage{gcrma} package.

\textit{Probe} packages are a convenient way for distributing and
storing the probe sequences (and related information) of a given
chip. 

As an example, the package \Rpackage{hgu95av2probe} provides microarray
reporter sequences for Affymetrix' \textit{HgU95a version 2} genechip,
and for almost all major Affymetrix genechips, the corresponding
packages can be downloaded from the Bioconductor website. If you
have the reporter sequence information of a particular chip, you can
also create such a package yourself. This is described in
in the makeProbePackage vignette of the \Rpackage{AnnotationForge}
package.

This document assumes some basic familiarity with R and with the design
of the \Rclass{AffyBatch} class in the \Rpackage{affy} package,
Bioconductor's basic container for Affymetrix genechip data.

First, let us load the \Rpackage{Biostrings} package and some other packages 
we will use.
%
<<startup,results=hide>>=
library(Biostrings)
library(hgu95av2probe)
library(hgu95av2cdf)
@

%------------------------------------------------------------
\section{Using probe packages}\label{sec.prbpkg}
%------------------------------------------------------------
Help for the probe sequence data packages can be accessed through
\begin{Schunk}
\begin{Sinput}
> ? hgu95av2probe
\end{Sinput}
\end{Schunk}

One of the issues that you have to deal
with is that the \textit{probe} packages do not provide the reporter
sequences of all the features present in an \Rclass{AffyBatch}. Some
sequences are missing, some are implied; in particular, the data
structure in the \textit{probe} packages does not explicitely
contain the sequences of the mismatch probes, since they are implied
by the perfect match probes. Also, some other features, typically
harbouring control probes or empty, do not have sequences. This is the
choice that Affymetrix made when they made files with probe sequences
available, and we followed it.

Practically, this means that the vector of probe sequences in a
\textit{probe} package does not align 1:1 with the rows of the
corresponding \Rpackage{AffyBatch}; you need to keep track of this
mapping, and some tools for this are provided and explained below (see
Section~\ref{subsec.relating}). It also means that some functions from
the \Rpackage{affy} package, such as \Rfunction{pm}, cannot be used
when the sequences of the probes corresponding to their result are
needed; since \Rfunction{pm} reports the intensities, but not the
identity of the probes it has selected, yet the latter would be needed
to retrieve their sequences.

%------------------------------------------------------------
\subsection{Basic functions}\label{subsec.fcts}
%------------------------------------------------------------
Let us look at some basic operations on the sequences.

\subsubsection{Reverse and complementary sequence}

DNA sequences can be reversed and/or complemented with the
\Rfunction{reverse}, \Rfunction{complement} and
\Rfunction{reverseComplement} functions.
\begin{Schunk}
\begin{Sinput}
> ? reverseComplement
\end{Sinput}
\end{Schunk}

\subsubsection{Matching sets of probes against each other}
<<matchprobes>>=
pm <- DNAStringSet(hgu95av2probe)
dict <- pm[3801:4000]
pdict <- PDict(dict)
m <- vcountPDict(pdict, pm)
dim(m)
table(rowSums(m))
which(rowSums(m) == 3)
ii <- which(m[77, ] != 0)
pm[ii]
@

\subsubsection{Base content}
The base content (number of occurrence of each character) of the
sequences can be computed with the function
\Rfunction{alphabetFrequency}.
%
<<basecontent>>=
bcpm <- alphabetFrequency(pm, baseOnly=TRUE)
head(bcpm)
alphabetFrequency(pm, baseOnly=TRUE, collapse=TRUE)
@


%------------------------------------------------------------
\subsection{Relating to the features of an \Rclass{AffyBatch}}
\label{subsec.relating}
%------------------------------------------------------------
<<hgu95av2dim,print=TRUE>>=
nc = hgu95av2dim$NCOL
nr = hgu95av2dim$NROW
@ 
Each column of an \Rclass{AffyBatch} corresponds to an array, each row
to a certain probe on the arrays. The probes are stored in a way that is
related to their geometrical position on the array. For example, the
\textit{hgu95av2} array is geometrically arranged into \Sexpr{nc} columns
and \Sexpr{nr} rows. We call the column and row indices the $x$- and
$y$-coordinates, respectively.
This results in \Sexpr{nc} $\times$ \Sexpr{nr} $=$
\Sexpr{nc*nr} probes of the \Rclass{AffyBatch}; we
also call them indices. To convert between $x$- and $y$-coordinates and
indices, you can use the functions \Rfunction{xy2indices} and
\Rfunction{indices2xy} from the \Rpackage{affy} package.

The sequence data in the \textit{probe} packages is addressed by their
$x$ and $y$--coordinates. Let us construct a vector \Robject{abseq}
that aligns with the indices of an \textit{hgu95av2} \Rclass{AffyBatch}
and fill in the sequences:
<<>>=
library(affy)
abseq = rep(as.character(NA), nc*nr) 
ipm = with(hgu95av2probe, xy2indices(x, y, nc=nc))
any(duplicated(ipm))  # just a sanity check
abseq[ipm] = hgu95av2probe$sequence
table(is.na(abseq))
@ 

The mismatch sequences are not explicitely stored in the probe
packages. They are implied by the match sequences, by flipping the
middle base. This can be done with the \Rfunction{pm2mm} function
defined below. For Affymetrix GeneChips the length of the
probe sequences is 25, and since we start counting at 1, the middle
position is 13.
%
<<pm2mm,>>=
mm <- pm
subseq(mm, start=13, width=1) <- complement(subseq(mm, start=13, width=1))
cat(as.character(pm[[1]]), as.character(mm[[1]]), sep="\n")
@
%
We compute \Robject{imm}, the indices of the mismatch probes, by
noting that each mismatch has the same $x$-coordinate as its
associated perfect match, while its $y$-coordinate is increased by 1.
%
<<mismatchSeq>>=
imm = with(hgu95av2probe, xy2indices(x, y+1, nc=nc))
intersect(ipm, imm)  # just a sanity check
abseq[imm] = as.character(mm)
table(is.na(abseq))
@
%
See Figures~\ref{matchprobes-bap}--\ref{matchprobes-p2p} for some applications 
of the probe sequence information to preprocessing and data quality related plots.


%------------------------------------------------------------
\section{Some sequence related ``preprocessing and quality'' plots}
\label{subsec.prqplots}
%------------------------------------------------------------
The function \Rfunction{alphabetFrequency} counts the number of
occurrences of each of the four bases A, C, G, T in each probe sequence.
<<bc>>=
freqs <- alphabetFrequency(DNAStringSet(abseq[!is.na(abseq)]), baseOnly=TRUE)
bc <- matrix(nrow=length(abseq), ncol=5)
colnames(bc) <- colnames(freqs)
bc[!is.na(abseq), ] <- freqs
head(na.omit(bc))
@ 
%
Let us define an ordered factor variable for GC content:
%
<<GC>>=
GC = ordered(bc[,"G"] + bc[,"C"])
colores = rainbow(nlevels(GC))
@

And let us create an \Rclass{AffyBatch} object.
<<abatch>>=
library(affydata)
f <- system.file("extracelfiles", "CL2001032020AA.cel", package="affydata")
pd <- new("AnnotatedDataFrame", data=data.frame(fromFile=I(f), row.names="f"))
abatch <- read.affybatch(filenames=f, compress=TRUE, phenoData=pd)
@

%
Figure~\ref{matchprobes-bap} shows a barplot of the frequencies of 
counts in \Robject{GC}:
%
<<bap,fig=TRUE,width=8,height=5>>=
barplot(table(GC), col=colores, xlab="GC", ylab="number")
@ 

Figure~\ref{matchprobes-bxp}:
<<bxp,fig=TRUE,width=8,height=5>>=
boxplot(log2(exprs(abatch)[,1]) ~ GC, outline=FALSE,
  col=colores, , xlab="GC", ylab=expression(log[2]~intensity))
@ 

Figure~\ref{matchprobes-p2p}:
<<p2p, results=hide>>=
png("matchprobes-p2p.png", width=900, height=480)
plot(exprs(abatch)[ipm,1], exprs(abatch)[imm,1], asp=1, pch=".", log="xy",
     xlab="PM", ylab="MM", col=colores[GC[ipm]])
abline(a=0, b=1, col="#404040", lty=3)
dev.off()
@
%
\myincfig{matchprobes-bap}{0.7\textwidth}{Distribution of probe GC content.
  The height of each bar corresponds to the number of probes with the 
  corresponing GC content.}
\myincfig{matchprobes-bxp}{0.7\textwidth}{Boxplots of $\log_2$ intensity 
  stratified by probe GC content.}
\myincfig{matchprobes-p2p}{0.7\textwidth}{Scatterplot of PM vs MM intensities, colored 
  by probe GC content.}

\end{document}