Analysis of Locust Data Set
Table of Contents
- 1. How to proceed?
- 2. Preliminary Analysis
- 3. Data renormalization
- 4. Interactive data exploration
- 5. Spike detection
- 6. Cuts
- 7. Getting "clean" events
- 8. Dimension reduction
- 9. Clustering
- 10. "Brute force" superposition resolution
- 11. Getting the spike trains
- 12. Custom developed code
- 12.1. Individual function definitions
- 12.1.1.
cstValueSgts
definition - 12.1.2. Definitions of generic function and time-series specific method for data exploration
- 12.1.3.
peaks
and associated methods definition - 12.1.4. Definition of an
explore
method for time series andeventsPos
objects - 12.1.5. Function
cutSglEvt
- 12.1.6. Function
mkEvents
- 12.1.7.
events
methods - 12.1.8.
mkNoise
definition - 12.1.9. Definition of an
explore
method forpca
results - 12.1.10. Definition of
get_jitter
- 12.1.11. Definition of
mk_aligned_events
- 12.1.12. Definition of
mk_center_list
- 12.1.13. Definition of
classify_and_align_evt
- 12.1.14. Definition of
predict_data
- 12.1.1.
- 12.1. Individual function definitions
Sys.setlocale(category="LC_MESSAGES",locale="C")
[1] "C"
1 How to proceed?
The locust data set is located at the following web site: http://xtof.disque.math.cnrs.fr/data. So you should start by dowloading the four file Locust_x.tar.gz
(where x
= 1, 2, 3, 4) in some directory where you will perform your analysis (see second sub-section).
1.1 Loading the data
The data are simply loaded, once we know where to find them and if we do not forget that they have been compressed with gzip
. The data are stored with one file per recording channel in double format. They were sampled at 15 kHz and there is 20 s of them. Since they are compressed, we cannot read them directly into R
from the depository nd we must download them first:
reposName <-"http://xtof.disque.math.cnrs.fr/data/" dN <- paste("Locust_",1:4,".dat.gz",sep="") sapply(1:4, function(i) download.file(paste(reposName,dN[i],sep=""), dN[i],mode="wb") )
Once the data are in our working directory
we can load them into our R workspace
:
dN <- paste("Locust_",1:4,".dat.gz",sep="") nb <- 20*15000 lD <- sapply(dN, function(n) { mC <- gzfile(n,open="rb") x <- readBin(mC,what="double",n=nb) close(mC);x } ) colnames(lD) <- paste("site",1:4)
We can check that our lD
object has the correct dimension:
dim(lD)
[1] 300000 4
1.2 Getting the code
The individual functions developed for this kind of analysis are defined at the end of this document (Sec. Individual function definitions). They can also be downloaded as a single file sortingwithr.R which must then be sourced with:
source("sorting_with_r.R")
where it is assumed that the working directory of your R
session is the directory where sorting_with_r.R
was downloaded.
2 Preliminary Analysis
We are going to start our analysis by some "sanity checks" to make sure that nothing "weird" happened during the recording.
2.1 Five number summary
We should start by getting an overall picture of the data like the one provided by the summary
method of R
which outputs a 5 numbers plus mean
summary. The five numbers are the minimum
, the first quartile
, the median
, the third quartile
and the maximum
:
summary(lD,digits=2)
site 1 site 2 site 3 site 4 Min. :-9.074 Min. :-8.229 Min. :-6.890 Min. :-7.35 1st Qu.:-0.371 1st Qu.:-0.450 1st Qu.:-0.530 1st Qu.:-0.49 Median :-0.029 Median :-0.036 Median :-0.042 Median :-0.04 Mean : 0.000 Mean : 0.000 Mean : 0.000 Mean : 0.00 3rd Qu.: 0.326 3rd Qu.: 0.396 3rd Qu.: 0.469 3rd Qu.: 0.43 Max. :10.626 Max. :11.742 Max. : 9.849 Max. :10.56
We see that the data range (maximum - minimum
) is similar (close to 20) on the four recording sites. The inter-quartiles ranges are also similar.
2.2 Were the data normalized?
We can check next if some processing like a division by the standard deviation (SD) has been applied:
apply(lD,2,sd)
site 1 site 2 site 3 site 4 1 1 1 1
2.3 Discretization step amplitude
We clearly see that these data have been scaled, that is, normalized to have an SD of 1. Since the data have been digitized we can easily obtain the apparent size of the digitization set:
apply(lD,2, function(x) min(diff(sort(unique(x)))))
site 1 site 2 site 3 site 4 0.006709845 0.009194500 0.011888433 0.009614042
2.4 Detecting saturation
Before embarking into a comprehensive analysis of data that we did not record ourselves (of that we recorded so long ago that we do not remember any "remarkable" event concerning them), it can be wise to check that no amplifier or A/D card saturation occurred. We can quickly check for that by looking at the length of the longuest segment of constant value. When saturation occurs the recorded value stays for many sampling points at the same upper or lower saturating level. We use function cstValueSgts that does this job.
ndL <- lapply(1:4,function(i) cstValueSgts(lD[,i])) sapply(ndL, function(l) max(l[2,]))
[1] 2 2 2 2
We see that for each recording site, the longest segment of constant value is two sampling points long, that is 2/15 ms. There is no ground to worry about saturation here.
2.5 Plot the data
We are going to profit from the time series
(ts
and mts
for multiple time series) objects of R
by redefining our lD
matrix as:
lD <- ts(lD,start=0,freq=15e3)
It is then straightforward to plot the whole data set:
plot(lD)
Figure 1: The whole (20 s) locust data set.
It is also good to "zoom in" and look at the data with a finer time scale:
plot(window(lD,start=0,end=0.2))
Figure 2: First 200 ms of the locust data set.
3 Data renormalization
We are going to use a median absolute deviation (MAD
) based renormalization. The goal of the procedure is to scale the raw data such that the noise SD is approximately 1. Since it is not straightforward to obtain a noise SD on data where both signal (i.e., spikes) and noise are present, we use this robust type of statistic for the SD. Luckily this is simply obtained in R
:
lD.mad <- apply(lD,2,mad) lD <- t((t(lD)-apply(lD,2,median))/lD.mad) lD <- ts(lD,start=0,freq=15e3)
where the last line of code ensures that lD
is still an mts
object. We can check on a plot how MAD
and SD
compare:
plot(window(lD[,1],0,0.2)) abline(h=c(-1,1),col=2) abline(h=c(-1,1)*sd(lD[,1]),col=4,lty=2,lwd=2)
Figure 3: First 200 ms on site 1 of the locust data set. In red: +/- the MAD
; in dashed blue +/- the SD
.
3.1 A quick check that the MAD
"does its job"
We can check that the MAD
does its job as a robust estimate of the noise standard deviation by looking at Q-Q plots of the whole traces normalized with the MAD
and normalized with the "classical" SD
.
lDQ <- apply(lD,2,quantile, probs=seq(0.01,0.99,0.01)) lDnormSD <- apply(lD,2,function(x) x/sd(x)) lDnormSDQ <- apply(lDnormSD,2,quantile, probs=seq(0.01,0.99,0.01)) qq <- qnorm(seq(0.01,0.99,0.01)) matplot(qq,lDQ,type="n",xlab="Normal quantiles",ylab="Empirical quantiles") abline(0,1,col="grey70",lwd=3) col=c("black","orange","blue","red") matlines(qq,lDnormSDQ,lty=2,col=col) matlines(qq,lDQ,lty=1,col=col) rm(lDnormSD,lDnormSDQ)
Figure 4: Performances of MAD
based vs SD
based normalizations. After normalizing the data of each recording site by its MAD
(plain colored curves) or its SD
(dashed colored curves), Q-Q plot against a standard normal distribution were constructed. Colors: site 1, black; site 2, orange; site 3, blue; site 4, red.
We see that the behavior of the "away from normal" fraction is much more homogeneous for small, as well as for large in fact, quantile values with the MAD
normalized traces than with the SD
normalized ones. If we consider automatic rules like the three sigmas we are going to reject fewer events (i.e., get fewer putative spikes) with the SD
based normalization than with the MAD
based one.
4 Interactive data exploration
Although we can't illustrate properly this key step on a "static" document it is absolutely necessary to look at the data in detail using (the corresponding function definition can be found in time-series specific method for data exploration):
explore(lD,col=c("black","grey70"))
Upon using this command the user is invited to move forward (typing "n" + RETURN
or simply RETURN
), backward (typing "f" + RETURN
), to change the abscissa or ordinate scale, etc.
5 Spike detection
We are going to filter the data slightly using a "box" filter of length 5. That is, the data points of the original trace are going to be replaced by the average of themselves with their two nearest neighbors. We will then scale the filtered traces such that the MAD
is one on each recording sites and keep only the parts of the signal which above 4:
lDf <- filter(lD,rep(1,5)/5) lDf[is.na(lDf)] <- 0 lDf.mad <- apply(lDf,2,mad) lDf <- t(t(lDf)/lDf.mad) thrs <- c(4,4,4,4) bellow.thrs <- t(t(lDf) < thrs) lDfr <- lDf lDfr[bellow.thrs] <- 0 remove(lDf)
We can see the difference between the raw trace and the filtered and rectified one on which spikes are going to be detected with:
plot(window(lD[,1],0,0.2)) abline(h=4,col=4,lty=2,lwd=2) lines(window(ts(lDfr[,1],start=0,freq=15e3),0,0.2),col=2)
Figure 5: First 200 ms on site 1 of data set lD
. The raw data are shown in black, the detection threshold appears in dashed blue and the filtered and rectified trace on which spike detection is going to be preformed appears in red.
Spikes are then detected as local maxima on the summed, filtered and rectified traces. To do that we use a peak detecting function defined in peaks
and associated methods definition.
sp0 <- peaks(apply(lDfr,1,sum),15)
The returned object, sp0
, is essentially a vector of integer containing the indexes of the detected spikes. To facilitate handling it is in addition defined as an object of class eventsPos
meaning that entering its name on the command line and typing returns, that is, calling the print
method on the object gives a short description of it:
sp0
eventsPos object with indexes of 1795 events. Mean inter event interval: 166.81 sampling points, corresponding SD: 143.79 sampling points Smallest and largest inter event intervals: 16 and 1157 sampling points.
We see that length(sp0)
1795
events were detected. Since the mean inter event interval is very close to the SD, the "compound process" (since it's likely to be the sum of the activities of many neurons) is essentially Poisson.
5.1 Interactive spike detection check
We can interactively check the detection quality with the tailored explore
method (defined in Definition of an explore
method for time series and eventsPos
objects):
explore(sp0,lD,col=c("black","grey50"))
That leads to a display very similar to the one previously obtained with explore(lD)
except that the detected events appear superposed on the raw data as red dots.
5.2 Remove useless objects
Since we are not going to use lDfr
anymore we can save memory by removing it:
remove(lDfr)
5.3 Data set split
In order to get stronger checks for our procedure and to illustrate better how it works, we are going to split our data set in two parts, establish our model on the first and use this model on both parts:
(sp0E <- as.eventsPos(sp0[sp0 <= dim(lD)[1]/2])) (sp0L <- as.eventsPos(sp0[sp0 > dim(lD)[1]/2]))
eventsPos object with indexes of 908 events. Mean inter event interval: 164.88 sampling points, corresponding SD: 139.06 sampling points Smallest and largest inter event intervals: 16 and 931 sampling points. eventsPos object with indexes of 887 events. Mean inter event interval: 168.72 sampling points, corresponding SD: 148.59 sampling points Smallest and largest inter event intervals: 16 and 1157 sampling points.
We see that eventsPos
objects can be sub-set like classical vectors. We also see that the sub-setting based on total time results in set with roughly the same number of events.
6 Cuts
6.1 Getting the "right" length for the cuts
After detecting our spikes, we must make our cuts in order to create our events' sample. That is, for each detected event we literally cut a piece of data and we do that on the four recording sites. To this end we use function mkEvents
which in addition to an eventPos
argument (sp1E
) and a "raw data" argument (lD
) takes an integer argument (before
) stating how many sampling points we want to keep within the cut before the reference time as well as another integer argument (after
) stating how many sampling points we want to keep within the cut after the reference time. The function returns essentially a matrix where each event is a column. The cuts on the different recording sites are put one after the other when the event is built. The obvious question we must first address is: How long should our cuts be? The pragmatic way to get an answer is:
- Make cuts much longer than what we think is necessary, like 50 sampling points on both sides of the detected event's time.
- Compute robust estimates of the "central" event (with the
median
) and of the dispersion of the sample around this central event (with theMAD
). - Plot the two together and check when does the
MAD
trace reach the background noise level (at 1 since we have normalized the data). - Having the central event allows us to see if it outlasts significantly the region where the
MAD
is above the background noise level.
Clearly cutting beyond the time at which the MAD
hits back the noise level should not bring any useful information as far a classifying the spikes is concerned. So here we perform this task as follows:
evtsE <- mkEvents(sp0E,lD,49,50) evtsE.med <- median(evtsE) evtsE.mad <- apply(evtsE,1,mad)
plot(evtsE.med,type="n",ylab="Amplitude") abline(v=seq(0,400,10),col="grey") abline(h=c(0,1),col="grey") lines(evtsE.med,lwd=2) lines(evtsE.mad,col=2,lwd=2)
Figure 6: Robust estimates of the central event (black) and of the sample's dispersion around the central event (red) obtained with "long" (100 sampling points) cuts. We see clearly that the dispersion is back to noise level 15 points before the peak and 30 points after the peak (on all sites). We also see that the median event is not back to zero 50 points after the peak, we will have to keep his information in mind when we are going to look for superpositions.
6.2 Events
Once we are satisfied with our spike detection, at least in a provisory way, and that we have decided on the length of our cuts, we proceed by making cuts
around the detected events. :
evtsE <- mkEvents(sp0E,lD,14,30)
Here we have decided to keep 14 points before and 30 points after our reference times. evtsE
is a bit more than a matrix, it is an object of class events
, meaning that a summary
method is available:
summary(evtsE)
events object deriving from data set: lD. Events defined as cuts of 45 sampling points on each of the 4 recording sites. The 'reference' time of each event is located at point 15 of the cut. There are 908 events in the object.
A print
method which calls the plot
method is also available giving:
evtsE[,1:200]
Figure 7: First 200 events of evtsE
. Cuts from the four recording sites appear one after the other. The background (white / grey) changes with the site. In red, robust estimate of the "central" event obtained by computing the pointwise median. In blue, robust estimate of the scale (SD) obtained by computing the pointwise MAD
.
Like eventsPos
objects, events
objects can be sub-set with respect to the rows like usual matrix. Notice that a rather sophisticated plot was obtained with an extremely simple command… The beauty of R
class / method mechanism in action.
6.3 Noise
Getting an estimate of the noise statistical properties is an essential ingredient to build respectable goodness of fit tests. In our approach "noise events" are essentially anything that is not an "event" is the sense of the previous section. I wrote essentially and not exactly since there is a little twist here which is the minimal distance we are willing to accept between the reference time of a noise event and the reference time of the last preceding and of the first following "event". We could think that keeping a cut length on each side would be enough. That would indeed be the case if all events were starting from and returning to zero within a cut. But this is not the case with the cuts parameters we tool previously (that will become clear soon). You might wonder why we chose so short a cut length then. Simply to avoid having to deal with too many superposed events which are the really bothering events for anyone wanting to do proper sorting.
To obtain our noise events we are going to define function mkNoise
which takes the same arguments as function mkEvents
plus two number: safetyFactor
a number by which the cut length is multiplied and which sets the minimal distance between the reference times discussed in the previous paragraph and size
the maximal number of noise events one wants to cut (the actual number obtained might be smaller depending on the data length, the cut length, the safety factor and the number of events).
We cut next noise events with a rather large safety factor:
noiseE <- mkNoise(sp0E,lD,14,30,safetyFactor=2.5,2000)
Here noiseE
is also an events
object and its summary
is:
summary(noiseE)
events object deriving from data set: lD. Events defined as cuts of 45 sampling points on each of the 4 recording sites. The 'reference' time of each event is located at point 15 of the cut. There are 1330 events in the object.
The reader interested in checking the effect of the safetyFactor
argument is invited to try something like:
noiseElowSF <- mkNoise(sp0E,lD,14,30,safetyFactor=1,2000) plot(mean(noiseElowSF),type="l") lines(mean(noiseE),col=2)
7 Getting "clean" events
Our spike sorting has two main stages, the first one consist in estimating a generative model
and the second one consists in using this model to build a classifier
before applying to the data. Our generative model
will include superposed events but it is going to be built out of reasonably "clean" ones. Here by clean we mean events which are not due to a nearly simultaneous firing of two or more neurons; and simultaneity is defined on the time scale of one of our cuts.
In order to eliminate the most obvious superpositions we are going to use a rather brute force approach, looking at the sides of the central peak of our median event and checking if individual events are not too large there, that is do not exhibit extra peaks. We first define a function doing this job:
goodEvtsFct <- function(samp,thr=3) { samp.med <- apply(samp,1,median) samp.mad <- apply(samp,1,mad) above <- samp.med > 0 samp.r <- apply(samp,2,function(x) {x[above] <- 0;x}) apply(samp.r,2,function(x) all(abs(x-samp.med) < thr*samp.mad)) }
We then apply our new function to our realigned sample:
goodEvts <- goodEvtsFct(evtsE,8)
Here goodEvts
is a vector of logical
with as many elements as events in evtsE
. Elements of goodEvts
are TRUE
if the corresponding event of evtsE
is "good" (i.e., not a superposition) and is FALSE
otherwise. We can look at the first 200 good events easily with:
evtsE[,goodEvts][,1:200]
Figure 8: First 200 good events of evtsE
.
We see that few superpositions are left but the most obvious ones of our previous events figure are gone. We can also look at the sum(!goodEvts)
[1] 38
"bad" events with:
evtsE[,!goodEvts]
Figure 9: Bad events of evtsE
.
8 Dimension reduction
8.1 Principal component analysis
Our events are living right now in an 180 dimensional space (our cuts are 45 sampling points long and we are working with 4 recording sites simultaneously). It turns out that it hard for most humans to perceive structures in such spaces. It also hard, not to say impossible with a realistic sample size, to estimate probability densities (which what some clustering algorithm are actually doing) in such spaces, unless one is ready to make strong assumptions about these densities. It is therefore usually a good practice to try to reduce the dimension of the sample space used to represent the data. We are going to that with principal component analysis (PCA
), using it on our "good" events.
evtsE.pc <- prcomp(t(evtsE[,goodEvts]))
We have to be careful here since function prcomp
assumes that the data matrix is built by stacking the events / observations as rows and not as columns like we did in our events
object. We apply therefore the function to the transpose
(t()
) of our events.
8.2 Exploring PCA
results
PCA
is a rather abstract procedure to most of its users, at least when they start using it. But one way to grasp what it does is to plot the mean event
plus or minus, say twice, each principal components like:
layout(matrix(1:4,nr=2)) explore(evtsE.pc,1,5) explore(evtsE.pc,2,5) explore(evtsE.pc,3,5) explore(evtsE.pc,4,5)
Figure 10: PCA of evtsEo2
(for "good" events) exploration (PC 1 to 4). Each of the 4 graphs shows the mean waveform (black), the mean waveform + 5 x PC (red), the mean - 5 x PC (blue) for each of the first 4 PCs. The fraction of the total variance "explained" by the component appears in between parenthesis in the title of each graph.
We can see that the first 3 PCs correspond to pure amplitude variations. An event with a large projection (score
) on the first PC is smaller than the average event on recording sites 1, 2 and 3, but not on 4. An event with a large projection on PC 2 is larger than average on site 1, smaller than average on site 2 and 3 and identical to the average on site 4. An event with a large projection on PC 3 is larger than the average on site 4 only. PC 4 is the first principal component corresponding to a change in shape as opposed to amplitude. A large projection on PC 4 means that the event as a shallower first valley and a deeper second valley than the average event on all recording sites.
We now look at the next 4 principal components:
layout(matrix(1:4,nr=2)) explore(evtsE.pc,5,5) explore(evtsE.pc,6,5) explore(evtsE.pc,7,5) explore(evtsE.pc,8,5)
Figure 11: PCA of evtsE
(for "good" events) exploration (PC 5 to 8). Each of the 4 graphs shows the mean waveform (black), the mean waveform + 5 x PC (red), the mean - 5 x PC (blue). The fraction of the total variance "explained" by the component appears in between parenthesis in the title of each graph.
An event with a large projection on PC 5 tends to be "slower" than the average event. An event with a large projection on PC 6 exhibits a slower kinetics of its second valley than the average event. PC 5 and 6 correspond to effects shared among recording sites. PC 7 correspond also to a "change of shape" effect on all sites except the first. Events with a large projection on PC 8 rise slightly faster and decay slightly slower than the average event on all recording site. Notice also that PC 8 has a "noisier" aspect than the other suggesting that we are reaching the limit of the "events extra variability" compared to the variability present in the background noise.
This guess can be confirmed by comparing the variance of the "good" events sample with the one of the noise sample to which the variance contributed by the first 8 PCs is added:
sapply(1:15, function(n) sum(diag(cov(t(noiseE))))+sum(evtsE.pc$sdev[1:n]^2)-sum(evtsE.pc$sdev^2))
[1] -277.4651543 -187.5634116 -128.0390777 -91.3186691 -58.8398876 [6] -36.3630674 -21.5437224 -8.2644952 0.2848893 6.9067336 [11] 13.3415488 19.4720891 25.2553356 29.1021047 32.5634359
This near equality means that we should not include component beyond the 8th one in our analysis. That's leave the room to use still fewer components.
8.3 Static representation of the projected data
We can build a scatter plot matrix
showing the projections of our "good" events sample onto the plane defined by pairs of the few first PCs:
panel.dens <- function(x,...) { usr <- par("usr") on.exit(par(usr)) par(usr = c(usr[1:2], 0, 1.5) ) d <- density(x, adjust=0.5) x <- d$x y <- d$y y <- y/max(y) lines(x, y, col="grey50", ...) } pairs(evtsE.pc$x[,1:4],pch=".",gap=0,diag.panel=panel.dens)
Figure 12: Scatter plot matrix of the projections of the good events in evtsEo2
onto the planes defined by the first 4 PCs. The diagonal shows a smooth (Gaussian kernel based) density estimate of the projection of the sample on the corresponding PC. Using the first 8 PCs does not make finner structure visible.
8.4 Dynamic representation of the projected data
The best way to discern structures in "high dimensional" data is to dynamically visualize them. To this end, the tool of choice is GGobi, an open source software available on Linux
, Windows
and MacOS
. It is in addition interfaced to R
thanks to the rggobi package. We have therefore two ways to use it: as a stand alone program after exporting the data from R
, or directly within R
. We are going to use it in its stand alone version here. We therefore start by exporting our data in csv
format to our disk:
write.csv(evtsE.pc$x[,1:8],file="evtsE.csv")
What comes next is not part of this document but here is a brief description of how to get it:
- Launch
GGobi
- In menu:
File
->Open
, selectevtsE.csv
. - Since the glyphs are rather large, start by changing them for smaller ones:
- Go to menu:
Interaction
->Brush
. - On the Brush panel which appeared check the
Persistent
box. - Click on
Choose color & glyph...
. - On the chooser which pops out, click on the small dot on the upper left of the left panel.
- Go back to the window with the data points.
- Right click on the lower right corner of the rectangle which appeared on the figure after you selected
Brush
. - Dragg the rectangle corner in order to cover the whole set of points.
- Go back to the
Interaction
menu and select the first row to go back where you were at the start.
- Go to menu:
- Select menu:
View
->Rotation
. - Adjust the speed of the rotation in order to see things properly.
You should easily discern 10 rather well separated clusters. Meaning that an automatic clustering with 10 clusters on the first 3 principal components should do the job.
9 Clustering
9.1 k-means clustering
Since our dynamic visualization shows 10 well separated clusters in 3 dimension, a simple k-means should do the job:
set.seed(20061001,kind="Mersenne-Twister") km10 <- kmeans(evtsE.pc$x[,1:3],centers=10,iter.max=100,nstart=100) c10 <- km10$cluster
Since function kmeans
of R
does use a random initialization, we set the seed (as well as the kind
) of our pseudo random number generator in order to ensure full reproducibility. In order to ensure reproducibility even if another seed is used as well as to facilitate the interpretation of the results, we "order" the clusters by "size" using the integrated absolute value of the central / median event of each cluster as a measure of its size.
cluster.med <- sapply(1:10, function(cIdx) median(evtsE[,goodEvts][,c10==cIdx])) sizeC <- sapply(1:10,function(cIdx) sum(abs(cluster.med[,cIdx]))) newOrder <- sort.int(sizeC,decreasing=TRUE,index.return=TRUE)$ix cluster.mad <- sapply(1:10, function(cIdx) {ce <- t(evtsE)[goodEvts,];ce <- ce[c10==cIdx,];apply(ce,2,mad)}) cluster.med <- cluster.med[,newOrder] cluster.mad <- cluster.mad[,newOrder] c10b <- sapply(1:10, function(idx) (1:10)[newOrder==idx])[c10]
9.2 Results inspection with GGobi
We start by checking our clustering quality with GGobi
. To this end we export the data and the labels of each event:
write.csv(cbind(evtsE.pc$x[,1:3],c10b),file="evtsEsorted.csv")
Again the dynamic visualization is not part of this document, but here is how to get it:
- Load the new data into GGobi like before.
- In menu:
Display
->New Scatterplot Display
, selectevtsEsorted.csv
. - Change the glyphs like before.
- In menu:
Tools
->Color Schemes
, select a scheme with 10 colors, likeSpectral
,Spectral 10
. - In menu:
Tools
->Automatic Brushing
, selectevtsEsorted.csv
tab and, within this tab, select variablec10b
. Then click onApply
. - Select
View
->Rotation
like before and see your result.
9.3 Cluster specific plots
Another way to inspect the clustering results is to look at cluster specific events plots:
layout(matrix(1:4,nr=4))
par(mar=c(1,1,1,1))
plot(evtsE[,goodEvts][,c10b==1],y.bar=5)
plot(evtsE[,goodEvts][,c10b==2],y.bar=5)
plot(evtsE[,goodEvts][,c10b==3],y.bar=5)
plot(evtsE[,goodEvts][,c10b==4],y.bar=5)
Figure 13: First 4 clusters. Cluster 1 at the top, cluster 4 at the bottom. Scale bar: 5 global MAD
units. Red, cluster specific central / median event. Blue, cluster specific MAD
.
Notice the increased MAD
on the rising phase of cluster 2 on the first recording site. A sing of misalignment of the events of this cluster.
layout(matrix(1:4,nr=4))
par(mar=c(1,1,1,1))
plot(evtsE[,goodEvts][,c10b==5],y.bar=5)
plot(evtsE[,goodEvts][,c10b==6],y.bar=5)
plot(evtsE[,goodEvts][,c10b==7],y.bar=5)
plot(evtsE[,goodEvts][,c10b==8],y.bar=5)
Figure 14: Next 4 clusters. Cluster 5 at the top, cluster 8 at the bottom. Scale bar: 5 global MAD
units. Red, cluster specific central / median event. Blue, cluster specific MAD
.
Cluster 5 has few events while some "subtle" superpositions are present in cluster 7.
layout(matrix(1:2,nr=2))
par(mar=c(1,1,1,1))
plot(evtsE[,goodEvts][,c10b==9],y.bar=5)
plot(evtsE[,goodEvts][,c10b==10],y.bar=5)
Figure 15: Last 2 clusters. Cluster 9 at the top, cluster 10 at the bottom. Scale bar: 5 global MAD
units. Red, cluster specific central / median event. Blue, cluster specific MAD
.
Cluster 10 exhibits an extra variability on sites 1 and 4 around its first valley and its peak.
9.4 Dealing with the jitter
9.4.1 A worked out example
Cluster 2 is the cluster exhibiting the largest sampling jitter effects, since it has the largest time derivative, in absolute value, of its median event . This is clearly seen when we superpose the 51st event from this cluster with the median event:
c2_median = apply(evtsE[,goodEvts][,c10b==2],1,median) plot(c2_median,col='red',lwd=2,type="l") lines(unclass(evtsE[,goodEvts][,c10b==2][,51]),lwd=2)
Figure 16: The median event of cluster 2 (red) together with event 51 of the same cluster (black).
A Taylor expansion shows that if we write g(t) the observed 51st event, δ the sampling jitter and f(t) the actual waveform of the event then: \[g(t) = f(t+δ) + ε(t) \approx f(t) + δ \, f'(t) + δ^2/2 \, f''(t) + ε(t) \, ;\] where ε is a Gaussian process and where \(f'\) and \(f''\) stand for the first and second time derivatives of \(f\). Therefore, if we can get estimates of \(f'\) and \(f''\) we should be able to estimate δ by linear regression (if we neglect the \(δ^2\) term as well as the potentially non null correlation in ε) or by non linear regression (if we keep the latter).
We start by getting the derivatives estimates:
dataD = apply(lD,2,function(x) c(0,diff(x,2)/2,0)) evtsED = mkEvents(sp0E,dataD,14,30) dataDD = apply(dataD,2,function(x) c(0,diff(x,2)/2,0)) evtsEDD = mkEvents(sp0E,dataDD,14,30) c2D_median = apply(evtsED[,goodEvts][,c10b==2],1,median) c2DD_median = apply(evtsEDD[,goodEvts][,c10b==2],1,median)
We then get something like:
plot(unclass(evtsE[,goodEvts][,c10b==2][,51])-c2_median,type="l",col='red',lwd=2) lines(1.5*c2D_median,col='blue',lwd=2) lines(1.5*c2D_median+1.5^2/2*c2DD_median,col='black',lwd=2)
Figure 17: The median event of cluster 2 subtracted from event 51 of the same cluster (red); 2 times the first derivative of the median event (blue)—corresponding to δ=1.5—; 1.5 times the first derivative + 1.52/2 times the second (black)—corresponding again to δ=1.5—.
If we neglect the \(δ^2\) term we quickly arrive at: \[\hat{δ} = \frac{\mathbf{f'} \cdot (\mathbf{g} -\mathbf{f})}{\| \mathbf{f'} \|^2} \, ;\]
where the 'vectorial' notation like \(\mathbf{a} \cdot \mathbf{b}\) stands here for: \[\sum_{i=0}^{179} a_i b_i \, .\]
For the 51st event of the cluster we get:
(delta_hat = sum(c2D_median*(unclass(evtsE[,goodEvts][,c10b==2][,51])-c2_median))/sum(c2D_median^2))
1.4917182304327
We can use this estimated value of delta_hat
as an initial guess for a procedure refining the estimate using also the \(δ^2\) term. The obvious quantity we should try to minimize is the residual sum of square, RSS
defined by:
\[\mathrm{RSS}(δ) = \| \mathbf{g} - \mathbf{f} - δ \, \mathbf{f'} - δ^2/2 \, \mathbf{f''} \|^2 \; .\]
We can define a function returning the RSS
for a given value of δ as well as an event evt
a cluster center (median event of the cluster) center
and its first two derivatives, centerD
and centerDD
:
rss_for_alignment <- function(delta, evt=unclass(evtsE[,goodEvts][,c10b==2][,51]), center=c2_median, centerD=c2D_median, centerDD=c2DD_median) { sapply(delta, function(d) sum((evt - center - d*centerD - d^2/2*centerDD)^2) ) }
We then get the graph with:
layout(matrix(1:2,nc=2)) dd <- seq(-5,5,0.05) plot(dd,rss_for_alignment(dd),type='l',col='black',lwd=2) dd_fine <- seq(delta_hat-0.5,delta_hat+0.5,len=501) plot(dd_fine,rss_for_alignment(dd_fine),type='l',col='black',lwd=2) abline(v=delta_hat,col='red')
Figure 18: The RSS
as a function of δ for event 51 of cluster 2. Left, \(δ \in [-5,5]\); right, \(δ \in [\hat{δ}-0.5,\hat{δ}+0.5]\) and the red vertical line shows \(\hat{δ}\).
The left panel of the above figure shows that our initial guess for \(\hat{δ}\) is not bad but still approximately 0.2 units away from the actual minimum. The classical way to refine our δ estimate—in 'nice situations' where the function we are trying to minimize is locally convex—is to use the Newton-Raphson algorithm which consists in approximating locally the 'target function' (here our RSS
function) by a parabola having locally the same first and second derivatives, before jumping to the minimum of this approximating parabola. If we develop our previous expression of \(\mathrm{RSS}(δ)\) we get:
\[\mathrm{RSS}(δ) = \| \mathbf{h} \|^2 - 2\, δ \, \mathbf{h} \cdot \mathbf{f'} + δ^2 \, \left( \|\mathbf{f'}\|^2 - \mathbf{h} \cdot \mathbf{f''}\right) + δ^3 \, \mathbf{f'} \cdot \mathbf{f''} + \frac{δ^4}{4} \|\mathbf{f''}\|^2 \, ;\]
where \(\mathbf{h}\) stands for \(\mathbf{g} - \mathbf{f}\). By differentiation with respect to δ we get:
\[\mathrm{RSS}'(δ) = - 2\, \mathbf{h} \cdot \mathbf{f'} + 2 \, δ \, \left( \|\mathbf{f'}\|^2 - \mathbf{h} \cdot \mathbf{f''}\right) + 3 \, δ^2 \, \mathbf{f'} \cdot \mathbf{f''} + δ^3 \|\mathbf{f''}\|^2 \, .\]
And a second differentiation leads to:
\[\mathrm{RSS}''(δ) = 2 \, \left( \|\mathbf{f'}\|^2 - \mathbf{h} \cdot \mathbf{f''}\right) + 6 \, δ \, \mathbf{f'} \cdot \mathbf{f''} + 3 \, δ^2 \|\mathbf{f''}\|^2 \, .\]
The equation of the approximating parabola at \(δ^{(k)}\) is then:
\[\mathrm{RSS}(δ^{(k)} + η) \approx \mathrm{RSS}(δ^{(k)}) + η \, \mathrm{RSS}'(δ^{(k)}) + \frac{η^2}{2} \, \mathrm{RSS}''(δ^{(k)})\; ,\]
and its minimum—if \(\mathrm{RSS}''(δ)\) > 0—is located at:
\[δ^{(k+1)} = δ^{(k)} - \frac{\mathrm{RSS}'(δ^{(k)})}{\mathrm{RSS}''(δ^{(k)})} \; .\]
Defining functions returning the required derivatives:
rssD_for_alignment <- function(delta, evt=unclass(evtsE[,goodEvts][,c10b==2][,51]), center=c2_median, centerD=c2D_median, centerDD=c2DD_median){ h = evt - center -2*sum(h*centerD) + 2*delta*(sum(centerD^2) - sum(h*centerDD)) + 3*delta^2*sum(centerD*centerDD) + delta^3*sum(centerDD^2) } rssDD_for_alignment <- function(delta, evt=unclass(evtsE[,goodEvts][,c10b==2][,51]), center=c2_median, centerD=c2D_median, centerDD=c2DD_median){ h = evt - center 2*(sum(centerD^2) - sum(h*centerDD)) + 6*delta*sum(centerD*centerDD) + 3*delta^2*sum(centerDD^2) }
we can get a graphical representation of a single step of the Newton-Raphson algorithm:
rss_at_delta_0 = rss_for_alignment(delta_hat) rssD_at_delta_0 = rssD_for_alignment(delta_hat) rssDD_at_delta_0 = rssDD_for_alignment(delta_hat) delta_1 = delta_hat - rssD_at_delta_0/rssDD_at_delta_0 plot(dd_fine,rss_for_alignment(dd_fine),type="l",col='black',lwd=2) abline(v=delta_hat,col='red') lines(dd_fine, rss_at_delta_0 + (dd_fine-delta_hat)*rssD_at_delta_0 + (dd_fine-delta_hat)^2/2*rssDD_at_delta_0,col='blue',lwd=2) abline(v=delta_1,col='grey')
Figure 19: The RSS
as a function of δ for event 51 of cluster 1 (black), the red vertical line shows \(\hat{δ}\). In blue, the approximating parabola at \(\hat{δ}\). The grey vertical line shows the minimum of the approximating parabola.
Subtracting the second order in δ approximation of f(t+δ) from the observed 51st event of cluster 1 we get:
plot(evtsE[,goodEvts][,c10b==2][,51],type="l",col='black',lwd=2) lines(evtsE[,goodEvts][,c10b==2][,51]-c2_median-delta_1*c2D_median-delta_1^2/2*c2DD_median,col='red',lwd=2) lines(c2_median+delta_1*c2D_median+delta_1^2/2*c2DD_median,col='blue',lwd=1)
Figure 20: Event 51 of cluster 2 (black), second order approximation of f(t+δ) (blue) and residual (red) for δ—obtained by a succession of a linear regression (order 1) and a single Newton-Raphson step—equal to: delta_1
1.37480481443249
.
9.4.2 Making the procedure systematic
Function get_jitter
is defined in Sec. Definition of get_jitter
. If we try it on the events attribute to cluster 2:
c2_jitter = get_jitter(evtsE[,goodEvts][,c10b==2],c2_median,c2D_median,c2DD_median)
we get the following summary statistic of the estimated jitters:
summary(c2_jitter,digits=2)
Min. 1st Qu. Median Mean 3rd Qu. Max. -1.900 -0.380 -0.044 0.041 0.660 1.800
We then get a estimate of the jitter shifted events of cluster 2 with:
c2_noj = unclass(evtsE[,goodEvts][,c10b==2]) - c2D_median %o% c2_jitter - c2DD_median %o% c2_jitter^2/2 attributes(c2_noj) = attributes(evtsE[,goodEvts][,c10b==2])
A plot allows us to compare the "raw version" of the events with the jitter corrected version:
layout(matrix(1:2,nr=2))
par(mar=c(1,1,1,1))
evtsE[,goodEvts][,c10b==2]
c2_noj
Figure 21: Top: cluster 2 events before jitter correction. Bottom: cluster 2 events after second order jitter correction.
9.4.3 Applying the procedure systematically
Using function mk_aligned_events
defined in Definition of mk_aligned_events
, for each of the 10 clusters:
evtsE_noj = lapply(1:10, function(i) mk_aligned_events(sp0E[goodEvts][c10b==i],lD))
Looking at the first 5 clusters we get:
layout(matrix(1:5,nr=5)) evtsE_noj[[1]] evtsE_noj[[2]] evtsE_noj[[3]] evtsE_noj[[4]] evtsE_noj[[5]]
Figure 22: First 5 clusters after jitter cancellation. Cluster 1 at the top, cluster 5 at the bottom. Red, cluster specific central / median event. Blue, cluster specific MAD
.
Looking at the last 5 clusters we get:
layout(matrix(1:5,nr=5)) evtsE_noj[[6]] evtsE_noj[[7]] evtsE_noj[[8]] evtsE_noj[[9]] evtsE_noj[[10]]
Figure 23: Last 5 clusters after jitter cancellation. Cluster 6 at the top, cluster 10 at the bottom. Red, cluster specific central / median event. Blue, cluster specific MAD
. Notice the vertical scale change compared to the previous figure
The five numbers summary of the remaining jitters for each of the 10 clusters are:
t(sapply(evtsE_noj, function(l) summary(attr(l,"delta"),digits=3)))
Min. 1st Qu. Median Mean 3rd Qu. Max. [1,] -0.493 -0.253 0.02120 -0.01690 0.171 0.465 [2,] -0.534 -0.225 -0.03800 0.00695 0.286 0.509 [3,] -0.450 -0.271 0.02930 0.01750 0.270 0.496 [4,] -0.475 -0.193 0.01990 0.01360 0.227 0.468 [5,] -0.627 -0.320 0.04860 -0.05690 0.216 0.337 [6,] -0.530 -0.235 0.01760 0.01860 0.320 0.473 [7,] -0.513 -0.249 -0.00277 -0.01360 0.234 0.560 [8,] -0.514 -0.233 0.00456 0.00753 0.256 0.613 [9,] -0.574 -0.258 0.02530 0.08350 0.256 4.570 [10,] -0.955 -0.275 0.01010 -0.00546 0.255 2.960
10 "Brute force" superposition resolution
We are going to resolve (the most "obvious") superpositions by a "recursive peeling method":
- Events are detected and cut from the raw data or from an already peeled version of the data.
- The closest center (in term of Euclidean distance) to the event is found—the jitter is always evaluated and compensated for when the distances are computed.
- If the residual sum of squares (
RSS
), that is: (actual data - best center)\(^2\), is smaller than the squared norm of a cut, the best center is subtracted from the data on which detection was performed—jitter is again compensated for at this stage. - Go back to step 1 or stop.
We get a named list of cluster centers lists with:
centers = lapply(1:10, function(i) mk_center_list(attr(evtsE_noj[[i]],"positions"),lD)) names(centers) = paste("Cluster",1:10)
We then apply function classify_and_align_evt
defined in Sec. Definition of classify_and_align_evt
to every detected event:
round0 = lapply(as.vector(sp0),classify_and_align_evt, data=lD,centers=centers)
We can see how many events (for a total of length(sp0)
1795
) got unclassified with:
sum(sapply(round0, function(l) l[[1]] == '?'))
28
10.1 First peeling
Using predict_data
we get:
pred0 = predict_data(round0,centers)
We subtract pred0
from lD
:
lD1 = lD - pred0
We can compare the original data with the result of the "first peeling":
ii = 1:1500 + 0.9*15000 tt = ii/15000 par(mar=c(1,1,1,1)) plot(tt, lD[ii,1], axes = FALSE, type="l",ylim=c(-50,20), xlab="",ylab="") lines(tt, lD1[ii,1], col='red') lines(tt, lD[ii,2]-15, col='black') lines(tt, lD1[ii,2]-15, col='red') lines(tt, lD[ii,3]-25, col='black') lines(tt, lD1[ii,3]-25, col='red') lines(tt, lD[ii,4]-40, col='black') lines(tt, lD1[ii,4]-40, col='red')
Figure 24: 100 ms (between 0.9 and 1.0 s) of the locust data set. Black, original data; red, after first peeling.
10.2 Second peeling
We then take lD1
as our former lD
and we repeat the procedure detecting only on the recording site 1 (also using a slightly shorter box filter):
lDf <- filter(lD1,rep(1,3)/3) lDf[is.na(lDf)] <- 0 lDf.mad <- apply(lDf,2,mad) lDf <- t(t(lDf)/lDf.mad) bellow.thrs <- t(t(lDf) < thrs) lDfr <- lDf lDfr[bellow.thrs] <- 0 remove(lDf) (sp1 <- peaks(lDfr[,1],15))
eventsPos object with indexes of 247 events. Mean inter event interval: 1208.87 sampling points, corresponding SD: 1386.01 sampling points Smallest and largest inter event intervals: 18 and 9140 sampling points.
We then apply again function classify_and_align_evt
defined in Sec. Definition of classify_and_align_evt
to every detected event:
round1 = lapply(as.vector(sp1),classify_and_align_evt, data=lD1,centers=centers)
We can see how many events (for a total of length(sp1)
247
) got unclassified with:
sum(sapply(round1, function(l) l[[1]] == '?'))
60
We then get our prediction
pred1 = predict_data(round1,centers)
We subtract pred1
from lD1
:
lD2 = lD1 - pred1
We can compare the first peeling with the second one:
ii = 1:1500 + 0.9*15000 tt = ii/15000 par(mar=c(1,1,1,1)) plot(tt, lD1[ii,1], axes = FALSE, type="l",ylim=c(-50,20), xlab="",ylab="") lines(tt, lD2[ii,1], col='red') lines(tt, lD1[ii,2]-15, col='black') lines(tt, lD2[ii,2]-15, col='red') lines(tt, lD1[ii,3]-25, col='black') lines(tt, lD2[ii,3]-25, col='red') lines(tt, lD1[ii,4]-40, col='black') lines(tt, lD2[ii,4]-40, col='red')
Figure 25: 100 ms (between 0.9 and 1.0 s) of the locust data set. Black, first peeling; red, second peeling.
10.3 Third peeling
We then take lD2
as our former lD1
and we repeat the procedure detecting only on the recording site 2:
lDf <- filter(lD2,rep(1,3)/3) lDf[is.na(lDf)] <- 0 lDf.mad <- apply(lDf,2,mad) lDf <- t(t(lDf)/lDf.mad) bellow.thrs <- t(t(lDf) < thrs) lDfr <- lDf lDfr[bellow.thrs] <- 0 remove(lDf) (sp2 <- peaks(lDfr[,2],15))
eventsPos object with indexes of 124 events. Mean inter event interval: 2404.16 sampling points, corresponding SD: 2381.06 sampling points Smallest and largest inter event intervals: 16 and 10567 sampling points.
We then apply again function classify_and_align_evt
defined in Sec. Definition of classify_and_align_evt
to every detected event:
round2 = lapply(as.vector(sp2),classify_and_align_evt, data=lD2,centers=centers)
We can see how many events (for a total of length(sp2)
124
) got unclassified with:
sum(sapply(round2, function(l) l[[1]] == '?'))
22
We then get our prediction
pred2 = predict_data(round2,centers)
We subtract pred2
from lD2
:
lD3 = lD2 - pred2
We can compare the second peeling with the third one:
ii = 1:1500 + 0.9*15000 tt = ii/15000 par(mar=c(1,1,1,1)) plot(tt, lD2[ii,1], axes = FALSE, type="l",ylim=c(-50,20), xlab="",ylab="") lines(tt, lD3[ii,1], col='red') lines(tt, lD2[ii,2]-15, col='black') lines(tt, lD3[ii,2]-15, col='red') lines(tt, lD2[ii,3]-25, col='black') lines(tt, lD3[ii,3]-25, col='red') lines(tt, lD2[ii,4]-40, col='black') lines(tt, lD3[ii,4]-40, col='red')
Figure 26: 100 ms (between 0.9 and 1.0 s) of the locust data set. Black, second peeling; red, third peeling.
10.4 Fourth peeling
We then take lD3
as our former lD2
and we repeat the procedure detecting only on the recording site 3:
lDf <- filter(lD3,rep(1,3)/3) lDf[is.na(lDf)] <- 0 lDf.mad <- apply(lDf,2,mad) lDf <- t(t(lDf)/lDf.mad) bellow.thrs <- t(t(lDf) < thrs) lDfr <- lDf lDfr[bellow.thrs] <- 0 remove(lDf) (sp3 <- peaks(lDfr[,3],15))
eventsPos object with indexes of 99 events. Mean inter event interval: 2993.43 sampling points, corresponding SD: 3686.95 sampling points Smallest and largest inter event intervals: 34 and 21217 sampling points.
We then apply again function classify_and_align_evt
defined in Sec. Definition of classify_and_align_evt
to every detected event:
round3 = lapply(as.vector(sp3),classify_and_align_evt, data=lD3,centers=centers)
We can see how many events (for a total of length(sp3)
99
) got unclassified with:
sum(sapply(round3, function(l) l[[1]] == '?'))
21
We then get our prediction
pred3 = predict_data(round3,centers)
We subtract pred3
from lD3
:
lD4 = lD3 - pred3
We can compare the third peeling with the fourth one:
ii = 1:1500 + 3.9*15000 tt = ii/15000 par(mar=c(1,1,1,1)) plot(tt, lD3[ii,1], axes = FALSE, type="l",ylim=c(-50,20), xlab="",ylab="") lines(tt, lD4[ii,1], col='red') lines(tt, lD3[ii,2]-15, col='black') lines(tt, lD4[ii,2]-15, col='red') lines(tt, lD3[ii,3]-25, col='black') lines(tt, lD4[ii,3]-25, col='red') lines(tt, lD3[ii,4]-40, col='black') lines(tt, lD4[ii,4]-40, col='red')
Figure 27: 100 ms (between 3.9 and 4.0 s) of the locust data set. Black, third peeling; red, fourth peeling. Different time frame than on the previous plots.
10.5 Fifth peeling
We then take lD4
as our former lD3
and we repeat the procedure detecting only on the recording site 4:
lDf <- filter(lD4,rep(1,3)/3) lDf[is.na(lDf)] <- 0 lDf.mad <- apply(lDf,2,mad) lDf <- t(t(lDf)/lDf.mad) bellow.thrs <- t(t(lDf) < thrs) lDfr <- lDf lDfr[bellow.thrs] <- 0 remove(lDf) (sp4 <- peaks(lDfr[,4],15))
eventsPos object with indexes of 164 events. Mean inter event interval: 1806.91 sampling points, corresponding SD: 1902.64 sampling points Smallest and largest inter event intervals: 25 and 9089 sampling points.
We then apply again function classify_and_align_evt
defined in Sec. Definition of classify_and_align_evt
to every detected event:
round4 = lapply(as.vector(sp4),classify_and_align_evt, data=lD4,centers=centers)
We can see how many events (for a total of length(sp4)
164
) got unclassified with:
sum(sapply(round4, function(l) l[[1]] == '?'))
60
We then get our prediction
pred4 = predict_data(round4,centers)
We subtract pred4
from lD4
:
lD5 = lD4 - pred4
We can compare the fourth peeling with the fifth one:
ii = 1:1500 + 3.9*15000 tt = ii/15000 par(mar=c(1,1,1,1)) plot(tt, lD4[ii,1], axes = FALSE, type="l",ylim=c(-50,20), xlab="",ylab="") lines(tt, lD5[ii,1], col='red') lines(tt, lD4[ii,2]-15, col='black') lines(tt, lD5[ii,2]-15, col='red') lines(tt, lD4[ii,3]-25, col='black') lines(tt, lD5[ii,3]-25, col='red') lines(tt, lD4[ii,4]-40, col='black') lines(tt, lD5[ii,4]-40, col='red')
Figure 28: 100 ms (between 3.9 and 4.0 s) of the locust data set. Black, fourth peeling; red, fifth peeling.
10.6 General comparison
We can compare the raw data with the fifth peeling on the first second:
ii = 1:15000 tt = ii/15000 par(mar=c(1,1,1,1)) plot(tt, lD[ii,1], axes = FALSE, type="l",ylim=c(-50,20), xlab="",ylab="") lines(tt, lD5[ii,1], col='red') lines(tt, lD[ii,2]-15, col='black') lines(tt, lD5[ii,2]-15, col='red') lines(tt, lD[ii,3]-25, col='black') lines(tt, lD5[ii,3]-25, col='red') lines(tt, lD[ii,4]-40, col='black') lines(tt, lD5[ii,4]-40, col='red')
Figure 29: The first second of the locust data set. Black, raw data; red, fifth peeling.
We can also look at the remaining unclassified events (that don't look like any of our templates):
bad_ones = sapply(round4, function(l) l[[1]] == '?') r4BE = mkEvents(sp4[bad_ones], lD4) r4BE
Figure 30: The 60 remaining bad events after the fifth peeling.
10.7 Sixth peeling
We can push things one peeling step further detecting on all sites simultaneously like we did the first time:
lDf <- filter(lD5,rep(1,3)/3) lDf[is.na(lDf)] <- 0 lDf.mad <- apply(lDf,2,mad) lDf <- t(t(lDf)/lDf.mad) bellow.thrs <- t(t(lDf) < thrs) lDfr <- lDf lDfr[bellow.thrs] <- 0 remove(lDf) (sp5 <- peaks(apply(lDfr,1,sum),15))
eventsPos object with indexes of 234 events. Mean inter event interval: 1283.21 sampling points, corresponding SD: 1325.54 sampling points Smallest and largest inter event intervals: 17 and 8484 sampling points.
We then apply again function classify_and_align_evt
defined in Sec. Definition of classify_and_align_evt
to every detected event:
round5 = lapply(as.vector(sp5),classify_and_align_evt, data=lD5,centers=centers)
We can see how many events (for a total of length(sp5)
234
) got unclassified with:
sum(sapply(round5, function(l) l[[1]] == '?'))
184
This number is a large fraction indeed of the total number of events. We can look at the whole sample of events:
mkEvents(sp5, lD5)
Figure 31: Detected events after fifth peeling (detection done on all sites simultaneously).
They also do not look like any of our templates.
11 Getting the spike trains
Once we have decided to stop the peeling iterations we can extract our spike trains with:
round_all = c(round0,round1,round2,round3,round4,round5) spike_trains = lapply(paste("Cluster",1:10), function(cn) sapply(round_all[sapply(round_all, function(l) l[[1]]==cn)], function(l) l[[2]]+l[[3]])) names(spike_trains) = paste("Cluster",1:10)
12 Custom developed code
12.1 Individual function definitions
12.1.1 cstValueSgts
definition
Formal parameters:
x
: a numeric vector.
Returns: an intgeger matrix with as many columns as segments where the derivative is null, the first row contains the index of the beginning of each segment, the second row contains the segments' lengths.
cstValueSgts <- function(x) { dx <- as.integer(abs(diff(x,2))/2 <= .Machine$double.eps) ## use "order 2" derivative estimates ## ddx is zero where the derivative did not change, ## it is one if the derivative becomes null and -1 ## is the derivative stops being null ddx <- diff(dx) ## s contains the indexes of the first ## points of segments where the ## derivative is null s <- (1:length(ddx))[ddx==1] sapply(s, function(b) { n <- ddx[(b+1):length(ddx)] c(b+1,min((1:length(n))[n==-1])) } ) }
12.1.2 Definitions of generic function and time-series specific method for data exploration
Definition of the generic function:
explore <- function(x,...) { UseMethod("explore") }
Definition of the corresponding method for time series: Formal arguments:
x
: ats
(time series) object.offsetFactor
: a numeric controlling the the spacing between the recording sites on the plot. Smaller values lead to closer spacing....
: additional arguments passed tomatplot
andplot
functions called inernally.
explore.ts <- function(x, offsetFactor=0.5, ...) { stopifnot(inherits(x,"ts")) yRange <- range(x,na.rm=TRUE) plotPara <- list(tlim = start(x)[1] + c(0,0.1), ylim = yRange, yMin = yRange[1], yMax = yRange[2], firstTime = start(x)[1], lastTime = end(x)[1], keepGoing = TRUE) nFct <- function() { leftTime <- plotPara$tlim[1] rightTime <- plotPara$tlim[2] timeRange <- rightTime - leftTime rightTime <- rightTime + timeRange if (rightTime > plotPara$lastTime) { cat("Recording end reached.\n ") rightTime <- plotPara$lastTime } plotPara$tlim <- c(rightTime - timeRange, rightTime) plotPara } fFct <- function() { leftTime <- plotPara$tlim[1] rightTime <- plotPara$tlim[2] timeRange <- rightTime - leftTime leftTime <- leftTime - timeRange if (leftTime < plotPara$firstTime) { cat("Recording end reached.\n ") leftTime <- plotPara$firstTime } plotPara$tlim <- c(leftTime, leftTime + timeRange) plotPara } qFct <- function() { plotPara$keepGoing <- FALSE plotPara } ## Function tFct definition ## Allows the user to change the recording duration displayed on the window ## The user is invited to enter a factor which will be used to multiply the ## present duration displayed. ## If the resulting duration is too long a warning is given and the whole ## recording is shown. ## If possible the center of the displayed window is conserved. tFct <- function() { presentWindowLength <- diff(range(plotPara$tlim)) tMessage <- paste("Present duration displayed: ", presentWindowLength, " \n", sep = "") tMessage <- paste(tMessage, "By what factor do you want to multiply it? \n", sep = "") theFactor <- as.numeric(readline(tMessage)) if (theFactor <= 0) { cat("A negative or null factor does not make sense.\n") return(plotPara) } ## End of conditional on theFactor <= 0 ## Check that the new display length is reasonable totalLength <- plotPara$lastTime - plotPara$firstTime if (theFactor * presentWindowLength >= totalLength) { cat("Cannot show more data than available but only the entire record.\n ") plotPara$tlim[1] <- plotPara$firstTime plotPara$tlim[2] <- plotPara$lastTime return(plotPara) } windowCenter <- plotPara$tlim[1] + presentWindowLength / 2 newLeft <- windowCenter - theFactor * presentWindowLength / 2 newRight <- windowCenter + theFactor * presentWindowLength / 2 if (!(newLeft >= plotPara$firstTime & newRight <= plotPara$lastTime)) { if (newLeft <= plotPara$firstTime) { cat("Cannot show data before the recording started, the displayed center wont be conserved.\n ") plotPara$tlim[1] <- plotPara$firstTime plotPara$tlim[2] <- plotPara$tlim[1] + theFactor * presentWindowLength } if (newRight >= plotPara$lastTime) { cat("Cannot show data after the recording ended, the displayed center wont be conserved.\n ") plotPara$tlim[2] <- plotPara$lastTime plotPara$tlim[1] <- plotPara$tlim[2] - theFactor * presentWindowLength } return(plotPara) } ## End of conditional on !(newLeft >= plotPara$firstTime & newRight <= plotPara$lastTime) plotPara$tlim[1] <- newLeft plotPara$tlim[2] <- newRight return(plotPara) } ## End of function tFct definition ## Function rFct definition ## Allows the user to change the maximal value displayed on the abscissa ## The user is invited to enter a value. rFct <- function() { leftTime <- plotPara$tlim[1] rightTime <- plotPara$tlim[2] timeRange <- rightTime - leftTime tMessage <- paste("Present latest time displayed: ", rightTime, "\n", sep = "") tMessage <- paste(tMessage, "What new latest time do want (return leaves things unchanged)? \n", sep = "") theNewTime <- as.numeric(readline(tMessage)) if (is.na(theNewTime)) { ## Nothing entered, leave things unchanged return(plotPara) } ## End of conditional on is.na(theFactor) if (theNewTime <= plotPara$firstTime) { ## This choice does not make sense cat("Cannot display data before recording started.\n") return(plotPara) } if (theNewTime > plotPara$lastTime) { cat("Recording end reached.\n ") rightTime <- plotPara$lastTime } else { if (theNewTime <= leftTime) { ## The new latest time entered is smaller that the earliest time displayed cat("The new latest time is smaller than the earliest, adjustement will be made.\n") leftTime <- theNewTime - timeRange if (leftTime < plotPara$firstTime) { cat("Adjustment requires a change in displayed duration.\n") leftTime <- plotPara$firstTime } } ## End of conditional on theNewTime <= leftTime rightTime <- theNewTime } ## End of conditional on theNewTime > plotPara$lastTime plotPara$tlim <- c(leftTime, rightTime) plotPara } ## Function lFct definition ## Allows the user to change the minimal value displayed on the abscissa ## The user is invited to enter a value. lFct <- function() { leftTime <- plotPara$tlim[1] rightTime <- plotPara$tlim[2] timeRange <- rightTime - leftTime tMessage <- paste("Present earliest time displayed: ", leftTime, "\n", sep = "") tMessage <- paste(tMessage, "What new earliest time do want (return leaves things unchanged)? \n", sep = "") theNewTime <- as.numeric(readline(tMessage)) if (is.na(theNewTime)) { ## Nothing entered, leave things unchanged return(plotPara) } ## End of conditional on is.na(theFactor) if (theNewTime >= plotPara$lastTime) { ## This choice does not make sense cat("Cannot display data after recording ended.\n") return(plotPara) } if (theNewTime < plotPara$firstTime) { cat("Recording start reached.\n ") leftTime <- plotPara$firstTime } else { if (theNewTime >= rightTime) { ## The new earliest time entered is larger that the latest time displayed cat("The new earliest time is larger than the latest, adjustement will be made.\n") rightTime <- theNewTime + timeRange if (rightTime > plotPara$lastTime) { cat("Adjustment requires a change in displayed duration.\n") rightTime <- plotPara$lastTime } } ## End of conditional on theNewTime <= leftTime leftTime <- theNewTime } ## End of conditional on theNewTime > plotPara$lastTime plotPara$tlim <- c(leftTime, rightTime) plotPara } ## Function yMaxFct definition ## Allows the user to change the maximal value displayed on the ordinate ## The user is invited to enter a value. yMaxFct <- function() { presentWindowRange <- range(plotPara$ylim) tMessage <- paste("Present range displayed: [", paste(presentWindowRange, collapse = ","), "] \n", sep = "") tMessage <- paste(tMessage, "What new maximal ordinate value do want (return goes back to maximum)? \n", sep = "") theFactor <- as.numeric(readline(tMessage)) if (is.na(theFactor)) { plotPara$ylim <- c(presentWindowRange[1],plotPara$yMax) return(plotPara) } ## End of conditional on is.na(theFactor) if (theFactor <= plotPara$ylim[1]) { cat("The maximum should be larger than the minimum.\n") return(plotPara) } ## End of conditional on theFactor <= plotPara$ylim[1] plotPara$ylim <- c(presentWindowRange[1],theFactor) return(plotPara) } ## End of function yMaxFct definition ## Function yMinFct definition ## Allows the user to change the minimal value displayed on the ordinate ## The user is invited to enter a value. yMinFct <- function() { presentWindowRange <- range(plotPara$ylim) tMessage <- paste("Present range displayed: [", paste(presentWindowRange, collapse = ","), "] \n", sep = "") tMessage <- paste(tMessage, "What new minimal ordinate value do want (return goes back to minimum)? \n", sep = "") theFactor <- as.numeric(readline(tMessage)) if (is.na(theFactor)) { plotPara$ylim <- c(plotPara$yMin, presentWindowRange[2]) return(plotPara) } ## End of conditional on is.na(theFactor) if (theFactor >= plotPara$ylim[2]) { cat("The minimum should be smaller than the maximum.\n") return(plotPara) } ## End of conditional on theFactor >= plotPara$ylim[2] plotPara$ylim <- c(theFactor, presentWindowRange[2]) return(plotPara) } ## End of function yMinFct definition show <- function(x, plotPara, ...) { s <- plotPara$tlim[1] e <- plotPara$tlim[2] y.m <- plotPara$ylim[1] y.M <- plotPara$ylim[2] m <- unclass(window(x,start=s,end=e)) if (class(m) == "matrix") { m <- apply(m,2,function(x) ifelse(x < y.m, y.m,x)) m <- apply(m,2,function(x) ifelse(x > y.M, y.M,x)) ns <- dim(m)[2] offset <- c(0,-(1:(ns-1))*(y.M-y.m)) m <- t(t(m)+offset*offsetFactor) matplot(m,type="l",lty=1,axes=FALSE,xlab="",ylab="",...) } else { m[m<y.m] <- y.m m[m>y.M] <- y.M plot(m,type="l",lty=1,axes=FALSE,xlab="",ylab="",ylim=c(y.m,y.M),...) } } plot.new() par(mar=c(0.5,0.5,0.5,0.5)) show(x,plotPara,...) myMessage <- "Make a choice:\n n or 'return' (next); f (former); l (lower abscissa limit); r (upper abscissa limit) \n t (time scale); Y (upper ordinate limit); y (lower ordinate limit); q (quit) \n " while(plotPara$keepGoing) { myChoice <- readline(myMessage) plotPara <- switch(myChoice, n = nFct(), f = fFct(), l = lFct(), r = rFct(), t = tFct(), Y = yMaxFct(), y = yMinFct(), q = qFct(), nFct() ) show(x,plotPara,...) } ## End of while loop on keepGoing dev.off() invisible() }
12.1.3 peaks
and associated methods definition
Function peaks
returns an object (essentially a vector of indices) of class eventsPos
:
peaks <- function(x, minimalDist=15, notZero=1e-3) { dx <- c(0,diff(x,2)/2,0) dx[abs(dx) < notZero] <- 0 dx <- diff(sign(dx)) res <- (1:length(dx))[dx < 0] res <- res[-length(res)][diff(res) > minimalDist] attr(res,"call") <- match.call() attr(res,"nIDx") <- length(x) class(res) <- "eventsPos" res }
Method as.eventsPos
transforms a vector into an eventsPos
object, its formal arguments are:
x
: an integer vector with strictly increasing elements.start
: an integer, the sampling point at which observation started.end
: an integer, the sampling point at which observation ended.
as.eventsPos <- function(x, start, end ) { x <- as.integer(x) stopifnot(all(diff(x)>0)) if (missing(start)) start <- floor(x) if (missing(end)) end <- ceiling(x) stopifnot(all(x>=start)) stopifnot(all(x<=end)) attr(x,"call") <- match.call() attr(x,"nIDx") <- end-start+1 class(x) <- "eventsPos" x }
Method print.eventsPos
prints an eventsPos
object:
print.eventsPos <- function(x, ...) { cat("\neventsPos object with indexes of ", length(x)," events. \n", sep = "") cat(" Mean inter event interval: ", round(mean(diff(x)),digits=2), " sampling points, corresponding SD: ", round(sd(diff(x)),digits=2), " sampling points \n", sep = "") cat(" Smallest and largest inter event intervals: ", paste(range(diff(x)),collapse=" and "), " sampling points. \n\n",sep= "") }
12.1.4 Definition of an explore
method for time series and eventsPos
objects
The formal parameters of the method are:
x
: aneventsPos
object.y
: ats
(time series) object.offsetFactor
: a numeric controlling the the spacing between the recording sites on the plot. Smaller values lead to closer spacing.events.pch
: an integer of a character: the ploting character used to indicate events.events.col
: an integer or a character string coding the color used to indicate the event....
: additional arguments passed tomatplot
andplot
functions called inernally.
explore.eventsPos <- function(x,y, offsetFactor=0.5, events.pch=16, events.col=2, ...) { stopifnot(inherits(y,"ts")) yRange <- range(y,na.rm=TRUE) plotPara <- list(tlim = start(y)[1] + c(0,0.1), ylim = yRange, yMin = yRange[1], yMax = yRange[2], firstTime = start(y)[1], lastTime = end(y)[1], keepGoing = TRUE) nFct <- function() { leftTime <- plotPara$tlim[1] rightTime <- plotPara$tlim[2] timeRange <- rightTime - leftTime rightTime <- rightTime + timeRange if (rightTime > plotPara$lastTime) { cat("Recording end reached.\n ") rightTime <- plotPara$lastTime } plotPara$tlim <- c(rightTime - timeRange, rightTime) plotPara } fFct <- function() { leftTime <- plotPara$tlim[1] rightTime <- plotPara$tlim[2] timeRange <- rightTime - leftTime leftTime <- leftTime - timeRange if (leftTime < plotPara$firstTime) { cat("Recording end reached.\n ") leftTime <- plotPara$firstTime } plotPara$tlim <- c(leftTime, leftTime + timeRange) plotPara } qFct <- function() { plotPara$keepGoing <- FALSE plotPara } ## Function tFct definition ## Allows the user to change the recording duration displayed on the window ## The user is invited to enter a factor which will be used to multiply the ## present duration displayed. ## If the resulting duration is too long a warning is given and the whole ## recording is shown. ## If possible the center of the displayed window is conserved. tFct <- function() { presentWindowLength <- diff(range(plotPara$tlim)) tMessage <- paste("Present duration displayed: ", presentWindowLength, " \n", sep = "") tMessage <- paste(tMessage, "By what factor do you want to multiply it? \n", sep = "") theFactor <- as.numeric(readline(tMessage)) if (theFactor <= 0) { cat("A negative or null factor does not make sense.\n") return(plotPara) } ## End of conditional on theFactor <= 0 ## Check that the new display length is reasonable totalLength <- plotPara$lastTime - plotPara$firstTime if (theFactor * presentWindowLength >= totalLength) { cat("Cannot show more data than available but only the entire record.\n ") plotPara$tlim[1] <- plotPara$firstTime plotPara$tlim[2] <- plotPara$lastTime return(plotPara) } windowCenter <- plotPara$tlim[1] + presentWindowLength / 2 newLeft <- windowCenter - theFactor * presentWindowLength / 2 newRight <- windowCenter + theFactor * presentWindowLength / 2 if (!(newLeft >= plotPara$firstTime & newRight <= plotPara$lastTime)) { if (newLeft <= plotPara$firstTime) { cat("Cannot show data before the recording started, the displayed center wont be conserved.\n ") plotPara$tlim[1] <- plotPara$firstTime plotPara$tlim[2] <- plotPara$tlim[1] + theFactor * presentWindowLength } if (newRight >= plotPara$lastTime) { cat("Cannot show data after the recording ended, the displayed center wont be conserved.\n ") plotPara$tlim[2] <- plotPara$lastTime plotPara$tlim[1] <- plotPara$tlim[2] - theFactor * presentWindowLength } return(plotPara) } ## End of conditional on !(newLeft >= plotPara$firstTime & newRight <= plotPara$lastTime) plotPara$tlim[1] <- newLeft plotPara$tlim[2] <- newRight return(plotPara) } ## End of function tFct definition ## Function rFct definition ## Allows the user to change the maximal value displayed on the abscissa ## The user is invited to enter a value. rFct <- function() { leftTime <- plotPara$tlim[1] rightTime <- plotPara$tlim[2] timeRange <- rightTime - leftTime tMessage <- paste("Present latest time displayed: ", rightTime, "\n", sep = "") tMessage <- paste(tMessage, "What new latest time do want (return leaves things unchanged)? \n", sep = "") theNewTime <- as.numeric(readline(tMessage)) if (is.na(theNewTime)) { ## Nothing entered, leave things unchanged return(plotPara) } ## End of conditional on is.na(theFactor) if (theNewTime <= plotPara$firstTime) { ## This choice does not make sense cat("Cannot display data before recording started.\n") return(plotPara) } if (theNewTime > plotPara$lastTime) { cat("Recording end reached.\n ") rightTime <- plotPara$lastTime } else { if (theNewTime <= leftTime) { ## The new latest time entered is smaller that the earliest time displayed cat("The new latest time is smaller than the earliest, adjustement will be made.\n") leftTime <- theNewTime - timeRange if (leftTime < plotPara$firstTime) { cat("Adjustment requires a change in displayed duration.\n") leftTime <- plotPara$firstTime } } ## End of conditional on theNewTime <= leftTime rightTime <- theNewTime } ## End of conditional on theNewTime > plotPara$lastTime plotPara$tlim <- c(leftTime, rightTime) plotPara } ## Function lFct definition ## Allows the user to change the minimal value displayed on the abscissa ## The user is invited to enter a value. lFct <- function() { leftTime <- plotPara$tlim[1] rightTime <- plotPara$tlim[2] timeRange <- rightTime - leftTime tMessage <- paste("Present earliest time displayed: ", leftTime, "\n", sep = "") tMessage <- paste(tMessage, "What new earliest time do want (return leaves things unchanged)? \n", sep = "") theNewTime <- as.numeric(readline(tMessage)) if (is.na(theNewTime)) { ## Nothing entered, leave things unchanged return(plotPara) } ## End of conditional on is.na(theFactor) if (theNewTime >= plotPara$lastTime) { ## This choice does not make sense cat("Cannot display data after recording ended.\n") return(plotPara) } if (theNewTime < plotPara$firstTime) { cat("Recording start reached.\n ") leftTime <- plotPara$firstTime } else { if (theNewTime >= rightTime) { ## The new earliest time entered is larger that the latest time displayed cat("The new earliest time is larger than the latest, adjustement will be made.\n") rightTime <- theNewTime + timeRange if (rightTime > plotPara$lastTime) { cat("Adjustment requires a change in displayed duration.\n") rightTime <- plotPara$lastTime } } ## End of conditional on theNewTime <= leftTime leftTime <- theNewTime } ## End of conditional on theNewTime > plotPara$lastTime plotPara$tlim <- c(leftTime, rightTime) plotPara } ## Function yMaxFct definition ## Allows the user to change the maximal value displayed on the ordinate ## The user is invited to enter a value. yMaxFct <- function() { presentWindowRange <- range(plotPara$ylim) tMessage <- paste("Present range displayed: [", paste(presentWindowRange, collapse = ","), "] \n", sep = "") tMessage <- paste(tMessage, "What new maximal ordinate value do want (return goes back to maximum)? \n", sep = "") theFactor <- as.numeric(readline(tMessage)) if (is.na(theFactor)) { plotPara$ylim <- c(presentWindowRange[1],plotPara$yMax) return(plotPara) } ## End of conditional on is.na(theFactor) if (theFactor <= plotPara$ylim[1]) { cat("The maximum should be larger than the minimum.\n") return(plotPara) } ## End of conditional on theFactor <= plotPara$ylim[1] plotPara$ylim <- c(presentWindowRange[1],theFactor) return(plotPara) } ## End of function yMaxFct definition ## Function yMinFct definition ## Allows the user to change the minimal value displayed on the ordinate ## The user is invited to enter a value. yMinFct <- function() { presentWindowRange <- range(plotPara$ylim) tMessage <- paste("Present range displayed: [", paste(presentWindowRange, collapse = ","), "] \n", sep = "") tMessage <- paste(tMessage, "What new minimal ordinate value do want (return goes back to minimum)? \n", sep = "") theFactor <- as.numeric(readline(tMessage)) if (is.na(theFactor)) { plotPara$ylim <- c(plotPara$yMin, presentWindowRange[2]) return(plotPara) } ## End of conditional on is.na(theFactor) if (theFactor >= plotPara$ylim[2]) { cat("The minimum should be smaller than the maximum.\n") return(plotPara) } ## End of conditional on theFactor >= plotPara$ylim[2] plotPara$ylim <- c(theFactor, presentWindowRange[2]) return(plotPara) } ## End of function yMinFct definition show <- function(x, y, plotPara, ...) { s <- plotPara$tlim[1] e <- plotPara$tlim[2] y.m <- plotPara$ylim[1] y.M <- plotPara$ylim[2] firstIdx <- round(max(1,s*frequency(y))) lastIdx <- round(min(end(y)[1]*frequency(y),e*frequency(y))) ii <- firstIdx:lastIdx xx <- x[firstIdx <= x & x <= lastIdx] if (class(y)[1] == "mts") { m <- y[ii,] if (length(xx) > 0) mAtx <- as.matrix(y)[xx,,drop=FALSE] } else { m <- y[ii] if (length(xx) > 0) mAtx <- y[xx] } if (class(m) == "matrix") { m <- apply(m,2,function(x) ifelse(x < y.m, y.m,x)) m <- apply(m,2,function(x) ifelse(x > y.M, y.M,x)) ns <- dim(m)[2] offset <- c(0,-(1:(ns-1))*(y.M-y.m)) m <- t(t(m)+offset*offsetFactor) matplot(m,type="l",lty=1,axes=FALSE,xlab="",ylab="",...) if (length(xx) > 0) { mAtx <- t(t(mAtx)+offset*offsetFactor) matpoints(xx-ii[1]+1,mAtx,pch=events.pch,col=events.col) } } else { m[m<y.m] <- y.m m[m>y.M] <- y.M plot(m,type="l",lty=1,axes=FALSE,xlab="",ylab="",ylim=c(y.m,y.M),...) if (length(xx) > 0) points(xx-ii[1]+1,mAtx,pch=events.pch,col=events.col) } } plot.new() par(mar=c(0.5,0.5,0.5,0.5)) show(x,y,plotPara,...) myMessage <- "Make a choice:\n n or 'return' (next); f (former); l (lower abscissa limit); r (upper abscissa limit) \n t (time scale); Y (upper ordinate limit); y (lower ordinate limit); q (quit) \n " while(plotPara$keepGoing) { myChoice <- readline(myMessage) plotPara <- switch(myChoice, n = nFct(), f = fFct(), l = lFct(), r = rFct(), t = tFct(), Y = yMaxFct(), y = yMinFct(), q = qFct(), nFct() ) show(x,y,plotPara,...) } ## End of while loop on keepGoing dev.off() invisible() }
12.1.5 Function cutSglEvt
Its formal parameters are:
evtPos
: a numeric or integer interpretable as an index, the posistion at which cuts will be produced.data
: a numeric vector of matrix containing the data. If vector the argument is converted as a single column matrix internally. The matrix rows are indexed by sampling points and its columns by recording sites / channels.before
: an integer: the number of sampling points within the cut before the reference time given by evtPos.after
: an integer: the number of sampling points after the reference time.
It returns:
- A numeric vector with the cut(s). When several recording sites are used the cuts of each individual sites are placed one after the other.
cutSglEvt <- function(evtPos, data, before=14, after=30 ) { evtPos <- as.integer(evtPos) ## make sure evtPos is an integer before <- as.integer(before) ## make sure before is an integer stopifnot(0 <= before) ## make sure before is positive or null after <- as.integer(after) stopifnot(0 <= after) ## make sure after is positive or null if (is.vector(data)) data <- matrix(data,nc=1) ns <- dim(data)[2] dl <- dim(data)[1] stopifnot(0<evtPos, evtPos<=dl) ## make sure evtPos is within range sl <- before+after+1 ## the length of the cut keep <- -before:after + evtPos within <- 1 <= keep & keep <= dl kw <- keep[within] res <- sapply(1:ns, function(idx) { v <- numeric(sl) v[within] <- data[kw,idx] v } ) as.vector(res) }
12.1.6 Function mkEvents
Its formal parameters are:
positions
: an integer vector with events' positions as indices / sampling points or aneventsPos
object.data
: a numeric vector of matrix containing the data or a 'ts' or 'mts' object. If vector the argument is converted as a single column matrix internally. The matrix rows are indexed by sampling points and its columns by recording sites / channels.before
: an integer, the number of sampling points within the cut before the reference times given bypositions
.after
: an integer, the number of sampling points within the cut after the reference times given bypositions
.
It returns a matrix with before + after + 1 rows and as many columns as elements in positions. Each column is an "event", that is, a set of cuts on the data. Attribute "positions" contains the value of the argument with the same name and attribute "data" contains the name of the corresponding argument, attribute "before" contains the value of the argument with the same name, attribute "after" contains the value of the argument with the same name, attribute "numberOfSites" contains the number of recording sites. Attribute "delta" is used when events are realligned on their mean waveforms (during "jitter cancellation"). The returned matrix is given an "events" class.
mkEvents <- function(positions, data, before=14, after=30 ) { positions <- unclass(positions) data <- unclass(data) res <- sapply(positions, cutSglEvt, data, before, after) the.call <- match.call() attr(res,"positions") <- positions attr(res,"delta") <- NULL attr(res,"data") <- the.call[["data"]] attr(res,"before") <- before attr(res,"after") <- after attr(res,"numberOfSites") <- ifelse(is.matrix(data),dim(data)[2],1) attr(res,"call") <- match.call() class(res) <- "events" res }
12.1.7 events
methods
summary.events <- function(object, ...) { b <- attr(object,"before") a <- attr(object,"after") ns <- attr(object,"numberOfSites") cat("\nevents object deriving from data set: ",attr(object,"data"),".\n",sep="") cat(" Events defined as cuts of ", a+b+1, " sampling points on each of the ",ns, " recording sites.\n",sep="") cat(" The 'reference' time of each event is located at point ", b+1, " of the cut.\n",sep="") if (!is.null(attr(object,"delta"))) { cat(" Events were realigned on median event.\n",sep="") } cat(" There are ", length(attr(object,"positions")), " events in the object.\n\n",sep="") }
"[.events" <- function(x,i,j,drop = FALSE) { y <- NextMethod("[") if (!missing(i)) return(NULL) if (is.matrix(y) && dim(y)[2] > 1) { attr(y,"positions") <- attr(x,"positions")[j] attr(y,"delta") <- attr(x,"delta") attr(y,"data") <- attr(x,"data") attr(y,"before") <- attr(x,"before") attr(y,"after") <- attr(x,"after") attr(y,"numberOfSites") <- attr(x,"numberOfSites") attr(y,"call") <- match.call() class(y) <- "events" } y }
t.events <- function(x) { t(unclass(x)) }
mean.events <- function(x,...) { apply(unclass(x),1,mean,...) } median.events <- function(x,na.rm = FALSE) { apply(unclass(x),1,median,na.rm) }
'-.events' <- function(e1,e2) { stopifnot(length(e2) == dim(e1)[1]) res <- unclass(e1)-e2 attr(res,"positions") <- attr(e1,"positions") attr(res,"delta") <- attr(e1,"delta") attr(res,"data") <- attr(e1,"data") attr(res,"before") <- attr(e1,"before") attr(res,"after") <- attr(e1,"after") attr(res,"numberOfSites") <- attr(e1,"numberOfSites") attr(res,"call") <- match.call() class(res) <- "events" res }
plot.events <- function(x, y=NULL, evts.lwd = 0.1, medAndMad = TRUE, evts.col = "black", med.col = "red", mad.col = "blue", x.bar = NULL, y.bar = NULL) { nsites <- attr(x,"numberOfSites") ne <- dim(x)[2] cl <- dim(x)[1]/nsites ylim <- range(x) matplot(x,type="n",xlab="",ylab="",axes=FALSE,ylim=ylim) if (nsites > 1) { ii <- 2*(1:(nsites %/% 2)) rect((ii-1)*cl,ylim[1],ii*cl,ylim[2],col="grey80",border=NA) } matlines(x,col=evts.col,lty=1,lwd=evts.lwd) if (medAndMad) { med <- apply(x,1,median) mad <- apply(x,1,mad) lines(med,col=med.col) lines(mad,col=mad.col) } if (!is.null(x.bar)) segments(x0=0,y0=ylim[1]+0.1*diff(ylim),x1=x.bar) if (!is.null(y.bar)) segments(x0=0,y0=ylim[1]+0.1*diff(ylim),y1=ylim[1]+0.1*diff(ylim)+y.bar) }
lines.events <- function(x, evts.lwd = 0.1, evts.col = "black", ... ) { matlines(x,col=evts.col,lty=1,lwd=evts.lwd,...) }
print.events <- function(x, ... ) { plot.events(x,...) }
12.1.8 mkNoise
definition
mkNoise <- function(positions, data, before=14, after=30, safetyFactor=2, size=2000) { positions <- unclass(positions) data <- unclass(data) if (!is.matrix(data)) data <- matrix(data,nc=1) size <- as.integer(size) stopifnot(0 < size) ## make sure size is a positive integer sl <- before+after+1 ns <- dim(data)[2] i1 <- diff(positions) ## inter events intervals nbI <- (i1-round(safetyFactor*sl))%/%sl ## number of noise sweeps ## one can cut from each ## interval nbPossible <- min(size, sum((nbI)[nbI>0]) ) ## allocate next the memory for the noise events noiseMatrix <- matrix(0, nr=ns*sl, nc=nbPossible ) iV <- (1:length(i1))[nbI>0] ## A vector containing the indexes of ## the (inter event) intervals from ## which at least one noise sweep can be ## cut. iIdx <- 1 ## an index running over the inter event intervals from ## which noise events can be cut. nInI <- nbI[iV[iIdx]] ## the number of noise sweeps that can be cut ## from the "non empty" inter event interval ## iV[iIdx]. nIdx <- 1 ## An index running over the noise sweeps. noisePositions <- integer(nbPossible) while (nIdx <= nbPossible) { uInI <- 1 ## An index running over the noise sweeps that will be ## cut from a given "non empty" inter event interval. iPos <- positions[iV[iIdx]] + round(safetyFactor*sl) noisePositions[nIdx] <- iPos while (uInI <= nInI & nIdx <= nbPossible ) { ii <- (-before:after) + iPos ns <- as.vector(data[ii,]) noiseMatrix[,nIdx] <- ns nIdx <- nIdx + 1 iPos <- iPos + sl uInI <- uInI + 1 } ## End of while loop on uInI iIdx <- iIdx + 1 nInI <- nbI[iV[iIdx]] } ## End of while loop on nIdx the.call <- match.call() attr(noiseMatrix,"positions") <- noisePositions attr(noiseMatrix,"delta") <- NULL attr(noiseMatrix,"data") <- the.call[["data"]] attr(noiseMatrix,"before") <- before attr(noiseMatrix,"after") <- after attr(noiseMatrix,"numberOfSites") <- ifelse(is.matrix(data),dim(data)[2],1) attr(noiseMatrix,"call") <- match.call() class(noiseMatrix) <- "events" noiseMatrix }
12.1.9 Definition of an explore
method for pca
results
explore.prcomp <- function(x, pc=1, ##<< an integer: the pc index to add to the mean. factor=2, ##<< a numeric, the scaling factor; that is, the plot shows mean +/- factor * pc. m.col="black", ##<< a character string or an integer, the color used for mean. u.col="red", ##<< a character string or an integer, the color used for mean + factor * pc. l.col="blue", ##<< a character string or an integer, the color used for mean - factor * pc. xlab="Index", ##<< a character string with the abscissa label. ylab="Amplitude", ##<< a character string with the ordinate label. main, ##<< a character string with the title. If 'missing' (default) one is automatically generated. ... ##<< additional arguments passed to 'plot'. ) { if (missing(main)) { w <- x$sdev[pc]^2/sum(x$sdev^2) main <- paste("PC ",pc," (",round(100*w,digits=1),"%)",sep="") } u <- x$center + factor * x$rotation[,pc] l <- x$center - factor * x$rotation[,pc] ylim=range(c(l,u)) plot(x$center,type="l",xlab=xlab,ylab=ylab,col=m.col,main=main,ylim=ylim,...) lines(u,col=u.col,...) lines(l,col=l.col,...) }
12.1.10 Definition of get_jitter
We define a function that estimates the jitter given:
evts
: an event—or a matrix of events where individual events form the columns.center
: the 'central' (median) event on which the alignment will be performed.centerD
: the first derivative of the central event.centerDD
: the second derivative of the central event.
The functions returns a vector of estimated jitters giving the amount of sampling points by which the central event should be shifted in order to best match each individual events. This sampling jitter estimation is performed by a two stages procedure:
- Linear regression is first used get a first jitter estimation based on a first order Taylor expansion.
- A Newton-Raphson step is used to refine the first estimation (used as a starting point) based on a second order Taylor expansion.
get_jitter <- function(evts, center, centerD, centerDD){ centerD_norm2 <- sum(centerD^2) centerDD_norm2 <- sum(centerDD^2) centerD_dot_centerDD <- sum(centerD*centerDD) if (is.null(dim(evts))) evts <- matrix(evts, nc=1) evts <- evts - center h_dot_centerD <- centerD %*% evts delta0 <- h_dot_centerD/centerD_norm2 h_dot_centerDD <- centerDD %*% evts first <- -2*h_dot_centerD + 2*delta0*(centerD_norm2 - h_dot_centerDD) + 3*delta0^2*centerD_dot_centerDD + delta0^3*centerDD_norm2 second <- 2*(centerD_norm2 - h_dot_centerDD) + 6*delta0*centerD_dot_centerDD + 3*delta0^2*centerDD_norm2 as.vector(delta0 - first/second) }
12.1.11 Definition of mk_aligned_events
We define a function taking a vector of spike times, that should all come from the same cluster and correspond to reasonably "clean" events, and three arguments corresponding to the last three arguments of mkEvents
, the function returns a events
object. The positions
attribute of the returned object gives the nearest sampling point to the actual peak. The delta
attribute gives the offset between the previous spike position and the "actual" peak position (the actual position is attribute positions
minus attribute delta
):
mk_aligned_events <- function(positions, data, before=14, after=30){ dataD = apply(data,2,function(x) c(0,diff(x,2)/2,0)) dataDD = apply(dataD,2,function(x) c(0,diff(x,2)/2,0)) evts = mkEvents(positions, data, before, after) evtsD = mkEvents(positions, dataD, before, after) evtsDD = mkEvents(positions, dataDD, before, after) evts_median = apply(evts,1,median) evtsD_median = apply(evtsD,1,median) evtsDD_median = apply(evtsDD,1,median) evts_jitter = get_jitter(evts,evts_median,evtsD_median,evtsDD_median) ## positions = positions-[round(x.item(0)) for x in np.nditer(evts_jitter)] positions = positions-round(evts_jitter) evts = mkEvents(positions, data, before, after) evtsD = mkEvents(positions, dataD, before, after) evtsDD = mkEvents(positions, dataDD, before, after) evts_median = apply(evts,1,median) evtsD_median = apply(evtsD,1,median) evtsDD_median = apply(evtsDD,1,median) evts_jitter = get_jitter(evts,evts_median,evtsD_median,evtsDD_median) res = unclass(evts) - evtsD_median %o% evts_jitter - evtsDD_median %o% evts_jitter^2/2 attributes(res) = attributes(evts) attr(res,"positions") <- positions attr(res,"call") <- match.call() attr(res,"delta") <- evts_jitter res }
12.1.12 Definition of mk_center_list
We define function mk_center_list
that given:
positions
, a vector of spike times, that should all come from the same cluster and correspond to reasonably "clean" events,data
, a data matrix,before
, the number of sampling point to keep before the peak,after
, the number of sampling point to keep after the peak,
returns a nammed list with the following elements and their content:
center
: the estimate of the center (obtained from the median),centerD
: the estimate of the center's derivative (obtained from the median of events cut on the derivative of data),centerDD
: the estimate of the center's second derivative (obtained from the median of events cut on the second derivative of data),centerD_norm2
: the squared norm of the center's derivative,centerDD_norm2
: the squared norm of the center's second derivative,centerD_dot_centerDD
: the scalar product of the center's first and second derivatives,center_idx
: an array of indices going from-before
toafter
.
The peeling procedure, requiers, for each cluster, estimates of its center and of its first two derivatives. Clusters' centers must moreover be built such that they can be used for subtraction, this implies that we should make them long enough, on both side of the peak, to see them go back to baseline. Formal parameters before
and after
should therefore be set to larger values than the ones used for clustering.
mk_center_list = function(positions, data, before=49, after=80) { dataD = apply(data,2,function(x) c(0,diff(x,2)/2,0)) dataDD = apply(dataD,2,function(x) c(0,diff(x,2)/2,0)) evts = mkEvents(positions, data, before, after) evtsD = mkEvents(positions, dataD, before, after) evtsDD = mkEvents(positions, dataDD, before, after) evts_median = apply(evts,1,median) evtsD_median = apply(evtsD,1,median) evtsDD_median = apply(evtsDD,1,median) list("center" = evts_median, "centerD" = evtsD_median, "centerDD" = evtsDD_median, "centerD_norm2" = sum(evtsD_median^2), "centerDD_norm2" = sum(evtsDD_median^2), "centerD_dot_centerDD" = sum(evtsD_median*evtsDD_median), "center_idx" = -before:after) }
12.1.13 Definition of classify_and_align_evt
The formal parameters of classify_and_align_evt
are:
- a set of events' positions,
- a data matrix containing the raw (but normalized data),
- a named list of centers,
- arguments
before
andafter
corresponding to arguments with those names inmk_center_list
,
returns a list with the following element:
- the name of the closest center in terms of Euclidean distance or "?" if none of the clusters' waveform does better than a uniformly null one,
- the new position of the event (the previous position corrected by the integer part of the estimated jitter),
- the remaining jitter.
classify_and_align_evt <- function(evt_pos, data, centers, before=14, after=30 ){ cluster_names = names(centers) n_sites = dim(data)[2] centersM = sapply(cluster_names, function(cn) centers[[cn]][["center"]][rep(-before <= centers[[cn]][["center_idx"]] & centers[[cn]][["center_idx"]] <= after, n_sites)]) evt = cutSglEvt(evt_pos,data=data,before=before, after=after) delta = -(centersM - evt) cluster_idx = which.min(apply(delta^2,2,sum)) good_cluster_name = cluster_names[cluster_idx] good_cluster_idx = rep(-before <= centers[[good_cluster_name]][["center_idx"]] & centers[[good_cluster_name]][["center_idx"]] <= after, n_sites) centerD = centers[[good_cluster_name]][["centerD"]][good_cluster_idx] centerD_norm2 = sum(centerD^2) centerDD = centers[[good_cluster_name]][["centerDD"]][good_cluster_idx] centerDD_norm2 = sum(centerDD^2) centerD_dot_centerDD = sum(centerD*centerDD) h = delta[,cluster_idx] h_order0_norm2 = sum(h^2) h_dot_centerD = sum(h*centerD) jitter0 = h_dot_centerD/centerD_norm2 h_order1_norm2 = sum((h-jitter0*centerD)^2) if (h_order0_norm2 > h_order1_norm2) { h_dot_centerDD = sum(h*centerDD) first = -2*h_dot_centerD + 2*jitter0*(centerD_norm2 - h_dot_centerDD) + 3*jitter0^2*centerD_dot_centerDD + jitter0^3*centerDD_norm2 second = 2*(centerD_norm2 - h_dot_centerDD) + 6*jitter0*centerD_dot_centerDD + 3*jitter0^2*centerDD_norm2 jitter1 = jitter0 - first/second h_order2_norm2 = sum((h-jitter1*centerD-jitter1^2/2*centerDD)^2) if (h_order1_norm2 <= h_order2_norm2) { jitter1 = jitter0 } } else { jitter1 = 0 } if (abs(round(jitter1)) > 0) { evt_pos = evt_pos - round(jitter1) evt = cutSglEvt(evt_pos,data=data,before=before, after=after) h = evt - centers[[good_cluster_name]][["center"]][good_cluster_idx] h_order0_norm2 = sum(h^2) h_dot_centerD = sum(h*centerD) jitter0 = h_dot_centerD/centerD_norm2 h_order1_norm2 = sum((h-jitter0*centerD)^2) if (h_order0_norm2 > h_order1_norm2) { h_dot_centerDD = sum(h*centerDD) first = -2*h_dot_centerD + 2*jitter0*(centerD_norm2 - h_dot_centerDD) + 3*jitter0^2*centerD_dot_centerDD + jitter0^3*centerDD_norm2 second = 2*(centerD_norm2 - h_dot_centerDD) + 6*jitter0*centerD_dot_centerDD + 3*jitter0^2*centerDD_norm2 jitter1 = jitter0 - first/second h_order2_norm2 = sum((h-jitter1*centerD-jitter1^2/2*centerDD)^2) if (h_order1_norm2 <= h_order2_norm2) { jitter1 = jitter0 } } else { jitter1 = 0 } } if (sum(evt^2) > sum((h-jitter1*centerD-jitter1^2/2*centerDD)^2)) return(list(cluster_names[cluster_idx], evt_pos, jitter1)) else return(list('?',evt_pos, jitter1)) }
12.1.14 Definition of predict_data
The formal parameters of predict_data
are:
class_pos_jitter_list
: a list of lists. Each sub-list corresponds to a single detected event and as three components: the cluster of origin (or "?" if none was found), the position and the jitter.centers_list
: a list of list. Each sub-list is the result of a call tomk_center_list
.nb_channels
: an integer, the number of recording channels.data_length
: an integer, the length of the data.
The function returns a matrix with data_length
rows and nb_channels
columns with observations predictions based on class_pos_jitter_list
and centers_list
.
predict_data <- function(class_pos_jitter_list, centers_list, nb_channels=4, data_length=300000) { res = matrix(0,nc=nb_channels, nr=data_length) for (class_pos_jitter in class_pos_jitter_list) { cluster_name = class_pos_jitter[[1]] if (cluster_name != '?') { center = centers_list[[cluster_name]][["center"]] centerD = centers_list[[cluster_name]][["centerD"]] centerDD = centers_list[[cluster_name]][["centerDD"]] jitter = class_pos_jitter[[3]] pred = center + jitter*centerD + jitter^2/2*centerDD pred = matrix(pred,nc=nb_channels) idx = centers_list[[cluster_name]][["center_idx"]] + class_pos_jitter[[2]] within = 0 < idx & idx <= data_length kw = idx[within] res[kw,] = res[kw,] + pred[within,] } } res }