This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Point process models for sequence detection in high-dimensional neural spike trains

Alex H. Williams
Department of Statistics
Stanford University
Stanford, CA 94305
[email protected]
&Anthony Degleris
Department of Electrical Engineering
Stanford University
Stanford, CA 94305
[email protected]
Yixin Wang
Department of Statistics
Columbia University
New York NY 10027
[email protected] &Scott W. Linderman
Department of Statistics
Stanford University
Stanford, CA 94305
[email protected]
Abstract

Sparse sequences of neural spikes are posited to underlie aspects of working memory [1], motor production [2], and learning [3, 4]. Discovering these sequences in an unsupervised manner is a longstanding problem in statistical neuroscience [5, 6, 7]. Promising recent work [4, 8] utilized a convolutive nonnegative matrix factorization model [9] to tackle this challenge. However, this model requires spike times to be discretized, utilizes a sub-optimal least-squares criterion, and does not provide uncertainty estimates for model predictions or estimated parameters. We address each of these shortcomings by developing a point process model that characterizes fine-scale sequences at the level of individual spikes and represents sequence occurrences as a small number of marked events in continuous time. This ultra-sparse representation of sequence events opens new possibilities for spike train modeling. For example, we introduce learnable time warping parameters to model sequences of varying duration, which have been experimentally observed in neural circuits [10]. We demonstrate these advantages on experimental recordings from songbird higher vocal center and rodent hippocampus.

1 Introduction

Identifying interpretable patterns in multi-electrode recordings is a longstanding and increasingly pressing challenge in neuroscience. Depending on the brain area and behavioral task, the activity of large neural populations may be low-dimensional as quantified by principal components analysis (PCA) or other latent variable models [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. However, many datasets do not conform to these modeling assumptions and instead exhibit high-dimensional behaviors [24]. Neural sequences are an important example of high-dimensional structure: if NN neurons fire sequentially with no overlap, the resulting dynamics are NN-dimensional and cannot be efficiently summarized by PCA or other linear dimensionality reduction methods [4]. Such sequences underlie current theories of working memory [1, 25], motor production [2], and memory replay [10]. More generally, neural sequences are seen as flexible building blocks for learning and executing complex neural computations [26, 27, 28].

Prior Work In practice, neural sequences are usually identified in a supervised manner by correlating neural firing with salient sensory cues or behavioral actions. For example, hippocampal place cells fire sequentially when rodents travel down a narrow hallway, and these sequences can be found by averaging spikes times over multiple traversals of the hallway. After identifying this sequence on a behavioral timescale lasting seconds, a template matching procedure can be used to show that these sequences reoccur on compressed timescales during wake [29] and sleep [30]. In other cases, sequences can be identified relative to salient features of the local field potential (LFP; [31, 32]).

Developing unsupervised alternatives that directly extract sequences from multineuronal spike trains would broaden the scope of this research and potentially uncover new sequences that are not linked to behaviors or sensations [33]. Several works have shown preliminary progress in this direction Maboudi et al. [34] proposed fitting a hidden Markov model (HMM) and then identifying sequences from the state transition matrix. Grossberger et al. [35] and van der Meij & Voytek [36] apply clustering to features computed over a sliding window. Others use statistics such as time-lagged correlations to detect sequences and other spatiotemporal patterns in a bottom-up fashion [6, 7].

In this paper, we develop a Bayesian point process modeling generalization of convolutive nonnegative matrix factorization (convNMF; [9]), which was recently used by [8] and [4] to model neural sequences. Briefly, the convNMF model discretizes each neuron’s spike train into a vector of BB time bins, 𝐱n+B\mathbf{x}_{n}\in\mathbb{R}_{+}^{{}_{B}} for neuron nn, and approximates this collection of vectors as a sum of convolutions, 𝐱nr=1R𝐰n,r𝐡r\mathbf{x}_{n}\approx\sum_{r=1}^{{}_{R}}\mathbf{w}_{n,r}*\mathbf{h}_{r}. The model parameters (𝐰n,r+L\mathbf{w}_{n,r}\in\mathbb{R}_{+}^{{}_{L}} and 𝐡r+B\mathbf{h}_{r}\in\mathbb{R}^{{}_{B}}_{+}) are optimized with respect to a least-squares criterion. Each component of the model, indexed by r{1,,R}{r\in\{1,\ldots,R\}}, consists of a neural factor𝐖r+N×L\mathbf{W}_{r}\in\mathbb{R}^{{}_{N\times L}}_{+} and a temporal factor𝐡r\mathbf{h}_{r}. The neural factor encodes a spatiotemporal pattern of neural activity over LL time bins (which is hoped to be sequence), while the temporal factor indicates the times at which this pattern of neural activity occurs. A total of RR motifs or sequence types, each corresponding to a different component, are extracted by this model.

There are several compelling features of this approach. Similar to classical nonnegative matrix factorization [37], and in contrast to clustering methods, convNMF captures sequences with overlapping groups of neurons by an intuitive “parts-based” representation. Indeed, convNMF uncovered overlapping sequences in experimental data from songbird higher vocal center (HVC) [4]. Further, if the same sequence is repeated with different peak firing rates, convNMF can capture this by varying the magnitude of the entries in 𝐡r\mathbf{h}_{r}, unlike many clustering methods. Finally, convNMF efficiently pools statistical power across large populations of neurons to identify sequences even when the correlations between temporally adjacent neurons are noisy—other methods, such as HMMs and bottom-up agglomerative clustering, require reliable pairwise correlations to string together a full sequence.

Our Contributions We propose a point process model for neural sequences (PP-Seq) which extends and generalizes convNMF to continuous time and uses a fully probabilistic Bayesian framework. This enables us to better quantify uncertainty in key parameters—e.g. the overall number of sequences the times at which they occur—and also characterize the data at finer timescales—e.g. whether individual spikes were evoked by a sequence. Most importantly, by achieving an extremely sparse representation of sequence event times, the PP-Seq model enables a variety of model extensions that are not easily incorported into convNMF or other common methods. We explore one such extension that introduces time warping factors to model sequences of varying duration, as is often observed in neural data [10].

Though we focus on applications in neuroscience, our approach could be adapted to other temporal point processes, which are a natural framework to describe data that are collected at irregular intervals (e.g. social media posts, consumer behaviors, and medical records) [38, 39]. We draw a novel connection between Neyman-Scott processes [40], which encompass PP-Seq and other temporal point process models as special cases, and mixture of finite mixture models [63]. Exploiting this insight, we develop innovative Markov chain Monte Carlo (MCMC) methods for PP-Seq.

2 Model

Refer to caption
Figure 1: (A) Example of a Neyman-Scott process over a 2D region. Latent events (purple dots) are first sampled from a homogeneous Poisson process. Each latent event then spawns a number of nearby observed events (gray dots), according to an inhomogeneous Poisson process. (B) A spike train can be modeled as a Neyman-Scott process with marked events over a 1D interval representing time. Latent events (zkz_{k}; sequences) evoke observed events (xsx_{s}; spikes) ordered in a sequence. An example with K=3K=3 latent events evoking R=2R=2 different sequence types (blue & red) is shown.

2.1 Point Process Models and Neyman-Scott Processes

Point processes are probabilistic models that generate discrete sets {x1,x2,}{xs}s=1S{\{x_{1},x_{2},\ldots\}\triangleq\{x_{s}\}_{s=1}^{S}} over some continuous space 𝒳\mathcal{X}. Each member of the set xs𝒳x_{s}\in\mathcal{X} is called an “event.” A Poisson process [58] is a point process which satisfies three properties. First, the number of events falling within any region V𝒳V\subset\mathcal{X} is Poisson distributed. Second, there exists a function λ(x):𝒳+\lambda(x):\mathcal{X}\mapsto\mathbb{R}_{+}, called the intensity function, for which Vλ(x)dx\int_{V}\lambda(x)\,\mathrm{d}x equals the expected number of events in VV. Third, the number of events in non-overlapping regions of 𝒳\mathcal{X} are independent. A Poisson process is said to be homogeneous when λ(x)=c\lambda(x)=c for some constant cc. Finally, a marked point process extends the usual definition of a point process to incorporate additional information into each event. A marked point process on 𝒳\mathcal{X} generates random tuples xs=(x~s,ms)x_{s}=(\tilde{x}_{s},m_{s}), where x~s𝒳\tilde{x}_{s}\in\mathcal{X} represents the random location and msm_{s}\in\mathcal{M} is the additional “mark” specifying metadata associated with event ss. See Supplement A for further background details and references.

Neyman-Scott Processes [40] use Poisson processes as building blocks to model clustered data. To simulate a Neyman-Scott process, first sample a set of latent events {zk}k=1K𝒵\{z_{k}\}_{k=1}^{K}\subset\mathcal{Z} from an initial Poisson process with intensity function λ(z):𝒵+\lambda(z):\mathcal{Z}\mapsto\mathbb{R}_{+}. Thus, the number of latent events is a Poisson-distributed random variable, KPoisson(𝒵λ(z)dz)K\sim\text{Poisson}(\int_{\mathcal{Z}}\lambda(z)\mathrm{d}z). Given the latent events, the observed events {xs}s=1S\{x_{s}\}_{s=1}^{S} are drawn from a second Poisson process with conditional intensity function λ(x)=λ+k=1Kg(x;zk){\lambda(x)=\lambda^{\varnothing}+\sum_{k=1}^{{}_{K}}g(x;z_{k})}. The nonnegative functions g(x;zk)g(x;z_{k}) can be thought of as impulse responses—each latent event adds to the rate of observed events, and stochastically generates some number of observed offspring events. Finally, the scalar parameter λ>0\lambda^{\varnothing}>0 specifies a “background” rate; thus, if K=0K=0, the observations follow a homogeneous Poisson process with intensity λ\lambda^{\varnothing}.

A simple example of a Neyman-Scott Process in 2\mathbb{R}^{2} is shown in fig. 1A. Latent events (purple dots) are first drawn from a homogenous Poisson process and specify the cluster centroids. Each latent event induces an isotropic Gaussian impulse response. Observed events (gray dots) are then sampled from a second Poisson process, whose intensity function is found by summing all impulse responses.

Importantly, we do not observe the number of latent events nor their locations. As described below, we use probabilistic inference to characterize this unobserved structure. A key idea is to attribute each observed event to a latent cause (either one of the latent events or the background process); these attributions are valid due to the additive nature of the latent intensity functions and the superposition principle of Poisson processes (see Supplement A). From this perspective, inferring the set of latent events is similar to inferring the number and location of clusters in a nonparametric mixture model.

2.2 Point Process Model of Neural Sequences (PP-Seq)

We model neural sequences as a Neyman-Scott process with marked events in a model we call PP-Seq. Consider a dataset with NN neurons, emitting a total of SS spikes over a time interval [0,T][0,T]. This can be encoded as a set of SS marked spike times—the observed events are tuples xs=(ts,ns){x_{s}=(t_{s},n_{s})} specifying the time, ts[0,T]t_{s}\in[0,T], and neuron, ns{1,,N}{n_{s}\in\{1\,,\,\ldots\,,\,N\}}, of each spike. Sequences correspond to latent events, which are also tuples zk=(τk,rk,Ak)z_{k}=(\tau_{k},r_{k},A_{k}) specifying the time, τk[0,T]\tau_{k}\in[0,T], type, rk{1,,R}{r_{k}\in\{1\,,\,\ldots\,,\,R\}}, and amplitude, Ak>0A_{k}>0 of the sequence. The hyperparameter RR specifies the number of recurring sequence types, which is analogous to the number of components in convNMF.

To draw samples from the PP-Seq model, we first sample sequences (i.e., latent events) from a Poisson process with intensity λ(z)λ(τ,r,A)=ψπrGa(A;α,β){\lambda(z)\triangleq\lambda(\tau,r,A)\!=\!\psi\,\pi_{r}\,\mathrm{Ga}(A;\alpha,\beta)}. Here, ψ>0\psi>0 sets the rate at which sequences occur within the data, πΔR\pi\in\Delta_{R} sets the probability of the RR sequence types, and α,β\alpha,\beta parameterize a gamma density which models the sequence amplitude, AA. Note that the number of sequence events is a Poisson-distributed random variable, KPoisson(ψT)K\sim\text{Poisson}(\psi T), where the rate parameter ψT\psi T is found by integrating λ(z)\lambda(z) over all sequence types, amplitudes, and times.

Conditioned on the set of KK sequence events, the firing rate of neuron nn is given by a sum of nonnegative impulse responses:

λn(t)=λn+k=1Kgn(t;zk).\lambda_{n}(t)=\lambda^{\varnothing}_{n}+\sum_{k=1}^{K}g_{n}(t;z_{k}). (1)

We assume these impulse responses vary across neurons and follow a Gaussian form:

gn(t;zk)=Akanrk𝒩(tτk+bnrk,cnrk),g_{n}(t;z_{k})=A_{k}\cdot a_{nr_{k}}\cdot\mathcal{N}(t\mid\tau_{k}+b_{nr_{k}},c_{nr_{k}}), (2)

where 𝒩(tμ,σ2)\mathcal{N}(t\mid\mu,\sigma^{2}) denotes a Gaussian density. The parameters ar=(a1r,,aNr)ΔN{a_{r}=(a_{1r},\ldots,a_{Nr})\in\Delta_{N}}, bnr{b_{nr}\in\mathbb{R}}, and cnr+{c_{nr}\in\mathbb{R}_{+}} correspond to the weight, latency, and width, respectively, of neurons’ firing rates in sequences of type rr. Since the firing rate is a sum of non-negative impulse responses, the superposition principle of Poisson processes (see Supplement A) implies that we can view the data as a union of “background” spikes and “induced” spikes from each sequence, justifying the connection to clustering. The expected number of spikes induced by sequence kk is:

n=1N0Tgn(t;zk)dtn=1Ngn(t;zk)dt=Ak,\sum_{n=1}^{N}\int_{0}^{T}g_{n}(t;z_{k})\mathrm{d}t\approx\sum_{n=1}^{N}\int_{-\infty}^{\infty}g_{n}(t;z_{k})\mathrm{d}t=A_{k}, (3)

and thus we may view AkA_{k} as the amplitude of sequence event kk.

Figure 1B schematizes a simple case containing K=3K=3 sequence events and R=2R=2 sequence types. A complete description of the model’s generative process is provided in Supplement B, but it can be summarized by the graphical model in fig. 1B, where we have global parameters Θ=(θ,{θr}r=1R){\Theta=(\theta_{\varnothing},\{\theta_{r}\}_{r=1}^{R})} with θr=(ar,{bnr}n=1N,{cnr}n=1N){\theta_{r}=(a_{r},\{b_{nr}\}_{n=1}^{N},\{c_{nr}\}_{n=1}^{N})} for each sequence type, and θ={λn}n=1N\theta_{\varnothing}=\{\lambda^{\varnothing}_{n}\}_{n=1}^{N} for the background process. We place weak priors on each parameter: the neural response weights {anr}n=1N\{a_{nr}\}_{n=1}^{N} follow a Dirichlet prior for each sequence type, and (bnr,cnr)(b_{nr},c_{nr}) follows a normal-inverse-gamma prior for every neuron and sequence type. The background rate, λn\lambda_{n}^{\varnothing}, follows a gamma prior. We set the sequence event rate, ψ\psi, to be a fixed hyperparameter, though this assumption could be relaxed.

Time-warped sequences PP-Seq can be extended to model more diverse sequence patterns by using higher-dimensional marks on the latent sequences. For example, we can model variability in sequence duration by introducing a time warping factorωk>0\omega_{k}>0, to each sequence event and changing eq. 2 to,

gn(t;zk)=Akan,rk𝒩(tτk+ωkbn,rk,ωk2cn,rk).g_{n}(t;z_{k})=A_{k}\cdot a_{n,r_{k}}\cdot\mathcal{N}(t\mid\tau_{k}+\omega_{k}b_{n,r_{k}},\omega_{k}^{2}c_{n,r_{k}}). (4)

This has the effect of linearly compressing or stretching each sequence in time (when ωk<1\omega_{k}<1 or ωk>1\omega_{k}>1, respectively). Such time warping is commonly observed in neural data [43, 44], and indeed, hippocampal sequences unfold \sim15-20 times faster during replay than during lived experiences [10]. We characterize this model in Supplement E and demonstrate its utility below.

In principle, it is equally possible to incorporate time warping into discrete time convNMF. However, since convNMF involves a dense temporal factor matrix 𝐇+R×B\mathbf{H}\in\mathbb{R}_{+}^{R\times B}, the most straightforward extension would be to introduce a time warping factor for each component r{1,,R}r\in\{1,\ldots,R\} and each time bin b{1,,B}b\in\{1,\ldots,B\}. This results in O(RB)O(RB) new trainable parameters, which poses non-trivial challenges in terms of computational efficiency, overfitting, and human interpretability. In contrast, PP-Seq represents sequence events as a set of KK latent events in continuous time. This ultra-sparse representation of sequence events (since KRBK\ll RB) naturally lends itself to modeling additional sequence features since this introduces only O(K)O(K) new parameters.

3 Collapsed Gibbs Sampling for Neyman-Scott Processes

Developing efficient algorithms for parameter inference in Neyman-Scott process models is an area of active research [45, 46, 47, 48]. To address this challenge, we developed a collapsed Gibbs sampling routine for Neyman-Scott processes, which encompasses the PP-Seq model as a special case. The method resembles “Algorithm 3” of [62]—a well-known approach for sampling from a Dirichlet process mixture model—and the collapsed Gibbs sampling algorithm for “mixture of finite mixtures” models developed by [63]. The idea is to partition observed spikes into background spikes and spikes induced by latent sequences, integrating over the sequence times, types, and amplitudes. Starting from an initial partition, the sampler iterates over individual spikes and probabilistically re-assigns them to (a) the background, (b) one of the remaining sequences, or (c) to a new sequence. The number of sequences in the partition, KK^{*}, changes as spikes are removed and re-assigned; thus, the algorithm is able to explore the full trans-dimensional space of partitions.

The re-assignment probabilities are determined by the prior distribution of partitions under the Neyman-Scott process and by the likelihood of the induced spikes assigned to each sequence. We state the conditional probabilities below and provide a full derivation in Supplement D. Let KK^{*} denote the number of sequences in the current partition after spike xsx_{s} has been removed from its current assignment. (Note that the number of latent sequences KK may exceed KK^{*} if some sequences produce zero spikes.) Likewise, let usu_{s} denote the sequence assignment of the ss-th spike, where us=0u_{s}=0 indicates assignment to the background process and us{1,,K}u_{s}\in\{1,\ldots,K^{*}\} indicates assignment to one of the latent sequence events. Finally, let Xk={xs:us=k,ss}X_{k}=\{x_{s^{\prime}}:u_{s^{\prime}}=k,s^{\prime}\neq s\} denote the spikes in the kk-th cluster, excluding xsx_{s}, and let Sk=|Xk|S_{k}=|X_{k}| denote its size. The conditional probability of the partition under the possible assignments of spike xsx_{s} are,

p(us=0xs,{Xk}k=1K,Θ)\displaystyle p(u_{s}=0\mid x_{s},\{X_{k}\}_{k=1}^{K^{*}},\Theta) (1+β)λns\displaystyle\propto(1+\beta)\,\lambda^{\varnothing}_{n_{s}} (5)
p(us=kxs,{Xk}k=1K,Θ)\displaystyle p(u_{s}=k\mid x_{s},\{X_{k}\}_{k=1}^{K^{*}},\Theta) (α+Sk)[rk=1Rp(rkXk)ansrkp(tsXk,rk,ns)]\displaystyle\propto(\alpha+S_{k})\,\!\left[\sum_{r_{k}=1}^{R}p(r_{k}\mid X_{k})\,a_{n_{s}r_{k}}\,p(t_{s}\mid X_{k},r_{k},n_{s})\right] (6)
p(us=K+1xs,{Xk}k=1K,Θ)\displaystyle p(u_{s}=K^{*}+1\mid x_{s},\{X_{k}\}_{k=1}^{K^{*}},\Theta) α(β1+β)αψr=1Rπransr\displaystyle\propto\alpha\left(\frac{\beta}{1+\beta}\right)^{\alpha}\psi\sum_{r=1}^{R}\pi_{r}\,a_{n_{s}r} (7)

The sampling algorithm iterates over all spikes xs{x1,,xS}x_{s}\in\{x_{1},\ldots,x_{S}\} and updates their assignments holding the other spikes’ assignments fixed. The probability of assigning spike xsx_{s} to an existing cluster marginalizes the time, type, and amplitude of the sequence, resulting in a collapsed Gibbs sampler [50, 62]. The exact form of the posterior probability p(rkXk)p(r_{k}\mid X_{k}) and the parameters of the posterior predictive p(tsXk,rk,ns)p(t_{s}\mid X_{k},r_{k},n_{s}) in eq. 6 are given in Supplement C.

After attributing each spike to a latent cause, it is straightforward to draw samples over the remaining model parameters—the latent sequences {zk}k=1K\{z_{k}\}_{k=1}^{K^{*}} and global parameters Θ\Theta. Given the spikes and assignments {xs,us}s=1S\{x_{s},u_{s}\}_{s=1}^{S}, we sample the sequences (i.e. their time, amplitude, types, etc.) from the closed-form conditional p(zk{xs:us=k},Θ)p(z_{k}\mid\{x_{s}:u_{s}=k\},\Theta). Given the sequences and spikes, we sample the conditional distribution on global parameters p(Θ{zk}k=1K,{xs,us}s=1S){p(\Theta\mid\{z_{k}\}_{k=1}^{K^{*}},\{x_{s},u_{s}\}_{s=1}^{S}}). Under conjugate formulations, these updates are straightforward. With these steps, the Markov chain targets the posterior distribution on model parameters and partitions. Complete derivations are in Supplement D.

Refer to caption
Figure 2: (A) Schematic of train/test partitions. We propose a speckled holdout pattern (bottom). (B) A subset of a synthetic spike train containing two sequences types. (C) Same data, but with grey regions showing the censored test set and yellow dots denoting imputed spikes. (D) Log-likelihood over Gibbs samples; positive values denote excess nats per unit time relative to a homogeneous Poisson process baseline. (E) Box plots showing range of log-likelihoods on the train and test sets for different choices of RR; cross-validation favors R=2R=2, in agreement with the ground truth shown in panel B. (F-G) Performance benefits of parallel MCMC on synthetic and experimental neural data.

Improving MCMC mixing times The intensity of sequence amplitudes AkA_{k} is proportional to the gamma density  Ga(Ak;α,β){\mathrm{Ga}(A_{k};\alpha,\beta)}, and these hyperparameters affect the mixing time of the Gibbs sampler. Intuitively, if there is little probability of low-amplitude sequences, the sampler is unlikely to create new sequences and is therefore slow to explore different partitions of spikes.111This problem is common to other nonparametric Bayesian mixture models as well [63, e.g.]. If, on the other hand, the variance of Ga(α,β)\mathrm{Ga}(\alpha,\beta) is large relative to the mean, then the probability of forming new clusters is non-negligible and the sampler tends to mix more effectively. Unfortunately, this latter regime is also probably of lesser scientific interest, since neural sequences are typically large in amplitude—they can involve many thousands of cells, each potentially contributing a small number of spikes [28, 2].

To address this issue, we propose an annealing procedure to initialize the Markov chain. We fix the mean of AkA_{k} and adjust α\alpha and β\beta to slowly lower variance of amplitude distribution. Initially, the sampler produces many small clusters of spikes, and as we lower the variance of Ga(α,β)\mathrm{Ga}(\alpha,\beta) to its target value, the Markov chain typically combines these clusters into larger sequences. We further improve performance by interspersing “split-merge” Metropolis-Hastings updates [64, 65] between Gibbs sweeps (see Supplement D.6). Finally, though we have not found it necessary, one could use convNMF to initialize the MCMC algorithm.

Parallel MCMC Resampling the sequence assignments is the primary computational bottleneck for the Gibbs sampler. One pass over the data requires O(SKR)O(SKR) operations, which quickly becomes costly when the operations are serially executed. While this computational cost is manageable for many datasets, we can improve performance substantially by parallelizing the computation [53]. Given PP processors and a spike train lasting TT seconds, we divide the dataset into intervals lasting T/PT/P seconds, and allocate one interval per processor. The current global parameters, Θ\Theta, are first broadcast to all processors. In parallel, the processors update the sequence assignments for their assigned spikes, and then send back sufficient statistics describing each sequence. After these sufficient statistics are collected on a single processor, the global parameters are re-sampled and then broadcast back to the processors to initiate another iteration. This algorithm introduces some error since clusters are not shared across processors. In essence, this introduces erroneous edge effects if a sequence of spikes is split across two processors. However, these errors are negligible when the sequence length is much less than T/PT/P, which we expect is the practical regime of interest.

4 Experiments

4.1 Cross-Validation and Demonstration of Computational Efficiency

Refer to caption
Figure 3: Zebra Finch HVC data. (A) Raw spike train (top) and sequences revealed by PP-Seq (left) and convNMF (right). (B) Box plots summarizing samples from the posterior on number of sequences, KK, derived from three independent MCMC chains. (C) Co-occupancy matrix summarizing probabilities of spike pairs belonging to the same sequence. (D) Credible intervals for evoked amplitudes for sequence type 1 (red) and 2 (blue). (E) Credible intervals for response offsets (same order and coloring as D). Estimates are suppressed for small-amplitude responses (gray dots).

We evaluate model performance by computing the log-likelihood assigned to held-out data. Partitioning the data into training and testing sets must be done somewhat carefully—we cannot withhold time intervals completely (as in fig. 2A, top) or else the model will not accurately predict latent sequences occurring in these intervals; likewise, we cannot withhold individual neurons completely (as in fig. 2A, middle) or else the model will not accurately predict the response parameters of those held out cells. Thus, we adopt a “speckled” holdout strategy [54] as diagrammed at the bottom of fig. 2A. We treat held-out spikes as missing data and sample them as part of the MCMC algorithm. (Their conditional distribution is given by the PP-Seq generative model.) This approach involving a speckled holdout pattern and multiple imputation of missing data may be viewed as a continuous time extension of the methods proposed by [27] for convNMF.

Panels B-E in fig. 2 show the results of this cross-validation scheme on a synthetic dataset with R=2R=2 sequence types. The predictions of the model in held-out test regions closely match the ground truth—missing spikes are reliably imputed when they are part of a sequence (fig. 2C). Further, the likelihood of the train and test sets improves over the course of MCMC sampling (fig. 2D), and can be used as a metric for model comparison—in agreement with the ground truth, test performance plateaus for models containing greater than R=2R=2 sequence types (fig. 2E).

Finally, to be of practical utility, the algorithm needs to run in a reasonable amount of time. Figure 2G shows that our Julia [55] implementation can fit a recording of 120 hippocamapal neurons with hundreds of thousands of spikes in a matter of minutes, on a 2017 MacBook Pro (3.1 GHz Intel Core i7, 4 cores, 16 GB RAM). Run-time grows linearly with the number of spikes, as expected, but even with a single thread it only takes six minutes to perform 1000 Gibbs sweeps on a 15-minute recording with 1.3×105{\sim\!1.3\times 10^{5}} spikes. With parallel MCMC, this laptop performs the same number of sweeps in under two minutes. Our open-source implementation is available at:

https://github.com/lindermanlab/PPSeq.jl.

4.2 Zebra Finch Higher Vocal Center (HVC)

We first applied PP-Seq to a recording of HVC premotor neurons in a zebra finch,222These data are available at http://github.com/FeeLab/seqNMF; originally published in [4]. which generate sequences that are time-locked to syllables in the bird’s courtship song. Figure 3A qualitatively compares the performance of convNMF and PP-Seq. The raw data (top panel) shows no visible spike patterns; however, clear sequences are revealed by sorting the neurons lexographically by preferred sequence type and the temporal offset parameter inferred by PP-Seq. While both models extract similar sequences, PP-Seq provides a finer scale annotation of the final result, providing, for example, attributions at the level of individual spikes to sequences (bottom left of fig. 3A).

Further, PP-Seq can quantify uncertainty in key parameters by considering the full sequence of MCMC samples. Figure 3B summarizes uncertainty in the total number of sequence events, i.e. KK, over three independent MCMC chains with different random seeds—all chains converge to similar estimates; the uncertainty is largely due to the rapid sequences (in blue) shown in panel A. Figure 3C displays a symmetric matrix where element (i,j)(i,j) corresponds to the probability that spike ii and spike jj are attributed to same sequence. Finally, fig. 3D-E shows the amplitude and offset for each neuron’s sequence-evoked response with 95% posterior credible intervals. These results naturally fall out of the probabilistic construction of the PP-Seq model, but have no obvious analogue in convNMF.

4.3 Time Warping Extension and Robustness to Noise

Refer to caption
Figure 4: PP-Seq is more robust to various forms of noise than convNMF. (A-D) Comparison of PP-Seq and convNMF to detect sequence times in synthetic data with varying amounts of noise. Panel D shows the performance of the time-warped variant of PP-Seq (see eq. 4). (E) convNMF reconstruction of a spike train containing sequences with 9-fold time warping. (F) Performance of time-warped PP-Seq (see eq. 4) on the same data as panel E.

Songbird HVC is a specialized circuit that generates unusually clean and easy-to-detect sequences. To compare the robustness of PP-Seq and convNMF under more challenging circumstances, we created a simple synthetic dataset with R=1R=1 sequence type and N=100N=100 neurons. We varied four parameters to manipulate the difficulty of sequence extraction: the rate of background spikes, λn\lambda^{\varnothing}_{n} (“additive noise,” fig. 4A), the expected value of sequence amplitudes, AkA_{k} (“participation noise”, fig. 4B), the expected variance of the Gaussian impulse responses, cnrc_{nr} (“jitter noise”, fig. 4C), and, finally, the maximal time warping coefficient (see eq. 4; fig. 4D). All simulated datasets involved sequences with low spike counts (𝔼[Ak]<100\mathbb{E}[A_{k}]<100 spikes). In this regime, the Poisson likelihood criterion used by PP-Seq is better matched to the statistics of the spike train. Since convNMF optimizes an alternative loss function (squared error instead of Poisson likelihood) we compared the models by their ability to extract the ground truth sequence event times. Using area under reciever operating characteriztic (ROC) curves as a performance metric (see Supplement F.1), we see favorable results for PP-Seq as noise levels are increased.

We demonstrate the abilities of time-warped PP-Seq further in fig. 4E-F. Here, we show a synthetic dataset containing sequences with 9-fold variability in their duration, which is similar to levels observed in some experimental systems [10]. While convNMF fails to reconstruct many of these warped sequences, the PP-Seq model identifies sequences that are closely matched to ground truth.

4.4 Rodent Hippocampal Sequences

Finally, we tested PP-Seq and its time warping variant on a hippocampal recording in a rat making repeated runs down a linear track.333These data are available at http://crcns.org/data-sets/hc/hc-11; originally published in [56]. This dataset is larger (T16T\approx 16 minutes, S=S=\,137,482) and contains less stereotyped sequences than the songbird data. From prior work [56], we expect to see two sequences with overlapping populations of neurons, corresponding to the two running directions on the track. PP-Seq reveals these expected sequences in an unsupervised manner—i.e. without reference to the rat’s position—as shown in fig. 5A-C.

We performed a large cross-validation sweep over 2,000 random hyperparameter settings for this dataset (see Supplement F.2). This confirmed that models with R=2R=2 sequence performed well in terms of heldout performance (fig. 5D). Interestingly, despite variability in running speeds, this same analysis did not show a consistent benefit to including larger time warping factors into the model (fig. 5E). Higher performing models were characterized by larger sequence amplitudes, i.e. larger values of 𝔼[Ak]=α/β\mathbb{E}[A_{k}]=\alpha/\beta, and smaller background rates, i.e. smaller values of λ\lambda^{\varnothing}. Other parameters had less pronounced effects on performance. Overall, these results demonstrate that PP-Seq can be fruitfully applied to large-scale and “messy” neural datasets, that hyperparameters can be tuned by cross-validation, and that the unsupervised learning of neural sequences conforms to existing scientific understanding gained via supervised methods.

Refer to caption
Figure 5: Automatic detection of place coding sequences in rat hippocampus. (A) Raw spike train (\sim20% of the full dataset is shown). Blue and red arrowheads indicate when the mouse reaches the end of the track and reverses direction. (B) Neurons re-sorted by PP-Seq (with R=2R=2). (C) Sequences annotated by PP-Seq. (D-I) Validation log-likelihoods as a function of hyperparameter value. Each boxplot summarizes the 50 highest scoring models, randomly sampling the other hyperparameters.

5 Conclusion

We proposed a point process model (PP-Seq) inspired by convolutive NMF [9, 8, 4] to identify neural sequences. Both models approximate neural activity as the sum of nonnegative features, drawn from a fixed number of spatiotemporal motif types. Unlike convNMF, PP-Seq restricts each motif to be a true sequence—the impulse responses are Gaussian and hence unimodal. Further, PP-Seq is formulated in a probabilistic framework that better quantifies uncertainty (see fig. 3) and handles low firing rate regimes (see fig. 4). Finally, PP-Seq produces an extremely sparse representation of sequence events in continuous time, opening the door to a variety of model extensions including the introduction of time warping (see fig. 4F), as well as other possibilities like truncated sequences and “clusterless” observations [57], which could be explored in future work.

Despite these benefits, fitting PP-Seq involves a tackling a challenging trans-dimensional inference problem inherent to Neyman-Scott point processes. We took several important steps towards overcoming this challenge by connecting these models to a more general class of Bayesian mixture models [63], developing and parallelizing a collapsed Gibbs sampler, and devising an annealed sampling approach to promote fast mixing. These innovations are sufficient to fit PP-Seq on datasets containing hundreds of thousands of spikes in just a few minutes on a modern laptop.

Acknowledgements

A.H.W. received funding support from the National Institutes of Health BRAIN initiative (1F32MH122998-01), and the Wu Tsai Stanford Neurosciences Institute Interdisciplinary Scholar Program. S.W.L. was supported by grants from the Simons Collaboration on the Global Brain (SCGB 697092) and the NIH BRAIN Initiative (U19NS113201 and R01NS113119). We thank the Stanford Research Computing Center for providing computational resources and support that contributed to these research results.

Broader Impact

Understanding neural computations in biological systems and ultimately the human brain is a grand and long-term challenge with broad implications for human health and society. The field of neuroscience is still taking early and incremental steps towards this goal. Our work develops a general-purpose, unsupervised method for identifying an important structure—neural sequences—which have been observed in a variety of experimental datasets and have been studied extensively by theorists. This work will serve to advance this growing understanding by providing new analytical tools for neuroscientists. We foresee no immediate impacts, positive or negative, concerning the general public.

References

  • [1] Mark S Goldman “Memory without feedback in a neural network” In Neuron 61.4, 2009, pp. 621–634
  • [2] Richard H R Hahnloser, Alexay A Kozhevnikov and Michale S Fee “An ultra-sparse code underlies the generation of neural sequences in a songbird” In Nature 419.6902, 2002, pp. 65–70
  • [3] Howard Eichenbaum “Time cells in the hippocampus: a new dimension for mapping memories” In Nat. Rev. Neurosci. 15.11 nature.com, 2014, pp. 732–744
  • [4] Emily L Mackevicius, Andrew H Bahle, Alex H Williams, Shijie Gu, Natalia I Denisenko, Mark S Goldman and Michale S Fee “Unsupervised discovery of temporal sequences in high-dimensional datasets, with applications to neuroscience” In Elife 8, 2019
  • [5] Moshe Abeles and Itay Gat “Detecting precise firing sequences in experimental data” In J. Neurosci. Methods 107.1-2 Elsevier, 2001, pp. 141–154
  • [6] Eleonora Russo and Daniel Durstewitz “Cell assemblies at multiple time scales with arbitrary lag constellations” In Elife 6 elifesciences.org, 2017
  • [7] Pietro Quaglio, Vahid Rostami, Emiliano Torre and Sonja Grün “Methods for identification of spike patterns in massively parallel spike trains” In Biol. Cybern. 112.1-2, 2018, pp. 57–80
  • [8] Sven Peter, Elke Kirschbaum, Martin Both, Lee Campbell, Brandon Harvey, Conor Heins, Daniel Durstewitz, Ferran Diego and Fred A Hamprecht “Sparse convolutional coding for neuronal assembly detection” In Advances in Neural Information Processing Systems 30 Curran Associates, Inc., 2017, pp. 3675–3685
  • [9] Paris Smaragdis “Convolutive speech bases and their application to supervised speech separation” In IEEE Trans. Audio Speech Lang. Processing ieeexplore.ieee.org, 2006
  • [10] Thomas J Davidson, Fabian Kloosterman and Matthew A Wilson “Hippocampal replay of extended experience” In Neuron 63.4, 2009, pp. 497–507
  • [11] Anne C Smith and Emery N Brown “Estimating a state-space model from point process observations” In Neural Computation 15.5 MIT Press, 2003, pp. 965–991
  • [12] K L Briggman, H D I Abarbanel and W B Kristan “Optical imaging of neuronal populations during decision-making” In Science 307.5711 science.sciencemag.org, 2005, pp. 896–901
  • [13] Byron M Yu, John P Cunningham, Gopal Santhanam, Stephen I Ryu, Krishna V Shenoy and Maneesh Sahani “Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity” In J. Neurophysiol. 102.1, 2009, pp. 614–635
  • [14] Liam Paninski, Yashar Ahmadian, Daniel Gil Ferreira, Shinsuke Koyama, Kamiar Rahnama Rad, Michael Vidne, Joshua Vogelstein and Wei Wu “A new look at state-space models for neural data” In Journal of Computational Neuroscience 29.1-2 Springer, 2010, pp. 107–126
  • [15] Jakob H Macke, Lars Buesing, John P Cunningham, M Yu Byron, Krishna V Shenoy and Maneesh Sahani “Empirical models of spiking in neural populations” In Advances in Neural Information Processing Systems, 2011, pp. 1350–1358
  • [16] David Pfau, Eftychios A Pnevmatikakis and Liam Paninski “Robust learning of low-dimensional dynamics from large neural ensembles” In Advances in Neural Information Processing Systems, 2013, pp. 2391–2399
  • [17] Peiran Gao and Surya Ganguli “On simplicity and complexity in the brave new world of large-scale neuroscience” In Curr. Opin. Neurobiol. 32, 2015, pp. 148–155
  • [18] Yuanjun Gao, Evan Archer, Liam Paninski and John P Cunningham “Linear dynamical neural population models through nonlinear embeddings”, 2016 arXiv:1605.08454 [q-bio.NC]
  • [19] Yuan Zhao and Il Memming Park “Variational Latent Gaussian Process for Recovering Single-Trial Dynamics from Population Spike Trains” In Neural Comput. 29.5 MIT Press, 2017, pp. 1293–1316
  • [20] Anqi Wu, Stan Pashkovski, Sandeep R Datta and Jonathan W Pillow “Learning a latent manifold of odor representations from neural responses in piriform cortex” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018, pp. 5378–5388
  • [21] Alex H Williams, Tony Hyun Kim, Forea Wang, Saurabh Vyas, Stephen I Ryu, Krishna V Shenoy, Mark Schnitzer, Tamara G Kolda and Surya Ganguli “Unsupervised discovery of demixed, low-dimensional neural dynamics across multiple timescales through tensor component analysis” In Neuron 98.6, 2018, pp. 1099–1115.e8
  • [22] Scott Linderman, Annika Nichols, David Blei, Manuel Zimmer and Liam Paninski “Hierarchical recurrent state space models reveal discrete and continuous dynamics of neural activity in C. elegans” In bioRxiv, 2019, pp. 621540
  • [23] Lea Duncker, Gergo Bohner, Julien Boussard and Maneesh Sahani “Learning interpretable continuous-time models of latent stochastic dynamical systems” In Proceedings of the 36th International Conference on Machine Learning 97, Proceedings of Machine Learning Research Long Beach, California, USA: PMLR, 2019, pp. 1726–1734
  • [24] Carsen Stringer, Marius Pachitariu, Nicholas Steinmetz, Matteo Carandini and Kenneth D Harris “High-dimensional geometry of population responses in visual cortex” In Nature 571.7765, 2019, pp. 361–365
  • [25] Christopher D Harvey, Philip Coen and David W Tank “Choice-specific sequences in parietal cortex during a virtual-navigation decision task” In Nature 484.7392, 2012, pp. 62–68
  • [26] Mikhail I Rabinovich, Ramón Huerta, Pablo Varona and Valentin S Afraimovich “Generation and reshaping of sequences in neural systems” In Biol. Cybern. 95.6 Springer, 2006, pp. 519–536
  • [27] Emily Lambert Mackevicius and Michale Sean Fee “Building a state space for song learning” In Curr. Opin. Neurobiol. 49 Elsevier, 2018, pp. 59–68
  • [28] György Buzsáki and David Tingley “Space and time: the hippocampus as a sequence generator” In Trends Cogn. Sci. 22.10, 2018, pp. 853–869
  • [29] Eva Pastalkova, Vladimir Itskov, Asohan Amarasingham and György Buzsáki “Internally generated cell assembly sequences in the rat hippocampus” In Science 321.5894, 2008, pp. 1322–1327
  • [30] Daoyun Ji and Matthew A Wilson “Coordinated memory replay in the visual cortex and hippocampus during sleep” In Nat. Neurosci. 10.1 nature.com, 2007, pp. 100–107
  • [31] Hannah R Joo and Loren M Frank “The hippocampal sharp wave-ripple in memory retrieval for immediate use and consolidation” In Nat. Rev. Neurosci. 19.12, 2018, pp. 744–757
  • [32] Wei Xu, Felipe Carvalho and Andrew Jackson “Sequential neural activity in primary motor cortex during sleep” In J. Neurosci. 39.19, 2019, pp. 3698–3712
  • [33] David Tingley and Adrien Peyrache “On the methods for reactivation and replay analysis” In Philos. Trans. R. Soc. Lond. B Biol. Sci. 375.1799 royalsocietypublishing.org, 2020, pp. 20190231
  • [34] Kourosh Maboudi, Etienne Ackermann, Laurel Watkins Jong, Brad E Pfeiffer, David Foster, Kamran Diba and Caleb Kemere “Uncovering temporal structure in hippocampal output patterns” In Elife 7, 2018
  • [35] Lukas Grossberger, Francesco P Battaglia and Martin Vinck “Unsupervised clustering of temporal patterns in high-dimensional neuronal ensembles using a novel dissimilarity measure” In PLoS Comput. Biol. 14.7, 2018, pp. e1006283
  • [36] Roemer Meij and Bradley Voytek “Uncovering neuronal networks defined by consistent between-neuron spike timing from neuronal spike recordings” In eNeuro 5.3 ncbi.nlm.nih.gov, 2018
  • [37] Daniel D Lee and H Sebastian Seung “Learning the parts of objects by non-negative matrix factorization” In Nature 401.6755, 1999, pp. 788–791
  • [38] Jesper Moller and Rasmus Plenge Waagepetersen “Statistical Inference and Simulation for Spatial Point Processes” Taylor & Francis, 2003
  • [39] Isabel Valera Manuel Gomez Rodriguez “Learning with Temporal Point Processes”, Tutorial at ICML, 2018
  • [40] Jerzy Neyman and Elizabeth L Scott “Statistical approach to problems of cosmology” In J. R. Stat. Soc. Series B Stat. Methodol. 20.1, 1958, pp. 1–29
  • [41] Jeffrey W Miller and Matthew T Harrison “Mixture models with a prior on the number of components” In J. Am. Stat. Assoc. 113.521, 2018, pp. 340–356
  • [42] John Frank Charles Kingman “Poisson processes” Clarendon Press, 2002
  • [43] Lea Duncker and Maneesh Sahani “Temporal alignment and latent Gaussian process factor inference in population spike trains” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018, pp. 10445–10455
  • [44] Alex H Williams, Ben Poole, Niru Maheswaranathan, Ashesh K Dhawale, Tucker Fisher, Christopher D Wilson, David H Brann, Eric M Trautmann, Stephen Ryu, Roman Shusterman, Dmitry Rinberg, Bence P Ölveczky, Krishna V Shenoy and Surya Ganguli “Discovering precise temporal patterns in large-scale neural recordings through robust and interpretable time warping” In Neuron 105.2, 2020, pp. 246–259.e8
  • [45] Ushio Tanaka, Yosihiko Ogata and Dietrich Stoyan “Parameter estimation and model selection for Neyman-Scott point processes” In Biometrical Journal: Journal of Mathematical Methods in Biosciences 50.1 Wiley Online Library, 2008, pp. 43–57
  • [46] Ushio Tanaka and Yosihiko Ogata “Identification and estimation of superposed Neyman–Scott spatial cluster processes” In Ann. Inst. Stat. Math. 66.4, 2014, pp. 687–702
  • [47] Jiří Kopecký and Tomáš Mrkvička “On the Bayesian estimation for the stationary Neyman-Scott point processes” In Appl. Math. 61.4, 2016, pp. 503–514
  • [48] Yosihiko Ogata “Cluster analysis of spatial point patterns: posterior distribution of parents inferred from offspring” In Japanese Journal of Statistics and Data Science, 2019
  • [49] Radford M Neal “Markov chain sampling methods for Dirichlet process mixture models” In J. Comput. Graph. Stat. 9.2 Taylor & Francis, 2000, pp. 249–265
  • [50] Jun S Liu, Wing Hung Wong and Augustine Kong “Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes” In Biometrika 81.1 Oxford Academic, 1994, pp. 27–40
  • [51] Sonia Jain and Radford M Neal “A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model” In J. Comput. Graph. Stat. 13.1 Taylor & Francis, 2004, pp. 158–182
  • [52] Sonia Jain and Radford M Neal “Splitting and merging components of a nonconjugate Dirichlet process mixture model” In Bayesian Anal. 2.3 International Society for Bayesian Analysis, 2007, pp. 445–472
  • [53] Elaine Angelino, Matthew James Johnson and Ryan P Adams “Patterns of scalable Bayesian inference” In Foundations and Trends® in Machine Learning 9.2-3 Now Publishers, 2016, pp. 119–247
  • [54] Svante Wold “Cross-validatory estimation of the number of components in factor and principal components models” In Technometrics 20.4 [Taylor & Francis, Ltd., American Statistical Association, American Society for Quality], 1978, pp. 397–405
  • [55] Jeff Bezanson, Alan Edelman, Stefan Karpinski and Viral B Shah “Julia: A fresh approach to numerical computing” In SIAM review 59.1 SIAM, 2017, pp. 65–98
  • [56] Andres D Grosmark and György Buzsáki “Diversity in neural firing dynamics supports both rigid and learned hippocampal sequences” In Science 351.6280, 2016, pp. 1440–1443
  • [57] Xinyi Deng, Daniel F Liu, Kenneth Kay, Loren M Frank and Uri T Eden “Clusterless decoding of position from multiunit activity using a marked point process filter” In Neural computation 27.7 MIT Press, 2015, pp. 1438–1460

References

  • [58] John Frank Charles Kingman “Poisson processes” Clarendon Press, 2002
  • [59] D J Daley and D Vere-Jones “An Introduction to the Theory of Point Processes: Volume I: Elementary Theory and Methods” Springer, New York, NY, 2003
  • [60] Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari and Donald B Rubin “Bayesian data analysis” CRC press, 2013
  • [61] Kevin P Murphy “Machine learning: a probabilistic perspective” MIT press, 2012
  • [62] Radford M Neal “Markov chain sampling methods for Dirichlet process mixture models” In J. Comput. Graph. Stat. 9.2 Taylor & Francis, 2000, pp. 249–265
  • [63] Jeffrey W Miller and Matthew T Harrison “Mixture models with a prior on the number of components” In J. Am. Stat. Assoc. 113.521, 2018, pp. 340–356
  • [64] Sonia Jain and Radford M Neal “A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model” In J. Comput. Graph. Stat. 13.1 Taylor & Francis, 2004, pp. 158–182
  • [65] Sonia Jain and Radford M Neal “Splitting and merging components of a nonconjugate Dirichlet process mixture model” In Bayesian Anal. 2.3 International Society for Bayesian Analysis, 2007, pp. 445–472

[]

Supplementary Material: Point process models for sequence detection in high-dimensional spike trains

Appendix A Background

This section describes some basic properties and characteristics of Poisson processes that are relevant to understand the PP-Seq model. See [58] for a targeted introduction to Poisson processes and [59] for a broader introduction to point processes.

Definition of a Poisson Process — A point process model satisfying the following three properties is called a Poisson process.

  1. (i)

    The number of events falling within any region of the sample space 𝒳\mathcal{X} is a random variable following a Poisson distribution. For a region R𝒳R\subset\mathcal{X} we denote the number of events occuring in RR as N(R)N(R).

  2. (ii)

    There exists a function λ(x):𝒳+\lambda(x):\mathcal{X}\mapsto\mathbb{R}_{+}, called the intensity function, which satisfies N(R)Poisson(Rλ(x)𝑑x)N(R)\sim\text{Poisson}(\int_{R}\lambda(x)dx) for every region R𝒳R\subset\mathcal{X}.

  3. (iii)

    For any two non-overlapping regions RR and RR^{\prime}, N(R)N(R) and N(R)N(R^{\prime}) are independent random variables.

We write {xi}i=1nPP(λ)\{x_{i}\}_{i=1}^{n}\sim PP(\lambda) to denote that a set of events {xi}i=1n={x1,,xn}\{x_{i}\}_{i=1}^{n}=\{x_{1},\ldots,x_{n}\} is distributed according to a Poisson process with intensity function λ\lambda.

Marked Poisson Processes — A marked Poisson process extends the above definition to sets of events marked with additional information. For example, we write {(xi,mi)}i=1nPP(λ(x,m))\{(x_{i},m_{i})\}_{i=1}^{n}\sim PP(\lambda(x,m)) to denote a marked Poisson process where mim_{i}\in\mathcal{M} is the “mark” associated with each event. Note that the domain of the intensity function is extended to 𝒳×\mathcal{X}\times\mathcal{M}. To pick a concrete example of interest, the events could represent spike times, in which case 𝒳\mathcal{X} would be an interval of the real line, e.g. [0,T][0,T], representing the recorded period. Further, \mathcal{M} would be a discrete set of neuron labels, e.g. {1,,N}\{1,\ldots,N\} for a recording of NN cells.

Likelihood function — Given a set of events {xi}i=1n\{x_{i}\}_{i=1}^{n} in 𝒳\mathcal{X} the likelihood of this set, with respect to Poisson process PP(λ)PP(\lambda) is:

p({xi}i=1nλ)=i=1nλ(xi)exp{𝒳λ(x)𝑑x}p(\{x_{i}\}_{i=1}^{n}\mid\lambda)=\prod_{i=1}^{n}\lambda(x_{i})\cdot\exp\left\{-\int_{\mathcal{X}}\lambda(x)dx\right\} (8)

Sampling — Given an intensity function λ(x)\lambda(x) and a sample space 𝒳\mathcal{X}, we can simulate data from a Poisson process by first sampling the number of events:

NPoisson(𝒳λ(x)𝑑x)N\sim\text{Poisson}\left(\int_{\mathcal{X}}\lambda(x)dx\right) (9)

and then sampling the location of each event according to the normalized intensity function. That is, for i={1,,N}i=\{1,\ldots,N\}, we independently sample the events according to:

xiFwhere the p.d.f. of F is λ(x)𝒳λ(x)𝑑xx_{i}\sim F\quad\text{where the p.d.f. of $F$ is }\frac{\lambda(x)}{\int_{\mathcal{X}}\lambda(x)dx} (10)

Superposition Principle — The union of KK independent Poisson processes with nonnegative intensity functions λ1(x),,λK(x)\lambda_{1}(x),\ldots,\lambda_{K}(x) is a Poisson process with intensity function Λ(x)=k=1Kλk(x)\Lambda(x)=\sum_{k=1}^{K}\lambda_{k}(x).

As a consequence, we can sample from a complicated Poisson process by first breaking it down into a series of simpler Poisson processes, drawing independent samples from these simpler processes, and taking the union of all events to achieve a sample from the original Poisson process. We exploit this property in section B.1 to draw samples from the PP-Seq model.

Appendix B The PP-Seq Model

B.1 Generative Process

Recall the generative model described in the main text. Let zk=(τk,rk,Ak)z_{k}=(\tau_{k},r_{k},A_{k}) denote the kk-th latent sequence, which consists of a time τk\tau_{k}, type rkr_{k}, and amplitude AkA_{k}. Extensions of the model introduce further properties like time warping factors ωk\omega_{k} for each latent sequence. The latent events (i.e. instantiations of a sequence) are sampled from a Poisson process,

{(τk,rk,Ak)}k=1K\displaystyle\{(\tau_{k},r_{k},A_{k})\}_{k=1}^{K} PP(λ(τ,r,A))\displaystyle\sim\mathrm{PP}\big{(}\lambda(\tau,r,A)\big{)} (11)
λ(τ,r,A)\displaystyle\lambda(\tau,r,A) =ψπrGa(A;α,β).\displaystyle=\psi\,\pi_{r}\,\mathrm{Ga}(A;\alpha,\beta). (12)

Note that this is a homogeneous Poisson process since the intensity function does not depend on time. Conditioned on the latent events, the observed spikes are then sampled from a second, inhomogeneous, Poisson process,

{(ts,ns)}s=1S\displaystyle\{(t_{s},n_{s})\}_{s=1}^{S} PP(λ(t,n{(τk,rk,Ak)}k=1K))\displaystyle\sim\mathrm{PP}\left(\lambda(t,n\mid\{(\tau_{k},r_{k},A_{k})\}_{k=1}^{K})\right) (13)
λ(t,n{(τk,rk,Ak)}k=1K)\displaystyle\lambda(t,n\mid\{(\tau_{k},r_{k},A_{k})\}_{k=1}^{K}) =λ(t,n)+k=1Kg(t,nτk,rk,Ak)\displaystyle=\lambda^{\varnothing}(t,n)+\sum_{k=1}^{K}g(t,n\mid\tau_{k},r_{k},A_{k}) (14)
g(t,nτk,rk,Ak)\displaystyle g(t,n\mid\tau_{k},r_{k},A_{k}) =Akanrk𝒩(tτk+bnrk,cnrk).\displaystyle=A_{k}\cdot a_{nr_{k}}\cdot\mathcal{N}(t\mid\tau_{k}+b_{nr_{k}},c_{nr_{k}}). (15)

In the main text we abbreviate λn(t)λ(t,n{(τk,rk,Ak)}k=1K)\lambda_{n}(t)\triangleq\lambda(t,n\mid\{(\tau_{k},r_{k},A_{k})\}_{k=1}^{K}), λ(t,n)λn\lambda^{\varnothing}(t,n)\triangleq\lambda^{\varnothing}_{n}, and gn(tzk)g(t,nτk,rk,Ak)g_{n}(t\mid z_{k})\triangleq g(t,n\mid\tau_{k},r_{k},A_{k}).

Observe that the firing rates are sums of non-negative intensity functions. We can use the superposition principle of Poisson processes (see appendix A) to write the observed spikes as the union of spikes generated by the background and by each of the latent events.

{(ts(0),ns(0)}s=1S0\displaystyle\{(t_{s}^{(0)},n_{s}^{(0)}\}_{s=1}^{S_{0}} PP(λ(t,n))\displaystyle\sim\mathrm{PP}\big{(}\lambda^{\varnothing}(t,n)\big{)} (16)
{(ts(k),ns(k)}s=1Sk\displaystyle\{(t_{s}^{(k)},n_{s}^{(k)}\}_{s=1}^{S_{k}} PP(Akanrk𝒩(tτk+bnrk,cnrk))\displaystyle\sim\mathrm{PP}\big{(}A_{k}\cdot a_{nr_{k}}\cdot\mathcal{N}(t\mid\tau_{k}+b_{nr_{k}},c_{nr_{k}})\big{)} for k=1,,K\displaystyle k=1,\ldots,K (17)
{(ts,ns)}s=1S\displaystyle\{(t_{s},n_{s})\}_{s=1}^{S} =k=0K{(ts(k),ns(k)}s=1Sk,\displaystyle=\bigcup_{k=0}^{K}\{(t_{s}^{(k)},n_{s}^{(k)}\}_{s=1}^{S_{k}}, (18)

where SkS_{k} denotes the number of spikes generated by the kk-th component, with k=0k=0 denoting the background spikes.

Next, note that each of the constituent Poisson processes intensity functions are simple, in that the normalized intensities correspond to simple densities. In the case of the background, the normalized intensity is categorical in the neuron indices and uniform in time. The impulse responses, once normalized, are categorical in the neuron indices and conditionally Gaussian in time. We use these facts to sample the Poisson processes as described in Appendix A.

The final sampling procedure for PP-Seq is as follows. First sample the global parameters,

λ\displaystyle\lambda^{\varnothing} Gamma(α,β)\displaystyle\sim\mathrm{Gamma}(\alpha_{\varnothing},\beta_{\varnothing}) (19)
π\displaystyle\pi^{\varnothing} Dirichlet(γ𝟏N)\displaystyle\sim\text{Dirichlet}(\gamma_{\varnothing}\cdot\boldsymbol{1}_{N}) (20)
π\displaystyle\pi Dirichlet(γ𝟏R)\displaystyle\sim\text{Dirichlet}(\gamma\cdot\boldsymbol{1}_{R}) (21)
𝐚r\displaystyle\mathbf{a}_{r} Dirichlet(φ𝟏N)\displaystyle\sim\text{Dirichlet}(\varphi\cdot\boldsymbol{1}_{N}) for r{1,,R}r\in\{1,\ldots,R\} (22)
cnr\displaystyle c_{nr} Inv-χ2(ν,σ2)\displaystyle\sim\text{Inv-}\chi^{2}(\nu,\sigma^{2}) for (n,r){1,,N}×{1,,R}(n,r)\in\{1,\ldots,N\}\times\{1,\ldots,R\} (23)
bnr\displaystyle b_{nr} Normal(0,cnr/κ)\displaystyle\sim\text{Normal}(0,c_{nr}/\kappa) for (n,r){1,,N}×{1,,R}.\displaystyle\text{for $(n,r)\in\{1,\ldots,N\}\times\{1,\ldots,R\}$}. (24)

Then sample the latent events,

K\displaystyle K Poisson(ψT)\displaystyle\sim\text{Poisson}(\psi\cdot T) (25)
rk\displaystyle r_{k} Categorical(π)\displaystyle\sim\text{Categorical}(\pi) for k{1,,K}k\in\{1,\ldots,K\} (26)
τk\displaystyle\tau_{k} Uniform([0,T])\displaystyle\sim\text{Uniform}(\,[0,T]\,) for k{1,,K}k\in\{1,\ldots,K\} (27)
Ak\displaystyle A_{k} Gamma(α,β)\displaystyle\sim\text{Gamma}(\alpha,\beta) for k{1,,K}.\displaystyle\text{for $k\in\{1,\ldots,K\}$}. (28)

Next sample the background spikes,

S\displaystyle S_{\varnothing} Poisson(λT)\displaystyle\sim\text{Poisson}\big{(}\lambda^{\varnothing}\cdot T\big{)} (29)
n0(s)\displaystyle n_{0}^{(s)} Categorical(π)\displaystyle\sim\text{Categorical}(\pi^{\varnothing}) for s{1,,S}s\in\{1,\ldots,S_{\varnothing}\} (30)
t0(s)\displaystyle t_{0}^{(s)} Uniform([0,T])\displaystyle\sim\text{Uniform}(\,[0,T]\,) for s{1,,S}s\in\{1,\ldots,S_{\varnothing}\} (31)

Note that we have decomposed each neuron’s background firing rate as the product: λn=λπn\lambda^{\varnothing}_{n}=\lambda^{\varnothing}\pi^{\varnothing}_{n}. Finally, sample the induced spikes in each sequence,

Sk\displaystyle S_{k} Poisson(Ak)\displaystyle\sim\text{Poisson}(A_{k}) for k{1,,K}k\in\{1,\ldots,K\} (32)
nk(s)\displaystyle n_{k}^{(s)} Categorical(𝐚rk)\displaystyle\sim\text{Categorical}(\mathbf{a}_{r_{k}}) for s{1,,Sk}s\in\{1,\ldots,S_{k}\}; for k{1,,K}k\in\{1,\ldots,K\} (33)
tk(s)\displaystyle t_{k}^{(s)} Normal(τk+bnsrk,cnsrk)\displaystyle\sim\text{Normal}(\tau_{k}+b_{n_{s}r_{k}},c_{n_{s}r_{k}}) for s{1,,Sk}s\in\{1,\ldots,S_{k}\}; for k{1,,K}k\in\{1,\ldots,K\} (34)

The hyperparameters of the model are ξ={γ,γ,α,β,φ,ν,σ2,κ,ψ,α,β}\xi=\big{\{}\gamma,\gamma_{\varnothing},\alpha_{\varnothing},\beta_{\varnothing},\varphi,\nu,\sigma^{2},\kappa,\psi,\alpha,\beta\big{\}}, and the global parameters consist of Θ={θ,{θr}r=1R}\Theta=\{\theta_{\varnothing},\{\theta_{r}\}_{r=1}^{R}\}, where θ={λ,{πn}n=1N}\theta_{\varnothing}=\{\lambda^{\varnothing},\{\pi_{n}^{\varnothing}\}_{n=1}^{N}\} denotes parameters related to the background spiking process and θr={πr,𝐚r,{bnr,cnr}n=1N}\theta_{r}=\{\pi_{r},\mathbf{a}_{r},\{b_{nr},c_{nr}\}_{n=1}^{N}\} denotes parameters associated with each sequence type for r{1,,R}r\in\{1,\ldots,R\}. The resulting dataset is the union of background and induced spikes, k=0Ks=1Sk{(nk(s),tk(s))}\bigcup_{k=0}^{K}\bigcup_{s=1}^{S_{k}}\big{\{}\big{(}n_{k}^{(s)},\,t_{k}^{(s)}\big{)}\big{\}}.

B.2 Expected Sequence Sizes

As described in the main text, each sequence type induces a different pattern of activation across the neural population. In particular, each neuron’s response is modeled by a scaled normal distribution described by three variables—the amplitude (anra_{nr}), the mean (bnrb_{nr}), and the variance (cnrc_{nr}).

We can interpret bnrb_{nr} as the temporal offset or delay for neuron nn, while cnrc_{nr} sets the width of the response. They are drawn from a normal-inverse-chi-squared distribution, which is a standard choice for the conjugate prior when inferring the mean and variance of a univariate normal distribution. Refer to [60] (section 3.3) and [61] (section 4.6.3.7) for appropriate background material.

We can interpret anra_{nr} as the expected number of spikes emitted by neuron nn for a sequence of type rr with unit amplitude (i.e., when Ak=1A_{k}=1). By sampling 𝐚r+N\mathbf{a}_{r}\in\mathbb{R}_{+}^{N} from a symmetric Dirichlet prior, we introduce the constraint that nanr=1\sum_{n}a_{nr}=1, which avoids a degeneracy in the amplitude parameters over neurons (anra_{nr}) and sequences (AkA_{k}). As a consequence, we can interpret AkA_{k} as the expected number of spikes evoked by latent event kk over the full neural population:

𝔼[Sk]=n=1N0TAkanr𝒩(tτk+bnrk,cnrk)dt=Akn=1Nanr=10T𝒩(tτk+bnrk,cnrk)dt1Ak.\mathbb{E}[S_{k}]=\sum_{n=1}^{N}\int_{0}^{T}A_{k}\cdot a_{nr}\cdot\mathcal{N}(t\mid\tau_{k}+b_{nr_{k}},c_{nr_{k}})\,\mathrm{d}t=A_{k}\cdot\underbrace{\small{\sum}_{n=1}^{N}a_{nr}}_{=1}\cdot\underbrace{\int_{0}^{T}\mathcal{N}(t\mid\tau_{k}+b_{nr_{k}},c_{nr_{k}})\,\mathrm{d}t}_{\approx 1}\approx A_{k}. (35)

This approximation ignores boundary effects, but it is justified when 0τkT0\ll\tau_{k}\ll T and bnr,cnrTb_{nr},\sqrt{c_{nr}}\ll T; i.e. when the sequences are short relative to the interval [0,T][0,T]. This condition will generally be satisfied.

B.3 Log Likelihood

Given a set of latent events {zk}={zk}k=1K\{z_{k}\}=\{z_{k}\}_{k=1}^{K} and sequence type parameters {θr}={θr}r=1R\{\theta_{r}\}=\{\theta_{r}\}_{r=1}^{R}, we can evaluate the likelihood of the dataset {xs}={xs}s=1S\{x_{s}\}=\{x_{s}\}_{s=1}^{S} as:

logp({xs}{zk},{θr})=s=1Slog[λπns+k=1KAkansrk𝒩(tsτk+bnsrk,cnsrk)]()λTk=1KAk()\log p(\{x_{s}\}\mid\{z_{k}\},\{\theta_{r}\})=\underbrace{\sum_{s=1}^{S}\log\left[\lambda^{\varnothing}\pi^{\varnothing}_{n_{s}}+\sum_{k=1}^{K}A_{k}a_{n_{s}r_{k}}\mathcal{N}(t_{s}\mid\tau_{k}+b_{n_{s}r_{k}},c_{n_{s}r_{k}})\right]}_{(*)}-\underbrace{\lambda^{\varnothing}T-\sum_{k=1}^{K}A_{k}}_{(**)} (36)

where terms ()(*) and ()(**) respectively correspond to the logarithms of [iλ(xi)][\,\prod_{i}\lambda(x_{i})\,] and [exp[Rλ(x)dx]][\,\exp[-\int_{R}\lambda(x)\mathrm{d}x]\,] appearing in Equation 8.

Appendix C Complete Gibbs Sampling Algorithm

In this section we describe the collapsed Gibbs sampling routine for PP-Seq in detail. It adapts “Algorithm 3” of [62] and the extension by [63] to draw approximate samples from the model posterior. To do this, we introduce auxiliary parent variables us{0}+u_{s}\in\{0\}\cup\mathbb{Z}_{+} for every spike. These variables denote assignments of each spike to of the latent sequence events, us=ku_{s}=k for some k{0K}k\in\{0\ldots K\}, or to the background process, us=0u_{s}=0. Let Xk={xs:us=k}X_{k}=\{x_{s}:u_{s}=k\} denote the set of spikes assigned to sequence k{1,,K}k\in\{1,\ldots,K\} under parent assignments {us}s=1S\{u_{s}\}_{s=1}^{S}. Similar assignment variables are used in standard Gibbs sampling routines for finite mixture models (see, e.g., section 24.2.4 in [61]) and for Dirichlet Process Mixture (DPM) models (see [62]; section 25.2.4 in [61]). In these contexts, each usu_{s} may be thought of as a sequence assignment or indicator variable. In the present context of the PP-Seq model, the attributions of spikes to latent events follows from the superposition principle of Poisson processes (see appendix A), which allows us to decompose each neuron’s firing rate function into a sum of simple functions (in this case, univariate Gaussians) and thus derive efficient Gibbs sampler updates.

Input: Spikes {x1,,xS}\{x_{1},\ldots,x_{S}\} and hyperparameters ξ\xi.
Initialize by assigning all spikes to the background: us=0u_{s}=0 for s{1S}s\in\{1\ldots S\}.
repeat MM times to draw MM samples
      
      1. Resample parent assignments, integrating over latent events:
      
            for s=1,,Ss=1,\,\ldots\,,S. Remove xsx_{s} its current assignment and place it…
      
               a. in the background, us=0u_{s}=0, with probability (1+β)λns\propto(1+\beta)\,\lambda^{\varnothing}_{n_{s}}
      
               b. in cluster us=ku_{s}=k, with probability (α+|Xk|)[rk=1Rp(rkXk)ansrkp(tsXk,rk,ns)]\propto(\alpha+|X_{k}|)\,\left[\sum_{r_{k}=1}^{R}p(r_{k}\mid X_{k})\,a_{n_{s}r_{k}}\,p(t_{s}\mid X_{k},r_{k},n_{s})\right]
      
               c. in a new cluster, us=K+1u_{s}=K+1, with probability α(β1+β)αψr=1Rπransr\propto\alpha\left(\frac{\beta}{1+\beta}\right)^{\alpha}\psi\sum_{r=1}^{R}\pi_{r}\,a_{n_{s}r}
      
            end
      
      2. Resample latent events
            for k=1,,Kk=1,\,\ldots\,,K sample rk,τk,Akp(rk,τk,AkXk,Θ)r_{k},\tau_{k},A_{k}\sim p(r_{k},\tau_{k},A_{k}\mid X_{k},\Theta) with the following steps:
      
               a. Sample rkp(rkXk,Θ,ξ)r_{k}\sim p(r_{k}\mid X_{k},\Theta,\xi) where
p(rk=rXk,Θ,ξ)\displaystyle p(r_{k}=r\mid X_{k},\Theta,\xi) πrk(xsXkansrk)Z(xsXkJsk,xsXkhsk)xsXkZ(Jsk,hsk)\displaystyle\propto\pi_{r_{k}}\left(\prod_{x_{s}\in X_{k}}a_{n_{s}r_{k}}\right)\frac{Z({\textstyle\sum_{x_{s}\in X_{k}}}J_{sk},{\textstyle\sum_{x_{s}\in X_{k}}}h_{sk})}{\prod_{x_{s}\in X_{k}}Z(J_{sk},h_{sk})}
          and where Jsk=1/cnsrkJ_{sk}=1/c_{n_{s}r_{k}}hsk=(tsbnsrk)/cnsrkh_{sk}=(t_{s}-b_{n_{s}r_{k}})/c_{n_{s}r_{k}}, and Z(J,h)=(2π)1/2J1/2exp{12h2J1}Z(J,h)=(2\pi)^{1/2}J^{-1/2}\exp\left\{\tfrac{1}{2}h^{2}J^{-1}\right\}
      
               b. Sample τkp(τkrk,Xk,Θ)\tau_{k}\sim p(\tau_{k}\mid r_{k},X_{k},\Theta) where,
p(τkrk,Xk,Θ)\displaystyle p(\tau_{k}\mid r_{k},X_{k},\Theta) =𝒩(τkμk,σk2)\displaystyle=\mathcal{N}(\tau_{k}\mid\mu_{k},\sigma_{k}^{2})
          and where σk2=(xsXkJsk)1\sigma_{k}^{2}=(\sum_{x_{s}\in X_{k}}J_{sk})^{-1} and μk=σk2(xsXkhsk)\mu_{k}=\sigma_{k}^{2}(\sum_{x_{s}\in X_{k}}h_{sk})
      
               c. Sample AkGa(α+|Xk|,β+1)A_{k}\sim\mathrm{Ga}(\alpha+|X_{k}|,\beta+1).
      
            end
      
      3. Resample global parameters
      
            a. Sample λnGa(α+xsX0𝕀[ns=n],β+T)\lambda_{n}^{\varnothing}\sim\mathrm{Ga}(\alpha_{\varnothing}+\sum_{x_{s}\in X_{0}}\mathbb{I}[n_{s}=n],\,\beta_{\varnothing}+T) for n=1,,Nn=1,\ldots,N.
      
            b. Sample πDir([γ+k𝕀[rk=1],,γ+k𝕀[rk=R]])\pi\sim\mathrm{Dir}\left(\left[\gamma+\sum_{k}\mathbb{I}[r_{k}=1],\ldots,\gamma+\sum_{k}\mathbb{I}[r_{k}=R]\right]\right)
      
            c. Sample 𝐚rDir([φ+kxsXk𝕀[rk=rns=1],,φ+kxsXk𝕀[rk=rns=N]])\mathbf{a}_{r}\sim\mathrm{Dir}\left(\left[\varphi+\sum_{k}\sum_{x_{s}\in X_{k}}\mathbb{I}[r_{k}=r\wedge n_{s}=1],\ldots,\varphi+\sum_{k}\sum_{x_{s}\in X_{k}}\mathbb{I}[r_{k}=r\wedge n_{s}=N]\right]\right)
             d. Sample bnr,cnrInvχ2(cnr;νnr,σnr2)𝒩(bnrμnr,cnr/κnr)b_{nr},c_{nr}\sim\mathrm{Inv-}\chi^{2}(c_{nr};\nu_{nr},\sigma_{nr}^{2})\,\mathcal{N}(b_{nr}\mid\mu_{nr},c_{nr}/\kappa_{nr}) for n=1,Nn=1\ldots,N, r=1,Rr=1\ldots,R where
Snr\displaystyle S_{nr} =k=1KxsXk𝕀[rk=rns=n]\displaystyle=\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}{\mathbb{I}[r_{k}=r\wedge n_{s}=n]}
νnr\displaystyle\nu_{nr} =ν+Snr\displaystyle=\nu+S_{nr} σnr2\displaystyle\sigma_{nr}^{2} =νσ2+k=1KxsXk(tsτk)2𝕀[rk=rns=n]ν+Snr,\displaystyle=\frac{\nu\sigma^{2}+\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}(t_{s}-\tau_{k})^{2}\cdot\mathbb{I}[r_{k}=r\wedge n_{s}=n]}{\nu+S_{nr}},
κnr\displaystyle\kappa_{nr} =κ+Snr\displaystyle=\kappa+S_{nr} μnr\displaystyle\mu_{nr} =k=1KxsXk(tsτk)𝕀[rk=rns=n]κ+Snr.\displaystyle=\frac{\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}(t_{s}-\tau_{k})\cdot\mathbb{I}[r_{k}=r\wedge n_{s}=n]}{\kappa+S_{nr}}.
      
end
Algorithm 1 Collapsed Gibbs sampling for PPSeq model

The Gibbs sampling routine is summarized in Algorithm 1. While the approach closely mirrors existing work by [62, 63] there are a couple notable differences. First, the background spikes in the PP-Seq model do not have an obvious counterpart in these prior algorithms, and this feature of the model renders the partition of datapoints only partially exchangeable (intuitively, the set of datapoints defined by us=0u_{s}=0 is treated differently from all other groups). Second, the factors Sk+αS_{k}+\alpha and α(β1+β)αψ\alpha\,\left(\frac{\beta}{1+\beta}\right)^{\alpha}\psi in the collapsed Gibbs updates are specific to the special form of the PP-Seq model as a Neyman-Scott process; for example, in the analogous Gibbs sampling routine for DPMs, (α+|Xk|)(\alpha+|X_{k}|) is replaced with |Xk||X_{k}|, and αψT(β1+β)α\alpha\,\psi\,T\left(\frac{\beta}{1+\beta}\right)^{\alpha} is replaced with the concentration parameter of the Dirichlet process.

Appendix D Derivations

D.1 Prior Distribution Over Partitions

First we derive the prior distribution on the partition of spikes under a Neyman-Scott process. Technically, we derive the prior distribution on spike indices{1,,S}\{1,\ldots,S\}, not yet including the spike times and neurons. A partition of indices is a set of disjoint, non-empty sets whose union is {1,,S}\{1,\ldots,S\}. We represent the partition as ={k}k=0K\mathcal{I}=\{\mathcal{I}_{k}\}_{k=0}^{K^{*}}, where k{1,,S}\mathcal{I}_{k}\subset\{1,\ldots,S\} contains the indices of spikes assigned to sequence kk, with k=0k=0 denoting the background spikes. Here, KK^{*} denotes the number of non-empty sequences in the partition, and |k||\mathcal{I}_{k}| denotes the number of spikes assigned to the kk-th sequence. Likewise, |0||\mathcal{I}_{0}| denotes the number of spikes assigned to the background.

Proposition 1.

The prior probability of the partition, integrating over the latent event parameters, is,

p(Θ,ξ)=V(K;ξ)Po(|0|;λT)|0|!S!(1+β)S|0|k=1KΓ(|k|+α)Γ(α),\displaystyle p(\mathcal{I}\mid\Theta,\xi)=V(K^{*};\xi)\,\frac{\mathrm{Po}(|\mathcal{I}_{0}|;\lambda^{\varnothing}T)\,|\mathcal{I}_{0}|!}{S!\,(1+\beta)^{S-|\mathcal{I}_{0}|}}\prod_{k=1}^{K^{*}}\frac{\Gamma(|\mathcal{I}_{k}|+\alpha)}{\Gamma(\alpha)}, (37)

where α\alpha and β\beta are the hyperparameters of the gamma intensity on amplitudes, λT{\lambda^{\varnothing}T} is the expected number of background events, and

V(K;ξ)=K=KPo(K;ψT)K!(KK)!(β1+β)αK.\displaystyle V(K^{*};\xi)=\sum_{K=K^{*}}^{\infty}\mathrm{Po}(K;\psi T)\,\frac{K!}{(K-K^{*})!}\,\left(\frac{\beta}{1+\beta}\right)^{\alpha K}. (38)
Proof.

Our derivation roughly follows that of [63] for mixture of finite mixture models, but we adapt it to Neyman-Scott processes. The major difference is that the distribution above is over partitions of random size. First, return to the generative process described in Section B.1 and integrate over the latent event amplitudes to obtain the marginal distribution on sequence sizes given the total number of sequences KK. We have,

p(S0,,SKK,Θ,ξ)\displaystyle p(S_{0},\ldots,S_{K}\mid K,\Theta,\xi) =Po(S0;λT)k=1KPo(Sk;γk)Ga(Ak;α,β)dAk\displaystyle=\mathrm{Po}(S_{0};\lambda^{\varnothing}T)\prod_{k=1}^{K}\int\mathrm{Po}(S_{k};\gamma_{k})\,\mathrm{Ga}(A_{k};\alpha,\beta)\mathrm{d}A_{k} (39)
=Po(S0;λT)k=1KNB(Sk;α,(1+β)1)\displaystyle=\mathrm{Po}(S_{0};\lambda^{\varnothing}T)\prod_{k=1}^{K}\mathrm{NB}(S_{k};\alpha,(1+\beta)^{-1}) (40)
=1S0!(λT)N0eλTk=1KΓ(Sk+α)Nk!Γ(α)(β1+β)α(11+β)Sk\displaystyle=\frac{1}{S_{0}!}(\lambda^{\varnothing}T)^{N_{0}}e^{-\lambda^{\varnothing}T}\prod_{k=1}^{K}\frac{\Gamma(S_{k}+\alpha)}{N_{k}!\,\Gamma(\alpha)}\left(\frac{\beta}{1+\beta}\right)^{\alpha}\left(\frac{1}{1+\beta}\right)^{S_{k}} (41)
=1S0!(λT)S0eλT(β1+β)Kα(11+β)SS0k=1KΓ(Sk+α)Sk!Γ(α).\displaystyle=\frac{1}{S_{0}!}(\lambda^{\varnothing}T)^{S_{0}}e^{-\lambda^{\varnothing}T}\left(\frac{\beta}{1+\beta}\right)^{K\alpha}\left(\frac{1}{1+\beta}\right)^{S-S_{0}}\prod_{k=1}^{K}\frac{\Gamma(S_{k}+\alpha)}{S_{k}!\,\Gamma(\alpha)}. (42)

Let {us}s=1S\{u_{s}\}_{s=1}^{S} denote a set of sequence assignments. There are (SS0,,SK)\binom{S}{S_{0},\ldots,S_{K}} parent assignments consistent with the sequence sizes S0,,SKS_{0},\ldots,S_{K}, and they are all equally likely under the prior, so the conditional probability of the parent assignments is,

p({us}s=1SK,Θ,ξ)\displaystyle p(\{u_{s}\}_{s=1}^{S}\mid K,\Theta,\xi) =1S0!(λT)S0eλTk=1KΓ(Sk+α)Sk!Γ(α)(β1+β)α(11+β)Nk(SS0,,SK)1\displaystyle=\frac{1}{S_{0}!}(\lambda^{\varnothing}T)^{S_{0}}e^{-\lambda^{\varnothing}T}\prod_{k=1}^{K}\frac{\Gamma(S_{k}+\alpha)}{S_{k}!\,\Gamma(\alpha)}\left(\frac{\beta}{1+\beta}\right)^{\alpha}\left(\frac{1}{1+\beta}\right)^{N_{k}}\,\binom{S}{S_{0},\ldots,S_{K}}^{-1} (43)
=1S!(λT)S0eλT(β1+β)Kα(11+β)SS0k=1KΓ(Sk+α)Γ(α).\displaystyle=\frac{1}{S!}(\lambda^{\varnothing}T)^{S_{0}}e^{-\lambda^{\varnothing}T}\left(\frac{\beta}{1+\beta}\right)^{K\alpha}\left(\frac{1}{1+\beta}\right)^{S-S_{0}}\prod_{k=1}^{K}\frac{\Gamma(S_{k}+\alpha)}{\Gamma(\alpha)}. (44)

The parent assignments above induce a partition, but technically they assume a particular ordering of the latent events. Moreover, if some of the latent events fail to produce any observed events, they will not be included in the partition. In performing a change of variables from parent assignments to partitions, we need to sum over latent event assignments that produce the same partition. There are (KK)K!=K!(KK)!{\binom{K}{K^{*}}K^{*}!=\frac{K!}{(K-K^{*})!}} such assignments if there are KK latent events but only KK^{*} partitions. Thus,

p(K,Θ,ξ)\displaystyle p(\mathcal{I}\mid K,\Theta,\xi) =K!(KK)!1S!(λT)|0|eλT(β1+β)Kα(11+β)S|0|k=1KΓ(|k|+α)Γ(α).\displaystyle=\frac{K!}{(K-K^{*})!}\,\frac{1}{S!}(\lambda^{\varnothing}T)^{|\mathcal{I}_{0}|}e^{-\lambda^{\varnothing}T}\left(\frac{\beta}{1+\beta}\right)^{K\alpha}\left(\frac{1}{1+\beta}\right)^{S-|\mathcal{I}_{0}|}\prod_{k=1}^{K^{*}}\frac{\Gamma(|\mathcal{I}_{k}|+\alpha)}{\Gamma(\alpha)}. (45)

Clearly, KK must be at least KK^{*} in order to produce the partition.

Finally, we sum over the number of latent events KK to obtain the marginal probability of the partition,

p(Θ,ξ)\displaystyle p(\mathcal{I}\mid\Theta,\xi) =K=KPo(K;ψT)p(K,Θ,ξ)\displaystyle=\sum_{K=K^{*}}^{\infty}\mathrm{Po}(K;\psi T)\,p(\mathcal{I}\mid K,\Theta,\xi) (46)
=V(K;ξ)(λT)|0|eλTS!(1+β)S|0|k=1KΓ(|k|+α)Γ(α),\displaystyle=V(K^{*};\xi)\,\frac{(\lambda^{\varnothing}T)^{|\mathcal{I}_{0}|}e^{-\lambda^{\varnothing}T}}{S!\,(1+\beta)^{S-|\mathcal{I}_{0}|}}\prod_{k=1}^{K^{*}}\frac{\Gamma(|\mathcal{I}_{k}|+\alpha)}{\Gamma(\alpha)}, (47)

where

V(K;ξ)\displaystyle V(K^{*};\xi) =K=KPo(K;ψT)K!(KK)!(β1+β)Kα.\displaystyle=\sum_{K=K^{*}}^{\infty}\mathrm{Po}(K;\psi T)\,\frac{K!}{(K-K^{*})!}\left(\frac{\beta}{1+\beta}\right)^{K\alpha}. (48)

D.2 Marginal Likelihood of Spikes in a Partition

Now we combine the prior on partitions of spike indices with a likelihood of spikes under that partition. To match the notation in the main text and Appendix C, let X={Xk}k=0KX=\{X_{k}\}_{k=0}^{K^{*}} denote a partition of the spikes where Xk={xs:sk}X_{k}=\{x_{s}:s\in\mathcal{I}_{k}\} denote the set of spikes in sequence kk. In this section we derive the marginal probability of spikes in a sequence, integrating out the sequence time and type. For the background spikes, there are no parameters to integrate and the marginal probability is simply,

p(X0Θ)\displaystyle p(X_{0}\mid\Theta) =xsX0Unif(ts;[0,T])Cat(ns[λ1/λ,,λN/λ])\displaystyle=\prod_{x_{s}\in X_{0}}\mathrm{Unif}(t_{s};[0,T])\,\mathrm{Cat}(n_{s}\mid[\lambda_{1}^{\varnothing}/\lambda^{\varnothing},\ldots,\lambda_{N}^{\varnothing}/\lambda^{\varnothing}]) (49)
=T|X0|xsX0λnsλ.\displaystyle=T^{-|X_{0}|}\prod_{x_{s}\in X_{0}}\frac{\lambda_{n_{s}}^{\varnothing}}{\lambda^{\varnothing}}. (50)

For the spikes attributed to a latent event, however, we have to marginalize over the type and time of the event. We have,

p(XkΘ)\displaystyle p(X_{k}\mid\Theta) =rk=1Rp(rk)0Tp(τk)skp(ts,nsrk,τk,Θ)dτk\displaystyle=\sum_{r_{k}=1}^{R}p(r_{k})\int_{0}^{T}p(\tau_{k})\prod_{s\in\mathcal{I}_{k}}p(t_{s},n_{s}\mid r_{k},\tau_{k},\Theta)\,\mathrm{d}\tau_{k} (51)
=rk=1Rπrk0TUnif(τk;[0,T])xsXkCat(nsrk,Θ)𝒩(tsτk+bnsrk,cnsrk)dτk\displaystyle=\sum_{r_{k}=1}^{R}\pi_{r_{k}}\int_{0}^{T}\mathrm{Unif}(\tau_{k};[0,T])\prod_{x_{s}\in X_{k}}\mathrm{Cat}(n_{s}\mid r_{k},\Theta)\,\mathcal{N}(t_{s}\mid\tau_{k}+b_{n_{s}r_{k}},c_{n_{s}r_{k}})\,\mathrm{d}\tau_{k} (52)
=1Trk=1Rπrk0TxsXkansrk2πcnsrkexp{12cnsrk(tsτkbnsrk)2}dτk.\displaystyle=\frac{1}{T}\sum_{r_{k}=1}^{R}\pi_{r_{k}}\int_{0}^{T}\prod_{x_{s}\in X_{k}}\frac{a_{n_{s}r_{k}}}{\sqrt{2\pi c_{n_{s}r_{k}}}}\exp\left\{-\frac{1}{2c_{n_{s}r_{k}}}(t_{s}-\tau_{k}-b_{n_{s}r_{k}})^{2}\right\}\,\mathrm{d}\tau_{k}. (53)

Let Jsk=1/cnsrkJ_{sk}=1/c_{n_{s}r_{k}}, and hsk=(tsbnsrk)/cnsrkh_{sk}=(t_{s}-b_{n_{s}r_{k}})/c_{n_{s}r_{k}}. Then,

p(XkΘ)\displaystyle p(X_{k}\mid\Theta) =1Trk=1Rπrk0TxsXkansrkJsk2πexp{12Jskτk2+hskτk12hsk2Jsk1}dτk\displaystyle=\frac{1}{T}\sum_{r_{k}=1}^{R}\pi_{r_{k}}\int_{0}^{T}\prod_{x_{s}\in X_{k}}a_{n_{s}r_{k}}\sqrt{\frac{J_{sk}}{2\pi}}\exp\left\{-\tfrac{1}{2}J_{sk}\tau_{k}^{2}+h_{sk}\tau_{k}-\tfrac{1}{2}h_{sk}^{2}J_{sk}^{-1}\right\}\,\mathrm{d}\tau_{k} (54)
=1Trk=1RπrkxsXkansrkZ(Jsk,hsk)0Texp{12(xsXkJsk)τk2+(xsXkhsk)τk}dτk\displaystyle=\frac{1}{T}\sum_{r_{k}=1}^{R}\pi_{r_{k}}\prod_{x_{s}\in X_{k}}\frac{a_{n_{s}r_{k}}}{Z(J_{sk},h_{sk})}\int_{0}^{T}\exp\left\{-\tfrac{1}{2}({\textstyle\sum_{x_{s}\in X_{k}}}J_{sk})\tau_{k}^{2}+({\textstyle\sum_{x_{s}\in X_{k}}}h_{sk})\tau_{k}\right\}\,\mathrm{d}\tau_{k} (55)
1Trk=1Rπrk(xsXkansrk)Z(xsXkJsk,xsXkhsk)xsXkZ(Jsk,hsk),\displaystyle\approx\frac{1}{T}\sum_{r_{k}=1}^{R}\pi_{r_{k}}\left(\prod_{x_{s}\in X_{k}}a_{n_{s}r_{k}}\right)\frac{Z({\textstyle\sum_{x_{s}\in X_{k}}}J_{sk},{\textstyle\sum_{x_{s}\in X_{k}}}h_{sk})}{\prod_{x_{s}\in X_{k}}Z(J_{sk},h_{sk})}, (56)

where

Z(J,h)=(2π)1/2J1/2exp{12h2J1}\displaystyle Z(J,h)=(2\pi)^{1/2}J^{-1/2}\exp\left\{\tfrac{1}{2}h^{2}J^{-1}\right\} (57)

is the normalizing constant of a Gaussian density in information form with precision JJ and linear coefficient hh. For a sequence with a single spike, this reduces to,

p(xsΘ)\displaystyle p(x_{s}\mid\Theta) 1Trk=1Rπrkansrk.\displaystyle\approx\frac{1}{T}\sum_{r_{k}=1}^{R}\pi_{r_{k}}a_{n_{s}r_{k}}. (58)

The approximations above are due to the truncated integral over [0,T][0,T] rather than the whole real line. In practice, this truncation is negligible for sequences far from the boundaries of the interval.

D.3 Collapsed Gibbs Sampling the Partitions

The conditional distribution on partitions given data and model parameters is given by,

p({xs}s=1S,Θ,ξ)\displaystyle p(\mathcal{I}\mid\{x_{s}\}_{s=1}^{S},\Theta,\xi) p(Θ,ξ)p({xs}s=1S,Θ)\displaystyle\propto p(\mathcal{I}\mid\Theta,\xi)\,p(\{x_{s}\}_{s=1}^{S}\mid\mathcal{I},\Theta) (59)
=p(Θ,ξ)p(X0Θ)k=1Kp(XkΘ).\displaystyle=p(\mathcal{I}\mid\Theta,\xi)\,p(X_{0}\mid\Theta)\,\prod_{k=1}^{K^{*}}p(X_{k}\mid\Theta). (60)

Let XxsX\setminus x_{s} denote the partition with spike xsx_{s} removed. Though it is slightly inconsistent with the proof in Section D.1, let usu_{s} denote the assignment of spike xsx_{s}, as in the main text. Moreover, let SkS_{k} denote the size of the kk-th sequence in the partition, not including spike xsx_{s}. We now consider the relative probability of the full partition XX when xsx_{s} is added to each one of the possible parts: the background, an existing sequence, or a new sequence.

For the background assignment,

p(us=0xs,Xxs,Θ,ξ)\displaystyle p(u_{s}=0\mid x_{s},X\setminus x_{s},\Theta,\xi) p(us=0,xs,Xxs,Θ,ξ)p(Xxs,Θ,ξ)\displaystyle\propto\frac{p(u_{s}=0,x_{s},X\setminus x_{s},\Theta,\xi)}{p(X\setminus x_{s},\Theta,\xi)} (61)
=V(K;ξ)(λT)S0+1eλTS!(1+β)SS01p(X0xsΘ)k=1KΓ(Sk+α)Γ(α)p(XkΘ)V(K;ξ)(λT)S0eλTS!(1+β)SS0p(X0Θ)k=1KΓ(Sk+α)Γ(α)p(XkΘ)\displaystyle=\frac{V(K^{*};\xi)\,\frac{(\lambda^{\varnothing}T)^{S_{0}+1}e^{-\lambda^{\varnothing}T}}{S!\,(1+\beta)^{S-S_{0}-1}}p(X_{0}\cup x_{s}\mid\Theta)\prod_{k=1}^{K^{*}}\frac{\Gamma(S_{k}+\alpha)}{\Gamma(\alpha)}p(X_{k}\mid\Theta)}{V(K^{*};\xi)\,\frac{(\lambda^{\varnothing}T)^{S_{0}}e^{-\lambda^{\varnothing}T}}{S!\,(1+\beta)^{S-S_{0}}}p(X_{0}\mid\Theta)\prod_{k=1}^{K^{*}}\frac{\Gamma(S_{k}+\alpha)}{\Gamma(\alpha)}p(X_{k}\mid\Theta)} (62)
=λT(1+β)1Tλnsλ\displaystyle=\lambda^{\varnothing}T(1+\beta)\frac{1}{T}\frac{\lambda_{n_{s}}^{\varnothing}}{\lambda^{\varnothing}} (63)
=(1+β)λns.\displaystyle=(1+\beta)\lambda_{n_{s}}^{\varnothing}. (64)

By a similar process, we arrive at the conditional probabilities of adding a spike to an existing sequence,

p(us=kxs,Xxs,Θ,ξ)\displaystyle p(u_{s}=k\mid x_{s},X\setminus x_{s},\Theta,\xi) Γ(Sk+α+1)Γ(Sk+α)p(XkxsΘ)p(XkΘ)\displaystyle\propto\frac{\Gamma(S_{k}+\alpha+1)}{\Gamma(S_{k}+\alpha)}\frac{p(X_{k}\cup x_{s}\mid\Theta)}{p(X_{k}\mid\Theta)} (65)
=(Sk+α)p(xsXk,Θ).\displaystyle=(S_{k}+\alpha)\,p(x_{s}\mid X_{k},\Theta). (66)

The predictive likelihood can be obtained via the ratio of marginal likelihoods derived above, or by explicitly calculating the categorical posterior distribution on types rkr_{k} and then the categorical and Gaussian posterior predictive distributions of nsn_{s} and tst_{s}, respectively, given the type and the other spikes. The latter approach is what is presented in the main text.

Finally, the conditional probability of assigning a spike to a new sequence simplifies as,

p(us=K+1Xxs,Θ,ξ)\displaystyle p(u_{s}=K^{*}+1\mid X\setminus x_{s},\Theta,\xi) (Γ(α+1)Γ(α))(V(K+1;ξ)V(K;ξ))p(xsΘ)\displaystyle\propto\left(\frac{\Gamma(\alpha+1)}{\Gamma(\alpha)}\right)\left(\frac{V(K^{*}+1;\xi)}{V(K^{*};\xi)}\right)p(x_{s}\mid\Theta) (67)
=α(J=K+11J!eψT(ψT)JJ!(JK1)!(β1+β)JαK=K1K!eψT(ψT)KK!(KK)!(β1+β)Kα)(1Trk=1Rπrkansrk)\displaystyle=\alpha\left(\frac{\sum_{J=K^{*}+1}^{\infty}\frac{1}{J!}e^{-\psi T}(\psi T)^{J}\,\frac{J!}{(J-K^{*}-1)!}\left(\frac{\beta}{1+\beta}\right)^{J\alpha}}{\sum_{K=K^{*}}^{\infty}\frac{1}{K!}e^{-\psi T}(\psi T)^{K}\,\frac{K!}{(K-K^{*})!}\left(\frac{\beta}{1+\beta}\right)^{K\alpha}}\right)\left(\frac{1}{T}\sum_{r_{k}=1}^{R}\pi_{r_{k}}a_{n_{s}r_{k}}\right) (68)
=α(K=K(ψT)K+11(KK)!(β1+β)(K+1)αK=K(ψT)K1(KK)!(β1+β)Kα)(1Trk=1Rπrkansrk)\displaystyle=\alpha\left(\frac{\sum_{K=K^{*}}^{\infty}(\psi T)^{K+1}\,\frac{1}{(K-K^{*})!}\left(\frac{\beta}{1+\beta}\right)^{(K+1)\alpha}}{\sum_{K=K^{*}}^{\infty}(\psi T)^{K}\,\frac{1}{(K-K^{*})!}\left(\frac{\beta}{1+\beta}\right)^{K\alpha}}\right)\,\left(\frac{1}{T}\sum_{r_{k}=1}^{R}\pi_{r_{k}}a_{n_{s}r_{k}}\right) (69)
=α(β1+β)αψT(1Trk=1Rπrkansrk)\displaystyle=\alpha\left(\frac{\beta}{1+\beta}\right)^{\alpha}\psi T\left(\frac{1}{T}\sum_{r_{k}=1}^{R}\pi_{r_{k}}a_{n_{s}r_{k}}\right) (70)
=α(β1+β)αψrk=1Rπrkansrk.\displaystyle=\alpha\,\left(\frac{\beta}{1+\beta}\right)^{\alpha}\psi\sum_{r_{k}=1}^{R}\pi_{r_{k}}a_{n_{s}r_{k}}. (71)

D.4 Gibbs Sampling the Latent Event Parameters

Given the partition and global parameters, it is straightforward to sample the latent events. In fact, most of the calculations were derived in Section D.2. The conditional distribution of the event type is,

p(rkXk,Θ)\displaystyle p(r_{k}\mid X_{k},\Theta) p(rk)0Tp(τk)xsXkp(ts,nsrk,τk,Θ)dτk\displaystyle\propto p(r_{k})\int_{0}^{T}p(\tau_{k})\prod_{x_{s}\in X_{k}}p(t_{s},n_{s}\mid r_{k},\tau_{k},\Theta)\,\mathrm{d}\tau_{k} (72)
πrk(xsXkansrk)Z(xsXkJsk,xsXkhsk)xsXkZ(Jsk,hsk),\displaystyle\approx\pi_{r_{k}}\left(\prod_{x_{s}\in X_{k}}a_{n_{s}r_{k}}\right)\frac{Z({\textstyle\sum_{x_{s}\in X_{k}}}J_{sk},{\textstyle\sum_{x_{s}\in X_{k}}}h_{sk})}{\prod_{x_{s}\in X_{k}}Z(J_{sk},h_{sk})}, (73)

where, again, the approximation comes from truncating the integral to the range [0,T][0,T], and is negligible in our cases.

Given the type, the latent event time is conditionally Gaussian,

p(τkrk,Xk,Θ)\displaystyle p(\tau_{k}\mid r_{k},X_{k},\Theta) p(τk)xsXkp(tsrk,ns,τk,Θ)\displaystyle\propto p(\tau_{k})\prod_{x_{s}\in X_{k}}p(t_{s}\mid r_{k},n_{s},\tau_{k},\Theta) (74)
=Unif(τk;[0,T])xsXk𝒩(tsτk+bnsrk,cnsrk)\displaystyle=\mathrm{Unif}(\tau_{k};[0,T])\prod_{x_{s}\in X_{k}}\mathcal{N}(t_{s}\mid\tau_{k}+b_{n_{s}r_{k}},c_{n_{s}r_{k}}) (75)
𝒩(τkμk,σk2)\displaystyle\propto\mathcal{N}(\tau_{k}\mid\mu_{k},\sigma_{k}^{2}) (76)

where

σk2\displaystyle\sigma_{k}^{2} =(xsXkJsk)1\displaystyle=\left(\sum_{x_{s}\in X_{k}}J_{sk}\right)^{-1} (77)
μk2\displaystyle\mu_{k}^{2} =σk2(xsXkhsk).\displaystyle=\sigma_{k}^{2}\left(\sum_{x_{s}\in X_{k}}h_{sk}\right). (78)

Finally, the amplitude is independent of the type and time, and its conditional distribution is given by,

p(AkXk,ξ)\displaystyle p(A_{k}\mid X_{k},\xi) Ga(Akα,β)Po(SkAk)\displaystyle\propto\mathrm{Ga}(A_{k}\mid\alpha,\beta)\,\mathrm{Po}(S_{k}\mid A_{k}) (79)
Ga(Akα+Sk,β+1).\displaystyle\propto\mathrm{Ga}(A_{k}\mid\alpha+S_{k},\,\beta_{\varnothing}+1). (80)

D.5 Gibbs Sampling the Global Parameters

Finally, the Gibbs updates of the global parameters are simple due to their conjugate priors.

Background rates

The conditional distribution of the background rates is,

p(λnX0,ξ)\displaystyle p(\lambda_{n}^{\varnothing}\mid X_{0},\xi) Ga(λnα,β)Po(|{xs:xsX0,ns=n}|λnT)\displaystyle\propto\mathrm{Ga}(\lambda_{n}^{\varnothing}\mid\alpha_{\varnothing},\beta_{\varnothing})\,\mathrm{Po}(|\{x_{s}:x_{s}\in X_{0},n_{s}=n\}|\mid\lambda_{n}^{\varnothing}T) (81)
Ga(λnα+xsX0𝕀[ns=n],β+T).\displaystyle\propto\mathrm{Ga}(\lambda_{n}^{\varnothing}\mid\alpha_{\varnothing}+\sum_{x_{s}\in X_{0}}\mathbb{I}[n_{s}=n],\,\beta_{\varnothing}+T). (82)

Sequence type probabilities

The conditional distribution of the sequence type probability vector π\pi is,

p(π{(rk,τk,Ak)}k=1K,ξ)\displaystyle p(\pi\mid\{(r_{k},\tau_{k},A_{k})\}_{k=1}^{K^{*}},\xi) Dir(πγ𝟏R)k=1KCat(rkπ)\displaystyle\propto\mathrm{Dir}(\pi\mid\gamma\mathbf{1}_{R})\prod_{k=1}^{K^{*}}\mathrm{Cat}(r_{k}\mid\pi) (83)
Dir(π[γ+k𝕀[rk=1],,γ+k𝕀[rk=1]]).\displaystyle\propto\mathrm{Dir}\left(\pi\mid\left[\gamma+\sum_{k}\mathbb{I}[r_{k}=1],\ldots,\gamma+\sum_{k}\mathbb{I}[r_{k}=1]\right]\right). (84)

Neuron weights for each sequence type

The conditional distribution of the neuron weight vector 𝐚r\mathbf{a}_{r} is,

p(𝐚rX,{(rk,τk,Ak)}k=1K,ξ)\displaystyle p(\mathbf{a}_{r}\mid X,\{(r_{k},\tau_{k},A_{k})\}_{k=1}^{K^{*}},\xi) Dir(𝐚rφ𝟏N)k=1KxsXkCat(ns𝐚rk),\displaystyle\propto\mathrm{Dir}(\mathbf{a}_{r}\mid\varphi\mathbf{1}_{N})\prod_{k=1}^{K^{*}}\prod_{x_{s}\in X_{k}}\mathrm{Cat}(n_{s}\mid\mathbf{a}_{r_{k}}), (85)
Dir(𝐚rϕr)\displaystyle\propto\mathrm{Dir}\left(\mathbf{a}_{r}\mid\boldsymbol{\phi}_{r}\right) (86)
ϕrn\displaystyle\phi_{rn} =φ+kxsXk𝕀[rk=rns=n].\displaystyle=\varphi+\sum_{k}\sum_{x_{s}\in X_{k}}\mathbb{I}[r_{k}=r\wedge n_{s}=n]. (87)

Neuron widths and delays

The conditional distribution of the neuron delays bnrb_{nr} and widths cnrc_{nr} is,

p(bnr,cnrX,{(rk,τk,Ak)}k=1K,ξ)\displaystyle p(b_{nr},c_{nr}\mid X,\{(r_{k},\tau_{k},A_{k})\}_{k=1}^{K^{*}},\xi) Invχ2(cnrν,σ2)𝒩(bnr0,cnr/κ)k=1KxsXk(𝒩(tsτk+bnr,cnr))𝕀[rk=rns=n]\displaystyle\propto\mathrm{Inv-}\chi^{2}(c_{nr}\mid\nu,\sigma^{2})\,\mathcal{N}(b_{nr}\mid 0,c_{nr}/\kappa)\prod_{k=1}^{K^{*}}\prod_{x_{s}\in X_{k}}\left(\mathcal{N}(t_{s}\mid\tau_{k}+b_{nr},c_{nr})\right)^{\mathbb{I}[r_{k}=r\wedge n_{s}=n]} (88)
Invχ2(cnrνnr,σnr2)𝒩(bnrμnr,cnr/κnr)\displaystyle\propto\mathrm{Inv-}\chi^{2}(c_{nr}\mid\nu_{nr},\sigma_{nr}^{2})\,\mathcal{N}(b_{nr}\mid\mu_{nr},c_{nr}/\kappa_{nr}) (89)

where

Snr\displaystyle S_{nr} =k=1KxsXk𝕀[rk=rns=n]\displaystyle=\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}{\mathbb{I}[r_{k}=r\wedge n_{s}=n]} (90)
νnr\displaystyle\nu_{nr} =ν+Snr\displaystyle=\nu+S_{nr} (91)
σnr2\displaystyle\sigma_{nr}^{2} =νσ2+k=1KxsXk(tsτk)2𝕀[rk=rns=n]ν+Snr\displaystyle=\frac{\nu\sigma^{2}+\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}(t_{s}-\tau_{k})^{2}\cdot\mathbb{I}[r_{k}=r\wedge n_{s}=n]}{\nu+S_{nr}} (92)
κnr\displaystyle\kappa_{nr} =κ+Snr\displaystyle=\kappa+S_{nr} (93)
μnr\displaystyle\mu_{nr} =k=1KxsXk(tsτk)𝕀[rk=rns=n]κ+Snr.\displaystyle=\frac{\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}(t_{s}-\tau_{k})\cdot\mathbb{I}[r_{k}=r\wedge n_{s}=n]}{\kappa+S_{nr}}. (94)

D.6 Split-Merge Sampling Moves

As discussed in the main text, when sequences containing of a small number of spikes are unlikely under the prior, Gibbs sampling can be slow to mix over the posterior of spike partitions since the probability of forming new sequences (i.e. singleton clusters) is very low. In addition to the annealed sampling approach outlined in the main text, we adapted the split-merge sampling method proposed by [64] for Dirichlet process mixture models to PP-Seq.444See also [65] for the case of non-conjugate priors. For simplicity, we implemented randomized split-merge proposals—i.e. without the additional restricted Gibbs sweeps described proposed in [64]. In practice, we found that these random proposals were sufficient for the sampler to accept both split and merge moves at a satisfactorily high rate.

Briefly, our method starts by randomly choosing two spikes (ti,ni)(t_{i},n_{i}) and (tj,nj)(t_{j},n_{j}) that are not assigned to the background partition and satisfy |titj|<W|t_{i}-t_{j}|<W for some user-specified time window WW. One could set W=TW=T, in which case all pairs of spikes not assigned to the background are considerd; however, we will see that this will generally result in proposals that are extremely likely to be rejected, so it is advisable to set WW to be an upper bound on the expected sequence length to encourage faster mixing. After identifying the pair of spikes, we propose to either merge their sequences (if uiuju_{i}\neq u_{j}), or, if they belong to the same sequence (ui=uj=ku_{i}=u_{j}=k) we propose to form two new sequences, each containing one of the two spikes, the remaining |Xk|2|X_{k}|-2 spikes are randomly assigned with equal probability.

Given the proposed split or merge move, \mathcal{I}\rightarrow\mathcal{I}^{*}, the Metropolis-Hastings acceptance probability is:

min[1,q()q()p({xs}s=1S,Θ,ξ))p({xs}s=1S,Θ,ξ)]\min\left[1,\frac{q(\mathcal{I}\mid\mathcal{I}^{*})}{q(\mathcal{I}^{*}\mid\mathcal{I})}\frac{p(\mathcal{I}^{*}\mid\{x_{s}\}_{s=1}^{S},\Theta,\xi))}{p(\mathcal{I}\mid\{x_{s}\}_{s=1}^{S},\Theta,\xi)}\right] (95)

where q()q(\mathcal{I}^{*}\mid\mathcal{I}) is the proposal density, and p({xs}s=1S,Θ,ξ)p(\mathcal{I}\mid\{x_{s}\}_{s=1}^{S},\Theta,\xi) is given in eqs. 59 and 60. The sampling method then directly follows [64]. The ratio of proposal probabilities for split moves is:

q(split)q(split)=(12)(|Xk|2)\frac{q(\mathcal{I}\mid\mathcal{I}^{\text{split}})}{q(\mathcal{I}^{\text{split}}\mid\mathcal{I})}=\left(\frac{1}{2}\right)^{-(|X_{k}|-2)} (96)

And the ratio for merge moves is:

q(merge)q(merge)=(12)(|Xk|2)\frac{q(\mathcal{I}\mid\mathcal{I}^{\text{merge}})}{q(\mathcal{I}^{\text{merge}}\mid\mathcal{I})}=\left(\frac{1}{2}\right)^{(|X_{k}|-2)} (97)

Appendix E Time-Warped Sequences

To account for sequences that unfold at different speeds, we incorporate a time-warping component into the generative model. Each sequence is endowed with a warp value, which is drawn from a discrete distribution,

ωk\displaystyle\omega_{k} f=1Fηfδwf(ωk),\displaystyle\sim\sum_{f=1}^{F}\eta_{f}\delta_{w_{f}}(\omega_{k}), (98)

where ηf0\eta_{f}\geq 0 and f=1Fηf=1\sum_{f=1}^{F}\eta_{f}=1 are the probabilities of the corresponding warp values wf>0w_{f}>0. In this paper, we set

wf\displaystyle w_{f} =wF1+2(f1)/(F1)\displaystyle=w_{F}^{-1+2(f-1)/(F-1)} (99)
ηf\displaystyle\eta_{f} 𝒩(fF12,σw2).\displaystyle\propto\mathcal{N}\left(f\mid\tfrac{F-1}{2},\sigma_{w}^{2}\right). (100)

The hyperparameters include the number of warp values FF, the maximum warp value wFw_{F}, and the variance of the warp probabilities σw2\sigma_{w}^{2}. We use an odd number of warp values so that w(F1)/2=1w_{(F-1)/2}=1.

The warp value changes the distribution of spike times in the corresponding sequence by scaling the mean and standard deviation,

ts(k)\displaystyle t_{s}^{(k)} 𝒩(τk+ωkbns(k)rk,ωk2cns(k)rk).\displaystyle\sim\mathcal{N}(\tau_{k}+\omega_{k}b_{n_{s}^{(k)}r_{k}},\omega_{k}^{2}c_{n_{s}^{(k)}r_{k}}). (101)

Note that one could also choose to linearly scale the variance rather than the standard deviation, as in a warped random walk. Ultimately, this is a subjective modeling choice, and scaling the standard deviation leads to slightly easier Gibbs updates later on.

We can equivalently endow each sequence with a latent warp index fk{1,,F}f_{k}\in\{1,\ldots,F\} where p(fk)ηfkp(f_{k})\propto\eta_{f_{k}} and,

ts(k)\displaystyle t_{s}^{(k)} 𝒩(τk+wfkbns(k)rk,wfk2cns(k)rk).\displaystyle\sim\mathcal{N}\left(\tau_{k}+w_{f_{k}}b_{n_{s}^{(k)}r_{k}},w_{f_{k}}^{2}c_{n_{s}^{(k)}r_{k}}\right). (102)

Since there are only a finite number of warp values, we can treat the warp index and the sequence type as a single discrete latent variable (rk,fk){1,,R}×{1,,F}(r_{k},f_{k})\in\{1,\ldots,R\}\times\{1,\ldots,F\}. This formulation allows us to re-use all of the collapsed Gibbs updates derived above, replacing our inferences over rkr_{k} with inferences of (rk,fk)(r_{k},f_{k}). Computationally, this increases the run-time by a factor of FF.

Given the sequence types, times, and warp values, the conditional distribution of the neuron delays bnrb_{nr} and widths cnrc_{nr} is,

p(bnr,cnrX,\displaystyle p(b_{nr},c_{nr}\mid X, {(rk,τk,fk,Ak)}k=1K,ξ)\displaystyle\{(r_{k},\tau_{k},f_{k},A_{k})\}_{k=1}^{K^{*}},\xi) (103)
Invχ2(cnrν,σ2)𝒩(bnr0,cnr/κ)k=1KxsXk(𝒩(tsτk+wfkbnr,wfk2cnr))𝕀[rk=rns=n]\displaystyle\propto\mathrm{Inv-}\chi^{2}(c_{nr}\mid\nu,\sigma^{2})\,\mathcal{N}(b_{nr}\mid 0,c_{nr}/\kappa)\prod_{k=1}^{K^{*}}\prod_{x_{s}\in X_{k}}\left(\mathcal{N}(t_{s}\mid\tau_{k}+w_{f_{k}}b_{nr},w_{f_{k}}^{2}c_{nr})\right)^{\mathbb{I}[r_{k}=r\wedge n_{s}=n]} (104)
Invχ2(cnrνnr,σnr2)𝒩(bnrμnr,cnr/κnr)\displaystyle\propto\mathrm{Inv-}\chi^{2}(c_{nr}\mid\nu_{nr},\sigma_{nr}^{2})\,\mathcal{N}(b_{nr}\mid\mu_{nr},c_{nr}/\kappa_{nr}) (105)

where

Snr\displaystyle S_{nr} =k=1KxsXk𝕀[rk=rns=n]\displaystyle=\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}{\mathbb{I}[r_{k}=r\wedge n_{s}=n]} (106)
Δsk\displaystyle\Delta_{sk} =tsτkwfk\displaystyle=\frac{t_{s}-\tau_{k}}{w_{f_{k}}} (107)
νnr\displaystyle\nu_{nr} =ν+Snr\displaystyle=\nu+S_{nr} (108)
σnr2\displaystyle\sigma_{nr}^{2} =νσ2+k=1KxsXkΔsk2𝕀[rk=rns=n]ν+Snr\displaystyle=\frac{\nu\sigma^{2}+\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}\Delta_{sk}^{2}\cdot\mathbb{I}[r_{k}=r\wedge n_{s}=n]}{\nu+S_{nr}} (109)
κnr\displaystyle\kappa_{nr} =κ+Snr\displaystyle=\kappa+S_{nr} (110)
μnr\displaystyle\mu_{nr} =k=1KxsXkΔsk𝕀[rk=rns=n]κ+Snr.\displaystyle=\frac{\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}\Delta_{sk}\cdot\mathbb{I}[r_{k}=r\wedge n_{s}=n]}{\kappa+S_{nr}}. (111)

If instead you choose to linearly scale the variances, as mentioned above, the conditional distribution of the delays and widths is no longer inverse χ2\chi^{2}. However, it is straightforward to sample the delays and widths one at a time from their univariate conditional distributions.

Appendix F Supplemental Details on Experiments

F.1 ROC curve comparison of convNMF and PP-Seq

Simulated datasets consisted of N=100N=100 neurons, T=2000T=2000 time units, R=1R=1 sequence type, sequence event rate ψ=0.02\psi=0.02, a default average background rate of λn=0.03\lambda_{n}^{\varnothing}=0.03, default settings of α=225\alpha=225 and β=7.5\beta=7.5. Further, we set κnr=cnr\kappa_{nr}=c_{nr} and cnr=0.04c_{nr}=0.04 so that neuron offsets had unit standard deviation and small jitter by default. For the case of “jitter noise” we modified cnrc_{nr} but continued to set κnr=cnr\kappa_{nr}=c_{nr} to preserve the expected distribution of neuron offsets.

For each synthetic dataset, we fit convNMF models with R=1R=1 component and a time bin size of 0.2 time units. We optimized using alternating projected gradient descent with a backtracking line search. Each convNMF model produced a temporal factor 𝐡+B\mathbf{h}\in\mathbb{R}_{+}^{B} (where B=T/0.2=40000B=T/0.2=40000 is the total number of time bins). We encoded the ground truth sequence times in a binary vector of length BB and computed ROC curves by thresholing 𝐡\mathbf{h} over a fine grid of values over the range [0,max(𝐡)][0,\text{max}(\mathbf{h})]. We consider each timebin, indexed by bb, a false positive if hbh_{b} exceeds the threshold, but no ground truth sequence was present in the timebin. Likewise, a timebin is a true positive when hbh_{b} exceeds the threshold and a sequence event did occur in this timebin.

We repeated a similar analysis with 100 (approximate) samples from the PP-Seq posterior distribution. Specifically, we discretized the sampled sequence event times, i.e. {τk}k=1K\{\tau_{k}\}_{k=1}^{K} for every sample, into the same BB time bins used by convNMF. For every timebin we computed the empirical probability that it contained a sampled event from the model, averaging over MCMC samples. This resulted in a discretized, nonnegative temporal factor that is directly analogous to the factor 𝐡\mathbf{h} produced convNMF. We compute an ROC curve by the same method outlined above.

Unfortunately, both convNMF and PP-Seq parameters are only weakly identifiable. For example, in PP-Seq, one can add a constant to all latent event times, τkτk+C\tau_{k}\mapsto\tau_{k}+C, and subtract the same constant from all neuron offsets bnrbnrCb_{nr}\mapsto b_{nr}-C. This manipulation does not modify the likelihood of the data (it does, however, affect the probability of the parameters under the prior, and so we say the model is “weakly identifiable”). The result of this is that both PP-Seq and convNMF may consistently misestimate the “true” sequence times by a small constant of time bins. To discount this nuisance source of model error we repeated the above analysis on shifted copies of the temporal factor (up to 20 time bins in each direction); we then selected the optimal ROC curve for each model and computed the area under the curve to quantify accuracy.

F.2 Hyperparameter sweep on hippocampal data

We performed MCMC inference on 2000 randomly sampled hyperparameter sets, masking out a random 7.5% of the data as a validation set on every run. We fixed the following hyperparameter values: γ=3\gamma=3, φ=1\varphi=1, μnr=0\mu_{nr}=0 for every neuron and sequence type, νnr=4\nu_{nr}=4 for every neuron and sequence type. We also fixed the following hyperparameters, pertaining to time warping: F=10F=10, σw2=100\sigma^{2}_{w}=100. We randomized the number of sequence types R{1,2,3,4}R\in\{1,2,3,4\} uniformly. We randomized the maximum warp value wFw_{F} uniformly on (1, 1.5). We randomized the mean sequence amplitude, α/β\alpha/\beta, log-uniformly over the interval (102,104)(10^{2},10^{4}), and set the variance of the sequence amplitude α/β2\alpha/\beta^{2} equal to the mean. We randomized the expected total background amplitude, λ\lambda^{\varnothing}, log-uniformly over the interval (102,104)(10^{2},10^{4}) and set the variance of this parameter equal to its mean. We randomized the sequence rate ψ\psi log-uniformly over (103,101)(10^{-3},10^{-1}). Finally, we randomized the neuron response widths cnrc_{nr} log-uniformly over (101,101)(10^{-1},10^{1}).

We randomly initialized all spike assignments either with convNMF or the annealed MCMC procedure outlined in the main text (we observed successful outcomes in both cases). In annealed runs, we set an initial temperature of 500 and exponentially relaxed the temperature to one over 20 stages, each containing 100 Gibbs sweeps. (Recall that this temperature parameter scales the variance of Ga(α,β)\mathrm{Ga}(\alpha,\beta) while preserving the mean of this prior distribution.) After annealing was completed we performed 100 final Gibbs sweeps over the data, the final half of which were used as approximate samples from the posterior. In this final stage of sampling, we also performed 1000 randomized split-merge Metropolis Hastings moves after every Gibbs sweep.

All MCMC runs were performed using our Julia implementation of PP-Seq, run in parallel on the Sherlock cluster at Stanford University. Each job was allocated 1 CPU and 8 GB of RAM.