Point process models for sequence detection in high-dimensional neural spike trains

Alex H. Williams
Department of Statistics
Stanford University
Stanford, CA 94305
[email protected]
&Anthony Degleris
Department of Electrical Engineering
Stanford University
Stanford, CA 94305
[email protected]
Yixin Wang
Department of Statistics
Columbia University
New York NY 10027
[email protected] &Scott W. Linderman
Department of Statistics
Stanford University
Stanford, CA 94305
[email protected]

Abstract

Sparse sequences of neural spikes are posited to underlie aspects of working memory [1], motor production [2], and learning [3, 4]. Discovering these sequences in an unsupervised manner is a longstanding problem in statistical neuroscience [5, 6, 7]. Promising recent work [4, 8] utilized a convolutive nonnegative matrix factorization model [9] to tackle this challenge. However, this model requires spike times to be discretized, utilizes a sub-optimal least-squares criterion, and does not provide uncertainty estimates for model predictions or estimated parameters. We address each of these shortcomings by developing a point process model that characterizes fine-scale sequences at the level of individual spikes and represents sequence occurrences as a small number of marked events in continuous time. This ultra-sparse representation of sequence events opens new possibilities for spike train modeling. For example, we introduce learnable time warping parameters to model sequences of varying duration, which have been experimentally observed in neural circuits [10]. We demonstrate these advantages on experimental recordings from songbird higher vocal center and rodent hippocampus.

1 Introduction

Identifying interpretable patterns in multi-electrode recordings is a longstanding and increasingly pressing challenge in neuroscience. Depending on the brain area and behavioral task, the activity of large neural populations may be low-dimensional as quantified by principal components analysis (PCA) or other latent variable models [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. However, many datasets do not conform to these modeling assumptions and instead exhibit high-dimensional behaviors [24]. Neural sequences are an important example of high-dimensional structure: if $N$ neurons fire sequentially with no overlap, the resulting dynamics are $N$ -dimensional and cannot be efficiently summarized by PCA or other linear dimensionality reduction methods [4]. Such sequences underlie current theories of working memory [1, 25], motor production [2], and memory replay [10]. More generally, neural sequences are seen as flexible building blocks for learning and executing complex neural computations [26, 27, 28].

Prior Work In practice, neural sequences are usually identified in a supervised manner by correlating neural firing with salient sensory cues or behavioral actions. For example, hippocampal place cells fire sequentially when rodents travel down a narrow hallway, and these sequences can be found by averaging spikes times over multiple traversals of the hallway. After identifying this sequence on a behavioral timescale lasting seconds, a template matching procedure can be used to show that these sequences reoccur on compressed timescales during wake [29] and sleep [30]. In other cases, sequences can be identified relative to salient features of the local field potential (LFP; [31, 32]).

Developing unsupervised alternatives that directly extract sequences from multineuronal spike trains would broaden the scope of this research and potentially uncover new sequences that are not linked to behaviors or sensations [33]. Several works have shown preliminary progress in this direction Maboudi et al. [34] proposed fitting a hidden Markov model (HMM) and then identifying sequences from the state transition matrix. Grossberger et al. [35] and van der Meij & Voytek [36] apply clustering to features computed over a sliding window. Others use statistics such as time-lagged correlations to detect sequences and other spatiotemporal patterns in a bottom-up fashion [6, 7].

In this paper, we develop a Bayesian point process modeling generalization of convolutive nonnegative matrix factorization (convNMF; [9]), which was recently used by [8] and [4] to model neural sequences. Briefly, the convNMF model discretizes each neuron’s spike train into a vector of $B$ time bins, $\mathbf{x}_{n}\in\mathbb{R}_{+}^{{}_{B}}$ for neuron $n$ , and approximates this collection of vectors as a sum of convolutions, $\mathbf{x}_{n}\approx\sum_{r=1}^{{}_{R}}\mathbf{w}_{n,r}*\mathbf{h}_{r}$ . The model parameters ( $\mathbf{w}_{n,r}\in\mathbb{R}_{+}^{{}_{L}}$ and $\mathbf{h}_{r}\in\mathbb{R}^{{}_{B}}_{+}$ ) are optimized with respect to a least-squares criterion. Each component of the model, indexed by ${r\in\{1,\ldots,R\}}$ , consists of a neural factor, $\mathbf{W}_{r}\in\mathbb{R}^{{}_{N\times L}}_{+}$ and a temporal factor, $\mathbf{h}_{r}$ . The neural factor encodes a spatiotemporal pattern of neural activity over $L$ time bins (which is hoped to be sequence), while the temporal factor indicates the times at which this pattern of neural activity occurs. A total of $R$ motifs or sequence types, each corresponding to a different component, are extracted by this model.

There are several compelling features of this approach. Similar to classical nonnegative matrix factorization [37], and in contrast to clustering methods, convNMF captures sequences with overlapping groups of neurons by an intuitive “parts-based” representation. Indeed, convNMF uncovered overlapping sequences in experimental data from songbird higher vocal center (HVC) [4]. Further, if the same sequence is repeated with different peak firing rates, convNMF can capture this by varying the magnitude of the entries in $\mathbf{h}_{r}$ , unlike many clustering methods. Finally, convNMF efficiently pools statistical power across large populations of neurons to identify sequences even when the correlations between temporally adjacent neurons are noisy—other methods, such as HMMs and bottom-up agglomerative clustering, require reliable pairwise correlations to string together a full sequence.

Our Contributions We propose a point process model for neural sequences (PP-Seq) which extends and generalizes convNMF to continuous time and uses a fully probabilistic Bayesian framework. This enables us to better quantify uncertainty in key parameters—e.g. the overall number of sequences the times at which they occur—and also characterize the data at finer timescales—e.g. whether individual spikes were evoked by a sequence. Most importantly, by achieving an extremely sparse representation of sequence event times, the PP-Seq model enables a variety of model extensions that are not easily incorported into convNMF or other common methods. We explore one such extension that introduces time warping factors to model sequences of varying duration, as is often observed in neural data [10].

Though we focus on applications in neuroscience, our approach could be adapted to other temporal point processes, which are a natural framework to describe data that are collected at irregular intervals (e.g. social media posts, consumer behaviors, and medical records) [38, 39]. We draw a novel connection between Neyman-Scott processes [40], which encompass PP-Seq and other temporal point process models as special cases, and mixture of finite mixture models [63]. Exploiting this insight, we develop innovative Markov chain Monte Carlo (MCMC) methods for PP-Seq.

2 Model

Refer to caption — Figure 1: (A) Example of a Neyman-Scott process over a 2D region. Latent events (purple dots) are first sampled from a homogeneous Poisson process. Each latent event then spawns a number of nearby observed events (gray dots), according to an inhomogeneous Poisson process. (B) A spike train can be modeled as a Neyman-Scott process with marked events over a 1D interval representing time. Latent events ( $z_{k}$ ; sequences) evoke observed events ( $x_{s}$ ; spikes) ordered in a sequence. An example with $K=3$ latent events evoking $R=2$ different sequence types (blue & red) is shown.

2.1 Point Process Models and Neyman-Scott Processes

Point processes are probabilistic models that generate discrete sets ${\{x_{1},x_{2},\ldots\}\triangleq\{x_{s}\}_{s=1}^{S}}$ over some continuous space $\mathcal{X}$ . Each member of the set $x_{s}\in\mathcal{X}$ is called an “event.” A Poisson process [58] is a point process which satisfies three properties. First, the number of events falling within any region $V\subset\mathcal{X}$ is Poisson distributed. Second, there exists a function $\lambda(x):\mathcal{X}\mapsto\mathbb{R}_{+}$ , called the intensity function, for which $\int_{V}\lambda(x)\,\mathrm{d}x$ equals the expected number of events in $V$ . Third, the number of events in non-overlapping regions of $\mathcal{X}$ are independent. A Poisson process is said to be homogeneous when $\lambda(x)=c$ for some constant $c$ . Finally, a marked point process extends the usual definition of a point process to incorporate additional information into each event. A marked point process on $\mathcal{X}$ generates random tuples $x_{s}=(\tilde{x}_{s},m_{s})$ , where $\tilde{x}_{s}\in\mathcal{X}$ represents the random location and $m_{s}\in\mathcal{M}$ is the additional “mark” specifying metadata associated with event $s$ . See Supplement A for further background details and references.

Neyman-Scott Processes [40] use Poisson processes as building blocks to model clustered data. To simulate a Neyman-Scott process, first sample a set of latent events $\{z_{k}\}_{k=1}^{K}\subset\mathcal{Z}$ from an initial Poisson process with intensity function $\lambda(z):\mathcal{Z}\mapsto\mathbb{R}_{+}$ . Thus, the number of latent events is a Poisson-distributed random variable, $K\sim\text{Poisson}(\int_{\mathcal{Z}}\lambda(z)\mathrm{d}z)$ . Given the latent events, the observed events $\{x_{s}\}_{s=1}^{S}$ are drawn from a second Poisson process with conditional intensity function ${\lambda(x)=\lambda^{\varnothing}+\sum_{k=1}^{{}_{K}}g(x;z_{k})}$ . The nonnegative functions $g(x;z_{k})$ can be thought of as impulse responses—each latent event adds to the rate of observed events, and stochastically generates some number of observed offspring events. Finally, the scalar parameter $\lambda^{\varnothing}>0$ specifies a “background” rate; thus, if $K=0$ , the observations follow a homogeneous Poisson process with intensity $\lambda^{\varnothing}$ .

A simple example of a Neyman-Scott Process in $\mathbb{R}^{2}$ is shown in fig. 1A. Latent events (purple dots) are first drawn from a homogenous Poisson process and specify the cluster centroids. Each latent event induces an isotropic Gaussian impulse response. Observed events (gray dots) are then sampled from a second Poisson process, whose intensity function is found by summing all impulse responses.

Importantly, we do not observe the number of latent events nor their locations. As described below, we use probabilistic inference to characterize this unobserved structure. A key idea is to attribute each observed event to a latent cause (either one of the latent events or the background process); these attributions are valid due to the additive nature of the latent intensity functions and the superposition principle of Poisson processes (see Supplement A). From this perspective, inferring the set of latent events is similar to inferring the number and location of clusters in a nonparametric mixture model.

2.2 Point Process Model of Neural Sequences (PP-Seq)

We model neural sequences as a Neyman-Scott process with marked events in a model we call PP-Seq. Consider a dataset with $N$ neurons, emitting a total of $S$ spikes over a time interval $[0,T]$ . This can be encoded as a set of $S$ marked spike times—the observed events are tuples ${x_{s}=(t_{s},n_{s})}$ specifying the time, $t_{s}\in[0,T]$ , and neuron, ${n_{s}\in\{1\,,\,\ldots\,,\,N\}}$ , of each spike. Sequences correspond to latent events, which are also tuples $z_{k}=(\tau_{k},r_{k},A_{k})$ specifying the time, $\tau_{k}\in[0,T]$ , type, ${r_{k}\in\{1\,,\,\ldots\,,\,R\}}$ , and amplitude, $A_{k}>0$ of the sequence. The hyperparameter $R$ specifies the number of recurring sequence types, which is analogous to the number of components in convNMF.

To draw samples from the PP-Seq model, we first sample sequences (i.e., latent events) from a Poisson process with intensity ${\lambda(z)\triangleq\lambda(\tau,r,A)\!=\!\psi\,\pi_{r}\,\mathrm{Ga}(A;\alpha,\beta)}$ . Here, $\psi>0$ sets the rate at which sequences occur within the data, $\pi\in\Delta_{R}$ sets the probability of the $R$ sequence types, and $\alpha,\beta$ parameterize a gamma density which models the sequence amplitude, $A$ . Note that the number of sequence events is a Poisson-distributed random variable, $K\sim\text{Poisson}(\psi T)$ , where the rate parameter $\psi T$ is found by integrating $\lambda(z)$ over all sequence types, amplitudes, and times.

Conditioned on the set of $K$ sequence events, the firing rate of neuron $n$ is given by a sum of nonnegative impulse responses:

\lambda_{n}(t)=\lambda^{\varnothing}_{n}+\sum_{k=1}^{K}g_{n}(t;z_{k}).

(1)

We assume these impulse responses vary across neurons and follow a Gaussian form:

g_{n}(t;z_{k})=A_{k}\cdot a_{nr_{k}}\cdot\mathcal{N}(t\mid\tau_{k}+b_{nr_{k}},c_{nr_{k}}),

(2)

where $\mathcal{N}(t\mid\mu,\sigma^{2})$ denotes a Gaussian density. The parameters ${a_{r}=(a_{1r},\ldots,a_{Nr})\in\Delta_{N}}$ , ${b_{nr}\in\mathbb{R}}$ , and ${c_{nr}\in\mathbb{R}_{+}}$ correspond to the weight, latency, and width, respectively, of neurons’ firing rates in sequences of type $r$ . Since the firing rate is a sum of non-negative impulse responses, the superposition principle of Poisson processes (see Supplement A) implies that we can view the data as a union of “background” spikes and “induced” spikes from each sequence, justifying the connection to clustering. The expected number of spikes induced by sequence $k$ is:

\sum_{n=1}^{N}\int_{0}^{T}g_{n}(t;z_{k})\mathrm{d}t\approx\sum_{n=1}^{N}\int_{-\infty}^{\infty}g_{n}(t;z_{k})\mathrm{d}t=A_{k},

(3)

and thus we may view $A_{k}$ as the amplitude of sequence event $k$ .

Figure 1B schematizes a simple case containing $K=3$ sequence events and $R=2$ sequence types. A complete description of the model’s generative process is provided in Supplement B, but it can be summarized by the graphical model in fig. 1B, where we have global parameters ${\Theta=(\theta_{\varnothing},\{\theta_{r}\}_{r=1}^{R})}$ with ${\theta_{r}=(a_{r},\{b_{nr}\}_{n=1}^{N},\{c_{nr}\}_{n=1}^{N})}$ for each sequence type, and $\theta_{\varnothing}=\{\lambda^{\varnothing}_{n}\}_{n=1}^{N}$ for the background process. We place weak priors on each parameter: the neural response weights $\{a_{nr}\}_{n=1}^{N}$ follow a Dirichlet prior for each sequence type, and $(b_{nr},c_{nr})$ follows a normal-inverse-gamma prior for every neuron and sequence type. The background rate, $\lambda_{n}^{\varnothing}$ , follows a gamma prior. We set the sequence event rate, $\psi$ , to be a fixed hyperparameter, though this assumption could be relaxed.

Time-warped sequences PP-Seq can be extended to model more diverse sequence patterns by using higher-dimensional marks on the latent sequences. For example, we can model variability in sequence duration by introducing a time warping factor, $\omega_{k}>0$ , to each sequence event and changing eq. 2 to,

g_{n}(t;z_{k})=A_{k}\cdot a_{n,r_{k}}\cdot\mathcal{N}(t\mid\tau_{k}+\omega_{k}b_{n,r_{k}},\omega_{k}^{2}c_{n,r_{k}}).

(4)

This has the effect of linearly compressing or stretching each sequence in time (when $\omega_{k}<1$ or $\omega_{k}>1$ , respectively). Such time warping is commonly observed in neural data [43, 44], and indeed, hippocampal sequences unfold $\sim$ 15-20 times faster during replay than during lived experiences [10]. We characterize this model in Supplement E and demonstrate its utility below.

In principle, it is equally possible to incorporate time warping into discrete time convNMF. However, since convNMF involves a dense temporal factor matrix $\mathbf{H}\in\mathbb{R}_{+}^{R\times B}$ , the most straightforward extension would be to introduce a time warping factor for each component $r\in\{1,\ldots,R\}$ and each time bin $b\in\{1,\ldots,B\}$ . This results in $O(RB)$ new trainable parameters, which poses non-trivial challenges in terms of computational efficiency, overfitting, and human interpretability. In contrast, PP-Seq represents sequence events as a set of $K$ latent events in continuous time. This ultra-sparse representation of sequence events (since $K\ll RB$ ) naturally lends itself to modeling additional sequence features since this introduces only $O(K)$ new parameters.

3 Collapsed Gibbs Sampling for Neyman-Scott Processes

Developing efficient algorithms for parameter inference in Neyman-Scott process models is an area of active research [45, 46, 47, 48]. To address this challenge, we developed a collapsed Gibbs sampling routine for Neyman-Scott processes, which encompasses the PP-Seq model as a special case. The method resembles “Algorithm 3” of [62]—a well-known approach for sampling from a Dirichlet process mixture model—and the collapsed Gibbs sampling algorithm for “mixture of finite mixtures” models developed by [63]. The idea is to partition observed spikes into background spikes and spikes induced by latent sequences, integrating over the sequence times, types, and amplitudes. Starting from an initial partition, the sampler iterates over individual spikes and probabilistically re-assigns them to (a) the background, (b) one of the remaining sequences, or (c) to a new sequence. The number of sequences in the partition, $K^{*}$ , changes as spikes are removed and re-assigned; thus, the algorithm is able to explore the full trans-dimensional space of partitions.

The re-assignment probabilities are determined by the prior distribution of partitions under the Neyman-Scott process and by the likelihood of the induced spikes assigned to each sequence. We state the conditional probabilities below and provide a full derivation in Supplement D. Let $K^{*}$ denote the number of sequences in the current partition after spike $x_{s}$ has been removed from its current assignment. (Note that the number of latent sequences $K$ may exceed $K^{*}$ if some sequences produce zero spikes.) Likewise, let $u_{s}$ denote the sequence assignment of the $s$ -th spike, where $u_{s}=0$ indicates assignment to the background process and $u_{s}\in\{1,\ldots,K^{*}\}$ indicates assignment to one of the latent sequence events. Finally, let $X_{k}=\{x_{s^{\prime}}:u_{s^{\prime}}=k,s^{\prime}\neq s\}$ denote the spikes in the $k$ -th cluster, excluding $x_{s}$ , and let $S_{k}=|X_{k}|$ denote its size. The conditional probability of the partition under the possible assignments of spike $x_{s}$ are,

$\displaystyle p(u_{s}=0\mid x_{s},\{X_{k}\}_{k=1}^{K^{*}},\Theta)$	$\displaystyle\propto(1+\beta)\,\lambda^{\varnothing}_{n_{s}}$	(5)
$\displaystyle p(u_{s}=k\mid x_{s},\{X_{k}\}_{k=1}^{K^{*}},\Theta)$	$\displaystyle\propto(\alpha+S_{k})\,\!\left[\sum_{r_{k}=1}^{R}p(r_{k}\mid X_{k})\,a_{n_{s}r_{k}}\,p(t_{s}\mid X_{k},r_{k},n_{s})\right]$	(6)
$\displaystyle p(u_{s}=K^{}+1\mid x_{s},\{X_{k}\}_{k=1}^{K^{}},\Theta)$	$\displaystyle\propto\alpha\left(\frac{\beta}{1+\beta}\right)^{\alpha}\psi\sum_{r=1}^{R}\pi_{r}\,a_{n_{s}r}$	(7)

The sampling algorithm iterates over all spikes $x_{s}\in\{x_{1},\ldots,x_{S}\}$ and updates their assignments holding the other spikes’ assignments fixed. The probability of assigning spike $x_{s}$ to an existing cluster marginalizes the time, type, and amplitude of the sequence, resulting in a collapsed Gibbs sampler [50, 62]. The exact form of the posterior probability $p(r_{k}\mid X_{k})$ and the parameters of the posterior predictive $p(t_{s}\mid X_{k},r_{k},n_{s})$ in eq. 6 are given in Supplement C.

After attributing each spike to a latent cause, it is straightforward to draw samples over the remaining model parameters—the latent sequences $\{z_{k}\}_{k=1}^{K^{*}}$ and global parameters $\Theta$ . Given the spikes and assignments $\{x_{s},u_{s}\}_{s=1}^{S}$ , we sample the sequences (i.e. their time, amplitude, types, etc.) from the closed-form conditional $p(z_{k}\mid\{x_{s}:u_{s}=k\},\Theta)$ . Given the sequences and spikes, we sample the conditional distribution on global parameters ${p(\Theta\mid\{z_{k}\}_{k=1}^{K^{*}},\{x_{s},u_{s}\}_{s=1}^{S}})$ . Under conjugate formulations, these updates are straightforward. With these steps, the Markov chain targets the posterior distribution on model parameters and partitions. Complete derivations are in Supplement D.

Improving MCMC mixing times The intensity of sequence amplitudes $A_{k}$ is proportional to the gamma density ${\mathrm{Ga}(A_{k};\alpha,\beta)}$ , and these hyperparameters affect the mixing time of the Gibbs sampler. Intuitively, if there is little probability of low-amplitude sequences, the sampler is unlikely to create new sequences and is therefore slow to explore different partitions of spikes.¹¹1This problem is common to other nonparametric Bayesian mixture models as well [63, e.g.]. If, on the other hand, the variance of $\mathrm{Ga}(\alpha,\beta)$ is large relative to the mean, then the probability of forming new clusters is non-negligible and the sampler tends to mix more effectively. Unfortunately, this latter regime is also probably of lesser scientific interest, since neural sequences are typically large in amplitude—they can involve many thousands of cells, each potentially contributing a small number of spikes [28, 2].

To address this issue, we propose an annealing procedure to initialize the Markov chain. We fix the mean of $A_{k}$ and adjust $\alpha$ and $\beta$ to slowly lower variance of amplitude distribution. Initially, the sampler produces many small clusters of spikes, and as we lower the variance of $\mathrm{Ga}(\alpha,\beta)$ to its target value, the Markov chain typically combines these clusters into larger sequences. We further improve performance by interspersing “split-merge” Metropolis-Hastings updates [64, 65] between Gibbs sweeps (see Supplement D.6). Finally, though we have not found it necessary, one could use convNMF to initialize the MCMC algorithm.

Parallel MCMC Resampling the sequence assignments is the primary computational bottleneck for the Gibbs sampler. One pass over the data requires $O(SKR)$ operations, which quickly becomes costly when the operations are serially executed. While this computational cost is manageable for many datasets, we can improve performance substantially by parallelizing the computation [53]. Given $P$ processors and a spike train lasting $T$ seconds, we divide the dataset into intervals lasting $T/P$ seconds, and allocate one interval per processor. The current global parameters, $\Theta$ , are first broadcast to all processors. In parallel, the processors update the sequence assignments for their assigned spikes, and then send back sufficient statistics describing each sequence. After these sufficient statistics are collected on a single processor, the global parameters are re-sampled and then broadcast back to the processors to initiate another iteration. This algorithm introduces some error since clusters are not shared across processors. In essence, this introduces erroneous edge effects if a sequence of spikes is split across two processors. However, these errors are negligible when the sequence length is much less than $T/P$ , which we expect is the practical regime of interest.

4 Experiments

4.1 Cross-Validation and Demonstration of Computational Efficiency

We evaluate model performance by computing the log-likelihood assigned to held-out data. Partitioning the data into training and testing sets must be done somewhat carefully—we cannot withhold time intervals completely (as in fig. 2A, top) or else the model will not accurately predict latent sequences occurring in these intervals; likewise, we cannot withhold individual neurons completely (as in fig. 2A, middle) or else the model will not accurately predict the response parameters of those held out cells. Thus, we adopt a “speckled” holdout strategy [54] as diagrammed at the bottom of fig. 2A. We treat held-out spikes as missing data and sample them as part of the MCMC algorithm. (Their conditional distribution is given by the PP-Seq generative model.) This approach involving a speckled holdout pattern and multiple imputation of missing data may be viewed as a continuous time extension of the methods proposed by [27] for convNMF.

Panels B-E in fig. 2 show the results of this cross-validation scheme on a synthetic dataset with $R=2$ sequence types. The predictions of the model in held-out test regions closely match the ground truth—missing spikes are reliably imputed when they are part of a sequence (fig. 2C). Further, the likelihood of the train and test sets improves over the course of MCMC sampling (fig. 2D), and can be used as a metric for model comparison—in agreement with the ground truth, test performance plateaus for models containing greater than $R=2$ sequence types (fig. 2E).

Finally, to be of practical utility, the algorithm needs to run in a reasonable amount of time. Figure 2G shows that our Julia [55] implementation can fit a recording of 120 hippocamapal neurons with hundreds of thousands of spikes in a matter of minutes, on a 2017 MacBook Pro (3.1 GHz Intel Core i7, 4 cores, 16 GB RAM). Run-time grows linearly with the number of spikes, as expected, but even with a single thread it only takes six minutes to perform 1000 Gibbs sweeps on a 15-minute recording with ${\sim\!1.3\times 10^{5}}$ spikes. With parallel MCMC, this laptop performs the same number of sweeps in under two minutes. Our open-source implementation is available at:

https://github.com/lindermanlab/PPSeq.jl.

4.2 Zebra Finch Higher Vocal Center (HVC)

We first applied PP-Seq to a recording of HVC premotor neurons in a zebra finch,²²2These data are available at http://github.com/FeeLab/seqNMF; originally published in [4]. which generate sequences that are time-locked to syllables in the bird’s courtship song. Figure 3A qualitatively compares the performance of convNMF and PP-Seq. The raw data (top panel) shows no visible spike patterns; however, clear sequences are revealed by sorting the neurons lexographically by preferred sequence type and the temporal offset parameter inferred by PP-Seq. While both models extract similar sequences, PP-Seq provides a finer scale annotation of the final result, providing, for example, attributions at the level of individual spikes to sequences (bottom left of fig. 3A).

Further, PP-Seq can quantify uncertainty in key parameters by considering the full sequence of MCMC samples. Figure 3B summarizes uncertainty in the total number of sequence events, i.e. $K$ , over three independent MCMC chains with different random seeds—all chains converge to similar estimates; the uncertainty is largely due to the rapid sequences (in blue) shown in panel A. Figure 3C displays a symmetric matrix where element $(i,j)$ corresponds to the probability that spike $i$ and spike $j$ are attributed to same sequence. Finally, fig. 3D-E shows the amplitude and offset for each neuron’s sequence-evoked response with 95% posterior credible intervals. These results naturally fall out of the probabilistic construction of the PP-Seq model, but have no obvious analogue in convNMF.

4.3 Time Warping Extension and Robustness to Noise

Songbird HVC is a specialized circuit that generates unusually clean and easy-to-detect sequences. To compare the robustness of PP-Seq and convNMF under more challenging circumstances, we created a simple synthetic dataset with $R=1$ sequence type and $N=100$ neurons. We varied four parameters to manipulate the difficulty of sequence extraction: the rate of background spikes, $\lambda^{\varnothing}_{n}$ (“additive noise,” fig. 4A), the expected value of sequence amplitudes, $A_{k}$ (“participation noise”, fig. 4B), the expected variance of the Gaussian impulse responses, $c_{nr}$ (“jitter noise”, fig. 4C), and, finally, the maximal time warping coefficient (see eq. 4; fig. 4D). All simulated datasets involved sequences with low spike counts ( $\mathbb{E}[A_{k}]<100$ spikes). In this regime, the Poisson likelihood criterion used by PP-Seq is better matched to the statistics of the spike train. Since convNMF optimizes an alternative loss function (squared error instead of Poisson likelihood) we compared the models by their ability to extract the ground truth sequence event times. Using area under reciever operating characteriztic (ROC) curves as a performance metric (see Supplement F.1), we see favorable results for PP-Seq as noise levels are increased.

We demonstrate the abilities of time-warped PP-Seq further in fig. 4E-F. Here, we show a synthetic dataset containing sequences with 9-fold variability in their duration, which is similar to levels observed in some experimental systems [10]. While convNMF fails to reconstruct many of these warped sequences, the PP-Seq model identifies sequences that are closely matched to ground truth.

4.4 Rodent Hippocampal Sequences

Finally, we tested PP-Seq and its time warping variant on a hippocampal recording in a rat making repeated runs down a linear track.³³3These data are available at http://crcns.org/data-sets/hc/hc-11; originally published in [56]. This dataset is larger ( $T\approx 16$ minutes, $S=\,$ 137,482) and contains less stereotyped sequences than the songbird data. From prior work [56], we expect to see two sequences with overlapping populations of neurons, corresponding to the two running directions on the track. PP-Seq reveals these expected sequences in an unsupervised manner—i.e. without reference to the rat’s position—as shown in fig. 5A-C.

We performed a large cross-validation sweep over 2,000 random hyperparameter settings for this dataset (see Supplement F.2). This confirmed that models with $R=2$ sequence performed well in terms of heldout performance (fig. 5D). Interestingly, despite variability in running speeds, this same analysis did not show a consistent benefit to including larger time warping factors into the model (fig. 5E). Higher performing models were characterized by larger sequence amplitudes, i.e. larger values of $\mathbb{E}[A_{k}]=\alpha/\beta$ , and smaller background rates, i.e. smaller values of $\lambda^{\varnothing}$ . Other parameters had less pronounced effects on performance. Overall, these results demonstrate that PP-Seq can be fruitfully applied to large-scale and “messy” neural datasets, that hyperparameters can be tuned by cross-validation, and that the unsupervised learning of neural sequences conforms to existing scientific understanding gained via supervised methods.

5 Conclusion

We proposed a point process model (PP-Seq) inspired by convolutive NMF [9, 8, 4] to identify neural sequences. Both models approximate neural activity as the sum of nonnegative features, drawn from a fixed number of spatiotemporal motif types. Unlike convNMF, PP-Seq restricts each motif to be a true sequence—the impulse responses are Gaussian and hence unimodal. Further, PP-Seq is formulated in a probabilistic framework that better quantifies uncertainty (see fig. 3) and handles low firing rate regimes (see fig. 4). Finally, PP-Seq produces an extremely sparse representation of sequence events in continuous time, opening the door to a variety of model extensions including the introduction of time warping (see fig. 4F), as well as other possibilities like truncated sequences and “clusterless” observations [57], which could be explored in future work.

Despite these benefits, fitting PP-Seq involves a tackling a challenging trans-dimensional inference problem inherent to Neyman-Scott point processes. We took several important steps towards overcoming this challenge by connecting these models to a more general class of Bayesian mixture models [63], developing and parallelizing a collapsed Gibbs sampler, and devising an annealed sampling approach to promote fast mixing. These innovations are sufficient to fit PP-Seq on datasets containing hundreds of thousands of spikes in just a few minutes on a modern laptop.

Acknowledgements

A.H.W. received funding support from the National Institutes of Health BRAIN initiative (1F32MH122998-01), and the Wu Tsai Stanford Neurosciences Institute Interdisciplinary Scholar Program. S.W.L. was supported by grants from the Simons Collaboration on the Global Brain (SCGB 697092) and the NIH BRAIN Initiative (U19NS113201 and R01NS113119). We thank the Stanford Research Computing Center for providing computational resources and support that contributed to these research results.

Broader Impact

Understanding neural computations in biological systems and ultimately the human brain is a grand and long-term challenge with broad implications for human health and society. The field of neuroscience is still taking early and incremental steps towards this goal. Our work develops a general-purpose, unsupervised method for identifying an important structure—neural sequences—which have been observed in a variety of experimental datasets and have been studied extensively by theorists. This work will serve to advance this growing understanding by providing new analytical tools for neuroscientists. We foresee no immediate impacts, positive or negative, concerning the general public.

References

[1] Mark S Goldman “Memory without feedback in a neural network” In Neuron 61.4, 2009, pp. 621–634
[2] Richard H R Hahnloser, Alexay A Kozhevnikov and Michale S Fee “An ultra-sparse code underlies the generation of neural sequences in a songbird” In Nature 419.6902, 2002, pp. 65–70
[3] Howard Eichenbaum “Time cells in the hippocampus: a new dimension for mapping memories” In Nat. Rev. Neurosci. 15.11 nature.com, 2014, pp. 732–744
[4] Emily L Mackevicius, Andrew H Bahle, Alex H Williams, Shijie Gu, Natalia I Denisenko, Mark S Goldman and Michale S Fee “Unsupervised discovery of temporal sequences in high-dimensional datasets, with applications to neuroscience” In Elife 8, 2019
[5] Moshe Abeles and Itay Gat “Detecting precise firing sequences in experimental data” In J. Neurosci. Methods 107.1-2 Elsevier, 2001, pp. 141–154
[6] Eleonora Russo and Daniel Durstewitz “Cell assemblies at multiple time scales with arbitrary lag constellations” In Elife 6 elifesciences.org, 2017
[7] Pietro Quaglio, Vahid Rostami, Emiliano Torre and Sonja Grün “Methods for identification of spike patterns in massively parallel spike trains” In Biol. Cybern. 112.1-2, 2018, pp. 57–80
[8] Sven Peter, Elke Kirschbaum, Martin Both, Lee Campbell, Brandon Harvey, Conor Heins, Daniel Durstewitz, Ferran Diego and Fred A Hamprecht “Sparse convolutional coding for neuronal assembly detection” In Advances in Neural Information Processing Systems 30 Curran Associates, Inc., 2017, pp. 3675–3685
[9] Paris Smaragdis “Convolutive speech bases and their application to supervised speech separation” In IEEE Trans. Audio Speech Lang. Processing ieeexplore.ieee.org, 2006
[10] Thomas J Davidson, Fabian Kloosterman and Matthew A Wilson “Hippocampal replay of extended experience” In Neuron 63.4, 2009, pp. 497–507
[11] Anne C Smith and Emery N Brown “Estimating a state-space model from point process observations” In Neural Computation 15.5 MIT Press, 2003, pp. 965–991
[12] K L Briggman, H D I Abarbanel and W B Kristan “Optical imaging of neuronal populations during decision-making” In Science 307.5711 science.sciencemag.org, 2005, pp. 896–901
[13] Byron M Yu, John P Cunningham, Gopal Santhanam, Stephen I Ryu, Krishna V Shenoy and Maneesh Sahani “Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity” In J. Neurophysiol. 102.1, 2009, pp. 614–635
[14] Liam Paninski, Yashar Ahmadian, Daniel Gil Ferreira, Shinsuke Koyama, Kamiar Rahnama Rad, Michael Vidne, Joshua Vogelstein and Wei Wu “A new look at state-space models for neural data” In Journal of Computational Neuroscience 29.1-2 Springer, 2010, pp. 107–126
[15] Jakob H Macke, Lars Buesing, John P Cunningham, M Yu Byron, Krishna V Shenoy and Maneesh Sahani “Empirical models of spiking in neural populations” In Advances in Neural Information Processing Systems, 2011, pp. 1350–1358
[16] David Pfau, Eftychios A Pnevmatikakis and Liam Paninski “Robust learning of low-dimensional dynamics from large neural ensembles” In Advances in Neural Information Processing Systems, 2013, pp. 2391–2399
[17] Peiran Gao and Surya Ganguli “On simplicity and complexity in the brave new world of large-scale neuroscience” In Curr. Opin. Neurobiol. 32, 2015, pp. 148–155
[18] Yuanjun Gao, Evan Archer, Liam Paninski and John P Cunningham “Linear dynamical neural population models through nonlinear embeddings”, 2016 arXiv:1605.08454 [q-bio.NC]
[19] Yuan Zhao and Il Memming Park “Variational Latent Gaussian Process for Recovering Single-Trial Dynamics from Population Spike Trains” In Neural Comput. 29.5 MIT Press, 2017, pp. 1293–1316
[20] Anqi Wu, Stan Pashkovski, Sandeep R Datta and Jonathan W Pillow “Learning a latent manifold of odor representations from neural responses in piriform cortex” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018, pp. 5378–5388
[21] Alex H Williams, Tony Hyun Kim, Forea Wang, Saurabh Vyas, Stephen I Ryu, Krishna V Shenoy, Mark Schnitzer, Tamara G Kolda and Surya Ganguli “Unsupervised discovery of demixed, low-dimensional neural dynamics across multiple timescales through tensor component analysis” In Neuron 98.6, 2018, pp. 1099–1115.e8
[22] Scott Linderman, Annika Nichols, David Blei, Manuel Zimmer and Liam Paninski “Hierarchical recurrent state space models reveal discrete and continuous dynamics of neural activity in C. elegans” In bioRxiv, 2019, pp. 621540
[23] Lea Duncker, Gergo Bohner, Julien Boussard and Maneesh Sahani “Learning interpretable continuous-time models of latent stochastic dynamical systems” In Proceedings of the 36th International Conference on Machine Learning 97, Proceedings of Machine Learning Research Long Beach, California, USA: PMLR, 2019, pp. 1726–1734
[24] Carsen Stringer, Marius Pachitariu, Nicholas Steinmetz, Matteo Carandini and Kenneth D Harris “High-dimensional geometry of population responses in visual cortex” In Nature 571.7765, 2019, pp. 361–365
[25] Christopher D Harvey, Philip Coen and David W Tank “Choice-specific sequences in parietal cortex during a virtual-navigation decision task” In Nature 484.7392, 2012, pp. 62–68
[26] Mikhail I Rabinovich, Ramón Huerta, Pablo Varona and Valentin S Afraimovich “Generation and reshaping of sequences in neural systems” In Biol. Cybern. 95.6 Springer, 2006, pp. 519–536
[27] Emily Lambert Mackevicius and Michale Sean Fee “Building a state space for song learning” In Curr. Opin. Neurobiol. 49 Elsevier, 2018, pp. 59–68
[28] György Buzsáki and David Tingley “Space and time: the hippocampus as a sequence generator” In Trends Cogn. Sci. 22.10, 2018, pp. 853–869
[29] Eva Pastalkova, Vladimir Itskov, Asohan Amarasingham and György Buzsáki “Internally generated cell assembly sequences in the rat hippocampus” In Science 321.5894, 2008, pp. 1322–1327
[30] Daoyun Ji and Matthew A Wilson “Coordinated memory replay in the visual cortex and hippocampus during sleep” In Nat. Neurosci. 10.1 nature.com, 2007, pp. 100–107
[31] Hannah R Joo and Loren M Frank “The hippocampal sharp wave-ripple in memory retrieval for immediate use and consolidation” In Nat. Rev. Neurosci. 19.12, 2018, pp. 744–757
[32] Wei Xu, Felipe Carvalho and Andrew Jackson “Sequential neural activity in primary motor cortex during sleep” In J. Neurosci. 39.19, 2019, pp. 3698–3712
[33] David Tingley and Adrien Peyrache “On the methods for reactivation and replay analysis” In Philos. Trans. R. Soc. Lond. B Biol. Sci. 375.1799 royalsocietypublishing.org, 2020, pp. 20190231
[34] Kourosh Maboudi, Etienne Ackermann, Laurel Watkins Jong, Brad E Pfeiffer, David Foster, Kamran Diba and Caleb Kemere “Uncovering temporal structure in hippocampal output patterns” In Elife 7, 2018
[35] Lukas Grossberger, Francesco P Battaglia and Martin Vinck “Unsupervised clustering of temporal patterns in high-dimensional neuronal ensembles using a novel dissimilarity measure” In PLoS Comput. Biol. 14.7, 2018, pp. e1006283
[36] Roemer Meij and Bradley Voytek “Uncovering neuronal networks defined by consistent between-neuron spike timing from neuronal spike recordings” In eNeuro 5.3 ncbi.nlm.nih.gov, 2018
[37] Daniel D Lee and H Sebastian Seung “Learning the parts of objects by non-negative matrix factorization” In Nature 401.6755, 1999, pp. 788–791
[38] Jesper Moller and Rasmus Plenge Waagepetersen “Statistical Inference and Simulation for Spatial Point Processes” Taylor & Francis, 2003
[39] Isabel Valera Manuel Gomez Rodriguez “Learning with Temporal Point Processes”, Tutorial at ICML, 2018
[40] Jerzy Neyman and Elizabeth L Scott “Statistical approach to problems of cosmology” In J. R. Stat. Soc. Series B Stat. Methodol. 20.1, 1958, pp. 1–29
[41] Jeffrey W Miller and Matthew T Harrison “Mixture models with a prior on the number of components” In J. Am. Stat. Assoc. 113.521, 2018, pp. 340–356
[42] John Frank Charles Kingman “Poisson processes” Clarendon Press, 2002
[43] Lea Duncker and Maneesh Sahani “Temporal alignment and latent Gaussian process factor inference in population spike trains” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018, pp. 10445–10455
[44] Alex H Williams, Ben Poole, Niru Maheswaranathan, Ashesh K Dhawale, Tucker Fisher, Christopher D Wilson, David H Brann, Eric M Trautmann, Stephen Ryu, Roman Shusterman, Dmitry Rinberg, Bence P Ölveczky, Krishna V Shenoy and Surya Ganguli “Discovering precise temporal patterns in large-scale neural recordings through robust and interpretable time warping” In Neuron 105.2, 2020, pp. 246–259.e8
[45] Ushio Tanaka, Yosihiko Ogata and Dietrich Stoyan “Parameter estimation and model selection for Neyman-Scott point processes” In Biometrical Journal: Journal of Mathematical Methods in Biosciences 50.1 Wiley Online Library, 2008, pp. 43–57
[46] Ushio Tanaka and Yosihiko Ogata “Identification and estimation of superposed Neyman–Scott spatial cluster processes” In Ann. Inst. Stat. Math. 66.4, 2014, pp. 687–702
[47] Jiří Kopecký and Tomáš Mrkvička “On the Bayesian estimation for the stationary Neyman-Scott point processes” In Appl. Math. 61.4, 2016, pp. 503–514
[48] Yosihiko Ogata “Cluster analysis of spatial point patterns: posterior distribution of parents inferred from offspring” In Japanese Journal of Statistics and Data Science, 2019
[49] Radford M Neal “Markov chain sampling methods for Dirichlet process mixture models” In J. Comput. Graph. Stat. 9.2 Taylor & Francis, 2000, pp. 249–265
[50] Jun S Liu, Wing Hung Wong and Augustine Kong “Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes” In Biometrika 81.1 Oxford Academic, 1994, pp. 27–40
[51] Sonia Jain and Radford M Neal “A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model” In J. Comput. Graph. Stat. 13.1 Taylor & Francis, 2004, pp. 158–182
[52] Sonia Jain and Radford M Neal “Splitting and merging components of a nonconjugate Dirichlet process mixture model” In Bayesian Anal. 2.3 International Society for Bayesian Analysis, 2007, pp. 445–472
[53] Elaine Angelino, Matthew James Johnson and Ryan P Adams “Patterns of scalable Bayesian inference” In Foundations and Trends® in Machine Learning 9.2-3 Now Publishers, 2016, pp. 119–247
[54] Svante Wold “Cross-validatory estimation of the number of components in factor and principal components models” In Technometrics 20.4 [Taylor & Francis, Ltd., American Statistical Association, American Society for Quality], 1978, pp. 397–405
[55] Jeff Bezanson, Alan Edelman, Stefan Karpinski and Viral B Shah “Julia: A fresh approach to numerical computing” In SIAM review 59.1 SIAM, 2017, pp. 65–98
[56] Andres D Grosmark and György Buzsáki “Diversity in neural firing dynamics supports both rigid and learned hippocampal sequences” In Science 351.6280, 2016, pp. 1440–1443
[57] Xinyi Deng, Daniel F Liu, Kenneth Kay, Loren M Frank and Uri T Eden “Clusterless decoding of position from multiunit activity using a marked point process filter” In Neural computation 27.7 MIT Press, 2015, pp. 1438–1460

References

[58] John Frank Charles Kingman “Poisson processes” Clarendon Press, 2002
[59] D J Daley and D Vere-Jones “An Introduction to the Theory of Point Processes: Volume I: Elementary Theory and Methods” Springer, New York, NY, 2003
[60] Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari and Donald B Rubin “Bayesian data analysis” CRC press, 2013
[61] Kevin P Murphy “Machine learning: a probabilistic perspective” MIT press, 2012
[62] Radford M Neal “Markov chain sampling methods for Dirichlet process mixture models” In J. Comput. Graph. Stat. 9.2 Taylor & Francis, 2000, pp. 249–265
[63] Jeffrey W Miller and Matthew T Harrison “Mixture models with a prior on the number of components” In J. Am. Stat. Assoc. 113.521, 2018, pp. 340–356
[64] Sonia Jain and Radford M Neal “A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model” In J. Comput. Graph. Stat. 13.1 Taylor & Francis, 2004, pp. 158–182
[65] Sonia Jain and Radford M Neal “Splitting and merging components of a nonconjugate Dirichlet process mixture model” In Bayesian Anal. 2.3 International Society for Bayesian Analysis, 2007, pp. 445–472

[]

Supplementary Material: Point process models for sequence detection in high-dimensional spike trains

Appendix A Background

This section describes some basic properties and characteristics of Poisson processes that are relevant to understand the PP-Seq model. See [58] for a targeted introduction to Poisson processes and [59] for a broader introduction to point processes.

Definition of a Poisson Process — A point process model satisfying the following three properties is called a Poisson process.

(i)

The number of events falling within any region of the sample space $\mathcal{X}$ is a random variable following a Poisson distribution. For a region $R\subset\mathcal{X}$ we denote the number of events occuring in $R$ as $N(R)$ .
(ii)

There exists a function $\lambda(x):\mathcal{X}\mapsto\mathbb{R}_{+}$ , called the intensity function, which satisfies $N(R)\sim\text{Poisson}(\int_{R}\lambda(x)dx)$ for every region $R\subset\mathcal{X}$ .
(iii)

For any two non-overlapping regions $R$ and $R^{\prime}$ , $N(R)$ and $N(R^{\prime})$ are independent random variables.

We write $\{x_{i}\}_{i=1}^{n}\sim PP(\lambda)$ to denote that a set of events $\{x_{i}\}_{i=1}^{n}=\{x_{1},\ldots,x_{n}\}$ is distributed according to a Poisson process with intensity function $\lambda$ .

Marked Poisson Processes — A marked Poisson process extends the above definition to sets of events marked with additional information. For example, we write $\{(x_{i},m_{i})\}_{i=1}^{n}\sim PP(\lambda(x,m))$ to denote a marked Poisson process where $m_{i}\in\mathcal{M}$ is the “mark” associated with each event. Note that the domain of the intensity function is extended to $\mathcal{X}\times\mathcal{M}$ . To pick a concrete example of interest, the events could represent spike times, in which case $\mathcal{X}$ would be an interval of the real line, e.g. $[0,T]$ , representing the recorded period. Further, $\mathcal{M}$ would be a discrete set of neuron labels, e.g. $\{1,\ldots,N\}$ for a recording of $N$ cells.

Likelihood function — Given a set of events $\{x_{i}\}_{i=1}^{n}$ in $\mathcal{X}$ the likelihood of this set, with respect to Poisson process $PP(\lambda)$ is:

p(\{x_{i}\}_{i=1}^{n}\mid\lambda)=\prod_{i=1}^{n}\lambda(x_{i})\cdot\exp\left\{-\int_{\mathcal{X}}\lambda(x)dx\right\}

(8)

Sampling — Given an intensity function $\lambda(x)$ and a sample space $\mathcal{X}$ , we can simulate data from a Poisson process by first sampling the number of events:

N\sim\text{Poisson}\left(\int_{\mathcal{X}}\lambda(x)dx\right)

(9)

and then sampling the location of each event according to the normalized intensity function. That is, for $i=\{1,\ldots,N\}$ , we independently sample the events according to:

x_{i}\sim F\quad\text{where the p.d.f. of $F$ is }\frac{\lambda(x)}{\int_{\mathcal{X}}\lambda(x)dx}

(10)

Superposition Principle — The union of $K$ independent Poisson processes with nonnegative intensity functions $\lambda_{1}(x),\ldots,\lambda_{K}(x)$ is a Poisson process with intensity function $\Lambda(x)=\sum_{k=1}^{K}\lambda_{k}(x)$ .

As a consequence, we can sample from a complicated Poisson process by first breaking it down into a series of simpler Poisson processes, drawing independent samples from these simpler processes, and taking the union of all events to achieve a sample from the original Poisson process. We exploit this property in section B.1 to draw samples from the PP-Seq model.

Appendix B The PP-Seq Model

B.1 Generative Process

Recall the generative model described in the main text. Let $z_{k}=(\tau_{k},r_{k},A_{k})$ denote the $k$ -th latent sequence, which consists of a time $\tau_{k}$ , type $r_{k}$ , and amplitude $A_{k}$ . Extensions of the model introduce further properties like time warping factors $\omega_{k}$ for each latent sequence. The latent events (i.e. instantiations of a sequence) are sampled from a Poisson process,

	$\displaystyle\{(\tau_{k},r_{k},A_{k})\}_{k=1}^{K}$	$\displaystyle\sim\mathrm{PP}\big{(}\lambda(\tau,r,A)\big{)}$		(11)
	$\displaystyle\lambda(\tau,r,A)$	$\displaystyle=\psi\,\pi_{r}\,\mathrm{Ga}(A;\alpha,\beta).$		(12)

Note that this is a homogeneous Poisson process since the intensity function does not depend on time. Conditioned on the latent events, the observed spikes are then sampled from a second, inhomogeneous, Poisson process,

$\displaystyle\{(t_{s},n_{s})\}_{s=1}^{S}$	$\displaystyle\sim\mathrm{PP}\left(\lambda(t,n\mid\{(\tau_{k},r_{k},A_{k})\}_{k=1}^{K})\right)$	(13)
$\displaystyle\lambda(t,n\mid\{(\tau_{k},r_{k},A_{k})\}_{k=1}^{K})$	$\displaystyle=\lambda^{\varnothing}(t,n)+\sum_{k=1}^{K}g(t,n\mid\tau_{k},r_{k},A_{k})$	(14)
$\displaystyle g(t,n\mid\tau_{k},r_{k},A_{k})$	$\displaystyle=A_{k}\cdot a_{nr_{k}}\cdot\mathcal{N}(t\mid\tau_{k}+b_{nr_{k}},c_{nr_{k}}).$	(15)

In the main text we abbreviate $\lambda_{n}(t)\triangleq\lambda(t,n\mid\{(\tau_{k},r_{k},A_{k})\}_{k=1}^{K})$ , $\lambda^{\varnothing}(t,n)\triangleq\lambda^{\varnothing}_{n}$ , and $g_{n}(t\mid z_{k})\triangleq g(t,n\mid\tau_{k},r_{k},A_{k})$ .

Observe that the firing rates are sums of non-negative intensity functions. We can use the superposition principle of Poisson processes (see appendix A) to write the observed spikes as the union of spikes generated by the background and by each of the latent events.

$\displaystyle\{(t_{s}^{(0)},n_{s}^{(0)}\}_{s=1}^{S_{0}}$	$\displaystyle\sim\mathrm{PP}\big{(}\lambda^{\varnothing}(t,n)\big{)}$			(16)
$\displaystyle\{(t_{s}^{(k)},n_{s}^{(k)}\}_{s=1}^{S_{k}}$	$\displaystyle\sim\mathrm{PP}\big{(}A_{k}\cdot a_{nr_{k}}\cdot\mathcal{N}(t\mid\tau_{k}+b_{nr_{k}},c_{nr_{k}})\big{)}$	for	$\displaystyle k=1,\ldots,K$	(17)
$\displaystyle\{(t_{s},n_{s})\}_{s=1}^{S}$	$\displaystyle=\bigcup_{k=0}^{K}\{(t_{s}^{(k)},n_{s}^{(k)}\}_{s=1}^{S_{k}},$			(18)

where $S_{k}$ denotes the number of spikes generated by the $k$ -th component, with $k=0$ denoting the background spikes.

Next, note that each of the constituent Poisson processes intensity functions are simple, in that the normalized intensities correspond to simple densities. In the case of the background, the normalized intensity is categorical in the neuron indices and uniform in time. The impulse responses, once normalized, are categorical in the neuron indices and conditionally Gaussian in time. We use these facts to sample the Poisson processes as described in Appendix A.

The final sampling procedure for PP-Seq is as follows. First sample the global parameters,

$\displaystyle\lambda^{\varnothing}$	$\displaystyle\sim\mathrm{Gamma}(\alpha_{\varnothing},\beta_{\varnothing})$		(19)
$\displaystyle\pi^{\varnothing}$	$\displaystyle\sim\text{Dirichlet}(\gamma_{\varnothing}\cdot\boldsymbol{1}_{N})$		(20)
$\displaystyle\pi$	$\displaystyle\sim\text{Dirichlet}(\gamma\cdot\boldsymbol{1}_{R})$		(21)
$\displaystyle\mathbf{a}_{r}$	$\displaystyle\sim\text{Dirichlet}(\varphi\cdot\boldsymbol{1}_{N})$	for $r\in\{1,\ldots,R\}$	(22)
$\displaystyle c_{nr}$	$\displaystyle\sim\text{Inv-}\chi^{2}(\nu,\sigma^{2})$	for $(n,r)\in\{1,\ldots,N\}\times\{1,\ldots,R\}$	(23)
$\displaystyle b_{nr}$	$\displaystyle\sim\text{Normal}(0,c_{nr}/\kappa)$	$\displaystyle\text{for $(n,r)\in\{1,\ldots,N\}\times\{1,\ldots,R\}$}.$	(24)

Then sample the latent events,

$\displaystyle K$	$\displaystyle\sim\text{Poisson}(\psi\cdot T)$		(25)
$\displaystyle r_{k}$	$\displaystyle\sim\text{Categorical}(\pi)$	for $k\in\{1,\ldots,K\}$	(26)
$\displaystyle\tau_{k}$	$\displaystyle\sim\text{Uniform}(\,[0,T]\,)$	for $k\in\{1,\ldots,K\}$	(27)
$\displaystyle A_{k}$	$\displaystyle\sim\text{Gamma}(\alpha,\beta)$	$\displaystyle\text{for $k\in\{1,\ldots,K\}$}.$	(28)

Next sample the background spikes,

$\displaystyle S_{\varnothing}$	$\displaystyle\sim\text{Poisson}\big{(}\lambda^{\varnothing}\cdot T\big{)}$		(29)
$\displaystyle n_{0}^{(s)}$	$\displaystyle\sim\text{Categorical}(\pi^{\varnothing})$	for $s\in\{1,\ldots,S_{\varnothing}\}$	(30)
$\displaystyle t_{0}^{(s)}$	$\displaystyle\sim\text{Uniform}(\,[0,T]\,)$	for $s\in\{1,\ldots,S_{\varnothing}\}$	(31)

Note that we have decomposed each neuron’s background firing rate as the product: $\lambda^{\varnothing}_{n}=\lambda^{\varnothing}\pi^{\varnothing}_{n}$ . Finally, sample the induced spikes in each sequence,

$\displaystyle S_{k}$	$\displaystyle\sim\text{Poisson}(A_{k})$	for $k\in\{1,\ldots,K\}$	(32)
$\displaystyle n_{k}^{(s)}$	$\displaystyle\sim\text{Categorical}(\mathbf{a}_{r_{k}})$	for $s\in\{1,\ldots,S_{k}\}$ ; for $k\in\{1,\ldots,K\}$	(33)
$\displaystyle t_{k}^{(s)}$	$\displaystyle\sim\text{Normal}(\tau_{k}+b_{n_{s}r_{k}},c_{n_{s}r_{k}})$	for $s\in\{1,\ldots,S_{k}\}$ ; for $k\in\{1,\ldots,K\}$	(34)

The hyperparameters of the model are $\xi=\big{\{}\gamma,\gamma_{\varnothing},\alpha_{\varnothing},\beta_{\varnothing},\varphi,\nu,\sigma^{2},\kappa,\psi,\alpha,\beta\big{\}}$ , and the global parameters consist of $\Theta=\{\theta_{\varnothing},\{\theta_{r}\}_{r=1}^{R}\}$ , where $\theta_{\varnothing}=\{\lambda^{\varnothing},\{\pi_{n}^{\varnothing}\}_{n=1}^{N}\}$ denotes parameters related to the background spiking process and $\theta_{r}=\{\pi_{r},\mathbf{a}_{r},\{b_{nr},c_{nr}\}_{n=1}^{N}\}$ denotes parameters associated with each sequence type for $r\in\{1,\ldots,R\}$ . The resulting dataset is the union of background and induced spikes, $\bigcup_{k=0}^{K}\bigcup_{s=1}^{S_{k}}\big{\{}\big{(}n_{k}^{(s)},\,t_{k}^{(s)}\big{)}\big{\}}$ .

B.2 Expected Sequence Sizes

As described in the main text, each sequence type induces a different pattern of activation across the neural population. In particular, each neuron’s response is modeled by a scaled normal distribution described by three variables—the amplitude ( $a_{nr}$ ), the mean ( $b_{nr}$ ), and the variance ( $c_{nr}$ ).

We can interpret $b_{nr}$ as the temporal offset or delay for neuron $n$ , while $c_{nr}$ sets the width of the response. They are drawn from a normal-inverse-chi-squared distribution, which is a standard choice for the conjugate prior when inferring the mean and variance of a univariate normal distribution. Refer to [60] (section 3.3) and [61] (section 4.6.3.7) for appropriate background material.

We can interpret $a_{nr}$ as the expected number of spikes emitted by neuron $n$ for a sequence of type $r$ with unit amplitude (i.e., when $A_{k}=1$ ). By sampling $\mathbf{a}_{r}\in\mathbb{R}_{+}^{N}$ from a symmetric Dirichlet prior, we introduce the constraint that $\sum_{n}a_{nr}=1$ , which avoids a degeneracy in the amplitude parameters over neurons ( $a_{nr}$ ) and sequences ( $A_{k}$ ). As a consequence, we can interpret $A_{k}$ as the expected number of spikes evoked by latent event $k$ over the full neural population:

\mathbb{E}[S_{k}]=\sum_{n=1}^{N}\int_{0}^{T}A_{k}\cdot a_{nr}\cdot\mathcal{N}(t\mid\tau_{k}+b_{nr_{k}},c_{nr_{k}})\,\mathrm{d}t=A_{k}\cdot\underbrace{\small{\sum}_{n=1}^{N}a_{nr}}_{=1}\cdot\underbrace{\int_{0}^{T}\mathcal{N}(t\mid\tau_{k}+b_{nr_{k}},c_{nr_{k}})\,\mathrm{d}t}_{\approx 1}\approx A_{k}.

(35)

This approximation ignores boundary effects, but it is justified when $0\ll\tau_{k}\ll T$ and $b_{nr},\sqrt{c_{nr}}\ll T$ ; i.e. when the sequences are short relative to the interval $[0,T]$ . This condition will generally be satisfied.

B.3 Log Likelihood

Given a set of latent events $\{z_{k}\}=\{z_{k}\}_{k=1}^{K}$ and sequence type parameters $\{\theta_{r}\}=\{\theta_{r}\}_{r=1}^{R}$ , we can evaluate the likelihood of the dataset $\{x_{s}\}=\{x_{s}\}_{s=1}^{S}$ as:

\log p(\{x_{s}\}\mid\{z_{k}\},\{\theta_{r}\})=\underbrace{\sum_{s=1}^{S}\log\left[\lambda^{\varnothing}\pi^{\varnothing}_{n_{s}}+\sum_{k=1}^{K}A_{k}a_{n_{s}r_{k}}\mathcal{N}(t_{s}\mid\tau_{k}+b_{n_{s}r_{k}},c_{n_{s}r_{k}})\right]}_{(*)}-\underbrace{\lambda^{\varnothing}T-\sum_{k=1}^{K}A_{k}}_{(**)}

(36)

where terms $(*)$ and $(**)$ respectively correspond to the logarithms of $[\,\prod_{i}\lambda(x_{i})\,]$ and $[\,\exp[-\int_{R}\lambda(x)\mathrm{d}x]\,]$ appearing in Equation 8.

Appendix C Complete Gibbs Sampling Algorithm

In this section we describe the collapsed Gibbs sampling routine for PP-Seq in detail. It adapts “Algorithm 3” of [62] and the extension by [63] to draw approximate samples from the model posterior. To do this, we introduce auxiliary parent variables $u_{s}\in\{0\}\cup\mathbb{Z}_{+}$ for every spike. These variables denote assignments of each spike to of the latent sequence events, $u_{s}=k$ for some $k\in\{0\ldots K\}$ , or to the background process, $u_{s}=0$ . Let $X_{k}=\{x_{s}:u_{s}=k\}$ denote the set of spikes assigned to sequence $k\in\{1,\ldots,K\}$ under parent assignments $\{u_{s}\}_{s=1}^{S}$ . Similar assignment variables are used in standard Gibbs sampling routines for finite mixture models (see, e.g., section 24.2.4 in [61]) and for Dirichlet Process Mixture (DPM) models (see [62]; section 25.2.4 in [61]). In these contexts, each $u_{s}$ may be thought of as a sequence assignment or indicator variable. In the present context of the PP-Seq model, the attributions of spikes to latent events follows from the superposition principle of Poisson processes (see appendix A), which allows us to decompose each neuron’s firing rate function into a sum of simple functions (in this case, univariate Gaussians) and thus derive efficient Gibbs sampler updates.

Input: Spikes

\{x_{1},\ldots,x_{S}\}

and hyperparameters

\xi

Initialize by assigning all spikes to the background:

u_{s}=0

for

s\in\{1\ldots S\}

repeat

M

times to draw

M

samples

1. Resample parent assignments, integrating over latent events:

for

s=1,\,\ldots\,,S

. Remove

x_{s}

its current assignment and place it…

a. in the background,

u_{s}=0

, with probability

\propto(1+\beta)\,\lambda^{\varnothing}_{n_{s}}

b. in cluster

u_{s}=k

, with probability

\propto(\alpha+|X_{k}|)\,\left[\sum_{r_{k}=1}^{R}p(r_{k}\mid X_{k})\,a_{n_{s}r_{k}}\,p(t_{s}\mid X_{k},r_{k},n_{s})\right]

c. in a new cluster,

u_{s}=K+1

, with probability

\propto\alpha\left(\frac{\beta}{1+\beta}\right)^{\alpha}\psi\sum_{r=1}^{R}\pi_{r}\,a_{n_{s}r}

end

2. Resample latent events

for

k=1,\,\ldots\,,K

sample

r_{k},\tau_{k},A_{k}\sim p(r_{k},\tau_{k},A_{k}\mid X_{k},\Theta)

with the following steps:

a. Sample

r_{k}\sim p(r_{k}\mid X_{k},\Theta,\xi)

where

\displaystyle p(r_{k}=r\mid X_{k},\Theta,\xi)

\displaystyle\propto\pi_{r_{k}}\left(\prod_{x_{s}\in X_{k}}a_{n_{s}r_{k}}\right)\frac{Z({\textstyle\sum_{x_{s}\in X_{k}}}J_{sk},{\textstyle\sum_{x_{s}\in X_{k}}}h_{sk})}{\prod_{x_{s}\in X_{k}}Z(J_{sk},h_{sk})}

and where

J_{sk}=1/c_{n_{s}r_{k}}

h_{sk}=(t_{s}-b_{n_{s}r_{k}})/c_{n_{s}r_{k}}

, and

Z(J,h)=(2\pi)^{1/2}J^{-1/2}\exp\left\{\tfrac{1}{2}h^{2}J^{-1}\right\}

b. Sample

\tau_{k}\sim p(\tau_{k}\mid r_{k},X_{k},\Theta)

where,

\displaystyle p(\tau_{k}\mid r_{k},X_{k},\Theta)

\displaystyle=\mathcal{N}(\tau_{k}\mid\mu_{k},\sigma_{k}^{2})

and where

\sigma_{k}^{2}=(\sum_{x_{s}\in X_{k}}J_{sk})^{-1}

and

\mu_{k}=\sigma_{k}^{2}(\sum_{x_{s}\in X_{k}}h_{sk})

c. Sample

A_{k}\sim\mathrm{Ga}(\alpha+|X_{k}|,\beta+1)

end

3. Resample global parameters

a. Sample

\lambda_{n}^{\varnothing}\sim\mathrm{Ga}(\alpha_{\varnothing}+\sum_{x_{s}\in X_{0}}\mathbb{I}[n_{s}=n],\,\beta_{\varnothing}+T)

for

n=1,\ldots,N

b. Sample

\pi\sim\mathrm{Dir}\left(\left[\gamma+\sum_{k}\mathbb{I}[r_{k}=1],\ldots,\gamma+\sum_{k}\mathbb{I}[r_{k}=R]\right]\right)

c. Sample

\mathbf{a}_{r}\sim\mathrm{Dir}\left(\left[\varphi+\sum_{k}\sum_{x_{s}\in X_{k}}\mathbb{I}[r_{k}=r\wedge n_{s}=1],\ldots,\varphi+\sum_{k}\sum_{x_{s}\in X_{k}}\mathbb{I}[r_{k}=r\wedge n_{s}=N]\right]\right)

d. Sample

b_{nr},c_{nr}\sim\mathrm{Inv-}\chi^{2}(c_{nr};\nu_{nr},\sigma_{nr}^{2})\,\mathcal{N}(b_{nr}\mid\mu_{nr},c_{nr}/\kappa_{nr})

for

n=1\ldots,N

r=1\ldots,R

where

$\displaystyle S_{nr}$	$\displaystyle=\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}{\mathbb{I}[r_{k}=r\wedge n_{s}=n]}$
$\displaystyle\nu_{nr}$	$\displaystyle=\nu+S_{nr}$	$\displaystyle\sigma_{nr}^{2}$	$\displaystyle=\frac{\nu\sigma^{2}+\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}(t_{s}-\tau_{k})^{2}\cdot\mathbb{I}[r_{k}=r\wedge n_{s}=n]}{\nu+S_{nr}},$
$\displaystyle\kappa_{nr}$	$\displaystyle=\kappa+S_{nr}$	$\displaystyle\mu_{nr}$	$\displaystyle=\frac{\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}(t_{s}-\tau_{k})\cdot\mathbb{I}[r_{k}=r\wedge n_{s}=n]}{\kappa+S_{nr}}.$

end

Algorithm 1 Collapsed Gibbs sampling for PPSeq model

The Gibbs sampling routine is summarized in Algorithm 1. While the approach closely mirrors existing work by [62, 63] there are a couple notable differences. First, the background spikes in the PP-Seq model do not have an obvious counterpart in these prior algorithms, and this feature of the model renders the partition of datapoints only partially exchangeable (intuitively, the set of datapoints defined by $u_{s}=0$ is treated differently from all other groups). Second, the factors $S_{k}+\alpha$ and $\alpha\,\left(\frac{\beta}{1+\beta}\right)^{\alpha}\psi$ in the collapsed Gibbs updates are specific to the special form of the PP-Seq model as a Neyman-Scott process; for example, in the analogous Gibbs sampling routine for DPMs, $(\alpha+|X_{k}|)$ is replaced with $|X_{k}|$ , and $\alpha\,\psi\,T\left(\frac{\beta}{1+\beta}\right)^{\alpha}$ is replaced with the concentration parameter of the Dirichlet process.

Appendix D Derivations

D.1 Prior Distribution Over Partitions

First we derive the prior distribution on the partition of spikes under a Neyman-Scott process. Technically, we derive the prior distribution on spike indices, $\{1,\ldots,S\}$ , not yet including the spike times and neurons. A partition of indices is a set of disjoint, non-empty sets whose union is $\{1,\ldots,S\}$ . We represent the partition as $\mathcal{I}=\{\mathcal{I}_{k}\}_{k=0}^{K^{*}}$ , where $\mathcal{I}_{k}\subset\{1,\ldots,S\}$ contains the indices of spikes assigned to sequence $k$ , with $k=0$ denoting the background spikes. Here, $K^{*}$ denotes the number of non-empty sequences in the partition, and $|\mathcal{I}_{k}|$ denotes the number of spikes assigned to the $k$ -th sequence. Likewise, $|\mathcal{I}_{0}|$ denotes the number of spikes assigned to the background.

Proposition 1.

The prior probability of the partition, integrating over the latent event parameters, is,

\displaystyle p(\mathcal{I}\mid\Theta,\xi)=V(K^{*};\xi)\,\frac{\mathrm{Po}(|\mathcal{I}_{0}|;\lambda^{\varnothing}T)\,|\mathcal{I}_{0}|!}{S!\,(1+\beta)^{S-|\mathcal{I}_{0}|}}\prod_{k=1}^{K^{*}}\frac{\Gamma(|\mathcal{I}_{k}|+\alpha)}{\Gamma(\alpha)},

(37)

where $\alpha$ and $\beta$ are the hyperparameters of the gamma intensity on amplitudes, ${\lambda^{\varnothing}T}$ is the expected number of background events, and

\displaystyle V(K^{*};\xi)=\sum_{K=K^{*}}^{\infty}\mathrm{Po}(K;\psi T)\,\frac{K!}{(K-K^{*})!}\,\left(\frac{\beta}{1+\beta}\right)^{\alpha K}.

(38)

Proof.

Our derivation roughly follows that of [63] for mixture of finite mixture models, but we adapt it to Neyman-Scott processes. The major difference is that the distribution above is over partitions of random size. First, return to the generative process described in Section B.1 and integrate over the latent event amplitudes to obtain the marginal distribution on sequence sizes given the total number of sequences $K$ . We have,

$\displaystyle p(S_{0},\ldots,S_{K}\mid K,\Theta,\xi)$	$\displaystyle=\mathrm{Po}(S_{0};\lambda^{\varnothing}T)\prod_{k=1}^{K}\int\mathrm{Po}(S_{k};\gamma_{k})\,\mathrm{Ga}(A_{k};\alpha,\beta)\mathrm{d}A_{k}$	(39)
	$\displaystyle=\mathrm{Po}(S_{0};\lambda^{\varnothing}T)\prod_{k=1}^{K}\mathrm{NB}(S_{k};\alpha,(1+\beta)^{-1})$	(40)
	$\displaystyle=\frac{1}{S_{0}!}(\lambda^{\varnothing}T)^{N_{0}}e^{-\lambda^{\varnothing}T}\prod_{k=1}^{K}\frac{\Gamma(S_{k}+\alpha)}{N_{k}!\,\Gamma(\alpha)}\left(\frac{\beta}{1+\beta}\right)^{\alpha}\left(\frac{1}{1+\beta}\right)^{S_{k}}$	(41)
	$\displaystyle=\frac{1}{S_{0}!}(\lambda^{\varnothing}T)^{S_{0}}e^{-\lambda^{\varnothing}T}\left(\frac{\beta}{1+\beta}\right)^{K\alpha}\left(\frac{1}{1+\beta}\right)^{S-S_{0}}\prod_{k=1}^{K}\frac{\Gamma(S_{k}+\alpha)}{S_{k}!\,\Gamma(\alpha)}.$	(42)

Let $\{u_{s}\}_{s=1}^{S}$ denote a set of sequence assignments. There are $\binom{S}{S_{0},\ldots,S_{K}}$ parent assignments consistent with the sequence sizes $S_{0},\ldots,S_{K}$ , and they are all equally likely under the prior, so the conditional probability of the parent assignments is,

	$\displaystyle p(\{u_{s}\}_{s=1}^{S}\mid K,\Theta,\xi)$	$\displaystyle=\frac{1}{S_{0}!}(\lambda^{\varnothing}T)^{S_{0}}e^{-\lambda^{\varnothing}T}\prod_{k=1}^{K}\frac{\Gamma(S_{k}+\alpha)}{S_{k}!\,\Gamma(\alpha)}\left(\frac{\beta}{1+\beta}\right)^{\alpha}\left(\frac{1}{1+\beta}\right)^{N_{k}}\,\binom{S}{S_{0},\ldots,S_{K}}^{-1}$		(43)
		$\displaystyle=\frac{1}{S!}(\lambda^{\varnothing}T)^{S_{0}}e^{-\lambda^{\varnothing}T}\left(\frac{\beta}{1+\beta}\right)^{K\alpha}\left(\frac{1}{1+\beta}\right)^{S-S_{0}}\prod_{k=1}^{K}\frac{\Gamma(S_{k}+\alpha)}{\Gamma(\alpha)}.$		(44)

The parent assignments above induce a partition, but technically they assume a particular ordering of the latent events. Moreover, if some of the latent events fail to produce any observed events, they will not be included in the partition. In performing a change of variables from parent assignments to partitions, we need to sum over latent event assignments that produce the same partition. There are ${\binom{K}{K^{*}}K^{*}!=\frac{K!}{(K-K^{*})!}}$ such assignments if there are $K$ latent events but only $K^{*}$ partitions. Thus,

\displaystyle p(\mathcal{I}\mid K,\Theta,\xi)

\displaystyle=\frac{K!}{(K-K^{*})!}\,\frac{1}{S!}(\lambda^{\varnothing}T)^{|\mathcal{I}_{0}|}e^{-\lambda^{\varnothing}T}\left(\frac{\beta}{1+\beta}\right)^{K\alpha}\left(\frac{1}{1+\beta}\right)^{S-|\mathcal{I}_{0}|}\prod_{k=1}^{K^{*}}\frac{\Gamma(|\mathcal{I}_{k}|+\alpha)}{\Gamma(\alpha)}.

(45)

Clearly, $K$ must be at least $K^{*}$ in order to produce the partition.

Finally, we sum over the number of latent events $K$ to obtain the marginal probability of the partition,

	$\displaystyle p(\mathcal{I}\mid\Theta,\xi)$	$\displaystyle=\sum_{K=K^{*}}^{\infty}\mathrm{Po}(K;\psi T)\,p(\mathcal{I}\mid K,\Theta,\xi)$		(46)
		$\displaystyle=V(K^{};\xi)\,\frac{(\lambda^{\varnothing}T)^{\|\mathcal{I}_{0}\|}e^{-\lambda^{\varnothing}T}}{S!\,(1+\beta)^{S-\|\mathcal{I}_{0}\|}}\prod_{k=1}^{K^{}}\frac{\Gamma(\|\mathcal{I}_{k}\|+\alpha)}{\Gamma(\alpha)},$		(47)

where

\displaystyle V(K^{*};\xi)

\displaystyle=\sum_{K=K^{*}}^{\infty}\mathrm{Po}(K;\psi T)\,\frac{K!}{(K-K^{*})!}\left(\frac{\beta}{1+\beta}\right)^{K\alpha}.

(48)

∎

D.2 Marginal Likelihood of Spikes in a Partition

Now we combine the prior on partitions of spike indices with a likelihood of spikes under that partition. To match the notation in the main text and Appendix C, let $X=\{X_{k}\}_{k=0}^{K^{*}}$ denote a partition of the spikes where $X_{k}=\{x_{s}:s\in\mathcal{I}_{k}\}$ denote the set of spikes in sequence $k$ . In this section we derive the marginal probability of spikes in a sequence, integrating out the sequence time and type. For the background spikes, there are no parameters to integrate and the marginal probability is simply,

	$\displaystyle p(X_{0}\mid\Theta)$	$\displaystyle=\prod_{x_{s}\in X_{0}}\mathrm{Unif}(t_{s};[0,T])\,\mathrm{Cat}(n_{s}\mid[\lambda_{1}^{\varnothing}/\lambda^{\varnothing},\ldots,\lambda_{N}^{\varnothing}/\lambda^{\varnothing}])$		(49)
		$\displaystyle=T^{-\|X_{0}\|}\prod_{x_{s}\in X_{0}}\frac{\lambda_{n_{s}}^{\varnothing}}{\lambda^{\varnothing}}.$		(50)

For the spikes attributed to a latent event, however, we have to marginalize over the type and time of the event. We have,

$\displaystyle p(X_{k}\mid\Theta)$	$\displaystyle=\sum_{r_{k}=1}^{R}p(r_{k})\int_{0}^{T}p(\tau_{k})\prod_{s\in\mathcal{I}_{k}}p(t_{s},n_{s}\mid r_{k},\tau_{k},\Theta)\,\mathrm{d}\tau_{k}$	(51)
	$\displaystyle=\sum_{r_{k}=1}^{R}\pi_{r_{k}}\int_{0}^{T}\mathrm{Unif}(\tau_{k};[0,T])\prod_{x_{s}\in X_{k}}\mathrm{Cat}(n_{s}\mid r_{k},\Theta)\,\mathcal{N}(t_{s}\mid\tau_{k}+b_{n_{s}r_{k}},c_{n_{s}r_{k}})\,\mathrm{d}\tau_{k}$	(52)
	$\displaystyle=\frac{1}{T}\sum_{r_{k}=1}^{R}\pi_{r_{k}}\int_{0}^{T}\prod_{x_{s}\in X_{k}}\frac{a_{n_{s}r_{k}}}{\sqrt{2\pi c_{n_{s}r_{k}}}}\exp\left\{-\frac{1}{2c_{n_{s}r_{k}}}(t_{s}-\tau_{k}-b_{n_{s}r_{k}})^{2}\right\}\,\mathrm{d}\tau_{k}.$	(53)

Let $J_{sk}=1/c_{n_{s}r_{k}}$ , and $h_{sk}=(t_{s}-b_{n_{s}r_{k}})/c_{n_{s}r_{k}}$ . Then,

$\displaystyle p(X_{k}\mid\Theta)$	$\displaystyle=\frac{1}{T}\sum_{r_{k}=1}^{R}\pi_{r_{k}}\int_{0}^{T}\prod_{x_{s}\in X_{k}}a_{n_{s}r_{k}}\sqrt{\frac{J_{sk}}{2\pi}}\exp\left\{-\tfrac{1}{2}J_{sk}\tau_{k}^{2}+h_{sk}\tau_{k}-\tfrac{1}{2}h_{sk}^{2}J_{sk}^{-1}\right\}\,\mathrm{d}\tau_{k}$	(54)
	$\displaystyle=\frac{1}{T}\sum_{r_{k}=1}^{R}\pi_{r_{k}}\prod_{x_{s}\in X_{k}}\frac{a_{n_{s}r_{k}}}{Z(J_{sk},h_{sk})}\int_{0}^{T}\exp\left\{-\tfrac{1}{2}({\textstyle\sum_{x_{s}\in X_{k}}}J_{sk})\tau_{k}^{2}+({\textstyle\sum_{x_{s}\in X_{k}}}h_{sk})\tau_{k}\right\}\,\mathrm{d}\tau_{k}$	(55)
	$\displaystyle\approx\frac{1}{T}\sum_{r_{k}=1}^{R}\pi_{r_{k}}\left(\prod_{x_{s}\in X_{k}}a_{n_{s}r_{k}}\right)\frac{Z({\textstyle\sum_{x_{s}\in X_{k}}}J_{sk},{\textstyle\sum_{x_{s}\in X_{k}}}h_{sk})}{\prod_{x_{s}\in X_{k}}Z(J_{sk},h_{sk})},$	(56)

where

\displaystyle Z(J,h)=(2\pi)^{1/2}J^{-1/2}\exp\left\{\tfrac{1}{2}h^{2}J^{-1}\right\}

(57)

is the normalizing constant of a Gaussian density in information form with precision $J$ and linear coefficient $h$ . For a sequence with a single spike, this reduces to,

\displaystyle p(x_{s}\mid\Theta)

\displaystyle\approx\frac{1}{T}\sum_{r_{k}=1}^{R}\pi_{r_{k}}a_{n_{s}r_{k}}.

(58)

The approximations above are due to the truncated integral over $[0,T]$ rather than the whole real line. In practice, this truncation is negligible for sequences far from the boundaries of the interval.

D.3 Collapsed Gibbs Sampling the Partitions

The conditional distribution on partitions given data and model parameters is given by,

	$\displaystyle p(\mathcal{I}\mid\{x_{s}\}_{s=1}^{S},\Theta,\xi)$	$\displaystyle\propto p(\mathcal{I}\mid\Theta,\xi)\,p(\{x_{s}\}_{s=1}^{S}\mid\mathcal{I},\Theta)$		(59)
		$\displaystyle=p(\mathcal{I}\mid\Theta,\xi)\,p(X_{0}\mid\Theta)\,\prod_{k=1}^{K^{*}}p(X_{k}\mid\Theta).$		(60)

Let $X\setminus x_{s}$ denote the partition with spike $x_{s}$ removed. Though it is slightly inconsistent with the proof in Section D.1, let $u_{s}$ denote the assignment of spike $x_{s}$ , as in the main text. Moreover, let $S_{k}$ denote the size of the $k$ -th sequence in the partition, not including spike $x_{s}$ . We now consider the relative probability of the full partition $X$ when $x_{s}$ is added to each one of the possible parts: the background, an existing sequence, or a new sequence.

For the background assignment,

$\displaystyle p(u_{s}=0\mid x_{s},X\setminus x_{s},\Theta,\xi)$	$\displaystyle\propto\frac{p(u_{s}=0,x_{s},X\setminus x_{s},\Theta,\xi)}{p(X\setminus x_{s},\Theta,\xi)}$	(61)
	$\displaystyle=\frac{V(K^{};\xi)\,\frac{(\lambda^{\varnothing}T)^{S_{0}+1}e^{-\lambda^{\varnothing}T}}{S!\,(1+\beta)^{S-S_{0}-1}}p(X_{0}\cup x_{s}\mid\Theta)\prod_{k=1}^{K^{}}\frac{\Gamma(S_{k}+\alpha)}{\Gamma(\alpha)}p(X_{k}\mid\Theta)}{V(K^{};\xi)\,\frac{(\lambda^{\varnothing}T)^{S_{0}}e^{-\lambda^{\varnothing}T}}{S!\,(1+\beta)^{S-S_{0}}}p(X_{0}\mid\Theta)\prod_{k=1}^{K^{}}\frac{\Gamma(S_{k}+\alpha)}{\Gamma(\alpha)}p(X_{k}\mid\Theta)}$	(62)
	$\displaystyle=\lambda^{\varnothing}T(1+\beta)\frac{1}{T}\frac{\lambda_{n_{s}}^{\varnothing}}{\lambda^{\varnothing}}$	(63)
	$\displaystyle=(1+\beta)\lambda_{n_{s}}^{\varnothing}.$	(64)

By a similar process, we arrive at the conditional probabilities of adding a spike to an existing sequence,

	$\displaystyle p(u_{s}=k\mid x_{s},X\setminus x_{s},\Theta,\xi)$	$\displaystyle\propto\frac{\Gamma(S_{k}+\alpha+1)}{\Gamma(S_{k}+\alpha)}\frac{p(X_{k}\cup x_{s}\mid\Theta)}{p(X_{k}\mid\Theta)}$		(65)
		$\displaystyle=(S_{k}+\alpha)\,p(x_{s}\mid X_{k},\Theta).$		(66)

The predictive likelihood can be obtained via the ratio of marginal likelihoods derived above, or by explicitly calculating the categorical posterior distribution on types $r_{k}$ and then the categorical and Gaussian posterior predictive distributions of $n_{s}$ and $t_{s}$ , respectively, given the type and the other spikes. The latter approach is what is presented in the main text.

Finally, the conditional probability of assigning a spike to a new sequence simplifies as,

$\displaystyle p(u_{s}=K^{*}+1\mid X\setminus x_{s},\Theta,\xi)$	$\displaystyle\propto\left(\frac{\Gamma(\alpha+1)}{\Gamma(\alpha)}\right)\left(\frac{V(K^{}+1;\xi)}{V(K^{};\xi)}\right)p(x_{s}\mid\Theta)$	(67)
	$\displaystyle=\alpha\left(\frac{\sum_{J=K^{}+1}^{\infty}\frac{1}{J!}e^{-\psi T}(\psi T)^{J}\,\frac{J!}{(J-K^{}-1)!}\left(\frac{\beta}{1+\beta}\right)^{J\alpha}}{\sum_{K=K^{}}^{\infty}\frac{1}{K!}e^{-\psi T}(\psi T)^{K}\,\frac{K!}{(K-K^{})!}\left(\frac{\beta}{1+\beta}\right)^{K\alpha}}\right)\left(\frac{1}{T}\sum_{r_{k}=1}^{R}\pi_{r_{k}}a_{n_{s}r_{k}}\right)$	(68)
	$\displaystyle=\alpha\left(\frac{\sum_{K=K^{}}^{\infty}(\psi T)^{K+1}\,\frac{1}{(K-K^{})!}\left(\frac{\beta}{1+\beta}\right)^{(K+1)\alpha}}{\sum_{K=K^{}}^{\infty}(\psi T)^{K}\,\frac{1}{(K-K^{})!}\left(\frac{\beta}{1+\beta}\right)^{K\alpha}}\right)\,\left(\frac{1}{T}\sum_{r_{k}=1}^{R}\pi_{r_{k}}a_{n_{s}r_{k}}\right)$	(69)
	$\displaystyle=\alpha\left(\frac{\beta}{1+\beta}\right)^{\alpha}\psi T\left(\frac{1}{T}\sum_{r_{k}=1}^{R}\pi_{r_{k}}a_{n_{s}r_{k}}\right)$	(70)
	$\displaystyle=\alpha\,\left(\frac{\beta}{1+\beta}\right)^{\alpha}\psi\sum_{r_{k}=1}^{R}\pi_{r_{k}}a_{n_{s}r_{k}}.$	(71)

D.4 Gibbs Sampling the Latent Event Parameters

Given the partition and global parameters, it is straightforward to sample the latent events. In fact, most of the calculations were derived in Section D.2. The conditional distribution of the event type is,

	$\displaystyle p(r_{k}\mid X_{k},\Theta)$	$\displaystyle\propto p(r_{k})\int_{0}^{T}p(\tau_{k})\prod_{x_{s}\in X_{k}}p(t_{s},n_{s}\mid r_{k},\tau_{k},\Theta)\,\mathrm{d}\tau_{k}$		(72)
		$\displaystyle\approx\pi_{r_{k}}\left(\prod_{x_{s}\in X_{k}}a_{n_{s}r_{k}}\right)\frac{Z({\textstyle\sum_{x_{s}\in X_{k}}}J_{sk},{\textstyle\sum_{x_{s}\in X_{k}}}h_{sk})}{\prod_{x_{s}\in X_{k}}Z(J_{sk},h_{sk})},$		(73)

where, again, the approximation comes from truncating the integral to the range $[0,T]$ , and is negligible in our cases.

Given the type, the latent event time is conditionally Gaussian,

$\displaystyle p(\tau_{k}\mid r_{k},X_{k},\Theta)$	$\displaystyle\propto p(\tau_{k})\prod_{x_{s}\in X_{k}}p(t_{s}\mid r_{k},n_{s},\tau_{k},\Theta)$	(74)
	$\displaystyle=\mathrm{Unif}(\tau_{k};[0,T])\prod_{x_{s}\in X_{k}}\mathcal{N}(t_{s}\mid\tau_{k}+b_{n_{s}r_{k}},c_{n_{s}r_{k}})$	(75)
	$\displaystyle\propto\mathcal{N}(\tau_{k}\mid\mu_{k},\sigma_{k}^{2})$	(76)

where

	$\displaystyle\sigma_{k}^{2}$	$\displaystyle=\left(\sum_{x_{s}\in X_{k}}J_{sk}\right)^{-1}$		(77)
	$\displaystyle\mu_{k}^{2}$	$\displaystyle=\sigma_{k}^{2}\left(\sum_{x_{s}\in X_{k}}h_{sk}\right).$		(78)

Finally, the amplitude is independent of the type and time, and its conditional distribution is given by,

	$\displaystyle p(A_{k}\mid X_{k},\xi)$	$\displaystyle\propto\mathrm{Ga}(A_{k}\mid\alpha,\beta)\,\mathrm{Po}(S_{k}\mid A_{k})$		(79)
		$\displaystyle\propto\mathrm{Ga}(A_{k}\mid\alpha+S_{k},\,\beta_{\varnothing}+1).$		(80)

D.5 Gibbs Sampling the Global Parameters

Finally, the Gibbs updates of the global parameters are simple due to their conjugate priors.

Background rates

The conditional distribution of the background rates is,

	$\displaystyle p(\lambda_{n}^{\varnothing}\mid X_{0},\xi)$	$\displaystyle\propto\mathrm{Ga}(\lambda_{n}^{\varnothing}\mid\alpha_{\varnothing},\beta_{\varnothing})\,\mathrm{Po}(\|\{x_{s}:x_{s}\in X_{0},n_{s}=n\}\|\mid\lambda_{n}^{\varnothing}T)$		(81)
		$\displaystyle\propto\mathrm{Ga}(\lambda_{n}^{\varnothing}\mid\alpha_{\varnothing}+\sum_{x_{s}\in X_{0}}\mathbb{I}[n_{s}=n],\,\beta_{\varnothing}+T).$		(82)

Sequence type probabilities

The conditional distribution of the sequence type probability vector $\pi$ is,

	$\displaystyle p(\pi\mid\{(r_{k},\tau_{k},A_{k})\}_{k=1}^{K^{*}},\xi)$	$\displaystyle\propto\mathrm{Dir}(\pi\mid\gamma\mathbf{1}_{R})\prod_{k=1}^{K^{*}}\mathrm{Cat}(r_{k}\mid\pi)$		(83)
		$\displaystyle\propto\mathrm{Dir}\left(\pi\mid\left[\gamma+\sum_{k}\mathbb{I}[r_{k}=1],\ldots,\gamma+\sum_{k}\mathbb{I}[r_{k}=1]\right]\right).$		(84)

Neuron weights for each sequence type

The conditional distribution of the neuron weight vector $\mathbf{a}_{r}$ is,

$\displaystyle p(\mathbf{a}_{r}\mid X,\{(r_{k},\tau_{k},A_{k})\}_{k=1}^{K^{*}},\xi)$	$\displaystyle\propto\mathrm{Dir}(\mathbf{a}_{r}\mid\varphi\mathbf{1}_{N})\prod_{k=1}^{K^{*}}\prod_{x_{s}\in X_{k}}\mathrm{Cat}(n_{s}\mid\mathbf{a}_{r_{k}}),$	(85)
	$\displaystyle\propto\mathrm{Dir}\left(\mathbf{a}_{r}\mid\boldsymbol{\phi}_{r}\right)$	(86)
$\displaystyle\phi_{rn}$	$\displaystyle=\varphi+\sum_{k}\sum_{x_{s}\in X_{k}}\mathbb{I}[r_{k}=r\wedge n_{s}=n].$	(87)

Neuron widths and delays

The conditional distribution of the neuron delays $b_{nr}$ and widths $c_{nr}$ is,

	$\displaystyle p(b_{nr},c_{nr}\mid X,\{(r_{k},\tau_{k},A_{k})\}_{k=1}^{K^{*}},\xi)$	$\displaystyle\propto\mathrm{Inv-}\chi^{2}(c_{nr}\mid\nu,\sigma^{2})\,\mathcal{N}(b_{nr}\mid 0,c_{nr}/\kappa)\prod_{k=1}^{K^{*}}\prod_{x_{s}\in X_{k}}\left(\mathcal{N}(t_{s}\mid\tau_{k}+b_{nr},c_{nr})\right)^{\mathbb{I}[r_{k}=r\wedge n_{s}=n]}$		(88)
		$\displaystyle\propto\mathrm{Inv-}\chi^{2}(c_{nr}\mid\nu_{nr},\sigma_{nr}^{2})\,\mathcal{N}(b_{nr}\mid\mu_{nr},c_{nr}/\kappa_{nr})$		(89)

where

$\displaystyle S_{nr}$	$\displaystyle=\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}{\mathbb{I}[r_{k}=r\wedge n_{s}=n]}$	(90)
$\displaystyle\nu_{nr}$	$\displaystyle=\nu+S_{nr}$	(91)
$\displaystyle\sigma_{nr}^{2}$	$\displaystyle=\frac{\nu\sigma^{2}+\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}(t_{s}-\tau_{k})^{2}\cdot\mathbb{I}[r_{k}=r\wedge n_{s}=n]}{\nu+S_{nr}}$	(92)
$\displaystyle\kappa_{nr}$	$\displaystyle=\kappa+S_{nr}$	(93)
$\displaystyle\mu_{nr}$	$\displaystyle=\frac{\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}(t_{s}-\tau_{k})\cdot\mathbb{I}[r_{k}=r\wedge n_{s}=n]}{\kappa+S_{nr}}.$	(94)

D.6 Split-Merge Sampling Moves

As discussed in the main text, when sequences containing of a small number of spikes are unlikely under the prior, Gibbs sampling can be slow to mix over the posterior of spike partitions since the probability of forming new sequences (i.e. singleton clusters) is very low. In addition to the annealed sampling approach outlined in the main text, we adapted the split-merge sampling method proposed by [64] for Dirichlet process mixture models to PP-Seq.⁴⁴4See also [65] for the case of non-conjugate priors. For simplicity, we implemented randomized split-merge proposals—i.e. without the additional restricted Gibbs sweeps described proposed in [64]. In practice, we found that these random proposals were sufficient for the sampler to accept both split and merge moves at a satisfactorily high rate.

Briefly, our method starts by randomly choosing two spikes $(t_{i},n_{i})$ and $(t_{j},n_{j})$ that are not assigned to the background partition and satisfy $|t_{i}-t_{j}|<W$ for some user-specified time window $W$ . One could set $W=T$ , in which case all pairs of spikes not assigned to the background are considerd; however, we will see that this will generally result in proposals that are extremely likely to be rejected, so it is advisable to set $W$ to be an upper bound on the expected sequence length to encourage faster mixing. After identifying the pair of spikes, we propose to either merge their sequences (if $u_{i}\neq u_{j}$ ), or, if they belong to the same sequence ( $u_{i}=u_{j}=k$ ) we propose to form two new sequences, each containing one of the two spikes, the remaining $|X_{k}|-2$ spikes are randomly assigned with equal probability.

Given the proposed split or merge move, $\mathcal{I}\rightarrow\mathcal{I}^{*}$ , the Metropolis-Hastings acceptance probability is:

\min\left[1,\frac{q(\mathcal{I}\mid\mathcal{I}^{*})}{q(\mathcal{I}^{*}\mid\mathcal{I})}\frac{p(\mathcal{I}^{*}\mid\{x_{s}\}_{s=1}^{S},\Theta,\xi))}{p(\mathcal{I}\mid\{x_{s}\}_{s=1}^{S},\Theta,\xi)}\right]

(95)

where $q(\mathcal{I}^{*}\mid\mathcal{I})$ is the proposal density, and $p(\mathcal{I}\mid\{x_{s}\}_{s=1}^{S},\Theta,\xi)$ is given in eqs. 59 and 60. The sampling method then directly follows [64]. The ratio of proposal probabilities for split moves is:

\frac{q(\mathcal{I}\mid\mathcal{I}^{\text{split}})}{q(\mathcal{I}^{\text{split}}\mid\mathcal{I})}=\left(\frac{1}{2}\right)^{-(|X_{k}|-2)}

(96)

And the ratio for merge moves is:

\frac{q(\mathcal{I}\mid\mathcal{I}^{\text{merge}})}{q(\mathcal{I}^{\text{merge}}\mid\mathcal{I})}=\left(\frac{1}{2}\right)^{(|X_{k}|-2)}

(97)

Appendix E Time-Warped Sequences

To account for sequences that unfold at different speeds, we incorporate a time-warping component into the generative model. Each sequence is endowed with a warp value, which is drawn from a discrete distribution,

\displaystyle\omega_{k}

\displaystyle\sim\sum_{f=1}^{F}\eta_{f}\delta_{w_{f}}(\omega_{k}),

(98)

where $\eta_{f}\geq 0$ and $\sum_{f=1}^{F}\eta_{f}=1$ are the probabilities of the corresponding warp values $w_{f}>0$ . In this paper, we set

	$\displaystyle w_{f}$	$\displaystyle=w_{F}^{-1+2(f-1)/(F-1)}$		(99)
	$\displaystyle\eta_{f}$	$\displaystyle\propto\mathcal{N}\left(f\mid\tfrac{F-1}{2},\sigma_{w}^{2}\right).$		(100)

The hyperparameters include the number of warp values $F$ , the maximum warp value $w_{F}$ , and the variance of the warp probabilities $\sigma_{w}^{2}$ . We use an odd number of warp values so that $w_{(F-1)/2}=1$ .

The warp value changes the distribution of spike times in the corresponding sequence by scaling the mean and standard deviation,

\displaystyle t_{s}^{(k)}

\displaystyle\sim\mathcal{N}(\tau_{k}+\omega_{k}b_{n_{s}^{(k)}r_{k}},\omega_{k}^{2}c_{n_{s}^{(k)}r_{k}}).

(101)

Note that one could also choose to linearly scale the variance rather than the standard deviation, as in a warped random walk. Ultimately, this is a subjective modeling choice, and scaling the standard deviation leads to slightly easier Gibbs updates later on.

We can equivalently endow each sequence with a latent warp index $f_{k}\in\{1,\ldots,F\}$ where $p(f_{k})\propto\eta_{f_{k}}$ and,

\displaystyle t_{s}^{(k)}

\displaystyle\sim\mathcal{N}\left(\tau_{k}+w_{f_{k}}b_{n_{s}^{(k)}r_{k}},w_{f_{k}}^{2}c_{n_{s}^{(k)}r_{k}}\right).

(102)

Since there are only a finite number of warp values, we can treat the warp index and the sequence type as a single discrete latent variable $(r_{k},f_{k})\in\{1,\ldots,R\}\times\{1,\ldots,F\}$ . This formulation allows us to re-use all of the collapsed Gibbs updates derived above, replacing our inferences over $r_{k}$ with inferences of $(r_{k},f_{k})$ . Computationally, this increases the run-time by a factor of $F$ .

Given the sequence types, times, and warp values, the conditional distribution of the neuron delays $b_{nr}$ and widths $c_{nr}$ is,

$\displaystyle p(b_{nr},c_{nr}\mid X,$	$\displaystyle\{(r_{k},\tau_{k},f_{k},A_{k})\}_{k=1}^{K^{*}},\xi)$	(103)
	$\displaystyle\propto\mathrm{Inv-}\chi^{2}(c_{nr}\mid\nu,\sigma^{2})\,\mathcal{N}(b_{nr}\mid 0,c_{nr}/\kappa)\prod_{k=1}^{K^{*}}\prod_{x_{s}\in X_{k}}\left(\mathcal{N}(t_{s}\mid\tau_{k}+w_{f_{k}}b_{nr},w_{f_{k}}^{2}c_{nr})\right)^{\mathbb{I}[r_{k}=r\wedge n_{s}=n]}$	(104)
	$\displaystyle\propto\mathrm{Inv-}\chi^{2}(c_{nr}\mid\nu_{nr},\sigma_{nr}^{2})\,\mathcal{N}(b_{nr}\mid\mu_{nr},c_{nr}/\kappa_{nr})$	(105)

where

$\displaystyle S_{nr}$	$\displaystyle=\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}{\mathbb{I}[r_{k}=r\wedge n_{s}=n]}$	(106)
$\displaystyle\Delta_{sk}$	$\displaystyle=\frac{t_{s}-\tau_{k}}{w_{f_{k}}}$	(107)
$\displaystyle\nu_{nr}$	$\displaystyle=\nu+S_{nr}$	(108)
$\displaystyle\sigma_{nr}^{2}$	$\displaystyle=\frac{\nu\sigma^{2}+\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}\Delta_{sk}^{2}\cdot\mathbb{I}[r_{k}=r\wedge n_{s}=n]}{\nu+S_{nr}}$	(109)
$\displaystyle\kappa_{nr}$	$\displaystyle=\kappa+S_{nr}$	(110)
$\displaystyle\mu_{nr}$	$\displaystyle=\frac{\sum_{k=1}^{K^{*}}\sum_{x_{s}\in X_{k}}\Delta_{sk}\cdot\mathbb{I}[r_{k}=r\wedge n_{s}=n]}{\kappa+S_{nr}}.$	(111)

If instead you choose to linearly scale the variances, as mentioned above, the conditional distribution of the delays and widths is no longer inverse $\chi^{2}$ . However, it is straightforward to sample the delays and widths one at a time from their univariate conditional distributions.

Appendix F Supplemental Details on Experiments

F.1 ROC curve comparison of convNMF and PP-Seq

Simulated datasets consisted of $N=100$ neurons, $T=2000$ time units, $R=1$ sequence type, sequence event rate $\psi=0.02$ , a default average background rate of $\lambda_{n}^{\varnothing}=0.03$ , default settings of $\alpha=225$ and $\beta=7.5$ . Further, we set $\kappa_{nr}=c_{nr}$ and $c_{nr}=0.04$ so that neuron offsets had unit standard deviation and small jitter by default. For the case of “jitter noise” we modified $c_{nr}$ but continued to set $\kappa_{nr}=c_{nr}$ to preserve the expected distribution of neuron offsets.

For each synthetic dataset, we fit convNMF models with $R=1$ component and a time bin size of 0.2 time units. We optimized using alternating projected gradient descent with a backtracking line search. Each convNMF model produced a temporal factor $\mathbf{h}\in\mathbb{R}_{+}^{B}$ (where $B=T/0.2=40000$ is the total number of time bins). We encoded the ground truth sequence times in a binary vector of length $B$ and computed ROC curves by thresholing $\mathbf{h}$ over a fine grid of values over the range $[0,\text{max}(\mathbf{h})]$ . We consider each timebin, indexed by $b$ , a false positive if $h_{b}$ exceeds the threshold, but no ground truth sequence was present in the timebin. Likewise, a timebin is a true positive when $h_{b}$ exceeds the threshold and a sequence event did occur in this timebin.

We repeated a similar analysis with 100 (approximate) samples from the PP-Seq posterior distribution. Specifically, we discretized the sampled sequence event times, i.e. $\{\tau_{k}\}_{k=1}^{K}$ for every sample, into the same $B$ time bins used by convNMF. For every timebin we computed the empirical probability that it contained a sampled event from the model, averaging over MCMC samples. This resulted in a discretized, nonnegative temporal factor that is directly analogous to the factor $\mathbf{h}$ produced convNMF. We compute an ROC curve by the same method outlined above.

Unfortunately, both convNMF and PP-Seq parameters are only weakly identifiable. For example, in PP-Seq, one can add a constant to all latent event times, $\tau_{k}\mapsto\tau_{k}+C$ , and subtract the same constant from all neuron offsets $b_{nr}\mapsto b_{nr}-C$ . This manipulation does not modify the likelihood of the data (it does, however, affect the probability of the parameters under the prior, and so we say the model is “weakly identifiable”). The result of this is that both PP-Seq and convNMF may consistently misestimate the “true” sequence times by a small constant of time bins. To discount this nuisance source of model error we repeated the above analysis on shifted copies of the temporal factor (up to 20 time bins in each direction); we then selected the optimal ROC curve for each model and computed the area under the curve to quantify accuracy.

F.2 Hyperparameter sweep on hippocampal data

We performed MCMC inference on 2000 randomly sampled hyperparameter sets, masking out a random 7.5% of the data as a validation set on every run. We fixed the following hyperparameter values: $\gamma=3$ , $\varphi=1$ , $\mu_{nr}=0$ for every neuron and sequence type, $\nu_{nr}=4$ for every neuron and sequence type. We also fixed the following hyperparameters, pertaining to time warping: $F=10$ , $\sigma^{2}_{w}=100$ . We randomized the number of sequence types $R\in\{1,2,3,4\}$ uniformly. We randomized the maximum warp value $w_{F}$ uniformly on (1, 1.5). We randomized the mean sequence amplitude, $\alpha/\beta$ , log-uniformly over the interval $(10^{2},10^{4})$ , and set the variance of the sequence amplitude $\alpha/\beta^{2}$ equal to the mean. We randomized the expected total background amplitude, $\lambda^{\varnothing}$ , log-uniformly over the interval $(10^{2},10^{4})$ and set the variance of this parameter equal to its mean. We randomized the sequence rate $\psi$ log-uniformly over $(10^{-3},10^{-1})$ . Finally, we randomized the neuron response widths $c_{nr}$ log-uniformly over $(10^{-1},10^{1})$ .

We randomly initialized all spike assignments either with convNMF or the annealed MCMC procedure outlined in the main text (we observed successful outcomes in both cases). In annealed runs, we set an initial temperature of 500 and exponentially relaxed the temperature to one over 20 stages, each containing 100 Gibbs sweeps. (Recall that this temperature parameter scales the variance of $\mathrm{Ga}(\alpha,\beta)$ while preserving the mean of this prior distribution.) After annealing was completed we performed 100 final Gibbs sweeps over the data, the final half of which were used as approximate samples from the posterior. In this final stage of sampling, we also performed 1000 randomized split-merge Metropolis Hastings moves after every Gibbs sweep.

All MCMC runs were performed using our Julia implementation of PP-Seq, run in parallel on the Sherlock cluster at Stanford University. Each job was allocated 1 CPU and 8 GB of RAM.