This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hidden Markov model analysis to fluorescence blinking of fluorescently labeled DNA

Tatsuhiro Furuta Graduate School of Engineering, Kyushu Institute of Technology, 1-1 Sensui-cho, Tobata-ku, Kitakyushu, Fukuoka 804-8550, Japan.    Shuya Fan SANKEN (The Institute of Scientific and Industrial Research), Osaka University, Mihogaoka 8-1, Ibaraki, Osaka 567-0047, Japan.    Tadao Takada Department of Applied Chemistry, Graduate School of Engineering, University of Hyogo, 2167 Shosha, Himeji, Hyogo 671-2280, Japan.    Yohei Kondo Department of Life Science and Technology, Institute of Science Tokyo, Nagatsuta, Midori-ku, Yokohama, Kanagawa 226-8501, Japan.    Mamoru Fujitsuka SANKEN (The Institute of Scientific and Industrial Research), Osaka University, Mihogaoka 8-1, Ibaraki, Osaka 567-0047, Japan.    Atsushi Maruyama Department of Life Science and Technology, Institute of Science Tokyo, Nagatsuta, Midori-ku, Yokohama, Kanagawa 226-8501, Japan.    Kiyohiko Kawai [email protected] Department of Life Science and Technology, Institute of Science Tokyo, Nagatsuta, Midori-ku, Yokohama, Kanagawa 226-8501, Japan.    Kazuma Nakamura [email protected] Graduate School of Engineering, Kyushu Institute of Technology, 1-1 Sensui-cho, Tobata-ku, Kitakyushu, Fukuoka 804-8550, Japan. Integrated Research Center for Energy and Environment Advanced Technology, Kyushu Institute of Technology, 1-1 Sensui-cho, Tobata-ku, Kitakyushu, Fukuoka 804-8550, Japan.
Abstract

We examine quantitatively the transition process from emitting to not-emitting states of fluorescent molecules with a machine learning technique. In a fluorescently labeled DNA, the fluorescence occurs continuously under irradiation, but it often transfers to the not-emitting state corresponding to a charge-separated state. The trajectory of the fluorescence consists of repetitions of light-emitting (ON) and not-emitting (OFF) states, called blinking, and it contains a very large amount of noise due to the several reasons, so in principle, it is difficult to distinguish the ON and OFF states quantitatively. The fluorescence trajectory is a typical stochastic process, and therefore requires advanced time-series data analysis. In the present study, we analyze the fluorescence trajectories using a hidden Markov model, and calculate the probability density of the ON and OFF duration. From the analysis, we found that the ON-duration probability density can be well described by an exponential function, and the OFF-duration probability density can be well described by a log-normal function, which are verified in terms of Kolmogorov-Smirnov test. The time-bin dependence in the fluorescence trajectory on the probability density is carefully analyzed. We also discuss the ON and OFF processes from failure-rate analysis used in life testing of semiconductor devices.

I Introduction

Materials science relating to fluorescence phenomena has long been one of the most active research fields. The luminescent properties of quantum dots are expected to have many applications [1, 2, 3, 4, 5] due to the variable fluorescence as a function of the system size and a high fluorescence quantum yield. In addition, fluorescently labeled DNA, in which a fluorescent dye is embedded to a specific DNA sequence, has become an essential tool in molecular biochemistry [6], biomedical research [7], and biophysics [8] to understand DNA dynamics and interactions.

One of characteristic and important aspects in the florescent property is fluorescence blinking [9, 10]. The state in which a fluorescent material repeatedly absorbs light and emits fluorescence under irradiation is called the ON state displayed in Fig. 1(a), and the state in which it does not emit light is called the OFF state in Fig. 1(b). When the fluorescence intensity is monitored as a function of time, the ON and OFF states are repeated, called fluorescence blinking, and the time series data shown in Fig. 1(e) are generated [11, 12]. This time series record the duration of the ON and OFF states.

Refer to caption
Figure 1: A schematic figure of fluorescence blinking. When illuminated with light, a fluorophore takes an ON state [panel (a)], where it repeats light-absorbing and light-emitting. The system also takes an OFF state [panel (b)], where the light-emitting is interrupted because of the transition to not-emitting state. Panels (c) and (d) are schematic chemical structures of the fluorescently labeled DNA (double helix structure) to be analyzed [11, 12]. The symbols A, G, T, and C are the bases that make up DNA, and X is the fluorescent molecule ATTO655, and ZG is a hole-trapping molecule. The panel (c) represents the ON state, and (d) represents the OFF state in which the molecule is trapped in a charge-separated state and does not emit light. See Sec. II.1 for more details. The panel (e) is a part of the observed fluorescence trajectory in Fig. 2 (a) of the present fluorecently labeled DNA, in which we see noticeable noises.

The blinking plot is obtained by calculating a probability density function of the duration of each states, providing a basis to understand the luminescent properties of materials. It should be noted here that the fluorescence trajectory contains a large amount of noise such as light emission from the substrate. When the noise becomes larger than the fluorescent signal, it becomes difficult to distinguish between the ON and OFF states, and the quantitative nature of the evaluated duration is ambiguous. As a result, the reliability of the blinking plot is also lost. A data analysis method that can remove noise and stably determine the ON/OFF states is desired.

In the present study, we report a hidden Markov model (HMM) analysis for the fluorescence trajectory of a fluorescently labeled DNA. The HMM is a machine learning method to handle time series data, and has been applied to various dynamic data such as stock price prediction [13] and anomaly detection [14]. In materials science, there are several applications to quantum dots [15], random telegraph noise [16, 17, 18, 19, 20], nitrogen-vacancy centers in diamond [21, 22], electron holograms [23], electron holography [24], atom quantum jump dynamics [25], and crack propagation [26]. The fluorescent trajectory is a typical stochastic process including noise, and it is important to understand the physics behind this process. The HMM introduces hidden variables characterizing states behind the observed data. The ON/OFF states can be identified with a time series of the hidden variables instead of real fluorescence trajectory. An important point here is that the time series of the hidden variable are noise-suppressive, and as a result, we can stably estimate ON/OFF duration and the blinking plot.

On the study about a fluorescently labeled single molecule, fluctuation and correlation in the fluorescent intensity are important objects, because they include details of the composition, shape, diffusion, dynamics, intramolecular interaction, and environment effects [27]. Methods based on the auto correlation function [28] and photon counting histogram [29] to the time series data play an important role. There are many reports with these techniques, including conformational dynamics of a single protein [30], enzymatic turnovers of single flavoenzyme molecules [31], electron transfer reaction of azurin molecules [32]. HMM analyses have also been applied to the analyses of fluorescence resonance energy transfer [33, 34, 35], diffusion of nanoparticles in the cytoplasm [36], biomolecular motors [37], multichromophore photobleaching of dextran polymers [38]. In the present study, we apply the HMM analysis to the understanding of fluorescence blinking of the fluorescently labeled DNA.

This paper is organized as follows: In Sec. II, we describe an experimental setup for single-molecule measurements of fluorescent signals of a fluorecently labeled DNA, and overviews of the HMM analysis and how to calculate the blinking plot. In Sec. III, we show results of the HMM analysis for the experimental fluorescence trajectory and model fitting to the probability density for the duration of the ON and OFF states. The best fitting functions are pursued in term of the statistical hypothesis testing of goodness-of-fit. In addition, we perform a failure-rate analysis to understand the ON\toOFF and OFF\toON transition processes. Section IV summarizes conclusions.

II Method

II.1 Single molecule measurement for fluorescence trajectory

In the present study, we analyze the fluorescence fluctuation data previously measured for charge-separation and charge-recombination processes in DNA [11, 12]. Figures 1 (c) and (d) are schematic chemical structures of the fluorescently labeled DNA (double helix structure) to be analyzed, where A, G, T, and C are the bases that make up DNA, X is the fluorescent molecule ATTO655 as a photosensitizer, and ZG is a deazaguanine as a hole trap to observe a charge-separation and charge recombination process in DNA at the single-molecule level. The panel (c) represents a schematic chemical structure for an emitting ON state, where ATTO655 repeats photon-absorption and emission. The panel (d) represents a schematic chemical structure of a not-emitting OFF state; during the charge-separated state, ATTO655 is in the radical anion form and cannot emit, therefore the duration of the OFF state corresponds to the lifetime of the charge-separated state.

Briefly, for the experimental setup, biotin group was introduced at the end of the duplex and was anchored on a glass surface through well-established biotinylated-BSA, streptavidin, biotinylated-DNA chemistry. The single molecule measurement was performed on a custom-made confocal fluorescence microscope consisting of an inverted optical microscope (IX73, Olympus). The output of the laser (OBIS 637 LX, Coherent, 637 nm, 4.6 μ\muW) was focused in the sample by an objective (UPlanXApo, x60, oil, NA 1.42, Olympus), and the detection volume (confocal volume) was regulated by a pinhole (diameter 25 μ\mum, Thorlabs MPH16). The scattered light was blocked by a band-pass filter (FF01-697/58-25, Semrock), and the emitted photons from ATTO655 were detected by an avalanche photodiode (SPCM-AQRH-14, Perkin-Elmer). A counting board (SPC-130EMN, Becker & Hickl GmbH) was used to count the output, and real-time monitoring of fluorescence intensity fluctuations has been achieved using the Spcm64 system. (Becker & Hickl GmbH). The time resolution of this device on the photon counting is 80 nsec.

II.2 Hidden Markov model

We next briefly describe our calculation method of HMM [15]. In the HMM, a probability density function combined with various stochastic variables is introduced. The joint distribution of the HMM employed in the present study is the following form as

p(𝑰,𝐒,𝚯,𝝅,𝐀)=p(𝑰|𝐒,𝚯)p(𝐒|𝝅,𝐀)p(𝚯)p(𝝅)p(𝐀).\displaystyle p({\bm{I}},\mathbf{S},\mathbf{\Theta},\bm{\pi},\mathbf{A})=p({\bm{I}}|\mathbf{S},\mathbf{\Theta})p(\mathbf{S}|\bm{\pi},\mathbf{A})p(\mathbf{\Theta})p(\bm{\pi})p(\mathbf{A}).
(1)

Here,

𝑰=(I1,I2,,IN)\displaystyle{\bm{I}}=(I_{1},I_{2},\cdots,I_{N}) (2)

is observed data corresponding to a fluorescence trajectory to be analyzed, and NN is the total numbers of the time grids. 𝐒\mathbf{S} is a time series of hidden variables as

𝐒=(𝐬1,𝐬2,,𝐬N),\displaystyle\mathbf{S}=({\bf s}_{1},{\bf s}_{2},\cdots,{\bf s}_{N}), (3)

which describes the internal state of each time in the fluorescence trajectory, and 𝐬i{\bf s}_{i} is a KK-dimensional vector in the time step ii. To identify an ON or OFF state, we set KK = 2 [15]. 𝚯\mathbf{\Theta} describes the parameter characterizing a distribution function of observed data 𝑰{\bm{I}}. In the present study, we assume that the distribution of each state follows the Gaussian distribution with a mean μ\mu and a precision λ\lambda, where λ1\lambda^{-1} represents a variance. Then, we write 𝚯\mathbf{\Theta} as {𝝁,𝝀}\{{\bm{\mu}},{\bm{\lambda}}\} with μ=(μON,μOFF){\bf\mu}=(\mu_{{\rm ON}},\mu_{{\rm OFF}}) and λ=(λON,λOFF){\bf\lambda}=(\lambda_{{\rm ON}},\lambda_{{\rm OFF}}). 𝝅\bm{\pi} represents the probability of the initial-step hidden variable, and 𝝅=(πON,πOFF)\bm{\pi}=(\pi_{{\rm ON}},\pi_{{\rm OFF}}). 𝐀\mathbf{A} describes a 2×\times2 transition matrix for the time evolution of the hidden variables. p(𝐒|𝝅,𝐀)p(\mathbf{S}|\bm{\pi},\mathbf{A}) in Eq. (1) describes the conditional probability distribution of 𝐒\mathbf{S} with 𝐀\mathbf{A} and 𝝅\bm{\pi} fixed, and similarly, p(𝑰|𝐒,𝚯)p(\bm{I}|\mathbf{S},\mathbf{\Theta}) is the conditional probability distribution of 𝑰\bm{I} after 𝐒\mathbf{S} and 𝚯\mathbf{\Theta} were determined.

The purpose of the HMM simulation is to infer the hidden-variable time series 𝐒\mathbf{S}. For this purpose, we need to calculate a posterior distribution which is a conditional distribution function under the observed trajectory 𝑰\bm{I} as

p(𝐒,𝝁,𝝀,𝝅,𝐀|𝑰)=p(𝑰,𝐒,𝝁,𝝀,𝝅,𝐀)p(𝑰),\displaystyle p(\mathbf{S},{\bm{\mu}},{\bm{\lambda}},\bm{\pi},\mathbf{A}|\bm{I})=\frac{p(\bm{I},\mathbf{S},\bm{\mu},\bm{\lambda},\bm{\pi},\mathbf{A})}{p(\bm{I})}, (4)

where p(𝑰)p(\bm{I}) is the marginal distribution written as

p(𝑰)=𝐒p(𝑰,𝐒,𝝁,𝝀,𝝅,𝐀)𝑑𝝁𝑑𝝀𝑑𝝅𝑑𝐀.\displaystyle p(\bm{I})=\sum_{\mathbf{S}}\int p(\bm{I},\mathbf{S},\bm{\mu},\bm{\lambda},\bm{\pi},\mathbf{A})d\bm{\mu}d\bm{\lambda}d\bm{\pi}d\mathbf{A}. (5)

To calculate the posterior distribution in Eq. (4) approximately, we use a blocking Gibbs sampling based on Bayesian inference. Optimization details can be found in Ref. 15. By evaluation of the posterior distribution, the hidden-variable time series 𝐒\mathbf{S} in Eq. (3) can be obtained as a simulation result. On the calculation condition, the total number of the iteration steps of the Gibbs sampling Nitr=1000N_{itr}=1000, and the hyperparameters of each distribution are set to the same as Ref. 15.

II.3 Probability density of duration

Next, we describe how to analyze the hidden-variable time series 𝐒\mathbf{S} in Eq. (3) obtained from the HMM simulation. For this purpose, we consider a probability density of duration τ\tau as

p(τ)=1Neα=1Neδ(ττα),\displaystyle p(\tau)=\frac{1}{N_{e}}\sum_{\alpha=1}^{N_{e}}\delta(\tau-\tau_{\alpha}), (6)

where p(τ)p(\tau) is normalized as p(τ)𝑑τ=1\int p(\tau)d\tau=1. We consider p(τON)p(\tau_{{\rm ON}}) and p(τOFF)p(\tau_{{\rm OFF}}), and τα\tau_{\alpha} in Eq. (6) is the α\alpha-th ON or OFF event duration. The {τα\tau_{\alpha}} data are collected from the hidden-variable time series 𝐒\mathbf{S} [15], and NeN_{e} is the total number of the {τα\tau_{\alpha}} data. In the practical calculation, p(τ)p(\tau) is evaluated discretely as

p(τk)=1Cα=1Neδ(τkτα),\displaystyle p(\tau_{k})=\frac{1}{C}\sum_{\alpha=1}^{N_{e}}\delta(\tau_{k}-\tau_{\alpha}), (7)

where p(τk)p(\tau_{k}) is proportional to the total number of the events with the duration from τk1\tau_{k-1} to τk\tau_{k}, and τk\tau_{k} is given as kΔτk\Delta_{\tau} with Δτ\Delta_{\tau} being a grid spacing of duration. CC is a normalization constant given as

C=k=1Nτα=1Neδ(τkτα)Δτ,\displaystyle C=\sum_{k=1}^{N_{\tau}}\sum_{\alpha=1}^{N_{e}}\delta(\tau_{k}-\tau_{\alpha})\Delta\tau, (8)

where NτN_{\tau} is the total number of the duration grids.

The calculation condition is as follows: We analyze the fluorescence trajectories of 40 molecules. The length of the fluorescence trajectories is ranging from 0.46 sec to 9.27 sec. For the time bin Δ\Delta of the trajectory, we consider the three size; 62.5 μ\musec, 125 μ\musec, and 250 μ\musec. The total number of the time grids of each trajectory, NN in Eqs. (2) and (3), depends on the trajectory length and Δ\Delta. The total number of the {τα\tau_{\alpha}} data, NeN_{e} in Eq. (6) or (7), is 2,078 for the ON case and 2,067 for the OFF case with the Δ\Delta=250-μ\musec case. Similarly, for the Δ\Delta=125-μ\musec case, NeN_{e} is 2,341 (ON) and 2,334 (OFF), and for the Δ\Delta=62.5-μ\musec case, NeN_{e} is 2,697 (ON) and 2,678 (OFF). The grid spacing of duration Δτ\Delta\tau is 500 μ\musec, and the total number of the duration grids NτN_{\tau} is 462 for the ON case and 122 for the OFF case with the Δ\Delta=250-μ\musec case. Similarly, for the Δ\Delta=125-μ\musec case, NτN_{\tau} is 383 (ON) and 122 (OFF), and for the Δ\Delta=62.5-μ\musec case, NτN_{\tau} is 322 (ON) and 122 (OFF).

III Results and Discussions

III.1 Analysis for fluorescence trajectory

Figure 2 shows three experimental fluorescence trajectories corresponding to Eq. (2), where photon counts are evaluated in the time bin Δ\Delta = (a) 250 μ\musec, (b) 125 μ\musec, and (c) 62.5 μ\musec. These data are all the same trajectory, but Δ\Delta is different, so the integrated values for each bin are different. It can clearly be seen that the noise becomes more appreciable as the Δ\Delta becomes smaller.

Figure 3 superimposes the hidden-variable time series 𝐒\mathbf{S} in Eq. (3) obtained from the HMM simulations, denoted by the green dashed lines, onto the original fluorescent trajectories (red solid lines). A black-solid line of 0.5 is the line to distinguish an ON state (above 0.5) and an OFF state (below 0.5) for the hidden variable time series. In the practical calculation, this threshold is slightly adjusted depending on the data within 0.4-0.6. As can be seen from the figure, the hidden-variable time series well capture the switching of the ON\toOFF and OFF\toON process, from which we properly evaluate the ON and OFF duration, τON\tau_{{\rm ON}} and τOFF\tau_{{\rm OFF}}. Examples of τON\tau_{{\rm ON}} and τOFF\tau_{{\rm OFF}} are illustrated in the Fig. 3 (a). Since the extremely noisy data affects the state identification, short-duration events might slightly be overestimated in the analysis of time series with smaller Δ\Delta. We note that the moving-average method [39] does not work for the ON/OFF identification from the present fluorescence trajectories. This is because the data is noisy, so there is a limitation for data smoothing. As a result, a large number of artificial short-duration events are generated, compared to the HMM. In this sense, the HMM is highly effective for the noise removing.

Refer to caption
Figure 2: Three experimental photon-count trajectories corresponding to Eq. (2), where the photon counts are evaluated in the time-bin Δ\Delta: (a) Δ\Delta = 250 μ\musec, (b) Δ\Delta = 125 μ\musec, and (c) Δ\Delta = 62.5 μ\musec.
Refer to caption
Figure 3: Three trajectories of hidden variables 𝐒\mathbf{S} in Eq. (3) obtained from HMM simulations (green-dashed lines): (a) Δ\Delta = 250 μ\musec, (b) Δ\Delta = 125 μ\musec, and (c) Δ\Delta = 62.5 μ\musec. Black-solid lines of 0.5 are drawn to distinguish an ON (above 0.5) state and an OFF (below 0.5) state for 𝐒\mathbf{S}. Examples of the ON and OFF duration, τON\tau_{{\rm ON}} and τOFF\tau_{{\rm OFF}}, are illustrated in the panel (a). Red-solid lines are the photon-count trajectories and how to see is the same as Fig. 2.

III.2 Probability density of duration

Figure 4 displays our calculated probability density for the ON [(a), (c), and (e)] and OFF [(b), (d), and (f)] duration, p(τk)p(\tau_{k}) in Eq. (7). The upper (a) and (b) describe the results for fluorescence trajectories with Δ\Delta = 250 μ\musec. The middle (c) and (d) and the lower (e) and (f) panels describe the results for Δ\Delta = 125 μ\musec and Δ\Delta = 62.5 μ\musec, respectively. Note that these data are collected from the fluorescence trajectories of the 40 molecules.

Refer to caption
Figure 4: Our calculated probability density for the ON/OFF duration in fluorescence blinking, p(τON)p(\tau_{{\rm ON}}) and p(τOFF)p(\tau_{{\rm OFF}}) in Eq. (7), denoted by red-bars. Blue-solid curves represent exponential distributions P(τON)P(\tau_{{\rm ON}}) in Eq. (9) and log-normal distributions P(τOFF)P(\tau_{{\rm OFF}}) in Eq. (10), whose optimized parameters are summarized in Table 1. The upper (a) and (b), middle (c) and (d), and lower (e) and (f) panels are the results for the time series with Δ\Delta = 250 μ\musec, 125 μ\musec, and 62.5 μ\musec, respectively. Also, the left (a), (c), and (e) panels are results for the ON duration, while the right (b), (d), and (f) panels correspond to the results for the OFF duration.

To analyze the trends of these data, we performed fittings of model probability density functions. For the ON distribution, we use an exponential distribution as

P(τON)=1sexp(τONs),\displaystyle P(\tau_{{\rm ON}})=\frac{1}{s}\exp\biggl{(}-\frac{\tau_{{\rm ON}}}{s}\biggr{)}, (9)

where ss is a parameter for the exponential function. For the OFF distribution, we consider a log-normal distribution as

P(τOFF)=12πσ2τOFFexp((lnτOFFμ)22σ2),\displaystyle P(\tau_{{\rm OFF}})=\frac{1}{\sqrt{2\pi\sigma^{2}}\tau_{{\rm OFF}}}\exp\biggl{(}-\frac{(\ln\tau_{{\rm OFF}}-\mu)^{2}}{2\sigma^{2}}\biggr{)}, (10)

where μ\mu and σ\sigma represent mean and standard deviation of the distribution, respectively. To evaluate the fitting accuracy, we calculate the following measures; root mean squared error (RMSE) as

RMSE=1NDk=1ND(p(τk)P(τk))2,\displaystyle{\rm RMSE}=\sqrt{\frac{1}{N_{D}}\sum_{k=1}^{N_{D}}\bigl{(}p(\tau_{k})-P(\tau_{k})\bigr{)}^{2}}, (11)

and coefficient of determination R2{\rm R}^{2} as

R2=1k=1ND(p(τk)P(τk))2k=1ND(p(τk)p¯)2\displaystyle{\rm R}^{2}=1-\frac{\sum_{k=1}^{N_{D}}\bigl{(}p(\tau_{k})-P(\tau_{k})\bigr{)}^{2}}{\sum_{k=1}^{N_{D}}\bigl{(}p(\tau_{k})-\bar{p}\bigr{)}^{2}} (12)

with the p¯\bar{p} in Eq. (12) being the mean of probability density as

p¯=1NDk=1NDp(τk).\displaystyle\bar{p}=\frac{1}{N_{D}}\sum_{k=1}^{N_{D}}p(\tau_{k}). (13)

Here, τk\tau_{k} represents the kk-th duration grid, introduced in Eq. (7), but the calculations of Eqs.(11), (12), and (13) exclude the data of p(τk)=0p(\tau_{k})=0. Thus, the total number of the data points NDN_{D} is 231 for the ON case and 88 for the OFF case with the Δ\Delta=250-μ\musec case. Similarly, for the Δ\Delta=125-μ\musec case, NDN_{D} is 218 (ON) and 85 (OFF), and for the Δ\Delta=62.5-μ\musec case, NDN_{D} is 209 (ON) and 87 (OFF). The P(τk)P(\tau_{k}) is the value of the fitting function of the kk-th duration grid.

The model parameters in Eqs. (9) and (10) are determined as follows: We first perform least-squared fitting. Then, we set the resulting parameters to the initial guess, and perform the Kolmogorov-Smirnov (KS) test [40]. In this test, we tune the model parameters to maximize pp-value for measuring goodness-of-fit, where the pp-value is defined as

pvalue=p(z)=2j=1(1)j1exp(2j2z2)\displaystyle p\mathchar 45\relax{\rm value}=p(z)=2\sum_{j=1}^{\infty}(-1)^{j-1}\exp(-2j^{2}z^{2}) (14)

with

z=Nemaxτ|SNe(τ)F(τ)|\displaystyle z=\sqrt{N_{e}}\max_{\tau}\bigl{|}S_{N_{e}}(\tau)-F(\tau)\bigr{|} (15)

and NeN_{e} being the total number of the {τα\tau_{\alpha}} data [Eq. (6)]. SNe(τ)S_{N_{e}}(\tau) is an empirical distribution function as

SNe(τ)=1Neα=1Ne𝟏τα<τ,\displaystyle S_{N_{e}}(\tau)=\frac{1}{N_{e}}\sum_{\alpha=1}^{N_{e}}{\bf 1}_{\tau_{\alpha}<\tau}, (16)

where 𝟏A{\bf 1}_{A} is an indicator function for the condition AA, and F(τ)F(\tau) is a cumulative distribution function as

F(τ)=0τP(τ)𝑑τ,\displaystyle F(\tau)=\int_{0}^{\tau}P(\tau^{\prime})d\tau^{\prime}, (17)

where P(τ)P(\tau) is a probability density function to be tested. The conventional significance level for the pp-value is 0.05, and if the obtained pp-value is above 0.05, the test model function is considered to have validity.

Figure 4 shows the resulting fitted curves with blue-solid lines. We see from the figure that the fittings are satisfactorily. Optimized parameters and fitting accuracy are given in Table 1, where the first and second columns contain the results for P(τON)P(\tau_{{\rm ON}}) and P(τOFF)P(\tau_{{\rm OFF}}), respectively. From the table, the RMSE and R2{\rm R}^{2} support good accuracy; very low RMSE and R2{\rm R}^{2} sufficiently close to 1. The resulting pp-values are also listed, where they are 0.2-0.6 for P(τON)P(\tau_{{\rm ON}}) and nearly 0.3 for P(τOFF)P(\tau_{{\rm OFF}}). These pp-values are larger than the conventional significance level of 0.05. Thus, the null hypothesis (the data distribution differs from the assumed distribution function) is rejected.

Table 1: Summary of fittings: We employ an exponential function P(τON)P(\tau_{{\rm ON}}) in Eq. (9) for the fitting of the ON-duration probability densities in Figs. 4 (a), (c), and (e), and a log-normal function P(τOFF)P(\tau_{{\rm OFF}}) in Eq. (10) for the OFF-duration probability density in Figs. 4 (b), (d), and (f). The ss is a parameter of P(τON)P(\tau_{{\rm ON}}) in Eq. (9), and the μ\mu and σ\sigma are parameters of P(τOFF)P(\tau_{{\rm OFF}}) in Eq. (10). RMSE and R2 are calculated from Eq. (11) and Eq. (12), respectively. The pp-value with the KS test in Eq. (14) are listed. The fitting accuracy of the normal distribution P(αOFF)P(\alpha_{{\rm OFF}}) in Eq. (26) of Appendix A to the data of Fig. 8 is also shown, with αOFF=lnτOFF\alpha_{{\rm OFF}}=\ln\tau_{{\rm OFF}}. It is noted that the parameters μ\mu and σ\sigma of the normal distribution are the same as those of the log-normal distribution, and the pp-value of P(αOFF)P(\alpha_{{\rm OFF}}) is the same as that of P(τOFF)P(\tau_{{\rm OFF}}).
P(τON)P(\tau_{{\rm ON}}) P(τOFF)P(\tau_{{\rm OFF}}) P(αOFF)P(\alpha_{{\rm OFF}})
Δ\Delta ss (ms) RMSE R2 pp-value μ\mu σ\sigma RMSE R2 pp-value RMSE R2
250 25.2 0.00339 0.896 0.196 1.95 0.787 0.00557 0.966 0.306 0.0399 0.955
125 22.0 0.00291 0.937 0.618 1.82 0.860 0.00601 0.964 0.275 0.0381 0.950
62.5 19.0 0.00294 0.950 0.460 1.68 0.908 0.00659 0.962 0.275 0.0341 0.953

Next, we discuss the characteristic relaxation time in terms of the resulting model parameters. For the relaxation time of the ON\toOFF process, it is estimated as 20-25 msec (see ss in Table 1). In addition, on the characteristic relaxation time eμe^{\mu} of the OFF\toON process, we obtain e1.95=7.03e^{1.95}=7.03 msec for the Δ=250\Delta=250-μ\musec case, e1.82=6.17e^{1.82}=6.17 msec for the Δ=125\Delta=125-μ\musec case, and e1.68=5.37e^{1.68}=5.37 msec for the Δ=62.5\Delta=62.5-μ\musec case. The estimate based on the auto-correlation function analysis [12, 11] is nearly 5 msec, and thus, the present HMM estimates are consistent with the estimates of the correlation analyses. We note that the both relaxation times of the ON and OFF states become short as the time-bin size Δ\Delta of the fluorescence trajectory becomes small. This may reflect that the longer duration event is split into the shorter duration events with considering the smaller Δ\Delta.

III.3 Best fit for probability density of the OFF duration

The ON distribution is basically a monotonically decreasing, so the validity of the exponential function is identified visibly. On the other hand, the OFF distribution has an asymmetric structure, so it is necessary to carefully verify the validity of the log-normal function. To find the best function for probability density of the OFF duration, we investigate two other representative asymmetric distributions, the Weibull and Gamma distributions. The Weibull distribution PW(τ)P_{W}(\tau) is written as

PW(τ)=mη(τη)m1exp{(τη)m},\displaystyle P_{W}(\tau)=\frac{m}{\eta}\biggl{(}\frac{\tau}{\eta}\biggr{)}^{m-1}\exp\biggl{\{}-\biggl{(}\frac{\tau}{\eta}\biggr{)}^{m}\biggr{\}}, (18)

where mm and η\eta are the shape and scale parameters of the distribution, respectively. Also, the Gamma distribution PG(τ)P_{G}(\tau) is written as

PG(τ)=baΓ(a)τa1exp(τb),\displaystyle P_{G}(\tau)=\frac{b^{-a}}{\Gamma(a)}\tau^{a-1}\exp\biggl{(}-\frac{\tau}{b}\biggr{)}, (19)

where aa and bb are the shape and scale parameters of the distribution, respectively. Γ(a)\Gamma(a) is the Gamma function.

Figure 5 compares the fittings to the OFF probability density: The panels (a), (d), and (g) are the fitting results of the Weibull distribution PW(τOFF)P_{W}(\tau_{{\rm OFF}}) in Eq. (18), based on the least-squared fitting plus the KS-pp value maximization, and (b), (e), and (h) are the fitting results of the Gamma distribution PG(τOFF)P_{G}(\tau_{{\rm OFF}}) in Eq. (19). The panels (c), (f), and (i) are the log-normal distribution P(τOFF)P(\tau_{{\rm OFF}}) in Eq. (10). Also, (a), (b), and (c) describe the results for the time bin Δ=250\Delta=250 μ\musec, (d), (e), and (f) describe the results for Δ=125\Delta=125 μ\musec, and (g), (h), and (i) are the results for Δ=62.5\Delta=62.5 μ\musec. The fitted parameters are summarized in Table 2. From the figure, we see that the log-normal function gives a good fit, while the Weibull and Gamma functions seem to underestimate the peak height around τOFF\tau_{{\rm OFF}}\sim 5 msec. In the Weibull and Gamma functions, the tail behavior in the low duration side is not described properly. The resulting pp-values for the Weibull and Gamma distributions are appreciably small compared to those of the log-normal distribution, and well below the significant level of 0.05.

Refer to caption
Figure 5: Comparison of fittings to the OFF-duration probability density in Fig. 4: Panels (a), (d), and (g) are the fitting results with the Weibull distribution PW(τOFF)P_{W}(\tau_{{\rm OFF}}) in Eq. (18), and (b), (e), and (h) are the Gamma distribution PG(τOFF)P_{G}(\tau_{{\rm OFF}}) in Eq. (19). The panels (c), (f), and (i) are the log-normal distribution P(τOFF)P(\tau_{{\rm OFF}}) in Eq. (10). Also, (a), (b), and (c) describe the results for the time bin Δ=250\Delta=250 μ\musec, and (d), (e), and (f) describe the results for Δ=125\Delta=125 μ\musec. The (g), (h), and (i) are the results for Δ=62.5\Delta=62.5 μ\musec. The fits are given with blue-solid curves, and fitted parameters are given in Table 2.
Table 2: Summary of fitting functions to the OFF-duration probability density: The fitting functions are the Weibull PW(τOFF)P_{W}(\tau_{{\rm OFF}}) in Eq. (18), Gamma PG(τOFF)P_{G}(\tau_{{\rm OFF}}) in Eq. (19), and log-normal P(τOFF)P(\tau_{{\rm OFF}}) in Eq. (10). The fitted curves are plotted in Fig. 5. The mm and η\eta are the fitted parameters of PW(τOFF)P_{W}(\tau_{{\rm OFF}}), and the aa and bb are the fitted parameters of PG(τOFF)P_{G}(\tau_{{\rm OFF}}). Also, the μ\mu and σ\sigma are fitted parameters of P(τOFF)P(\tau_{{\rm OFF}}). Evaluated pp-values with the KS test in Eq. (14) are listed.
Weibull Gamma log-normal
Δ\Delta mm η\eta (ms) pp-value aa bb (ms) pp-value μ\mu σ\sigma pp-value
250 1.56 9.47 4.32×105\times 10^{-5} 2.09 4.12 3.10×103\times 10^{-3} 1.95 0.787 0.306
125 1.44 8.57 1.82×104\times 10^{-4} 1.82 4.33 6.81×103\times 10^{-3} 1.82 0.860 0.275
62.5 1.35 7.56 7.59×105\times 10^{-5} 1.62 4.34 1.95×103\times 10^{-3} 1.68 0.908 0.275

The fact that τOFF\tau_{{\rm OFF}} follows the log-normal distribution means that τOFF\tau_{{\rm OFF}} is based on the multiplicative stochastic process as [41, 42, 43]

τOFF=i=1Mni,\displaystyle\tau_{{\rm OFF}}=\prod_{i=1}^{M}n_{i}, (20)

where nin_{i} is the ii-th factor affecting τOFF\tau_{{\rm OFF}} and MM is the total numbers of the factors. Now, taking the logarithm of Eq. (20), we obtain

lnτOFF=i=1Mlnni\displaystyle\ln\tau_{{\rm OFF}}=\sum_{i=1}^{M}\ln n_{i} (21)

If nin_{i} is independent with each other and MM is sufficiently large, lnτOFF\ln\tau_{{\rm OFF}} follows the normal distribution based on the central limit theorem, or equivalently, τOFF\tau_{{\rm OFF}} follows the log-normal distribution. Now, an important question is what is the factor nin_{i} affecting the stability of the OFF state, leaving to be explored.

III.4 Failure rate analysis

Using the above determined model distribution functions, we discuss aspects of the ON\toOFF and OFF\toON transition processes. The ON\toOFF transition corresponds to a process in which the photon-absorption and emission cycle changes to a charge-separated state. On the other hand, the OFF\toON transition corresponds to a process in which the charge-separated state returns to the ground state. In addition, it was confirmed that the lifetime distribution of the ON state decays exponentially, and that of the OFF state exhibits the log-normal distribution.

Based on these results, we will discuss aspects of the transition processes in terms of the failure analysis in the semiconductor device [44]. In this analysis, for example, the ON\toOFF process is discussed as follows: First, an ON state is defined as a non-faulty state, and an OFF state is defined as a faulty state. The failure analysis tells us the nature of the failures; we can classify whether the ON\toOFF process is based on an accidental event or a wear-out event, as well as the failure type of the semiconductor product.

We first consider a failure probability density f(τ)f(\tau) representing the probability density that a failure will occur between τ\tau and τ+dτ\tau+d\tau. A failure rate λ(τ)\lambda(\tau) is defined in terms of f(τ)f(\tau) as

λ(τ)=f(τ)R(τ),\displaystyle\lambda(\tau)=\frac{f(\tau)}{R(\tau)}, (22)

where R(τ)R(\tau) represents survival function defined as

R(τ)=τf(t)𝑑t=1F(τ)\displaystyle R(\tau)=\int^{\infty}_{\tau}f(t)dt=1-F(\tau) (23)

with F(τ)F(\tau) being the failure distribution function as

F(τ)=0τf(t)𝑑t.\displaystyle F(\tau)=\int^{\tau}_{0}f(t)dt. (24)

The failure rate λ(τ)\lambda(\tau) describes the frequency of breaking down after the usage time τ\tau. Based on λ(τ)\lambda(\tau), the nature of the failures in semiconductor products can properly classified: Figure 6 shows typical behavior of λ(τ)\lambda(\tau), called bathtub-like curve; initially, the λ(τ)\lambda(\tau) decreases because the initial defective products with a certain extent continue to fail. This period is called the early failure period. Then, the λ(τ)\lambda(\tau) becomes constant. This period is called the random failure period. Finally, the λ(τ)\lambda(\tau) rapidly increases as the product deteriorates due to wear. This is wear-out period.

Refer to caption
Figure 6: Schematic figure of a failure rate λ(τ)\lambda(\tau) of semiconductor products. It is known that the λ(τ)\lambda(\tau) exhibits a bathtub-like failure rate as a function of the usage time τ\tau [44]. Initially, the λ(τ)\lambda(\tau) decreases because the products continue to fail due to defective products with a certain extent, and then becomes constant. The former period is called the early failure period, and the latter is the random failure period. Finally, the λ(τ)\lambda(\tau) rapidly increases as the products deteriorate due to wear. This is called the wear-out period.

In Sec. III.2, we confirmed that the probability density of the ON duration follows the exponential function of P(τON)P(\tau_{{\rm ON}}) in Eq. (9) and that of the OFF duration follows the log-normal function P(τOFF)P(\tau_{{\rm OFF}}) in Eq. (10). Now, these P(τON)P(\tau_{{\rm ON}}) and P(τOFF)P(\tau_{{\rm OFF}}) correspond to f(τ)f(\tau) in Eq. (22): Thus, the λ(τON)=P(τON)/R(τON)\lambda(\tau_{{\rm ON}})=P(\tau_{{\rm ON}})\big{/}R(\tau_{{\rm ON}}) describes the failure rate where the ON state breaks to the OFF state after duration τON\tau_{{\rm ON}}. Similarly, the λ(τOFF)\lambda(\tau_{{\rm OFF}}) represents the failure rate where the OFF state changes to the ON state after duration τOFF\tau_{{\rm OFF}}. By comparing the λ(τON)\lambda(\tau_{{\rm ON}}) and λ(τOFF)\lambda(\tau_{{\rm OFF}}) with the bathtub-like curve in Fig. 6, we can discuss the nature of the transition processes of the ON\toOFF and OFF\toON.

Figure 7 shows our calculated failure rate λ(τ)\lambda(\tau) in Eq. (22), denoted by red-dashed curve. Survival function R(τ)R(\tau) in Eq. (23) is also plotted with green-dotted curve, and blue-solid curve is the model probability density P(τ)P(\tau) given in Fig. 4. The upper (a) and (b), middle (c) and (d), the lower (e) and (f) panels are the results for the time series with Δ\Delta = 250 μ\musec, Δ\Delta = 125 μ\musec, and Δ\Delta = 62.5 μ\musec, respectively. Also, the left (a), (c), and (e) panels are results for the ON\toOFF process, while the right (b), (d), and (f) panels correspond to the results for the OFF\toON process. We see that the λ(τON)\lambda(\tau_{{\rm ON}}) is constant, which is a consequence of the exponential decay of P(τON)P(\tau_{{\rm ON}}). The behavior of λ(τON)\lambda(\tau_{{\rm ON}}) is the same as the random failure period in the bathtub-like curve of Fig. 6, and then the ON\toOFF process would be recognized as an accidental event; the light-emitting ON state switches to the not-emitting OFF state accidentally.

Refer to caption
Figure 7: Our calculated failure rate λ(τ)\lambda(\tau) in Eq. (22) as a function of duration τON\tau_{{\rm ON}} and τOFF\tau_{{\rm OFF}}, denoted by red-dashed curves. Survival function R(τ)R(\tau) in Eq. (23) is described with green-dotted curves. Blue-solid curves are the probability density function P(τ)P(\tau) displayed in Fig. 4. The upper (a) and (b), middle (c) and (d), and lower (e) and (f) panels are the results for the time series with Δ\Delta = 250 μ\musec, Δ\Delta = 125 μ\musec, and Δ\Delta = 62.5 μ\musec, respectively. Also, left (a), (c), and (e) panels are results for the ON duration, while right (b), (d), and (f) panels correspond to the results for the OFF duration.

On the other hand, the OFF\toON process would not be so simple; the λ(τOFF)\lambda(\tau_{{\rm OFF}}) first increases, reaches a maximum, then decreases. The initial increasing trend of the λ(τOFF)\lambda(\tau_{{\rm OFF}}) seems to correspond to the wear-out period in Fig. 6. However, after that, it decreases, which is not observed in the bathtub curve of Fig. 6. In the case of practical semiconductor products, once the wear-out failures begin, all the products would break down within the wear-out period. Therefore, this decreasing trend may not be observed in the semiconductor-product case. In the present case, the situation might be a little different; at τOFF=0\tau_{{\rm OFF}}=0, P(τOFF)=0P(\tau_{{\rm OFF}})=0 and the OFF state does not react in the very beginning. Then, the OFF state begins to break down. Around the inflection point of R(τOFF)R(\tau_{{\rm OFF}}), λ(τOFF)\lambda(\tau_{{\rm OFF}}) takes a peak, and then decreases. The peak position is at around 5 msec, and most of the OFF states return to the ON states in about 30 msec. In any case, when it comes to wear-out failure, various factors, such as the details of the molecular structure and the environment, can be relevant to the failure.

IV Summary

In this study, we have analyzed the blinking phenomenon of fluorescently labeled DNA using HMM. HMM is an effective method for removing noise in the time series data. The time series after removing noise allows for quantitative determination of the ON (light-emitting) and OFF (not-emitting) states. The length of the fluorescence trajectory is short in the present system, about 1 sec, and the ON and OFF states switches drastically in the trajectory, but HMM works very stably. We have performed analyses for the fluorescence trajectories with different time bins and carefully discussed the time bin dependence of the results. We have found that the probability density of the ON duration is represented by an exponential function, and the OFF-duration probability density is expressed by a log-normal function, which have been verified in terms of the KS test. We have found that the pp-value of the exponential distribution is 0.2-0.6, and that of the log-normal distribution is \sim0.3, and these are larger than the significant level of 0.05. Based on the determined distribution functions, we have performed an analysis based on a failure rate used in the reliability engineering of semiconductor products. We have found that the ON\toOFF process is based on a random failure process, and the OFF\toON process in the early stage is similar to a wear-out failure process.

If the probability density of the OFF duration τOFF\tau_{{\rm OFF}} follows the log-normal distribution, the exponent αOFF\alpha_{{\rm OFF}} of τOFF\tau_{{\rm OFF}} follows the normal distribution (Appendix A). The mean and standard deviation of αOFF\alpha_{{\rm OFF}} describe something reaction details of the OFF\toON process; in view of the Arrhenius plot, the mean and standard deviation might be related to the activation energy and its fluctuation with describing the degree of the inhomogeneity effect. More quantitative discussions would be desirable.

In the present analysis for the fluorescence trajectory, we assumed that the noise follows Gaussian distribution (Sec. II.2). Although the noise in the time series data often follows Gaussian, the nature of the noise can vary depending on the system and environment. For single-molecule measurements, especially, the nature of the noise is likely to reflect the environmental effects, and machine learning techniques that can handle quantitative nature of the noise, including HMM, will be important.

V Acknowledgments

This research was supported by JSPS KAKENHI Grant Numbers JP22H01183, JP23H01126, and JP21H02059, the establishment of university fellowships towards the creation of science technology innovation, Grant Number JPMJFS2133. K.K. and A.M. acknowledge support from the Five-Star Alliance in NJRC Mater. & Dev.

Appendix A Probability density for logarithm value of the OFF duration

Here, we discuss the OFF\toON process in terms of the logarithm transform of τ\tau as

α=lnτ,\displaystyle\alpha=\ln\tau, (25)

where α\alpha represents an exponent when the τ\tau is expressed as an exponential function, and it can be seen as describing the right-hand side of Eq. (21). The log-normal distribution of τ\tau in Eq. (10) is rewritten to the normal distribution of α\alpha as

P(α)=12πσ2exp((αμ)22σ2),\displaystyle P(\alpha)=\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\biggl{(}-\frac{(\alpha-\mu)^{2}}{2\sigma^{2}}\biggr{)}, (26)

where we use a transform of P(τ)dτ=P(α)dαP(\tau)d\tau=P(\alpha)d\alpha. Thus, the log-normal distribution of τ\tau is mathematically equivalent to the normal distribution of α\alpha. It is noted that parameters μ\mu and σ\sigma of the above normal distribution are the same as those of the log-normal distribution in Eq. (10).

Figure 8 is our calculated probability density for α\alpha, described by red bars. The grid spacing of α\alpha is 0.25, and the total number of the α\alpha grids is 21 with the Δ\Delta=250-μ\musec case, 26 with the Δ\Delta=125-μ\musec case, and 29 with the Δ\Delta=62.5-μ\musec case. The blue-solid curves describe the fitted normal distribution functions. Fitting accuracy is summarized in Table 1, where the accuracy is evaluated for the data points of p(αk)0p(\alpha_{k})\neq 0. Figures 8 (a), (b), and (c) correspond to the results for Δ\Delta = 250 μ\musec, 125 μ\musec, and 62.5 μ\musec, respectively. From the figure, we see that the fitting is reasonable, but as the Δ\Delta decreases, the probability around the small α\alpha appears. This is related to the short-lived OFF states, and it may be due to quantitative uncertainty of the OFF duration evaluated with the HMM simulation. We note that the KS-pp value of the normal distribution P(αOFF)P(\alpha_{{\rm OFF}}) is the same as that of the log-normal distribution P(τOFF)P(\tau_{{\rm OFF}}).

Refer to caption
Figure 8: Calculated probability density for αOFF=lnτOFF\alpha_{{\rm OFF}}=\ln\tau_{{\rm OFF}} in Eq. (25), denoted by red-bars. Blue-solid curves represent normal distributions P(αOFF)P(\alpha_{{\rm OFF}}) in Eq. (26), and model parameters and fitting accuracy are summarized in Table 1. The panels (a), (b), and (c) panels correspond to the results for Δ\Delta = 250 μ\musec, 125 μ\musec, 62.5 μ\musec, respectively.

References