This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

, , ,

Optimization of array encoding for ultrasound imaging

Jacob Spainhour1, Korben Smart2, Stephen Becker1 and Nick Bottenus3 1 Department of Applied Mathematics, University of Colorado Boulder, USA 2 Department of Physics, University of Colorado Boulder, USA 3 Paul M. Rady Department of Mechanical Engineering, University of Colorado Boulder, USA [email protected] [email protected] [email protected] [email protected]
Abstract

Objective: The transmit encoding model for synthetic aperture imaging is a robust and flexible framework for understanding the effects of acoustic transmission on ultrasound image reconstruction. Our objective is to use machine learning (ML) to construct scanning sequences, parameterized by time delays and apodization weights, that produce high-quality B-mode images.

Approach: We use a custom ML model in PyTorch with simulated RF data from Field II to probe the space of possible encoding sequences for those that minimize a loss function that describes image quality. This approach is made computationally feasible by a novel formulation of the derivative for delay-and-sum beamforming.

Main Results: When trained for a specified experimental setting (imaging domain, hardware restrictions, etc.), our ML model produces optimized encoding sequences that, when deployed in the REFoCUS imaging framework, improve a number of standard quality metrics over conventional sequences including resolution, field of view, and contrast. We demonstrate these results experimentally on both wire targets and a tissue-mimicking phantom.

Significance: This work demonstrates that the set of commonly used encoding schemes represent only a narrow subset of those available. Additionally, it demonstrates the value for ML tasks in synthetic transmit aperture imaging to consider the beamformer within the model, instead of purely as a post-processing step.

Keywords: synthetic aperture, spatial encoding, numerical optimization, machine learning, image quality

1 Introduction

In classical ultrasound imaging, transmissions from a transducer array are focused towards specific locations in the target medium, and the backscattered echoes are collected and processed into an image that is clear at these locations (Cobbold, 2007). More modern systems use a synthetic transmit aperture system that combines the echoes from multiple transmissions to recreate this focus across multiple points in the region. In the ideal case, data acquisition is performed by firing each array element individually in sequence and echoes are received by every element in parallel. The full set of element-to-element signals is then combined into a B-mode image via post-processing that achieves focus even at depth (Jensen et al., 2006). However, the large number of transmissions needed paired with the relatively low power of each means that, when performed directly, this procedure creates images with poor frame rate and low signal-to-noise ratio (SNR), making it impractical for clinical application (Chiao et al., 1997).

Retrospective Encoding For Conventional Ultrasound Sequences, or REFoCUS, is an alternative imaging framework that considers the responses from an arbitrary transmission as a linear combination of these pairwise element responses (Bottenus, 2018; Ali et al., 2020). In this way, the transmission sequence described by time delays and apodization weights parameterizes an encoding of the ground-truth basis of element-to-element signals, the multistatic (STA) data set. By using the specific sequence used to acquire data, the echoed response signals can be decoded to produce a robust approximation of this basis. This approximation can then be focused throughout the imaging domain as post-processing, making a B-mode image that achieves the desired SNR and focal depth. Conventional transmission sequences are often selected according to geometric principles (e.g., focused, planewave, or diverging beams) or according to spatial codes (e.g., Hadamard (Chiao et al., 1997) or S-sequence (Harrison et al., 2014)). However, the REFoCUS framework generalizes the encoding framweork to allow for a uniform treatment of other categories of transmissions (Bottenus et al., 2023).

The quality of the resultant B-mode image—measured in terms of resolution, field of view (FOV), and artifacts—is highly dependent on properties of the transmit sequence used for data acquisition. As an example of the importance of beam geometry, focused beams provide very high resolution, but only around a particular focal point. Another influential property is simply the number of transmissions. While the number of transmissions ideally matches or exceeds the number of array elements, practical frame rate considerations often restrict this. In the resulting underdetermined case, there is active research that seeks to minimize the loss of information during encoding and decoding. Some groups have shown success using randomly assigned delays, as such schemes lead to a very well conditioned encoding, in the sense that the relevant encoding matrices are full rank and therefore stably invertible (You et al., 2021). Similarly, a spatial code for element apodizations can be applied to produce a lossless encoding (or near-lossless when the matrix must be truncated (Zhang et al., 2022)), which in turn increases SNR during imaging.

However, the space of all possible encoding sequences dwarfs those with these sorts of immediately clear and desirable properties (Spainhour et al., 2022). In this paper, we present a novel machine learning (ML) model for data acquisition and imaging which we train to efficiently search the set of all possible transmission sequences, identifying those that lead to high-quality images using the REFoCUS encoding framework. Notably, our training workflow is designed to measure the quality of a sequence after image formation, better reflecting how performance is judged in a clinical setting. Furthermore, the only trainable parameters of this machine learning model are a physical description of the transmission sequence, such that a trained ML model is parameterized by an optimized encoding sequence. This optimized encoding sequence can then be considered independently of the ML model that generated it, allowing it to be deployed classically within the REFoCUS framework to achieve image quality beyond what is attainable with conventional sequences.

1.1 Related Work

Techniques in machine learning have gained a significant amount of research attention in the field of ultrasound imaging, particularly in the application of deep neural networks (DNNs) (Liu et al., 2019; Hyun et al., 2021). Following the unprecedented successes of the deep convolutional neural network (CNN) AlexNet in non-medical imaging tasks (Krizhevsky et al., 2017), similar networks are frequently trained and utilized for ultrasound-adjacent tasks as an additional post-processing step to a conventional imaging pipeline by performing classification, segmentation, etc. on image data (Mischi et al., 2020).

Other deep learning approaches perform image formation directly. In a prototypical setup, a DNN is fed the received echoes of a transducer array to produce a B-mode image, aiding or even supplanting the use of a conventional beamformer (van Sloun et al., 2020; Goudarzi and Rivaz, 2022). Often the goal is to create a network that can be evaluated faster than classical imaging techniques without degrading image quality (Chen et al., 2021). Other times, the DNN is designed to directly improve image quality. For example, Hyun et al. (2019) utilize a CNN as a beamformer that reduces speckle, citing considerable advantages over conventional delay-and-sum techniques.

While a well-trained DNN can offer these and other benefits, their use necessarily introduces complications. Most notably, neural networks are susceptible to hallucinations, in which it unexpectedly generates features that are not present in the underlying data, an issue that is uniquely consequential in medical imaging (Bhadra et al., 2021). Our approach to machine learning entirely circumvents this and other issues by using a custom architecture whose forward operation is entirely acoustically motivated. Specifically, the proposed ML method improves imaging quality exclusively through an optimized parameterization of experimental hardware via the transmit sequence. After data is acquired with an optimized sequence, the B-mode image to be evaluated is generated through the REFoCUS imaging framework, which itself is based solely on the linear nature pulse-echo responses (We describe the decoding and imaging processes in detail in Section 2.1 and Section 2.2 respectively).

A similar strategy to our own is employed by Chen et al. (2021), in which a machine learning model is trained to optimize the apodization weights of planewave transmissions, although a DNN is used to decode the received echoes. Specifically, a stacked denoising autoencoder architecture (a type of DNN whose first trainable layer is taken to be the weights themselves) is trained to optimize the reconstruction of the multistatic data set. In contrast, our ML model allows full control of the encoding sequence through a continuous treatment of apodization weights and time delays, with no other trainable parameters. At the same time, our training workflow optimizes the encoding sequence according to direct measures of image quality. In doing so, we provide notable counterexamples to the principle that a better reconstruction of the multistatic data set leads to higher quality images.

By detaching our machine learning model from conventional neural network architectures, our approach obtains theoretical and practical advantages beyond avoiding hallucinations. First, an optimized encoding sequence generated by our ML model can be analyzed separately from the model itself, such that its beneficial features can be identified and further studied. This is in contrast to a trained DNN, whose per-layer weights and biases have no meaning outside the context of the DNN. Second, the smaller parameter space of our ML model dramatically reduces the training time needed to generate an optimized encoding sequence. This allows for greater flexibility in the overall ML workflow, as a different configuration of the ML model can be retrained at a minimal cost when the experimental setting changes (e.g., targeting a different viewing depth or prioritizing different image features via an alternate loss function). Finally, optimization of data acquisition through the transmit sequence makes our imaging process very robust to different targets. While conventional DNNs are typically trained for very specific imaging tasks on a constrained set of training data, we will show that the proposed approach not only generalizes well to out-of-sample training data, but also to data with dramatically different features. This is because our ML model is fundamentally optimized based on the underlying acoustic principles of ultrasound imaging, where the processing of backscattered echoes is influenced primarily through manipulation of the transmission sequence.

2 Methods

2.1 Transmit Encoding Theory

We consider a linear transducer array of NEN_{E} physical elements, of which NTN_{T} are capable of transmitting a diverging wave and NRN_{R} can receive backscattered echoes. Moreover, it is typical to have elements that perform both functions, such that NT=NR=NEN_{T}=N_{R}=N_{E}. Taken across all pairs of transmit and receive elements, collected RF (radio frequency) signals form a time dependent NT×NRN_{T}\times N_{R} matrix 𝐔(t)\mathbf{U}(t), within which the matrix element uTR(t)u_{TR}(t) is the signal observed by transmitting on element TT and receiving on element RR. In the synthetic transmit aperture model, the complete collection of pulse-echo response signals between pairs of transmit and receive elements make up the multistatic data set. Because these signals can be delayed and summed across the receive channel to produce transmit focus throughout the domain, as in delay-and-sum (DAS) beamforming, these data can be considered the mathematical basis of our imaging (Jensen et al., 2006).

As previously discussed, it is inadvisable to construct 𝐔(t)\mathbf{U}(t) directly by firing each transmit element in sequence across a total of NTN_{T} transmissions (Chiao et al., 1997). Instead, the REFoCUS technique seeks to construct an approximation of 𝐔(t)\mathbf{U}(t) using the responses from an arbitrary scanning sequence. We consider a scanning sequence of NMN_{M} separate transmissions, each of which is parameterized by NTN_{T} time delays tMTt_{MT} and apodization weights wMTw_{MT}. The response observed by receive element RR from transmit MM can be described as a weighted sum of time delayed matrix elements from the multistatic data set,

sMR(t)=T=1NTwMTuTR(ttMT),\displaystyle s_{MR}(t)=\sum_{T=1}^{N_{T}}w_{MT}u_{TR}(t-t_{MT})\,, (1)

which are similarly collected into a time dependent NM×NRN_{M}\times N_{R} matrix 𝐒(t)\mathbf{S}(t) of focused responses. Collected across each transmit, these time delays and apodization weights make up the encoding sequence, denoted 𝝈=(t,w)MT\boldsymbol{\sigma}=(t,w)_{MT}.

It is now convenient to work in the frequency domain, where each time delay of uTR(t)u_{TR}(t) becomes a complex phase shift of uTR(ω)u_{TR}(\omega) (Ali et al., 2020). For simplicity, we represent the frequency dependent Fourier transform of 𝐔(t)\mathbf{U}(t) and 𝐒(t)\mathbf{S}(t) as 𝐔(ω)\mathbf{U}(\omega) and 𝐒(ω)\mathbf{S}(\omega) respectively. In doing so, we can express the transmission response of receive element RR to transmission MM as

sMR(ω)=T=1NTwMTexp(jωtMT)uTR(ω).\displaystyle s_{MR}(\omega)=\sum_{T=1}^{N_{T}}w_{MT}\exp(-j\omega t_{MT})u_{TR}(\omega)\,. (2)

This describes a linear relationship between the multistatic data set 𝐔(ω)\mathbf{U}(\omega) and the transmission responses 𝐒(ω)\mathbf{S}(\omega) that is fully parameterized by 𝝈\boldsymbol{\sigma}, given by

𝐒(ω)=𝐇(ω)𝐔(ω),\displaystyle\mathbf{S}(\omega)=\mathbf{H}(\omega)\mathbf{U}(\omega)\,, (3)

where 𝐇(ω)\mathbf{H}(\omega) is a frequency dependent NM×NTN_{M}\times N_{T} encoding matrix given by

𝐇(ω)=[w1,1exp(jωt1,1)w1,Texp(jωt1,T)w2,1exp(jωt2,1)w2,Texp(jωt2,T)wM,1exp(jωtM,1)wM,Texp(jωtM,T)].\mathbf{H}(\omega)=\begin{bmatrix}w_{1,1}\exp(-j\omega t_{1,1})&\dots&w_{1,T}\exp(j\omega t_{1,T})\\ w_{2,1}\exp(-j\omega t_{2,1})&\dots&w_{2,T}\exp(j\omega t_{2,T})\\ \vdots&\ddots&\vdots\\ w_{M,1}\exp(-j\omega t_{M,1})&\dots&w_{M,T}\exp(j\omega t_{M,T})\\ \end{bmatrix}\,. (4)

In practice, we collect only discrete, equispaced samples from 𝐒(t)\mathbf{S}(t), from which we have a total of NωN_{\omega} angular frequencies ω\omega. Because calculations involving the collection of matrices 𝐒(ω)\mathbf{S}(\omega) are typically performed in parallel across frequencies, we use calligraphic letters to denote the collection of all frequencies. As an example, we denote 𝓢ω=𝐒(ω)\boldsymbol{\mathcal{S}}_{\omega}=\mathbf{S}(\omega), and represent the above equation more compactly as 𝓢=𝓗𝓤\boldsymbol{\mathcal{S}}=\boldsymbol{\mathcal{H}}\boldsymbol{\mathcal{U}}. When we wish to emphasize the dependence of our encoding matrices 𝓗\boldsymbol{\mathcal{H}} on the sequence 𝝈\boldsymbol{\sigma}, we add 𝝈\boldsymbol{\sigma} as a subscript.

The objective of the REFoCUS framework is to create an approximation of the multistatic data set 𝓤^\widehat{\boldsymbol{\mathcal{U}}} from the recorded echoes in 𝓢\boldsymbol{\mathcal{S}}. This is done by applying a frequency dependent decoder 𝐇(ω)\mathbf{H}^{\dagger}(\omega) to 𝐒(ω)\mathbf{S}(\omega), resulting in

𝓤^=𝓗𝓢=(𝓗𝓗)𝓤.\widehat{\boldsymbol{\mathcal{U}}}=\boldsymbol{\mathcal{H}}^{\dagger}\boldsymbol{\mathcal{S}}=(\boldsymbol{\mathcal{H}}^{\dagger}\boldsymbol{\mathcal{H}})\boldsymbol{\mathcal{U}}\,. (5)

This decoder is theoretically arbitrary, with options explored in the literature ranging from a per-frequency conjugate transpose (Bottenus, 2018) to a fully trained neural network (Chen et al., 2021). In this work, we follow Ali et al. (2020); Bottenus et al. (2023) and apply a Tikhonov regularized pseudoinverse that has been shown to work well experimentally. This choice of decoder depends only on the original encoding matrix, which is itself dependent only on our sequence 𝝈\boldsymbol{\sigma}. The Tikhonov decoder is given explicitly by

𝓗=(𝓗𝓗+γ2𝓘NE)1𝓗,\boldsymbol{\mathcal{H}}^{\dagger}=(\boldsymbol{\mathcal{H}}^{*}\boldsymbol{\mathcal{H}}+\gamma^{2}\boldsymbol{\mathcal{I}}_{N_{E}})^{-1}\boldsymbol{\mathcal{H}}^{*}\,, (6)

where 𝓗\boldsymbol{\mathcal{H}}^{*} is the conjugate transpose of 𝓗\boldsymbol{\mathcal{H}} and γ\gamma is a regularization constant. This regularization both suppresses the contribution of noise to the recovered multistatic channel data and produces a more stable inversion of 𝓗\boldsymbol{\mathcal{H}}, which together results in a more accurate and robust recovery of 𝓤^\widehat{\boldsymbol{\mathcal{U}}}. This is particularly important in the underdetermined system of interest, in which the number of transmits NMN_{M} is significantly smaller than the number of array elements NEN_{E}. In all usages of regularization herein we take a value of γ=0.1σmax(ω)\gamma=0.1\sigma_{\text{max}}(\omega), where σmax(ω)\sigma_{\text{max}}(\omega) is the spectral norm of each per-frequency encoding matrix 𝐇(ω)\mathbf{H}(\omega). This particular value has been observed empirically to avoid issues of over- or under-regularization in the presence of noise (Bottenus et al., 2023).

2.2 Beamforming and Imaging Procedure

Given a multistatic data set 𝓤\boldsymbol{\mathcal{U}}, we produce the 2D B-mode image of interest with conventional DAS beamforming. This technique maps ground-truth or approximated multistatic IQ data (in-phase and quadrature data derived from the experimentally acquired RF data) at each frequency to a single IQ data matrix, taking a sum of the time delayed data in 𝓤^\widehat{\boldsymbol{\mathcal{U}}}. By manipulating these post-acquisition time delays, an image is created that is focused at individual pixels in the imaging domain. Mathematically, we denote this beamforming operation as :NM×NR×NωNx×Nz\mathcal{B}:\mathbb{C}^{N_{M}\times N_{R}\times N_{\omega}}\to\mathbb{C}^{N_{x}\times N_{z}}, where (Nx,Nz)(N_{x},N_{z}) are the number of pixels in the lateral and axial directions of our image.

To make the resulting image suitable for interpretation by a human observer, we apply a number of non-linear post-processing steps to this re-focused IQ data. These include the computation of the signal envelope, log-scaling the dynamic range, and clipping the image to a minimum decibel value. In our notation, we represent these operations collectively as ||Im:Nx×NzNx×Nz|\cdot|_{\text{Im}}:~{}\mathbb{C}^{N_{x}\times N_{z}}\to\mathbb{R}^{N_{x}\times N_{z}}. When applied in the experimental context, the total forward operation of our ML model takes in transmission responses collected in 𝓢\boldsymbol{\mathcal{S}} and an encoding sequence 𝝈\boldsymbol{\sigma}, and produces an image represented by the real matrix |(𝓗𝝈𝓢)|Im|\mathcal{B}(\boldsymbol{\mathcal{H}}_{\boldsymbol{\sigma}}^{\dagger}\boldsymbol{\mathcal{S}})|_{\text{Im}}.

To ensure that the generated encoding sequences is useful in a clinical setting, we train our ML model with a loss function :Nx×Nz+\mathcal{L}:~{}\mathbb{R}^{N_{x}\times N_{z}}\to~{}\mathbb{R}^{+} that directly measures image quality. While there is flexibility in the specific choice of loss function (Hyun, 2022), achieving a lower value for the loss should correspond to higher resolution, higher SNR, wider FOV, fewer artifacts, etc.

A clear choice is to use a loss function that quantifies these qualities directly. For example, the generalized contrast to noise ratio (gCNR) is a measure of histogram overlap between the brightness of predefined speckle and anechoic regions, such that a higher value indicates greater contrast in the image (Rodriguez-Molares et al., 2019). We can in principle improve the gCNR by minimizing the loss function

gCNR(𝐗^):=1gCNR(𝐗^),\displaystyle\mathcal{L}_{\text{gCNR}}(\widehat{\mathbf{X}}):=1-\text{gCNR}(\widehat{\mathbf{X}})\,, (7)

However, complications with implementation make this loss function impractical. In particular, the histogram operator is not differentiable and standard continuous approximations are unstable, which makes the metric incompatible with backpropagation during training.

On the other hand, a natural and easily implementable choice is a normalized 2\ell^{2} comparison to a reference image that has desirable qualities. This introduces further flexibility in the choice of imaging target, which we discuss in Section 2.4 and Section 4.3. With a given image 𝐗^\widehat{\mathbf{X}} and reference image 𝐗\mathbf{X}, this loss is then defined by

2(𝐗^;𝐗):=1NxNz𝐗𝐗^22.\displaystyle\mathcal{L}_{\ell^{2}}(\widehat{\mathbf{X}};\mathbf{X}):=\frac{1}{N_{x}N_{z}}\big{\|}\mathbf{X}-\widehat{\mathbf{X}}\big{\|}^{2}_{2}\,. (8)

Importantly, we will show that training the ML model to minimize 2\mathcal{L}_{\ell^{2}} still makes a meaningful improvement to the gCNR of produced images.

2.3 Implementation of the Proposed Machine Learning Model

We define an optimized encoding sequence 𝝈\boldsymbol{\sigma}^{*} as one that, for a given loss function \mathcal{L}, beamformer \mathcal{B}, and decoder 𝓗𝝈\boldsymbol{\mathcal{H}}^{\dagger}_{\boldsymbol{\sigma}}, minimizes the non-convex optimization problem

min𝝈Σ𝔼(|(𝓗𝝈𝓢)|Im),\displaystyle\min_{\boldsymbol{\sigma}\in\Sigma}\mathbb{E}\;\mathcal{L}\left(|\mathcal{B}(\boldsymbol{\mathcal{H}}_{\boldsymbol{\sigma}}^{\dagger}\boldsymbol{\mathcal{S}})|_{\text{Im}}\right)\,, (9)

where this expectation is taken over all possible transmission responses 𝓢\boldsymbol{\mathcal{S}}. As it is impossible to evaluate this expectation directly, we instead approach the minimization problem from a machine learning perspective, where we instead minimize over a representative sample of training data {𝓤i}\{\boldsymbol{\mathcal{U}}_{i}\}. Each piece of ground-truth multistatic data is related to a transmission response by our encoding matrix 𝓗𝝈\boldsymbol{\mathcal{H}}_{\boldsymbol{\sigma}} by 𝓢i=𝓗𝝈𝓤i\boldsymbol{\mathcal{S}}_{i}=\boldsymbol{\mathcal{H}}_{\boldsymbol{\sigma}}\boldsymbol{\mathcal{U}}_{i}.

Similarly, we must restrict the space of encoding sequences to those that are practically realizable on a physical system, which we denote as Σ\Sigma. For example, we limit the total transmission power by restricting the set of apodization weights {wMT}\{w_{MT}\} to an \ell^{\infty} ball, effectively clipping each value to the range [1,1][-1,1]. Other restrictions stem from quantization of numerical values to match machine clock cycles. These small discretizations are much more dependent on the specific acquisition system, but also less impactful to the optimized result.

Altogether, we are left to solve

𝝈=argmin𝝈Σ1NUi=1NU(|(𝓗𝝈𝓗𝝈𝓤i)|Im).\displaystyle\boldsymbol{\sigma}^{*}=\operatorname*{arg\,min}_{\boldsymbol{\sigma}\in\Sigma}\frac{1}{N_{U}}\sum_{i=1}^{N_{U}}\mathcal{L}\left(|\mathcal{B}(\boldsymbol{\mathcal{H}}_{\boldsymbol{\sigma}}^{\dagger}\boldsymbol{\mathcal{H}}_{\boldsymbol{\sigma}}\boldsymbol{\mathcal{U}}_{i})|_{\text{Im}}\right)\,. (10)

Numerically, we implement and train the ML model that minimizes Equation 10 in PyTorch, a Python package for building deep learning and other machine learning models (Paszke et al., 2019). This platform is flexible to our unique architecture, which lacks the abstract layers of a neural network and instead propagates information only through acoustically motivated operations, namely the encoding/decoding of the multistatic data and the generation of a B-mode image with DAS beamforming. In all other respects, we mirror the conventional ML pipeline by minimizing our loss function over batches of training data. We perform the numerical optimization using the Adam algorithm, an adaptive version of standard stochastic gradient descent (Kingma and Ba, 2017), and accommodate the constraint set Σ\Sigma for physically realizable encoding sequences with a simple projection of our sequence during each update. The update loop of our ML model is depicted in Figure 1, for which we emphasize the portion that is deployed in an experimental context.

Refer to caption
Figure 1: The training procedure for the proposed ML model for data acquisition and imaging, which is fully parameterized by an encoding sequence. Simulated multistatic training data is encoded and decoded according to our model parameters, and an image is formed using delay-and-sum beamforming. The resulting B-mode image is evaluated by comparison to a target image, or with some other data obtained during the simulation, i.e., target position.

To effectively update the parameters of any ML model requires knowledge of first order derivatives, which we obtain using the reverse-mode automatic differentiation, or backpropagation, capabilities of PyTorch (Griewank and Walther, 2008). In the context of ultrasound imaging, standard techniques are often sufficient when the loss function depends primarily on multistatic data (Chen et al., 2021) or on individual pieces of RF data (Noda et al., 2020). This is because platforms like PyTorch have built-in analytic derivatives for operations that are common in deep learning applications, such as matrix multiplication and Fourier transforms. Gradients for all other operations can be constructed at runtime at the cost of additional computation time and memory overhead. However, this process is made difficult when an ML model directly incorporates beamforming, which we consider a critical component to produce high-quality encoding sequences (See Section 4.4). Backpropagation through the DAS beamformer rapidly becomes a computational bottleneck when it is implemented directly as a sum of linear interpolations of IQ data into per-pixel focused channel data, as the number of intermediate values that must be stored in memory grows rapidly. This is especially relevant while training the proposed ML model, as we must simultaneously perform this expensive calculation over entire batches of training data.

In our application, we circumvent this issue through a novel PyTorch implementation of the derivative of the DAS beamforming operator \mathcal{B}. Because the beamformed image \mathcal{I} is linear as a function of the multistatic data 𝒰\mathcal{U} (Perrot et al., 2021), the analytic derivative needed for backpropagation is simply its adjoint operator. We provide PyTorch the DAS adjoint for arbitrary data, which eliminates the additional cost and memory overhead.

We do not directly construct and transpose the DAS beamforming matrix, as it is impractically large for reasonably sized multistatic data, even under a sparse representation (Perrot et al., 2021). Instead, we apply this adjoint matrix-free, recognizing that \mathcal{B} can be described as a sum of linear interpolations, or

(𝓤)=Sum(Interpolate(𝓤)).\displaystyle\mathcal{B}(\boldsymbol{\mathcal{U}})=\text{Sum}(\text{Interpolate}(\boldsymbol{\mathcal{U}}))\,.

From this, we derive from simple analytic formulas that

()=Interpolate(Sum()).\displaystyle\mathcal{B}^{*}(\mathcal{I})=\text{Interpolate}^{*}(\text{Sum}^{*}(\mathcal{I}))\,.

These two component adjoints have known, albeit obscure formulas (Claerbout, 2014), and so we provide pseudocode for both operations for clarity. The code used to perform the optimization, along with the RF data that support the findings of this study, are available at github.com/jcs15c/optimal_ultrasound_encoding (Spainhour et al., 2024).

Input: 𝓤\boldsymbol{\mathcal{U}}: Multistatic Data
Output: \mathcal{I}: Beamformed Image
1 Initialize focused IQ data at each pixel to zero
2 for each transmit-receive element pair do
       /* Linearly interpolate per-pixel time delays onto the channel data for the transmit-receive element pair */
3      
4       for each pixel in the image do
5             Identify the pair of IQ data the pixel focused time delay is between.
6             Add a linear combination of these two IQ data to the focused IQ data
7            
8      
Algorithm 1 DAS Beamformer \mathcal{B}
Input: ~\widetilde{\mathcal{I}}: Data of the same shape as \mathcal{I}
Output: 𝓤~\widetilde{\mathcal{\boldsymbol{\mathcal{U}}}}: Data of the same shape as 𝓤\boldsymbol{\mathcal{U}}
1
2Initialize output to zeros
3 for each transmit-receive element pair do
       /* Perform adjoint-interpolation of ~\widetilde{\mathcal{I}} based on per-pixel time delays */
4      
5       for each pixel in the image do
6             Identify pair of IQ data the pixel focused time delay is between
7             To each index of the output where the closest pair is located, add the corresponding value of the linear combination in Algorithm 1
8      
Algorithm 2 DAS Beamformer Adjoint \mathcal{B}^{*}

2.4 Acquisition of Training and Testing Data

Parameters Simulated array
Element count 64
Element pitch 0.3 mm
Element kerf 0.01 mm
Center frequency 3 MHz
Bandwidth 70%
Table 1: Simulated Field II Transducer Parameters

While training on true in vivo data would best reflect the experimental setting, it is understood that large collections of general purpose, labeled clinical data are rare (Liu et al., 2019). Instead, we utilize simulated multistatic data created with Field II (Jensen, 2004) to train and evaluate our proposed ML model. The simulated transducer configuration is described in Table 1. We have found that multistatic data depicting randomly placed anechoic lesions in an underdeveloped speckle pattern is a particularly effective class of training data, as encoding sequences trained on it generalize well to other classes of data (See Section 4.2).

For training, we generate with Field II a collection of 500 instances of underdeveloped speckle data, of which 20% form a validation set for by-hand tuning of our (reasonably few) hyperparameters. Notably, the proposed model is exposed only to this type of data during training. For thorough testing of the optimized encoding sequence, we use additional samples of underdeveloped speckle data, as well as collections of other types of simulated data. These include conventional imaging targets, namely isolated point scatterers and anechoic lesions in fully developed background speckle. These data are also used to evaluate the optimized encoding sequence according to standard ultrasound imaging quality metrics, such as the cystic resolution and gCNR respectively.

To explore performance on more complex RF data, we follow a procedure similar to Hyun et al. (2019) and generate arrangements of scatterers with amplitudes weighted according a grayscale image. These images are derived from a set of 160 samples from the publicly available validation set of an ImageNet competition (Howard et al., 2018), where we have extracted and smoothed the middle 256×\times256 pixels from each and converted them to grayscale. The first type of image-derived data interpolates the pixel values of each grayscale image to nearby scatterers that are distributed throughout the spatial domain, allowing us to consider data with scatterers of arbitrary amplitude. The second type of image-derived completely removes scatterers near pixels of sufficiently low brightness, resulting in anechoic regions in background speckle that are more arbitrary than standard circular cysts.

We also test our optimized encoding sequences on experimental hardware with both a tissue-mimicking phantom and a wire target, as described in Section 3.4.

We must similarly consider the imaging target used by the loss function in Equation 8 during the supervised learning. A natural choice is an “unencoded” image, where each simulated ground-truth multistatic data set is directly focused with DAS beamforming, thereby removing artifacts induced by encoding. Alternatively, one can compare to a more “idealized” target that represents the ground-truth contrast, as this also reduces the influence of any particular speckle pattern during training. Each image of ground-truth contrast is generated alongside the associated simulated multistatic data, where spatial masks are used to create homogeneous anechoic and scattering regions. An alternative approach to describing ground-truth echogenicity would be to directly convolve a high-quality point spread function (PSF) with each scatterer (Goudarzi and Rivaz, 2022), but we find our approach using a spatial mask to be simpler computationally. For image-derived data, the ground-truth contrast is simply the original grayscale image. During training, we use ground-truth contrast targets, and discuss the consequences of this decision in Section 4.3.

In Figure 2 we show an example of each target type for each category of simulated data, highlighting the particular configuration of data that is exclusively used during training.

Refer to caption
Figure 2: The types of simulated data and imaging target type on which the ML model is trained and/or evaluated. (a) Conventional imaging targets generated from the uniform responses of individual scatterers in an anechoic field. (b) Imaging targets for image-derived data, where the amplitude of each scatterer is weighted according to a grayscale image. While all categories are used for validation of the ML model, numerical experiments suggest that training on the class of data emphasized in red (ground-truth contrast, underdeveloped speckle) produces the best optimized encoding sequences.

3 Results

3.1 Optimization of Both Time Delays and Apodization Weights

We now demonstrate our capability to effectively train the proposed ML model, thereby producing encoding sequences that meaningfully improve image quality. In the most general case, the available hardware allows us to manipulate both time delays and apodization weights for each array element. This allows the greatest freedom when designing encoding sequences, and predictably the greatest improvement over conventional sequences. As a prototypical configuration of the training workflow, we train the ML model using 400 simulated instances of the previously described underdeveloped speckle background with anechoic lesions, processed in batches of size 8. We evaluate the model with the 2\ell^{2} loss function in Equation 8 using ground-truth contrast as the imaging target. The training is performed using the Adam optimizer using a learning rate of 0.1 over 25 epochs. The available Verasonics hardware is limited to a minimum duty cycle of two clock cycles, which effectively nullifies low values for our weights. As a result, we appropriately restrict the parameter space of encoding sequences Σ\Sigma during training (See Section 2.3).

While this is one of many possible configurations of the proposed training workflow (e.g., the choice loss function can vary according to the specific imaging task) we will see that the optimized encoding sequence created in the above configuration improves image quality fairly broadly across a number of different metrics.

We first demonstrate these improvements by evaluating the optimized encoding sequence on several types of simulated ultrasound data in Figure 3, for which we uniformly observe significantly improved contrast and reduction of scattering artifacts. For comparison, we use two conventional planewave encodings of varying maximum angle extent, as such sequences characterize a common tradeoff between FOV and resolution at depth (Bottenus et al., 2023). For example, the first generates 15 beams with 1 degree of separation, which results in a higher resolution along a narrower FOV. This is particularly notable when imaging point targets, where some scatterers are essentially undetectable. Steering these 15 planewaves sufficiently far apart to cover the entire imaging domain results in shallower contrast for anechoic regions, as there is more significant scattering throughout the domain.

By contrast, the proposed ML model has generated a sequence that recovers this wider FOV simultaneously with improved contrast. Importantly, we observe through the improvements on the image-derived data that the optimized encoding sequence is performant regardless of the specific arrangement of scatterers in the field. Instead, the point spread function for each scatterer is improved more generally and consistently throughout the FOV.

Refer to caption
Figure 3: Three choices for 15-transmit encodings applied to different imaging targets. We compare to imaging using planewaves with 1 degree of separation (left), planewaves with 10 degrees of separation (middle), and our novel optimized sequence (right). In all types of simulated data, we see considerable improvements in contrast and resolution over the conventional transmit encodings.

We show these improvements quantitatively in Table 2, where this prototypical optimized encoding sequence leads to improved 2\ell^{2} error against the ground-truth contrast. This is the same metric that is minimized over the training set of anechoic lesions in underdeveloped speckle (See Equation 8), and it is therefore natural that our ML model improves this metric for this type of data. However, we see that these improvements in the 2\ell^{2} loss are consistent across several other types of imaging targets that are visually quite distinct.

Average 2\ell^{2} Loss against Ground-Truth Contrast Map
11^{\circ} Spacing Full FOV Span Truncated Optimized
Target Type Planewaves Planewaves Hadamard Encoding
Isolated 12.55 20.03 33.87 8.12
Point Targets
Anechoic Lesions in 583.9 331.6 366.3 252.9
Underdeveloped Speckle
Anechoic Lesions 530.8 250.5 257.3 205.2
Image-Derived 256.6 166.8 174.4 160.8
Contrast
Binarized Image- 369.9 258.8 274.7 230.1
Derived Contrast
Table 2: We compute the average 2\ell^{2} loss against the ground-truth contrast across 50 pieces of simulated data outside the training set. Although the encoding sequence is trained using only the data set of anechoic lesions in underdeveloped speckle, this metric is improved for each other type of data, emphasized in bold.

We can also demonstrate that improvements in this 2\ell^{2} loss are strongly associated with improvements in more conventional image quality metrics, namely the gCNR and cystic resolution, thereby justifying the decision to select an 2\ell^{2} loss for training of the ML model.

The cystic resolution quantifies detectability of an anechoic cyst in background speckle with a given level of contrast (Ranganathan and Walker, 2007). We compute this metric through cystic contrast, which is the concentration of energy within a certain distance from a point target in an anechoic field. When cystic contrast is considered as a function of target radius, the cystic resolution is the minimum radius that achieves a desired level of contrast. For our examples we use a contrast of -20 dB, a common threshold for lesion detectability that represents an order of magnitude difference from the surrounding speckle. By explicitly plotting the cystic contrast for each encoding sequence in Figure 4(a), we can see that the optimized encoding sequence produces lesion detectability on par with narrowly concentrated planewaves, and that this conclusion is not sensitive to the exact choice of threshold.

Refer to caption
Figure 4: (a) The contrast of an anechoic cyst as a function of cyst size, with a vertical line indicating the value of the cystic resolution. (b) The centered point targets for which the cystic resolution is measured. From this, we can see that the optimized encoding results in cystic resolution comparable to that of narrow span planewaves.

In contrast to these narrow planewaves, however, our optimal encoding sequence maintains this level of resolution throughout the imaging domain. To see this, we use Field II to simulate the responses from single, isolated scatterers distributed throughout the viewing window, and compute the cystic resolution for each when imaged with the optimized sequence. Shown in Figure 5, this procedure generates a map of lesion detectability as a function of target position. This strongly suggests that the benefits of our optimized encoding sequence stem from more general acoustic principles, rather than overfitting to the training data, as the ML model is never exposed to these isolated point targets during training.

We additionally measure contrast using the gCNR, which also varies according to the spatial position of the lesion target. To account for this variance, we consider simulated anechoic lesion data for which the position and radius of each lesion is random. We categorize these lesions based on position, and tabulate the average gCNR for each category in Figure 5. To consider more complicated lesion shapes, we also measure the gCNR of the binarized, image-derived data of Figure 2(b), which we record in Table 3. Altogether, we see that our optimized encoding sequence outperforms conventional planewave and Hadamard encodings in terms of both FOV and contrast.

Refer to caption
Figure 5: (top) We plot the average gCNR for anechoic lesions in each region of the viewing range, averaged over randomly located targets over 50 instances of simulated data. Values are displayed for one instance of randomly located targets. (bottom) We plot the cystic resolution throughout the domain. Observe that only our optimized encoding sequence is able to maintain the same quality measured by gCNR and cystic resolution throughout the imaging domain.
Average gCNR
11^{\circ} Spacing Full FOV Span Truncated Optimized
Target Type Planewaves Planewaves Hadamard Encoding
Binarized Image- 0.882 0.654 0.888 0.891
Derived Contrast
Table 3: We compute the average gCNR over 50 pieces of out-of-sample data whose ground-truth contrast is derived from a grayscale image. Observe that our optimized encoding results in the highest gCNR, emphasized in bold.

We also compare our optimized sequence to standard techniques in apodization, which similarly seek to improve the PSF by reducing side lobes (Jeong and Kwon, 2020). To do this, we consider planewaves with 1010^{\circ} spacing, which offers good resolution at depth while imaging the full FOV, and apply a Tukey window to the receive channel. As expected, this improves cyst detectability in the image considerably. However, it does so at the cost of resolution, as shown in Figure 6. Specifically, there are still clear artifacts around point scatterers, and the striation in the cystic resolution map for the unapodized case is still present after apodization. These effects are not present in our optimized encoding sequence.

Refer to caption
Figure 6: We compare our technique to a classical application of apodization. We do this visually (top), through the average gCNR across each row of lesions (middle) and through the cystic resolution (bottom). We observe that although the Tukey window does improve the PSF, there is greater improvement from our optimized sequence.

3.2 Restricted Optimization to Either Time Delays or Apodization Weights

We can also apply the proposed ML model and training framework to more restrictive hardware specifications. In doing so, we demonstrate that there is still considerable improvement present, although such sequences do not perform as well as a fully arbitrary system, as expected. To this end, we train two separate configurations of the ML model. In the first, the apodization weights are fixed to some uniform value, and the time delays are the only trainable parameters of the ML model. In the second, only the apodization weights can be changed, and each time delay is fixed to zero in the ML model. As in Section 3.1, we train the model on anechoic lesions in underdeveloped speckle and test the resulting sequence on data far outside the training set. We compare the two resulting optimized sequences to the appropriate conventional alternative: wide spanning planewaves for delay-only optimization, and a truncated Hadamard encoding for weight-only optimization.

The results of this comparison are compiled in Table 4. We see that in each case the optimized sequence performs better according to the 2\ell^{2} loss metric than its conventional alternative, which we have now shown correlates with high quality as measured by gCNR and cystic resolution. Furthermore, we observe that while there is still considerable improvement when only delays are optimized through the training of our ML model, the magnitude of improvements when only optimized weights are generated is comparable to that of a fully optimized encoding sequence. This demonstrates that in this imaging scenario, optimization of the weights has a more significant influence on the quality of the resulting sequence than delay optimization alone.

Average 2\ell^{2} Loss against Ground-Truth Contrast Map
Delay Optimization Weight Optimization Both
Optimized Full FOV Optimized Truncated Optimized
Target Type Planewaves Hadamard
Isolated 16.93 20.03 14.53 33.87 8.129
Point Targets
Underdeveloped 310.2 331.6 303.3 366.3 252.9
Speckle
Anechoic Lesions 231.2 250.5 217.1 257.3 205.2
Image-Derived 165.7 166.8 164.2 174.4 160.8
Contrast
Binarized Image- 243.3 258.8 241.2 274.7 230.1
Derived Contrast
Table 4: We consider training ML models whose trainable parameters are restricted to only a single component of the encoding sequence. In both cases, we see that the per-component optimized encoding sequence outperforms the conventional alternative on each type of imaging target.

3.3 Optimization in the Presence of Noise

These simulated results demonstrate the theoretical utility of applying an ML optimization strategy for data acquisition and imaging within the REFoCUS framework. However, we can also verify that our optimized sequences produce meaningful improvements in several practical scenarios. In the first, we consider a separate configuration of the ML model that is trained to emulate an imaging system under the influence of electronic noise. We recreate this during training by the addition of random noise to the encoded channel data prior to imaging. In the current example, this additive noise is sampled from a random normal distribution, filtered to have a bandwidth equal to that of the transducer, and then scaled to produce an average of 20 dB channel signal-to-noise ratio. It is in this context that restricting the set of apodization weights to an \ell^{\infty} ball is most critical, as otherwise all noise would be suppressed simply by arbitrarily increasing the apodization weights.

By accounting for this noise during training, the ML model generates a new encoding sequence that specifically suppresses this noise. We see this in comparison to the optimized encoding sequence evaluated throughout Section 3.1. In Figure 7(a), we consider these two sequences, trained with and without the presence of noise, along with a conventional planewave configuration for comparison. We then apply each sequence to data to which noise is, or is not added. In each of the four cases, we see considerable improvement over the conventional planewave approach. However, we can also see that these improvements are most prominent when the ML model is trained with the same type of noise present during evaluation, the case emphasized in Figure 7(a) in red. In Figure 7(b), we consider a more complete evaluation of the case where noisy data is imaged using a sequence trained with the same noise model. There is the same degree of improvement as the noiseless case of Figure 3, featuring dramatic improvements in the gCNR and cystic resolution.

Refer to caption
Figure 7: Effects of noise on training/evaluation of an ML model. (a) We compare encoding sequences generated by different ML models trained with and without the presence of electronic noise. (b) We make additional comparisons to the case emphasized in red in (a), comparing conventional transmit sequences on noisy RF data with a sequence optimized in the presence of additive noise. We see a greater qualitative suppression of noise (top), as well as improvements in gCNR (middle) and the point spread function as measured by the cystic resolution (bottom).

3.4 Experimental Verification

Finally, we verify these results by directly implementing the optimized encoding sequence discussed in Section 3.1 in physical hardware. In the following experiments, we use the Verasonics Vantage 256 research scanner (Verasonics, Inc., Kirkland, WA) with the P4-2v phased array transducer (3 MHz transmit frequency, 64 elements, 0.3 mm pitch, 10V transmit voltage). Received channel data were stored for processing offline, where they were decoded and used to produce an image in the same manner as in simulation. For comparison, we use sequences of 15 transmits that generate planewaves of varying span (15, 60, and 150 degrees). To reproduce the effects of arbitrarily scaled weights (rather than restricting to ±1\pm 1 through pulse inversion) we use the Verasonics apodization function to scale the duty cycle of the transmit excitation so that it matches the output amplitude.

The first experiment utilizes an ATS 539 multi-purpose phantom (CIRS, Inc., Norfolk, VA) to confirm the ability of optimized encoding sequences to improve imaging on tissue-like material. Within the phantom we image anechoic lesions of varying position and radius in background speckle, and compute the gCNR of each with respect to surrounding background speckle. We show these gCNR values below each lesion in Figure 8(a). For a fixed lateral position, we acquire data using each encoding sequence at each of seven elevation positions of the phantom. In this way, we image anechoic lesions in the same positions with different realizations of background speckle, and compute the average gCNR across all realizations.

We can see that these results match closely with our numerical simulations, in that our optimized sequence is able to produce the depth of focus and resolution of a very narrow span of planewaves, while still maintaining a full FOV throughout the range. Importantly, these improvements persist beyond the fixed viewing window of the training data, which extends only to a depth of 55 mm. This further emphasizes that our encoding sequence has not simply specialized to the specific characteristics of the training data.

Refer to caption
Figure 8: Experimental validation of simulated results. (a) We image the ATS 539 multi-purpose phantom using different encoding sequences. Although the ML model is optimized for a symmetric FOV, the fixed lateral position of the transducer was intentionally placed off-center to demonstrate improved FOV, while fully capturing each target in the asymmetric phantom. The average gCNR over seven realizations of the speckle pattern is shown beneath each anechoic lesion. (b) We create a montage of 30 individual point target images. Each is obtained by translating the P4-2v transducer around a fixed wire target in a water tank. To provide an appropriate comparison between different targets within the montage, each image is displayed with the same amount of gain (top). We use these images to compute the cystic resolution throughout the domain (bottom).

We can also recreate our simulated point target images, which we can in turn use to measure the cystic resolution throughout the imaging domain (albeit on a much coarser grid). To accomplish this, we image a custom, single target wire phantom (0.03 mm tungsten wire) in a water tank. Using the AIMS III Hydrophone Scanning System (Onda Corporation, Sunnyvale, CA), we mechanically translate the suspended P4-2v transducer around the fixed wire so that the target can be imaged from a 6×\times5 grid of target positions (\sim10 mm lateral spacing, \sim8 mm axial spacing).

Although we only ever image one point target at a time with this experimental setup, in Figure 8(b), we present a montage of these images distributed appropriately in space. To ensure a fair visual comparison of each point target, we apply a uniform level of gain to each image in the domain. This allows us to very clearly see the effect of our optimized sequence on the FOV of the imaging system, where points in the upper corners of the viewing window are either undetectable or suffer from very poor resolution. In contrast, the optimized encoding sequence provides a uniform degree of resolution at all points in the domain.

Furthermore, we can see that the off-axis scattering artifacts present are greatly reduced when imaged with the optimized sequence. As before, we quantify this effect using the cystic resolution, which we can compute for each point target in Figure 8(b). Although our ability to perfectly capture this metric is slightly hampered by irregularities in the experimental setup, we still observe near uniform improvement over each conventional sequence, analogous to results observed in simulation.

4 Discussion

Our machine learning framework represents a novel method of generating transmit sequences that ultimately result in higher quality images than current standards. However, as with most machine learning applications, maximizing the efficacy of the optimized sequence requires careful consideration of the ML model configuration. For example, one must select a number of standard ML hyperparameters, such as descent algorithm, batch size, learning rate, etc. Furthermore, this particular usage of ML in ultrasound imaging brings about additional, unique considerations. In this section, we discuss our own exploration of these considerations, and make observations that we believe will be relevant in any future ultrasound ML application that directly incorporates a beamformer into the architecture.

4.1 Significance of the Initial Condition

Because the minimization problem presented in Equation 10 is not convex, there is no expectation that any optimization procedure can find the global minimum, should one even exist. Instead, the particular local minimum approached by our optimization procedure is highly dependent on the initial condition. Because locally optimal encoding sequences vary in quality, it is important to select an initial condition that is at or near a good local minimum. The interpretability of the parameters of our ML model offers a distinct advantage over conventional deep learning applications in accomplishing this, as we can intelligently explore the effects of different initial conditions.

For example, an instance of our ML model initialized with a planewave encoding will converge to transmissions that are steered in the same way, albeit with less well-defined wavefronts. On the other hand, initializing with a truncated Hadamard code causes the parameters to converge to a sequence with no visible steering. We can see these qualitative behaviors in Figure 9(a). Despite this, we have found experimentally that a truncated Hadamard sequence is the superior initial condition, despite being a generally lower quality transmission sequence. This can be seen through the 2\ell^{2} loss plotted over the course of training epochs for each initial condition in Figure 9(b). This can be explained by the conventional Hadamard sequence allowing the ML model to begin with near-optimal FOV, as well as both positive and negative apodization weights. Except for cases in which delays or weights must be fixed as in Section 3.2, each optimized encoding sequence used in this paper was generated by an ML model with this configuration as the initial condition.

Refer to caption
Figure 9: A comparison between two different encoding sequences used as initial conditions. (a) For each, we plot the transmit sequence before and after optimization, along with one piece of testing data for each. In the delay plots, each transmission is represented by a single line. (b) For each initial condition, we plot the per-epoch 2\ell^{2} loss function averaged over the training and testing data sets.

In our exploration of this topic, we have found that recovering a known, conventional sequence is exceptionally difficult except in the simplest toy problems. For example, although planewave transmissions are generally performant, the ML model cannot identifiably recover the same geometric properties that make them desirable. On the other hand, the optimal characteristics of the resulting sequence are quite opaque: while optimized sequences appear to be a simple perturbation of the initial condition, performing this perturbation randomly fails to achieve the same improvements. This suggests that it would not be possible to discover these sequences without the use of the proposed ML model.

4.2 Significance of the Training Data

In Figure 9(b), we also observe that for either initialization, our ML model performs nearly identically between our training and testing data. This is a highly desired property in this context, as ultrasound imaging can be theoretically formulated as a convolution of a spatially varying point spread function across each scatterer in the domain, the results of which are then summed to form an image. Therefore, an effective manner of optimizing image quality for arbitrary data would be to optimize this PSF itself, concentrating it at zero offset (Goudarzi and Rivaz, 2022).

Although our loss function does not reference features of the PSF directly, our choice to use underdeveloped speckle as training data implicitly encourages the same type of improvement at a lesser computational burden. This is in part because the scatterers within each sample are spread out enough that their off-axis scattering artifacts do not necessarily overlap one another, as would be the case with standard background speckle. This means the 2\ell^{2} loss function can only be meaningfully decreased by acoustically concentrating the energy of each scatterer inwards, thereby improving the PSF. At the same time, scatterers across all samples in the training data are densely distributed throughout the imaging domain. This means that learning improvements that span the entire FOV can be obtained with relatively few samples of type of training data.

4.3 Significance of the Imaging Target

Although we exclusively use a loss function that compares produced B-mode images with some ground-truth imaging target, there is still much flexibility within that choice. As stated in Section 2.4, a natural choice for the imaging target is an image derived from ground-truth multistatic data. While this produces target images without any encoding artifacts, we have found that an ML model trained with these targets produces poor image quality relative to alternatives. This is because DAS beamforming with even unencoded multistatic data results in a PSF with undesirable features such as the visible side lobe artifacts present throughout the domain. When the ML model is trained on such target images, the transmit sequence is implicitly encouraged to accurately reconstruct these artifacts. Instead, we see that training against the exact ground-truth contrast produces images with a higher quality PSF, as seen in Figure 10.

Refer to caption
Figure 10: The quality of the encoding sequence depends heavily on the imaging target used during training. (a) We initialize two ML models identically. (b) When the imaging target of the 2\ell^{2} loss is unencoded data (top), the resulting sequence has the same artifacts present in the beamformed image. By using an idealized imaging target, i.e., one that represents the ground-truth contrast, the resulting sequence can suppress these artifacts (bottom).

Additionally, we see the most uniform improvement in image quality when we “pad” images of underdeveloped speckle so that the final viewing window is itself located in an anechoic field. This ensures that the encoding sequence is not unnecessarily dependent on the specific domain geometry. The consequences of this are best seen through the plots of cystic resolution in Figure 11, where the first encoding sequence is trained on images whose extent is exactly that of the final viewing window. In contrast, the second encoding sequence is trained on images located in a slightly wider anechoic void. As we can see, adding this empty space causes the PSF to more smoothly vary throughout the domain, and failing to do so results in additional artifacts for point targets near the center of the region. In effect, these streaks depict positions for which scattering artifacts extend beyond the imaging domain, and the introduced error is not captured by the loss function.

Refer to caption
Figure 11: The imaging target must be carefully selected to ensure there is not an unnecessary dependence on the specifics of the imaging domain. For example, by expanding the window during training (right), we are able to deal with streaking artifacts (indicated with an arrow) that arise from unsuppressed scattering just beyond the imaging domain.

4.4 Significance of the Loss Function

By using our custom implementation of a differentiable beamformer, we are able to train our ML model according to improvements in an imaging metric, which is a novel departure from existing work that only considers the recovery of the multistatic data set. Importantly, we can demonstrate the existence of transmit sequences that produce images with higher resolution and contrast, even though the encoding itself produces a worse reconstruction of 𝓤\boldsymbol{\mathcal{U}} from 𝓢\boldsymbol{\mathcal{S}}.

This is primarily because a perfect reconstruction of 𝓤\boldsymbol{\mathcal{U}} will only produce the original unencoded image, and this type of image lacks many desirable properties, as discussed in Section 4.3. To create a concrete example of this phenomenon, we compare to an ML model within our framework that is instead configured to directly optimize the recovery of the multistatic data set by minimizing

STA(𝓤^;𝓤):=1NTNRNω𝓤𝓤^22.\displaystyle\mathcal{L}_{\text{STA}}(\widehat{\boldsymbol{\mathcal{U}}};\boldsymbol{\mathcal{U}}):=\frac{1}{N_{T}N_{R}N_{\omega}}\big{\|}\boldsymbol{\mathcal{U}}-\widehat{\boldsymbol{\mathcal{U}}}\big{\|}^{2}_{2}\,. (11)

Consider Figure 12. We see that this new configuration successfully generates an encoding sequence that produces a lower STA loss than the previously considered encoding sequence optimized for image recovery. Yet despite this numerical improvement, there is a clear superiority in image quality when the optimization process considers image formation, leading to improvements both visually and in terms of the gCNR averaged over each lesion.

We take this as a counterexample to the common convention that a better conditioned encoding necessarily corresponds to higher quality images. To account for the effects of Tikhonov regularization, we measure the condition number of the collection of encoding matrices 𝓗\boldsymbol{\mathcal{H}} as κ=𝓗0𝓗0\kappa=\|\boldsymbol{\mathcal{H}}_{0}\|\cdot\|\boldsymbol{\mathcal{H}}^{\dagger}_{0}\|, doing so because the encoding matrix for the lowest frequency 𝓗0\boldsymbol{\mathcal{H}}_{0} has the poorest conditioning among the collection 𝓗\boldsymbol{\mathcal{H}}. We see that both encoding sequences bring about reasonably well-conditioned matrices following Tikhonov regularization, but there are still clear qualitative differences between the two images.

Refer to caption
Figure 12: Comparison of our optimized sequence (left) to one that is trained to optimize reconstruction of the multistatic data set (right). In spite of a worse multistatic reconstruction measured by the STA loss and comparable condition number κ\kappa for the encoding matrices, our strategy for optimization produces visibly improved contrast and resolution throughout the image.

5 Conclusions

In summary, we have shown that the principles of machine learning can be used to generate encoding sequences that improve the quality of ultrasound imaging within the context of the REFoCUS model. An important theoretical consequence of our ML procedure is its ability to discover currently unknown transmit sequences with desirable properties. For example, the various ML models trained for demonstration throughout this paper have each found transmit sequences that are “high-quality” in a sense that is acoustically meaningful, yet is not directly encouraged by the training procedure. That is to say, although the sequence is selected so that it minimizes one particular 2\ell^{2} imaging metric, it ultimately improves quality along a number of other metrics across disparate classes of data. At the same time, the exact acoustic properties that these sequences share remains obscure. This means that our ML model has stumbled upon a currently unexplored regime of transmit sequences, and there remain unanswered questions as to the unifying features of such sequences.

By using an ML model whose parameters are exactly those of an encoding sequence, this approach offers a high degree of flexibility to different imaging scenarios, making it an important foundation for future endeavors. For example, the high degree of generalizability shown in Figure 3 indicates that, in sharp contrast to other deep-learning image formation and analysis tasks, it may be possible to train with a limited amount of in vivo data without a severe degradation in image quality. The theoretical flexibility of the REFoCUS framework is a key component of this type of investigation, as it allows for uniform treatment of highly variable kinds of transmit sequences. Beyond the proposed framework, we wish to further explore the capabilities of machine learning within REFoCUS, motivated by our ability to incorporate beamforming as a layer of an ML architecture, potentially with its own set of trainable parameters. As has been the case in this work, such technology will ultimately allow us to continue developing techniques that directly improve image quality for applications of interest.

6 Data Availability

All data that support the findings of this study are included within the article (and any supplementary information files).

References

  • (1)
  • Ali et al. (2020) Ali, R., Herickhoff, C. D., Hyun, D., Dahl, J. J. and Bottenus, N. (2020). Extending retrospective Encoding for Robust Recovery of the Multistatic Data Set, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 67(5): 943–956.
  • Bhadra et al. (2021) Bhadra, S., Kelkar, V. A., Brooks, F. J. and Anastasio, M. A. (2021). On hallucinations in tomographic image reconstruction, IEEE Transactions on Medical Imaging 40(11): 3249–3260.
  • Bottenus (2018) Bottenus, N. (2018). Recovery of the complete data set from focused transmit beams, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 65(1): 30–38.
  • Bottenus et al. (2023) Bottenus, N., Spainhour, J. and Becker, S. (2023). Comparison of spatial encodings for ultrasound imaging, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 70(1): 52–63.
  • Chen et al. (2021) Chen, Y., Liu, J., Luo, X. and Luo, J. (2021). ApodNet: Learning for High Frame Rate Synthetic Transmit Aperture Ultrasound Imaging, IEEE Transactions on Medical Imaging 40(11): 3190–3204.
  • Chiao et al. (1997) Chiao, R., Thomas, L. and Silverstein, S. (1997). Sparse array imaging with spatially-encoded transmits, 1997 IEEE Ultrasonics Symposium pp. 1679–1682.
  • Claerbout (2014) Claerbout, J. (2014). Geophysical image estimation by example, LULU COM.
  • Cobbold (2007) Cobbold, R. S. C. (2007). Foundations of Biomedical Ultrasound, Biomedical Engineering Series, Oxford University Press.
  • Goudarzi and Rivaz (2022) Goudarzi, S. and Rivaz, H. (2022). Deep reconstruction of high-quality ultrasound images from raw plane-wave data: A simulation and in vivo study, Ultrasonics 125: 106778.
  • Griewank and Walther (2008) Griewank, A. and Walther, A. (2008). Evaluating Derivatives, second edn, Society for Industrial and Applied Mathematics.
  • Harrison et al. (2014) Harrison, T., Samplaleanu, A. and Zemp, R. (2014). S-sequence spatially-encoded synthetic aperture ultrasound imaging, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 61(5): 886–890.
  • Howard et al. (2018) Howard, A., Park, E. and Kan, W. (2018). Imagenet object localization challenge.
    https://kaggle.com/competitions/imagenet-object-localization-challenge
  • Hyun (2022) Hyun, D. (2022). A universal end-to-end description of pulse-echo ultrasound image reconstruction, in S. Aylward, J. A. Noble, Y. Hu, S.-L. Lee, Z. Baum and Z. Min (eds), Simplifying Medical Ultrasound, Springer International Publishing, Cham, pp. 128–138.
  • Hyun et al. (2019) Hyun, D., Brickson, L. L., Looby, K. T. and Dahl, J. J. (2019). Beamforming and speckle reduction using neural networks, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 66(5): 898–910.
  • Hyun et al. (2021) Hyun, D., Wiacek, A., Goudarzi, S., Rothlübbers, S., Asif, A., Eickel, K., Eldar, Y. C., Huang, J., Mischi, M., Rivaz, H., Sinden, D., van Sloun, R. J. G., Strohm, H. and Bell, M. A. L. (2021). Deep learning for ultrasound image formation: Cubdl evaluation framework and open datasets, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 68(12): 3466–3483.
  • Jensen (2004) Jensen, J. (2004). Simulation of advanced ultrasound systems using field ii, 2004 2nd IEEE International Symposium on Biomedical Imaging: Nano to Macro (IEEE Cat No. 04EX821), pp. 636–639 Vol. 1.
  • Jensen et al. (2006) Jensen, J. A., Nikolov, S. I., Gammelmark, K. L. and Pedersen, M. H. (2006). Synthetic aperture ultrasound imaging, Ultrasonics 44: e5–e15.
  • Jeong and Kwon (2020) Jeong, M. K. and Kwon, S.-J. (2020). A new method for assessing the performance of signal processing filters in suppressing the side lobe level, Ultrasonography 40: 289 – 300.
  • Kingma and Ba (2017) Kingma, D. P. and Ba, J. (2017). Adam: A method for stochastic optimization.
  • Krizhevsky et al. (2017) Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks, Commun. ACM 60(6): 84–90.
  • Liu et al. (2019) Liu, S., Wang, Y., Yang, X., Lei, B., Liu, L., Li, S. X., Ni, D. and Wang, T. (2019). Deep learning in medical ultrasound analysis: A review, Engineering 5(2): 261–275.
  • Mischi et al. (2020) Mischi, M., Lediju Bell, M. A., van Sloun, R. J. G. and Eldar, Y. C. (2020). Deep learning in medical ultrasound—from image formation to image analysis, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 67(12): 2477–2480.
  • Noda et al. (2020) Noda, T., Tomii, N., Nakagawa, K., Azuma, T. and Sakuma, I. (2020). Shape estimation algorithm for ultrasound imaging by flexible array transducer, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control PP: 1–1.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J. and Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library.
  • Perrot et al. (2021) Perrot, V., Polichetti, M., Varray, F. and Garcia, D. (2021). So you think you can DAS? a viewpoint on delay-and-sum beamforming, Ultrasonics 111: 106309.
  • Ranganathan and Walker (2007) Ranganathan, K. and Walker, W. F. (2007). Cystic resolution: A performance metric for ultrasound imaging systems, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 54(4): 782–792.
  • Rodriguez-Molares et al. (2019) Rodriguez-Molares, A., Marius, O., Rindal, H. and Jan, D. (2019). The Generalized Contrast-to-Noise ratio : a formal definition for lesion detectability, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control pp. 1–12.
  • Spainhour et al. (2022) Spainhour, J., Becker, S. and Bottenus, N. (2022). A strategy for synthetic aperture sequence design using numerical optimization, 2022 IEEE International Ultrasonics Symposium (IUS), pp. 1–4.
  • Spainhour et al. (2024) Spainhour, J., Smart, K., Becker, S. and Bottenus, N. (2024). Source code and data repository for ”optimization of array encoding for ultrasound imaging”.
    https://doi.org/10.5281/zenodo.10689039
  • van Sloun et al. (2020) van Sloun, R. J. G., Cohen, R. and Eldar, Y. C. (2020). Deep learning in ultrasound imaging, Proceedings of the IEEE 108(1): 11–29.
  • You et al. (2021) You, Q., Dong, Z., Lowerison, M. R. and Song, P. (2021). Pixel-oriented Adaptive Apodization for Planewave Imaging Based on Recovery of the Complete Data Set, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control pp. 1–1.
  • Zhang et al. (2022) Zhang, J., Liu, J., Fan, W., Qiu, W. and Luo, J. (2022). Partial hadamard encoded synthetic transmit aperture for high frame rate imaging with minimal l 2-norm least squares method, Physics in Medicine & Biology 67(10): 1–18.