KST-Mixer: Kinematic Spatio-Temporal Data Mixer For Colon Shape Estimation
Abstract
We propose a spatio-temporal mixing kinematic data estimation method to estimate the shape of the colon with deformations caused by colonoscope insertion. Endoscope tracking or a navigation system that navigates physicians to target positions is needed to reduce such complications as organ perforations. Although many previous methods focused to track bronchoscopes and surgical endoscopes, few number of colonoscope tracking methods were proposed. This is because the colon largely deforms during colonoscope insertion. The deformation causes significant tracking errors. Colon deformation should be taken into account in the tracking process. We propose a colon shape estimation method using a Kinematic Spatio-Temporal data Mixer (KST-Mixer) that can be used during colonoscope insertions to the colon. Kinematic data of a colonoscope and the colon, including positions and directions of their centerlines, are obtained using electromagnetic and depth sensors. The proposed method separates the data into sub-groups along the spatial and temporal axes. The KST-Mixer extracts kinematic features and mix them along the spatial and temporal axes multiple times. We evaluated colon shape estimation accuracies in phantom studies. The proposed method achieved 11.92 mm mean Euclidean distance error, the smallest of the previous methods. Statistical analysis indicated that the proposed method significantly reduced the error compared to the previous methods. 111Code and data of the proposed method are available at: https://github.com/modafone/kst-mixer
keywords:
Colon; Colonoscope tracking; Kinematic data estimation; Multi layer perceptron1 Introduction
CT colonography (CTC) is performed to find colonic polyps from CT images. If colonic polyps or early-stage cancers are found in a CTC, a colonoscopic examination is performed to endoscopically remove them. A physician controls the colonoscope based on its camera view during a colonoscope examination. However, its viewing field is limited and unclear because the camera is often covered by fluid or the colonic wall. Furthermore, the colon changes shape significantly during colonoscope insertion. Physicians require much skill and experience to estimate how the colonoscope travels inside the colon. Inexperienced physicians overlook polyps or cause such complications as colon perforation. A colonoscope navigation system is needed that guides the physician to the polyp position. A colonoscope tracking method is necessary as its core.
Many tracking methods of endoscopes are proposed (Peters 2008; Deligianni 2008; Rai 2008; Deguchi 2009; Visentini-Scarzanella 2017; Wang 2021; Banach 2021; Gildea 2006; Schwarz 2006; Mori 2005; Luo 2015; Liu 2013; Yao 2021; Ching 2010; Fukuzawa 2015; Oda 2017). They can be classified into the image-based, the sensor-based, and the hybrid methods. The bronchoscope is the main application of bronchoscope tracking. Many researchers proposed image- and sensor-based methods in recent several decades. Image-based tracking methods estimate the camera movements based on image registrations. Registrations between temporally continuous bronchoscopic images (Peters 2008) or between real and virtualized bronchoscopic images (Deligianni 2008; Rai 2008; Deguchi 2009) are used for tracking. Recent approaches use deep learning-based depth estimation results to improve image-based tracking accuracy (Visentini-Scarzanella 2017; Wang 2021; Banach 2021). Sensor-based tracking methods use small sensors to obtain bronchoscope position (Gildea 2006; Schwarz 2006). Hybrid methods of them use both image and sensor information to accurately estimate bronchoscope position (Mori 2005; Luo 2015). Some research groups propose tracking methods for colonoscope. In colonoscope tracking, the image-based method (Liu 2013) is difficult to apply because unclear colonoscopic images appear frequently. Unclear image removal and disparity map estimation are utilized to improve tracking accuracy (Yao 2021). Electromagnetic (EM) sensors are used to obtain colonoscope shapes (Ching 2010; Fukuzawa 2015). Unfortunately, they cannot guide physicians to polyp positions because they cannot map the colonoscope shape to a colon in a CT volume, which may contain polyp detection results. Combining the colonoscope position or shape information with polyp position information, which can be detected in a CT volume taken prior to colonoscope insertion, is essential in colonoscope navigation. Such navigation is called colonoscope-CT-based navigation.
A few colonoscope tracking methods are applicable to perform colonoscope-CT-based navigation. Reference (Oda 2017) used colonoscope shape measured by an EM sensor and CT volume for colonoscope tracking. The method obtains two curved lines representing the colon and colonoscope shapes to estimate the colonoscope position on a CT volume coordinate system. The method enables real-time tracking regardless of the colonoscopic image quality. However, the method does not consider colon deformation caused by colonoscope insertion. Such deformation caused significant tracking error. Large tracking errors were observed at the transverse and sigmoid colons, which are significantly deformed by a colonoscope insertion. To reduce tracking errors, estimation methods of the shape of the colon with deformations caused by colonoscope insertion were proposed (Oda 2018a, b). One of them (Oda 2018a) uses the shape estimation network that has a long short-term memory (LSTM) layer (Hochreiter 1997) to estimate the colon shape and the other (Oda 2018b) uses regression forests. Estimation accuracies of them need to be improved to perform accurate colonoscope tracking.
We propose a novel shape estimation method of the colon for colonoscope tracking using a Kinematic Spatio-Temporal data Mixer (KST-Mixer). The proposed method estimates the colon from time-series shape data of the colonoscope, which is inserted into the colon. The KST-Mixer extracts kinematic features from the data of the colonoscope using simple multi-layer perceptrons (MLPs). The extracted features are mixed along the spatial and temporal axes in spatio-temporal mixing blocks to estimate the colon shape dynamically. Because the KST-Mixer has a simple processing flow, it provides estimation results in a short processing time. It is suitable to be used in real-time colonoscope navigation systems.
Contributions of this paper are summarized as: (1) to propose a novel colon shape estimation method that utilizes spatial and temporal kinematic features extraction and mixing processes, (2) to enable short computation time in estimation used in real-time colonoscope tracking, and (3) to achieve the smallest shape estimation error among the previous methods.
2 Method
2.1 Overview
The proposed method estimates the colon shape from the colonoscope shape. They are time-series data measured at a specific time interval. The KST-Mixer is trained to estimate a colon shape from time-series colonoscope shapes. After the training, a trained model estimates a colon shape during a colonoscope insertion.
2.2 Colon and colonoscope shape representation
We represent the colon and colonoscope shapes as point sets. The colonoscope shape (Fig. 1 (a)) is a set of 3D positions and 3D directions , that is represented as
(1) |
where is the index of time, is the total number of time frames, and is the number of points in the colonoscope shape. is a point aligned along the colonoscope centerline. corresponds to the position of the colonoscope tip. is a tangent direction of the colonoscope tube at .
The colon shape (Fig. 1 (b)) is a set of 3D points that is represented as
(2) |
where is the number of points in the colon shape. is a point aligned along the colon centerline. and correspond to the cecum and the anus positions, respectively.


2.3 Kinematic spatio-temporal data mixer (KST-Mixer)
2.3.1 Overview of KST-Mixer
The KST-Mixer estimates a colon shape from time-series colonoscope shapes. Its architecture is based on MLPs that are repeatedly applied across the spatial or temporal axes. This architecture is inspired by the MLP-Mixer (Tolstikhin 2021), which classifies images utilizing spatial locations and image features. The MLP-Mixer has competitive image classification performance to current methods such as Vision Transformers (ViT) (Dosovitskiy 2021) and provides a short processing time. We utilize the MLP-based architecture to process kinematic data in shape estimation tasks.
2.3.2 Data preparation
To generate input data of the KST-Mixer, we rearrange a time-series colonoscope shape data as a matrix with spatial and temporal axes. Components of the 3D point and direction are represented as and . From , we define the positional matrix () of time period as
(3) |
where is time length. We also define the directional matrix as similarly.
Values in are normalized to take values in the range . We regard the normalized matrix as a 2D image to generate non-overlapping and homogeneous sized image patches. The image size is and the size of each patch is . From them, the number of patches is calculated as . Each patch contains spatially and temporally local data. Each patch is projected to a feature vector of hidden dimension . As the result, we obtain a input matrix of positional data . The order of feature values in the matrix is sensitive to both spatial and temporal axes. Therefore, such a position embedding technique as ViT employs is not necessary. We also obtain a input matrix of directional data from . We make a matrix of colonoscope shape data consisting of and elements. This process is illustrated in Fig. 2.
We generate additional input data of the KST-Mixer from the insertion length of the colonoscope. The insertion length of colonoscope at time is . A set of insertion lengths in the period of time is represented as , which is a column vector of size . The insertion length data is used in the process of the KST-Mixer.

2.3.3 Architecture of KST-Mixer
The input of the KST-Mixer are and . The KST-Mixer outputs an estimated colon shape . The architecture of the KST-Mixer is shown in Fig. 3. It has spatio-temporal mixing blocks. Each of them consists of two MLP blocks similarly to the MLP-Mixer (Tolstikhin 2021). The first is the spatio-temporal feature mixing MLP block (patch mixing MLP block). In the block, an input patch-wise feature vector is transposed and processed by the MLP block. The second MLP block is the patch-wise feature extraction MLP block.

Each MLP block has two fully connected layers and an activation function. Dropout with a probability of is performed. Operation in the spatio-temporal mixing MLP block can be represented as
(4) | |||||
(5) |
where are input and output feature vectors, are weight parameters of fully connected layers, and , . is a layer normalization function (Ba 2016). is an GELU activation function (Hendrycks 2016). indicates the row or the column vectors where operations are applied. Equation (4) is the calculation in the patch mixing MLP block. The calculation is performed for each column of . The number of hidden units of the first fully connected layer in this block is used to control patch mixing. Equation (5) is the calculation in the patch-wise feature extraction MLP block. The calculation is performed for each rows of . The number of the hidden units of the first fully connected layer in this block is used to control feature extraction from the patch.
After the processes of the spatio-temporal mixing blocks, feature values are mapped to a vector. It is combined with feature values calculated from the insertion length data and then processed by some fully connected layers. Dropout with a probability of is performed here. The last layer outputs an estimated colon shape .
3 Experimental Setup
We evaluated the colon shape estimation accuracy of the proposed method in a phantom study. We used a colon phantom (colonoscopy training model type I-B, Koken, Tokyo, Japan), a CT volume of the phantom, a colonoscope (CF-Q260AI, Olympus, Tokyo, Japan), an EM sensor (Aurora 5/6 DOF Shape Tool Type 1, NDI, Ontario, Canada), and a depth image sensor (Kinect v2, Microsoft, WA, USA). We measured colonoscope and colon shapes in our measurement environment shown in Fig. 4. We assume the colonoscope tip is inserted up to the cecum when colonoscope tracking starts because physicians observe and treat the colon while retracting the colonoscope after its insertion up to the cecum. The colonoscope was moved from the cecum to the anus.

3.1 Colonoscope shape measurement
We measured colonoscope shapes using the EM sensor. The EM sensor is strap-shaped with six sensors at its tip and points along its strap-shaped body (one sensor is 6 DOF and remaining are 5 DOF). Each sensor provides a 3D position and a 3D/2D direction along the colonoscope by inserting the sensor into the colonoscope working channel. The measured colonoscope shape is at every time .
3.2 Colon shape measurement
We measured colon shapes from the colon phantom using the depth image sensor. Twelve position markers were attached to the surface of the colon phantom to scan its shape. The depth image sensor scanned colon shapes during the colonoscope insertions to the colon phantom. We automatically detect marker positions using YOLOv5 (Ultralytics 2020) from the scanned color and depth images. Then, the detection results were manually corrected. The detected markers are described as at every time , used as colon shape. and correspond to the cecum and the anus positions, respectively.
3.3 Training and testing of KST-Mixer
We measured both and during colonoscope insertions to the colon phantom. The shape recording frequency was six times per second. and belong to the EM and depth image sensor coordinate systems. We registered them in the CT coordinate system using the iterative closest point (ICP) algorithm (Besl 1992) and manual registrations. Registered shape data was used to train and test the KST-Mixer. Parameters used in the trainings were: , , , , , , , 50 minibatch size, and 200 training epochs. Mean squared error was used as the loss function in training. We implemented the KST-Mixer using the Keras build in TensorFlow 2.4.0 running on a Windows PC equipped with a NVIDIA RTX A6000 GPU. The KST-Mixer used 2.5 GBytes of GPU memory in trainings.
In the test step, we provide colonoscope shapes for testing to the trained KST-Mixer. We obtain estimated colon shape of current time from it.
4 Experimental Results
We measured colonoscope and colon shapes during eight colonoscope insertions and recorded 1,388 shape pairs. An engineering researcher operated the colonoscope. Leave-one-colonoscope-insertion-out cross validation was performed for evaluation. We used mean Euclidean distance (MED) (mm) between and as an evaluation metric. It is defined as
(6) |
We compared the MED of the proposed method with previous colon shape estimation methods, including SEN (LSTM-based method) (Oda 2018a) and regression forests-based method (Oda 2018b). Table 1 shows results of the comparison. The proposed method achieved the smallest MED among the methods. Statistical analysis of the results indicated that the proposed method significantly reduced MED compared to the SEN (Oda 2018a) ( with paired t-test of MED values). We compared computation times in estimations of one colon shape among these methods. The results are shown in Table 2. From the results, both the proposed and previous methods have real-time performances. Examples of colon shape estimation results are in Fig. 5. The figure shows that the differences between the ground truths and estimated colon shapes were tiny.
Method | MED (Mean S.D.) (mm) |
---|---|
KST-Mixer (Proposed) | |
SEN (LSTM-based method) (Oda 2018a) | |
Regression forests-based method (Oda 2018b) |
Method | Computation time (msec.) |
---|---|
KST-Mixer (Proposed) | 7.3 |
SEN (LSTM-based method) (Oda 2018a) | 2.9 |
Regression forests-based method (Oda 2018b) | 8.9 |


5 Discussion
The proposed KST-Mixer achieved the smallest error in the colon shape estimation among the previous methods. We designed the KST-Mixer to extract features from kinematic data using simple MLP blocks. Extracted features are then mixed in the spatio-temporal feature mixing MLP blocks to generate spatially and temporally global features. This structure is quite effective in processing time-series kinematic data because we achieved better estimation results than the LSTM-based method (Oda 2018a). The proposed method improves colonoscope tracking accuracy by accurately estimating deformed colon shape during colonoscope insertions. Furthermore, the computation time of the proposed method was short enough to be used in real-time applications.
The application of the proposed method is not limited to colon shape estimation alone. It can be applied to estimations of elastic organs in diagnosis and treatment. Organ shape estimation is essential in surgical assistances by computers. Accurate organ shape estimation contributes to the generation of real-time surgical navigation information and the automation of surgical assistance robots.
Although we have obtained promising results in colon shape estimation, many challenges are still remain for application of the proposed method to colonoscope tracking. Such challenges include (1) collecting data containing variations of operators and colon shapes, (2) collecting in-vivo data, (3) development of intuitive visualization method of tracking result, and (4) development of a colonoscope that have embedded EM sensors. (1) collecting data containing variations of operators and colon shapes is necessary to improve robustness of the method to real situations. Colonoscope movements have variations among physicians depending on their years of experience. Furthermore, colon shapes also have variations among patients. Colon and colonoscope shapes data containing such variations is necessary to achieve better estimation model. We will measure shape data under operations of colonoscope by physicians of various years of experience. We also measure shape data using many colon phantoms and 3D printed phantoms that have variation of the shapes. (2) collecting in-vivo data is necessary to improve the proposed method from phantom level to clinically applicable level. (3) development of intuitive visualization method of tracking result is required that presents deformed colon shapes in real-time during colonoscope insertions. Such visualization helps physicians to understand how the colonoscope traveling in the colon. (4) development of a colonoscope that have embedded EM sensors is required to perform tracking in clinical situations.
6 Conclusions
This paper proposed a colon shape estimation method using the KST-Mixer from kinematic data. The KST-Mixer extracts kinematic features and mixes them along the spatial and temporal axes in multiple MLP blocks. We evaluated the method’s estimation accuracy in the colon shape estimation from colonoscope shapes in the phantom study. The proposed KST-Mixer achieved the smallest estimation error in the comparative experiments. Future work includes improvement of the data number using other phantoms, evaluating the method using shape data measured during colonoscope operations by physicians, application to colonoscope tracking, and application to the human colon.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Funding
Parts of this research were supported by the MEXT/JSPS KAKENHI Grant Numbers 21K12723, 17H00867 and the JSPS Bilateral International Collaboration Grants.
References
- Peters (2008) Peters T, Cleary K. 2008. Image-guided interventions: technology and applications. New York (NY): Springer.
- Deligianni (2008) Deligianni F, Chung A, Yang GZ. 2005. Predictive camera tracking for bronchoscope simulation with CONDensation. In: Duncan JS, Gerig G, editors. MICCAI 2005. Medical Image Computing and Computer-Assisted Intervention; Oct 26–30; Palm Springs (CA), LNCS 3749, p. 910–916.
- Rai (2008) Rai L, Helferty JP, Higgins WE. 2008. Combined video tracking and image-video registration for continuous bronchoscopic guidance. International Journal of Computer Assisted Radiology and Surgery. 3:315–329.
- Deguchi (2009) Deguchi D, Mori K, Feuerstein M, Kitasaka T, Maurer CR Jr, Suenaga Y, Takabatake H, Mori M, Natori H. 2009. Selective image similarity measure for bronchoscope tracking based on image registration. Medical Image Analysis. 3(14):621–633.
- Visentini-Scarzanella (2017) Visentini-Scarzanella M, Sugiura T, Kaneko T, Koto S. 2017. Deep monocular 3D reconstruction for assisted navigation in bronchoscopy. International Journal of Computer Assisted Radiology and Surgery. 12:1089–1099.
- Wang (2021) Wang C, Hayashi Y, Oda M, Kitasaka T, Takabatake H, Mori M, Honma H, Natori H, Mori K. 2021. Depth-based branching level estimation for bronchoscopic navigation. International Journal of Computer Assisted Radiology and Surgery. 16:1795–1804.
- Banach (2021) Banach A, King F, Masaki F, Tsukada H, Hata N. 2021. Visually navigated bronchoscopy using three cycle-consistent generative adversarial network for depth estimation. Medical Image Analysis. 73:102164.
- Gildea (2006) Gildea TR, Mazzone PJ, Karnak D, Meziane M, Mehta A. 2006. Electromagnetic navigation diagnostic bronchoscopy: a prospective study. American Jourlan of Respiratory and Critical Care Medicine. 174(9):982–989.
- Schwarz (2006) Schwarz Y, Greif J, Becker HD, Ernst A, Mehta A. 2006. Real-time electromagnetic navigation bronchoscopy to peripheral lung lesions using overlaid CT images: the first human study. Chest. 129(4):988–994.
- Mori (2005) Mori K, Deguchi D, Akiyama K, Kitasaka T, Maurer CR Jr, Suenaga Y, Takabatake H, Mori M, Natori H. 2005. Hybrid bronchoscope tracking using a magnetic tracking sensor and image registration. In: Duncan JS, Gerig G, editors. MICCAI 2005. Medical Image Computing and Computer-Assisted Intervention; Oct 26–30; Palm Springs (CA), LNCS 3750, p. 543–550.
- Luo (2015) Luo X, Wan Y, He X, Mori K. 2015. Observation-driven adaptive differential evolution and its application to accurate and smooth bronchoscope three-dimensional motion tracking. Medical Image Analysis. 24(1):282–296.
- Liu (2013) Liu J, Subramanian KR, Yoo TS. 2013. An optical flow approach to tracking colonoscopy video. Computerized Medical Imaging and Graphics. 37(3):207–223.
- Yao (2021) Yao H, Stidham RW, Gao Z, Gryak J, Najarian K. 2021. Motion-based camera localization system in colonoscopy videos. Medical Image Analysis. 73:102180.
- Ching (2010) Ching LY, Moller K, Suthakorn J. 2010. Non-radiological colonoscope tracking image guided colonoscopy using commercially available electromagnetic tracking system. In: 2010 IEEE Conference on Robotics, Automation and Mechatronics; June 28–30; Singapore.
- Fukuzawa (2015) Fukuzawa M, Uematsu J, Kono S, Suzuki S, Sato T, Yagi N, Tsuji Y, Yagi K, Kusano C, Gotoda T, Kawai T, Moriyasu F. 2015. Clinical impact of endoscopy position detection unit (UPD-3) for a non-sedated colonoscopy. World Journal of Gastroenterology. 21(16):4903–4910.
- Oda (2017) Oda M, Kondo H, Kitasaka T, Furukawa K, Miyahara R, Hirooka Y, Goto H, Navab N, Mori K. 2017. Robust colonoscope tracking method for colon deformations utilizing coarse-to-fine correspondence findings. International Journal of Computer Assisted Radiology and Surgery. 12(1):39–50.
- Oda (2018a) Oda M, Roth H, Kitasaka T, Furukawa K, Miyahara R, Hirooka Y, Goto H, Navab N, Mori K. 2018. Colon shape estimation method for colonoscope tracking using recurrent neural networks. In: Frangi A, Schnabel J, Davatzikos C, Alberola-López C, Fichtinger G, editors. MICCAI 2018. Medical Image Computing and Computer Assisted Intervention; Sep 16–20; Granada, Spain, LNCS 11073, p. 176–184.
- Oda (2018b) Oda M, Kitasaka T, Furukawa K, Miyahara R, Hirooka Y, Goto H, Navab N, Mori K. 2018. Machine learning-based colon deformation estimation method for colonscope tracking. In: Proceeding of SPIE Medical Imaging; Feb 10–15; Houston (TX), 10576:1057619.
- Hochreiter (1997) Hochreiter S, Schmidhuber J. 1997. Long short-term memory. Neural Computation. 9(8):1735–1780.
- Tolstikhin (2021) Tolstikhin I, Houlsby N, Kolesnikov A, Beyer L, Zhai X, Unterthiner T, Yung J, Steiner A, Keysers D, Uszkoreit J, Lucic M, Dosovitskiy A. 2021. MLP-Mixer: An all-MLP architecture for vision. arXiv:2105.01601.
- Dosovitskiy (2021) Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations; May 3–7; Vienna, Austria.
- Ba (2016) Ba JL, Kiros JR, Hinton GE. 2016. Layer normalization. arXiv:1607.06450.
- Hendrycks (2016) Hendrycks D, Gimpel K. 2016. Gaussian error linear units (GELUs). arXiv:1606.08415.
- Ultralytics (2020) Ultralytics YOLOv5. 2022. [accessed 2022 Feb 27]. https://github.com/ultralytics/yolov5.
- Besl (1992) Besl PJ, McKay ND. 1992. A method for registration of 3-D shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence. 14(2),239–256.