\onlineid

0 \vgtccategoryResearch \vgtcinsertpkg \teaser [Uncaptioned image] High-level architecture overview of the single-device CAMRE framework with external unity networking framework server.

Experiences with CAMRE: Single-Device Collaborative Adaptive Mixed Reality Environment

Hung-Jui Guo
The University of Texas at Dallas Corresponding author: Hung-Jui Guo, e-mail: [email protected] 0000-0003-2233-846X Omeed Eshaghi Ashtiani
The University of Texas at Dallas e-mail: [email protected] 0000-0002-6598-0551 Balakrishnan Prabhakaran
The University of Texas at Dallas e-mail: [email protected] 0000-0003-0385-8662

Abstract

During collaboration in XR (eXtended Reality), users typically share and interact with virtual objects in a common, shared virtual environment. Specifically, collaboration among users in Mixed Reality (MR) requires knowing their position, movement, and understanding of the visual scene surrounding their physical environments. Otherwise, one user could move an important virtual object to a position blocked by the physical environment for others. However, even for a single physical environment, 3D reconstruction takes a long time and the produced 3D data is typically very large in size. Also, these large amounts of 3D data take a long time to be streamed to receivers making real-time updates on the rendered scene challenging. Furthermore, many collaboration systems in MR require multiple devices, which take up space and make setup difficult. To address these challenges, in this paper, we describe a single-device system called Collaborative Adaptive Mixed Reality Environment (CAMRE). We build CAMRE using the scene understanding capabilities of HoloLens 2 devices to create shared MR virtual environments for each connected user and demonstrate using a Leader-Follower(s) paradigm: faster reconstruction and scene update times due to smaller data. Consequently, multiple users can receive shared, synchronized, and close-to-real-time latency virtual scenes from a chosen Leader, based on their physical position and movement. We also illustrate other expanded features of CAMRE MR virtual environment such as navigation using a real-time virtual mini-map and X-ray vision for handling adaptive wall opacity. We share several experimental results that evaluate the performance of CAMRE in terms of the network latency in sharing virtual objects and other capabilities.

\CCScatlist\CCScatTwelve

Human-centered computingMixed / augmented reality; \CCScatTwelveHuman-centered computingCollaborative interaction

1 Introduction

Multi-user collaboration in Virtual Reality (VR) and Mixed Reality (MR) has potential applications in a large variety of different fields, such as education [13] and industrial settings [24]. For physically distributed users, collaboration systems in Virtual Reality (VR) include networked, persistent, immersive, and virtual environments [26]. Although this system is primarily for VR, the concept has been extended to MR. For example, in [25], the authors built a collaborative system combining Augmented Reality (AR) and VR devices to enable collaboration among users accessing different devices. By utilizing a Kinect camera, [7] captured the user’s motion and projected it onto a humanoid robot located in the collaborator’s physical space to create an MR collaboration system. In the context of MR, users typically obtain views of their own surrounding physical environments while interacting with virtual objects. Therefore, unlike collaborative VR environments, MR collaboration systems will face several further issues.

Only a limited number of existing MR collaboration systems handle cases where some of the users’ physical environments differ from the collaborators’ current physical environments [38]. Under such conditions, during the collaboration process, one of the users may move or rotate the virtual object to a place or position where the collaborator cannot see or operate it, which might have a negative effect on the collaboration process; an example is shown in Figure 1.

Refer to caption — Figure 1: (a) Example of one user may move or rotate the virtual object to a place or position where (b) the collaborator cannot see or operate it.

1.1 Challenges for Creating Physical Environment-based MR Collaboration System

Creating a collaborative mixed reality system that employs a shared virtual environment based on the user’s physical environment could lead to further challenges:

•

Computationally expensive to construct a 3-dimensional virtual environment based on the physical environment due to large data size; for instance, it would take about 100 Megabytes for a complete 3D indoor environment.
•

Constructing a shared virtual environment to build a collaboration system often requires setting up multiple devices, which could be difficult for users unfamiliar with MR to get started and use the collaboration system.
•

Transferring large-scale 3D environments can result in large streaming and update latencies over the Internet.
•

Users tend to only use the virtual contents within their line of sight for collaboration and may have a limited understanding of the entire environment in MR virtual environments. This limited understanding of the overall environment might restrict physical movements and constrain the usage of the collaboration system.

1.2 Collaborative Adaptive Mixed Reality Environment (CAMRE)

To address the above challenges, we established a Collaborative Adaptive Mixed Reality Environment (CAMRE) system with one single MR device - Microsoft Hololens 2 [16] to not only share virtual objects but also share the virtual environment established based on one user’s physical environment and share with other users that are connected to the same server, see Figure Experiences with CAMRE: Single-Device Collaborative Adaptive Mixed Reality Environment for an overview. To address the first challenge of using large 3D environments, instead of constructing the whole physical environment into mesh data, we use the scene understanding feature in the Microsoft Mixed Reality Toolkit (MRTK) [17] to build virtual objects and a virtual environment based on the objects and geometry of the physical environment, which reduces data size from around 100 Megabytes (a living room size 3D mesh data) to approximately 0.3 Megabytes (a living room size scene understanding-based virtual environment). By utilizing the scene understanding feature to create a virtual environment, users are only required to deploy the CAMRE system on their HoloLens 2 device instead of setting up multiple sensors in the environment where they are located, which tackles the second challenge in terms of ease of use.

Based on the created small-sized virtual environment, when users physically move in their physical environment, CAMRE can update the virtual environment accordingly by transmitting a small amount of data. On the basis of this system structure, we incorporated a Leader-Follower paradigm; the Leader is responsible for observing and creating an MR virtual environment and sharing with multiple Followers through a networking framework. CAMRE helps Leader and Followers share the same knowledge of the virtual environment, thereby preventing users from moving virtual objects to places where other collaborators cannot see. With CAMRE, the Leader can stream virtual information to Followers with minimal data usage, addressing the third challenge listed above.

1.3 CAMRE’s Expanded Navigation Features

CAMRE with its Leader-Follower paradigm provides a method for one Leader and multiple Followers to collaborate in a shared virtual environment. However, since the occlusion of the wall objects generated from the Leader’s side, users might stay in the initial room due to a limited understanding of other rooms, which will lead to the fourth challenge. Therefore, to tackle this challenge, we incorporated three expanded navigation features into the CAMRE system to provide an overview of the created virtual environment before physically moving to the destination to assist navigation and increase usability.

•

Dynamic X-ray vision: Allows users to see through the surrounding obstacles to gain additional information about another room. (This feature was published separately in our demo paper; to ensure anonymity in reviewing process, we marked the authors’ names as A. Anonymous in [2].)
•

Complete see-through virtual environment: Virtual walls will become partially transparent whenever the user approaches within 3 meters to provide information about all other rooms within range.
•

Real-time mini-map: Leader can observe and build the entire CAMRE MR virtual environment to share with Followers. Followers can explore either with the Leader or independently without following the Leader. This is facilitated by the real-time mini-map that shows the bird’s eye view of the whole virtual environment and provides a complete view separately for each user. This mini-map feature makes an explicit assumption that such a complete view is available before starting the collaboration process and is given to the Followers.

When users are immersed in a virtual environment, their movements and interactions are significantly influenced by human depth perception [14, 8]. Unlike real objects with fixed images such as size and color in the human brain, users in virtual environments frequently lack adequate references to make accurate judgments regarding the depth perception of virtual objects due to the Vergence-accommodation conflict. Therefore, providing users with additional depth information in the virtual environment could help users have a better understanding of the surroundings. For example, [9] presented a series of virtual environment underestimation experiments to suggest that visual information is an important source of information for the calibration of movement. In CAMRE, besides providing an overview of the virtual environment, the three expanded features could also provide additional depth information to assist users with navigation further. Dynamic X-ray vision can provide motion parallax since the X-ray vision window is moving with the user’s eye gaze direction. A complete see-through virtual environment can provide distance perception due to the 3-meter setting that makes virtual walls partially transparent when users move within 3 meters of distance. Real-time mini-map provides camera position and field-of-view (FOV) options for users to adjust to provide relative distance and scale of the virtual objects. Here, we make an explicit assumption that the needed information such as the scene behind the obstacles is available (perhaps, through a pre-captured database) to the user. Detailed information will be provided in section 3.2.

1.4 Contributions

We designed an exhaustive set of experiments with a primary objective of measuring latencies incurred during a Leader-multiple Followers collaboration over the Internet involving different distances among the collaborators. These experiments were conducted with varying factors such as room sizes, networking frameworks for sharing virtual environments, different distances between the Leader and Followers, and the number of simultaneous network connections. We make the following contributions through the created CAMRE framework:

1.

Implemented a user-friendly, single-device setup system for users who are unfamiliar with the MR system.
2.

Dynamically update virtual environment based on the corresponding physical environment with low data scale and low building time.
3.

Low streaming latency sharing virtual environment from Leader to multiple Followers to achieve Leader-Follower paradigm.
4.

Provide expanded navigation features to help users gain an overview of the created virtual environment to provide additional scene information and depth information.
5.

The extensive performance evaluation carried out on CAMRE involves two commonly used Unity networking frameworks and the performance results can serve as a benchmark network for other similar, future systems.

Although some previous works created collaborative systems across multiple AR/VR/MR systems, to the best of our knowledge, the CAMRE system may be one of the earliest MR collaborative systems that include dynamic environmental updates and real-time ability.

1.5 Using CAMRE

We will make the CAMRE system software to be available as open source (after the paper is published). The CAMRE system along with the planned future work described in Section 6 can be very useful for the research community as well as application developers dealing with collaborative use cases such as training and tele-mentoring using MR. We will also make the experimental data reported in Section 5 publicly available. The research community can use this data as a benchmark for comparing similar approaches. The data pertaining to network latencies in Section 5 can also be used for trace-driven simulation for human subject studies in Internet-based collaborative MR applications.

1.6 Limitations of Our Work

We also acknowledge some important limitations:

1.

Some previous MR collaboration systems (reported above) handle cases where some of the users’ physical environments differ from the collaborators’ current physical environments. However, CAMRE specifically employs a single Leader-multiple Followers paradigm, resulting in a common, shared virtual environment for all the users. While this could be a limitation for some use cases, the shared common virtual environment could be advantageous for training or telementoring types of applications.
2.

The performance studies reported in Sections 4 and 5 have been focused on network latencies in collaboration over the Internet. We have not carried out human subject studies to understand their perception of the effect of degraded (or small-data sized) virtual environments, nor on the effect of varying Internet latencies.
3.

In a similar manner, our work has not evaluated the human perception of the effect of such a synchronized virtual environment as that of CAMRE. For instance, when the Leader moves to a different environment, the Followers’ view/understanding of the virtual environment would also change accordingly even though they (the Followers) never move. This unexpected change in the environment might affect user experience and/or cause VR sickness.

The above aspects of human perceptions need to be evaluated thoroughly. Considering the need for detailed and exhaustive user perception studies, we plan to do this as a future, separate research work. As mentioned earlier, will use the network latencies reported in Section 5 to emulate Internet-level collaboration for these human perception studies.

2 Related Work

Many studies have been conducted to develop multi-user collaboration systems in AR/VR/MR that enable effective remote collaboration, particularly during the COVID-19 pandemic. One of the most common types of collaboration systems involves creating a virtual environment where users can immerse themselves and interact with other users’ avatars to achieve collaborative outcomes. The concept of this type of system was proposed and discussed in 1998 [3] as ”collaborative virtual environments,” which used networked virtual reality systems to support group work. More recently, various techniques have been used to achieve collaborative virtual environments; for example, [32, 33] presented a 360Drops system to provide 360 video sharing and 3D reconstructed scenes with photo-bubble to provide environment details. Additionally, researchers have been working on developing cross-reality systems to enable multiplayer collaboration across various AR and VR devices [30, 25]. [40] developed a VRGit system to facilitate multi-user collaboration in VR, which helps users manage the different versions of 3D content in virtual environments, making it easier for them to collaborate effectively. With this system, users can easily keep track of the modifications made to the virtual environment and manage the different versions of the content. To conclude the development of collaborative virtual environments, a comprehensive survey on collaboration and communication systems was conducted by [5] and [28] to provide insights into the different functionalities, advantages, and disadvantages of each collaboration system.

However, existing works rarely focused on sharing the whole surrounding physical environment, which could lead to the occlusion issue addressed in Figure 1 and reduce collaboration efficiency and freedom of moving in the virtual environment. Still, some previous works have tried to share the surrounding physical environment to achieve collaboration; for instance, [6] proposed a model to include users’ surrounding physical environment into the virtual environment to build a collaborative environment in the VR world by taking into account the physical features and embedding them in the virtual environment. The authors of [21] introduced a system called PLANWELL that utilized handheld AR devices for scanning outdoor geographical data by an explorer to create a 3D model, which could be shared with an overseer for remote collaboration. Although this system is similar to our Leader-Follower paradigm, it has a higher data transfer time (2.4 seconds), which might not be suitable for real-time collaboration. [39] presented a DistanciAR system that captured and created a remote environment with a LiDAR camera for viewing from a different location and improved the interface by adding Dollhouse (bird’s-eye view) and Peek modes. However, the time spent using the complete system took around 13 minutes, which may pose a challenge when used for collaboration purposes. Most recently, [34] presented a 3D MR remote collaboration prototype system by scanning the surrounding physical environment. However, to achieve real-time collaboration, this system utilized AR and VR head-mounted devices with three depth cameras to build 3D models by utilizing pre-scanned reconstructed 3D mesh models of a room-scale workspace, which could be challenging for regular users to set up.

In addition, other research works have tried to use humanoid robots to accomplish multi-player collaboration to perform actions captured by users that could potentially solve the occlusion issue. For example, [20] used humanoid robots to imitate users’ activities as surrogates to achieve cross-country collaboration, and [22] proposed a system integrating humanoid robots and video streams to build an MR-like collaborative environment for remote collaboration. To achieve better human-robot collaboration, [7] suggested that robots should be capable of perceiving and parsing a scene’s information in real-time. The authors claimed that such environmental parsing is typically divided into three categories: Scene graph, 2D map representation, and 3D map representation, which echoes back to our scene understanding-based CAMRE system that performs scene understanding to build scene graphs with 3D map representation by reconstructing a corresponding 3D model from the understood information, and real-time mini-map to achieve 2D map representation.

The above summary of past and recent multi-user virtual environment works demonstrates a focus on sharing the same virtual environment in AR/VR. We believe that utilizing MR to immerse individuals in a common virtual environment based on their specific physical surroundings can provide additional information to facilitate collaboration. Although similar collaborative systems and expanded features have been presented in other works built in AR/VR, our CAMRE system is one of the few collaborative systems built in MR, enabling users to interact with both the virtual environment and their physical environment. Furthermore, CAMRE utilizes a low data scale virtual environment to achieve low data transfer latency while still providing real-time collaboration with expanded features to enhance user experience and provide user-friendliness by providing accessible setup attributes with a single device. In the following sections, we will focus on the detailed settings and evaluations of the CAMRE MR virtual environment with expanded low-latency sharing and expanded navigation features based on the surrounding physical environment.

3 CAMRE System Design

As mentioned earlier, in CAMRE, we employ a Leader-Follower paradigm through which MR environments are adaptively generated with low latency based on the physical environment of the Leader and shared using the open-source network frameworks in Unity (Unity Netcode for Gameobject [35] with Unity Relay [36] and Photon Unity Networking [23]) for collaboration among multiple users. Instead of making every user hold the same level of authority, the Leader-Follower paradigm is employed to avoid multiple users building and sharing their virtual environment to cause virtual environment overlays, and only the Leader is authorized to observe and create the virtual environment. Next, we add three expanded features (dynamic X-ray vision window, complete see-through virtual environment, and a real-time mini-map; see below) to help users navigate and gain depth cues of the adaptive MR environments. This system is designed and built on the Microsoft HoloLens 2 and HoloLens Unity emulator.

3.1 MR Virtual Environment Adapting to Physical Environment

Scene understanding is a pre-built feature in the MRTK (Mixed Reality Tool-Kit from Microsoft) [17] that is often used to observe and understand the target physical environment to obtain information and analyze it. This feature utilizes the spatial mapping feature of HoloLens 2, which uses a long-throw depth camera to capture the structure of the physical environment when users walk around and scan the surroundings to create multiple virtual flat planes (the virtual flat planes will be referred to as ”scene objects” in the following articles) that align with the corresponding physical flat planes to create a complete MR virtual environment, as shown in Figure 2 (a). In the CAMRE framework, we integrate this scene understanding feature to generate virtual environments that dynamically adapt to the changes in the Leader’s physical environment in close-to-real-time. Users can operate the virtual environment update settings on the control panel to manually update or auto-update with specific time intervals (5 seconds in default, can be changed). The scene understanding feature in MRTK generates simple virtual planes to construct the virtual environment while preserving proper scene information. As a result, the data size of the created virtual environment is relatively small (about 0.3 Megabytes) compared to standard virtual room mesh data (around 100 Megabytes). For instance, in a recent study [10], an MR collaboration system was developed where user avatars were built and shared as mesh data, with the smallest data taking up 0.4 Megabytes, which is higher than the size of our scene understanding-based virtual environment. Therefore, system load and environment creation time can decrease when a user moves and updates the virtual environment with low latency. In addition, these dynamically generated MR environments are shared with a set of Followers to facilitate collaboration. Whenever the Leader moves to or faces a new and unobserved physical environment, our system will update the virtual environment dynamically. Each scene object created or updated in the virtual environment on the Leader’s HoloLens 2 will immediately update to the server and forward to all Followers when Leader creates it. Due to the small data size, the sharing process from Leader to Followers will have close-to-real-time latency. Followers can see the exact same MR environments in their devices to gain the same environmental information as the Leader; scene objects received from the Leader’s side are shown as gray color only, which indicates that the current user is not the creator of these objects to avoid confusion, as shown in Figure 2 (b).

3.1.1 Network Framework

In CAMRE, we used two state-of-the-art Unity networking frameworks, including Unity Netcode for Gameobject with Unity Relay and Photon Unity Networking, to transfer observed scene objects from Leader to an external server and then to Followers to accomplish remote collaboration. We have built the CAMRE system on two different frameworks to compare the network latency and ease of use, which will be evaluated in a later section.

1.

Netcode for Gameobject is the latest (first released in June 2022) and highly recommended package built by Unity for multiplayer networking that enables the system to synchronize virtual objects’ position, rotation, and scale. Since Netcode for GameObject only supports local network connections without requiring modification of the user’s router, to avoid complicated operations to ensure user-friendliness, we use the Unity Relay, a Unity-registered third-party package, for external network connections. The combined system allows up to 50 concurrent users for free ($0.16 per additional user) but requires Leader to send access codes to Followers externally.
2.

Another networking framework we used is Photon Unity Networking, a primarily recommended multiplayer network framework for HoloLens 2 to perform multi-user collaboration. This system offers similar functionality as it allows virtual objects’ position, rotation, and scale sharing through Photon server, which allows users to join a preset room without exchanging external messages. However, the free version of the system supports only 20 concurrent users. (up to 2000 concurrent users for $370).

3.2 CAMRE’s Expanded Navigation Features

In this section, we describe three expanded features of CAMRE to provide an overview and depth perception of the virtual environment that can assist in user navigation and understanding of the entire environment. Here, we make the following two assumptions for users before starting to use the expanded navigation features:

1.

CAMRE MR virtual environment is observed and built completely by the HoloLens 2.
2.

Information behind the obstacles is available to the user.

3.2.1 Dynamic X-ray Vision Window

In order to provide additional information and depth perception (such as motion parallax) of the surrounding scene to the users, we built a dynamic X-ray vision window [2]. This feature allows users to directly see through the obstacles in front of them in the CAMRE MR virtual environment while still retaining a complete view of the surrounding environment to obtain information behind obstacles by utilizing the clipping primitive feature in MRTK. By attaching the clipping primitive onto selected virtual objects to mimic a physical window and make the contact area partially transparent, users have the ability to gain information behind obstacles before physically moving to other rooms. To provide customization and avoid potential motion sickness, users can dynamically change the X-ray vision window’s size with a slider to best fit their current viewing needs. Furthermore, we used the eye-tracking function in HoloLens 2 to allow the X-ray vision window to follow the eye-gaze direction, dynamically updating the window and making it move smoothly and quickly. Having the X-ray vision window updates based on the user’s position, movement, and eye gaze can provide motion parallax to more closely resemble the real world. To prevent users from experiencing virtual motion sickness while using the eye gaze X-ray vision window, we offer an alternative option, head gaze version (window following head movement), that follows head movement, which enables users to choose the version they are comfortable with. With the help of X-ray vision in the CAMRE MR virtual environment, users can locate and perceive the distance of objects in adjacent rooms without physically moving. A depiction of this feature is shown in Figure 3 (a).

3.2.2 Complete See-through Virtual Environment

We also provided an option for the users to have a complete direct view of the created MR virtual environment when they navigate the surroundings. As users approach the virtual wall objects created by CAMRE’s scene understanding feature within a 3-meter radius (euclidean distance between the user’s current location and wall object’s location), the objects become 30% transparent, which allows users to view information about adjacent rooms before leaving the current one and also helps them perceive the distance between themselves and the edges of the virtual environment. Conversely, when the user moves away from the virtual wall objects beyond three meters, the objects will return to opaque. We ensure that users are aware of significant changes in the virtual environment by alerting them using spatial sound cues from the direction of wall objects they approach and becoming transparent. For instance, if a user moves towards a wall object on the right side, an alert sound will be heard on the right side to indicate the approaching movement. By generating sound cues from changing objects using the HoloLens 2’s spatial sound capabilities, users can quickly notice where wall objects are within three meters and have changed. This can help the users acquire spatial information for additional depth cues while obtaining the information behind the obstacles. The wall object’s transparent effect is shown in Figure 3 (b).

3.2.3 Real-time Mini-map

We also include a mini-map capability to provide a bird’s eye view of the entire CAMRE MR virtual environment to gain an overall understanding. Our assumption here is that the Leader’s CAMRE MR virtual environment is observed and built completely beforehand and shared with the Followers. This allows Followers to explore the entire virtual environment independently with the real-time mini-map without following the Leader in real-time. This type of mini-map is a common feature in first-person shooter video games to help players navigate their surroundings. Similarly, including a mini-map feature in the CAMRE framework can allow users to maintain an awareness and understanding of their surrounding environment. Therefore, we create a track-up (this setting is a configurable option, where a north-up mini-map can also be chosen) mini-map that updates in real-time with the user’s physical movement (position and rotation) and will follow and display at the bottom right corner of the user’s FOV, as shown in Figure 4 (a). To identify users on the mini-map, we create a self-avatar following the user’s position in real-time, shown on the mini-map to indicate the current position. Avatar on the Leader side will spawn at the origin point where the Leader starts the application, and avatars on Followers side will also spawn at the Leader’s origin point whenever connected to the server. All users can locate the current location of other users to confirm whether the virtual object being shared is within the other user’s FOV. Examples of the mini-map avatar and virtual objects displayed on the mini-map are shown in Figure 4 (b).

Our system offers users the ability to control the position of the camera and the field of view of the mini-map in real-time. Through two sliders, users can choose to view a close-up of a specific area to display detailed information or a full view of the entire CAMRE MR virtual environment to gather complete information about other rooms before physically moving to them. Furthermore, by combining different settings of the two sliders, the mini-map can display varying levels of detail to provide further information. Suppose the user chooses a high camera position value and a low FOV value. In this case, the mini-map will display a flatter and complete floor plan to help users see the top view of the virtual objects located in the surrounding virtual environment and avoid scene objects (such as wall objects) affecting the judgment of the virtual object’s position, as shown in Figure 5 (a). Conversely, if the user chooses a low camera position value and a high FOV value, the mini-map will present a higher perspective of the scene objects, allowing the user to see the three-dimensional view of the scene objects more clearly to help users locate and calculate the size of the scene objects, as shown in Figure 5 (b).

By combining the low-latency environmental update attribute of our CAMRE system, the above three expanded features can provide additional information for the users (including adjacent room settings and depth cues) when they physically move in the created virtual environment based on the surrounding physical environment. Typically, users have a compressed depth perception when immersed in a virtual environment; this may be partially due to the lack of certain types of contextual information typical of the physical environment (such as shadows of physical objects). By including the above three expanded features, CAMRE can help users to move smoothly to the desired location apart from providing depth perception to enhance immersion.

4 CAMRE System Evaluation Experiment Design

In this section, we designed multiple experiments to evaluate the latency of each feature to provide the capability details of the CAMRE system, including time taken to construct a virtual environment, scene object transfer latency and throughput (average bytes transferred per second), transfer packet loss, latency for X-ray vision, and latency for mini-map. The major factors we used for experiment design are:

•

Room size difference.
•

Testing different networking frameworks.
•

Number of contemporary connections.
•

Distance between Leader and Followers.

4.1 CAMRE MR Virtual Environment Evaluation

4.1.1 Virtual Environment Data Scale and Constructing Time

To comprehend the magnitude of data and time required for users to create and explore the virtual environment, we assessed the data size and construction time of Leader’s CAMRE MR virtual environment. We conducted this evaluation by incorporating three rooms of varying sizes: a personal room (3.81m X 3.02m X 2.40m, containing approximately 30 scene objects), a living room (7m X 3.92m X 2.97m, containing approximately 90 scene objects), and a large classroom (13m x 9.2m x 3m, containing approximately 130 scene objects). This evaluation aimed to determine if the size of the room has an impact on the data size and construction time. During this experiment, we record the time span from pressing the ”update scene” button to spawning the last scene object by directly recording the system timer, and we do the experiment 20 times for each room to account for any variation that might occur.

4.1.2 Virtual Environment Transfer Latency

In CAMRE, we used Unity Netcode for Gameobject with Unity Relay or Photon Unity Networking to transfer the observed virtual environment between multiple devices. Therefore, to compare the pros and cons of the two selected networking frameworks, we measure them by using the following metrics:

•

Leader to Follower Data transfer latency/Standard Deviation: average time difference between the Leader side creating each scene object and the Follower side receiving the scene object with the standard deviation across all observed time differences.
•

Room size scene (50 scene objects) transfer time: During the experiment, we capture the average transfer time as the time difference between receiving the first and last scene objects with average bytes transferred from Leader to Follower and the total number of transferred scene objects as a benchmark. To ensure a fair comparison of transfer times among different Leaders while accounting for varying room sizes, we use the following equation to normalize the transfer time to a standard virtual room with 50-scene objects (a standard indoor room size according to other Leaders’ created virtual environment in our medium distance scenario):

$TotalTransferTime/TotalSceneObjects*50$
•

Average throughput(bytes per second): average bytes received per second on the Follower side during the process of transferring the entire virtual environment.
•

Packet loss: packet loss percentage over the whole virtual environment transfer.

In this experiment, we investigated whether the number of concurrently connected users on the same server and the distance between the Leader and Followers affect transfer efficiency. Therefore, we divide the experiment into three different scenarios (each experiment is conducted 5 times to account for any variation); detailed settings are listed in Table 1:

[Uncaptioned image] — Table 1: Experiment Scenarios

To capture packet data transferring from Leader to Follower, we set up one Follower using a Unity emulator as the main evaluation target and used Wireshark [31] to capture the data. We do each experiment combination 5 times to share scene objects to account for any variation that might occur.

4.2 CAMRE Expanded Features Evaluation

In addition to evaluating the base CAMRE MR virtual environment, we also assess the expanded features for real-time expression. The complete see-through virtual environment function is primarily designed to adjust the transparency of virtual wall objects when the user approaches the wall within 3 meters. However, due to the short latency of the transparency adjustment and the natural slight movement of the user’s head while walking and wearing the HoloLens 2, it becomes challenging to precisely measure the distance accuracy in millimeter-scale between the user and the wall object from an external perspective. Therefore, in this subsection, we will only evaluate dynamic X-ray vision and mini-map features.

4.2.1 Dynamic X-ray Vision Window Display Latency

We conducted an evaluation to determine if the dynamic X-ray vision window has low latency, providing users with a real-time experience. To calculate the display latency, we recorded the timestamp of when the X-ray vision enabling switch was pressed and when the X-ray vision window was displayed, measuring the time gap between them. This experiment is repeated 100 times to account for any variation that might occur.

4.2.2 Dynamic X-ray Vision Window Moving Latency

To evaluate whether the dynamic X-ray vision window consistently follows the user’s eye-gaze direction when physically moving in the CAMRE MR virtual environment, we evaluate the moving latency of the dynamic X-ray vision window by calculating the time difference between the user’s eye-gaze movement captured by HoloLens 2 (direction can be digitized using eye-gaze ray intersections with wall objects) and the time X-ray vision window position updated to the same position by recording the timestamp. This experiment is also repeated 100 times to account for any variation that might occur.

4.2.3 Mini-map Moving Latency

Consistently following the user’s movement is essential for the Mini-map to help determine their current location in the virtual environment and related locations from other collaborators since Followers can move independently without following the Leader in real-time. Therefore, we conducted an evaluation to determine if the mini-map accurately tracks the user’s physical movements. Specifically, we measured the mini-map moving latency by recording the timestamp to calculate the time difference between when the user rotated the display by 180 degrees (ensuring that there is a noticeable angle difference between the direction displayed by the mini-map and the user’s current facing direction) and when the mini-map rotation caught up to the same rotation. This experiment is also repeated 100 times to account for any variation that might occur.

5 CAMRE Evaluation Results and Discussion

With the evaluation experiments proposed in the previous section, we collected various experimental results to further analyze and discuss in detail to the performance of CAMRE.

5.1 CAMRE MR Virtual Environment Data Size and Construction Time

In this experiment, we built three different-sized rooms into 30, 90, and 130 scene objects to explore virtual environment construction time. By saving three different scene objects into a bytes file, their respective data sizes are in order: 0.18, 0.33, 0.75 megabytes. According to the experiment results in Table 2, the personal room took an average of 0.96 seconds with a standard deviation of 0.13 to fully build the virtual environment, the living room took an average of 2.53 seconds with a standard deviation of 0.28, and large classroom took average 3.69 seconds with a standard deviation of 0.10, which shows a low construction time attribute to build a complete 3D indoor environment comparing to [11] with 12 seconds and [41] with a 60 seconds indoor scene to 3D mesh reconstruction computation time. Low data size and construction time can further benefit CAMRE updating and streaming virtual information between Leader and Followers when Leader physically moves in the surrounding environment.

Table 2: CAMRE MR Virtual Environment Construction Time

	Personal	Living	Large	[11]	[41]
	Room	Room	Classroom
Time (s)	0.96s	2.53s	3.69s	12s	60s

5.2 CAMRE MR Virtual Environment Transfer Latency

To evaluate and prove the low transfer latency sharing virtual environments with two different networking frameworks, we conducted three experiment scenarios with different distances and different numbers of concurrent users between Leaders and Followers. We also consider internet bandwidth as a potential factor affecting transfer latency; therefore, we asked all Leaders and Followers to provide their internet bandwidth showing in https://fast.com/. Before starting the experiments, we calibrated all connected machines with Network Time Protocol (NTP) to confirm the accuracy of the latency with milliseconds preciseness. Before the primary observing Follower connects to the server, we launched Wireshark to capture internet packets and stop capturing after receiving all the scene objects transferred from the Leader. Since HoloLens 2 does not provide software for users to capture internet packet data, we can only capture and analyze packet data transferred from networking servers in the Unity emulator Follower side. According to the packet data, both Photon networking and Netcode for Gameobject use User Datagram Protocol (UDP) to transfer data, and there is no packet loss in every scenario discussed in the following.

5.2.1 Scenario 1: Short Distance (SD)

The Leader and Followers in this scenario all physically stay in the same location with an internet speed of 240 Mb per second. The data shown in Table 3 indicates that there is no significant difference in the three evaluation metrics when connecting with one Leader and one Follower (SD1) and one Leader and three Followers (SD2) under the two networking frameworks, which reflects the low-latency stability of the system even when connecting with multiple users. During our testing, we discovered that when transferring a room-sized scene, Netcode for Gameobject took longer than Photon networking, but had a higher throughput. Our analysis of the captured packet data revealed that Netcode encrypts the transferred data, which could lead to larger data size, while Photon transfers data directly.

5.2.2 Scenario 2: Long Distance (LD)

In this scenario, only the primary observing Follower is located at another place with about 12,400 km distance with 530 Mbps internet speed, Leader and the other two Followers are located at the same location with 240 Mbps internet speed. Based on Table 3, connecting a Leader with one Follower (LD1) and three Followers (LD2) shows no significant difference in the data transfer latency category, still exhibiting a stable attribute for each scene object. However, transferring a room-sized scene with the LD2 scenario takes a little longer to complete the process, which means that if there is only one user located farther away from the location where other users are located, the scene transfer time will be affected. Furthermore, the average throughput in LD1 and LD2 scenarios while using Photon networking displays no significant difference when compared to the short-distance scenario. However, Netcode for Gameobject exhibits a higher throughput, suggesting that internet speed might affect the throughput of Netcode, but does not provide a significant difference for the Photon networking framework.

5.2.3 Scenario 3: Medium Distance (MD)

In this scenario, Leader and all other Followers are located at different locations. MD1, MD2, and MD3 have different Leaders with internet speeds in order: 90, 250, and 90 Mbps, and the Follower is the same with 240 Mbps internet speed. MD4, MD5, and MD6 have the Leader with 90 Mbps internet speed and other Followers with 240, 250, and 90 Mbps in order. The experiment results shown in Table 4 indicate that data transfer latency is slightly higher in MD4, MD5, and MD6 scenarios compared to MD1, MD2, and MD3 scenarios respectively. Similarly, we observe a similar pattern when transferring a room-sized scene over the Netcode server, suggesting that connecting from multiple locations may impact the transfer performance. Due to the use of HoloLens 2 devices by the other two Followers to connect with Leader, we were unable to capture internet packets for MD5 and MD6 scenarios; therefore, we marked N/A to MD5 and MD6 in the average throughput section of Table 4. After analyzing the average throughput results, we observed that MD4 has a higher throughput compared to MD1, MD2, and MD3 scenarios, which could indicate that Photon and Netcode servers require higher throughput to efficiently transfer data with multiple users across different locations.

Based on the above experiment results, all of the Leader to Follower data transfer latencies are sub-0.15 seconds, and room size scene transfer time is lower than 1.6 seconds in both networking frameworks, which indicates that the CAMRE system can transfer 3D virtual environments with low latency by utilizing small data size to increase collaboration consistency between Leader and Followers.

5.2.4 Comparison with Existing Systems

We also provide comparisons with other existing collaboration systems; however, only one AR collaboration system, PLANWELL [21], measures data transfer latency. Therefore, we found multiple state-of-the-art VR multiuser platforms, including VRChat [37], AltspaceVR [18], Rec Room [27], Mozilla Hubs [19], and Horizon Worlds [15], that transfer data to multiple users, which are measured in [4] as client-to-server and back-to-client round-trip time that is similar to a Leader transferring packets to a server and then to a Follower. The comparison table (Table 5) shows that PLANWELL has a data transfer latency of 2.4 seconds, which is higher than the latency of CAMRE in any scenario. All other VR systems have better data transfer latency than the CAMRE system. However, it is important to note that CAMRE is an MR collaboration system that requires capturing information from the physical environment, which could increase system load. The fact that CAMRE has similar data transfer latency as two other VR systems indicates that its low data size design has a significant impact. In addition, [4] also measured the average throughput of the five state-of-the-art VR collaboration systems; therefore, we have compared the average throughput of our CAMRE system with other VR collaboration systems. Even though the two network frameworks we have employed have lesser throughput than the state-of-the-art VR systems, the CAMRE system archives low data transfer latency, as the server resources we possess are not as high as those used by large companies.

5.3 Dynamic X-ray Vision Window Display Latency

During the experiment process, we conducted the switch-enabling procedure 100 times. The results are displayed in Figure 6. The average display time was 6.81 milliseconds, with a standard deviation of 2.63 milliseconds. Our findings indicate that the latency period for the X-ray vision window to appear on the display after the switch is pressed is consistently small enough to be considered a real-time feature, which improves usability and reduces the likelihood of virtual motion sickness [29].

5.4 Dynamic X-ray Vision Window Moving Latency

Similarly, we conducted the eye-gaze movement (move 45 degrees from left to right) process 100 times, and the results are shown in Figure 7. The average displaying time was 6.57 milliseconds, with a standard deviation of 2.92 milliseconds. Results show a consistently low latency that can be considered a real-time feature that instantly provides information inside adjacent rooms when users move their eye-gaze direction. By utilizing the physical window-like X-ray vision that follows the user’s eye-gaze direction in real-time, the X-ray vision window is capable of providing motion parallax, a feature that is normally available in a real environment but is unavailable in the virtual environment.

5.5 Mini-map Moving Latency

We evaluate this feature by performing the rotation task 100 times, and the results are shown in Figure 8. The average displaying latency was 5.99 milliseconds, with a standard deviation of 2.42 milliseconds. Results indicate that this feature has a consistent latency of under 10 milliseconds, which could show the real-time attribute of this feature. Having a mini-map displays the user’s surrounding virtual environment in real-time, and with the camera options (Figure 5) that allow the user to gain a broader view of the environment, users can quickly locate the location of the target virtual objects.

6 Conclusion and Future Work

In the CAMRE framework, we demonstrate dynamically generated MR virtual environments with low latency and small data sizes in the HoloLens 2 device by utilizing the scene understanding feature. With the small data sizes of the virtual environment, we employ a Leader-Follower paradigm, representing the Leader’s surrounding physical environment and transferring these environments through networking frameworks to the Followers in close-to-real-time. This permits remote connections and collaboration, including real-time expanded features to assist users with navigation. As part of our research, we evaluated the performance of the CAMRE framework, which showed that users can construct virtual environments in a short amount of time. Our tests revealed that it takes around 2.5 seconds to build a living room using the framework. We also evaluated two networking frameworks for sharing a typical room size scene and found that their latencies were below 1.6 seconds in all evaluated scenarios. We then assessed the X-ray vision and mini-map display and found that their update latencies were all below 12 milliseconds, which suggests that these features can be used in real-time to help users navigate through the virtual environment.

In the future, to address the limitations described in Section 1.6, we plan to design and conduct an exhaustive set of behavioral studies to understand how users perceive CAMRE as a means of collaboration as well as the efficacy of the MR navigation features. Furthermore, we would also like to investigate the networking performance of CAMRE with more than four concurrent users. Currently, we do not allow the roles of Leader and Followers to be swapped in real-time, but we plan to investigate such a role swapping in the future. Lastly, we aim to enhance the collaborative capabilities of the CAMRE system by implementing real-time frame analysis for the creation of virtual object mesh and dynamic color adaptation of virtual objects to their surrounding environment. At present, the system generates a virtual environment with basic geometric scene objects. However, by fitting primitive geometry [12, 1] to these objects, we could potentially create more detailed virtual objects without overburdening computational resources. These new features may further increase the effectiveness of the CAMRE framework.

Acknowledgements.

This research was sponsored by the DEVCOM U.S. Army Research Laboratory under Cooperative Agreement Number W911NF-21-2-0145 to B.P.
The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the DEVCOM Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation.

References

[1] R. Alghofaili, C. Nguyen, V. Krs, N. Carr, R. Mĕch, and L.-F. Yu. Warpy: Sketching environment-aware 3d curves in mobile augmented reality. In 2023 IEEE Conference Virtual Reality and 3D User Interfaces (VR), pp. 367–377. IEEE, 2023.
[2] A. Anonymous. Hidden title. In Hidden, p. Hidden, Hidden. doi: Hidden
[3] S. Benford, C. Greenhalgh, G. Reynard, C. Brown, and B. Koleva. Understanding and constructing shared spaces with mixed-reality boundaries. ACM Trans. Comput.-Hum. Interact., 5(3):185–223, sep 1998. doi: 10 . 1145/292834 . 292836
[4] R. Cheng, N. Wu, M. Varvello, S. Chen, and B. Han. Are we ready for metaverse? a measurement study of social virtual reality platforms. In Proceedings of the 22nd ACM Internet Measurement Conference, IMC ’22, p. 504–518. Association for Computing Machinery, New York, NY, USA, 2022. doi: 10 . 1145/3517745 . 3561417
[5] R. Druta, C. Druta, P. Negirla, and I. Silea. A review on methods and systems for remote collaboration. Applied Sciences, 11(21), 2021. doi: 10 . 3390/app112110035
[6] T. Duval, T. T. H. Nguyen, C. Fleury, A. Chauffaut, G. Dumont, and V. Gouranton. Improving awareness for 3d virtual collaboration by embedding the features of users’ physical environments and by augmenting interaction tools with cognitive feedback cues. Journal on Multimodal User Interfaces, 8(2):187–197, 2014.
[7] J. Fan, P. Zheng, and S. Li. Vision-based holistic scene understanding towards proactive human–robot collaboration. Robotics and Computer-Integrated Manufacturing, 75:102304, 2022. doi: 10 . 1016/j . rcim . 2021 . 102304
[8] A. Jones, J. E. Swan, G. Singh, and E. Kolstad. The effects of virtual reality, augmented reality, and motion parallax on egocentric depth perception. In 2008 IEEE Virtual Reality Conference, pp. 267–268, 2008. doi: 10 . 1109/VR . 2008 . 4480794
[9] J. A. Jones, J. E. Swan, G. Singh, and S. R. Ellis. Peripheral visual information and its effect on distance judgments in virtual and augmented environments. In Proceedings of the ACM SIGGRAPH Symposium on Applied Perception in Graphics and Visualization, APGV ’11, p. 29–36. Association for Computing Machinery, New York, NY, USA, 2011. doi: 10 . 1145/2077451 . 2077457
[10] D. Laskos and K. Moustakas. Real-time upper body reconstruction and streaming for mixed reality applications. In 2020 International Conference on Cyberworlds (CW), pp. 129–132, 2020. doi: 10 . 1109/CW49994 . 2020 . 00027
[11] C. Li, L. Yu, and S. Fei. Large-scale, real-time 3d scene reconstruction using visual and imu sensors. IEEE Sensors Journal, 20(10):5597–5605, 2020. doi: 10 . 1109/JSEN . 2020 . 2971521
[12] Y. Li, X. Wu, Y. Chrysathou, A. Sharf, D. Cohen-Or, and N. J. Mitra. Globfit: Consistently fitting primitives by discovering global relations. 30(4), jul 2011. doi: 10 . 1145/2010324 . 1964947
[13] J.-L. Lugrin, M. E. Latoschik, M. Habel, D. Roth, C. Seufert, and S. Grafe. Breaking bad behaviors: A new tool for learning classroom management using virtual reality. Frontiers in ICT, 3:26, 2016.
[14] X. Luo, R. Kenyon, D. Kamper, D. Sandin, and T. DeFanti. The effects of scene complexity, stereovision, and motion parallax on size constancy in a virtual environment. In 2007 IEEE Virtual Reality Conference, pp. 59–66, 2007. doi: 10 . 1109/VR . 2007 . 352464
[15] Meta. Horizon worlds, 2022. https://www.meta.com/horizon-worlds/.
[16] Microsoft. Hololens, 2015. https://www.microsoft.com/en-us/hololens.
[17] Microsoft. Mixed reality toolkit 2, 2018. https://docs.microsoft.com/en-us/windows/mixed-reality/mrtk-unity/mrtk2/.
[18] Microsoft. Altspacevr, 2022. https://altvr.com/.
[19] Mozilla. Mozilla hubs, 2022. https://hubs.mozilla.com/.
[20] A. Nagendran, A. Steed, B. Kelly, and Y. Pan. Symmetric telepresence using robotic humanoid surrogates. Computer Animation and Virtual Worlds, 26(3-4):271–280, 2015.
[21] A. S. Nittala, N. Li, S. Cartwright, K. Takashima, E. Sharlin, and M. C. Sousa. Planwell: Spatial user interface for collaborative petroleum well-planning. In SIGGRAPH Asia 2015 Mobile Graphics and Interactive Applications, SA ’15. Association for Computing Machinery, New York, NY, USA, 2015. doi: 10 . 1145/2818427 . 2818443
[22] O. Oyekoya, R. Stone, W. Steptoe, L. Alkurdi, S. Klare, A. Peer, T. Weyrich, B. Cohen, F. Tecchia, and A. Steed. Supporting interoperability and presence awareness in collaborative mixed reality environments. In Proceedings of the 19th ACM Symposium on Virtual Reality Software and Technology, VRST ’13, p. 165–174. Association for Computing Machinery, New York, NY, USA, 2013. doi: 10 . 1145/2503713 . 2503732
[23] Photon. Photon fusion, 2019. https://www.photonengine.com/.
[24] C. Pidel and P. Ackermann. Collaboration in virtual and augmented reality: a systematic overview. In Augmented Reality, Virtual Reality, and Computer Graphics: 7th International Conference, AVR 2020, Lecce, Italy, September 7–10, 2020, Proceedings, Part I 7, pp. 141–156. Springer, 2020.
[25] T. Piumsomboon, Y. Lee, G. Lee, and M. Billinghurst. Covar: A collaborative virtual and augmented reality system for remote collaboration. In SIGGRAPH Asia 2017 Emerging Technologies, SA ’17. Association for Computing Machinery, New York, NY, USA, 2017. doi: 10 . 1145/3132818 . 3132822
[26] A. Robertson and J. Peters. What is the metaverse, and do i have to care?, 2021. https://www.theverge.com/22701104/metaverse-explained-fortnite-roblox-facebook-horizon.
[27] R. Room. Rec room, 2022. https://recroom.com/.
[28] A. Schäfer, G. Reis, and D. Stricker. A survey on synchronous augmented, virtual and mixed reality remote collaboration systems. ACM Comput. Surv., apr 2022. Just Accepted. doi: 10 . 1145/3533376
[29] J.-P. Stauffert, F. Niebling, and M. E. Latoschik. Latency and cybersickness: Impact, causes, and measures. a review. Frontiers in Virtual Reality, 1:582204, 2020.
[30] M. Tanaya, K. Yang, T. Christensen, S. Li, M. O’Keefe, J. Fridley, and K. Sung. A framework for analyzing ar/vr collaborations: An initial result. In 2017 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA), pp. 111–116, 2017. doi: 10 . 1109/CIVEMSA . 2017 . 7995311
[31] T. W. team. Wireshark, 1998. https://www.wireshark.org/.
[32] T. Teo, G. A. Lee, M. Billinghurst, and M. Adcock. 360drops: Mixed reality remote collaboration using 360 panoramas within the 3d scene*. In SIGGRAPH Asia 2019 Emerging Technologies, SA ’19, p. 1–2. Association for Computing Machinery, New York, NY, USA, 2019. doi: 10 . 1145/3355049 . 3360517
[33] T. Teo, M. Norman, G. Lee, M. Billinghurst, and M. Adcock. Exploring interaction techniques for 360 panoramas inside a 3d reconstructed scene for mixed reality remote collaboration. Journal on Multimodal User Interfaces, 14, 07 2020. doi: 10 . 1007/s12193-020-00343-x
[34] H. Tian, G. A. Lee, H. Bai, and M. Billinghurst. Using virtual replicas to improve mixed reality remote collaboration. IEEE Transactions on Visualization and Computer Graphics, 29(5):2785–2795, 2023. doi: 10 . 1109/TVCG . 2023 . 3247113
[35] Unity. Netcode for gameobjects, 2022. https://docs-multiplayer.unity3d.com/.
[36] Unity. Unity relay, 2022. https://unity.com/products/relay.
[37] VRChat. Vrchat, 2022. https://hello.vrchat.com/.
[38] P. Wang, X. Bai, M. Billinghurst, S. Zhang, X. Zhang, S. Wang, W. He, Y. Yan, and H. Ji. Ar/mr remote collaboration on physical tasks: A review. Robotics and Computer-Integrated Manufacturing, 72:102071, 2021. doi: 10 . 1016/j . rcim . 2020 . 102071
[39] Z. Wang, C. Nguyen, P. Asente, and J. Dorsey. Distanciar: Authoring site-specific augmented reality experiences for remote environments. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21. Association for Computing Machinery, New York, NY, USA, 2021. doi: 10 . 1145/3411764 . 3445552
[40] L. Zhang, A. Agrawal, S. Oney, and A. Guo. Vrgit: A version control system for collaborative content creation in virtual reality. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23. Association for Computing Machinery, New York, NY, USA, 2023. doi: 10 . 1145/3544548 . 3581136
[41] I. Zhura, D. Davletshin, N. D. W. Mudalige, A. Fedoseev, R. Peter, and D. Tsetserukou. Neuroswarm: Multi-agent neural 3d scene reconstruction and segmentation with uav for optimal navigation of quadruped robot. arXiv preprint arXiv:2308.01725, 2023.