Experiences with ROS 2 on our robots

This blog post presents some issues which came up when we migrated from ROS 1 and started using ROS 2. We assume that other people might run into them too. Therefore, we will describe the problems and possible solutions below. Our use case of ROS 2 is a bit different than most systems that currently use ROS 2. We use a team of humanoid robots in RoboCup Soccer. Therefore we have on the one hand differences from using a humanoid robot (500+ Hz control loop cycle) and on the other hand from the league (no network connection to the robot after starting the code).

Scheduling

We have many nodes (~45) running concurrently when we start our full software stack. Most of them have additionally multiple threads. At the same time, some of them, e.g. the walking, need to be executed with a constant and high (500+ Hz) rate. This leads to some issues with the default Linux scheduler. Setting the nice value of the processes was not enough to solve this because it does not necessarily reduce latencies. In ROS 1 it was enough for us to assign the processes to specific CPUs using the “taskset” command, but with ROS 2 we still had issues with this. We finally resolved it by also using the “isolcpus” kernel parameter which forbids the scheduler of using the specified CPU core. Therefore, the process can run freely on its own CPU without any interruptions. Using a real time kernel would probably be a better, but also more complicated solution that we will investigate in the future. More information can be found in the following ROSCON talk: https://roscon.ros.org/2019/talks/roscon2019_concurrency.pdf

Executor Performance

One of our largest issues was the extreme performance drop between ROS 1 and ROS 2. Simple nodes that only took a few percent of a core before, now needed a complete core for themselves. Basically, almost every node was running at 100% CPU usage. It took us some time to realize that the issue comes from the fact that we have a lot of messages per second (e.g. /joint_states is sent with 500 Hz) and thus the badly implemented standard executor was totally overloaded. Luckily, iRobot already did a lot of work on this and created an “Events Executor” for rclcpp (https://github.com/ros2/design/pull/305). Unfortunately, this executor is not yet in the master. Furthermore, there is no such thing for rclpy yet. By using it, we reached node performance values similar to ROS 1 with our C++ nodes. But we also needed to rewrite some nodes from Python to C++ as it was otherwise not possible to run our complete software stack on our 8 core CPU. Unfortunately, all standard ROS 2 nodes, e.g. robot state publisher, use the default implementation and therefore needed to be manually patched by us. This also includes the tf listener. The repository containing the stand-alone Events Executor can be found here: https://github.com/irobot-ros/events-executor/ Our patched versions of rclcpp and other packages are linked here: https://github.com/ros2/design/pull/305#issuecomment-1133757777

Callbacks and Timers

When you are in a callback or timer thread (so anything that is handled by the spinning), it is not possible to receive other callbacks per default. This means for example that you can not wait to get a tf transformation, as you will never receive anything while waiting for it. The same is the case for simulation time callbacks in rclpy. This leads to the time not progressing inside callbacks which leads to other issues. The issue is solved in rclcpp by spinning a callback group containing the time callback in a separate thread, but the isolated execution of specific callback groups is not supported in the current implementation of the Python executor (see https://github.com/ros2/rclpy/issues/850). Although there are multi-threaded Executors available, they seem to not solve the issue completely. If there are many callbacks to handle, they might not manage to handle the correct one in time before you run into a timeout. Interestingly, the order in which the subscriptions are created influences this behavior and sometimes issues can be resolved by ordering them differently.

FastDDS

Sometimes FastDDS fails to list nodes / topics after restarting a node while other nodes are running. The FastDDS discovery server (see https://fast-dds.docs.eprosima.com/en/latest/fastdds/ros2/discovery_server/ros2_discovery_server.html) is similar to the concept of a rosmaster in ROS 1 and should fix this issue.

There are issues with callbacks not arriving in C++ on ROS 2 rolling under Ubuntu 22.04 and FastDDS, which was reintroduced as the default DDS for rolling. We observed these issues our self in our own code base, but they are also the reason why nav2 is not released for rolling. See https://github.com/ros-planning/navigation2/issues/2648 for information regarding the release of nav2 for rolling and humble as well as https://discourse.ros.org/t/nav2-issues-with-humble-binaries-due-to-fast-dds-rmw-regression/26128 and https://discourse.ros.org/t/fastdds-without-discovery-server/26117/14 for a more general discussion. Switching to CycloneDDS solved these issues for us, but we still need to build nav2 ourselves.

CycloneDDS Configuration

Cyclone allows a lot of further settings. Unfortunately, this is just documented in the Cyclone documentation. One important setting when working with large messages, e.g. images, is the kernel queue size (see https://github.com/ros2/rmw_cyclonedds). For Cyclone configuration related things one needs to go to https://cyclonedds.io/docs/cyclonedds/latest.

To activate CycloneDDS follow the steps on https://github.com/ros2/rmw_cyclonedds.

The biggest problem that we were facing with Cyclone is that it can only operate on a fixed network configuration which cannot be changed at runtime. The default network adapter that is used it determined by a ranking. Wired adapters have the highest priority, loopbacks the lowest. Adapters can also be specified manually in the cyclone config file. Issues arise if the selected adapter is not available, e.g. because the ethernet cable is unplugged from the robot after starting the software. In our case, that is quite often the case because an ethernet cable for debugging is removed or the wifi disconnects. We solved this issue by creating a bridge network containing the wired adapter. This results in the adapter being available even when the cable is unplugged. Cyclone is then configured to use this bridge network adapter.

There is a fix already proposed for cyclonedds which enables optional adapters, but it is not merged at the time of writing (see https://github.com/eclipse-cyclonedds/cyclonedds/pull/1336). It will also only partially fix the problem, as cyclone will continue to try to use the adapter that was connected when it was first started, and not dynamically switch to another one.

As our robot uses the builtin ethernet adapter internally for the camera, external devices such as laptops are connected via a usb2lan adapter. This introduced the problem that its network interface name might change due to different brands etc. We mitigate this by only using adapters from a single brand and adding a udev rule that always assigns the same name to it. It is then included by this name in the bridge interface described above.

ROS Domain ID

While the default in ROS 1 was that you use your local roscore, in ROS 2 the default is to communicate with anybody in the network. This can quite quickly lead to issues if you have multiple robots or workstations running in the same network. Therefore it is crucial to set the ROS_DOMAIN_ID environment variable differently for every machine, as isolation is done as opt-in instead of opt-out like in ROS 1. The number of domain ids is also quite limited for large setups and a ros domain with many nodes might “leak” into the namespace of another domain id.

Colcon

One of the most unnecessary issues which could have been all avoided by keeping catkin as a build tool are the issues with colcon. Just to get the same quality of usage that you had with catkin tools by default you need to invest a lot of effort. First, there are many additional packages that you need to install (colcon-common-extensions, colcon-edit, colcon-clean). Then you need to set the verbosity level of colcon, as you will otherwise get completely spammed with unnecessary printouts (export COLCON_LOG_LEVEL=30). Colors are still not supported although there are PRs for it (https://github.com/colcon/colcon-core/pull/487). You might want to create a lot of aliases because the colcon commands are very verbose compared to catkin (e.g. colcon build --packages-up-to PACKAGE --symlink-install to build a specific package including its present dependencies).

Rebuilds

Due to changes in the build process, notably the removal of catkin’s devel folder, a default build now installs to the install folder, and the install folder has to be sourced. This makes it necessary to rebuild your package for every change you do, even in configuration files, launch files, scripts, or python code. Colcon has an option --symlink-install, but it often cannot be relied on and it does not include things like launch files or config files. So get used to rebuilding for every parameter you change in a config file.

Rate in Simulation

When using rate.sleep() in simulation, the node does not always sleep correctly. Sometimes it does not sleep at all, which leads to executing the code again in the same time step. https://github.com/ros2/rclcpp/issues/465

Sim Time

The usage of lower than real-time simulations (such as https://humanoid.robocup.org/hl-vs2022/) can be tricky due to packages using the walltime instead of the ros time for some timeouts etc.. The use_sim_time parameter which tells a node to use the time from the /clock topic instead of the wall time is now present at each node separately. This is the case because the concept of global parameters does not exist in ROS 2. It is therefore necessary to pass a launch file argument, e.g. sim:=true, through the full launch file hierarchy to set the use_sim_time parameter at the launch of each node individually.

Parameters

ROS 2 handles parameters quite differently compared to ROS 1. Parameters exist only in the scope of a single node. They need to be declared in the code to be available. During this declaration it is also possible to define the type, value ranges etc. The declaration process is quite verbose if you have a large number of parameters. You can use the following trick to skip the declaration if you don’t care about a neatly generated rqt reconfigure gui:

For Python:

node = Node("example_node", automatically_declare_parameters_from_overrides=True)

# or if you inhert from the Node class

class MyNode(Node):
    def __init__(self):
        super().__init__('example_node', automatically_declare_parameters_from_overrides=True)

For C++:

MyNode::myNode(const std::string &ns, std::vector<rclcpp::Parameter> parameters) :
    Node(ns + "my_node", rclcpp::NodeOptions().allow_undeclared_parameters(true).parameter_overrides(parameters).automatically_declare_parameters_from_overrides(true)) {

If you have any global parameter values, a blackboard node which holds these parameters is needed. It could look like this:

<node name="parameter_blackboard" pkg="demo_nodes_cpp" exec="parameter_blackboard" args="--ros-args --log-level WARN">
	<param name="use_sim_time" value="$(var sim)"/>
	<param from="$(find-pkg-share my_config_package)/config/global_parameters.yaml" />
</node>

Retrieving the global parameters from that blackboard is not as straightforward as querying the parameters of your local node. You need to perform a manual service call. Note that depending on the executor setup service calls might timeout or block indefinitely when done inside of callbacks.

Here is some demo code to retrieve the params of another node like the mentioned parameters blackboard.

def get_parameters_from_other_node(own_node: Node,
                                   other_node_name: str,
                                   parameter_names: List[str],
                                   service_timeout_sec: float = 20.0) -> Dict:
    """
    Used to receive parameters from other running nodes.
    Returns a dict with requested parameter name as dict key and parameter value as dict value.
    """
    client = own_node.create_client(GetParameters, f'{other_node_name}/get_parameters')
    ready = client.wait_for_service(timeout_sec=service_timeout_sec)
    if not ready:
        raise RuntimeError(f'Wait for {other_node_name} parameter service timed out')
    request = GetParameters.Request()
    request.names = parameter_names
    future = client.call_async(request)
    rclpy.spin_until_future_complete(own_node, future)
    response = future.result()

    results = {}  # Received parameter
    for i, param in enumerate(parameter_names):
        results[param] = parameter_value_to_python(response.values[i])
    return results

TF Node Spam

By default in C++ a tf listener creates its own node just to listen to tf updates in parallel. This is not needed anymore, as rclcpp supports spinning of specific callback groups using their own executor. To lower the overhead and reduce spam in e.g. the rqt node view we therefore suggest that you should replace the following instantiation of the tf listener

tf2_ros::Buffer tfBuffer{node->get_clock()};
tf2_ros::TransformListener tfListener{tfBuffer};

with this one

tf2_ros::Buffer tfBuffer{node->get_clock()};
tf2_ros::TransformListener tfListener{tfBuffer, node};

The spinning of the callback group is done with the default executor. You would need to patch tf2_ros your self to use the more performant EventsExecutor. This is especially relevant as /tf callbacks might arrive at high frequency.

All of this does not apply to Python, as neither the separate callback groups nor the EventsExecutor are implemented for it…

Unreleased Packages

One thing that is not a major deal breaker but still annoying are missing releases for some packages. There are always some packages, e.g. rqt_tf_tree, that are not released as apt packages (for rolling) although their code works fine. It just seems like there is a lack of maintainers for ROS 2. We hope that this changes when support for ROS 1 runs out and people can concentrate on ROS 2.

Environment Setup

There are some things that you can set in your ~/.bashrc or ~/.zshrc:

# reduce colcon spam
export COLCON_LOG_LEVEL=30
# make logs colorful
export RCUTILS_COLORIZED_OUTPUT=1
# format logs in terminal
export RCUTILS_CONSOLE_OUTPUT_FORMAT="[{severity}] [{name}]: {message} ({function_name}() at {file_name}:{line_number})"
# force using CycloneDDS as FastDDS is currently broken
export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
# source ros
source /opt/ros/rolling/setup.zsh
# source workspace
source $HOME/colcon_ws/install/setup.zsh
# enable autocompletion
eval "$(register-python-argcomplete3 ros2)"
eval "$(register-python-argcomplete3 colcon)"
# reduce depraction warning spam in colcon
export PYTHONWARNINGS='ignore:::setuptools.command.install,ignore:::setuptools.command.easy_install,ignore:::pkg_resources'
# always set a domain ID (should be unique in your network)
export ROS_DOMAIN_ID=42
FacebooktwitterredditpinterestlinkedinmailFacebooktwitterredditpinterestlinkedinmail

6 Comments

  1. Hey, that’s a great article, we faced almost all of the same issues. I think it would be beneficial to post that on ROS Discourse, it might get some traction

  2. Thanks for the great article!
    I just want a clarification about “colcon spamming unnecessary printouts”, what exactly does it spam out?
    If there are no warnings or errors during colcon build, it should print out very minimal information. Are you’re referring to the warning about overriding packages when you have overlayed packages in workspaces?

    1. We observed some deprecation warnings as well as printout related to different ament env variables. Some of that could also be related to our configuration. The default cli output is indeed pretty okay. In addition to that, it is discouraged to build in sourced terminals iirc. which also produces a bunch of printout and may introduce some weird behavior. But I am by no way an expert regarding colcon and ament.

  3. Thanks for sharing!
    As someone that hasn’t made the move to ROS2 yet, and it’s concerned about the little things that make you need to benchmark and triple check everything before you realize it’s not your code but the underlying framework… I really thank you!
    Specially because I really enjoy using python with ROS and it sounds half baked in ROS2.

Leave a Reply to Tony Najjar Cancel reply

Your email address will not be published. Required fields are marked *