Subscribe to the RSS feed

Red Hat OpenShift Virtualization 4.19 significantly improves performance and speed for I/O intensive workloads like databases. Multiple IOThreads for Red Hat OpenShift Virtualization is a new feature allowing virtual machine (VM) disk I/O to be spread among multiple worker threads on the host, which are in turn mapped to disk queues inside the VM. This allows a VM to efficiently use both the vCPU and the host CPU for multi-stream I/O, improving performance.

This article serves as a companion to my colleague Jenifer Abrams's feature introduction. To follow up, I provide performance results to assist you in tuning your VM to achieve better I/O throughput.

For testing, I used fio with Linux VMs as a synthetic I/O workload. Other efforts are under way to test with applications, and also on Microsoft Windows.

For more background on how this feature is implemented in KVM, see this article on IOThread Virtqueue Mapping as well this companion article demonstrating performance improvements for Database workloads in VMs running in a Red Hat Enterprise Linux (RHEL) environment. 

Test description

I tested I/O throughput on two configurations: 

  1. A cluster with local storage using the Logical Volume Manager provisioned by the Local Storage Operator (LSO)
  2. A separate cluster using OpenShift Data Foundation (ODF) 

The configurations are very different and cannot be compared.

We tested on pods (for a baseline) and VMs. VMs were allocated 16 cores and 8 GB RAM. I used test files of 512 GB with 1 VM and 256 GB with 2 VMs. I used direct I/O in all tests. I used persistent volume claims (PVC), in block mode and formatted as ext4, for VMs, and filesystem mode PVCs also formatted as ext4 for pods. All tests were run with the libaio I/O engine.

I tested the following matrix:

Parameter

Settings

Storage Volume Type

Local (LSO), ODF

Number of Pods/VMs

1, 2

Number of I/O Threads (VMs only)

None (baseline), 1, 2, 3, 4, 6, 8, 12, 16

I/O Operations

Sequential and random reads and writes

I/O Block Sizes (bytes)

2K, 4K, 32K, 1M

Concurrent Jobs

1, 4, 16

I/O iodepth (iodepth)

1, 4, 16

I used ClusterBuster to orchestrate the tests. The VMs used CentOS Stream 9, and the pods likewise used CentOS Stream as a container image base.

Local storage

The local storage cluster was a 5 node (3 master + 2 worker) cluster consisting of Dell R740xd nodes containing 2 Intel Xeon Gold 6130 CPUs, each with 16 cores and 2 threads (32 CPUs) for 32 cores and 64 CPUs total. Each node contained 192 GB RAM. The I/O subsystem consisted of four Dell-branded Kioxia CM6 MU 1.6 TB NVMe drives configured as a RAID0 striped multiple device (MD) configuration with default settings. Persistent volume claims were carved out of this MD using the lvmcluster operator. Unfortunately, this rather modest configuration was all I had available, and it's very possible that a faster I/O system would yield even more improvement from multiple I/O threads.

OpenShift Data Foundation 

The OpenShift Data Foundation (ODF) cluster was a 6 node (3 master + 3 worker) cluster consisting of Dell PowerEdge R7625 nodes containing 2 AMD EPYC 9534 CPUs, each with 64 cores and 2 threads (128 CPUs) for 128 cores and 256 CPUs total. Each node contains 512 GB RAM. The I/O subsystem consisted of two 5.8 TB NVMe drives per node, with 3-way replication over 25 GbE default pod network. I did not have access to a faster network for this test, but newer network hardware would likely have yielded better uplift.

Summary of results

This test evaluates multiple I/O threads with specific I/O back ends that may not be representative of your use case. Differences in storage characteristics can have a major impact on the choice of number of I/O threads.

Here's what my tests revealed.

  • Maximum I/O throughput: for local storage, maximum throughput was about 7.3 GB/sec read and 6.7 GB/sec write, for both pods and VMs, regardless of iodepth or number of jobs on local storage. That is substantially less than what would be expected from the hardware. The devices (which are each 4x PCIe gen4) are rated for 6.9 GB/sec read and 4.2 MB/sec write. I've not investigated the reason for this, but I was running on aging hardware. The top performance is clearly better than the single drive performance, indicating that striping was having effect. For ODF, the best we achieved was about 5 GB/sec read and 2 GB/sec write.
  • Large block I/O (1 MB) showed little if any improvement, because performance was already limited by the system.
  • The optimal choice for the number of I/O threads varies with workload and storage characteristics. As expected, workloads without significant I/O concurrency showed little benefit.
    • Local storage: For VMs with significant I/O concurrency, 4-8 is generally a good starting point. Particularly for workloads utilizing small I/O size and large concurrency, more threads can yield benefit.
    • ODF: More than 1 I/O thread rarely yielded significant benefit, and in many cases none at all were needed. This is likely due to the comparatively slow pod network; it's likely that faster networking would yield different results.
  • Multiple I/O threads were more effective at delivering improvement with multiple concurrent jobs than with deep asynchronous I/O, at least with this test.
  • There was little difference in behavior between 1 and 2 concurrent VMs until underlying aggregate maximum I/O throughput (noted above) was reached.
  • Multiple I/O threads did not completely close the gap with pods at lower job count or I/O iodepth. At high I/O iodepth with small operations, VMs actually outperformed pods by significant margins for write operations.

By the numbers

Here is the overall I/O throughput to be had using multiple I/O threads on my local storage based system. As you can see, with workloads involving smaller I/O sizes and lots of parallelism to a fast I/O system, there can be great benefits. I present more of my findings below about the benefit I got from different numbers of I/O threads below. I observed only minor improvement with a 1 MB block size because performance was already very close to the underlying system limit. With even faster hardware, it's possible that additional I/O threads would yield improvement even with large block sizes.

Best improvement over VM baseline with additional I/O threads

(Local storage)

 

jobs

iodepth

       
  

1

  

4

  

16

  

size

op

1

4

16

1

4

16

1

4

16

2048

randread

18%

31%

30%

30%

103%

192%

151%

432%

494%

 

randwrite

81%

59%

24%

153%

199%

187%

458%

433%

353%

 

read

67%

58%

25%

64%

71%

103%

252%

241%

287%

 

write

103%

64%

0%

143%

99%

84%

410%

250%

203%

2048 Total

 

67%

53%

20%

97%

118%

141%

318%

339%

334%

4096

randread

18%

34%

28%

33%

101%

208%

156%

432%

492%

 

randwrite

95%

69%

20%

149%

200%

187%

471%

543%

481%

 

read

26%

53%

27%

24%

46%

66%

142%

155%

165%

 

write

103%

69%

0%

144%

86%

48%

438%

256%

161%

4096 Total

 

60%

56%

19%

87%

108%

127%

302%

346%

325%

32768

randread

16%

23%

26%

23%

71%

124%

99%

160%

129%

 

randwrite

75%

71%

28%

108%

132%

116%

203%

123%

115%

 

read

21%

57%

25%

21%

42%

32%

77%

54%

32%

 

write

79%

64%

26%

104%

59%

24%

195%

45%

27%

32768 Total

 

48%

53%

26%

64%

76%

74%

143%

96%

76%

1048576

randread

5%

2%

0%

9%

0%

0%

17%

0%

0%

 

randwrite

10%

0%

1%

6%

0%

2%

9%

0%

2%

 

read

12%

18%

0%

9%

0%

0%

16%

0%

0%

 

write

19%

0%

0%

7%

0%

0%

9%

0%

0%

1048576 Total

 

11%

5%

0%

8%

0%

1%

13%

0%

0%

 

Here are the number of I/O threads necessary to achieve 90% of the best result achievable with up to 16 I/O threads. For example, if the best result achieved in my test with a particular combination of operation, block size, jobs, and iodepth was 1 GB/sec, then the metric here would be the fewest threads needed to achieve 900 MB/sec. This allows setting a conservative number of threads while still achieving good performance.

Minimum iothread count to achieve 90% of best performance

(Local storage)

 

jobs

iodepth

       
  

1

  

4

  

16

  

size

op

1

4

16

1

4

16

1

4

16

2048

randread

1

1

1

1

4

8

3

12

12

 

randwrite

1

1

1

3

16

12

8

12

12

 

read

1

1

1

1

4

4

4

6

8

 

write

1

1

0

2

12

6

8

6

6

2048 Total

 

1

1

1

2

9

8

6

9

10

4096

randread

1

1

1

1

4

8

3

12

12

 

randwrite

1

1

1

3

16

12

8

12

12

 

read

1

1

1

1

2

3

3

6

8

 

write

1

1

0

3

12

4

8

6

4

4096 Total

 

1

1

1

2

9

7

6

9

9

32768

randread

1

1

1

1

3

6

2

4

3

 

randwrite

1

1

1

2

12

6

4

3

3

 

read

1

1

1

1

2

2

2

2

2

 

write

1

1

1

2

6

2

4

2

1

32768 Total

 

1

1

1

2

6

4

3

3

2

1048576

randread

0

0

0

0

0

0

1

0

0

 

randwrite

0

0

0

0

0

0

0

0

0

 

read

1

1

0

0

0

0

1

0

0

 

write

1

0

0

0

0

0

0

0

0

1048576 Total

 

1

0

0

0

0

0

1

0

0

Detailed results

For each test case measured, I calculated the following figures of merit:

  1. Measurement of I/O throughput
  2. Best VM performance (not directly reported)
  3. Minimum number of iothreads to achieve 90% of the best VM performance
  4. Ratio of best VM performance to pod performance
  5. Improvement of best VM performance over baseline VM performance

I am NOT reporting the number of threads for the best performance, because in many cases the differences were very small, less than the normal variance in reporting I/O performance.

I provide separate summaries for the results for local storage and ODF, due to the characteristics being so different.

All performance graphs below show results for pods (pod), a baseline VM without I/O threads (0) and the specified number of I/O threads on the X axis.

Local storage

If we look at raw performance, we see that in at least some cases using multiple I/O threads offers substantial benefit. For example, with 16 jobs, async I/O with iodepth 1, on local storage use of additional I/O threads can yield a benefit of the better part of an order of magnitude:

Threads for OpenShift

Even with a single stream I/O, the use of an extra I/O thread can yield benefit. Not surprisingly, more than one does not help:

Threads for OpenShift

There are anomalous cases where extra I/O threads actually hurts performance. In this case, using deep asynchronous I/O with small blocks, the best performance (even better than pods) is actually achieved with VMs using no dedicated I/O threads. I have not determined why this happens.

Threads for OpenShift

All of this demonstrates that to get the best performance out of multiple I/O threads, you need to experiment with your particular workload.

ODF cluster results

In contrast to local storage, where small block random write demonstrated dramatic improvement with multiple I/O threads, I observed minimal improvement with ODF even with high job counts. It's likely that faster, or lower latency, networking would yield greater benefit. Read operations, particularly random read, did demonstrate modest benefit, but writes, and lower job counts, showed little (if any) benefit.

Threads for OpenShift

Conclusions

Multiple I/O threads for OpenShift Virtualization is an exciting new feature in OpenShift 4.19 that offers the potential of substantial improvements in I/O performance for workloads with concurrent I/O, particularly with fast I/O systems such as the local NVMe storage used in my tests. Faster I/O subsystems are expected to benefit the most from multiple I/O threads, because more CPUs are needed to fully drive the underlying bare metal I/O. As always with I/O, differences in I/O systems and overall workloads can greatly affect performance, so I recommend testing your own workloads to take best advantage of this new feature. I hope that my test results help you make good choices for I/O threads!

product trial

Red Hat OpenShift Virtualization Engine | Product Trial

A streamlined, dedicated solution to deploy, manage, and scale virtual machines.

About the author

UI_Icon-Red_Hat-Close-A-Black-RGB

Keep exploring

Browse by channel

automation icon

Automation

The latest on IT automation for tech, teams, and environments

AI icon

Artificial intelligence

Updates on the platforms that free customers to run AI workloads anywhere

open hybrid cloud icon

Open hybrid cloud

Explore how we build a more flexible future with hybrid cloud

security icon

Security

The latest on how we reduce risks across environments and technologies

edge icon

Edge computing

Updates on the platforms that simplify operations at the edge

Infrastructure icon

Infrastructure

The latest on the world’s leading enterprise Linux platform

application development icon

Applications

Inside our solutions to the toughest application challenges

Virtualization icon

Virtualization

The future of enterprise virtualization for your workloads on-premise or across clouds