The Road to Introducing Virtual Memory Allocators

1K Views

November 27, 23

capcom r&d カプコンカプコン技研 re engine re:2023 capcom open conference professional カプコンオープンカンファレンスプロフェッショナル re2023capcom

スライド概要

■Overview
The memory allocator for RE ENGINE was developed in-house.

We will discuss its history, which has evolved over time, and explain how it's used in our latest titles.

Note: This is the contents of the publicly available CAPCOM Open Conference Professional RE:2023 videos, converted to slideshows, with some minor modifications.

■Prerequisites
Assumes knowledge of basic memory allocation and release, such as malloc and free.

I'll show you just a little bit of the content !
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
CAPCOM Open Conference Professional RE:2023
https://www.capcom-games.com/coc/2023/

Check the official Twitter for the latest information on CAPCOM R&D !
https://twitter.com/capcom_randd
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Copyright (c) 2018-2021 Microsoft Corporation, Daan Leijen
Released under the MIT License: https://opensource.org/license/mit/

Microsoft, Windows, and Microsoft Teams are trademarks or registered trademarks of Microsoft Corporation in the United States and other countries.
Screenshots of Microsoft products are used with permission from Microsoft.

カプコン技研

@CAPCOM_RandD

スライド一覧

株式会社カプコンが誇るゲームエンジン「RE ENGINE」を開発している技術研究統括によるカプコン公式アカウントです。これまでの技術カンファレンスなどで行った講演資料を公開しています。【CAPCOM オープンカンファレンスプロフェッショナル RE:2023】　https://www.capcom-games.com/coc/2023/ 【CAPCOM オープンカンファレンス RE:2022】　https://www.capcom.co.jp/RE2022/ 【CAPCOM オープンカンファレンス RE:2019】　http://www.capcom.co.jp/RE2019/

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

関連スライド

MH:W | GPU Particle - モンスターハンター:ワールドにおけるGPU Particleの実装

gcc2018 capcom r&d カプコンカプコン技研

カプコン技研 68.6K

ラピッドイテレーションを実現するRE ENGINEの設計

gcc2017 capcom re engine r&d カプコンカプコン技研

カプコン技研 65.7K

GPU駆動レンダリングへの取り組み

gcc2016 capcom re engine r&d カプコンカプコン技研

カプコン技研 37.5K

ゲームエンジン開発を支える技術

gcc2016 capcom re engine r&d カプコンカプコン技研

カプコン技研 34.2K

緻密なキャラクターの表情や破壊表現のためのコンピュートシェーダによるメッシュアニメーション

gcc2019 capcom re engine r&d カプコンカプコン技研

カプコン技研 32.6K

カプコンオープンカンファレンス RE:2022 パネルコンテンツ

capcom r&d カプコンカプコン技研 re engine re:2022 カプコンオープンカンファレンス capcom open conference re2022capcom

カプコン技研 21.8K

各ページのテキスト

The Road to Introducing Virtual Memory Allocators In this presentation, titled "The Road to Introducing Virtual Memory Allocators," I will discuss how the memory allocation for RE ENGINE‘s game runtime was changed to virtual memory management. -------- mimalloc Copyright (c) 2018-2021 Microsoft Corporation, Daan Leijen Released under the MIT License: https://opensource.org/license/mit/ Microsoft, Windows, and Microsoft Teams are trademarks or registered trademarks of Microsoft Corporation in the United States and other countries. Screenshots of Microsoft products are used with permission from Microsoft. ©CAPCOM 1

Introduction Street Fighter 6 and Exoprimal use virtual memory allocators Other titles use a different memory allocator (details later) I will introduce memory management methods in RE ENGINE, and discuss a memory allocator that supports various game genres This session will deal with the topic of memory allocators. Street Fighter 6 and Exoprimal, which were released in 2023,use a virtual memory allocator. 1 Other titles that have already been released use a different memory allocator. There are multiple memory allocators in RE ENGINE today, with the most recent implementation being the virtual memory allocator. We will discuss memory allocator strategies that support a variety of game genres. ©CAPCOM 2

Agenda • Memory Allocator Overview • Memory Management Scheme • Heap Allocator Overview • Virtual Memory Allocator • Summary and Prospects for the Future This is the agenda. After giving an overview of memory allocators and memory management methods, I will move on to the allocator itself in operation at Capcom. ©CAPCOM 2 3

Memory Allocator Overview First, an overview of memory allocators. 3 ©CAPCOM 4

What is a memory allocator? Programs that carve out memory areas to be allocated and released Called when new and delete are used in C++ Various elements in the game demand memory mMesh = new Mesh(); mDrawPoints = new vec3[1024]; mBufferPtr = std::make_unique<u8[]>(1024); Memory space A memory allocator is a program that carves out memory areas to be allocated and deallocated. In C++, for example, it is called when new and delete are performed. 4 In a game, not only the visible objects on the screen, but also various elements such as programs that manage the game state and processes in the engine always require memory. ©CAPCOM 5

Standards Required for Memory Allocators Numerous memory allocation/deallocation runs during game execution Severe demand for execution speed Effective use of limited memory space is also required Unused Unused Memory space Unused Memory space During a game, a large number of memory allocations and releases are constantly taking place. This means that a slow memory allocator will slow down the overall execution speed of the game; it must be lightweight. 5 However, this does not mean that the process can be a messy one. A fixed-size allocation like this is fast because all you have to do is advance the counter, but if the required size is less than the allocated size, you will waste space. Modern game consoles often do not support page swapping and have severe limitations on the amount of memory space they can handle compared to PCs. While it is difficult to achieve zero wasted space, they must not only be lightweight, but must also make full use of the limited memory space available. ©CAPCOM 6

Initiatives in RE ENGINE RE ENGINE uses memory allocators produced in-house Performance is predictable and can be optimized for usage Easy to implement leak detection and profiling processes To meet these stringent requirements, RE ENGINE manufactures its memory allocators in-house. This allows for predictable performance and room for optimization. It also has the advantage of making it easy to introduce processes that improve development efficiency, such as leak detection and performance profiling. ©CAPCOM 6 7

Fragmentation and Trends in Memory Allocation in Games Now, in order to keep our awareness of the terms we will touch on, we will look at the trends in fragmentation and memory allocation in games. 7 ©CAPCOM 8

What is memory fragmentation? Free and in-use areas are scattered throughout the memory Not all of the apparent free space is available As fragmentation progresses, the program becomes unable to continue In this presentation, the term memory fragmentation, or simply fragmentation, refers to a situation in which a memory area has free and in-use areas that are disjointed. Suppose we have a memory area like this, and the memory area in blue is in use. 8 In this case, the free space is scattered, and less memory can be allocated than the apparent capacity. For example, if you want to allocate three blocks of memory, there are not enough contiguous free areas. As fragmentation gradually progresses, memory allocation will fail and the program will be unable to continue. Let's look at the characteristics of each game. ©CAPCOM 9

10.

Games Where Fragmentation Doesn't Occur Much Fighting and stage attack type games Often has a fixed flow and what needs to be loaded is known Note: Memory allocation trends will vary by stage and character even in the same game Memory tends to stay clean In fighting games and stage attack games, what is to be loaded and to a degree where the player will go are already known, and memory tends to stay clean as the player enters and exits the stage. 9 ©CAPCOM 10

11.

Games Prone to Fragmentation Open-world games with no limits over the user's movements Games with online elements rich in randomness Memory is frequently allocated and released that cannot be determined in advance, and its lifetime is difficult to control On the other hand, we have open-field games or games with an online component that do not allow for much limitation over user movement. 10 They have a limited range of pre-determined memory block lifetimes, and there are somewhat random cases like memory being allocated in one location that gets retained forever. ©CAPCOM 11

12.

Memory Management in RE ENGINE Now that we are all on the same page, let's talk about the internals of RE ENGINE. We will start with how memory is managed in the program. 11 ©CAPCOM 12

13.

Memory Management Scheme Memory budget area divided into segments Memory allocator exists for each segment Provides a top-down view of overall memory usage Fragmentation is controlled to some extent Default Permanent Memory Resource space Develop Temp ScopedMemorySegment segment(Temp); // This memory is taken from the Temp segment auto scratchPtr = new u8[1024]; RE ENGINE divides the entire memory budget area into units called segments. The size of each segment is determined by the title developers. Examples of segments are Default for general memory, Permanent for permanent memory blocks such as singletons and managers, Temp for temporary memory, and Resource for in-game resources such as scene data and motion. For each segment, there is an instance of a memory allocator. In the code, the segment is dynamically switched and allocated as shown in the lower right. 12 This allocation method has the advantage that it is easy to see at a glance what kind of memory is being used for what purpose. In addition, fragmentation is controlled naturally to some extent because memory areas are separated for each use. ©CAPCOM 13

14.

Heap Allocator Let's move on to a description of the heap allocator, a type of allocator that actually carves out memory. This heap allocator was employed in all RE ENGINE titles before the advent of the virtual memory allocator. ©CAPCOM 13 14

15.

Heap Allocator Initialization to Termination Allocate all budgeted memory from the OS at startup Not returned until end of process Process startup Game running Process ends Allocate full capacity space from OS Default Permanent Resource Develop Temp Return to OS The RE ENGINE heap allocator allocates all budgeted memory from the OS at process startup. Since the allocator for each segment obtains consecutive memory addresses from the OS, in this figure, only five memory allocation requests are made to the OS. 14 Then, during game execution, memory blocks are carved out in response to memory allocation requests in the program. When the process terminates, the allocator for each segment returns memory to the OS. ©CAPCOM 15

16.

Advantages of Heap Allocators Speed advantage No system calls occur Placed for maximum page alignment Get debugging hints Relevant area can be inferred from the register at the time of the crash Easy to find header information from adjacent memory Permanent Header Header Overrun! Linear addresses The fact that no system calls, whose execution time is difficult to predict, are made during game execution is a speed advantage. Another speed advantage is that TLB misses are minimized by applying the maximum appropriate page alignment. 15 Furthermore, the memory area managed by the heap allocator is determined at startup and does not change during game execution, which is useful for debugging. The relevant segment can be allocated from the register information in the event of a crash. ©CAPCOM 16

17.

Advantages of Heap Allocators Speed advantage No system calls occur Placed for maximum page alignment Get debugging hints Relevant area can be inferred from the register at the time of the crash Easy to find header information from adjacent memory Permanent Header Header Overrun! Linear addresses If the address in the register points to this red memory block, the address range indicates that it refers to the Permanent segment. All memory blocks allocated from the segment have header information embedded in them to trace back to the allocated 15 source. Therefore, by looking for header information in the memory blocks adjacent to the area destroyed by overrun, it's possible to identify the cause of such bugs at an early stage. ©CAPCOM 17

18.

Heap Allocator Issues Fixed capacity for each use It tends to work against you in making games like open worlds Adjusted to stay within range throughout the entire game Even the heap allocator, which has many such advantages, has encountered challenges as the scale of development has expanded. There are two major ones. One is that the capacity is fixed for each application. This is especially true for open world games or games with multiple game modes within a single game. 16 It is not uncommon to have completely different memory consumption situations. However, since the capacity is fixed for each application, it must be adjusted to stay within the range throughout the entire game. ©CAPCOM 18

19.

Heap Allocator Issues Vulnerable to fragmentation Used Free Avail : 24 60 MB : 40 4 MB : 16 4 MB Another issue is susceptibility to fragmentation. The concept of segmentation naturally suppresses fragmentation to a certain extent by separation. 17 However, when memory blocks are left behind, the allocator has 40 MB of free space, but only a small amount of continuous memory can be allocated. Although it is necessary to ensure that memory is allocated and released thoroughly during game production to prevent this from happening, in some situations, such as games with open world elements, it may be unavoidable. ©CAPCOM 19

20.

Virtual Memory Allocator Then comes the virtual memory allocator. 18 ©CAPCOM 20

21.

Virtual and Physical Addresses Memory address normally touched on the program = virtual address A physical address is mapped to a virtual address Virtual address Physical address memset(bufferPtr, 0xff, size); 0x1fbcc000 – 0x1fbcd000 0x1000 – 0x2000 position->x += 1.0f; 0x20001000 – 0x20002000 0x2000 – 0x3000 0x70004000 – 0x70005000 0x3000 – 0x4000 int v = *srcPtr; Before getting into the topic of virtual memory allocators, let's touch on virtual and physical addresses. Usually, it is the virtual address that we are dealing with directly in the program. The physical address is mapped to this virtual address to access the actual main memory. 19 For example, when a program like this runs to write 0xff to a buffer, the address in the bufferPtr variable is a virtual address. This virtual address is tied to a physical address in main memory. The actual value of 0xff is written to the physical address via the virtual address. The same is also true for reading and writing when adding values to members of a structure or class, or when reading data via a pointer variable. ©CAPCOM 21

22.

Relationship to Fragmentation Heap allocator mapping review Map virtual and physical addresses at startup Fragmentation in the virtual address space directly leads to fragmentation of memory space Linear addresses Used Free Avail : 24 MB : 40 MB : 20 MB The heap allocator performs the mapping between virtual and physical addresses at startup. Therefore, fragmentation in the virtual address space directly leads to fragmentation of the available memory space. 20 This fragmentation can be eliminated to some extent by finely tuning the mapping operation between virtual and physical addresses. ©CAPCOM 22

23.

Eliminating Fragmentation Virtual and physical address mapping operations can be performed on a page-by-page basis Physical address can be scraped together from a distance Physical address Virtual address u8 delete[] p = new p; u8[8192]; 0x10000000 – 0x10001000 0x10001000 – 0x10002000 0x01000 – 0x02000 0x02000 – 0x03000 0x03000 – 0x04000 Virtual address space is so vast that fragmentation is negligible Vast virtual address space available that cannot be compared to physical memory space The mapping operation between virtual and physical addresses can be performed in units of size called pages. The size of a page can be 4KiB, 16KiB, 64KiB, etc., depending on the platform. Consider a situation where 8KiB of memory is allocated in an environment where the page is 4KiB. Virtual addresses are contiguous. 21 The physical address, on the other hand, is fragmented, with areas referenced from other locations. Nevertheless, this is not a problem because it can be mapped on a page-by-page basis. ©CAPCOM 23

24.

Eliminating Fragmentation Virtual and physical address mapping operations can be performed on a page-by-page basis Physical address can be scraped together from a distance Physical address Virtual address u8 delete[] p = new p; u8[8192]; 0x10000000 – 0x10001000 0x10001000 – 0x10002000 0x01000 – 0x02000 0x02000 – 0x03000 0x03000 – 0x04000 Virtual address space is so vast that fragmentation is negligible Vast virtual address space available that cannot be compared to physical memory space Thus, physical addresses can be scraped from all over memory. If the physical address is unmapped when memory is released, it can be reused for other purposes. 21 While the physical address space handled by a process is on the order of GB, a 64-bit process has a virtual address space on the order of TB. Therefore, there is no need to think too much about fragmentation of the virtual address space. ©CAPCOM 24

25.

Segments as Management Concepts Size allocation does not imply separation of memory areas Effective use of physical memory that is actually free Memory space Allocation size (standard budget) A scene Another scene Default Default Permanent Permanent Default Resource Resource Perma nent Resource Develop Develop Develop Temp Temp Temp And it is not only fragmentation that is solved by this. With the separation of virtual and physical addresses, per-segment size allocation no longer implies memory area separation. 22 Even though the segments are numerically separated, they can still be successfully allocated by picking up available physical memory, so the segment size is effectively variable. Segments are used as a management guide to indicate the budget for what and how much memory can be used. ©CAPCOM 25

26.

Growing Demand for Virtual Memory Allocators As development has become larger and more diverse, there have been many calls for the virtual memory allocator described earlier. 23 ©CAPCOM 26

27.

Virtual Memory Allocator Added mapping of virtual and physical addresses to the heap allocator Unnecessary space is returned to the OS Even if the physical address is fragmented, if the total area is sufficient, allocation will succeed Concerns Performance impact of TLB cache misses System call overhead How granular should returning memory to the OS be? No expertise in virtual memory allocators Knowledge of VRAM allocators not transferrable Now let us consider the introduction of a virtual memory allocator. In principle, it would be a good idea to add a dynamic mapping function between virtual and physical addresses to the24heap allocator. If memory that is no longer needed is returned to the OS, even if the physical address is fragmented, allocation will be successful if the total space is sufficient. However, there are some concerns. For example, there was little knowledge of the performance impact of TLB cache misses or the overhead of system calls to manipulate the mapping state between virtual and physical addresses. VRAM had already moved to operations that separated virtual and physical addresses, but the granularity of the operations handled was greater than CPUs, so we could not simply refer to them as they were. ©CAPCOM 27

28.

Bottlenecks in Introducing a Virtual Memory Allocator Expanding the heap allocator with limited knowledge is risky Testing using already released titles is not enough to dispel concerns Upcoming titles will push the hardware more, which might expose problems Decided to incorporate existing libraries Suitable for closed source development with clear licensing Performance should not deviate from the heap allocator Excellent portability Proven track record Under these circumstances, extending the heap allocator to implement a virtual memory allocator carries a significant risk. The number of titles using RE ENGINE has exceeded 10, and title development teams have begun to create titles that push the performance of the hardware. Therefore, we cannot dispel our concerns by simply verifying the performance of previously released titles. However, time will run out if we are too patient. 25 Therefore, we chose to incorporate an existing library. The requirements for the library were as follows: The license must be clear, and it must be easy to incorporate into RE ENGINE, which is a closed source project. Performance should not deviate from the heap allocator. It must have excellent portability. It must already have some proven track record. ©CAPCOM 28

29.

Selection of Virtual Memory Allocator Implementation Implemented and tested the most promising of several candidates Decided on mimalloc 2.0.3 Less memory overhead than 1.x series MIT License Lockless memory management strategies that scale to multiple cores Designed for portability As a result of our testing, mimalloc was selected as the memory allocator that meets our requirements. The exact version is 2.0.3. 26 The reason is that it has a smaller memory space overhead than the version 1 series, which is considered the stable version. The license is MIT License. Titles that use RE ENGINE will have the license notice noted in the web manual or in-game. mimalloc is designed with a lockless strategy that scales to multiple cores. It was also important to note that mimalloc fleshed out the heap allocator in a performance study using microbenchmarks. Specific performance will be presented at the end of this report. ©CAPCOM 29

30.

Portability to Gaming Platforms Windows, macOS, POSIX, wasm implementations exist as standard Proof that portability is taken into account Works by implementing functions equivalent to VirtualAlloc/VirtualFree Allocation and release of virtual addresses Map/Unmap physical addresses Most modern operating systems provide an equivalent API As for portability, implementations exist for Windows, macOS, POSIX, and wasm, making it a highly portable code base. Except for a few minor details, it works by implementing functions equivalent to Windows APIs VirtualAlloc/VirtualFree. 27 These APIs are responsible for allocating and releasing virtual addresses and mapping and unmapping physical addresses. Most modern OSs provide equivalent APIs. ©CAPCOM 30

31.

Challenges Encountered and Solutions Here are some of the challenges we encountered in implementing mimalloc and how we solved them. This is specifically about how mimalloc was implemented in RE ENGINE, so the circumstances are different from those for general Windows applications. 28 Please keep in mind that this information is for running a moderately allocation-heavy game on a platform with strict memory constraints. ©CAPCOM 31

32.

Implementation Issues Assumes an environment with sufficient physical memory capacity or memory management based on memory swapping Once allocated, little memory is returned to the OS Allocates memory from the OS for each thread Large delay in memory release of finished threads Large amount of memory space that is practically unavailable (~30%) The first thing we encountered is that the memory management is performed as if physical memory capacity is quite vast, or memory swapping is assumed. 29 Once memory is allocated, it is rarely returned to the OS, memory sharing with other threads does not work because memory is allocated by the OS for each thread, and memory deallocation for finished threads is handled quite optimistically. As a result, the amount of memory space being wasted was close to 30% of the total. We will look at each of these issues. ©CAPCOM 32

33.

Issue: Returns little memory to the OS Depends on performance optimization strategies Allocating and returning on a case-by-case basis places a burden on the operating system Physical memory is small in game consoles and memory swapping is not available Set options to return as much as possible Rewrote source code to return memory earlier // Options provided by mimalloc mi_option_disable(eager_commit); mi_option_set(segment_commit_delay, 0); mi_option_set(reset_delay, 25); // Extend the following features Allow segment_cache to be capped Make page_free and page_retire execute immediately Extend scope of page_free_collect The fact that very little memory is returned to the OS is due to mimalloc's optimization strategy. To gain performance, memory is not returned once it has been retrieved. 30 This is not suitable for environments where physical memory is small and memory swapping is not always available, so we modified the source code to encourage early return in addition to setting options to return as much as possible. ©CAPCOM 33

34.

Issue: Allocates memory from OS for each thread All memory allocation is via mi_heap instances An instance of mi_heap is created for each thread Memory in mi_heap is not accessible from other threads Allocating thread (mi_heap) monopolizes memory until returned to OS Memory that gets freed in a thread is also still unavailable to other threads In mimalloc, all memory allocation goes through instances of mi_heap. The mi_heap allocates memory allocated by the OS. An instance of mi_heap is created for each thread. The strategy of keeping memory per thread achieves high parallelism. 31 Now here is the problem. Once memory is in mi_heap, it is monopolized by the allocated thread until it is returned to the OS. Even if there is free space, it cannot be accessed by another thread's mi_heap, creating a gap between the apparent free space and the free space that can actually be used. Since there are more than several dozen threads running in a game, memory can easily be exhausted if this overhead area is not reduced. This problem is solved as follows. ©CAPCOM 34

35.

Solution: Logically integrate low-priority threads Performance-oriented threads associated with game loops There are not many threads of this kind Have a heap for each thread to take advantage of mimalloc's features Low priority worker threads Middleware external threads Managed as one logical thread Insert locks at lock-free locations as appropriate Prioritize memory space efficiency over execution performance RE ENGINE has two types of threads: performance-critical threads associated with the game loop and threads that are not affected even if execution is slightly delayed. The number of performance-oriented threads is small, and that thread count can be easily controlled, 32 so a mi_heap instance is provided for each thread. On the other hand, low-priority worker threads and middleware threads are grouped together as one logical thread, and memory is allocated while maintaining synchronization. This significantly reduces wasted memory space. ©CAPCOM 35

36.

[beta]

Issue: Freeing memory of finished threads
Heap information on finished threads is managed in a linked list
Released in a section independent of the game loop
// added callback to mi_heapfree()
void re_mi_callback_heapfree(mi_heap_t* heap)
{
// Return all memory that can be freed
re_mi_heap_disable_delayed_free(heap);
re_mi_heap_release_unused(heap);

}

// If not empty, connect to management list and GC in background
if (!re_mi_heap_is_empty(heap)) {
re_mi_heap_enqueue_abandoned(heap);
}

Normally, the memory of the mi_heap associated with a thread that has already terminated is released late, but in RE ENGINE,
the discarded mi_heap is managed in a bidirectional linked list so that it can be released early at a timing independent of the game
loop.
33
Note that since the introduction of the concept of logical threads as explained in the previous slide, few threads use this process.

©CAPCOM

36

37.

Issue: Large memory space overhead Although countermeasures were put in place, mi_heap allocates a lot of memory to begin with We are more concerned about memory space efficiency than performance benefits Adjust memory allocation granularity to a level that balances performance #define MI_SEGMENT_SLICE_SHIFT (11 + MI_INTPTR_SHIFT) // Segment Size: 8MiB #define MI_SEGMENT_SHIFT (8 + MI_SEGMENT_SLICE_SHIFT) // Medium Page: 128KiB #define MI_MEDIUM_PAGE_SHIFT (3+MI_SEGMENT_PAGE_SHIFT) // Small Page: 16KiB #define MI_SEGMENT_PAGE_SHIFT (MI_SEGMENT_SLICE_SHIFT) Even with the measures mentioned so far, on platforms with small physical memory, the problem remained that mi_heap likes to allocate large amount of memory . We adjusted the granularity of the memory allocation from the OS to balance performance. Any areas where bugs arose due to setting these values, we have worked on as well. 34 These values result in the maximum size of Small Objects being limited to 4 KiB, which is set considering that one of RE ENGINE's characteristics is that intensive allocations in the game loop tend to be of sizes of 4 KiB or less. Please do not rely on the numerical settings here, but rather gather information on the memory allocation trends in your environment and consider the specific values needed. ©CAPCOM 37

38.

[beta]

New Issue: Frequent system calls
System call execution costs are difficult to predict
Spikes are to be avoided as much as possible
In a fighting game that's locked to 60fps, this has a direct effect on product quality
Cache mapped memory with LRU method
LRU = Least Recently Used
About 70% hit rate

struct mapped_t {
u64 packed_virtual_addr : 44;
u64 page_count : 16;
u64 sparse : 1;
u64 misc : 3;
};
constexpr size_t LRU_Entries = 16;
mapped_t mRecentPages[LRU_Entries];

As a side effect from the measures I just detailed, the number of system calls issued to map memory increased.
The cost of executing system calls is difficult to predict, and if not controlled to some extent, could induce spikes in the35game loop.
Spikes in the game loop is a problem that would directly affect quality in a fighting game with a fixed frame rate.
Therefore, we decided to cache mapped memory using the LRU (Least Recently Used) method.
As a result, the cache hit ratio rose to about 70%, leading to a reduction in the number of system calls.

©CAPCOM

38

39.

Implementation Issues Call assuming Windows VirtualAlloc/VirtualFree Map request across mapped area Unmap a portion of the mapped area Mapped memory is guaranteed to be zero-clear There are other challenges. Windows' VirtualAlloc/VirtualFree API has a rich specification that allows for map requests across mapped areas and unmapping of 36 portions of mapped areas. In addition, mapped memory is guaranteed to be zero-cleared. mimalloc is designed with that zero-clear guarantee in mind. ©CAPCOM 39

40.

[beta]

Issue: Rich VirtualAlloc/VirtualFree specification
Resolved by managing virtual and physical memory manually
Simple implementation of both virtual and physical addresses, rounding out every 64 KiB
Mapping status is managed in a simple page table
Reinterpret complex requests from mimalloc
Support memory mapping requests not directly supported by system calls
struct page_t {
u64 continuous_page_count : 12;
u64 misc : 4;
u64 packed_virtual_addr : 48;
};
page_t mMappedPages[MaxPhysicalMem >> 16];

Since it is not expected that the target platform's API has the same specifications as VirtualAlloc/VirtualFree,
we solved this problem by managing virtual and physical memory on our own.
Although physical address alignment and continuity guarantees are provided separately,
the implementation is basically a simple linear 64KiB-by-64KiB cutout.
Mapping status was managed by maintaining a simple page table.

37

This allows requests issued from mimalloc to be reinterpreted against the simple page table,
and system calls to be issued or canceled separately.

©CAPCOM

40

41.

[beta]

Issue: Zero-clear guarantee of mapped memory
Implemented fast per-platform memory clearing
_mm256_stream_si256() + loop unroll
Per-page allocation always satisfies Intrinsic alignment requirements
Non-Temporal writes are valid for mimalloc memory access patterns
Faster than standard memset()
for (size_t i = 0; i < sz >> 5; i += 8) {
_mm256_stream_si256(&addr256[i
_mm256_stream_si256(&addr256[i
_mm256_stream_si256(&addr256[i
_mm256_stream_si256(&addr256[i
_mm256_stream_si256(&addr256[i
_mm256_stream_si256(&addr256[i
_mm256_stream_si256(&addr256[i
_mm256_stream_si256(&addr256[i
}

+ 0], kZero256);
+ 1], kZero256);
+ 2], kZero256);
+ 3], kZero256);
+ 4], kZero256);
+ 5], kZero256);
+ 6], kZero256);
+ 7], kZero256);

// 32
// 64
// 96
// 128
// 160
// 192
// 224
// 256

As for the zero-clear guarantee of mapped memory, implementing fast memory clearing on a platform-by-platform basis was sufficient.
On Windows, a background thread zero-clears and pools free memory pages to provide fast zero-cleared memory,
but there was no need to put in major efforts there.

38

The code shown in the example is a routine that clears memory using the AVX instruction.
It is implemented for the CPU variant of the platform to which it is ported.
There are two major differences between this code and libc's memset().
One is that there is a guarantee that the request address and size are aligned on a per-page basis,
so there is no need for the head-to-tail alignment that is common in memory clear routines.

©CAPCOM

41

42.

[beta]

Issue: Zero-clear guarantee of mapped memory
Implemented fast per-platform memory clearing
_mm256_stream_si256() + loop unroll
Per-page allocation always satisfies Intrinsic alignment requirements
Non-Temporal writes are valid for mimalloc memory access patterns
Faster than standard memset()
for (size_t i = 0; i < sz >> 5; i += 8) {
_mm256_stream_si256(&addr256[i
_mm256_stream_si256(&addr256[i
_mm256_stream_si256(&addr256[i
_mm256_stream_si256(&addr256[i
_mm256_stream_si256(&addr256[i
_mm256_stream_si256(&addr256[i
_mm256_stream_si256(&addr256[i
_mm256_stream_si256(&addr256[i
}

+ 0], kZero256);
+ 1], kZero256);
+ 2], kZero256);
+ 3], kZero256);
+ 4], kZero256);
+ 5], kZero256);
+ 6], kZero256);
+ 7], kZero256);

// 32
// 64
// 96
// 128
// 160
// 192
// 224
// 256

The second is that it is a non-temporal write that does not pollute the CPU cache.
Although mimalloc makes a mapping request with the expectation that it will be zero-cleared,
only the first page of the mapped memory area will be immediately referenced.
The rest will be far into the future in terms of CPU cycles.

©CAPCOM

38

42

43.

Implementation Issues Cases where the thread that allocates is different from the thread that releases are treated as "rare“ Put into the release queue, actual release is delayed GC process call is required Actually not a rare case in RE ENGINE Memory initialized in background threads are often released in the game loop Allocated within the game loop and released in another worker thread In mimalloc, a special case is handled when memory is returned from another thread to one thread's mi_heap. The returned memory is placed in the deallocation queue, and GC processing for the mi_heap is required for the queued 39 memory. In RE ENGINE, there are many cases in which the game loop releases memory that was initialized in a background thread, or visa versa. When and at what granularity GC processing is executed is very important from a performance perspective. ©CAPCOM 43

44.

Issue: Memory released by a separate thread Unified behavior of all deallocations to use the release queue Improved instruction cache hit ratio GC processing is performed incrementally when allocating memory for the owner thread Optimization focused on speed of memory release process Memory allocation is often done gradually, while deallocation is often done all at once void* allocate(size_t sz, size_t align) { mi_heap_t* currentHeap = re_mi_get_default_heap(); s32 collectionSteps = 100; re_mi_heap_collect(currentHeap, collectionSteps); mi_heap_malloc(currentHeap, sz); // … The solution to this problem was to first unify the implementation to use the release queue regardless of which thread does the deallocating. 40 This eliminated conditional branches and reduced the code size of the memory release logic, resulting in a higher instruction cache hit ratio. On top of that, GC processing is performed incrementally at the time of memory allocation by the thread that has ownership of the mi_heap in question. The number of object steps processed by GC is counted, and GC is terminated when it reaches a certain level. This does add additional processing load when allocating memory, but this optimization is based on the fact that the typical memory allocation behavior of RE ENGINE is such that memory allocation is generally done gradually, while deallocation is often done all at once and within the game loop. ©CAPCOM 44

45.

Performance allocate > 1MiB allocate > 4KiB allocate <= 4KiB free > 1MiB free > 4KiB free <= 4KiB 1 4 16 64 256 1024 4096 16384 65536 Nano seconds (smaller is better) VirtualAllocator Average N=10,000,000 PlayStation 5 HeapAllocator Finally, performance. This was measured using in-game scenes from a previously released title. The results are divided into sub-4KiB allocation/deallocation, which occurs frequently during game execution, and allocation/deallocation of 1MiB or more, which is less common. 41 The speed difference is overwhelming for the low-frequency, large memory allocations. This is due to the overhead of system calls. On the other hand, when looking at the more-common sub-4KiB allocations, the performance is comparable to that of the heap allocator and does not interfere with game execution. If we take into account the advantages of cross-segment memory management and fragmentation protection, we can say that this is not a bad choice. ©CAPCOM 45

46.

47.

Summary RE ENGINE is designed to meet the needs of game production in a variety of genres so flexible memory management via support for virtual memory allocators was important. We were able to release a title while accumulating knowledge by incorporating mimalloc. While benefiting from the basic aspects of the system, we extended and optimized it to suit our own needs. Thorough optimization of execution performance and operational stability are carried out to support high-quality title production. RE ENGINE now supports a virtual memory allocator to meet the needs of game production in a variety of genres, ushering in the era of flexible memory management. 43 Until recently, we supported title production only with our in-house heap allocator. By incorporating mimalloc, we were able to release titles while accumulating knowledge, starting from essentially zero in terms of knowledge of virtual memory allocators. While respecting the benefits that can be obtained just by incorporating mimalloc, we have expanded and optimized it to suit our own operations. To support high-quality title production, our commitment to execution performance and operational stability has remained unchanged since the days of heap allocators. ©CAPCOM 47

48.

Future Outlook We achieved a degree of success, but haven't applied everything we learned from the heap allocator era There is more that can be done for performance and memory space efficiency Fully in-house virtual memory allocator Memory allocation strategy that best matches RE ENGINE Less memory space overhead Optimization that makes maximum use of accumulated know-how Endless technological research and development for better game production While we achieved a degree of success in terms of title releases, we‘ve only scratched the surface of the know-how accumulated over the heap allocator’s long history. 44 We believe that more can still be done in terms of performance and memory space efficiency. Therefore, we are working on a completely in-house virtual memory allocator implementation. By investing in a memory allocation strategy that best matches RE ENGINE, we can achieve both lower memory space overhead and higher levels of performance. At Capcom, we can push the limits in technological research and development to achieve better game production. ©CAPCOM 48

49.

The Road to Introducing Virtual Memory Allocators

カプコン技研

関連スライド

MH:W | GPU Particle - モンスターハンター:ワールドにおけるGPU Particleの実装

ラピッドイテレーションを実現するRE ENGINEの設計

GPU駆動レンダリングへの取り組み

ゲームエンジン開発を支える技術

緻密なキャラクターの表情や破壊表現のためのコンピュートシェーダによるメッシュアニメーション

カプコン オープンカンファレンス RE:2022 パネルコンテンツ

各ページのテキスト

カプコンオープンカンファレンス RE:2022 パネルコンテンツ