A simple caching allocator pinned host memory allocations. More...

#include <CachingHostAllocator.h>

Classes
struct	BlockDescriptor

class	TotalBytes

Public Types
typedef std::multiset< BlockDescriptor, Compare >	BusyBlocks
	Set type for live blocks (ordered by ptr) More...

typedef std::multiset< BlockDescriptor, Compare >	CachedBlocks
	Set type for cached blocks (ordered by size) More...

typedef bool(*	Compare) (const BlockDescriptor &, const BlockDescriptor &)
	BlockDescriptor comparator function interface. More...

Public Member Functions
	CachingHostAllocator (bool skip_cleanup=false, bool debug=false)
	Default constructor. More...

	CachingHostAllocator (unsigned int bin_growth, unsigned int min_bin=1, unsigned int max_bin=INVALID_BIN, size_t max_cached_bytes=INVALID_SIZE, bool skip_cleanup=false, bool debug=false)
	Set of live pinned host allocations currently in use. More...

cudaError_t	FreeAllCached ()
	Frees all cached pinned host allocations. More...

cudaError_t	HostAllocate (void **d_ptr, size_t bytes, cudaStream_t active_stream=nullptr)
	Provides a suitable allocation of pinned host memory for the given size. More...

cudaError_t	HostFree (void *d_ptr)
	Frees a live allocation of pinned host memory, returning it to the allocator. More...

void	NearestPowerOf (unsigned int &power, size_t &rounded_bytes, unsigned int base, size_t value)

void	SetMaxCachedBytes (size_t max_cached_bytes)
	Sets the limit on the number bytes this allocator is allowed to cache. More...

	~CachingHostAllocator ()
	Destructor. More...

Static Public Member Functions
static unsigned int	IntPow (unsigned int base, unsigned int exp)

Public Attributes
unsigned int	bin_growth
	Mutex for thread-safety. More...

CachedBlocks	cached_blocks
	Aggregate cached bytes. More...

TotalBytes	cached_bytes
	Whether or not to print (de)allocation events to stdout. More...

bool	debug
	Whether or not to skip a call to FreeAllCached() when destructor is called. (The CUDA runtime may have already shut down for statically declared allocators) More...

BusyBlocks	live_blocks
	Set of cached pinned host allocations available for reuse. More...

unsigned int	max_bin
	Minimum bin enumeration. More...

size_t	max_bin_bytes
	Minimum bin size. More...

size_t	max_cached_bytes
	Maximum bin size. More...

unsigned int	min_bin
	Geometric growth factor for bin-sizes. More...

size_t	min_bin_bytes
	Maximum bin enumeration. More...

std::mutex	mutex

const bool	skip_cleanup
	Maximum aggregate cached bytes. More...

Static Public Attributes
static const unsigned int	INVALID_BIN = (unsigned int)-1
	Out-of-bounds bin. More...

static const int	INVALID_DEVICE_ORDINAL = -1
	Invalid device ordinal. More...

static const size_t	INVALID_SIZE = (size_t)-1
	Invalid size. More...

Detailed Description

A simple caching allocator pinned host memory allocations.

Overview: The allocator is thread-safe. It behaves as follows:

I presume the CUDA stream-safeness is not useful as to read/write from/to the pinned host memory one needs to synchronize anyway. The difference wrt. device memory is that in the CPU all operations to the device memory are scheduled via the CUDA stream, while for the host memory one can perform operations directly.

Allocations are categorized and cached by bin size. A new allocation request of a given size will only consider cached allocations within the corresponding bin.
Bin limits progress geometrically in accordance with the growth factor bin_growth provided during construction. Unused host allocations within a larger bin cache are not reused for allocation requests that categorize to smaller bin sizes.
Allocation requests below (bin_growth ^ min_bin) are rounded up to (bin_growth ^ min_bin).
Allocations above (bin_growth ^ max_bin) are not rounded up to the nearest bin and are simply freed when they are deallocated instead of being returned to a bin-cache.
If the total storage of cached allocations will exceed max_cached_bytes, allocations are simply freed when they are deallocated instead of being returned to their bin-cache.

For example, the default-constructed CachingHostAllocator is configured with:

bin_growth = 8
min_bin = 3
max_bin = 7
max_cached_bytes = 6MB - 1B

: which delineates five bin-sizes: 512B, 4KB, 32KB, 256KB, and 2MB and sets a maximum of 6,291,455 cached bytes

Definition at line 124 of file CachingHostAllocator.h.

Member Typedef Documentation

◆ BusyBlocks

typedef std::multiset<BlockDescriptor, Compare> notcub::CachingHostAllocator::BusyBlocks

Set type for live blocks (ordered by ptr)

Definition at line 195 of file CachingHostAllocator.h.

◆ CachedBlocks

typedef std::multiset<BlockDescriptor, Compare> notcub::CachingHostAllocator::CachedBlocks

Set type for cached blocks (ordered by size)

Definition at line 192 of file CachingHostAllocator.h.

◆ Compare

typedef bool(* notcub::CachingHostAllocator::Compare) (const BlockDescriptor &, const BlockDescriptor &)

BlockDescriptor comparator function interface.

Definition at line 182 of file CachingHostAllocator.h.

Constructor & Destructor Documentation

◆ CachingHostAllocator() [1/2]

notcub::CachingHostAllocator::CachingHostAllocator	(	unsigned int	bin_growth,
		unsigned int	min_bin = `1`,
		unsigned int	max_bin = `INVALID_BIN`,
		size_t	max_cached_bytes = `INVALID_SIZE`,
		bool	skip_cleanup = `false`,
		bool	debug = `false`
	)

inline

Set of live pinned host allocations currently in use.

Constructor.

Parameters

bin_growth	Geometric growth factor for bin-sizes
min_bin	Minimum bin (default is bin_growth ^ 1)
max_bin	Maximum bin (default is no max bin)
max_cached_bytes	Maximum aggregate cached bytes (default is no limit)
skip_cleanup	Whether or not to skip a call to `FreeAllCached()` when the destructor is called (default is to deallocate)
debug	Whether or not to print (de)allocation events to stdout (default is no stderr output)

Definition at line 267 of file CachingHostAllocator.h.

                                       : 512B, 4KB, 32KB, 256KB, and 2MB and
      * sets a maximum of 6,291,455 cached bytes
      */
     CachingHostAllocator(bool skip_cleanup = false, bool debug = false)
         : bin_growth(8),
           min_bin(3),
           max_bin(7),
           min_bin_bytes(IntPow(bin_growth, min_bin)),
           max_bin_bytes(IntPow(bin_growth, max_bin)),
           max_cached_bytes((max_bin_bytes * 3) - 1),
           skip_cleanup(skip_cleanup),
           debug(debug),
           cached_blocks(BlockDescriptor::SizeCompare),
           live_blocks(BlockDescriptor::PtrCompare) {}

◆ CachingHostAllocator() [2/2]

notcub::CachingHostAllocator::CachingHostAllocator	(	bool	skip_cleanup = `false`,
		bool	debug = `false`
	)

inline

Default constructor.

Configured with:

bin_growth = 8
min_bin = 3
max_bin = 7
max_cached_bytes = (bin_growth ^ max_bin) * 3) - 1 = 6,291,455 bytes

which delineates five bin-sizes: 512B, 4KB, 32KB, 256KB, and 2MB and sets a maximum of 6,291,455 cached bytes

Definition at line 299 of file CachingHostAllocator.h.

316 {

◆ ~CachingHostAllocator()

notcub::CachingHostAllocator::~CachingHostAllocator ( )

inline

Destructor.

Definition at line 663 of file CachingHostAllocator.h.

Member Function Documentation

◆ FreeAllCached()

cudaError_t notcub::CachingHostAllocator::FreeAllCached ( )

inline

Frees all cached pinned host allocations.

Definition at line 604 of file CachingHostAllocator.h.

                                                        {
         cudaCheck(error = cudaSetDevice(entrypoint_device));
       }
  
       return error;
     }
  
     ~CachingHostAllocator() {
       if (!skip_cleanup)
         FreeAllCached();
     }
   };
   // end group UtilMgmt
  
 }  // namespace notcub
  
 #endif

Referenced by CUDAService::~CUDAService().

◆ HostAllocate()

cudaError_t notcub::CachingHostAllocator::HostAllocate	(	void **	d_ptr,
		size_t	bytes,
		cudaStream_t	active_stream = `nullptr`
	)

inline

Provides a suitable allocation of pinned host memory for the given size.

Once freed, the allocation becomes available immediately for reuse.

Parameters

[out]	d_ptr	Reference to pointer to the allocation
[in]	bytes	Minimum number of bytes for the allocation
[in]	active_stream	The stream to be associated with this allocation

Definition at line 337 of file CachingHostAllocator.h.

              {
         // Search for a suitable cached allocation: lock
         mutex_locker.lock();
  
         if (search_key.bin < min_bin) {
           // Bin is less than minimum bin: round up
           search_key.bin = min_bin;
           search_key.bytes = min_bin_bytes;
         }
  
         // Iterate through the range of cached blocks in the same bin
         CachedBlocks::iterator block_itr = cached_blocks.lower_bound(search_key);
         while ((block_itr != cached_blocks.end()) && (block_itr->bin == search_key.bin)) {
           // To prevent races with reusing blocks returned by the host but still
           // in use for transfers, only consider cached blocks that are from an idle stream
           if (cudaEventQuery(block_itr->ready_event) != cudaErrorNotReady) {
             // Reuse existing cache block.  Insert into live blocks.
             found = true;
             search_key = *block_itr;
             search_key.associated_stream = active_stream;
             if (search_key.device != device) {
               // If "associated" device changes, need to re-create the event on the right device
               cudaCheck(error = cudaSetDevice(search_key.device));
               cudaCheck(error = cudaEventDestroy(search_key.ready_event));
               cudaCheck(error = cudaSetDevice(device));
               cudaCheck(error = cudaEventCreateWithFlags(&search_key.ready_event, cudaEventDisableTiming));
               search_key.device = device;
             }
  
             live_blocks.insert(search_key);
  
             // Remove from free blocks
             cached_bytes.free -= search_key.bytes;
             cached_bytes.live += search_key.bytes;
  
             if (debug)
               printf(
                   "\tHost reused cached block at %p (%lld bytes) for stream %lld, event %lld on device %lld "
                   "(previously associated with stream %lld, event %lld).\n",
                   search_key.d_ptr,
                   (long long)search_key.bytes,
                   (long long)search_key.associated_stream,
                   (long long)search_key.ready_event,
                   (long long)search_key.device,
                   (long long)block_itr->associated_stream,
                   (long long)block_itr->ready_event);
  
             cached_blocks.erase(block_itr);
  
             break;
           }
           block_itr++;
         }
  
         // Done searching: unlock
         mutex_locker.unlock();
       }
  
       // Allocate the block if necessary
       if (!found) {
         // Attempt to allocate
         // TODO: eventually support allocation flags
         if ((error = cudaHostAlloc(&search_key.d_ptr, search_key.bytes, cudaHostAllocDefault)) ==
             cudaErrorMemoryAllocation) {
           // The allocation attempt failed: free all cached blocks on device and retry
           if (debug)
             printf(
                 "\tHost failed to allocate %lld bytes for stream %lld on device %lld, retrying after freeing cached "
                 "allocations",
                 (long long)search_key.bytes,
                 (long long)search_key.associated_stream,
                 (long long)search_key.device);
  
           error = cudaSuccess;  // Reset the error we will return
           cudaGetLastError();   // Reset CUDART's error
  
           // Lock
           mutex_locker.lock();
  
           // Iterate the range of free blocks
           CachedBlocks::iterator block_itr = cached_blocks.begin();
  
           while ((block_itr != cached_blocks.end())) {
             // No need to worry about synchronization with the device: cudaFree is
             // blocking and will synchronize across all kernels executing
             // on the current device
  
             // Free pinned host memory.
             if ((error = cudaFreeHost(block_itr->d_ptr)))
               break;
             if ((error = cudaEventDestroy(block_itr->ready_event)))
               break;
  
             // Reduce balance and erase entry
             cached_bytes.free -= block_itr->bytes;
  
             if (debug)
               printf(
                   "\tHost freed %lld bytes.\n\t\t  %lld available blocks cached (%lld bytes), %lld live blocks (%lld "
                   "bytes) outstanding.\n",
                   (long long)block_itr->bytes,
                   (long long)cached_blocks.size(),
                   (long long)cached_bytes.free,
                   (long long)live_blocks.size(),
                   (long long)cached_bytes.live);
  
             cached_blocks.erase(block_itr);
  
             block_itr++;
           }
  
           // Unlock
           mutex_locker.unlock();
  
           // Return under error
           if (error)
             return error;
  
           // Try to allocate again
           cudaCheck(error = cudaHostAlloc(&search_key.d_ptr, search_key.bytes, cudaHostAllocDefault));
         }
  
         // Create ready event
         cudaCheck(error = cudaEventCreateWithFlags(&search_key.ready_event, cudaEventDisableTiming));
  
         // Insert into live blocks
         mutex_locker.lock();
         live_blocks.insert(search_key);
         cached_bytes.live += search_key.bytes;
         mutex_locker.unlock();
  
         if (debug)
           printf(
               "\tHost allocated new host block at %p (%lld bytes associated with stream %lld, event %lld on device "
               "%lld).\n",
               search_key.d_ptr,
               (long long)search_key.bytes,
               (long long)search_key.associated_stream,
               (long long)search_key.ready_event,
               (long long)search_key.device);
       }
  
       // Copy host pointer to output parameter
       *d_ptr = search_key.d_ptr;
  
       if (debug)
         printf("\t\t%lld available blocks cached (%lld bytes), %lld live blocks outstanding(%lld bytes).\n",
                (long long)cached_blocks.size(),
                (long long)cached_bytes.free,
                (long long)live_blocks.size(),
                (long long)cached_bytes.live);
  
       return error;
     }
  
     cudaError_t HostFree(void *d_ptr) {
       int entrypoint_device = INVALID_DEVICE_ORDINAL;
       cudaError_t error = cudaSuccess;
  
       // Lock
       std::unique_lock<std::mutex> mutex_locker(mutex);
  
       // Find corresponding block descriptor
       bool recached = false;
       BlockDescriptor search_key(d_ptr);
       BusyBlocks::iterator block_itr = live_blocks.find(search_key);
       if (block_itr != live_blocks.end()) {
         // Remove from live blocks
         search_key = *block_itr;
         live_blocks.erase(block_itr);
         cached_bytes.live -= search_key.bytes;
  
         // Keep the returned allocation if bin is valid and we won't exceed the max cached threshold
         if ((search_key.bin != INVALID_BIN) && (cached_bytes.free + search_key.bytes <= max_cached_bytes)) {

◆ HostFree()

cudaError_t notcub::CachingHostAllocator::HostFree ( void * d_ptr )

inline

Frees a live allocation of pinned host memory, returning it to the allocator.

Once freed, the allocation becomes available immediately for reuse.

Definition at line 522 of file CachingHostAllocator.h.

                                                   {
         cudaCheck(error = cudaSetDevice(search_key.device));
       }
  
       if (recached) {
         // Insert the ready event in the associated stream (must have current device set properly)
         cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
       }
  
       // Unlock
       mutex_locker.unlock();
  
       if (!recached) {
         // Free the allocation from the runtime and cleanup the event.
         cudaCheck(error = cudaFreeHost(d_ptr));
         cudaCheck(error = cudaEventDestroy(search_key.ready_event));
  
         if (debug)
           printf(
               "\tHost freed %lld bytes from associated stream %lld, event %lld on device %lld.\n\t\t  %lld available "
               "blocks cached (%lld bytes), %lld live blocks (%lld bytes) outstanding.\n",
               (long long)search_key.bytes,
               (long long)search_key.associated_stream,
               (long long)search_key.ready_event,
               (long long)search_key.device,
               (long long)cached_blocks.size(),
               (long long)cached_bytes.free,
               (long long)live_blocks.size(),
               (long long)cached_bytes.live);
       }
  
       // Reset device
       if ((entrypoint_device != INVALID_DEVICE_ORDINAL) && (entrypoint_device != search_key.device)) {
         cudaCheck(error = cudaSetDevice(entrypoint_device));
       }
  
       return error;
     }
  
     cudaError_t FreeAllCached() {
       cudaError_t error = cudaSuccess;
       int entrypoint_device = INVALID_DEVICE_ORDINAL;
       int current_device = INVALID_DEVICE_ORDINAL;
  
       std::unique_lock<std::mutex> mutex_locker(mutex);
  
       while (!cached_blocks.empty()) {
         // Get first block
         CachedBlocks::iterator begin = cached_blocks.begin();
  
         // Get entry-point device ordinal if necessary
         if (entrypoint_device == INVALID_DEVICE_ORDINAL) {
           if ((error = cudaGetDevice(&entrypoint_device)))
             break;
         }
  
         // Set current device ordinal if necessary
         if (begin->device != current_device) {
           if ((error = cudaSetDevice(begin->device)))
             break;

◆ IntPow()

static unsigned int notcub::CachingHostAllocator::IntPow	(	unsigned int	base,
		unsigned int	exp
	)

inlinestatic

Integer pow function for unsigned base and exponent

Definition at line 204 of file CachingHostAllocator.h.

                                     {
         rounded_bytes *= base;
         power++;
       }
     }
  
     //---------------------------------------------------------------------
     // Fields
     //---------------------------------------------------------------------
  

References newFWLiteAna::base.

◆ NearestPowerOf()

void notcub::CachingHostAllocator::NearestPowerOf	(	unsigned int &	power,
		size_t &	rounded_bytes,
		unsigned int	base,
		size_t	value
	)

inline

Round up to the nearest power-of

Definition at line 219 of file CachingHostAllocator.h.

250 : bin_growth(bin_growth),

◆ SetMaxCachedBytes()

void notcub::CachingHostAllocator::SetMaxCachedBytes ( size_t max_cached_bytes )

inline

Sets the limit on the number bytes this allocator is allowed to cache.

Changing the ceiling of cached bytes does not cause any allocations (in-use or cached-in-reserve) to be freed. See FreeAllCached().

Definition at line 317 of file CachingHostAllocator.h.

331 {

Member Data Documentation

◆ bin_growth

unsigned int notcub::CachingHostAllocator::bin_growth

Mutex for thread-safety.

Definition at line 242 of file CachingHostAllocator.h.

◆ cached_blocks

CachedBlocks notcub::CachingHostAllocator::cached_blocks

Aggregate cached bytes.

Definition at line 255 of file CachingHostAllocator.h.

◆ cached_bytes

TotalBytes notcub::CachingHostAllocator::cached_bytes

Whether or not to print (de)allocation events to stdout.

Definition at line 254 of file CachingHostAllocator.h.

◆ debug

bool notcub::CachingHostAllocator::debug

Whether or not to skip a call to FreeAllCached() when destructor is called. (The CUDA runtime may have already shut down for statically declared allocators)

Definition at line 252 of file CachingHostAllocator.h.

Referenced by rrapi.RRApi::dprint(), rrapi.RRApi::get(), runTauIdMVA.TauIDEmbedder::loadMVA_WPs_run2_2017(), and runTauIdMVA.TauIDEmbedder::runTauID().

◆ INVALID_BIN

const unsigned int notcub::CachingHostAllocator::INVALID_BIN = (unsigned int)-1

static

Out-of-bounds bin.

Definition at line 131 of file CachingHostAllocator.h.

◆ INVALID_DEVICE_ORDINAL

const int notcub::CachingHostAllocator::INVALID_DEVICE_ORDINAL = -1

static

Invalid device ordinal.

Definition at line 139 of file CachingHostAllocator.h.

◆ INVALID_SIZE

const size_t notcub::CachingHostAllocator::INVALID_SIZE = (size_t)-1

static

Invalid size.

Definition at line 134 of file CachingHostAllocator.h.

◆ live_blocks

BusyBlocks notcub::CachingHostAllocator::live_blocks

Set of cached pinned host allocations available for reuse.

Definition at line 256 of file CachingHostAllocator.h.

◆ max_bin

unsigned int notcub::CachingHostAllocator::max_bin

Minimum bin enumeration.

Definition at line 244 of file CachingHostAllocator.h.

◆ max_bin_bytes

size_t notcub::CachingHostAllocator::max_bin_bytes

Minimum bin size.

Definition at line 247 of file CachingHostAllocator.h.

◆ max_cached_bytes

size_t notcub::CachingHostAllocator::max_cached_bytes

Maximum bin size.

Definition at line 248 of file CachingHostAllocator.h.

◆ min_bin

unsigned int notcub::CachingHostAllocator::min_bin

Geometric growth factor for bin-sizes.

Definition at line 243 of file CachingHostAllocator.h.

◆ min_bin_bytes

size_t notcub::CachingHostAllocator::min_bin_bytes

Maximum bin enumeration.

Definition at line 246 of file CachingHostAllocator.h.

◆ mutex

std::mutex notcub::CachingHostAllocator::mutex

Definition at line 240 of file CachingHostAllocator.h.

◆ skip_cleanup

const bool notcub::CachingHostAllocator::skip_cleanup

Maximum aggregate cached bytes.

Definition at line 251 of file CachingHostAllocator.h.

Classes

Public Types

Public Member Functions

Static Public Member Functions

Public Attributes

Static Public Attributes

Detailed Description

Member Typedef Documentation

◆ BusyBlocks

◆ CachedBlocks

◆ Compare

Constructor & Destructor Documentation

◆ CachingHostAllocator() [1/2]

◆ CachingHostAllocator() [2/2]

◆ ~CachingHostAllocator()

Member Function Documentation

◆ FreeAllCached()

◆ HostAllocate()

◆ HostFree()

◆ IntPow()

◆ NearestPowerOf()

◆ SetMaxCachedBytes()

Member Data Documentation

◆ bin_growth

◆ cached_blocks

◆ cached_bytes

◆ debug

◆ INVALID_BIN

◆ INVALID_DEVICE_ORDINAL

◆ INVALID_SIZE

◆ live_blocks

◆ max_bin

◆ max_bin_bytes

◆ max_cached_bytes

◆ min_bin

◆ min_bin_bytes

◆ mutex

◆ skip_cleanup