A simple caching allocator for device memory allocations. More...

#include <CachingDeviceAllocator.h>

Classes
struct	BlockDescriptor

Public Types
typedef std::multiset< BlockDescriptor, Compare >	BusyBlocks
	Set type for live blocks (ordered by ptr) More...

typedef std::multiset< BlockDescriptor, Compare >	CachedBlocks
	Set type for cached blocks (ordered by size) More...

typedef bool(*	Compare) (const BlockDescriptor &, const BlockDescriptor &)
	BlockDescriptor comparator function interface. More...

using	GpuCachedBytes = cms::cuda::allocator::GpuCachedBytes
	Map type of device ordinals to the number of cached bytes cached by each device. More...

Public Member Functions
GpuCachedBytes	CacheStatus () const

	CachingDeviceAllocator (unsigned int bin_growth, unsigned int min_bin=1, unsigned int max_bin=INVALID_BIN, size_t max_cached_bytes=INVALID_SIZE, bool skip_cleanup=false, bool debug=false)
	Set of live device allocations currently in use. More...

	CachingDeviceAllocator (bool skip_cleanup=false, bool debug=false)
	Default constructor. More...

cudaError_t	DeviceAllocate (int device, void **d_ptr, size_t bytes, cudaStream_t active_stream=nullptr)
	Provides a suitable allocation of device memory for the given size on the specified device. More...

cudaError_t	DeviceAllocate (void **d_ptr, size_t bytes, cudaStream_t active_stream=nullptr)
	Provides a suitable allocation of device memory for the given size on the current device. More...

cudaError_t	DeviceFree (int device, void *d_ptr)
	Frees a live allocation of device memory on the specified device, returning it to the allocator. More...

cudaError_t	DeviceFree (void *d_ptr)
	Frees a live allocation of device memory on the current device, returning it to the allocator. More...

cudaError_t	FreeAllCached ()
	Frees all cached device allocations on all devices. More...

void	NearestPowerOf (unsigned int &power, size_t &rounded_bytes, unsigned int base, size_t value)

cudaError_t	SetMaxCachedBytes (size_t max_cached_bytes)
	Sets the limit on the number bytes this allocator is allowed to cache per device. More...

	~CachingDeviceAllocator ()
	Destructor. More...

Static Public Member Functions
static unsigned int	IntPow (unsigned int base, unsigned int exp)

Public Attributes
unsigned int	bin_growth
	Mutex for thread-safety. More...

CachedBlocks	cached_blocks
	Map of device ordinal to aggregate cached bytes on that device. More...

GpuCachedBytes	cached_bytes
	Whether or not to print (de)allocation events to stdout. More...

bool	debug
	Whether or not to skip a call to FreeAllCached() when destructor is called. (The CUDA runtime may have already shut down for statically declared allocators) More...

BusyBlocks	live_blocks
	Set of cached device allocations available for reuse. More...

unsigned int	max_bin
	Minimum bin enumeration. More...

size_t	max_bin_bytes
	Minimum bin size. More...

size_t	max_cached_bytes
	Maximum bin size. More...

unsigned int	min_bin
	Geometric growth factor for bin-sizes. More...

size_t	min_bin_bytes
	Maximum bin enumeration. More...

std::mutex	mutex

const bool	skip_cleanup
	Maximum aggregate cached bytes per device. More...

Static Public Attributes
static const unsigned int	INVALID_BIN = (unsigned int)-1
	Out-of-bounds bin. More...

static const int	INVALID_DEVICE_ORDINAL = -1
	Invalid device ordinal. More...

static const size_t	INVALID_SIZE = (size_t)-1
	Invalid size. More...

Detailed Description

A simple caching allocator for device memory allocations.

Overview: The allocator is thread-safe and stream-safe and is capable of managing cached device allocations on multiple devices. It behaves as follows:

Allocations from the allocator are associated with an active_stream. Once freed, the allocation becomes available immediately for reuse within the active_stream with which it was associated with during allocation, and it becomes available for reuse within other streams when all prior work submitted to active_stream has completed.
Allocations are categorized and cached by bin size. A new allocation request of a given size will only consider cached allocations within the corresponding bin.
Bin limits progress geometrically in accordance with the growth factor bin_growth provided during construction. Unused device allocations within a larger bin cache are not reused for allocation requests that categorize to smaller bin sizes.
Allocation requests below (bin_growth ^ min_bin) are rounded up to (bin_growth ^ min_bin).
Allocations above (bin_growth ^ max_bin) are not rounded up to the nearest bin and are simply freed when they are deallocated instead of being returned to a bin-cache.
If the total storage of cached allocations on a given device will exceed max_cached_bytes, allocations for that device are simply freed when they are deallocated instead of being returned to their bin-cache.

For example, the default-constructed CachingDeviceAllocator is configured with:

bin_growth = 8
min_bin = 3
max_bin = 7
max_cached_bytes = 6MB - 1B

: which delineates five bin-sizes: 512B, 4KB, 32KB, 256KB, and 2MB and sets a maximum of 6,291,455 cached bytes per device

Definition at line 100 of file CachingDeviceAllocator.h.

Member Typedef Documentation

◆ BusyBlocks

typedef std::multiset<BlockDescriptor, Compare> notcub::CachingDeviceAllocator::BusyBlocks

Set type for live blocks (ordered by ptr)

Definition at line 178 of file CachingDeviceAllocator.h.

◆ CachedBlocks

typedef std::multiset<BlockDescriptor, Compare> notcub::CachingDeviceAllocator::CachedBlocks

Set type for cached blocks (ordered by size)

Definition at line 175 of file CachingDeviceAllocator.h.

◆ Compare

typedef bool(* notcub::CachingDeviceAllocator::Compare) (const BlockDescriptor &, const BlockDescriptor &)

BlockDescriptor comparator function interface.

Definition at line 170 of file CachingDeviceAllocator.h.

◆ GpuCachedBytes

using notcub::CachingDeviceAllocator::GpuCachedBytes = cms::cuda::allocator::GpuCachedBytes

Map type of device ordinals to the number of cached bytes cached by each device.

Definition at line 182 of file CachingDeviceAllocator.h.

Constructor & Destructor Documentation

◆ CachingDeviceAllocator() [1/2]

notcub::CachingDeviceAllocator::CachingDeviceAllocator	(	unsigned int	bin_growth,
		unsigned int	min_bin = `1`,
		unsigned int	max_bin = `INVALID_BIN`,
		size_t	max_cached_bytes = `INVALID_SIZE`,
		bool	skip_cleanup = `false`,
		bool	debug = `false`
	)

inline

Set of live device allocations currently in use.

Constructor.

Parameters

bin_growth	Geometric growth factor for bin-sizes
min_bin	Minimum bin (default is bin_growth ^ 1)
max_bin	Maximum bin (default is no max bin)
max_cached_bytes	Maximum aggregate cached bytes per device (default is no limit)
skip_cleanup	Whether or not to skip a call to `FreeAllCached()` when the destructor is called (default is to deallocate)
debug	Whether or not to print (de)allocation events to stdout (default is no stderr output)

Definition at line 255 of file CachingDeviceAllocator.h.

         : bin_growth(bin_growth),
           min_bin(min_bin),
           max_bin(max_bin),
           min_bin_bytes(IntPow(bin_growth, min_bin)),
           max_bin_bytes(IntPow(bin_growth, max_bin)),
           max_cached_bytes(max_cached_bytes),
           skip_cleanup(skip_cleanup),
           debug(debug),
           cached_blocks(BlockDescriptor::SizeCompare),
           live_blocks(BlockDescriptor::PtrCompare) {}

◆ CachingDeviceAllocator() [2/2]

notcub::CachingDeviceAllocator::CachingDeviceAllocator	(	bool	skip_cleanup = `false`,
		bool	debug = `false`
	)

inline

Default constructor.

Configured with:

bin_growth = 8
min_bin = 3
max_bin = 7
max_cached_bytes = (bin_growth ^ max_bin) * 3) - 1 = 6,291,455 bytes

which delineates five bin-sizes: 512B, 4KB, 32KB, 256KB, and 2MB and sets a maximum of 6,291,455 cached bytes per device

Definition at line 287 of file CachingDeviceAllocator.h.

         : bin_growth(8),
           min_bin(3),
           max_bin(7),
           min_bin_bytes(IntPow(bin_growth, min_bin)),
           max_bin_bytes(IntPow(bin_growth, max_bin)),
           max_cached_bytes((max_bin_bytes * 3) - 1),
           skip_cleanup(skip_cleanup),
           debug(debug),
           cached_blocks(BlockDescriptor::SizeCompare),
           live_blocks(BlockDescriptor::PtrCompare) {}

◆ ~CachingDeviceAllocator()

notcub::CachingDeviceAllocator::~CachingDeviceAllocator ( )

inline

Destructor.

Definition at line 737 of file CachingDeviceAllocator.h.

References FreeAllCached(), and skip_cleanup.

                               {
       if (!skip_cleanup)
         FreeAllCached();
     }

Member Function Documentation

◆ CacheStatus()

GpuCachedBytes notcub::CachingDeviceAllocator::CacheStatus ( ) const

inline

Definition at line 728 of file CachingDeviceAllocator.h.

References cached_bytes, and mutex.

Referenced by cms::cuda::deviceAllocatorStatus().

                                        {
       std::unique_lock mutex_locker(mutex);
       return cached_bytes;
     }

◆ DeviceAllocate() [1/2]

cudaError_t notcub::CachingDeviceAllocator::DeviceAllocate	(	int	device,
		void **	d_ptr,
		size_t	bytes,
		cudaStream_t	active_stream = `nullptr`
	)

inline

Provides a suitable allocation of device memory for the given size on the specified device.

Once freed, the allocation becomes available immediately for reuse within the active_stream with which it was associated with during allocation, and it becomes available for reuse within other streams when all prior work submitted to active_stream has completed.

Parameters

[in]	device	Device on which to place the allocation
[out]	d_ptr	Reference to pointer to the allocation
[in]	bytes	Minimum number of bytes for the allocation
[in]	active_stream	The stream to be associated with this allocation

Definition at line 331 of file CachingDeviceAllocator.h.

References notcub::CachingDeviceAllocator::BlockDescriptor::associated_stream, notcub::CachingDeviceAllocator::BlockDescriptor::bin, bin_growth, notcub::CachingDeviceAllocator::BlockDescriptor::bytes, notcub::CachingDeviceAllocator::BlockDescriptor::bytesRequested, cached_blocks, cached_bytes, cudaCheck, notcub::CachingDeviceAllocator::BlockDescriptor::d_ptr, debug, relativeConstraints::error, newFWLiteAna::found, free(), INVALID_BIN, INVALID_DEVICE_ORDINAL, beam_dqm_sourceclient-live_cfg::live, live_blocks, max_bin, min_bin, min_bin_bytes, mutex, NearestPowerOf(), and notcub::CachingDeviceAllocator::BlockDescriptor::ready_event.

Referenced by DeviceAllocate().

     {
       // CMS: use RAII instead of (un)locking explicitly
       std::unique_lock<std::mutex> mutex_locker(mutex, std::defer_lock);
       *d_ptr = nullptr;
       int entrypoint_device = INVALID_DEVICE_ORDINAL;
       cudaError_t error = cudaSuccess;
 
       if (device == INVALID_DEVICE_ORDINAL) {
         // CMS: throw exception on error
         cudaCheck(error = cudaGetDevice(&entrypoint_device));
         device = entrypoint_device;
       }
 
       // Create a block descriptor for the requested allocation
       bool found = false;
       BlockDescriptor search_key(device);
       search_key.bytesRequested = bytes;  // CMS
       search_key.associated_stream = active_stream;
       NearestPowerOf(search_key.bin, search_key.bytes, bin_growth, bytes);
 
       if (search_key.bin > max_bin) {
         // Bin is greater than our maximum bin: allocate the request
         // exactly and give out-of-bounds bin.  It will not be cached
         // for reuse when returned.
         search_key.bin = INVALID_BIN;
         search_key.bytes = bytes;
       } else {
         // Search for a suitable cached allocation: lock
         mutex_locker.lock();
 
         if (search_key.bin < min_bin) {
           // Bin is less than minimum bin: round up
           search_key.bin = min_bin;
           search_key.bytes = min_bin_bytes;
         }
 
         // Iterate through the range of cached blocks on the same device in the same bin
         CachedBlocks::iterator block_itr = cached_blocks.lower_bound(search_key);
         while ((block_itr != cached_blocks.end()) && (block_itr->device == device) &&
                (block_itr->bin == search_key.bin)) {
           // To prevent races with reusing blocks returned by the host but still
           // in use by the device, only consider cached blocks that are
           // either (from the active stream) or (from an idle stream)
           if ((active_stream == block_itr->associated_stream) ||
               (cudaEventQuery(block_itr->ready_event) != cudaErrorNotReady)) {
             // Reuse existing cache block.  Insert into live blocks.
             found = true;
             search_key = *block_itr;
             search_key.associated_stream = active_stream;
             live_blocks.insert(search_key);
 
             // Remove from free blocks
             cached_bytes[device].free -= search_key.bytes;
             cached_bytes[device].live += search_key.bytes;
             cached_bytes[device].liveRequested += search_key.bytesRequested;  // CMS
 
             if (debug)
               // CMS: improved debug message
               // CMS: use raw printf
               printf(
                   "\tDevice %d reused cached block at %p (%lld bytes) for stream %lld, event %lld (previously "
                   "associated with stream %lld, event %lld).\n",
                   device,
                   search_key.d_ptr,
                   (long long)search_key.bytes,
                   (long long)search_key.associated_stream,
                   (long long)search_key.ready_event,
                   (long long)block_itr->associated_stream,
                   (long long)block_itr->ready_event);
 
             cached_blocks.erase(block_itr);
 
             break;
           }
           block_itr++;
         }
 
         // Done searching: unlock
         mutex_locker.unlock();
       }
 
       // Allocate the block if necessary
       if (!found) {
         // Set runtime's current device to specified device (entrypoint may not be set)
         if (device != entrypoint_device) {
           // CMS: throw exception on error
           cudaCheck(error = cudaGetDevice(&entrypoint_device));
           cudaCheck(error = cudaSetDevice(device));
         }
 
         // Attempt to allocate
         // CMS: silently ignore errors and retry or pass them to the caller
         if ((error = cudaMalloc(&search_key.d_ptr, search_key.bytes)) == cudaErrorMemoryAllocation) {
           // The allocation attempt failed: free all cached blocks on device and retry
           if (debug)
             // CMS: use raw printf
             printf(
                 "\tDevice %d failed to allocate %lld bytes for stream %lld, retrying after freeing cached allocations",
                 device,
                 (long long)search_key.bytes,
                 (long long)search_key.associated_stream);
 
           error = cudaSuccess;  // Reset the error we will return
           cudaGetLastError();   // Reset CUDART's error
 
           // Lock
           mutex_locker.lock();
 
           // Iterate the range of free blocks on the same device
           BlockDescriptor free_key(device);
           CachedBlocks::iterator block_itr = cached_blocks.lower_bound(free_key);
 
           while ((block_itr != cached_blocks.end()) && (block_itr->device == device)) {
             // No need to worry about synchronization with the device: cudaFree is
             // blocking and will synchronize across all kernels executing
             // on the current device
 
             // Free device memory and destroy stream event.
             // CMS: silently ignore errors and pass them to the caller
             if ((error = cudaFree(block_itr->d_ptr)))
               break;
             if ((error = cudaEventDestroy(block_itr->ready_event)))
               break;
 
             // Reduce balance and erase entry
             cached_bytes[device].free -= block_itr->bytes;
 
             if (debug)
               // CMS: use raw printf
               printf(
                   "\tDevice %d freed %lld bytes.\n\t\t  %lld available blocks cached (%lld bytes), %lld live blocks "
                   "(%lld bytes) outstanding.\n",
                   device,
                   (long long)block_itr->bytes,
                   (long long)cached_blocks.size(),
                   (long long)cached_bytes[device].free,
                   (long long)live_blocks.size(),
                   (long long)cached_bytes[device].live);
 
             cached_blocks.erase(block_itr);
 
             block_itr++;
           }
 
           // Unlock
           mutex_locker.unlock();
 
           // Return under error
           if (error)
             return error;
 
           // Try to allocate again
           // CMS: throw exception on error
           cudaCheck(error = cudaMalloc(&search_key.d_ptr, search_key.bytes));
         }
 
         // Create ready event
         // CMS: throw exception on error
         cudaCheck(error = cudaEventCreateWithFlags(&search_key.ready_event, cudaEventDisableTiming));
 
         // Insert into live blocks
         mutex_locker.lock();
         live_blocks.insert(search_key);
         cached_bytes[device].live += search_key.bytes;
         cached_bytes[device].liveRequested += search_key.bytesRequested;  // CMS
         mutex_locker.unlock();
 
         if (debug)
           // CMS: improved debug message
           // CMS: use raw printf
           printf("\tDevice %d allocated new device block at %p (%lld bytes associated with stream %lld, event %lld).\n",
                  device,
                  search_key.d_ptr,
                  (long long)search_key.bytes,
                  (long long)search_key.associated_stream,
                  (long long)search_key.ready_event);
 
         // Attempt to revert back to previous device if necessary
         if ((entrypoint_device != INVALID_DEVICE_ORDINAL) && (entrypoint_device != device)) {
           // CMS: throw exception on error
           cudaCheck(error = cudaSetDevice(entrypoint_device));
         }
       }
 
       // Copy device pointer to output parameter
       *d_ptr = search_key.d_ptr;
 
       if (debug)
         // CMS: use raw printf
         printf("\t\t%lld available blocks cached (%lld bytes), %lld live blocks outstanding(%lld bytes).\n",
                (long long)cached_blocks.size(),
                (long long)cached_bytes[device].free,
                (long long)live_blocks.size(),
                (long long)cached_bytes[device].live);
 
       return error;
     }

◆ DeviceAllocate() [2/2]

cudaError_t notcub::CachingDeviceAllocator::DeviceAllocate	(	void **	d_ptr,
		size_t	bytes,
		cudaStream_t	active_stream = `nullptr`
	)

inline

Provides a suitable allocation of device memory for the given size on the current device.

Once freed, the allocation becomes available immediately for reuse within the active_stream with which it was associated with during allocation, and it becomes available for reuse within other streams when all prior work submitted to active_stream has completed.

Parameters

[out]	d_ptr	Reference to pointer to the allocation
[in]	bytes	Minimum number of bytes for the allocation
[in]	active_stream	The stream to be associated with this allocation

Definition at line 541 of file CachingDeviceAllocator.h.

References DeviceAllocate(), and INVALID_DEVICE_ORDINAL.

     {
       return DeviceAllocate(INVALID_DEVICE_ORDINAL, d_ptr, bytes, active_stream);
     }

◆ DeviceFree() [1/2]

cudaError_t notcub::CachingDeviceAllocator::DeviceFree	(	int	device,
		void *	d_ptr
	)

inline

Frees a live allocation of device memory on the specified device, returning it to the allocator.

Once freed, the allocation becomes available immediately for reuse within the active_stream with which it was associated with during allocation, and it becomes available for reuse within other streams when all prior work submitted to active_stream has completed.

Definition at line 556 of file CachingDeviceAllocator.h.

References notcub::CachingDeviceAllocator::BlockDescriptor::associated_stream, notcub::CachingDeviceAllocator::BlockDescriptor::bin, notcub::CachingDeviceAllocator::BlockDescriptor::bytes, notcub::CachingDeviceAllocator::BlockDescriptor::bytesRequested, cached_blocks, cached_bytes, cudaCheck, debug, relativeConstraints::error, free(), INVALID_BIN, INVALID_DEVICE_ORDINAL, beam_dqm_sourceclient-live_cfg::live, live_blocks, max_cached_bytes, mutex, and notcub::CachingDeviceAllocator::BlockDescriptor::ready_event.

                                                     {
       int entrypoint_device = INVALID_DEVICE_ORDINAL;
       cudaError_t error = cudaSuccess;
       // CMS: use RAII instead of (un)locking explicitly
       std::unique_lock<std::mutex> mutex_locker(mutex, std::defer_lock);
 
       if (device == INVALID_DEVICE_ORDINAL) {
         // CMS: throw exception on error
         cudaCheck(error = cudaGetDevice(&entrypoint_device));
         device = entrypoint_device;
       }
 
       // Lock
       mutex_locker.lock();
 
       // Find corresponding block descriptor
       bool recached = false;
       BlockDescriptor search_key(d_ptr, device);
       BusyBlocks::iterator block_itr = live_blocks.find(search_key);
       if (block_itr != live_blocks.end()) {
         // Remove from live blocks
         search_key = *block_itr;
         live_blocks.erase(block_itr);
         cached_bytes[device].live -= search_key.bytes;
         cached_bytes[device].liveRequested -= search_key.bytesRequested;  // CMS
 
         // Keep the returned allocation if bin is valid and we won't exceed the max cached threshold
         if ((search_key.bin != INVALID_BIN) && (cached_bytes[device].free + search_key.bytes <= max_cached_bytes)) {
           // Insert returned allocation into free blocks
           recached = true;
           cached_blocks.insert(search_key);
           cached_bytes[device].free += search_key.bytes;
 
           if (debug)
             // CMS: improved debug message
             // CMS: use raw printf
             printf(
                 "\tDevice %d returned %lld bytes at %p from associated stream %lld, event %lld.\n\t\t %lld available "
                 "blocks cached (%lld bytes), %lld live blocks outstanding. (%lld bytes)\n",
                 device,
                 (long long)search_key.bytes,
                 d_ptr,
                 (long long)search_key.associated_stream,
                 (long long)search_key.ready_event,
                 (long long)cached_blocks.size(),
                 (long long)cached_bytes[device].free,
                 (long long)live_blocks.size(),
                 (long long)cached_bytes[device].live);
         }
       }
 
       // First set to specified device (entrypoint may not be set)
       if (device != entrypoint_device) {
         // CMS: throw exception on error
         cudaCheck(error = cudaGetDevice(&entrypoint_device));
         cudaCheck(error = cudaSetDevice(device));
       }
 
       if (recached) {
         // Insert the ready event in the associated stream (must have current device set properly)
         // CMS: throw exception on error
         cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
       }
 
       // Unlock
       mutex_locker.unlock();
 
       if (!recached) {
         // Free the allocation from the runtime and cleanup the event.
         // CMS: throw exception on error
         cudaCheck(error = cudaFree(d_ptr));
         cudaCheck(error = cudaEventDestroy(search_key.ready_event));
 
         if (debug)
           // CMS: improved debug message
           printf(
               "\tDevice %d freed %lld bytes at %p from associated stream %lld, event %lld.\n\t\t  %lld available "
               "blocks cached (%lld bytes), %lld live blocks (%lld bytes) outstanding.\n",
               device,
               (long long)search_key.bytes,
               d_ptr,
               (long long)search_key.associated_stream,
               (long long)search_key.ready_event,
               (long long)cached_blocks.size(),
               (long long)cached_bytes[device].free,
               (long long)live_blocks.size(),
               (long long)cached_bytes[device].live);
       }
 
       // Reset device
       if ((entrypoint_device != INVALID_DEVICE_ORDINAL) && (entrypoint_device != device)) {
         // CMS: throw exception on error
         cudaCheck(error = cudaSetDevice(entrypoint_device));
       }
 
       return error;
     }

◆ DeviceFree() [2/2]

cudaError_t notcub::CachingDeviceAllocator::DeviceFree ( void * d_ptr )

inline

Frees a live allocation of device memory on the current device, returning it to the allocator.

Once freed, the allocation becomes available immediately for reuse within the active_stream with which it was associated with during allocation, and it becomes available for reuse within other streams when all prior work submitted to active_stream has completed.

Definition at line 661 of file CachingDeviceAllocator.h.

References DeviceFree(), and INVALID_DEVICE_ORDINAL.

Referenced by DeviceFree().

661 { return DeviceFree(INVALID_DEVICE_ORDINAL, d_ptr); }

notcub::CachingDeviceAllocator::DeviceFree

cudaError_t DeviceFree(int device, void *d_ptr)

Frees a live allocation of device memory on the specified device, returning it to the allocator...

Definition: CachingDeviceAllocator.h:556

notcub::CachingDeviceAllocator::INVALID_DEVICE_ORDINAL

static const int INVALID_DEVICE_ORDINAL

Invalid device ordinal.

Definition: CachingDeviceAllocator.h:114

◆ FreeAllCached()

cudaError_t notcub::CachingDeviceAllocator::FreeAllCached ( )

inline

Frees all cached device allocations on all devices.

Definition at line 666 of file CachingDeviceAllocator.h.

References cached_blocks, cached_bytes, cudaCheck, debug, relativeConstraints::error, free(), INVALID_DEVICE_ORDINAL, beam_dqm_sourceclient-live_cfg::live, live_blocks, and mutex.

Referenced by cms::cuda::allocator::cachingAllocatorsFreeCached(), and ~CachingDeviceAllocator().

                                 {
       cudaError_t error = cudaSuccess;
       int entrypoint_device = INVALID_DEVICE_ORDINAL;
       int current_device = INVALID_DEVICE_ORDINAL;
       // CMS: use RAII instead of (un)locking explicitly
       std::unique_lock<std::mutex> mutex_locker(mutex);
 
       while (!cached_blocks.empty()) {
         // Get first block
         CachedBlocks::iterator begin = cached_blocks.begin();
 
         // Get entry-point device ordinal if necessary
         if (entrypoint_device == INVALID_DEVICE_ORDINAL) {
           // CMS: silently ignore errors and pass them to the caller
           if ((error = cudaGetDevice(&entrypoint_device)))
             break;
         }
 
         // Set current device ordinal if necessary
         if (begin->device != current_device) {
           // CMS: silently ignore errors and pass them to the caller
           if ((error = cudaSetDevice(begin->device)))
             break;
           current_device = begin->device;
         }
 
         // Free device memory
         // CMS: silently ignore errors and pass them to the caller
         if ((error = cudaFree(begin->d_ptr)))
           break;
         if ((error = cudaEventDestroy(begin->ready_event)))
           break;
 
         // Reduce balance and erase entry
         cached_bytes[current_device].free -= begin->bytes;
 
         if (debug)
           printf(
               "\tDevice %d freed %lld bytes.\n\t\t  %lld available blocks cached (%lld bytes), %lld live blocks (%lld "
               "bytes) outstanding.\n",
               current_device,
               (long long)begin->bytes,
               (long long)cached_blocks.size(),
               (long long)cached_bytes[current_device].free,
               (long long)live_blocks.size(),
               (long long)cached_bytes[current_device].live);
 
         cached_blocks.erase(begin);
       }
 
       mutex_locker.unlock();
 
       // Attempt to revert back to entry-point device if necessary
       if (entrypoint_device != INVALID_DEVICE_ORDINAL) {
         // CMS: throw exception on error
         cudaCheck(error = cudaSetDevice(entrypoint_device));
       }
 
       return error;
     }

◆ IntPow()

static unsigned int notcub::CachingDeviceAllocator::IntPow	(	unsigned int	base,
		unsigned int	exp
	)

inlinestatic

Integer pow function for unsigned base and exponent

Definition at line 191 of file CachingDeviceAllocator.h.

References newFWLiteAna::base, and JetChargeProducer_cfi::exp.

Referenced by cms::cuda::allocator::getCachingDeviceAllocator(), and cms::cuda::allocator::getCachingHostAllocator().

                                                                     {
       unsigned int retval = 1;
       while (exp > 0) {
         if (exp & 1) {
           retval = retval * base;  // multiply the result by the current base
         }
         base = base * base;  // square the base
         exp = exp >> 1;      // divide the exponent in half
       }
       return retval;
     }

◆ NearestPowerOf()

void notcub::CachingDeviceAllocator::NearestPowerOf	(	unsigned int &	power,
		size_t &	rounded_bytes,
		unsigned int	base,
		size_t	value
	)

inline

Round up to the nearest power-of

Definition at line 206 of file CachingDeviceAllocator.h.

References newFWLiteAna::base, and cms::alpakatools::detail::power().

Referenced by DeviceAllocate().

                                                                                                      {
       power = 0;
       rounded_bytes = 1;
 
       if (value * base < value) {
         // Overflow
         power = sizeof(size_t) * 8;
         rounded_bytes = size_t(0) - 1;
         return;
       }
 
       while (rounded_bytes < value) {
         rounded_bytes *= base;
         power++;
       }
     }

◆ SetMaxCachedBytes()

cudaError_t notcub::CachingDeviceAllocator::SetMaxCachedBytes ( size_t max_cached_bytes )

inline

Sets the limit on the number bytes this allocator is allowed to cache per device.

Changing the ceiling of cached bytes does not cause any allocations (in-use or cached-in-reserve) to be freed. See FreeAllCached().

Definition at line 305 of file CachingDeviceAllocator.h.

References debug, max_cached_bytes, and mutex.

                                                            {
       // Lock
       // CMS: use RAII instead of (un)locking explicitly
       std::unique_lock mutex_locker(mutex);
 
       if (debug)
         // CMS: use raw printf
         printf("Changing max_cached_bytes (%lld -> %lld)\n",
                (long long)this->max_cached_bytes,
                (long long)max_cached_bytes);
 
       this->max_cached_bytes = max_cached_bytes;
 
       // Unlock (redundant, kept for style uniformity)
       mutex_locker.unlock();
 
       return cudaSuccess;
     }

Member Data Documentation

◆ bin_growth

unsigned int notcub::CachingDeviceAllocator::bin_growth

Mutex for thread-safety.

Definition at line 230 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate().

◆ cached_blocks

CachedBlocks notcub::CachingDeviceAllocator::cached_blocks

Map of device ordinal to aggregate cached bytes on that device.

Definition at line 243 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate(), DeviceFree(), and FreeAllCached().

◆ cached_bytes

GpuCachedBytes notcub::CachingDeviceAllocator::cached_bytes

Whether or not to print (de)allocation events to stdout.

Definition at line 242 of file CachingDeviceAllocator.h.

Referenced by CacheStatus(), DeviceAllocate(), DeviceFree(), and FreeAllCached().

◆ debug

bool notcub::CachingDeviceAllocator::debug

Whether or not to skip a call to FreeAllCached() when destructor is called. (The CUDA runtime may have already shut down for statically declared allocators)

Definition at line 240 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate(), DeviceFree(), rrapi.RRApi::dprint(), FreeAllCached(), rrapi.RRApi::get(), runTauIdMVA.TauIDEmbedder::loadMVA_WPs_run2_2017(), runTauIdMVA.TauIDEmbedder::runTauID(), and SetMaxCachedBytes().

◆ INVALID_BIN

const unsigned int notcub::CachingDeviceAllocator::INVALID_BIN = (unsigned int)-1

static

Out-of-bounds bin.

Definition at line 106 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate(), and DeviceFree().

◆ INVALID_DEVICE_ORDINAL

const int notcub::CachingDeviceAllocator::INVALID_DEVICE_ORDINAL = -1

static

Invalid device ordinal.

Definition at line 114 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate(), DeviceFree(), and FreeAllCached().

◆ INVALID_SIZE

const size_t notcub::CachingDeviceAllocator::INVALID_SIZE = (size_t)-1

static

Invalid size.

Definition at line 109 of file CachingDeviceAllocator.h.

◆ live_blocks

BusyBlocks notcub::CachingDeviceAllocator::live_blocks

Set of cached device allocations available for reuse.

Definition at line 244 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate(), DeviceFree(), and FreeAllCached().

◆ max_bin

unsigned int notcub::CachingDeviceAllocator::max_bin

Minimum bin enumeration.

Definition at line 232 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate().

◆ max_bin_bytes

size_t notcub::CachingDeviceAllocator::max_bin_bytes

Minimum bin size.

Definition at line 235 of file CachingDeviceAllocator.h.

◆ max_cached_bytes

size_t notcub::CachingDeviceAllocator::max_cached_bytes

Maximum bin size.

Definition at line 236 of file CachingDeviceAllocator.h.

Referenced by DeviceFree(), and SetMaxCachedBytes().

◆ min_bin

unsigned int notcub::CachingDeviceAllocator::min_bin

Geometric growth factor for bin-sizes.

Definition at line 231 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate().

◆ min_bin_bytes

size_t notcub::CachingDeviceAllocator::min_bin_bytes

Maximum bin enumeration.

Definition at line 234 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate().

◆ mutex

std::mutex notcub::CachingDeviceAllocator::mutex

mutable

Definition at line 228 of file CachingDeviceAllocator.h.

Referenced by CacheStatus(), DeviceAllocate(), DeviceFree(), FreeAllCached(), and SetMaxCachedBytes().

◆ skip_cleanup

const bool notcub::CachingDeviceAllocator::skip_cleanup

Maximum aggregate cached bytes per device.

Definition at line 239 of file CachingDeviceAllocator.h.

Referenced by ~CachingDeviceAllocator().

Classes

Public Types

Public Member Functions

Static Public Member Functions

Public Attributes

Static Public Attributes

Detailed Description

Member Typedef Documentation

◆ BusyBlocks

◆ CachedBlocks

◆ Compare

◆ GpuCachedBytes

Constructor & Destructor Documentation

◆ CachingDeviceAllocator() [1/2]

◆ CachingDeviceAllocator() [2/2]

◆ ~CachingDeviceAllocator()

Member Function Documentation

◆ CacheStatus()

◆ DeviceAllocate() [1/2]

◆ DeviceAllocate() [2/2]

◆ DeviceFree() [1/2]

◆ DeviceFree() [2/2]

◆ FreeAllCached()

◆ IntPow()

◆ NearestPowerOf()

◆ SetMaxCachedBytes()

Member Data Documentation

◆ bin_growth

◆ cached_blocks

◆ cached_bytes

◆ debug

◆ INVALID_BIN

◆ INVALID_DEVICE_ORDINAL

◆ INVALID_SIZE

◆ live_blocks

◆ max_bin

◆ max_bin_bytes

◆ max_cached_bytes

◆ min_bin

◆ min_bin_bytes

◆ mutex

◆ skip_cleanup