CMS 3D CMS Logo

List of all members | Classes | Public Types | Public Member Functions | Static Public Member Functions | Public Attributes | Static Public Attributes
notcub::CachingDeviceAllocator Struct Reference

A simple caching allocator for device memory allocations. More...

#include <CachingDeviceAllocator.h>

Classes

struct  BlockDescriptor
 

Public Types

typedef std::multiset< BlockDescriptor, CompareBusyBlocks
 Set type for live blocks (ordered by ptr) More...
 
typedef std::multiset< BlockDescriptor, CompareCachedBlocks
 Set type for cached blocks (ordered by size) More...
 
typedef bool(* Compare) (const BlockDescriptor &, const BlockDescriptor &)
 BlockDescriptor comparator function interface. More...
 
using GpuCachedBytes = cms::cuda::allocator::GpuCachedBytes
 Map type of device ordinals to the number of cached bytes cached by each device. More...
 

Public Member Functions

GpuCachedBytes CacheStatus () const
 
 CachingDeviceAllocator (unsigned int bin_growth, unsigned int min_bin=1, unsigned int max_bin=INVALID_BIN, size_t max_cached_bytes=INVALID_SIZE, bool skip_cleanup=false, bool debug=false)
 Set of live device allocations currently in use. More...
 
 CachingDeviceAllocator (bool skip_cleanup=false, bool debug=false)
 Default constructor. More...
 
cudaError_t DeviceAllocate (int device, void **d_ptr, size_t bytes, cudaStream_t active_stream=nullptr)
 Provides a suitable allocation of device memory for the given size on the specified device. More...
 
cudaError_t DeviceAllocate (void **d_ptr, size_t bytes, cudaStream_t active_stream=nullptr)
 Provides a suitable allocation of device memory for the given size on the current device. More...
 
cudaError_t DeviceFree (int device, void *d_ptr)
 Frees a live allocation of device memory on the specified device, returning it to the allocator. More...
 
cudaError_t DeviceFree (void *d_ptr)
 Frees a live allocation of device memory on the current device, returning it to the allocator. More...
 
cudaError_t FreeAllCached ()
 Frees all cached device allocations on all devices. More...
 
void NearestPowerOf (unsigned int &power, size_t &rounded_bytes, unsigned int base, size_t value)
 
cudaError_t SetMaxCachedBytes (size_t max_cached_bytes)
 Sets the limit on the number bytes this allocator is allowed to cache per device. More...
 
 ~CachingDeviceAllocator ()
 Destructor. More...
 

Static Public Member Functions

static unsigned int IntPow (unsigned int base, unsigned int exp)
 

Public Attributes

unsigned int bin_growth
 Mutex for thread-safety. More...
 
CachedBlocks cached_blocks
 Map of device ordinal to aggregate cached bytes on that device. More...
 
GpuCachedBytes cached_bytes
 Whether or not to print (de)allocation events to stdout. More...
 
bool debug
 Whether or not to skip a call to FreeAllCached() when destructor is called. (The CUDA runtime may have already shut down for statically declared allocators) More...
 
BusyBlocks live_blocks
 Set of cached device allocations available for reuse. More...
 
unsigned int max_bin
 Minimum bin enumeration. More...
 
size_t max_bin_bytes
 Minimum bin size. More...
 
size_t max_cached_bytes
 Maximum bin size. More...
 
unsigned int min_bin
 Geometric growth factor for bin-sizes. More...
 
size_t min_bin_bytes
 Maximum bin enumeration. More...
 
std::mutex mutex
 
const bool skip_cleanup
 Maximum aggregate cached bytes per device. More...
 

Static Public Attributes

static const unsigned int INVALID_BIN = (unsigned int)-1
 Out-of-bounds bin. More...
 
static const int INVALID_DEVICE_ORDINAL = -1
 Invalid device ordinal. More...
 
static const size_t INVALID_SIZE = (size_t)-1
 Invalid size. More...
 

Detailed Description

A simple caching allocator for device memory allocations.

Overview
The allocator is thread-safe and stream-safe and is capable of managing cached device allocations on multiple devices. It behaves as follows:
  • Allocations from the allocator are associated with an active_stream. Once freed, the allocation becomes available immediately for reuse within the active_stream with which it was associated with during allocation, and it becomes available for reuse within other streams when all prior work submitted to active_stream has completed.
  • Allocations are categorized and cached by bin size. A new allocation request of a given size will only consider cached allocations within the corresponding bin.
  • Bin limits progress geometrically in accordance with the growth factor bin_growth provided during construction. Unused device allocations within a larger bin cache are not reused for allocation requests that categorize to smaller bin sizes.
  • Allocation requests below (bin_growth ^ min_bin) are rounded up to (bin_growth ^ min_bin).
  • Allocations above (bin_growth ^ max_bin) are not rounded up to the nearest bin and are simply freed when they are deallocated instead of being returned to a bin-cache.
  • If the total storage of cached allocations on a given device will exceed max_cached_bytes, allocations for that device are simply freed when they are deallocated instead of being returned to their bin-cache.
For example, the default-constructed CachingDeviceAllocator is configured with:
  • bin_growth = 8
  • min_bin = 3
  • max_bin = 7
  • max_cached_bytes = 6MB - 1B
which delineates five bin-sizes: 512B, 4KB, 32KB, 256KB, and 2MB and sets a maximum of 6,291,455 cached bytes per device

Definition at line 100 of file CachingDeviceAllocator.h.

Member Typedef Documentation

◆ BusyBlocks

Set type for live blocks (ordered by ptr)

Definition at line 178 of file CachingDeviceAllocator.h.

◆ CachedBlocks

Set type for cached blocks (ordered by size)

Definition at line 175 of file CachingDeviceAllocator.h.

◆ Compare

typedef bool(* notcub::CachingDeviceAllocator::Compare) (const BlockDescriptor &, const BlockDescriptor &)

BlockDescriptor comparator function interface.

Definition at line 170 of file CachingDeviceAllocator.h.

◆ GpuCachedBytes

Map type of device ordinals to the number of cached bytes cached by each device.

Definition at line 182 of file CachingDeviceAllocator.h.

Constructor & Destructor Documentation

◆ CachingDeviceAllocator() [1/2]

notcub::CachingDeviceAllocator::CachingDeviceAllocator ( unsigned int  bin_growth,
unsigned int  min_bin = 1,
unsigned int  max_bin = INVALID_BIN,
size_t  max_cached_bytes = INVALID_SIZE,
bool  skip_cleanup = false,
bool  debug = false 
)
inline

Set of live device allocations currently in use.

Constructor.

Parameters
bin_growthGeometric growth factor for bin-sizes
min_binMinimum bin (default is bin_growth ^ 1)
max_binMaximum bin (default is no max bin)
max_cached_bytesMaximum aggregate cached bytes per device (default is no limit)
skip_cleanupWhether or not to skip a call to FreeAllCached() when the destructor is called (default is to deallocate)
debugWhether or not to print (de)allocation events to stdout (default is no stderr output)

Definition at line 255 of file CachingDeviceAllocator.h.

264  min_bin(min_bin),
265  max_bin(max_bin),
270  debug(debug),
CachedBlocks cached_blocks
Map of device ordinal to aggregate cached bytes on that device.
BusyBlocks live_blocks
Set of cached device allocations available for reuse.
static bool PtrCompare(const BlockDescriptor &a, const BlockDescriptor &b)
static unsigned int IntPow(unsigned int base, unsigned int exp)
const bool skip_cleanup
Maximum aggregate cached bytes per device.
unsigned int bin_growth
Mutex for thread-safety.
static bool SizeCompare(const BlockDescriptor &a, const BlockDescriptor &b)
unsigned int max_bin
Minimum bin enumeration.
size_t max_bin_bytes
Minimum bin size.
size_t max_cached_bytes
Maximum bin size.
unsigned int min_bin
Geometric growth factor for bin-sizes.
bool debug
Whether or not to skip a call to FreeAllCached() when destructor is called. (The CUDA runtime may hav...
size_t min_bin_bytes
Maximum bin enumeration.

◆ CachingDeviceAllocator() [2/2]

notcub::CachingDeviceAllocator::CachingDeviceAllocator ( bool  skip_cleanup = false,
bool  debug = false 
)
inline

Default constructor.

Configured with:

  • bin_growth = 8
  • min_bin = 3
  • max_bin = 7
  • max_cached_bytes = (bin_growth ^ max_bin) * 3) - 1 = 6,291,455 bytes

which delineates five bin-sizes: 512B, 4KB, 32KB, 256KB, and 2MB and sets a maximum of 6,291,455 cached bytes per device

Definition at line 287 of file CachingDeviceAllocator.h.

288  : bin_growth(8),
289  min_bin(3),
290  max_bin(7),
293  max_cached_bytes((max_bin_bytes * 3) - 1),
295  debug(debug),
CachedBlocks cached_blocks
Map of device ordinal to aggregate cached bytes on that device.
BusyBlocks live_blocks
Set of cached device allocations available for reuse.
static bool PtrCompare(const BlockDescriptor &a, const BlockDescriptor &b)
static unsigned int IntPow(unsigned int base, unsigned int exp)
const bool skip_cleanup
Maximum aggregate cached bytes per device.
unsigned int bin_growth
Mutex for thread-safety.
static bool SizeCompare(const BlockDescriptor &a, const BlockDescriptor &b)
unsigned int max_bin
Minimum bin enumeration.
size_t max_bin_bytes
Minimum bin size.
size_t max_cached_bytes
Maximum bin size.
unsigned int min_bin
Geometric growth factor for bin-sizes.
bool debug
Whether or not to skip a call to FreeAllCached() when destructor is called. (The CUDA runtime may hav...
size_t min_bin_bytes
Maximum bin enumeration.

◆ ~CachingDeviceAllocator()

notcub::CachingDeviceAllocator::~CachingDeviceAllocator ( )
inline

Destructor.

Definition at line 737 of file CachingDeviceAllocator.h.

References FreeAllCached(), and skip_cleanup.

737  {
738  if (!skip_cleanup)
739  FreeAllCached();
740  }
const bool skip_cleanup
Maximum aggregate cached bytes per device.
cudaError_t FreeAllCached()
Frees all cached device allocations on all devices.

Member Function Documentation

◆ CacheStatus()

GpuCachedBytes notcub::CachingDeviceAllocator::CacheStatus ( ) const
inline

Definition at line 728 of file CachingDeviceAllocator.h.

References cached_bytes, and mutex.

Referenced by cms::cuda::deviceAllocatorStatus().

728  {
729  std::unique_lock mutex_locker(mutex);
730  return cached_bytes;
731  }
GpuCachedBytes cached_bytes
Whether or not to print (de)allocation events to stdout.

◆ DeviceAllocate() [1/2]

cudaError_t notcub::CachingDeviceAllocator::DeviceAllocate ( int  device,
void **  d_ptr,
size_t  bytes,
cudaStream_t  active_stream = nullptr 
)
inline

Provides a suitable allocation of device memory for the given size on the specified device.

Once freed, the allocation becomes available immediately for reuse within the active_stream with which it was associated with during allocation, and it becomes available for reuse within other streams when all prior work submitted to active_stream has completed.

Parameters
[in]deviceDevice on which to place the allocation
[out]d_ptrReference to pointer to the allocation
[in]bytesMinimum number of bytes for the allocation
[in]active_streamThe stream to be associated with this allocation

Definition at line 331 of file CachingDeviceAllocator.h.

References notcub::CachingDeviceAllocator::BlockDescriptor::associated_stream, notcub::CachingDeviceAllocator::BlockDescriptor::bin, bin_growth, notcub::CachingDeviceAllocator::BlockDescriptor::bytes, notcub::CachingDeviceAllocator::BlockDescriptor::bytesRequested, cached_blocks, cached_bytes, cudaCheck, notcub::CachingDeviceAllocator::BlockDescriptor::d_ptr, debug, relativeConstraints::error, newFWLiteAna::found, free(), INVALID_BIN, INVALID_DEVICE_ORDINAL, beam_dqm_sourceclient-live_cfg::live, live_blocks, max_bin, min_bin, min_bin_bytes, mutex, NearestPowerOf(), and notcub::CachingDeviceAllocator::BlockDescriptor::ready_event.

Referenced by DeviceAllocate().

336  {
337  // CMS: use RAII instead of (un)locking explicitly
338  std::unique_lock<std::mutex> mutex_locker(mutex, std::defer_lock);
339  *d_ptr = nullptr;
340  int entrypoint_device = INVALID_DEVICE_ORDINAL;
341  cudaError_t error = cudaSuccess;
342 
343  if (device == INVALID_DEVICE_ORDINAL) {
344  // CMS: throw exception on error
345  cudaCheck(error = cudaGetDevice(&entrypoint_device));
346  device = entrypoint_device;
347  }
348 
349  // Create a block descriptor for the requested allocation
350  bool found = false;
351  BlockDescriptor search_key(device);
352  search_key.bytesRequested = bytes; // CMS
353  search_key.associated_stream = active_stream;
354  NearestPowerOf(search_key.bin, search_key.bytes, bin_growth, bytes);
355 
356  if (search_key.bin > max_bin) {
357  // Bin is greater than our maximum bin: allocate the request
358  // exactly and give out-of-bounds bin. It will not be cached
359  // for reuse when returned.
360  search_key.bin = INVALID_BIN;
361  search_key.bytes = bytes;
362  } else {
363  // Search for a suitable cached allocation: lock
364  mutex_locker.lock();
365 
366  if (search_key.bin < min_bin) {
367  // Bin is less than minimum bin: round up
368  search_key.bin = min_bin;
369  search_key.bytes = min_bin_bytes;
370  }
371 
372  // Iterate through the range of cached blocks on the same device in the same bin
373  CachedBlocks::iterator block_itr = cached_blocks.lower_bound(search_key);
374  while ((block_itr != cached_blocks.end()) && (block_itr->device == device) &&
375  (block_itr->bin == search_key.bin)) {
376  // To prevent races with reusing blocks returned by the host but still
377  // in use by the device, only consider cached blocks that are
378  // either (from the active stream) or (from an idle stream)
379  if ((active_stream == block_itr->associated_stream) ||
380  (cudaEventQuery(block_itr->ready_event) != cudaErrorNotReady)) {
381  // Reuse existing cache block. Insert into live blocks.
382  found = true;
383  search_key = *block_itr;
384  search_key.associated_stream = active_stream;
385  live_blocks.insert(search_key);
386 
387  // Remove from free blocks
388  cached_bytes[device].free -= search_key.bytes;
389  cached_bytes[device].live += search_key.bytes;
390  cached_bytes[device].liveRequested += search_key.bytesRequested; // CMS
391 
392  if (debug)
393  // CMS: improved debug message
394  // CMS: use raw printf
395  printf(
396  "\tDevice %d reused cached block at %p (%lld bytes) for stream %lld, event %lld (previously "
397  "associated with stream %lld, event %lld).\n",
398  device,
399  search_key.d_ptr,
400  (long long)search_key.bytes,
401  (long long)search_key.associated_stream,
402  (long long)search_key.ready_event,
403  (long long)block_itr->associated_stream,
404  (long long)block_itr->ready_event);
405 
406  cached_blocks.erase(block_itr);
407 
408  break;
409  }
410  block_itr++;
411  }
412 
413  // Done searching: unlock
414  mutex_locker.unlock();
415  }
416 
417  // Allocate the block if necessary
418  if (!found) {
419  // Set runtime's current device to specified device (entrypoint may not be set)
420  if (device != entrypoint_device) {
421  // CMS: throw exception on error
422  cudaCheck(error = cudaGetDevice(&entrypoint_device));
423  cudaCheck(error = cudaSetDevice(device));
424  }
425 
426  // Attempt to allocate
427  // CMS: silently ignore errors and retry or pass them to the caller
428  if ((error = cudaMalloc(&search_key.d_ptr, search_key.bytes)) == cudaErrorMemoryAllocation) {
429  // The allocation attempt failed: free all cached blocks on device and retry
430  if (debug)
431  // CMS: use raw printf
432  printf(
433  "\tDevice %d failed to allocate %lld bytes for stream %lld, retrying after freeing cached allocations",
434  device,
435  (long long)search_key.bytes,
436  (long long)search_key.associated_stream);
437 
438  error = cudaSuccess; // Reset the error we will return
439  cudaGetLastError(); // Reset CUDART's error
440 
441  // Lock
442  mutex_locker.lock();
443 
444  // Iterate the range of free blocks on the same device
445  BlockDescriptor free_key(device);
446  CachedBlocks::iterator block_itr = cached_blocks.lower_bound(free_key);
447 
448  while ((block_itr != cached_blocks.end()) && (block_itr->device == device)) {
449  // No need to worry about synchronization with the device: cudaFree is
450  // blocking and will synchronize across all kernels executing
451  // on the current device
452 
453  // Free device memory and destroy stream event.
454  // CMS: silently ignore errors and pass them to the caller
455  if ((error = cudaFree(block_itr->d_ptr)))
456  break;
457  if ((error = cudaEventDestroy(block_itr->ready_event)))
458  break;
459 
460  // Reduce balance and erase entry
461  cached_bytes[device].free -= block_itr->bytes;
462 
463  if (debug)
464  // CMS: use raw printf
465  printf(
466  "\tDevice %d freed %lld bytes.\n\t\t %lld available blocks cached (%lld bytes), %lld live blocks "
467  "(%lld bytes) outstanding.\n",
468  device,
469  (long long)block_itr->bytes,
470  (long long)cached_blocks.size(),
471  (long long)cached_bytes[device].free,
472  (long long)live_blocks.size(),
473  (long long)cached_bytes[device].live);
474 
475  cached_blocks.erase(block_itr);
476 
477  block_itr++;
478  }
479 
480  // Unlock
481  mutex_locker.unlock();
482 
483  // Return under error
484  if (error)
485  return error;
486 
487  // Try to allocate again
488  // CMS: throw exception on error
489  cudaCheck(error = cudaMalloc(&search_key.d_ptr, search_key.bytes));
490  }
491 
492  // Create ready event
493  // CMS: throw exception on error
494  cudaCheck(error = cudaEventCreateWithFlags(&search_key.ready_event, cudaEventDisableTiming));
495 
496  // Insert into live blocks
497  mutex_locker.lock();
498  live_blocks.insert(search_key);
499  cached_bytes[device].live += search_key.bytes;
500  cached_bytes[device].liveRequested += search_key.bytesRequested; // CMS
501  mutex_locker.unlock();
502 
503  if (debug)
504  // CMS: improved debug message
505  // CMS: use raw printf
506  printf("\tDevice %d allocated new device block at %p (%lld bytes associated with stream %lld, event %lld).\n",
507  device,
508  search_key.d_ptr,
509  (long long)search_key.bytes,
510  (long long)search_key.associated_stream,
511  (long long)search_key.ready_event);
512 
513  // Attempt to revert back to previous device if necessary
514  if ((entrypoint_device != INVALID_DEVICE_ORDINAL) && (entrypoint_device != device)) {
515  // CMS: throw exception on error
516  cudaCheck(error = cudaSetDevice(entrypoint_device));
517  }
518  }
519 
520  // Copy device pointer to output parameter
521  *d_ptr = search_key.d_ptr;
522 
523  if (debug)
524  // CMS: use raw printf
525  printf("\t\t%lld available blocks cached (%lld bytes), %lld live blocks outstanding(%lld bytes).\n",
526  (long long)cached_blocks.size(),
527  (long long)cached_bytes[device].free,
528  (long long)live_blocks.size(),
529  (long long)cached_bytes[device].live);
530 
531  return error;
532  }
CachedBlocks cached_blocks
Map of device ordinal to aggregate cached bytes on that device.
static const unsigned int INVALID_BIN
Out-of-bounds bin.
BusyBlocks live_blocks
Set of cached device allocations available for reuse.
void NearestPowerOf(unsigned int &power, size_t &rounded_bytes, unsigned int base, size_t value)
void free(void *ptr) noexcept
static const int INVALID_DEVICE_ORDINAL
Invalid device ordinal.
unsigned int bin_growth
Mutex for thread-safety.
unsigned int max_bin
Minimum bin enumeration.
GpuCachedBytes cached_bytes
Whether or not to print (de)allocation events to stdout.
unsigned int min_bin
Geometric growth factor for bin-sizes.
#define cudaCheck(ARG,...)
Definition: cudaCheck.h:69
bool debug
Whether or not to skip a call to FreeAllCached() when destructor is called. (The CUDA runtime may hav...
size_t min_bin_bytes
Maximum bin enumeration.

◆ DeviceAllocate() [2/2]

cudaError_t notcub::CachingDeviceAllocator::DeviceAllocate ( void **  d_ptr,
size_t  bytes,
cudaStream_t  active_stream = nullptr 
)
inline

Provides a suitable allocation of device memory for the given size on the current device.

Once freed, the allocation becomes available immediately for reuse within the active_stream with which it was associated with during allocation, and it becomes available for reuse within other streams when all prior work submitted to active_stream has completed.

Parameters
[out]d_ptrReference to pointer to the allocation
[in]bytesMinimum number of bytes for the allocation
[in]active_streamThe stream to be associated with this allocation

Definition at line 541 of file CachingDeviceAllocator.h.

References DeviceAllocate(), and INVALID_DEVICE_ORDINAL.

545  {
546  return DeviceAllocate(INVALID_DEVICE_ORDINAL, d_ptr, bytes, active_stream);
547  }
static const int INVALID_DEVICE_ORDINAL
Invalid device ordinal.
cudaError_t DeviceAllocate(int device, void **d_ptr, size_t bytes, cudaStream_t active_stream=nullptr)
Provides a suitable allocation of device memory for the given size on the specified device...

◆ DeviceFree() [1/2]

cudaError_t notcub::CachingDeviceAllocator::DeviceFree ( int  device,
void *  d_ptr 
)
inline

Frees a live allocation of device memory on the specified device, returning it to the allocator.

Once freed, the allocation becomes available immediately for reuse within the active_stream with which it was associated with during allocation, and it becomes available for reuse within other streams when all prior work submitted to active_stream has completed.

Definition at line 556 of file CachingDeviceAllocator.h.

References notcub::CachingDeviceAllocator::BlockDescriptor::associated_stream, notcub::CachingDeviceAllocator::BlockDescriptor::bin, notcub::CachingDeviceAllocator::BlockDescriptor::bytes, notcub::CachingDeviceAllocator::BlockDescriptor::bytesRequested, cached_blocks, cached_bytes, cudaCheck, debug, relativeConstraints::error, free(), INVALID_BIN, INVALID_DEVICE_ORDINAL, beam_dqm_sourceclient-live_cfg::live, live_blocks, max_cached_bytes, mutex, and notcub::CachingDeviceAllocator::BlockDescriptor::ready_event.

556  {
557  int entrypoint_device = INVALID_DEVICE_ORDINAL;
558  cudaError_t error = cudaSuccess;
559  // CMS: use RAII instead of (un)locking explicitly
560  std::unique_lock<std::mutex> mutex_locker(mutex, std::defer_lock);
561 
562  if (device == INVALID_DEVICE_ORDINAL) {
563  // CMS: throw exception on error
564  cudaCheck(error = cudaGetDevice(&entrypoint_device));
565  device = entrypoint_device;
566  }
567 
568  // Lock
569  mutex_locker.lock();
570 
571  // Find corresponding block descriptor
572  bool recached = false;
573  BlockDescriptor search_key(d_ptr, device);
574  BusyBlocks::iterator block_itr = live_blocks.find(search_key);
575  if (block_itr != live_blocks.end()) {
576  // Remove from live blocks
577  search_key = *block_itr;
578  live_blocks.erase(block_itr);
579  cached_bytes[device].live -= search_key.bytes;
580  cached_bytes[device].liveRequested -= search_key.bytesRequested; // CMS
581 
582  // Keep the returned allocation if bin is valid and we won't exceed the max cached threshold
583  if ((search_key.bin != INVALID_BIN) && (cached_bytes[device].free + search_key.bytes <= max_cached_bytes)) {
584  // Insert returned allocation into free blocks
585  recached = true;
586  cached_blocks.insert(search_key);
587  cached_bytes[device].free += search_key.bytes;
588 
589  if (debug)
590  // CMS: improved debug message
591  // CMS: use raw printf
592  printf(
593  "\tDevice %d returned %lld bytes at %p from associated stream %lld, event %lld.\n\t\t %lld available "
594  "blocks cached (%lld bytes), %lld live blocks outstanding. (%lld bytes)\n",
595  device,
596  (long long)search_key.bytes,
597  d_ptr,
598  (long long)search_key.associated_stream,
599  (long long)search_key.ready_event,
600  (long long)cached_blocks.size(),
601  (long long)cached_bytes[device].free,
602  (long long)live_blocks.size(),
603  (long long)cached_bytes[device].live);
604  }
605  }
606 
607  // First set to specified device (entrypoint may not be set)
608  if (device != entrypoint_device) {
609  // CMS: throw exception on error
610  cudaCheck(error = cudaGetDevice(&entrypoint_device));
611  cudaCheck(error = cudaSetDevice(device));
612  }
613 
614  if (recached) {
615  // Insert the ready event in the associated stream (must have current device set properly)
616  // CMS: throw exception on error
617  cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
618  }
619 
620  // Unlock
621  mutex_locker.unlock();
622 
623  if (!recached) {
624  // Free the allocation from the runtime and cleanup the event.
625  // CMS: throw exception on error
626  cudaCheck(error = cudaFree(d_ptr));
627  cudaCheck(error = cudaEventDestroy(search_key.ready_event));
628 
629  if (debug)
630  // CMS: improved debug message
631  printf(
632  "\tDevice %d freed %lld bytes at %p from associated stream %lld, event %lld.\n\t\t %lld available "
633  "blocks cached (%lld bytes), %lld live blocks (%lld bytes) outstanding.\n",
634  device,
635  (long long)search_key.bytes,
636  d_ptr,
637  (long long)search_key.associated_stream,
638  (long long)search_key.ready_event,
639  (long long)cached_blocks.size(),
640  (long long)cached_bytes[device].free,
641  (long long)live_blocks.size(),
642  (long long)cached_bytes[device].live);
643  }
644 
645  // Reset device
646  if ((entrypoint_device != INVALID_DEVICE_ORDINAL) && (entrypoint_device != device)) {
647  // CMS: throw exception on error
648  cudaCheck(error = cudaSetDevice(entrypoint_device));
649  }
650 
651  return error;
652  }
CachedBlocks cached_blocks
Map of device ordinal to aggregate cached bytes on that device.
static const unsigned int INVALID_BIN
Out-of-bounds bin.
BusyBlocks live_blocks
Set of cached device allocations available for reuse.
void free(void *ptr) noexcept
static const int INVALID_DEVICE_ORDINAL
Invalid device ordinal.
GpuCachedBytes cached_bytes
Whether or not to print (de)allocation events to stdout.
size_t max_cached_bytes
Maximum bin size.
#define cudaCheck(ARG,...)
Definition: cudaCheck.h:69
bool debug
Whether or not to skip a call to FreeAllCached() when destructor is called. (The CUDA runtime may hav...

◆ DeviceFree() [2/2]

cudaError_t notcub::CachingDeviceAllocator::DeviceFree ( void *  d_ptr)
inline

Frees a live allocation of device memory on the current device, returning it to the allocator.

Once freed, the allocation becomes available immediately for reuse within the active_stream with which it was associated with during allocation, and it becomes available for reuse within other streams when all prior work submitted to active_stream has completed.

Definition at line 661 of file CachingDeviceAllocator.h.

References DeviceFree(), and INVALID_DEVICE_ORDINAL.

Referenced by DeviceFree().

661 { return DeviceFree(INVALID_DEVICE_ORDINAL, d_ptr); }
cudaError_t DeviceFree(int device, void *d_ptr)
Frees a live allocation of device memory on the specified device, returning it to the allocator...
static const int INVALID_DEVICE_ORDINAL
Invalid device ordinal.

◆ FreeAllCached()

cudaError_t notcub::CachingDeviceAllocator::FreeAllCached ( )
inline

Frees all cached device allocations on all devices.

Definition at line 666 of file CachingDeviceAllocator.h.

References cached_blocks, cached_bytes, cudaCheck, debug, relativeConstraints::error, free(), INVALID_DEVICE_ORDINAL, beam_dqm_sourceclient-live_cfg::live, live_blocks, and mutex.

Referenced by cms::cuda::allocator::cachingAllocatorsFreeCached(), and ~CachingDeviceAllocator().

666  {
667  cudaError_t error = cudaSuccess;
668  int entrypoint_device = INVALID_DEVICE_ORDINAL;
669  int current_device = INVALID_DEVICE_ORDINAL;
670  // CMS: use RAII instead of (un)locking explicitly
671  std::unique_lock<std::mutex> mutex_locker(mutex);
672 
673  while (!cached_blocks.empty()) {
674  // Get first block
675  CachedBlocks::iterator begin = cached_blocks.begin();
676 
677  // Get entry-point device ordinal if necessary
678  if (entrypoint_device == INVALID_DEVICE_ORDINAL) {
679  // CMS: silently ignore errors and pass them to the caller
680  if ((error = cudaGetDevice(&entrypoint_device)))
681  break;
682  }
683 
684  // Set current device ordinal if necessary
685  if (begin->device != current_device) {
686  // CMS: silently ignore errors and pass them to the caller
687  if ((error = cudaSetDevice(begin->device)))
688  break;
689  current_device = begin->device;
690  }
691 
692  // Free device memory
693  // CMS: silently ignore errors and pass them to the caller
694  if ((error = cudaFree(begin->d_ptr)))
695  break;
696  if ((error = cudaEventDestroy(begin->ready_event)))
697  break;
698 
699  // Reduce balance and erase entry
700  cached_bytes[current_device].free -= begin->bytes;
701 
702  if (debug)
703  printf(
704  "\tDevice %d freed %lld bytes.\n\t\t %lld available blocks cached (%lld bytes), %lld live blocks (%lld "
705  "bytes) outstanding.\n",
706  current_device,
707  (long long)begin->bytes,
708  (long long)cached_blocks.size(),
709  (long long)cached_bytes[current_device].free,
710  (long long)live_blocks.size(),
711  (long long)cached_bytes[current_device].live);
712 
713  cached_blocks.erase(begin);
714  }
715 
716  mutex_locker.unlock();
717 
718  // Attempt to revert back to entry-point device if necessary
719  if (entrypoint_device != INVALID_DEVICE_ORDINAL) {
720  // CMS: throw exception on error
721  cudaCheck(error = cudaSetDevice(entrypoint_device));
722  }
723 
724  return error;
725  }
CachedBlocks cached_blocks
Map of device ordinal to aggregate cached bytes on that device.
BusyBlocks live_blocks
Set of cached device allocations available for reuse.
void free(void *ptr) noexcept
static const int INVALID_DEVICE_ORDINAL
Invalid device ordinal.
GpuCachedBytes cached_bytes
Whether or not to print (de)allocation events to stdout.
#define cudaCheck(ARG,...)
Definition: cudaCheck.h:69
bool debug
Whether or not to skip a call to FreeAllCached() when destructor is called. (The CUDA runtime may hav...

◆ IntPow()

static unsigned int notcub::CachingDeviceAllocator::IntPow ( unsigned int  base,
unsigned int  exp 
)
inlinestatic

Integer pow function for unsigned base and exponent

Definition at line 191 of file CachingDeviceAllocator.h.

References newFWLiteAna::base, and JetChargeProducer_cfi::exp.

Referenced by cms::cuda::allocator::getCachingDeviceAllocator(), and cms::cuda::allocator::getCachingHostAllocator().

191  {
192  unsigned int retval = 1;
193  while (exp > 0) {
194  if (exp & 1) {
195  retval = retval * base; // multiply the result by the current base
196  }
197  base = base * base; // square the base
198  exp = exp >> 1; // divide the exponent in half
199  }
200  return retval;
201  }
base
Main Program
Definition: newFWLiteAna.py:92

◆ NearestPowerOf()

void notcub::CachingDeviceAllocator::NearestPowerOf ( unsigned int &  power,
size_t &  rounded_bytes,
unsigned int  base,
size_t  value 
)
inline

Round up to the nearest power-of

Definition at line 206 of file CachingDeviceAllocator.h.

References newFWLiteAna::base, and cms::alpakatools::detail::power().

Referenced by DeviceAllocate().

206  {
207  power = 0;
208  rounded_bytes = 1;
209 
210  if (value * base < value) {
211  // Overflow
212  power = sizeof(size_t) * 8;
213  rounded_bytes = size_t(0) - 1;
214  return;
215  }
216 
217  while (rounded_bytes < value) {
218  rounded_bytes *= base;
219  power++;
220  }
221  }
base
Main Program
Definition: newFWLiteAna.py:92
Definition: value.py:1
constexpr unsigned int power(unsigned int base, unsigned int exponent)

◆ SetMaxCachedBytes()

cudaError_t notcub::CachingDeviceAllocator::SetMaxCachedBytes ( size_t  max_cached_bytes)
inline

Sets the limit on the number bytes this allocator is allowed to cache per device.

Changing the ceiling of cached bytes does not cause any allocations (in-use or cached-in-reserve) to be freed. See FreeAllCached().

Definition at line 305 of file CachingDeviceAllocator.h.

References debug, max_cached_bytes, and mutex.

305  {
306  // Lock
307  // CMS: use RAII instead of (un)locking explicitly
308  std::unique_lock mutex_locker(mutex);
309 
310  if (debug)
311  // CMS: use raw printf
312  printf("Changing max_cached_bytes (%lld -> %lld)\n",
313  (long long)this->max_cached_bytes,
314  (long long)max_cached_bytes);
315 
316  this->max_cached_bytes = max_cached_bytes;
317 
318  // Unlock (redundant, kept for style uniformity)
319  mutex_locker.unlock();
320 
321  return cudaSuccess;
322  }
size_t max_cached_bytes
Maximum bin size.
bool debug
Whether or not to skip a call to FreeAllCached() when destructor is called. (The CUDA runtime may hav...

Member Data Documentation

◆ bin_growth

unsigned int notcub::CachingDeviceAllocator::bin_growth

Mutex for thread-safety.

Definition at line 230 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate().

◆ cached_blocks

CachedBlocks notcub::CachingDeviceAllocator::cached_blocks

Map of device ordinal to aggregate cached bytes on that device.

Definition at line 243 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate(), DeviceFree(), and FreeAllCached().

◆ cached_bytes

GpuCachedBytes notcub::CachingDeviceAllocator::cached_bytes

Whether or not to print (de)allocation events to stdout.

Definition at line 242 of file CachingDeviceAllocator.h.

Referenced by CacheStatus(), DeviceAllocate(), DeviceFree(), and FreeAllCached().

◆ debug

bool notcub::CachingDeviceAllocator::debug

Whether or not to skip a call to FreeAllCached() when destructor is called. (The CUDA runtime may have already shut down for statically declared allocators)

Definition at line 240 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate(), DeviceFree(), rrapi.RRApi::dprint(), FreeAllCached(), rrapi.RRApi::get(), runTauIdMVA.TauIDEmbedder::loadMVA_WPs_run2_2017(), runTauIdMVA.TauIDEmbedder::runTauID(), and SetMaxCachedBytes().

◆ INVALID_BIN

const unsigned int notcub::CachingDeviceAllocator::INVALID_BIN = (unsigned int)-1
static

Out-of-bounds bin.

Definition at line 106 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate(), and DeviceFree().

◆ INVALID_DEVICE_ORDINAL

const int notcub::CachingDeviceAllocator::INVALID_DEVICE_ORDINAL = -1
static

Invalid device ordinal.

Definition at line 114 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate(), DeviceFree(), and FreeAllCached().

◆ INVALID_SIZE

const size_t notcub::CachingDeviceAllocator::INVALID_SIZE = (size_t)-1
static

Invalid size.

Definition at line 109 of file CachingDeviceAllocator.h.

◆ live_blocks

BusyBlocks notcub::CachingDeviceAllocator::live_blocks

Set of cached device allocations available for reuse.

Definition at line 244 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate(), DeviceFree(), and FreeAllCached().

◆ max_bin

unsigned int notcub::CachingDeviceAllocator::max_bin

Minimum bin enumeration.

Definition at line 232 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate().

◆ max_bin_bytes

size_t notcub::CachingDeviceAllocator::max_bin_bytes

Minimum bin size.

Definition at line 235 of file CachingDeviceAllocator.h.

◆ max_cached_bytes

size_t notcub::CachingDeviceAllocator::max_cached_bytes

Maximum bin size.

Definition at line 236 of file CachingDeviceAllocator.h.

Referenced by DeviceFree(), and SetMaxCachedBytes().

◆ min_bin

unsigned int notcub::CachingDeviceAllocator::min_bin

Geometric growth factor for bin-sizes.

Definition at line 231 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate().

◆ min_bin_bytes

size_t notcub::CachingDeviceAllocator::min_bin_bytes

Maximum bin enumeration.

Definition at line 234 of file CachingDeviceAllocator.h.

Referenced by DeviceAllocate().

◆ mutex

std::mutex notcub::CachingDeviceAllocator::mutex
mutable

◆ skip_cleanup

const bool notcub::CachingDeviceAllocator::skip_cleanup

Maximum aggregate cached bytes per device.

Definition at line 239 of file CachingDeviceAllocator.h.

Referenced by ~CachingDeviceAllocator().