TensorFlow优化

2019/11/23 posted in  TensorFlow

这段时间,工作当中杂事比较多,有点像充当产品经理,去给算法业务的同学去安利我们最近完成的一些东西,感觉自己几乎没啥提升,希望这样的日子快点过去,公众号也落下了,主要原因是最近事比较多,又加上还迷上了抖音 什么东北酱,周末一躺就过去了。加上最近网上暴力被裁事情、明星猝死,突然发现中年危机可能就要在眼前,作为最老一批90年后,也马上要30了,但是觉得荒废下去好像也是蛮爽的。哎,迷茫而惆怅的中年危机恐怕是真的要到来了。回到正题,好久没有写过技术文章,最近又开始在读XDL的代码,这次我重点关注在底层,真的觉得质量很高,无愧是阿里妈妈吗的工作,但是无奈基础太差,又加上XDL除了代码开头的license生命几乎没有一行注释, 这里如果有XDL的小伙伴看到,希望把注释补一补,东西是不错,但是也要考虑下我们的水平吧。

本章文章是看到XDL代码之后,发现它自己写了自己的内存管理,这块我之前完全不懂,google一波才明白一些,而TensorFlow本身其实也有一些memory allocation的工作,本章会详细说明下BFC这块的工作。

TensorFlow Memory Manage

BFC是Best fit with coalescing的缩写,下面截图是TensorFlow源码config.proto,其中有关于内存、显存管理的注释文档,BFC是dlmalloc的一个简化版本,DLmalloc是啥,大家不明白没关系,直接理解为内存分配上一个特别硬核的工作,当前很多内存分配的算法很多受其影响。【memory allocate是操作系统下特别复杂的一项工作,如要系统了解,建议读读下OS经典教材memoey manage的几章】。

TensorFlow Memory Allocator

BFCAllocator

TensorFlow针对不同的设备,比如cpu下的mkl的allocator,cuda下的allocator,其实现都是有很大的差异的。本文不可能一一详述,仅针对bfc_allocator相关的逻辑进行描述,由于作者在这块经验较少,有任何发现疑问以及问题的请在评论区留言,大家来一起讨论

Allocator: Abstract interface for allocating and deallocating device memory

class Allocator {
 public:
  static constexpr size_t kAllocatorAlignment = 64;

  virtual ~Allocator();

  virtual string Name() = 0;

  virtual void* AllocateRaw(size_t alignment, size_t num_bytes) = 0;

  virtual void* AllocateRaw(size_t alignment, size_t num_bytes,
                            const AllocationAttributes& allocation_attr) {
    return AllocateRaw(alignment, num_bytes);
  }

  virtual void DeallocateRaw(void* ptr) = 0;

  virtual size_t RequestedSize(const void* ptr) const {
    CHECK(false) << "allocator doesn't track sizes";
    return size_t(0);
  }

  virtual size_t AllocatedSize(const void* ptr) const {
    return RequestedSize(ptr);
  }

  virtual int64 AllocationId(const void* ptr) const { return 0; }
  virtual size_t AllocatedSizeSlow(const void* ptr) const {
    if (TracksAllocationSizes()) {
      return AllocatedSize(ptr);
    }
    return 0;
  }

  virtual absl::optional<AllocatorStats> GetStats() { return absl::nullopt; }

  virtual void ClearStats() {}

  virtual void SetSafeFrontier(uint64 count) {}
};



struct AllocationAttributes {
  AllocationAttributes() = default;

  AllocationAttributes(bool no_retry_on_failure, bool allocation_will_be_logged,
                       std::function<uint64()>* freed_by_func)
      : no_retry_on_failure(no_retry_on_failure),
        allocation_will_be_logged(allocation_will_be_logged),
        freed_by_func(freed_by_func) {}

  bool no_retry_on_failure = false;
  bool allocation_will_be_logged = false;
  std::function<uint64()>* freed_by_func = nullptr;  // Not owned.

  TF_DISALLOW_COPY_AND_ASSIGN(AllocationAttributes);
};


struct AllocatorStats {
  int64 num_allocs;          // Number of allocations.
  int64 bytes_in_use;        // Number of bytes in use.
  int64 peak_bytes_in_use;   // The peak bytes in use.
  int64 largest_alloc_size;  // The largest single allocation seen.

  absl::optional<int64> bytes_limit;

  int64 bytes_reserved;       // Number of bytes reserved.
  int64 peak_bytes_reserved;  // The peak number of bytes reserved.
  absl::optional<int64> bytes_reservable_limit;

  AllocatorStats()
      : num_allocs(0),
        bytes_in_use(0),
        peak_bytes_in_use(0),
        largest_alloc_size(0),
        bytes_reserved(0),
        peak_bytes_reserved(0) {}

  string DebugString() const;
};

TensorFlow 内存分配与回收的抽象接口,封装Name, AllocateRaw, DellocateRaw, TracksAllocationSize, AllocatesOpaqueHandle, RequestedSize, AllocatedSize, AllocationId, AllocatedSizeSlow, GetStats, ClearStats, SetSafeFrontier

这些逻辑作为父类的纯虚接口,由子类去实现,BFCAllocator的详细接口信息如下:

在此之前,BFCAllocator下的两个比较重要的数据结构, Chunk和Bin,两者之间的关系如下图,看起来像一个个糖葫芦,第一个bin size为256<<1, 第二个为256<<2, 一次类推,TF内有21个bin,最后bin size 为256 << 21为512MB,每一个bin下面会接下若干个大于bin size的chunk,整个内存空间由以下的结构来组织,当分配内存大小指定时,系统会遍历bin,找到能够第一次满足chunk > bin_size,每一个bin下的chunk是有序的(Bin下的ChunkComparator)

Chunk

struct Chunk {
    size_t size = 0;
    size_t requested_size = 0;
    int64 allocation_id = -1;
    void* ptr = nullptr;  // pointer to granted subbuffer.
    ChunkHandle prev = kInvalidChunkHandle;
    ChunkHandle next = kInvalidChunkHandle;
    BinNum bin_num = kInvalidBinNum;

    // Optional count when this chunk was most recently made free.
    uint64 freed_at_count = 0;
    bool in_use() const { return allocation_id != -1; }
    string DebugString(BFCAllocator* a,
                       bool recurse) NO_THREAD_SAFETY_ANALYSIS {
      string dbg;
      strings::StrAppend(
          &dbg, "  Size: ", strings::HumanReadableNumBytes(size),
          " | Requested Size: ", strings::HumanReadableNumBytes(requested_size),
          " | in_use: ", in_use(), " | bin_num: ", bin_num);
      if (recurse && prev != BFCAllocator::kInvalidChunkHandle) {
        Chunk* p = a->ChunkFromHandle(prev);
        strings::StrAppend(&dbg, ", prev: ", p->DebugString(a, false));
      }
      if (recurse && next != BFCAllocator::kInvalidChunkHandle) {
        Chunk* n = a->ChunkFromHandle(next);
        strings::StrAppend(&dbg, ", next: ", n->DebugString(a, false));
      }
      return dbg;
    }
  };

Bin

struct Bin {
    // All chunks in this bin have >= bin_size memory.
    size_t bin_size = 0;

    class ChunkComparator {
     public:
      explicit ChunkComparator(BFCAllocator* allocator)
          : allocator_(allocator) {}
      bool operator()(const ChunkHandle ha,
                      const ChunkHandle hb) const NO_THREAD_SAFETY_ANALYSIS {
        const Chunk* a = allocator_->ChunkFromHandle(ha);
        const Chunk* b = allocator_->ChunkFromHandle(hb);
        if (a->size != b->size) {
          return a->size < b->size;
        }
        return a->ptr < b->ptr;
      }

     private:
      BFCAllocator* allocator_;  // The parent allocator
    };

    typedef std::set<ChunkHandle, ChunkComparator> FreeChunkSet;

    FreeChunkSet free_chunks;
    Bin(BFCAllocator* allocator, size_t bs)
        : bin_size(bs), free_chunks(ChunkComparator(allocator)) {}
  };

分配内存

  • rounded_bytes: 保证内存对齐;

  • BinNumForSize(rounded_bytes):找到对应的BinNum;

  • MergeTimestampedChunks: 如果timestamped_chunks_不为空, (required_bytes==0,这里还不是特别理解, 有理解清楚的可以在文章后面评论),则合并;

  • FindChunkPtr:

    • 找到第一个满足rounded_bytes的bin;
    • 从free_chunks中删除大于rounded_bytes的chunk, 从free_chunks移除;
    • 若chunk大小为rounded_bytes的两倍,或者chunk大小比rounded_bytes 大128mb以上, 会将chunk split成满足rounded_bytes和剩余大小的chunk, 然后将后者插入合适bin的free_chunks;
  • 如果FindChunkPtr没找到合适的chunk,则尝试Extend

    • 如果available_bytes<rounded_bytes,则返回false;
    • 如果当前curr_region_allocation_bytes_小于rounded_bytes,则curr_region_allocation_bytes翻倍,直到满足大于rounded_bytes;
    • 调用sub_allocator来分配内存块,分配bytes大小为min(rounded_bytes, curr_region_allocation_bytes), 若未能成功分配,则一直尝试分配0.9*bytes, 若最后也未分配成功,则extend失败;
    • 分配好内存块之后,创建对应的chunk,这个chunk里保存了,内存块地址等信息,并将chunk插入到对应bin;
  • 如果Extend也fail了, 则再次尝试MergeTimestampedChunks来是否满足round_bytes, (这里会聚合最近释放的内存块,直到满足rounded_bytes);

  • 若再次MergeTimestampedChunks之后还是无法分配合适的内存块,系统会再次尝试释放已经free的regions,然后尝试extend来满足是否能满足分配rounded_Bytes, 如还是fail,则返回空指针,Allocate 失败;

      void* BFCAllocator::AllocateRawInternal(size_t unused_alignment,
                                              size_t num_bytes,
                                              bool dump_log_on_failure,
                                              uint64 freed_before) {
        if (num_bytes == 0) {
          VLOG(2) << "tried to allocate 0 bytes";
          return nullptr;
        }
    
        size_t rounded_bytes = RoundedBytes(num_bytes);
        BinNum bin_num = BinNumForSize(rounded_bytes);
      
        mutex_lock l(lock_);
        if (!timestamped_chunks_.empty()) {
          MergeTimestampedChunks(0);
        }
        void* ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
        if (ptr != nullptr) {
          return ptr;
        }
      
        if (Extend(unused_alignment, rounded_bytes)) {
          ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
          if (ptr != nullptr) {
            return ptr;
          }
        }
      
        if ((freed_before == 0) && (!timestamped_chunks_.empty())) {
      
          if (MergeTimestampedChunks(rounded_bytes)) {
            ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
            if (ptr != nullptr) {
              return ptr;
            }
          }
        }
    
        if (DeallocateFreeRegions(rounded_bytes) &&
            Extend(unused_alignment, rounded_bytes)) {
          ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
          if (ptr != nullptr) {
            return ptr;
          }
        }
    

释放内存

  • 首先,判断ptr是否为空指针,如是,则不作后续操作;

  • region_manager_.get_handle(ptr) 找到对应chunkhandle, 然后将handle标记为free,具体有application_id赋值为-1, 若timing_counter_为true,则记录释放内存时间;

  • 调用InsertFreeChunkIntoBin,将已标记为free的chunk插入到合适的Bin中,释放内存完成;

      void BFCAllocator::DeallocateRawInternal(void* ptr) {
        if (ptr == nullptr) {
          VLOG(2) << "tried to deallocate nullptr";
          return;
        }
        mutex_lock l(lock_);
      
        // Find the chunk from the ptr.
        BFCAllocator::ChunkHandle h = region_manager_.get_handle(ptr);
        CHECK(h != kInvalidChunkHandle);
      
        MarkFree(h);
      
        // Consider coalescing it.
        if (timing_counter_) {
          InsertFreeChunkIntoBin(h);
          timestamped_chunks_.push_back(h);
        } else {
          InsertFreeChunkIntoBin(TryToCoalesce(h, false));
        }
      
        if (VLOG_IS_ON(4)) {
          LOG(INFO) << "F: " << RenderOccupancy();
        }
      }
    

读源码的方法

我发现看懂了一部分代码之后,再写出来就显得很简单,这个对于读者不是什么好事情,很多小伙伴很有心地去阅读某些框架的底层源码,但是很多时候理解不了,下面我分享给小伙伴几个方法:

  • 专项学习某些知识点,比如你看的源码部分是memory manager,你可以去找一些国内外教材、教学视频, 先理解清楚其中概念;
  • 至少完整看两遍代码:第一遍,不要嫌麻烦,将每个类的所有的成员函数、成员变量画出来,更给力的是讲类间的关系引用也画出来;第二遍,再逐个函数把逻辑理清楚;
  • 看完之后建议写篇文章来描述下,以后忘记了再稍微看看能搞清楚;

感兴趣的家庭作业

XDL代码里面有其他不同的allocator如buddy_allocator, slab_allocator, slab_buddy_allocator,如果有人感兴趣可以按照我前面提的方法来理解下这几块内容,大家一起学习、讨论。