leveldb skiplist

Data Structure

skiplist

leveldb memtable需要满足以下几种要求：

get 足够快
write 足够快
iterator 足够快 (compact to sst file by ordering)

那么skiplist作为一个同时兼顾list和tree属性的data structure，是非常好的选择。

Concurrency

concurrency

leveldb skiplist 通过采用write lock和memory barrier的方式，保证读读，读写操作不会冲突，提高并发能力。

Implementation Details

skiplist doc:

// Thread safety
// -------------
//
// Writes require external synchronization, most likely a mutex.
// Reads require a guarantee that the SkipList will not be destroyed
// while the read is in progress.  Apart from that, reads progress
// without any internal locking or synchronization.
//
// Invariants:
//
// (1) Allocated nodes are never deleted until the SkipList is
// destroyed.  This is trivially guaranteed by the code since we
// never delete any skip list nodes.
//
// (2) The contents of a Node except for the next/prev pointers are
// immutable after the Node has been linked into the SkipList.
// Only Insert() modifies the list, and it is careful to initialize
// a node and use release-stores to publish the nodes in one or
// more lists.
//
// ... prev vs. next pointer ordering ...

主要以下几点：

写操作需要外部同步，比如mutex
读操作要保证在读的时候， skiplist不要destroy，比如reference count，读操作内部不会用到锁或者其他同步器
skiplist内部分配的node是不会被删除的，除了skiplist被destroy外
node的内容除了next/pre指针外，其他是不会改变的，只有Insert()会修改list，Insert()巧妙的初始化node，并通过release-stores(内存屏障)把node加到skiplist中, 保证对reads的可见性。
不难看出， put, delete, update等方法都是通过insert new entry实现的。

Node

// Implementation details follow
template<typename Key, class Comparator>
struct SkipList<Key,Comparator>::Node {
  explicit Node(const Key& k) : key(k) { }

  Key const key;

  // Accessors/mutators for links.  Wrapped in methods so we can
  // add the appropriate barriers as necessary.
  Node* Next(int n) {
    assert(n >= 0);
    // Use an 'acquire load' so that we observe a fully initialized
    // version of the returned Node.
    return reinterpret_cast<Node*>(next_[n].Acquire_Load());
  }
  void SetNext(int n, Node* x) {
    assert(n >= 0);
    // Use a 'release store' so that anybody who reads through this
    // pointer observes a fully initialized version of the inserted node.
    next_[n].Release_Store(x);
  }

  // No-barrier variants that can be safely used in a few locations.
  Node* NoBarrier_Next(int n) {
    assert(n >= 0);
    return reinterpret_cast<Node*>(next_[n].NoBarrier_Load());
  }
  void NoBarrier_SetNext(int n, Node* x) {
    assert(n >= 0);
    next_[n].NoBarrier_Store(x);
  }

Node的提供了两套set/get next方法，一种是普通的，另一种是memory barrier的，后一种可以保证数据的可见性。这里简单提一下memory barrier, 语义上，内存屏障之前的所有写操作都要写入内存；内存屏障之后的读操作都可以获得同步屏障之前的写操作的结果。

Node内部的数据，除了需要保存key外，还有一个长为1的指针数组，用于存储level 0的next node的地址，那其他level的呢？在node init时，多申请出height-1的地址空间，用于存储其他level的地址。这里如果我设计，要么用linklist，要么设置一个长度为最大长度的数组，要么只存放一个数组指针，这个方法我估计想不到。

Insert()

template<typename Key, class Comparator>
void SkipList<Key,Comparator>::Insert(const Key& key) {
  // TODO(opt): We can use a barrier-free variant of FindGreaterOrEqual()
  // here since Insert() is externally synchronized.

  // prev用于存放FindGreaterOrEqual node的prev node的地址
  Node* prev[kMaxHeight];
  Node* x = FindGreaterOrEqual(key, prev);

  // Our data structure does not allow duplicate insertion
  assert(x == nullptr || !Equal(key, x->key));

  int height = RandomHeight();
  //设置height高出部分的pre node为head
  if (height > GetMaxHeight()) {
    for (int i = GetMaxHeight(); i < height; i++) {
      prev[i] = head_;
    }
    //fprintf(stderr, "Change height from %d to %d\n", max_height_, height);

    // It is ok to mutate max_height_ without any synchronization
    // with concurrent readers.  A concurrent reader that observes
    // the new value of max_height_ will see either the old value of
    // new level pointers from head_ (nullptr), or a new value set in
    // the loop below.  In the former case the reader will
    // immediately drop to the next level since nullptr sorts after all
    // keys.  In the latter case the reader will use the new node.

	// 这里高度对search没有影响， search操作对于这种情况作了考虑， 因此不需要barrier， 
    max_height_.NoBarrier_Store(reinterpret_cast<void*>(height));
  }

  x = NewNode(key, height);
  for (int i = 0; i < height; i++) {
    // NoBarrier_SetNext() suffices since we will add a barrier when
    // we publish a pointer to "x" in prev[i].
	
	// x的next可见不可见不重要， x没有publish， 其他线程现在不能访问到x
    x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));
	// prev next-> x, 强制barrier， 此时对于其他reads是保证可见的
    prev[i]->SetNext(i, x);
  }
}

memory barrier 使用难度比lock要高的多，需要考虑的情况更为复杂，就如瓜帅的传控足球，磨练时可能意外颇多，一旦炉火纯青，就砍瓜切菜一般。

Iterator

在leveldb里， Iterator是一个非常重要的概念，包括block， table， memtable等很多都提供了Iterator， Iterator是访问这些数据结构内部数据的重要工具。SkipList<Key,Comparator>::Iterator::Seek - Advance to the first entry with a key >= target, 这里并没有seek等确切相等的entry，外部实现(memtable)get时，会处理这种不相等的情况。

others

除了search时， nodo的next要使用具有memory barrier的，其他实现细节就比较简单了

总结

外部与leveldb其他部分相关的insert(put update delete)的设计，内部lock(虽然外部同步，但是与内存设计相关)， memory barrier，存储细节的考虑，几百行代码的skiplist，需要细细品味的东西太多了。