boltdb introduction

发表于 2018-05-06 | 分类于 computer science

boltdb是一个设计简单， go实现的kv数据库，代码量虽少，但是功能五脏俱全，而且性能不错，如果打算学习下kv storage engine, boltdb绝对是一个好的开始。

boltdb的所有数据都存储在一个单文件里，mmap映射到内存，以page作为基本数据操作单位， page主要有meta(元数据)， freelist(可使用的page）， branch（index）, leaf（data）四种，数据是以b+ tree组织的，通过cow实现mvcc，即当发生写事务时，会把整个路径的对于page的修改写到新的page里，那些old page当没有读事务关联时，就会有freelist回收再利用，读事务时间不能太长，不然会用太多的old page在内存， meta page有两个，当初始化事务时，会关联txid最大且有效的，当当写事务要写会到page时，会写到txid较小的那个，及时写时失败，也总有一个有效的meta page，以此保证读事务总是可以在任何时候读到最新的commit数。，写事务是相互阻塞的，同时只有一个写事务存在，这样对于事务的维护非常简单，但是写事务的性能不高。由于数据的存储依赖mmap，当数据足够大时，由于os的换页操作，读性能也会严重下降。

skiplist 分段锁

发表于 2018-05-01 | 分类于 computer science

level skiplist

leveldb skiplist 通过采用write lock和memory barrier的方式，保证读读，读写操作不会冲突，但是写写是相互阻塞的，如果对Insert操作分析一下，不难发现，当同时Insert不相邻的enrty时，Insert操作之间是相互不影响的，那么除了write mutex lock，是否可以采用更好的方式提高写写的并发呢？

优化

skiplist 的高度为1时，退化为有序的list，先看一下怎么去提高有序list insert的并发。

找到node entry要insert的位置时，记录下<entry的node pre， * 同时记录下>= entry的node pre的next node即node next
entry.next = next，CAS(pre.next, next, entry), 如果失败，从node pre向后找，找到新的pre， next，继续该操作，知道成功为止。
entry成功插入list

对于skiplist的高度>1时， skiplist不同高度的list相互间是互不影响的，因此，该优化对于skiplist是同样适应的。

如果不采用CAS，使用mutex lock又该如何呢？

找到node entry要insert的位置时，记录下<entry的node pre， * 同时记录下>= entry的node pre的next node即node next
获得锁， check pre.next是否是next，如果不是，释放锁，从pre向后找到新的pre, next，继续该操作，直到成功，然后entry.next=next, pre.next=entry, 释放锁
entry成功插入list

总结

对于write write的数据之间相互不影响时，往往可以采用分段锁来提高并发，比如ConcurrentHashMap

leveldb WriteBatch

发表于 2018-05-01 | 分类于 computer science

Introduction

// WriteBatch holds a collection of updates to apply atomically to a DB.
//
// The updates are applied in the order in which they are added
// to the WriteBatch.  For example, the value of "key" will be "v3"
// after the following batch is written:
//
//    batch.Put("key", "v1");
//    batch.Delete("key");
//    batch.Put("key", "v2");
//    batch.Put("key", "v3");
//
// Multiple threads can invoke const methods on a WriteBatch without
// external synchronization, but if any of the threads may call a
// non-const method, all threads accessing the same WriteBatch must use
// external synchronization.

doc解释的很清楚了，重点a collection of updates，至于保证原子操作，与db write操作有关，保证中间不会插入其他update。

1 2	// WriteBatchInternal provides static methods for manipulating a // WriteBatch that we don't want in the public WriteBatch interface.

对WriteBatch的操作是由WriteBatchInternal提供的，比较重点的是InsertInto，通过这个方法，会把writebatch里的updates写到memtable里

Implementation Details

// WriteBatch::rep_ :=
//    sequence: fixed64
//    count: fixed32
//    data: record[count]
// record :=
//    kTypeValue varstring varstring         |
//    kTypeDeletion varstring
// varstring :=
//    len: varint32
//    data: uint8[len]

WriteBatch内部保存updates的格式

leveldb memtable

发表于 2018-04-30 | 分类于 computer science

Introduction

Memtable 提供了NewIterator, Get支持read操作， Add支持write操作， Ref/Unref支持reference count

Implementation Details

entry format

entry

entry compared order by:

increasing user key (according to user-supplied comparator)
decreasing sequence number
decreasing type (though sequence# should be enough to disambiguate)

Add

将key, value, sequence number, value type组合成entry, 然后insert到skiplist

Get

通过Skiplist::Seek找到entry，然后判断entry里的user key是否匹配， sequence number降序排列，这里就不需要判断了，最后判断value type是否是kTypeValue。

Iterator

Memtable::Iterator 通过调用Skiplist::Iterator对应方法实现

Concurrency

Memtable并没有为skiplist提供write mutex，需要外部保证。

总结

Memtable提供了非常简单的接口，在leveldb中是非常重要的数据结构。

leveldb LRUCache

发表于 2018-04-24 | 分类于 computer science

Data Structure

LRUCache

LRUCache 内部主要数据成员：

hash table，一个简单的hash table的实现
lru list，一个双向list，用于存放未被外部使用的entries，当cache满了后，再向cache加入新的entry，需要从这里找出牺牲品。
in_use list, 一个双向list，用于存放被外部使用的entries

hash table与双向list相结合， hash table实现快速查找，双向list实现快速add， delete。

entry是cache里的数据，要么在lru list中，要么在in_use list中，通过refs管理生命周期, 通过in_cache标记是否在cache中，三种情况entry会被从cache中移除:

Erase()删除
Insert()加入了相同key的entry，把原来的替换掉
cache满了，被牺牲掉

ShardedLRUCache通过分片，讲keys映射到多个LRUCache中，增大并发能力。

Concurrency

mutex保证多线程安全， shard增大并发能力。

Implementation Details

看到hash table的实现，我在想，为啥hashcode连个扰动函数都不用，基本只会用到hashcode低位去map映射，后面看了ShardedLRUCache，原来高位负责这里的map映射。

其他实现简单易懂。

总结

几百行的LRUCache的实现， hash table和双向list向结合， reference count，对于hashcode的合理使用， sharded等各种技术点都有体现，短小精悍。

leveldb skiplist

发表于 2018-04-23 | 分类于 computer science

Data Structure

skiplist

leveldb memtable需要满足以下几种要求：

get 足够快
write 足够快
iterator 足够快 (compact to sst file by ordering)

那么skiplist作为一个同时兼顾list和tree属性的data structure，是非常好的选择。

Concurrency

concurrency

leveldb skiplist 通过采用write lock和memory barrier的方式，保证读读，读写操作不会冲突，提高并发能力。

Implementation Details

skiplist doc:

// Thread safety
// -------------
//
// Writes require external synchronization, most likely a mutex.
// Reads require a guarantee that the SkipList will not be destroyed
// while the read is in progress.  Apart from that, reads progress
// without any internal locking or synchronization.
//
// Invariants:
//
// (1) Allocated nodes are never deleted until the SkipList is
// destroyed.  This is trivially guaranteed by the code since we
// never delete any skip list nodes.
//
// (2) The contents of a Node except for the next/prev pointers are
// immutable after the Node has been linked into the SkipList.
// Only Insert() modifies the list, and it is careful to initialize
// a node and use release-stores to publish the nodes in one or
// more lists.
//
// ... prev vs. next pointer ordering ...

主要以下几点：

写操作需要外部同步，比如mutex
读操作要保证在读的时候， skiplist不要destroy，比如reference count，读操作内部不会用到锁或者其他同步器
skiplist内部分配的node是不会被删除的，除了skiplist被destroy外
node的内容除了next/pre指针外，其他是不会改变的，只有Insert()会修改list，Insert()巧妙的初始化node，并通过release-stores(内存屏障)把node加到skiplist中, 保证对reads的可见性。
不难看出， put, delete, update等方法都是通过insert new entry实现的。

Node

// Implementation details follow
template<typename Key, class Comparator>
struct SkipList<Key,Comparator>::Node {
  explicit Node(const Key& k) : key(k) { }

  Key const key;

  // Accessors/mutators for links.  Wrapped in methods so we can
  // add the appropriate barriers as necessary.
  Node* Next(int n) {
    assert(n >= 0);
    // Use an 'acquire load' so that we observe a fully initialized
    // version of the returned Node.
    return reinterpret_cast<Node*>(next_[n].Acquire_Load());
  }
  void SetNext(int n, Node* x) {
    assert(n >= 0);
    // Use a 'release store' so that anybody who reads through this
    // pointer observes a fully initialized version of the inserted node.
    next_[n].Release_Store(x);
  }

  // No-barrier variants that can be safely used in a few locations.
  Node* NoBarrier_Next(int n) {
    assert(n >= 0);
    return reinterpret_cast<Node*>(next_[n].NoBarrier_Load());
  }
  void NoBarrier_SetNext(int n, Node* x) {
    assert(n >= 0);
    next_[n].NoBarrier_Store(x);
  }

Node的提供了两套set/get next方法，一种是普通的，另一种是memory barrier的，后一种可以保证数据的可见性。这里简单提一下memory barrier, 语义上，内存屏障之前的所有写操作都要写入内存；内存屏障之后的读操作都可以获得同步屏障之前的写操作的结果。

Node内部的数据，除了需要保存key外，还有一个长为1的指针数组，用于存储level 0的next node的地址，那其他level的呢？在node init时，多申请出height-1的地址空间，用于存储其他level的地址。这里如果我设计，要么用linklist，要么设置一个长度为最大长度的数组，要么只存放一个数组指针，这个方法我估计想不到。

Insert()

template<typename Key, class Comparator>
void SkipList<Key,Comparator>::Insert(const Key& key) {
  // TODO(opt): We can use a barrier-free variant of FindGreaterOrEqual()
  // here since Insert() is externally synchronized.

  // prev用于存放FindGreaterOrEqual node的prev node的地址
  Node* prev[kMaxHeight];
  Node* x = FindGreaterOrEqual(key, prev);

  // Our data structure does not allow duplicate insertion
  assert(x == nullptr || !Equal(key, x->key));

  int height = RandomHeight();
  //设置height高出部分的pre node为head
  if (height > GetMaxHeight()) {
    for (int i = GetMaxHeight(); i < height; i++) {
      prev[i] = head_;
    }
    //fprintf(stderr, "Change height from %d to %d\n", max_height_, height);

    // It is ok to mutate max_height_ without any synchronization
    // with concurrent readers.  A concurrent reader that observes
    // the new value of max_height_ will see either the old value of
    // new level pointers from head_ (nullptr), or a new value set in
    // the loop below.  In the former case the reader will
    // immediately drop to the next level since nullptr sorts after all
    // keys.  In the latter case the reader will use the new node.

	// 这里高度对search没有影响， search操作对于这种情况作了考虑， 因此不需要barrier， 
    max_height_.NoBarrier_Store(reinterpret_cast<void*>(height));
  }

  x = NewNode(key, height);
  for (int i = 0; i < height; i++) {
    // NoBarrier_SetNext() suffices since we will add a barrier when
    // we publish a pointer to "x" in prev[i].
	
	// x的next可见不可见不重要， x没有publish， 其他线程现在不能访问到x
    x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));
	// prev next-> x, 强制barrier， 此时对于其他reads是保证可见的
    prev[i]->SetNext(i, x);
  }
}

memory barrier 使用难度比lock要高的多，需要考虑的情况更为复杂，就如瓜帅的传控足球，磨练时可能意外颇多，一旦炉火纯青，就砍瓜切菜一般。

Iterator

在leveldb里， Iterator是一个非常重要的概念，包括block， table， memtable等很多都提供了Iterator， Iterator是访问这些数据结构内部数据的重要工具。SkipList<Key,Comparator>::Iterator::Seek - Advance to the first entry with a key >= target, 这里并没有seek等确切相等的entry，外部实现(memtable)get时，会处理这种不相等的情况。

others

除了search时， nodo的next要使用具有memory barrier的，其他实现细节就比较简单了

总结

外部与leveldb其他部分相关的insert(put update delete)的设计，内部lock(虽然外部同步，但是与内存设计相关)， memory barrier，存储细节的考虑，几百行代码的skiplist，需要细细品味的东西太多了。

pifuant

蚍蜉撼大树可笑不自量

GitHub E-Mail