std::sort是怎么实现的

那天听源神说std::sort并不是简单的快排，而是什么内省排序，我感到很新奇，就研究了一下。

前言

所有本文的讨论基于stl_algo.h

开搞

std::sort

std::sort总共有两个重载的版本，如下所示

template<typename _RandomAccessIterator>
    inline void
    sort(_RandomAccessIterator __first, _RandomAccessIterator __last)

template<typename _RandomAccessIterator, typename _Compare>
    inline void
    sort(_RandomAccessIterator __first, _RandomAccessIterator __last,
     _Compare __comp)

可以看出，一个是带了二元谓词的，一个是没带。没带的用的是

__gnu_cxx::__ops::__iter_less_iter()

这个是什么呢？在predefined_ops.h中line 37-48 可以看到。

struct _Iter_less_iter
  {
    template<typename _Iterator1, typename _Iterator2>
      _GLIBCXX14_CONSTEXPR
      bool
      operator()(_Iterator1 __it1, _Iterator2 __it2) const
      { return *__it1 < *__it2; }
  };

inline _Iter_less_iter
  __iter_less_iter()
  { return _Iter_less_iter(); }

看起来就是比较两个迭代器指向的数据，如果前者小于后者，就返回true，否则返回false。

然后回到std::sort。可见它们在末尾都不约而同地调用了std::__sort

template<typename _RandomAccessIterator, typename _Compare>
    inline void
    __sort(_RandomAccessIterator __first, _RandomAccessIterator __last,
       _Compare __comp)
    {
      if (__first != __last)
    {
      std::__introsort_loop(__first, __last,
                std::__lg(__last - __first) * 2,
                __comp);
      std::__final_insertion_sort(__first, __last, __comp);
    }
    }

可见std::__sort先调用std::__introsort_loop，再调用std::__final_insertion_sort。而且std::__introsort_loop中调用了std::__lg。首先看std::__lg是什么。

std::__lg

inline _GLIBCXX_CONSTEXPR int
  __lg(int __n)
  { return sizeof(int) * __CHAR_BIT__  - 1 - __builtin_clz(__n); }

  inline _GLIBCXX_CONSTEXPR unsigned
  __lg(unsigned __n)
  { return sizeof(int) * __CHAR_BIT__  - 1 - __builtin_clz(__n); }

  inline _GLIBCXX_CONSTEXPR long
  __lg(long __n)
  { return sizeof(long) * __CHAR_BIT__ - 1 - __builtin_clzl(__n); }

  inline _GLIBCXX_CONSTEXPR unsigned long
  __lg(unsigned long __n)
  { return sizeof(long) * __CHAR_BIT__ - 1 - __builtin_clzl(__n); }

  inline _GLIBCXX_CONSTEXPR long long
  __lg(long long __n)
  { return sizeof(long long) * __CHAR_BIT__ - 1 - __builtin_clzll(__n); }

  inline _GLIBCXX_CONSTEXPR unsigned long long
  __lg(unsigned long long __n)
  { return sizeof(long long) * __CHAR_BIT__ - 1 - __builtin_clzll(__n); }

发现std::__lg作用于整数。其中有个__builtin_clz查了下发现是GCC Built-in Function

— Built-in Function: int __builtin_clz (unsigned int x)
Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.

就是在形参x非0时候返回x在二进制表示时前面0的个数，x为0该函数行为未定义。因此std::__lg返回的就是一个整数的二进制最高位。想一想就发现其实这是[log₂x]([x]代表向下取整，下同)。

std::__introsort_loop

好了，这时再来看std::__introsort_loop。

template<typename _RandomAccessIterator, typename _Size, typename _Compare>
    void
    __introsort_loop(_RandomAccessIterator __first,
             _RandomAccessIterator __last,
             _Size __depth_limit, _Compare __comp)
    {
      while (__last - __first > int(_S_threshold))
    {
      if (__depth_limit == 0)
        {
          std::__partial_sort(__first, __last, __last, __comp);
          return;
        }
      --__depth_limit;
      _RandomAccessIterator __cut =
        std::__unguarded_partition_pivot(__first, __last, __comp);
      std::__introsort_loop(__cut, __last, __depth_limit, __comp);
      __last = __cut;
    }
    }

可见std::__lg(__last - __first) * 2，即2log₂(区间长度)，就是__depth_limit。至于为什么选这个作为深度阈值，大概是因为partition的递归树一层是cn，那么只有树高度为O(logN)才会比较稳，要不放任其递归，就可能导致非常糟糕的partition，最终造成Θ(N²)的复杂度。

Theorem 2 Assuming an O(logN) subproblem tree depth limit and an O(NlogN) worst-case time bound for Heapsort , the worst-case computing time for Introsort is Θ(NlogN).

Proof: The time for partitioning is bounded by times the number of levels of partitioning, which is bounded by the subproblem tree depth limit. Hence the total time for partitioning is O(NlogN). Suppose Introsort calls Heapsort times on sequences of length n₁,…,n_j.
Let c be a constant such that cNlog₂N bounds the time for Heapsort on a sequence of length N; then the time for all the calls of Heapsort is bounded by

  j                                         j
∑ cn_ilog₂n_i     ≤     c(log₂N)∑n_i     ≤     cNlog₂N
i=1                                      i=1

or O(NlogN) also. Therefore the total time for Introsort is O(NlogN).
The lower bound is also Ω(NlogN), since a sequence such as an already sorted sequence
produces equal length partitions in all cases, resulting in log₂N levels each taking Ω(N) time.

至于为什么是2 * log₂N，而不是2.333 * log₂N或者6.666 * log₂N，原论文也是有解释的。首先不能太小，否则太频繁的调用堆排序，一点也不比一开始就调堆排序好。然后之所以是2，是根据经验玄学，2会产生更好的结果。

Although any choice for the constant factor in this bound will ensure an overall time bound of O(NlogN), it must not
be so small that the algorithm calls heapsort too frequently (causing it to be little better than using heapsort in the first place). In the following pseudo-code the depth limit is 2[log₂N], since this value empirically produces good results.

然后接着看std::__introsort_loop。需要__last - __first大于int(_S_threshold)才能执行。

enum { _S_threshold = 16 };

毕竟如果区间长度小，再分治就不合算了，改为插入排序，在基本有序情况下插入排序还是很稳的。至于为什么是16，好像原论文里也没说。。目测又是经验值。玄学

如果递归深度__depth_limit达到了，就调用std::__partial_sort，其内部用的是堆排序。

template<typename _RandomAccessIterator, typename _Compare>
    inline void
    __partial_sort(_RandomAccessIterator __first,
           _RandomAccessIterator __middle,
           _RandomAccessIterator __last,
           _Compare __comp)
    {
      std::__heap_select(__first, __middle, __last, __comp);
      std::__sort_heap(__first, __middle, __comp);
    }

每循环一次__depth_limit都会自减1。然后用std::__unguarded_partition_pivot取出一个pivot，对[pivot, last)递归调用std::__introsort_loop，并接着在first与pivot间进行循环。所以说并不是[first,pivot)没算到，而是它在下一次循环算的，每次只递归后面的部分。

template<typename _RandomAccessIterator, typename _Compare>
    inline _RandomAccessIterator
    __unguarded_partition_pivot(_RandomAccessIterator __first,
                _RandomAccessIterator __last, _Compare __comp)
    {
      _RandomAccessIterator __mid = __first + (__last - __first) / 2;
      std::__move_median_to_first(__first, __first + 1, __mid, __last - 1, __comp);
      return std::__unguarded_partition(__first + 1, __last, __first, __comp);
    }

std::__unguarded_partition_pivot里面调了std::__move_median_to_first与std::__unguarded_partition，先看这两个函数。

template<typename _Iterator, typename _Compare>
    void
    __move_median_to_first(_Iterator __result,_Iterator __a, _Iterator __b,
               _Iterator __c, _Compare __comp)
    {
      if (__comp(__a, __b))
    {
      if (__comp(__b, __c))
        std::iter_swap(__result, __b);
      else if (__comp(__a, __c))
        std::iter_swap(__result, __c);
      else
        std::iter_swap(__result, __a);
    }
      else if (__comp(__a, __c))
    std::iter_swap(__result, __a);
      else if (__comp(__b, __c))
    std::iter_swap(__result, __c);
      else
    std::iter_swap(__result, __b);
    }

可以看出std::__move_median_to_first就是把__first + 1, __mid, __last - 1三者指向的值的中位数找出，与__first指向的数交换。

template<typename _RandomAccessIterator, typename _Compare>
    _RandomAccessIterator
    __unguarded_partition(_RandomAccessIterator __first,
              _RandomAccessIterator __last,
              _RandomAccessIterator __pivot, _Compare __comp)
    {
      while (true)
    {
      while (__comp(__first, __pivot))
        ++__first;
      --__last;
      while (__comp(__pivot, __last))
        --__last;
      if (!(__first < __last))
        return __first;
      std::iter_swap(__first, __last);
      ++__first;
    }
    }

大概就是__first从前往后搜，找到比__pivot大的数；然后__last从后往前搜，找到比__pivot小的数。如果此时__first与__last交叉，则结束，返回__first。否则交换__first与__last指向的元素，重复上述步骤。__pivot即上一步std::__move_median_to_first得到的中位数。因此std::__unguarded_partition_pivot返回的就是pivot，的位置，从__first+1到__last以pivot为界，前面比__first指向的值小，后面比__first指向的值大。

std::__final_insertion_sort

又回到最初的起点std::__sort。std::__introsort_loop后调的是std::__final_insertion_sort。

template<typename _RandomAccessIterator, typename _Compare>
    void
    __final_insertion_sort(_RandomAccessIterator __first,
               _RandomAccessIterator __last, _Compare __comp)
    {
      if (__last - __first > int(_S_threshold))
    {
      std::__insertion_sort(__first, __first + int(_S_threshold), __comp);
      std::__unguarded_insertion_sort(__first + int(_S_threshold), __last,
                      __comp);
    }
      else
    std::__insertion_sort(__first, __last, __comp);
    }

这时_S_threshold又出现了。当区间长度小于16，则直接调用std::__insertion_sort，否则对前16个调用std::__insertion_sort，对后面的调用std::__unguarded_insertion_sort。

template<typename _RandomAccessIterator, typename _Compare>
    void
    __insertion_sort(_RandomAccessIterator __first,
             _RandomAccessIterator __last, _Compare __comp)
    {
      if (__first == __last) return;

      for (_RandomAccessIterator __i = __first + 1; __i != __last; ++__i)
    {
      if (__comp(__i, __first))
        {
          typename iterator_traits<_RandomAccessIterator>::value_type
        __val = _GLIBCXX_MOVE(*__i);
          _GLIBCXX_MOVE_BACKWARD3(__first, __i, __i + 1);
          *__first = _GLIBCXX_MOVE(__val);
        }
      else
        std::__unguarded_linear_insert(__i,
                __gnu_cxx::__ops::__val_comp_iter(__comp));
    }
    }

std::__insertion_sort和平常的insertion sort不同在于首先比较第一个元素和当前元素，如果满足__comp的条件，直接把当前元素移到第一个元素处，否则调用std::__unguarded_linear_insert。

template<typename _RandomAccessIterator, typename _Compare>
    void
    __unguarded_linear_insert(_RandomAccessIterator __last,
                  _Compare __comp)
    {
      typename iterator_traits<_RandomAccessIterator>::value_type
    __val = _GLIBCXX_MOVE(*__last);
      _RandomAccessIterator __next = __last;
      --__next;
      while (__comp(__val, __next))
    {
      *__last = _GLIBCXX_MOVE(*__next);
      __last = __next;
      --__next;
    }
      *__last = _GLIBCXX_MOVE(__val);
    }

可以看出，std::__unguarded_linear_insert就是提供了insertion sort中的插入部分。

template<typename _RandomAccessIterator, typename _Compare>
    inline void
    __unguarded_insertion_sort(_RandomAccessIterator __first,
                   _RandomAccessIterator __last, _Compare __comp)
    {
      for (_RandomAccessIterator __i = __first; __i != __last; ++__i)
    std::__unguarded_linear_insert(__i,
                __gnu_cxx::__ops::__val_comp_iter(__comp));
    }

std::__unguarded_insertion_sort就是对从__first到__last调用std::__unguarded_linear_insert(插入操作)。

就这样结束了？

然而有个问题，搞那么多种insertion sort干什么？可以看到，std::__unguarded_linear_insert就是直接往前找位置插入，不管会不会越界。所以std::__insertion_sort调用是毫无问题的，因为std::__insertion_sort会首先和第一个元素比较，保证了不会越界。但是std::__unguarded_insertion_sort并没有检查，凭什么也能调？所以要看看std::__unguarded_insertion_sort在哪被调用的。

template<typename _RandomAccessIterator, typename _Compare>
    void
    __final_insertion_sort(_RandomAccessIterator __first,
               _RandomAccessIterator __last, _Compare __comp)
    {
      if (__last - __first > int(_S_threshold))
    {
      std::__insertion_sort(__first, __first + int(_S_threshold), __comp);
      std::__unguarded_insertion_sort(__first + int(_S_threshold), __last,
                      __comp);
    }
      else
    std::__insertion_sort(__first, __last, __comp);
    }

可以查到，只有std::__final_insertion_sort调了。所以从调用的方式看来，一定能够保证最值会出现在前16个数中。而且如果要区间长度大于16，那么std::__introsort_loop就一定会被调用。那么就由std::__introsort_loop来保证最值出现在前16个数中了。这个函数结束就只有两个条件

递归深度为0
区间长度小16

由于如果要调用std::__unguarded_insertion_sort，一开始一定是区间长度大于16。那么此时递归深度__depth_limit ≥ 2[log₂16] = 8，一定会调用std::__unguarded_insertion_sort，则保证前半部分值和后半部分值由__cut分开。

如果递归深度为0，那么调用堆排序的序列长度大于16。由于保证了最值会出现在__first与__last之间，因此堆排序后最值肯定在__first，自然也就肯定在前16个元素中了。
如果区间长度小于16。由于每次循环都能缩小区间且保证最值出现在下一次循环的区间中，因此一定能满足最小值在该区间(长度小于16)中，自然也就保证在前16个中了。所以实现是没有问题的，一定能够保证不会越界。

那么为什么要搞std::__unguarded_insertion_sort呢？只是insertion sort不好吗？粗略算一下，可能是因为insertion sort慢。。一个insertion sort大概是这样

while j > 0 and comp(A[j-1], A[j])
    swap(A[j-1], A[j])
    j = j - 1

假定j = N，最多需要2N次比较，3N次赋值(swap)，N次自减。std::__unguarded_linear_insert，由于unguarded，而且实现好像是选择排序，且不用判断是否越界，最多N次比较，2N次赋值，N次自减。就算走另一个分支，最多3N次自增，N次赋值，N次比较。所以会比一般的插入排序要优秀。虽然看起来差不多，大数据下也许差距就非常可观了。要不是效率STL的实现者也不会没事搞这么复杂吧。

总结

STL源码有空还是读读，可以学不少东西。

参考资料

Musser D R. Introspective sorting and selection algorithms[J]. Softw., Pract. Exper., 1997, 27(8): 983-993.
GNU STL