文字列探索法の比較

文字列探索のアルゴリズムで有名なものとして、ナイーブな力任せ法とKMP法とBM法とRK法の4つがある。実際問題として多くのケースでは力任せ法が十分に早いし作業空間も最小なので、C++標準ライブラリのstring::findは力任せ法で実装されている。KMP法は理論的な計算量は優れているが処理が複雑になるので現実的には遅く、BM法は前処理の作業空間を多少食う代わりに文字種が多いケースでは早いことが知られている。RK法は多数のパターンを同時に検出したい場合に効率が良い。と、いろんなところに書いてある。でも具体的にどんなデータでどんな違いが出るのか自分でも確かめてみたい。
f:id:fridaynight:20200215160730j:plain

約10万文字の英字テキストの中に適当な20文字のパターンが出現するかどうかを調べるというユースケースを想定してみる。テキストとパターンの文字は小文字の英字をランダムに選択して生成する。文字の種類は、2種類、3種類、4種類、5種類、26種類と変えてテストする。2種類の場合、パターンは baaaababbbbbabbbaaba のようになり、4種類だと caabbdbddcccbddcbacb という感じで、26種類だと pcehhyhwusonmxwnjdqh という感じになる。テキストとパターンの文字種の数は同じにする。探索は10000回繰り返してその合計経過時間を測定する。探索関数は、力任せ法から4種類の実装を用意して、さらにKMP法とBM法も加えた6種類を試す。

まずは、string::find を使った実装。中の実装がどうなっているかは処理系次第なのだが、少なくともGNUの実装では力任せ法に基づいているらしい。

// String searching implementation with the standard string::find.
int find_std(const std::string& text, const std::string& pattern) {
  return text.find(pattern);
}

memchrを使った実装。これも力任せ法だ。パターンの最初の文字に一致するテキスト内の位置判定をmemchrが担当し、それはCより低いレイヤーで最適化されていることが期待できる。

// String searching implementation with memmem.
int find_memchr(const std::string& text, const std::string& pattern) {
  if (pattern.empty()) {
    return 0;
  }
  const int first_pattern_char = pattern.front();
  const char* pattern_data = pattern.data() + 1;
  const int pattern_size = pattern.size() - 1;
  const char* current_pointer = text.data();
  int size = static_cast<int>(text.size()) - static_cast<int>(pattern.size()) + 1;
  while (size > 0) {
    const char* match_pointer =
      static_cast<const char*>(std::memchr(current_pointer, first_pattern_char, size));
    if (match_pointer == nullptr) {
      break;
    }
    if (memcmp(match_pointer + 1, pattern_data, pattern_size) == 0) {
      return match_pointer - text.data();
    }
    match_pointer++;
    size -= match_pointer - current_pointer;
    current_pointer = match_pointer;
  }
  return -1;
}

memmemを使った実装。これも力任せ法だ。memmemは非標準だが多くのUNIX系システムで実装されていて、メモリ上のパターン検索を行ってくれる。これもCより低いレイヤーで最適化されていることが期待できる。

// String searching implementation with memchr.
int find_memmem(const std::string& text, const std::string& pattern) {
  const void* result =
    memmem(text.data(), text.size(), pattern.data(), pattern.size());
  if (result == nullptr) {
    return -1;
  }
  return static_cast<const char*>(result) - text.data();
}

こちらは最も直感的な力任せ法の実装で、自分で二重ループを書いて一致判定をするものだ。

// String searching implementation with naive double loop matching.
int find_naive(const std::string& text, const std::string& pattern) {
  if (pattern.size() > text.size()) {
    return -1;
  }
  const size_t text_end = text.size() - pattern.size() + 1;
  for (size_t text_index = 0; text_index < text_end; text_index++) {
    size_t pattern_index = 0;
    while (pattern_index < pattern.size() &&
           text[text_index + pattern_index] == pattern[pattern_index]) {
      pattern_index++;
    }
    if (pattern_index == pattern.size()) {
      return text_index;
    }
  }
  return -1;
}

KMP法（クヌース・モリス・プラット法）。パターン内のどの位置で不一致が起きたかによって、次の一致判定でパターン内のどの位置に着目するか決める。

// String searching implementation with Knuth–Morris–Pratt algorithm.
int find_kmp(const std::string& text, const std::string& pattern) {
  if (pattern.empty()) {
    return 0;
  }
  std::vector<int> table(pattern.size() + 1);
  table[0] = -1;
  size_t pattern_index = 0;
  int offset = -1;
  while (pattern_index < pattern.size()) {
    while (offset >= 0 && pattern[pattern_index] != pattern[offset]) {
      offset = table[offset];
    }
    pattern_index++;
    offset++;
    table[pattern_index] = offset;
  }
  size_t text_index = 0;
  offset = 0;
  while (text_index < text.size() &&
         offset < static_cast<int>(pattern.size())) {
    while (offset >= 0 && text[text_index] != pattern[offset]) {
      offset = table[offset];
    }
    text_index++;
    offset++;
  }
  if (offset == static_cast<int>(pattern.size())) {
    return text_index - pattern.size();
  }
  return -1;
}

BM法（ボイヤー・ムーア法）。パターンの後ろから一致判定をしていくが、不一致の際に着目点を何個ずらすかを文字毎に決める。

// String searching implementation with Boyer-Moore algorithm.
int find_bm(const std::string& text, const std::string& pattern) {
  if (pattern.empty()) {
    return 0;
  }
  constexpr int table_size = std::numeric_limits<unsigned char>::max() + 1;
  int table[table_size];
  for(int table_index = 0; table_index < table_size; table_index++){
    table[table_index] = pattern.size();
  }
  int shift = pattern.size();
  int pattern_index = 0;
  while (shift > 0){
    const unsigned int table_index =
      static_cast<unsigned char>(pattern[pattern_index++]);
    table[table_index] = --shift;
  }
  int pattern_end = pattern.size() - 1;
  int begin_index = 0;
  int end_index = text.size() - pattern_end;
  while (begin_index < end_index) {
    int pattern_index = pattern_end;
    while (text[begin_index + pattern_index] == pattern[pattern_index]){
      if (pattern_index == 0) {
        return begin_index;
      }
      pattern_index--;
    }
    const unsigned int table_index =
      static_cast<unsigned char>(text[begin_index + pattern_index]);
    const int step = table[table_index] - pattern_end + pattern_index;
    begin_index += step > 0 ? step : 2;
  }
  return -1;
}

RK法（ラビン・カープ法）。パターンのサイズのスライドウィンドウ毎にRolling Hashを使ってハッシュ値を更新してパターンのそれと比較する。

// String searching implementation with Rabin-Karp algorithm.
int find_rk(const std::string& text, const std::string& pattern) {
  const unsigned char* text_p = reinterpret_cast<const unsigned char*>(text.data());
  const unsigned char* pattern_p = reinterpret_cast<const unsigned char*>(pattern.data());
  constexpr int base = 239;
  constexpr int modulo = 1798201;
  int power = 1;
  for (size_t i = 0; i < pattern.size(); i++) {
    power = (power * base) % modulo;
  }
  int pattern_hash = 0;
  for (size_t i = 0; i < pattern.size(); i++) {
    pattern_hash = pattern_hash * base + pattern_p[i];
    pattern_hash %= modulo;
  }
  int text_hash = 0;
  for (size_t i = 0; i < text.size(); i++) {
    text_hash = text_hash * base + text_p[i];
    text_hash %= modulo;
    if (i >= pattern.size()) {
      text_hash -= power * text_p[i - pattern.size()] % modulo;
      if (text_hash < 0) {
        text_hash += modulo;
      }
    }
    if (pattern_hash == text_hash && i >= pattern.size() - 1) {
      const size_t offset = i - (pattern.size() - 1);
      if (std::memcmp(text_p + offset, pattern_p, pattern.size()) == 0) {
        return offset;
      }
    }
  }
  return -1;
}

パフォーマンステスト用のメインルーチンはこんな感じ。

#include <cstring>
#include <iostream>
#include <string>
#include <utility>
#include <vector>
#include <limits>
#include <chrono>
#include <random>

// Above function implementations here.

int main(int argc, char** argv) {
  constexpr int text_size = 100 * 1000;
  constexpr int pattern_size = 20;
  constexpr int num_iterations = 10000;
  const std::vector<std::pair<char, char>> charsets =
    {{'a', 'b'}, {'a', 'c'}, {'a', 'd'}, {'a', 'e'}, {'a', 'z'}};
  const std::vector<std::pair<
    int (*)(const std::string&, const std::string&), const char*>> test_sets = {
    {find_std, "find_std"},
    {find_memchr, "find_memchr"},
    {find_memmem, "find_memmem"},
    {find_naive, "find_naive"},
    {find_kmp, "find_kmp" },
    {find_bm, "find_bm"},
    {find_rk, "find_kr"},
  };
  for (const auto& charset : charsets) {
    std::vector<std::string> texts;
    for (int i = 0; i < 1009; ++i) {
      texts.emplace_back(make_random_text(text_size, i, charset.first, charset.second));
    }
    std::vector<std::string> patterns;
    for (int i = 0; i < 1013; ++i) {
      patterns.emplace_back(make_random_text(pattern_size, texts.size() + i,
                                             charset.first, charset.second));
    }
    std::cout << "first=" << charset.first
              << " seccond=" << charset.second
              << " sample_pattern=" << patterns.front() << std::endl;
    for (const auto& test_set : test_sets) {
      const auto start_time = std::chrono::system_clock::now();
      int hit_count = 0;
      for (int i = 0; i < num_iterations; ++i) {
        const std::string& text = texts[i % texts.size()];
        const std::string& pattern = patterns[i % patterns.size()];
        if (test_set.first(text, pattern) >= 0) {
          hit_count++;
        }
      }
      const auto end_time = std::chrono::system_clock::now();
      const auto elapsed_time =
        std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time);
      std::cout << "func=" << test_set.second
                << " time=" << (elapsed_time.count() / 1000000.0)
                << " hit_count=" << hit_count << std::endl;
    }
  }
  return 0;
}

結果はこうなった。単位は秒。

文字の樹類	2文字	3文字	4文字	5文字	26文字
sring::find	10.9104	7.0804	5.8150	4.0944	1.3216
memchr	10.4196	6.5102	5.1063	3.8017	1.2609
memmem	5.1861	4.7109	3.9979	3.4930	1.5694
二重ループ	9.2288	8.5419	6.9518	5.9132	2.5553
KMP法	10.0218	10.9312	9.3197	8.2337	3.8789
BM法	7.0359	5.0016	3.2147	2.2617	0.4916
RK法	8.7378	9.1713	9.3164	9.2715	9.2196

力任せ法の4実装の中では、memchrを使った実装が安定して早い。string::findが最速ってわけでもないらしい。でもstring::findのパフォーマンス特性とmemchrのそれはかなり似ているので、実装はだいたい同じ感じなんじゃないかな。暇があったら実装を読もうか。memmemを使った実装は文字種が少ない場合にやたら早いというのも興味深い。二重ループは予想通り遅いのだが、文字種が少ない場合にはstring::findより早いというのは面白い。

KMP法は良いところがない。文字種が少ないほど有利になる傾向はあるが、2種類でもまだ力任せ法に明確には勝てない。探索パターンとほぼ同じだがちょっとだけ違うパターンがテキスト内に頻出する場合にのみ力任せ方は遅くなるが、その場合にもKMP法は遅くならないというのが唯一の利点だ。しかしランダムな文字列生成ではその良さを引き出すことはできない。やはり、教科書に書いてある通り、KMP法は理論上は美しいが実用性はなさげだ。

下馬評の高いBM法を見てみよう。のっけの文字2種類から力任せ法より早く、文字種が増えるほどに高速化する。このテストケースのように、テキストがかなり巨大な場合は、パターンの前処理にかかる時間が無視できるので、力任せ法よりもBM法に圧倒的に分がある。逆に、テキストが1000文字以下だったりパターンが10文字以下だったりするとBM法の利点はあまりないだろう。

最後にRK法を見てみよう。アルゴリズムの特性上、文字種やパターンの長さに関わらず一定の速度になる。同じ長さの複数のパターンを調べても文字毎のハッシュ値の更新操作は一回なので、バッチ処理とかで同じ長さのパターン毎にまとめて処理するなら得かもしれない。ただ、一度に20パターンとかを比較しないとBM法には勝てないだろう。

まとめ。やはり、大抵の場合には標準のstring::findを使うのが最善だろう。字種がやたら少ない場合にはmemmemが早いこともあるが、処理系依存の実装を積極的に使うほどの利点でもない。テキストが巨大でパターンが短くない場合にはBM法を積極的に使うべきだ。とはいえ、巨大なデータの探索は実際は低レイヤーのデータ入出力に律速されることが多いので、各アルゴリズムの差が重要になるシーンはそんなにないかも。検索エンジンとかゲームとかなら別だろうけど。

たまに意味のないタスクをやると頭の体操になってよい。string::findよりmemchrの実装が早いっていうのは意外だが、それより意外なのは、一般的なユースケースではmemmemよりmemchrの実装の方が早いということだ。やってみないとわからんことは多い。BM法は予想通り早い。テキストが巨大な場合はこれ一択。

余談。唐突に思い出した話。昔、がんばれゴエモンというゲームがあり、その中で賭場に行くことができた。そこでの丁半勝負は、勝つと所持金が2倍になり、負けると所持金が半分になるというものだった。子供ながらに思ったのは、この勝負の期待値は (2 + 0.5) / 2 = 1.25 なので、めっちゃお得じゃんということだ。ただし、必ず全財産を賭けなければいけないというのがゲーム上のミソだ。対数スケールで考えると期待値は0なので、フェアなゲームとして成り立っていた。とはいえ、儲けた分で他の商品を買ったり、損した分を外のプレーで補填したりすれば対数の枠組みから抜けられるので、その手間さえかければお得な仕組みだった。さて、大人になって思った。株とかFXとかで、価値が半分になったら損切りして倍になったら利確するという取引を繰り返せば、ゴエモン式に儲かるんじゃないかと。しかし、常識的に考えて、そんなことはあり得ない。残念ながら、半減する確率と倍増する確率が同じであるという前提が誤りなのだ。必勝法などない。地道に働こう。

Nikon ミラーレス一眼カメラ Z50 ダブルズームキット NIKKOR Z DX 16-50mm+NIKKOR Z DX 50-250mm付属 Z50WZ ブラック

発売日: 2019/11/22
メディア: Camera

SONY デジタルカメラ DSC-RX100 1.0型センサー F1.8レンズ搭載ブラック Cyber-shot DSC-RX100

発売日: 2012/06/15
メディア: エレクトロニクス

豪鬼メモ

一瞬千撃

文字列探索法の比較