バッチで文字列探索する際の性能

コロナ騒ぎであまり外に出られないこの情勢では、せっかくだから文化系の活動をしようじゃないかと思うわけだ。以前の記事で、息抜きに文字列探索法の比較をしてみた。今回は、バッチ処理の場合の傾向を見てみる。
f:id:fridaynight:20200223120834j:plain

今回は、テキスト内にパターンが最初に現れた位置を探すだけでなく、パターンに一致する全ての部分の位置を探すことにする。また、パターンは複数同時に与えられるものとして、その各々に対して一致したインデックスのリストを得られるものとする。したがって、インターフェイスは以下のようになる。C++17から入ったstring_viewも使ってみる。

std::vector<std::vector<int32_t>> StrSearchBatch(
    std::string_view text, const std::vector<std::string>& patterns);

リファレンス実装として、string::findを使った実装を用意する。まずは一致する全ての位置を返すようにループを書く。パターンが空文字列の場合は、定義上、あらゆる部分文字列が一致するので、全ての文字のインデックスが返される。

std::vector<int32_t> StrSearchWhole(std::string_view text, std::string_view pattern) {
  std::vector<int32_t> result;
  if (pattern.empty()) {
    result.reserve(text.size());
    for (size_t i = 0; i < text.size(); ++i) {
      result.emplace_back(i);
    }
    return result;
  }
  size_t pos = 0;
  while (pos != text.size()) {
    pos = text.find(pattern, pos);
    if (pos == std::string::npos) {
      break;
    }
    result.emplace_back(pos);
    pos++;
  }
  return result;
}

複数のパターンを一気に受け取った場合に最適化できる余地があるのがバッチ処理インターフェイスの良いところだが、string::findを使った実装ではそれは不可能なので普通に逐次処理するコードを書く。

std::vector<std::vector<int32_t>> StrSearchBatch(
    std::string_view text, const std::vector<std::string>& patterns) {
  std::vector<std::vector<int32_t>> result;
  result.resize(patterns.size());
  for (size_t i = 0; i < patterns.size(); i++) {
    result[i] = StrSearchWhole(text, patterns[i]);
  }
  return result;
}

実践ではあまり役に立たないKMP法でも一致を全部返すように実装する。

std::vector<int32_t> StrSearchWholeKMP(std::string_view text, std::string_view pattern) {
  std::vector<int32_t> result;
  if (pattern.empty()) {
    for (size_t i = 0; i < text.size(); ++i) {
      result.emplace_back(i);
    }
    return result;
  }
  std::vector<int32_t> table(pattern.size() + 1);
  table[0] = -1;
  size_t pattern_index = 0;
  int32_t offset = -1;
  while (pattern_index < pattern.size()) {
    while (offset >= 0 && pattern[pattern_index] != pattern[offset]) {
      offset = table[offset];
    }
    pattern_index++;
    offset++;
    table[pattern_index] = offset;
  }
  size_t text_index = 0;
  while (text_index < text.size()) {
    offset = 0;
    while (text_index < text.size() &&
           offset < static_cast<int32_t>(pattern.size())) {
      while (offset >= 0 && text[text_index] != pattern[offset]) {
        offset = table[offset];
      }
      text_index++;
      offset++;
    }
    if (offset == static_cast<int32_t>(pattern.size())) {
      const size_t offset = text_index - pattern.size();
      result.emplace_back(offset);
      text_index = offset + 1;
    }
  }
  return result;
}

自然言語のテキストは、文字の種類が比較的多い。英文小文字だけでも26字で、日本語だと仮名と常用漢字を足して2300字ほどだ（いずれにせよ、UTF-8などで表現してバイト列を比較するなら256字以下になる）。そのような場合、BM法（Boyer-Moore法）による文字列探索が圧倒的に高速だ。不一致が発生した文字によって探索位置をスキップさせるのだが、それはパターン内でその文字が最初に現れた位置で決まる。よって文字種が多いほどスキップ幅の期待値が上がる。BM法で一致を全部返す実装はこんな感じになる。

std::vector<int32_t> StrSearchWholeBM(std::string_view text, std::string_view pattern) {
  std::vector<int32_t> result;
  if (pattern.empty()) {
    for (size_t i = 0; i < text.size(); ++i) {
      result.emplace_back(i);
    }
    return result;
  }
  int32_t table[UINT8MAX];
  for(int32_t table_index = 0; table_index < UINT8MAX; table_index++) {
    table[table_index] = pattern.size();
  }
  int32_t shift = pattern.size();
  int32_t pattern_index = 0;
  while (shift > 0){
    const uint8_t table_index = pattern[pattern_index++];
    table[table_index] = --shift;
  }
  int32_t pattern_end = pattern.size() - 1;
  int32_t begin_index = 0;
  int32_t end_index = text.size() - pattern_end;
  while (begin_index < end_index) {
    int32_t pattern_index = pattern_end;
    bool hit = false;
    while (text[begin_index + pattern_index] == pattern[pattern_index]) {
      if (pattern_index == 0) {
        result.emplace_back(begin_index);
        hit = true;
        break;
      }
      pattern_index--;
    }
    const uint8_t table_index = text[begin_index + pattern_index];
    const int32_t step = table[table_index] - pattern_end + pattern_index;
    begin_index += step > 0 ? step : (hit ? 1 : 2);
  }
  return result;
}

KMP法とBM法でもバッチ処理による最適化は見込めないので、普通に逐次処理するラッパーを書いておく。StrSearchBatchと同じ実装なので割愛。

さて、何でバッチ処理と連呼しているかというと、RK法（Rabin-Karp法）ではバッチ処理による最適化ができるからだ。受け取ったパターンの各々に対して予めハッシュ値を計算しておいて、テキストの各位置を調べる毎に各々のハッシュ値と比較する。ハッシュ値が一致した場合にのみその位置で通常の一致検査をすることで、ハッシュ法による高速化の恩恵を得つつ、偽陽性を回避している。テキスト側のハッシュ値の算出はローリングハッシュ法を用いる。先頭の文字の情報をシフトアウトして次の文字の情報をシフトインすることでハッシュ値の更新ができるため、ハッシュ値の計算はパターン長によらない。また、テキスト側のハッシュ値の計算を1回だけすれば、バッチ内の全パターンと比較できるので、バッチサイズが上がるほどに効率化する。しかし、パターン長が異なるとローリングハッシュ法の一貫性がなくなるので、バッチ処理で最適化できるのは長さが同じパターンが複数ある場合だけである。ということで、パターンの長さ毎に自動的にミニバッチを作ってからRK法による検査を行うような実装を書いてみた。

std::vector<std::vector<int32_t>> StrSearchBatchRK(
    std::string_view text, const std::vector<std::string>& patterns) {
  std::vector<std::vector<int32_t>> result(patterns.size());
  const unsigned char* text_p = reinterpret_cast<const unsigned char*>(text.data());
  std::map<size_t, std::vector<size_t>> batch;
  for (size_t i = 0; i < patterns.size(); i++) {
    batch[patterns[i].size()].emplace_back(i);
  }
  for (const auto& batch_item : batch) {
    const size_t pattern_size = batch_item.first;
    const auto& pattern_indices = batch_item.second;
    if (pattern_size < 1) {
      for (const auto& pattern_index : pattern_indices) {
        for (size_t i = 0; i < text.size(); i++) {
          result[pattern_index].emplace_back(i);
        }
      }
      continue;
    }
    constexpr int32_t base = 239;
    constexpr int32_t modulo = 1798201;
    int32_t power = 1;
    for (size_t i = 0; i < pattern_size; i++) {
      power = (power * base) % modulo;
    }
    std::vector<std::pair<int32_t, size_t>> pattern_hashes;
    for (const auto& pattern_index : pattern_indices) {
      const auto& pattern = patterns[pattern_index];
      const unsigned char* pattern_p = reinterpret_cast<const unsigned char*>(pattern.data());
      int32_t pattern_hash = 0;
      for (size_t i = 0; i < pattern_size; i++) {
        const int32_t c = i < pattern.size() ? pattern_p[i] : 0;
        pattern_hash = pattern_hash * base + c;
        pattern_hash %= modulo;
      }
      pattern_hashes.emplace_back(std::make_pair(pattern_hash, pattern_index));
    }
    int32_t text_hash = 0;
    for (size_t i = 0; i < text.size(); i++) {
      text_hash = text_hash * base + text_p[i];
      text_hash %= modulo;
      if (i >= pattern_size) {
        text_hash -= power * text_p[i - pattern_size] % modulo;
        if (text_hash < 0) {
          text_hash += modulo;
        }
      }
      for (const auto& pattern_hash : pattern_hashes) {
        const auto& pattern = patterns[pattern_hash.second];
        if (pattern_hash.first == text_hash) {
          if (i >= pattern.size() - 1) {
            const size_t offset = i - (pattern.size() - 1);
            if (std::memcmp(text_p + offset, pattern.data(), pattern.size()) == 0) {
              result[pattern_hash.second].emplace_back(offset);
            }
          }
        }
      }
    }
  }
  return result;
}

性能テストを行う。まずはユースケースを想定しよう。英字小文字26文字からランダムに選んだ文字が100,000文字入ったテキストを用意する。100Kの英文書を想定している。それに対して、10文字からなるパターンの全出現位置を取得する。バッチに含まれるパターン数は倍々で増加させる。テキストとパターンは毎回変えつつ、1000回実行してその合計時間を計測する。

バッチサイズ	1	2	4	8	16	32	64	128	256
std::find	0.088	0.176	0.350	0.721	1.406	2.853	5.727	11.393	22.833
KMP法	0.224	0.451	0.898	1.845	3.573	7.251	14.507	29.050	57.973
BM法	0.062	0.124	0.257	0.516	0.994	2.060	3.973	7.957	16.182
RK法	0.460	0.463	0.522	0.692	0.999	2.854	3.245	5.554	8.903

バッチ最適化ができないstd::findとKMPとBMの三者はバッチサイズに比例して時間がかかるのは当然の結果だ。一方でRK法の時間増加は線形よりも穏やかだ。バッチ1では最も遅いRK法である。KMP法にすら負けて、BM法に比べて7倍近い時間がかかっていた。しかし、バッチ4でKMP法を抜き、バッチ8でstd::findを抜き、バッチ64でBM法を抜くという結果になった。

BM法は文字種が多いほど利点があると上に書いたが、じゃあ文字種がもっと少ないとどうだろう。例えばDNAの塩基ATGCの4文字からなるテキストを想定してテストしてみよう。その他の条件は同じにする。

バッチサイズ	1	2	4	8	16	32	64	128	256
std::find	0.306	0.605	1.161	2.353	4.622	9.476	18.820	37.692	75.401
KMP法	0.449	0.491	1.735	3.590	6.953	13.961	28.195	56.482	112.842
BM法	0.247	0.491	0.940	1.899	3.855	7.746	15.340	30.502	60.989
RK法	0.477	0.484	0.513	0.710	1.041	2.798	3.285	5.569	8.855

文字種が減ると部分一致が増えるので大抵の実装は効率が悪化するが、ハッシュ値を使うRK法ではその影響は小さいので、相対的にRK法が有利になる。バッチ1では最弱だが、バッチ2で既に他の全実装を凌駕する結果になった。バッチ256の結果だけ見ると、RK法ってばめちゃめちゃ早いと思うだろう。

文字種26字の時も文字種4字の時も、RK法だとバッチ16とバッチ32の間のギャップが大きく、なぜか線形よりも悪化している。ミニバッチ用の配列のデータがCPUのキャッシュに乗るか乗らないかとかそんな感じの違いがあるような気もするが、実際のところは不明だ。それでも、全体的な傾向を見れば、RK法のバッチ数に対する反応が線形より緩やかであることは明らかだ。

まとめ。インデックスなしの文字列検索と言えばBM法を使うのが定石だが、それが最善でないケースもある。同一サイズのパターンで64個以上のバッチ処理ができる場合はRK法が有利だ。実験する前はバッチ8くらいでRK法がBM法を上回ると予想していたのだが、ミニバッチ処理のオーバーヘッドが思ったより大きいらしく、バッチ64からという塩っぱい結果になった。それが分かったというのは収穫ではある。また、文字種が少ない場合にはRK法が有利になる状況がより増えるということも確認できた。文字4種類ならバッチ2から利得がある。とはいえ、かなり特殊なケースであることは否めないけれど。

同じテキストに対して何度もパターン検索を行う場合、テキストからSuffix Arrayなどのインデックスを作っておいて何度も利用する方が検索処理自体は圧倒的に早いだろう。しかし、テキストが巨大な場合はインデックス構築にかかる時間と空間がうざいので、バッチサイズが数百程度だったら逐次探索で頑張らせた方が便利なことも多い。その際にはBM法とRK法が有力な候補になる。大抵の場合にはBM方が良いが、特殊なケースではRK法も検討に値する。パターン長毎にバッチを作ってから、各スロットで、一定数以上あればRK法、そうでなければBM法を呼ぶように実装すれば最強かも。