Pythonで転置インデックスによる検索システム - from __future_

山下たつをさんの転置インデックスによる検索システムを作ってみよう！のコードをPythonで真似してみました。Python2.5じゃないと動きません。
ファイルフォーマットや使い方はそのままですが、フォーマット検査をちゃんとやってないです。正規表現のキャプチャ、Pythonでも簡単に書けないかなぁ。

index.py

#!/usr/bin/env python
import sys, codecs
from collections import defaultdict
sys.stdout = codecs.getwriter("utf-8")(sys.stdout)
index = defaultdict(lambda:list())
num_docs = 0
for line in sys.stdin:
    doc = line.decode("utf-8").strip().split(" ", 1)
    if len(doc) != 2: continue
    id, text = doc
    bigram_set = set()
    for i in xrange(len(text)-1):
        bigram = "".join((text[i], text[i+1]))
        if bigram in bigram_set: continue
        index[bigram].append(id)
        bigram_set.add(bigram)
    num_docs += 1
print "#NUM=%s" % num_docs
for key in sorted(index.keys()):
    print "%s %s" % (key, ",".join(index[key]))

search.py

#!/usr/bin/env python
from __future__ import with_statement
import sys, math
from collections import defaultdict
index = dict()
num_docs = 0
with file(sys.argv[1]) as f:
    for line in f:
        line = line.decode("utf-8").strip()
        if line.startswith("#NUM="):
            num_docs = float(line.split("=")[-1])
        else:
            bigram, ids = line.split(" ")
            index[bigram] = ids.split(",")
for line in sys.stdin:
    line = line.decode("utf-8").strip()
    score_dict = defaultdict(float)
    tf_dict = defaultdict(int)
    for i in xrange(len(line)-1):
        tf_dict["".join((line[i], line[i+1]))] += 1
    for bigram, tf in tf_dict.iteritems():
        df = len(index[bigram]) if (bigram in index) else 1
        idf = math.log(num_docs/(df+1))
        tf_idf = tf * idf
        for id in index.get(bigram, ()):
            score_dict[id] += tf_idf
    for id, score in sorted(score_dict.items(), lambda x,y: cmp(y[1], x[1])):
        print "ID:%s SCORE:%s" % (id, score)

実行

$ cat test.txt
1 これはペンです
2 最近はどうですか？
3 ペンギン大好き
4 こんにちは。いかがおすごしですか？
5 ここ最近疲れ気味
6 ペンキ塗りたてで気味が悪いです
$ ./index.py test.txt > test.idx
$ echo '最近ペンギンが好きです' | ./search.py test.idx
ID:3 SCORE:3.70130197411
ID:2 SCORE:0.875468737354
ID:5 SCORE:0.69314718056
ID:1 SCORE:0.587786664902
ID:6 SCORE:0.587786664902
ID:4 SCORE:0.182321556794

最初、logの前の除算をintどうしでやってしまいスコアの精度がやたら低くなってしまいました。