Guide for making index-files

Last Modified: 1999-10-20

Yamasita, Tatuo (translated by Kaoru, Yamamoto)

Notation: $SUFRAY : Directory for SUFARY package

Introduction

SUFARY can perform two kinds of text searching described below:

String Search: String Search finds the chracter position at which the keyword is found in a text.
Text Area Search: A "text area" means a string enclosed(embraced) by a SGML-like tag. For example, a string "Article" enclosed by tags "<atrticle>" and "</atrticle>" and a string "Title" enclosed by tags "<title>" and "</title>" are text areas. Text Area Search finds a text area where a keyword exists.

In either search, SUFARY needs an "array file" in addition to the text you want to search. You can build an array file using "mkary" command (see below).

Moreover, in a text area search, SUFARY needs an "DocID file" in addition to the text and the array file. You can build an array file using "mkdid" command (see below).

// Files and Builders //

Files	Purposes	Builder
array file	mandate for String Search	mkary
DocID file	mandate for Text Area Search	mkdid

For SUFARY versions 2.1b3 or earlier, array files and DocID files are CPU-dependent due to the problem of byte-order. However, for SUFARY version 2.1 (or later) byte order is *always* big endian irrespective of CPU platform. You can convert little endian to big endian for array or DocID files by the following Perl one-liner:

% perl -e '$_ = join "", <>; s/(.)(.)(.)(.)/$4$3$2$1/gs; print' \
  foo.ary > foo.ary.new
% mv foo.ary.new foo.ary

How to make an array file

Dependeing on the granuality you wish to search, you need to make different array files.

If you make an array file based on a character, you can find all substrings in the text. For example, if you make a character-based array file for samp1.txt below, you can find substrings not only "YAMASITA" and "Tatuo", but also "ASITA T" and "st-na".

// samp1.txt //

YAMASITA Tatuo
tatuo-y@is.aist-nara.ac.jp
http://cl.aist-nara.ac.jp/~tatuo-y/

If you make an array file based on a line, you can find all substrings that begin with the head of the line. For example, you make a line-based array file for samp1.txt above, you can find substrings such as "YAMASITA", "YAM", and "http" but cannot find substrings such as "aist" and "Tatuo". This is because "aist" and "Tatuo" do not begin with the head of the line. A line-based array file is suitable for dictionary search as in samp2.txt below. The line-based array file is more compact than the character-based array file in size for the same plain text.

// samp2.txt //

fish SAKANA
boy OTOKONOKO
girl ONNANOKO

A program for array files can be found in $SUFARY/mkary/mkary.

Example

// making a character-based array file //

array% mkary /home/tatuo-y/data/ecoli
Save to "/home/tatuo-y/data/ecoli.ary"
Reading text file "/home/tatuo-y/data/ecoli"
++++++++++++++++++++ 1M
++++++++++++++++++++ 2M
++++++++++++++++++++ 3M
++++++++++++++++++++ 4M
++++++++++++
Sorting...
Saving...
Done.

// making a line-based array file //

array% mkary -l samp2.txt
Save to "samp2.txt.ary"
Reading text file "samp2.txt.ary"
 
Sorting...
Saving...
Done.

A handy search program to disply search results line by line sass can be found in $SUFARY/tools/sass.

% sass girl samp2.txt
19:0:girl ONNANOKO
% sass boy samp2.txt
8:0:boy OTOKONOKO

Usage

mkary [-c|w|l|b] [-#] [-q] [-ns] [-so] [-J] [-o ARRAY_FILE]
        [-M MEGABYTE] TEXT_FILE

TEXT_FILE is the name of file for which you wish to search. By default, the corresponding array file will be in the form "TEXT_FILE.ary".

-o ARRAY_FILE

Specifies a file name for an array file. If no options, defualts to TEXT_FILE.ary.

-c

Makes a character-based array file. Enables to find any substrings in a text file. An Japanese character (EUC-JP) is asuumed to be 2 bytes.

-l

Makes a line-based array file. Enables to find any prefixes( substrubgs beginning with the head of a line). Suitable for a dictionary search.

-w

Makes a word-based array file. A word is defined to be a string separated by a newline, a space or a tab.

-b

Makes a byte-based array file.

-J

Ignores characters except for Japanese characters(EUC) and '<' when making a character-based array file. Strings that begin with a character other than a Japanese charater(EUC) and a tag beginning with '<' cannot be searched any longer. But this option has a benefit of reducing the size of an array file.

-q

No Message display during execution.

-ns

No Sort: Makes an UNSORTED array files. This cannot be used for string nor text area searching.

-so

Sort Only: Sorts an existing array file which can then be used for string nor text area searching.

-#

Ignores lines begginning with "#". Only valid when [-l] option is also in use.

-M MEGABYTE

Partitions texts, sorts each text and merge them. MEGABYTE is the size of partitions to make. Use this option when little memory is available.

Example: mkary -M 3 sample.txt

How to make DocID file

Suppose that there is a text in which multiple articles are enclosed by tags <ARTICLE> and </ARTICLE> as in samp3.txt. We wish to find articles where a string "自然言語処理" exists.

// samp3.txt //

<ARTICLE>
形態素システム『茶筌』は２０世紀末に奈良先端大で開発された・・・（略）
・・・フリーソフトとして公開・・・（略）・・・
</ARTICLE>
<ARTICLE>
２１世紀初頭の自然言語処理システム開発への過剰な投資により、粗悪製品が
乱造され・・・（略）・・・若者の自然言語処理ばなれが深刻・・・（略）・
・・結局我々人間は歴史から何も学んでいないということを実感させられる。
</ARTICLE>

Array files enable us to find the positions of a string "自然言語処理" in a text. But it is not easy to find which articles a string "自然言語処理" exists using array files. To overcome this, SUFARY offers DocID files for this task. DocID files contain position information for start tags ( e.g. <ARTICLE> ) and end tags ( e.g. </ARTICLE> ) and equip for efficient text area searching. For more information, visit SUFARY home page at <http://cl.aist-nara.ac.jp/lab/nlt/ss/>.

Example

A program for DocID files can be found in $SUFARY/mkdid/mkdid.

First, we need to make array files by the mkary command.

% mkary samp3.txt
Save to "samp3.txt.ary"
Reading text file "samp3.txt"

Sorting...
Saving...
Done.

Then, make a DocID file by specifying tags for a text area. By default, the file names for corresponding DocID file is "samp3.txt.did".

% mkdid '<ARTICLE>' '</ARTICLE>' samp3.txt
Number of Documents = 2
sorting...
writting...
done.

A handy text area search program to disply results af can be found in $SUFARY/tools/af.

% af '自然言語処理' samp3.txt samp3.txt.did
FOUND 1
<ARTICLE>
２１世紀初頭の自然言語処理システム開発への過剰な投資により、粗悪製品が
乱造され・・・（略）・・・若者の自然言語処理ばなれが深刻・・・（略）・
・・結局我々人間は歴史から何も学んでいないということを実感させられる。
</ARTICLE>

BUT, it is rare to find text areas that have the same tags for the beginning and the end. For example, the text below does not have the end tag.

// samp4.txt //

#ID-001
形態素システム『茶筌』は２０世紀末に奈良先端大で開発された・・・（略）
・・・フリーソフトとして公開・・・（略）・・・
#ID-002
２１世紀初頭の自然言語処理システム開発への過剰な投資により、粗悪製品が
乱造され・・・（略）・・・結局我々人間は歴史から何も学んでいないという
ことを実感させられる。
#ID-003
裏自然言語処理研究会のお知らせ：本日午後３時・・・（略）・・・ふるって
御参加下さい。

In such a case, we specify one tag that appears in a text.

% mkdid '#ID-' samp4.txt
Number of Documents = 3
sorting...
writting...
done.

Usage

mkdid [-q] [-o DOCID_FILE] START_TAG [END_TAG] TEXT_FILE

TEXT_FILE is the name of file for which you wish to search. You need to accompany the array file "TEXT_FILE.ary". START_TAG and END_TAG are the names of tags that enclose a text area respectively. END_TAG is optional.

-o DOCID_FILE: Specifies a file name for a DocID file. If no options, defualts to TEXT_FILE.ary.
-q: No Message display during execution.

tatuo-y@is.aist-nara.ac.jp