Last Modified: 1999-10-20
Yamasita, Tatuo (translated by Kaoru, Yamamoto)SUFARY can perform two kinds of text searching described below:
In either search, SUFARY needs an "array file" in addition to the text you want to search. You can build an array file using "mkary" command (see below).
Moreover, in a text area search, SUFARY needs an "DocID file" in addition to the text and the array file. You can build an array file using "mkdid" command (see below).
// Files and Builders //
Files | Purposes | Builder |
array file | mandate for String Search | mkary |
DocID file | mandate for Text Area Search | mkdid |
For SUFARY versions 2.1b3 or earlier, array files and DocID files are CPU-dependent due to the problem of byte-order. However, for SUFARY version 2.1 (or later) byte order is *always* big endian irrespective of CPU platform. You can convert little endian to big endian for array or DocID files by the following Perl one-liner:
% perl -e '$_ = join "", <>; s/(.)(.)(.)(.)/$4$3$2$1/gs; print' \ foo.ary > foo.ary.new % mv foo.ary.new foo.ary
Dependeing on the granuality you wish to search, you need to make different array files.
If you make an array file based on a character, you can find all substrings in the text. For example, if you make a character-based array file for samp1.txt below, you can find substrings not only "YAMASITA" and "Tatuo", but also "ASITA T" and "st-na".
// samp1.txt //
YAMASITA Tatuo tatuo-y@is.aist-nara.ac.jp http://cl.aist-nara.ac.jp/~tatuo-y/
If you make an array file based on a line, you can find all substrings that begin with the head of the line. For example, you make a line-based array file for samp1.txt above, you can find substrings such as "YAMASITA", "YAM", and "http" but cannot find substrings such as "aist" and "Tatuo". This is because "aist" and "Tatuo" do not begin with the head of the line. A line-based array file is suitable for dictionary search as in samp2.txt below. The line-based array file is more compact than the character-based array file in size for the same plain text.
// samp2.txt //
fish SAKANA boy OTOKONOKO girl ONNANOKO
A program for array files can be found in $SUFARY/mkary/mkary
.
// making a character-based array file //
array% mkary /home/tatuo-y/data/ecoli Save to "/home/tatuo-y/data/ecoli.ary" Reading text file "/home/tatuo-y/data/ecoli" ++++++++++++++++++++ 1M ++++++++++++++++++++ 2M ++++++++++++++++++++ 3M ++++++++++++++++++++ 4M ++++++++++++ Sorting... Saving... Done.
// making a line-based array file //
array% mkary -l samp2.txt Save to "samp2.txt.ary" Reading text file "samp2.txt.ary" Sorting... Saving... Done.
A handy search program to disply search results line by
line sass can be found in $SUFARY/tools/sass
.
% sass girl samp2.txt 19:0:girl ONNANOKO % sass boy samp2.txt 8:0:boy OTOKONOKO
mkary [-c|w|l|b] [-#] [-q] [-ns] [-so] [-J] [-o ARRAY_FILE] [-M MEGABYTE] TEXT_FILE
TEXT_FILE is the name of file for which you wish to search. By default, the corresponding array file will be in the form "TEXT_FILE.ary".
Example: mkary -M 3 sample.txt
Suppose that there is a text in which multiple articles are enclosed by tags <ARTICLE> and </ARTICLE> as in samp3.txt. We wish to find articles where a string "自然言語処理" exists.
// samp3.txt //
<ARTICLE> 形態素システム『茶筌』は20世紀末に奈良先端大で開発された・・・(略) ・・・フリーソフトとして公開・・・(略)・・・ </ARTICLE> <ARTICLE> 21世紀初頭の自然言語処理システム開発への過剰な投資により、粗悪製品が 乱造され・・・(略)・・・若者の自然言語処理ばなれが深刻・・・(略)・ ・・結局我々人間は歴史から何も学んでいないということを実感させられる。 </ARTICLE>
Array files enable us to find the positions of a string "自然 言語処理" in a text. But it is not easy to find which articles a string "自然言語処理" exists using array files. To overcome this, SUFARY offers DocID files for this task. DocID files contain position information for start tags ( e.g. <ARTICLE> ) and end tags ( e.g. </ARTICLE> ) and equip for efficient text area searching. For more information, visit SUFARY home page at <http://cl.aist-nara.ac.jp/lab/nlt/ss/>.
A program for DocID files can be found in $SUFARY/mkdid/mkdid
.
First, we need to make array files by the mkary command.
% mkary samp3.txt Save to "samp3.txt.ary" Reading text file "samp3.txt" Sorting... Saving... Done.
Then, make a DocID file by specifying tags for a text area. By default, the file names for corresponding DocID file is "samp3.txt.did".
% mkdid '<ARTICLE>' '</ARTICLE>' samp3.txt Number of Documents = 2 sorting... writting... done.
A handy text area search program to disply results af
can be found in $SUFARY/tools/af
.
% af '自然言語処理' samp3.txt samp3.txt.did FOUND 1 <ARTICLE> 21世紀初頭の自然言語処理システム開発への過剰な投資により、粗悪製品が 乱造され・・・(略)・・・若者の自然言語処理ばなれが深刻・・・(略)・ ・・結局我々人間は歴史から何も学んでいないということを実感させられる。 </ARTICLE>
BUT, it is rare to find text areas that have the same tags for the beginning and the end. For example, the text below does not have the end tag.
// samp4.txt //
#ID-001 形態素システム『茶筌』は20世紀末に奈良先端大で開発された・・・(略) ・・・フリーソフトとして公開・・・(略)・・・ #ID-002 21世紀初頭の自然言語処理システム開発への過剰な投資により、粗悪製品が 乱造され・・・(略)・・・結局我々人間は歴史から何も学んでいないという ことを実感させられる。 #ID-003 裏自然言語処理研究会のお知らせ:本日午後3時・・・(略)・・・ふるって 御参加下さい。
In such a case, we specify one tag that appears in a text.
% mkdid '#ID-' samp4.txt Number of Documents = 3 sorting... writting... done.
mkdid [-q] [-o DOCID_FILE] START_TAG [END_TAG] TEXT_FILE
TEXT_FILE is the name of file for which you wish to search. You need to accompany the array file "TEXT_FILE.ary". START_TAG and END_TAG are the names of tags that enclose a text area respectively. END_TAG is optional.