You can index just about any type of text file using SWISH-E. You can do incremental indexes as well.
Go to http://swish-e.org/download/ for the source. First, get a list of generated files when building and installing:
me% cd /src/info-retrieval/swish-e-2.4.3 me% find . -print | sort | tail +2 > ORIG me% mkdir /tmp/local me% CC=gcc CFLAGS=-O ./configure --enable-daystamp --prefix=/tmp/local me% make me% make check me% make install
Now /tmp/local holds everything installed from SWISH-E. You might want to keep the output from "ls -lR /tmp/local" somewhere.
me% rm -rf /tmp/local me% echo ./ORIG >> ORIG me% sort -o ORIG ORIG me% find . -print | sort | tail +2 > NEW me% echo ./NEW >> ORIG
ORIG is the original list of files, NEW is the list after installation. To get the files added to the source directory during the build:
me% comm -23 NEW ORIG ./Makefile ./conf/Makefile ./config.log ./config.status ./doc/Makefile ./example/Makefile ./example/search.cgi ./example/swish.cgi ... ./src/txt.lo ./src/txt.o ./src/xml.lo ./src/xml.o ./swish-config ./swish-e.pc ./tests/Makefile ./tests/index.swish-e ./tests/index.swish-e.prop
Now clean up and see if anything is left over.
me% make distclean me% find . -print | sort | tail +2 > out me% diff out ORIG me% rm out ORIG NEW
Cleanup works fine, so we don't need ORIG and NEW. If any files had been left over, diff would have shown them.
I rebuilt with pcre, after installing pcre-4.3 from ports. Configure:
me% CC=gcc CFLAGS=-O ./configure --enable-daystamp --with-pcre
checking for a BSD-compatible install... /usr/local/bin/ginstall -c
checking whether build environment is sane... yes
checking for gawk... no
checking for mawk... no
checking for nawk... nawk
checking whether make sets $(MAKE)... yes
checking build system type... i386-unknown-freebsd4.9
checking host system type... i386-unknown-freebsd4.9
checking for style of include used by make... GNU
checking for gcc... gcc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ANSI C... none needed
checking dependency style of gcc... gcc
checking for a sed that does not truncate output... /usr/bin/sed
checking for egrep... grep -E
checking for ld used by gcc... /usr/libexec/elf/ld
checking if the linker (/usr/libexec/elf/ld) is GNU ld... yes
checking for /usr/libexec/elf/ld option to reload object files... -r
checking for BSD-compatible nm... /usr/bin/nm -B
checking whether ln -s works... yes
checking how to recognise dependent libraries... pass_all
checking how to run the C preprocessor... gcc -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... no
checking for unistd.h... yes
checking dlfcn.h usability... yes
checking dlfcn.h presence... yes
checking for dlfcn.h... yes
checking for g++... g++
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking dependency style of g++... gcc
checking how to run the C++ preprocessor... g++ -E
checking for g77... no
checking for f77... f77
checking whether we are using the GNU Fortran 77 compiler... yes
checking whether f77 accepts -g... yes
checking the maximum length of command line arguments... 16384
checking command to parse /usr/bin/nm -B output from gcc object... ok
checking for objdir... .libs
checking for ar... ar
checking for ranlib... ranlib
checking for strip... strip
checking if gcc static flag works... yes
checking if gcc supports -fno-rtti -fno-exceptions... yes
checking for gcc option to produce PIC... -fPIC
checking if gcc PIC flag -fPIC works... yes
checking if gcc supports -c -o file.o... yes
checking whether the gcc linker (/usr/libexec/elf/ld) supports
shared libraries... yes
checking whether -lc should be explicitly linked in... yes
checking dynamic linker characteristics... freebsd4.9 ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
configure: creating libtool
appending configuration tag "CXX" to libtool
checking for ld used by g++... /usr/libexec/elf/ld
checking if the linker (/usr/libexec/elf/ld) is GNU ld... yes
checking whether the g++ linker (/usr/libexec/elf/ld) supports
shared libraries... yes
checking for g++ option to produce PIC... -fPIC
checking if g++ PIC flag -fPIC works... yes
checking if g++ supports -c -o file.o... yes
checking whether the g++ linker (/usr/libexec/elf/ld) supports
shared libraries... yes
checking dynamic linker characteristics... freebsd4.9 ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
appending configuration tag "F77" to libtool
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking for f77 option to produce PIC... -fPIC
checking if f77 PIC flag -fPIC works... yes
checking if f77 supports -c -o file.o... yes
checking whether the f77 linker (/usr/libexec/elf/ld) supports
shared libraries... yes
checking dynamic linker characteristics... freebsd4.9 ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking whether to enable maintainer-specific portions of
Makefiles... no
checking for BSDgettimeofday... no
checking for gettimeofday... yes
checking whether #! works in shell scripts... yes
checking whether make sets $(MAKE)... (cached) yes
checking for perl... /usr/local/bin/perl
checking for pod2man... pod2man
checking for a BSD-compatible install... /usr/local/bin/ginstall -c
checking for gcc... (cached) gcc
checking whether we are using the GNU C compiler... (cached) yes
checking whether gcc accepts -g... (cached) yes
checking for gcc option to accept ANSI C... (cached) none needed
checking dependency style of gcc... (cached) gcc
checking for vsnprintf in -lsnprintf... no
checking for dirent.h that defines DIR... yes
checking for library containing opendir... none required
checking whether stat file-mode macros are broken... no
checking for ANSI C header files... (cached) yes
checking for unistd.h... (cached) yes
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking sys/timeb.h usability... yes
checking sys/timeb.h presence... yes
checking for sys/timeb.h... yes
checking windows.h usability... no
checking windows.h presence... no
checking for windows.h... no
checking sys/resource.h usability... yes
checking sys/resource.h presence... yes
checking for sys/resource.h... yes
checking sys/param.h usability... yes
checking sys/param.h presence... yes
checking for sys/param.h... yes
checking for sys/wait.h that is POSIX.1 compatible... yes
checking for an ANSI C-conforming const... yes
checking for pid_t... yes
checking for size_t... yes
checking whether struct tm is in sys/time.h or time.h... time.h
checking for working alloca.h... no
checking for alloca... yes
checking for strftime... yes
checking for vprintf... yes
checking for _doprnt... no
checking for unistd.h... (cached) yes
checking vfork.h usability... no
checking vfork.h presence... no
checking for vfork.h... no
checking for fork... yes
checking for vfork... yes
checking for working fork... yes
checking for working vfork... (cached) yes
checking for re_comp... no
checking for regcomp... yes
checking for strdup... yes
checking for strstr... yes
checking for lstat... yes
checking for setenv... yes
checking for access... yes
checking for strchr... yes
checking for memcpy... yes
checking for clock... yes
checking for times... yes
checking for getrusage... yes
checking for log in -lm... yes
checking for uid_t in sys/types.h... yes
checking type of array argument to getgroups... gid_t
checking for getgroups... yes
checking for working getgroups... yes
checking type of array argument to getgroups... (cached) gid_t
checking for vsnprintf... yes
checking for mkstemp... yes
checking for xml2-config... /usr/local/bin/xml2-config
checking for libxml libraries >= 2.4.3... found version 2.5.11
checking zlib.h usability... yes
checking zlib.h presence... yes
checking for zlib.h... yes
checking for gzread in -lz... yes
checking for pcre-config... /usr/local/bin/pcre-config
checking for libpcre libraries >= 3.4... found version 4.3
configure: Setting libexecdir to ${exec_prefix}/lib/swish-e
checking config option memdebug for setting MEM_DEBUG... no
checking config option memtrace for setting MEM_TRACE... no
checking config option memstats for setting MEM_STATISTICS... no
configure: creating ./config.status
config.status: creating Makefile
config.status: creating html/Makefile
config.status: creating man/Makefile
config.status: creating doc/Makefile
config.status: creating src/Makefile
config.status: creating src/expat/Makefile
config.status: creating src/replace/Makefile
config.status: creating src/snowball/Makefile
config.status: creating rpm/swish-e.spec
config.status: creating tests/Makefile
config.status: creating example/Makefile
config.status: creating prog-bin/Makefile
config.status: creating filters/Makefile
config.status: creating filters/SWISH/Makefile
config.status: creating conf/Makefile
config.status: creating filter-bin/Makefile
config.status: creating swish-e.pc
config.status: creating swish-config
config.status: creating src/acconfig.h
config.status: executing depfiles commands
Build:
me% make
Making all in filters
Making all in SWISH
Making all in prog-bin
Making all in conf
Making all in filter-bin
Making all in example
Making all in html
Making all in man
Making all in src
make all-recursive
Making all in expat
source='xmltok.c' object='xmltok.lo' libtool=yes
depfile='.deps/xmltok.Plo' tmpdepfile='.deps/xmltok.TPlo'
depmode=gcc /usr/local/bin/bash ../../config/depcomp
/usr/local/bin/bash ../../libtool --mode=compile gcc
-DHAVE_CONFIG_H -I. -I. -I../../src -I"./xmlparse"
-I"./xmltok" -O -c -o xmltok.lo `test -f 'xmltok.c' || echo
'./'`xmltok.c
mkdir .libs
gcc -DHAVE_CONFIG_H -I. -I. -I../../src -I./xmlparse -I./xmltok
-O -c xmltok.c -Wp,-MD,.deps/xmltok.TPlo -fPIC -DPIC -o
.libs/xmltok.o
gcc -DHAVE_CONFIG_H -I. -I. -I../../src -I./xmlparse -I./xmltok
-O -c xmltok.c -Wp,-MD,.deps/xmltok.TPlo -o xmltok.o
>/dev/null 2>&1
source='xmlrole.c' object='xmlrole.lo' libtool=yes
depfile='.deps/xmlrole.Plo' tmpdepfile='.deps/xmlrole.TPlo'
depmode=gcc /usr/local/bin/bash ../../config/depcomp
/usr/local/bin/bash ../../libtool --mode=compile gcc
-DHAVE_CONFIG_H -I. -I. -I../../src -I"./xmlparse"
-I"./xmltok" -O -c -o xmlrole.lo `test -f 'xmlrole.c' ||
echo './'`xmlrole.c
...
source='result_output.c' object='result_output.o' libtool=no
depfile='.deps/result_output.Po'
tmpdepfile='.deps/result_output.TPo' depmode=gcc
/usr/local/bin/bash ../config/depcomp gcc -DHAVE_CONFIG_H
-I. -I. -I. -Dlibexecdir=\"/usr/local/lib/swish-e\"
-DPATH_SEPARATOR=\":\" -I/usr/local/include
-I/usr/local/include/libxml2 -I/usr/local/include -Ireplace
-Wall -O -c `test -f 'result_output.c' || echo
'./'`result_output.c
/usr/local/bin/bash ../libtool --mode=link gcc -Wall -O -o
swish-e swish.o keychar_out.o dump.o result_output.o
libswishindex.la libswish-e.la -lm
gcc -Wall -O -o .libs/swish-e swish.o keychar_out.o dump.o
result_output.o ./.libs/libswishindex.a -L/usr/local/lib
-lxml2 -liconv ./.libs/libswish-e.so -lz -lpcreposix -lpcre
-lm -Wl,--rpath -Wl,/usr/local/lib
creating swish-e
Making all in tests
Test:
me% make check Making check in filters Making check in SWISH Making check in prog-bin Making check in conf Making check in filter-bin Making check in example Making check in html Making check in man Making check in src Making check in expat Making check in replace Making check in snowball Making check in tests make check-TESTS PASS: check_index PASS: check_search PASS: check_metasearch ================== All 3 tests passed ==================
Install:
$lib = /usr/local/lib/swish-e
$inc = /usr/local/include
$bin = /usr/local/bin
$doc = /usr/local/share/doc/swish-e
root# make install
Making install in filters
Making install in SWISH
$bin/bash ../../config/mkinstalldirs $lib/perl/SWISH
mkdir -p -- $lib/perl/SWISH
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c
Filters/Doc2txt.pm $lib/perl/SWISH/Filters/Doc2txt.pm
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c
Filters/Doc2html.pm $lib/perl/SWISH/Filters/Doc2html.pm
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c
Filters/Pdf2HTML.pm $lib/perl/SWISH/Filters/Pdf2HTML.pm
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c
Filters/ID3toHTML.pm $lib/perl/SWISH/Filters/ID3toHTML.pm
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c
Filters/XLtoHTML.pm $lib/perl/SWISH/Filters/XLtoHTML.pm
$bin/bash ../../config/mkinstalldirs $lib/perl/SWISH
$bin/ginstall -c Filter.pm $lib/perl/SWISH/Filter.pm
$bin/bash ../config/mkinstalldirs $bin
$bin/ginstall -c swish-filter-test $bin/swish-filter-test
$bin/bash ../config/mkinstalldirs $doc/examples/filters
mkdir -p -- $doc/examples/filters
$bin/ginstall -c -m 644 README $doc/examples/filters/README
Making install in prog-bin
$bin/bash ../config/mkinstalldirs $lib
$bin/ginstall -c spider.pl $lib/spider.pl
$bin/ginstall -c DirTree.pl $lib/DirTree.pl
$bin/bash ../config/mkinstalldirs $doc/examples/prog-bin
mkdir -p -- $doc/examples/prog-bin
$bin/ginstall -c -m 644 README $doc/examples/prog-bin/README
$bin/ginstall -c -m 644 file.pl $doc/examples/prog-bin/file.pl
$bin/ginstall -c -m 644 SwishSpiderConfig.pl
$doc/examples/prog-bin/SwishSpiderConfig.pl
$bin/ginstall -c -m 644 MySQL.pl $doc/examples/prog-bin/MySQL.pl
$bin/ginstall -c -m 644 index_hypermail.pl
$doc/examples/prog-bin/index_hypermail.pl
$bin/ginstall -c -m 644 pdf2xml.pm $doc/examples/prog-bin/pdf2xml.pm
$bin/ginstall -c -m 644 pdf2html.pm
$doc/examples/prog-bin/pdf2html.pm
$bin/ginstall -c -m 644 doc2txt.pm $doc/examples/prog-bin/doc2txt.pm
$bin/bash ../config/mkinstalldirs $lib/perl
$bin/ginstall -c doc2txt.pm $lib/perl/doc2txt.pm
$bin/ginstall -c pdf2html.pm $lib/perl/pdf2html.pm
$bin/ginstall -c pdf2xml.pm $lib/perl/pdf2xml.pm
Making install in conf
$bin/bash ../config/mkinstalldirs $doc/examples/conf
mkdir -p -- $doc/examples/conf
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c -m 644
stopwords/dutch.txt $doc/examples/conf/stopwords/dutch.txt
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c -m 644
stopwords/english.txt
$doc/examples/conf/stopwords/english.txt
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c -m 644
stopwords/german.txt $doc/examples/conf/stopwords/german.txt
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c -m 644
stopwords/spanish.txt
$doc/examples/conf/stopwords/spanish.txt
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c -m 644
README $doc/examples/conf/README
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c -m 644
example1.config $doc/examples/conf/example1.config
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c -m 644
example2.config $doc/examples/conf/example2.config
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c -m 644
example3.config $doc/examples/conf/example3.config
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c -m 644
example4.config $doc/examples/conf/example4.config
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c -m 644
example5.config $doc/examples/conf/example5.config
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c -m 644
example6.config $doc/examples/conf/example6.config
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c -m 644
example7.config $doc/examples/conf/example7.config
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c -m 644
example8.config $doc/examples/conf/example8.config
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c -m 644
example9.config $doc/examples/conf/example9.config
/src/info-retrieval/swish-e-2.4.3/config/install-sh -c -m 644
example9.pl $doc/examples/conf/example9.pl
Making install in filter-bin
$bin/bash ../config/mkinstalldirs $doc/examples/filter-bin
mkdir -p -- $doc/examples/filter-bin
$bin/ginstall -c -m 644 README $doc/examples/filter-bin/README
$bin/ginstall -c -m 644 swish_filter.pl
$doc/examples/filter-bin/swish_filter.pl
$bin/ginstall -c -m 644 _binfilter.sh
$doc/examples/filter-bin/_binfilter.sh
$bin/ginstall -c -m 644 _pdf2html.pl
$doc/examples/filter-bin/_pdf2html.pl
Making install in example
$bin/bash ../config/mkinstalldirs $lib
$bin/ginstall -c swish.cgi $lib/swish.cgi
$bin/ginstall -c search.cgi $lib/search.cgi
$bin/bash ../config/mkinstalldirs $lib/perl/SWISH
$bin/ginstall -c modules/SWISH/DateRanges.pm
$lib/perl/SWISH/DateRanges.pm
$bin/ginstall -c modules/SWISH/DefaultHighlight.pm
$lib/perl/SWISH/DefaultHighlight.pm
$bin/ginstall -c modules/SWISH/PhraseHighlight.pm
$lib/perl/SWISH/PhraseHighlight.pm
$bin/ginstall -c modules/SWISH/SimpleHighlight.pm
$lib/perl/SWISH/SimpleHighlight.pm
$bin/ginstall -c modules/SWISH/TemplateDefault.pm
$lib/perl/SWISH/TemplateDefault.pm
$bin/ginstall -c modules/SWISH/TemplateDumper.pm
$lib/perl/SWISH/TemplateDumper.pm
$bin/ginstall -c modules/SWISH/TemplateFrame.pm
$lib/perl/SWISH/TemplateFrame.pm
$bin/ginstall -c modules/SWISH/TemplateHTMLTemplate.pm
$lib/perl/SWISH/TemplateHTMLTemplate.pm
$bin/ginstall -c modules/SWISH/TemplateToolkit.pm
$lib/perl/SWISH/TemplateToolkit.pm
$bin/ginstall -c modules/SWISH/ParseQuery.pm
$lib/perl/SWISH/ParseQuery.pm
$bin/bash ../config/mkinstalldirs /usr/local/share/swish-e
mkdir -p -- /usr/local/share/swish-e
$bin/ginstall -c -m 644 swish.tt /usr/local/share/swish-e/swish.tt
$bin/ginstall -c -m 644 swish.tmpl
/usr/local/share/swish-e/swish.tmpl
$bin/bash ../config/mkinstalldirs /usr/local/share/swish-e/templates
mkdir -p -- /usr/local/share/swish-e/templates
$bin/ginstall -c -m 644 templates/search.tt
/usr/local/share/swish-e/templates/search.tt
$bin/ginstall -c -m 644 templates/page_layout
/usr/local/share/swish-e/templates/page_layout
$bin/ginstall -c -m 644 templates/common_header
/usr/local/share/swish-e/templates/common_header
$bin/ginstall -c -m 644 templates/common_footer
/usr/local/share/swish-e/templates/common_footer
$bin/ginstall -c -m 644 templates/style.css
/usr/local/share/swish-e/templates/style.css
$bin/ginstall -c -m 644 templates/markup.css
/usr/local/share/swish-e/templates/markup.css
Making install in html
$bin/bash ../config/mkinstalldirs $doc/html
mkdir -p -- $doc/html
$bin/ginstall -c -m 644 ./CHANGES.html $doc/html/CHANGES.html
$bin/ginstall -c -m 644 ./INSTALL.html $doc/html/INSTALL.html
$bin/ginstall -c -m 644 ./README.html $doc/html/README.html
$bin/ginstall -c -m 644 ./SWISH-3.0.html $doc/html/SWISH-3.0.html
$bin/ginstall -c -m 644 ./SWISH-BUGS.html $doc/html/SWISH-BUGS.html
$bin/ginstall -c -m 644 ./SWISH-CONFIG.html $doc/html/SWISH-CONFIG.html
$bin/ginstall -c -m 644 ./SWISH-FAQ.html $doc/html/SWISH-FAQ.html
$bin/ginstall -c -m 644 ./SWISH-LIBRARY.html
$doc/html/SWISH-LIBRARY.html
$bin/ginstall -c -m 644 ./SWISH-RUN.html $doc/html/SWISH-RUN.html
$bin/ginstall -c -m 644 ./SWISH-SEARCH.html $doc/html/SWISH-SEARCH.html
$bin/ginstall -c -m 644 ./API.html $doc/html/API.html
$bin/ginstall -c -m 644 ./spider.html $doc/html/spider.html
$bin/ginstall -c -m 644 ./swish.html $doc/html/swish.html
$bin/ginstall -c -m 644 ./search.html $doc/html/search.html
$bin/ginstall -c -m 644 ./Filter.html $doc/html/Filter.html
$bin/ginstall -c -m 644 ./style.css $doc/html/style.css
$bin/ginstall -c -m 644 ./index.html $doc/html/index.html
$bin/ginstall -c -m 644 ./index_long.html $doc/html/index_long.html
$bin/ginstall -c -m 644 .htaccess $doc/html/.htaccess
$bin/ginstall -c -m 644 searchdoc.html $doc/html/searchdoc.html
$bin/ginstall -c -m 644 .swishcgi.conf $doc/html/.swishcgi.conf
$bin/ginstall -c -m 644 swish.conf $doc/html/swish.conf
$bin/ginstall -c -m 644 split.pl $doc/html/split.pl
$bin/bash ../config/mkinstalldirs $doc/html/images
mkdir -p -- $doc/html/images
$bin/ginstall -c -m 644 ./images/dotrule1.gif
$doc/html/images/dotrule1.gif
$bin/ginstall -c -m 644 ./images/swish2b.gif
$doc/html/images/swish2b.gif
$bin/ginstall -c -m 644 ./images/swish2.gif
$doc/html/images/swish2.gif
$bin/ginstall -c -m 644 ./images/swishbanner1.gif
$doc/html/images/swishbanner1.gif
$bin/ginstall -c -m 644 ./images/swish.gif $doc/html/images/swish.gif
Making install in man
$bin/bash ../config/mkinstalldirs /usr/local/man/man1
$bin/ginstall -c -m 644 ././swish-e.1 /usr/local/man/man1/swish-e.1
$bin/ginstall -c -m 644 ././SWISH-CONFIG.1
/usr/local/man/man1/SWISH-CONFIG.1
$bin/ginstall -c -m 644 ././SWISH-FAQ.1
/usr/local/man/man1/SWISH-FAQ.1
$bin/ginstall -c -m 644 ././SWISH-LIBRARY.1
/usr/local/man/man1/SWISH-LIBRARY.1
$bin/ginstall -c -m 644 ././SWISH-RUN.1
/usr/local/man/man1/SWISH-RUN.1
Making install in src
Making install in expat
Making install in replace
Making install in snowball
$bin/bash ../config/mkinstalldirs /usr/local/lib
$bin/bash ../libtool --mode=install $bin/ginstall -c
libswish-e.la /usr/local/lib/libswish-e.la
$bin/ginstall -c .libs/libswish-e.so.2 /usr/local/lib/libswish-e.so.2
(cd /usr/local/lib && rm -f libswish-e.so && ln -s
libswish-e.so.2 libswish-e.so)
(cd /usr/local/lib && rm -f libswish-e.so && ln -s
libswish-e.so.2 libswish-e.so)
$bin/ginstall -c .libs/libswish-e.lai /usr/local/lib/libswish-e.la
$bin/ginstall -c .libs/libswish-e.a /usr/local/lib/libswish-e.a
ranlib /usr/local/lib/libswish-e.a
chmod 644 /usr/local/lib/libswish-e.a
----------------------------------------------------------------------
Libraries have been installed in:
/usr/local/lib
If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the `LD_LIBRARY_PATH' environment variable
during execution
- add LIBDIR to the `LD_RUN_PATH' environment variable
during linking
- use the `-Wl,--rpath -Wl,LIBDIR' linker flag
See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
$bin/bash ../config/mkinstalldirs $bin
$bin/bash ../libtool --mode=install $bin/ginstall -c swish-e
$bin/swish-e
$bin/ginstall -c .libs/swish-e $bin/swish-e
$bin/bash ../config/mkinstalldirs $lib
$bin/ginstall -c swishspider $lib/swishspider
$bin/bash ../config/mkinstalldirs $inc
$bin/ginstall -c -m 644 swish-e.h $inc/swish-e.h
Making install in tests
$bin/bash ./config/mkinstalldirs $bin
$bin/ginstall -c swish-config $bin/swish-config
$bin/bash ./config/mkinstalldirs $doc
$bin/ginstall -c -m 644 ./INSTALL $doc/INSTALL
$bin/ginstall -c -m 644 ./README $doc/README
$bin/ginstall -c -m 644 README.cvs $doc/README.cvs
$bin/bash ./config/mkinstalldirs /usr/local/lib/pkgconfig
$bin/ginstall -c -m 644 swish-e.pc /usr/local/lib/pkgconfig/swish-e.pc
Next, see how fast this works for all text and html files under my web directory. The config file looks like this:
me% cat swish.conf # Example configuration file # Tell Swish-e what to index (same as -i switch above) IndexDir /doc/html/htdocs # Only index HTML and text files IndexOnly .htm .html .txt # Tell Swish-e that .txt files are to use the text parser. IndexContents TXT* .txt # Otherwise, use the HTML parser DefaultContents HTML*
Now run the indexer:
me% swish-e -c swish.conf
Indexing Data Source: "File-System"
Indexing "/doc/html/htdocs"
Warning: Failed to open dir '/doc/html/htdocs/dforum/forumdata'
:Permission denied
Warning: Failed to open:
'/doc/html/htdocs/forum-samples/wwwboard/passwd.txt':
Permission denied
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 207,978 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
207,978 unique words indexed.
4 properties sorted.
12,658 files indexed. 171,427,982 total bytes. 18,032,526 total words.
Elapsed time: 00:01:19 CPU time: 00:00:50
Indexing done!
me% ls -l
-rw-r--r-- 1 vogelke vogelke 41572867 Aug 16 17:03 index.swish-e
-rw-r--r-- 1 vogelke vogelke 1077692 Aug 16 17:03 index.swish-e.prop
-rw-r--r-- 1 vogelke vogelke 310 Aug 16 17:01 swish.conf
The warnings are appropriate; I have some protected directories under /doc/html/htdocs.
This thing absolutely flies when searching:
me% swish-e -w wheeler
# SWISH format: 2.4.3-2005-08-16
# Search words: wheeler
# Removed stopwords:
# Number of hits: 20
# Search time: 0.000 seconds
# Run time: 0.013 seconds
1000 /doc/html/htdocs/dmoz/categories.txt "categories.txt" 35205447
930 /doc/html/htdocs/security/Secure-Programs-HOWTO.html "Secure
Programming for Linux and Unix HOWTO" 873201
642 /doc/html/htdocs/open-sources/oss-fs-why/oss_fs_why.htm "Why
Open Source Software / Free Software (OSS/FS)? Look at the
Numbers!" 428946
578 /doc/html/htdocs/sample-files/newdir/incompetence.txt
"incompetence.txt" 81070
541 /doc/html/htdocs/open-sources/oss-fs-why/oss-fs-why.txt
"oss-fs-why.txt" 353693
499 /doc/html/htdocs/citing/misc/elcite.txt "elcite.txt" 80325
446 /doc/html/htdocs/bzip/manual_4.html "bzip2 and libbzip2 -
Miscellanea" 17469
385 /doc/html/htdocs/bzip/manual_2.html "bzip2 and libbzip2 -
How to use bzip2" 18306
304 /doc/html/htdocs/ghostscript/History2.htm "History of
Ghostscript versions 1.n" 208829
304 /doc/html/htdocs/blog/post/1078616486.txt "1078616486.txt" 969
304 /doc/html/htdocs/bb/0043.htm "GR Bulletin Board: [MISC]
Annual Birthday Observance Luncheon to" 15246
304 /doc/html/htdocs/linux/LFS/index.htm "Linux From Scratch" 1307494
192 /doc/html/htdocs/sample-files/subdir/secure-linux-programming.txt
"secure-linux-programming.txt" 5317
192 /doc/html/htdocs/scripting/2001/2001-09-23.htm
"2001-09-23.htm" 13638
192 /doc/html/htdocs/scripting/2001/2001-03-03.htm
"2001-03-03.htm" 7294
192 /doc/html/htdocs/politics/reparations/fredreed.txt
"fredreed.txt" 6548
192 /doc/html/htdocs/politics/census/title13.txt "title13.txt" 2405
192 /doc/html/htdocs/open-sources/halloween/0,4164,2160239,00.txt
"0,4164,2160239,00.txt" 14826
192 /doc/html/htdocs/bzip/manual_3.html "bzip2 and libbzip2 -
Programming with libbzip2" 60973
192 /doc/html/htdocs/bzip/manual_1.html "bzip2 and libbzip2 -
Introduction" 1694
.
I want something to find all (or at least most) of my text files and preprocess them for indexing. This is a start, from the prog-bin directory:
#!/usr/bin/perl -w
use strict;
# This is a short example that basically does the same
# thing as the default file system access method.
# This will index only one directory.
my $dir = shift (@ARGV) || '.';
opendir D, $dir or die $!;
while ($_ = readdir D) {
next unless -T "$dir/$_";
my ($size, $mtime) = (stat "$dir/$_")[7, 9];
open FH, "$dir/$_" or die "$! on $dir/$_";
print <<EOF;
Content-Length: $size
Last-Mtime: $mtime
Path-Name: $dir/$_
EOF
print <FH>;
close FH;
}
exit(0);
To index all text files under /etc:
me% cat etc.conf SwishProgParameters /etc IndexFile index.etc IndexDir ./file.pl me% swish-e -S prog -c etc.conf Indexing Data Source: "External-Program" Indexing "./file.pl" External Program found: ./file.pl Removing very common words... no words removed. Writing main index... Sorting words ... Sorting 9,039 words alphabetically Writing header ... Writing index entries ... Writing word text: Complete Writing word hash: Complete Writing word data: Complete 9,039 unique words indexed. 4 properties sorted. 90 files indexed. 501,654 total bytes. 92,491 total words. Elapsed time: 00:00:00 CPU time: 00:00:00 Indexing done! me% ls -lF index.etc* -rw-r--r-- 1 vogelke vogelke 758822 Aug 16 17:56 index.etc -rw-r--r-- 1 vogelke vogelke 2525 Aug 16 17:56 index.etc.prop
To search the /etc files:
me% swish-e -f index.etc -w krb # SWISH format: 2.4.3-2005-08-16 # Search words: krb # Removed stopwords: # Number of hits: 1 # Search time: 0.001 seconds # Run time: 0.013 seconds 1000 /etc/services "services" 73490 . me% grep krb /etc/services kerberos-sec 88/tcp kerberos # krb5... kerberos-sec 88/udp kerberos # krb5... krb_prop 754/tcp krb5_prop... krbupdate 760/tcp kreg... krb524 4444/tcp krb524 4444/udp # PROBLEM krb524 assigned the port,
This is more flexible if we run find from within the script, so we walk an entire filetree. (Yes, I know about File::Find. I hate it. Sue me.)
Here's find.pl:
#!/usr/bin/perl -w
use strict;
use English qw( -no_match_vars );
my $pid;
my $dir = shift(@ARGV) || '.';
my $program = "find";
# This is a short example based on file.pl.
# It reads from find.
die "cannot fork: $!" unless defined($pid = open(KID, "-|"));
if ($pid) { # parent
while (<KID>) {
chomp;
next unless -T "$_";
my ($size, $mtime) = (stat "$_")[7, 9];
open(FH, "$_") or die "$! on $_";
print <<EOF;
Content-Length: $size
Last-Mtime: $mtime
Path-Name: $_
EOF
print <FH>;
close(FH);
}
close KID;
}
else {
$EUID = $UID;
$EGID = $GID; # XXX: initgroups() not called
$ENV{PATH} = "/bin:/usr/bin";
exec($program, '-x', "$dir", '-print')
or die "can't exec $program: $!";
}
exit(0);
Use -x or -xdev or something to keep from wandering across mount points. Results:
me% cat etc.conf SwishProgParameters /etc IndexFile index.etc IndexDir ./find.pl me% swish-e -S prog -c etc.conf Indexing Data Source: "External-Program" Indexing "./find.pl" External Program found: ./find.pl find: /etc/isdn: Permission denied find: /etc/uucp: Permission denied Removing very common words... no words removed. Writing main index... Sorting words ... Sorting 12,179 words alphabetically Writing header ... Writing index entries ... Writing word text: Complete Writing word hash: Complete Writing word data: Complete 12,179 unique words indexed. 4 properties sorted. 196 files indexed. 1,220,794 total bytes. 172,599 total words. Elapsed time: 00:00:01 CPU time: 00:00:00 Indexing done! me% swish-e -f index.etc -w queuerun # SWISH format: 2.4.3-2005-08-16 # Search words: queuerun # Removed stopwords: # Number of hits: 2 # Search time: 0.001 seconds # Run time: 0.013 seconds 1000 /etc/periodic/daily/500.queuerun "500.queuerun" 727 862 /etc/defaults/periodic.conf "periodic.conf" 8215 .
I break up the files on my workstation into a few general search categories:
me% cd /space/swish me% ls -l drwxr-xr-x 2 vogelke wheel 512 Sep 8 04:16 home drwxr-xr-x 2 vogelke wheel 512 Sep 8 04:20 logs drwxr-xr-x 3 vogelke wheel 512 Aug 21 18:48 mail-saved drwxr-xr-x 3 vogelke wheel 512 Aug 22 12:31 mail-unread drwxr-xr-x 2 vogelke wheel 512 Sep 8 04:19 notebook drwxr-xr-x 2 vogelke wheel 512 Aug 22 16:20 root drwxr-xr-x 2 vogelke wheel 512 Aug 22 16:47 usr drwxr-xr-x 2 vogelke wheel 512 Aug 30 17:38 web me% du -s * 180856 home 6660 logs 232884 mail-saved 1254208 mail-unread 74282 notebook 17424 root 266360 usr 112936 web me% find . -print | wc -l 159285
Some of these groups change more often than others. For example, I generally save mail in specific project logs rather than under my mail directory, so mail-saved doesn't change much. However, my notebook holds anything I do on a daily basis, so that needs to be re-indexed all the time.
Each directory holds a swish configuration file plus a script used to provide the associated files to the indexer. The home directory looks like this:
me% ls -lR home
total 180854
-rwxr-xr-x 1 vogelke 841 Aug 22 17:31 find.pl
-rw-r--r-- 1 vogelke 66 Aug 22 19:01 home.conf
-rw-r--r-- 1 vogelke 183636154 Sep 8 04:16 index.home
-rw-r--r-- 1 vogelke 1458093 Sep 8 04:16 index.home.prop
me% cat home/home.conf
SwishProgParameters $HOME
IndexFile index.home
IndexDir ./find.pl
me% cat home/find.pl
#!/usr/bin/perl -w
use strict;
use English qw( -no_match_vars );
my $pid;
my $dir = shift(@ARGV) || '.';
my $program = "find";
my $home = $ENV{'HOME'};
# This is a short example based on file.pl.
# It reads from find.
die "cannot fork: $!" unless defined($pid = open(KID, "-|"));
if ($pid) { # parent
while (<KID>) {
chomp;
next if m!$home/notebook/!;
next unless -T "$_";
my ($size, $mtime) = (stat "$_")[7, 9];
open(FH, "$_") or die "$! on $_";
print <<EOF;
Content-Length: $size
Last-Mtime: $mtime
Path-Name: $_
EOF
print <FH>;
close(FH);
}
close KID;
}
else {
$EUID = $UID;
$EGID = $GID;
$ENV{PATH} = "/bin:/usr/bin";
exec($program, '-x', "$dir", '-type', 'f', '-print')
or die "can't exec $program: $!";
}
exit(0);
The index.home* files are generated by the swish indexer. Similar files are present under the other directories:
me% pwd /space/swish me% ls -lF * home: -rwxr-xr-x 1 vogelke 841 Aug 22 17:31 find.pl* -rw-r--r-- 1 vogelke 66 Aug 22 19:01 home.conf -rw-r--r-- 1 vogelke 183636154 Sep 8 04:16 index.home -rw-r--r-- 1 vogelke 1458093 Sep 8 04:16 index.home.prop logs: -rwxr-xr-x 1 vogelke 778 Aug 22 14:51 find.pl* -rw-r--r-- 1 vogelke 6689138 Sep 8 04:20 index.logs -rw-r--r-- 1 vogelke 92958 Sep 8 04:20 index.logs.prop -rw-r--r-- 1 vogelke 65 Aug 22 14:51 logs.conf mail-saved: -rwxr-xr-x 1 vogelke 770 Aug 21 18:39 find.pl* -rw-r--r-- 1 vogelke 61426342 Aug 21 18:44 index.saved -rw-r--r-- 1 vogelke 1691358 Aug 21 18:44 index.saved.prop -rw-r--r-- 1 vogelke 90 Aug 21 18:40 saved.conf mail-unread: -rwxr-xr-x 1 vogelke 770 Aug 21 18:48 find.pl* -rw-r--r-- 1 vogelke 295549588 Aug 21 19:12 index.unread -rw-r--r-- 1 vogelke 9372562 Aug 21 19:12 index.unread.prop -rw-r--r-- 1 vogelke 92 Aug 21 18:48 unread.conf notebook: -rwxr-xr-x 1 vogelke 770 Aug 22 14:08 find.pl* -rw-r--r-- 1 vogelke 73362429 Sep 8 04:19 index.notebook -rw-r--r-- 1 vogelke 352000 Sep 8 04:19 index.notebook.prop -rw-r--r-- 1 vogelke 79 Aug 22 14:07 notebook.conf root: -rwxr-xr-x 1 vogelke 809 Aug 22 16:19 find.pl* -rw-r--r-- 1 root 17775390 Aug 22 16:19 index.root -rw-r--r-- 1 root 39300 Aug 22 16:19 index.root.prop -rw-r--r-- 1 vogelke 62 Aug 22 16:17 root.conf usr: -rwxr-xr-x 1 vogelke 784 Aug 22 16:21 find.pl* -rw-r--r-- 1 root 264337503 Aug 22 16:36 index.usr -rw-r--r-- 1 root 8305594 Aug 22 16:36 index.usr.prop -rw-r--r-- 1 vogelke 64 Aug 22 16:21 usr.conf web: -rwxr-xr-x 1 vogelke 770 Aug 22 14:58 find.pl* -rw-r--r-- 1 vogelke 113947235 Aug 30 17:38 index.web -rw-r--r-- 1 vogelke 1617638 Aug 30 17:38 index.web.prop -rw-r--r-- 1 vogelke 76 Aug 22 14:59 web.conf
The home, logs, and notebook indexes are regenerated every night via cron using this script:
#!/bin/sh
#
# Id: 140-swish-rebuild,v 1.1 2005/08/26 16:44:03 vogelke Exp
# Source: /home/vogelke/etc/periodic/daily/RCS/140-swish-rebuild,v
#
# re-index SWISH files.
PATH=/usr/local/bin:/bin:/sbin:/usr/sbin:/usr/bin
export PATH
umask 022
tag=`basename $0`
for topic in home notebook logs
do
logger -t $tag "starting $topic"
( cd /space/swish/$topic && swish-e -S prog -c $topic.conf )
done
exit 0
I use this to search the indexes:
#!/bin/ksh
#
# Id: srch,v 1.4 2005/09/10 21:06:46 vogelke Exp
# Source: /home/vogelke/bin/RCS/srch,v
#
# NAME:
# srch
#
# SYNOPSIS:
# srch [-hlmnrsuw] pattern
# srch -v
#
# DESCRIPTION:
# Look through all SWISH data for "pattern".
# Default behavior (no options) is to search all indexes.
#
# OPTIONS:
# -v print the version and exit
# -h search home index
# -l search logs index
# -m search mail-unread index
# -n search notebook index
# -r search root index
# -s search mail-saved index
# -u search usr index
# -w search web index
#
# AUTHOR:
# Karl Vogel <vogelke@pobox.com>
# Sumaria Systems, Inc.
PATH=/usr/local/bin:/bin:/sbin:/usr/sbin:/usr/bin
export PATH
umask 022
tag=`basename $0`
# ======================== FUNCTIONS =============================
# die: prints an optional argument to stderr and exits.
# A common use for "die" is with a test:
# test -f /etc/passwd || die "no passwd file"
# This works in subshells and loops, but may not exit with
# a code other than 0.
die () {
echo "$tag: error: $*" 1>&2
exit 1
}
# usage: prints an optional string plus part of the comment
# header (if any) to stderr, and exits with code 1.
usage () {
lines=`egrep -n '^# (NAME|AUTHOR)' $0 | sed -e 's/:.*//'`
(
case "$#" in
0) ;;
*) echo "usage error: $*"; echo ;;
esac
case "$lines" in
"") ;;
*) set `echo $lines | sed -e 's/ /,/'`
sed -n ${1}p $0 | sed -e 's/^#//g' |
egrep -v AUTHOR:
;;
esac
) 1>&2
exit 1
}
# version: prints the current version to stdout.
version () {
lsedscr='s/RCSfile: //
s/.Date: //
s/,v . .Revision: / v/
s/\$//g'
lrevno='$RCSfile: srch,v $ $Revision: 1.4 $'
lrevdate='$Date: 2005/09/10 21:06:46 $'
echo "$lrevno $lrevdate" | sed -e "$lsedscr"
}
# ======================== MAIN PROGRAM ==========================
top=/space/swish
test -d "$top" || die "$top: not found"
cd $top || die "cd $top failed"
conf=
while getopts ":hlmnrsuvw" opt; do
case $opt in
h) conf="$conf home/home.conf" ;;
l) conf="$conf logs/logs.conf" ;;
m) conf="$conf mail-unread/unread.conf" ;;
n) conf="$conf notebook/notebook.conf" ;;
r) conf="$conf root/root.conf" ;;
s) conf="$conf mail-saved/saved.conf" ;;
u) conf="$conf usr/usr.conf" ;;
w) conf="$conf web/web.conf" ;;
v) version; exit 0 ;;
\?) usage "-$OPTARG: invalid option"; return 1 ;;
esac
done
shift $(($OPTIND - 1))
#
# Sanity checks.
#
case "$@" in
"") usage "no pattern to search for" ;;
*) pattern="$@" ;;
esac
case "$conf" in
"") conf="*/*.conf" ;;
*) ;;
esac
#
# Look through the swish config files to see where to search.
#
grep -H IndexFile $conf | sed -e 's!/.* ! !' |
while read dir idx
do
(
cd $dir
echo "=== $dir"
swish-e -f $idx -w "$pattern" |
egrep -v '^#|^\.|err: no results' |
awk '{print $1, $2}'
echo
)
done
exit 0
To see how fast we can find any mention of Ted Koppel:
me% time srch ted koppel === home 1000 /home/vogelke/mail/ADDRS === logs === mail-saved 1000 /space/swish/mail-saved/data/c0/840641684.humor.309 1000 /space/swish/mail-saved/data/4a/892167530.humor.478 === mail-unread 1000 /space/swish/mail-unread/data/31/1083877253.2004-05.1109 816 /space/swish/mail-unread/data/a0/1069260000.2003-11.1680 633 /space/swish/mail-unread/data/eb/1074706477.2004-01.1673 633 /space/swish/mail-unread/data/b8/1097697323.2004-10.1757 === notebook 1000 /home/vogelke/notebook/2004/0530/strength-p1.txt 775 /home/vogelke/notebook/2004/0522/00p7lujs09q00b7y4huzk1 === root === usr === web 1000 /doc/html/htdocs/dmoz/categories.txt 317 /doc/html/htdocs/scripting/2005-05-12.htm srch ted koppel 0.09s user 0.11s system 84% cpu 0.232 total
To find any messages from the ifile-discuss mailing list in my unread mail:
% srch -m ifile-discuss === mail-unread 1000 /space/swish/mail-unread/data/41/1064852003.2003-09.3338 983 /space/swish/mail-unread/data/d6/1097041122.2004-10.622 674 /space/swish/mail-unread/data/fe/1030042118.2002-08.2209 658 /space/swish/mail-unread/data/03/1030035341.2002-08.2204 568 /space/swish/mail-unread/data/77/1036675949.2002-11.1471
Swish doesn't work for partial-word matches unless you use wildcards, and they only work at the end of a word. For example, if the word "abbreviation" is somewhere in the notebook files but "abbrev" isn't:
me% cd /space/swish/notebook me% swish-e -f index.notebook -w abbrev # SWISH format: 2.4.3-2005-08-16 # Search words: abbrev # Removed stopwords: err: no results . me% swish-e -f index.notebook -w 'abbrev*' # SWISH format: 2.4.3-2005-08-16 # Search words: abbrev* # Removed stopwords: # Number of hits: 7 # Search time: 0.005 seconds # Run time: 0.018 seconds 1000 .../2005/0711/LOG "LOG" 8997 1000 .../2005/0803/LOG "LOG" 7526 561 .../2005/0804/LOG "LOG" 3509 355 .../2005/0802/LOG "LOG" 1807 .
You can get a list of words in the swish index files, if you want to rig something using a closest match in case a query comes back empty. To get a list of all words in the notebook index without any 8-bit crap:
me% swish-e -f index.notebook -k '*' |
tr ' ' '\012' |
tr -cd '[:print:][:cntrl:]' |
grep -v '^[0-9]*$' |
sort -u > words
me% cat words
#
000000pt
00000101t000000z
0000012d
00001m
00001n
00001z
00002a
00002z
00008b
...
mark
markchar
markdown
markdownoptions
marked
markedly
markels
marker
markers
...
zzzv
zzzvvv
zzzvvvaa
zzzw
zzzwvbbbk
zzzx
zzzyy
zzzz
zzzz8dju72j
zzzznn1
Generated from swish.t2t by txt2tags