How to Use PyLucene

Michael A. Alcorn
2 min readMar 31, 2018

This blog post will briefly outline how to install PyLucene on Fedora and give some examples of how to use the analyzers to process text.

To begin with, you need to install Apache Ant and Apache Ivy.

sudo dnf install ant ivy

Next, download PyLucene from one of the mirrors found here and extract the archive.

tar -xzvf pylucene-6.5.0-src.tar.gz

To install PyLucene, we’ll mostly follow the instructions found here. The first step is to install JCC.

cd /path/to/pylucene/jcc

Make sure the Java location in jcc/setup.py is correct.

'linux': '/usr/lib/jvm/java-8-oracle' # change this
'linux': '/usr/lib/jvm/java-1.8.0' # mine

Install JCC.

python setup.py build
python setup.py install

Now we’ll start the PyLucene installation process.

cd ..

PyLucene assumes an incorrect name for the Python 3 development library, so I had to create a symbolic link to make compiling work (I brought this to the PyLucene team’s attention and it should be fixed in a future release).

ln -s /path/to/.pyenv/versions/3.5.2/lib/libpython3.5m.so.1.0 /path/to/.pyenv/versions/3.5.2/lib/libpython3.5.so

Next, edit the Makefile so that everything is pointing to the proper locations.

# Linux     (Debian Jessie 64-bit, Python 3.4.2, Oracle Java 1.8
# Be sure to also set JDK['linux'] in jcc's setup.py to the JAVA_HOME value
# used below for ANT (and rebuild jcc after changing it).
PREFIX_PYTHON=/home/path/to/.pyenv/versions/3.5.2
ANT=JAVA_HOME=/usr/lib/jvm/java-1.8.0 /usr/bin/ant
PYTHON=$(PREFIX_PYTHON)/bin/python3
JCC=$(PYTHON) -m jcc --shared
NUM_FILES=8

Finally, make and install PyLucene.

make
make test
make install

PyLucene is designed to mimic the Lucene Java API. The following examples demonstrate how to use several different analyzers to process text.

import lucenefrom java.io import StringReader
from org.apache.lucene.analysis.ja import JapaneseAnalyzer
from org.apache.lucene.analysis.standard import StandardAnalyzer, StandardTokenizer
from org.apache.lucene.analysis.tokenattributes import CharTermAttribute
lucene.initVM(vmargs=['-Djava.awt.headless=true'])# Basic tokenizer example.
test = "This is how we do it."
tokenizer = StandardTokenizer()
tokenizer.setReader(StringReader(test))
charTermAttrib = tokenizer.getAttribute(CharTermAttribute.class_)
tokenizer.reset()
tokens = []while tokenizer.incrementToken():
tokens.append(charTermAttrib.toString())
print(tokens)# StandardAnalyzer example.
analyzer = StandardAnalyzer()
stream = analyzer.tokenStream("", StringReader(test))
stream.reset()
tokens = []
while stream.incrementToken():
tokens.append(stream.getAttribute(CharTermAttribute.class_).toString())
print(tokens)# JapaneseAnalyzer example.
analyzer = JapaneseAnalyzer()
test = "寿司が食べたい。"
stream = analyzer.tokenStream("", StringReader(test))
stream.reset()
tokens = []
while stream.incrementToken():
tokens.append(stream.getAttribute(CharTermAttribute.class_).toString())
print(tokens)

--

--