A Sample
Library Application, Featuring the Spring Framework, Lucene-Based
Hibernate Search, and JavaServer Faces
August 19, 2009 · Robert Söding
Preface
Primarily, this article and sample application have been
written to study the feasibility of using the Lucene full-text
search engine - or some framework on top of it - in a Spring, and
Hibernate, application.
In contrast to my other recently written articles, this one
has been dealt with in an ad-hoc way. That is, I have had
not read the complete documentation before starting to
code. Likewise, the article is intentionally kept brief.
As for Lucene integration with Hibernate and Spring, there are
a number of frameworks around (see chapter Related Resources). While
the Compass framework might be worth more than a second look,
Hibernate Search has been chosen for the simple reason of JBoss
being an industry leader with Hibernate itself.
Interested readers are supposed to already have a basic
knowledge of the Spring framework, Hibernate, and (for that matter)
JavaServer Faces (JSF).
Feedback is welcome and may be directed to .
Prerequisites
There are the following software requirements:
The project is Eclipse-based. Both Eclipse 3.4 and 3.5 should
do.
The application has been developed against the MySQL 5.0
database. Other databases will also work.
As a web server, Tomcat 6.0 has been used.
To test the sample application, ...
download the project's WAR file, which contains all dependencies and the
project's source code. In Eclipse, click File --> Import
--> WAR file to import the project
edit src/main/resources/jdbc.properties and insert a
valid user name and password
edit src/main/resources/hibernate.cfg.xml and insert
the folder name of the Lucene indexes to be created
This should be all to consider.
Use Cases
Basically, a user can search the web (Google), download and
add the search results to the library, search the library, view
document details (including extracted plain text), and view an
archived copy of the original media.
Search the Web and Add new Media to the Library
The following image shows the "Add Media" view:
Search the Library
The following image shows the "Search Library" view:
View Media Details and the Archived Document
The following image shows the "Show Media Details" view:
As RDBMS (Relational Database Management System), MySQL is
used. (Most common databases would also do.) The Hibernate OR/M
(Object Relational Mapper) is used to persist and retrieve data, on
top of the JPA (Java Persistence API).
Data access is encapsulated within Data Access Objects, in
this case, one for CRUD (Create-Read-Update-Delete) operations, and
another one for more sophisticated search operations:
Central APIs used include Spring's JpaTemplate
and JpaCallback, Hibernate Search's
FullTextEntityManager and
FullTextEntityQuery, and Lucene's Query
and Analyzer.
The DAOs are exposed as Spring beans see
chapter
Dependency Injection for more information.
Model
Entities
The following image shows the entities and the
MediaFactory:
The MediaFactory is used to create and populate a
Media from a WebSearchResult instance. It
downloads the corresponding document and uses Apache
Tika to extract the document contents.
The entities, Media and their
MetaData, are mapped to the database tables and
fields, as well as Lucene index fields, using Hibernate, Hibernate
Search, and JPA annotations.
There is further information available on these
JPA, Hibernate, and Hibernate Search, annotations. You may
also want to read my previously written chapter on XML-based
Hibernate Mappings
The following image shows a Media's (constructors
and) methods:
Value Objects
Value - or Transfer - Objects are used to transfer specific
information. For clarity, they may expose getter methods,
only.
The following image shows the value objects used in the
application:
Additionally, there are the WebSearchCriteria and
WebSearchResult value objects, in the
de.metagear.util.web package.
Service Layer
Any central business logic in the sample application is
coordinated by Service Layer methods, which, in this case, mostly
operates on the DAOs' methods. The following image shows the
service layer structure:
The DataQueryService is used to retrieve data
from the database and the Lucene index.
The WebInteractionService is used to search the
web (utilizing the Google Search APIs) and to save the
WebSearchResults retrieved. Implementations of both
are exposed as Spring beans.
The DocumentServlet displays archived
Media documents.
Controllers
The Controllers, implemented as JavaServer Faces Managed
Beans, connect the service and view layer. See
WEB-INF/faces-config.xml and the following image for an
overview:
We could have used the Spring MVC API, however,
the application's complexity does not require that.
Commonly, these controllers provide JSF action
methods, a BackingBean (containing static
properties to be used in the JSF views) and a
CommandBean (containing properties that are to be
edited by the views).
The service layer's Spring beans are dependency-injected,
which is configured in WEB-INF/faces-config.xml, where a
SpringBeanFacesELResolver resolves the Spring beans'
names to JSF.
Glue
Several portions of functionality can be used (and re-used)
indepently from concrete applications. These, in the sample
application, are organized in the de.metagear.util and
de.metagear.library.util Java packages.
MediaParser
The MediaParser puts the Apache
Tika APIs to work to extract contents (into plain text or HTML)
as well as meta data from documents of a large number of document
formats. Internally, Tika uses Apache POI, PDFBox, and various
other libraries.
QueryTermsProcessor
The QueryTermsProcessor plays quite a central
role in querying the Lucene index. It processes the search terms
entered by the user (see Apache Lucene - Query Parser Syntax) as well as
other query terms (i.e, the requested document format or language)
and assigns them to Lucene index fields. Thereby, the
QueryTermsProcessor also combines terms and term
groups (using Lucene's AND, OR, and
NOT operators) and properly nests them.
A formatted query passed to Lucene's QueryParser
might resemble the following code snippet:
(
(
plainText:spring OR plainText:groovy
)
OR
(
title:spring OR title:groovy
)
)
AND
(
(
languageCode:de
)
AND
(
mimeType:text/html
)
)
General-Purpose Libraries
The CollectionUtils, IoUtils and
StringUtils classes provide low-level
functionality.
The PaginationSupport class ("next page",
"previous page", etc.) can be plugged into frameworks of any
type.
MySql5InnoDbDialectUTF8
The MySql5InnoDbDialectUTF8 Hibernate dialect
causes Hibernate to create a database with a UTF-8 character
set.
Java Reflection
Classes in the de.metagear.util.reflection
package are used to manipulate Java bean properties, which saves a
lot of code in classes using them.
Web and Google Search
The following image shows parts of the WebSearch
APIs and its GoogleSearch implementation:
The GoogleSearch APIs are based on the *wonder*
Google Search APIs (which is not of great scope, BTW).
Currently, Google returns only eight matches per
query. This could be changed by obtaining a client key.
View
While JavaServer Faces 2.0 are stable in my own
findings (according to their automated testing, however, they are
yet not as stable as JSF 1.2), JavaServer Faces 1.2 have been
chosen to be implemented in the sample application. That way, the
application could be extended with other JSF frameworks, which are
typically not yet compatible with JSF 2.0.
Facelets
A composition and templating framework, Facelets are used. The
main template is WEB-INF/templates/masterTemplate.jspx
(I couldn't get Eclipse's code completion to
work with ".xhtml"-extended files). See Facelets Resources for more
information.
JavaServer Faces (JSF)
The JSF view's code is pretty straight.
For a more thorough discussion, see chapter
Presentation Layer (on JSF 2.0) in my
previously written JEE 6 article.
The sample application's JSF beans are
session scoped. In a rather large-scaling application, one
would have second thoughts on which properties actually need to
belong to the resource-intensive session scope. See chapter
Bean Declaration and Scope in the
aforementioned article.
Testing
For the tests, the JUnit framework is used.
Currently, the basic libraries are relatively thoroughly
covered, and there is a decent covering of the service methods. DAO
and integration tests (including the web GUI) are missing.
The reason for the missing tests is, of course,
that the sample application's scope does not comprise testing, at
all. Moreover, it cannot be foreseen that the tests would need to
be conducted repeatedly in future.
Note that the service tests do populate
the database, their automatic transaction rollback is currently
switched off.
Content Analysis and Processing, and Indexing
Content Analysis and Processing
An Analyzer returns a TokenStream by
applying one or more TokenFilters (accepting or
discarding tokens) to a Tokenizer (splitting a
character sequence into tokens).
The following images show the Analyzer,
Tokenizer, and TokenFilter, type
hierarchies (the optional library lucene-analyzers.jar
being installed).
The StandardAnalyzer (which executes if not
otherwise specified) works with the StandardTokenizer,
to which the filters StandardFilter,
LowerCaseFilter and StopFilter are
applied.
In human language texts, particularly, stop
words and word stems are to be considered. On the
other hand, if a text, for example, is expected to contain an ISBN
number, like "978-3-89864-465-5", it needs to be ensured that this
entity is indexed as is, i.e., not splitted or discarded.
Note that - other than the Hibernate Search APIs
discussed in this chapter - Lucene provides its own index-related
APIs, including the IndexReader and
IndexWriter classes.
Index Creation
The sample applications' entities'
properties are annotated with Hibernate Search's
@Field marker as in the following snippet:
...
@Field(index = Index.TOKENIZED, store = Store.YES)
private String title;
...
@Field(index = Index.UN_TOKENIZED, store = Store.YES)
private String mimeType;
The @Field annotation causes the property value
to be indexed in the corresponding Lucene fields. The value can be
tokenized, that is, split. The tokenizing
behavior can be specified using the @AnalyzerDef,
@TokenizerDef and TokenFilterDef
annotations (see chapter Content Analysis and
Processing).
Given the @Field annotation is in place,
Hibernate Search, by default, will automatically index the property
values on JpaTemplate.persist(Object) and
JpaTemplate.merge(Object).
Indexing will cause a write lock to be put into effect on the
data store, and only one index writer can operate at a time.
Therefor, strategies exist to defer the index creation. There are
configuration settings for indexing after a given number of
transaction or operations (see Tuning Lucene indexing performance).
Additionally, Hibernate provides means to send index change
requests to a JMS (Java Messaging Service) queue.
Automatic indexing can also be disabled (see Automatic indexing ). Manually, index
changes can be conducted by using the
FullTextSession's <T> void index(T)
and void flushToIndexes() methods (see Manual indexing).
Index Optimization
An optimization consolidates index files into one main file.
Optimization can be set to be performed automatically in Hibernate
Search's configuration (see Automatic Optimization) or manually, by
invoking one of the overloaded optimize(..) methods of
a SearchFactory, which, in turn, can be obtained from
a FullTextSession (see Manual Optimization).
Index Search
A Lucene search in the sample application is implemented as
follows:
public Collection<MediaSearchResultVO> getSearchResults(
final MediaSearchCriteriaVO criteria) {
return (Collection<MediaSearchResultVO>) jpaTemplate
.execute(new JpaCallback() {
@Override
public Collection<MediaSearchResultVO> doInJpa(
EntityManager em) throws PersistenceException {
try {
String preProcessedSearchTerms = new QueryTermsProcessor()
.processQueryTerms(criteria, new String[] {
"plainText", "title" },
QueryOperator.OR);
Query query = new QueryParser("plainText",
new WhitespaceAnalyzer())
.parse(preProcessedSearchTerms);
FullTextEntityManager fullTextEntityManager = Search
.getFullTextEntityManager(em);
FullTextQuery fullTextQuery = fullTextEntityManager
.createFullTextQuery(query, Media.class);
fullTextQuery.setProjection(FullTextQuery.SCORE,
"id", "title", "mimeType", "languageCode",
"lastUpdated");
fullTextQuery
.setResultTransformer(
new MediaSearchResultVoResultTransformer());
fullTextQuery.setFirstResult(criteria
.getStartItem());
fullTextQuery.setMaxResults(criteria
.getNumOfItemsPerPage());
if (criteria.getOrderBy() != null) {
fullTextQuery.setSort(new Sort(criteria
.getOrderBy()));
}
return (Collection<MediaSearchResultVO>) fullTextQuery
.getResultList();
}
catch (ParseException e) {
throw new MediaSearchException(e);
}
}
});
}
First of all, the application's
QueryTermsProcessor pre-processes the search terms
(see chapter QueryTermsProcessor).
Next, the Lucene QueryParser further processes
the search terms string (which will be semantically equal,
afterwards) and returns a Query, using a
WhitespaceAnalyzer in this case (see chapter Content Analysis and
Processing).
A FullTextQuery instance is created and provided
with a Projection. This, firstly, specifies and limits
the fields to be returned and, secondly, causes Hibernate Search to
query the Lucene indexes, only, in contrast to querying the
database.
The sample application's
MediaSearchResultVoResultTransformer transforms each
result row's values to a MediaSearchResultVO instance
(of which, after all, a Collection will be
returned).
Additionally, pagination and sort properties are set.
Miscellaneous
Luke - Lucene Index Toolbox
Luke is a simple, yet effective, tool to view, query, and
manipulate Lucene indexes. See the following screenshot and the
Luke
Homepage for more information.