PostgreSQLFullText

From Request Tracker Wiki
Jump to: navigation, search

Outdated

This page contains out of date and possibly misleading information or instructions such as installation methods or configuration examples that no longer apply. Please consider this warning when reading the page below.


This article will describe, in short, what I have done to speed up queries which searches inside email bodies/attachments and RTFM articles which contain large amounts of WikiText.

There are 3 things that need to be done:

  • patch SearchBuilder, to change LIKE 'search' to @@ plainto_tsquery('search')
  • add a column to hold the processed .content/.largecontent fields
  • add PostgreSQL Text indexes for Attachments.content and ObjectCustomFieldValues.largecontent

Attached are the file needed todo this, that functionality isn't there anymore ;-( So here they come inline

This is the procedure that was followed to add full text

search support to attachments and RTFM Largecontent fields.

  1. Patch SearchBuilder.pm.
  2. Add a tsvector column to the attachements table to allow searching. ALTER TABLE attachments ADD COLUMN textsearchable tsvector; UPDATE attachments SET textsearchable = to_tsvector('english', coalesce(subject,'') || coalesce(content,'')); This first command failed with the error: ERROR: string is too long for tsvector

So I am adding the tsvectors only to those entries with a size < 500KB:

UPDATE attachments SET textsearchable = to_tsvector('english', substring(coalesce(subject,'') || coalesce(content,''), 1, 500000));

Add the same text search column to objectcustomfieldvalues to index largecontent:

ALTER TABLE objectcustomfieldvalues ADD COLUMN textsearchable tsvector; UPDATE objectcustomfieldvalues SET textsearchable = to_tsvector('english', substring(coalesce(largecontent,''), 1, 500000));

Now add an index on the new column to speed up searches. Note, this can be either a GIST or GIN index. GIN is faster to search but larger and slower to update while GIST is slower to search but the index is smaller and faster to update -- pick your poison:

CREATE INDEX attachments_textsearch ON attachments

  USING gist(textsearchable);

CREATE INDEX largecontent_textsearch ON objectcustomfieldvalues

  USING gist(textsearchable);

Here are the index creation commands using the GIN index type. I have tried both, and unless you are in an extremely update intensive environment you will really want GIN -- very, very fast queries.

CREATE INDEX attachments_textsearch ON attachments

  USING GIN (textsearchable );

CREATE INDEX largecontent_textsearch ON objectcustomfieldvalues

  USING GIN (textsearchable );

Here is the patch to DBIx::SearchBuilder to add the PostgreSQL support:

DBIx> diff -u SearchBuilder.pm_* --- SearchBuilder.pm_ORIG 2009-01-28 09:13:38.000000000 -0600 +++ SearchBuilder.pm_FULLTEXT 2009-02-01 15:36:52.000000000 -0600 @@ -926,6 +926,22 @@ <code><pre> } </pre></code> <ul> <li># Use FULLTEXT for large attachments.content and</li> <li># objectcustomfieldvalues.largecontent in PostgreSQL.</li> <li>if (($QualifiedField =~ /Attachments_\d+.Content/) or</li> <li>($QualifiedField =~ /ObjectCustomFieldValues_\d+.Largecontent/)) {</li> <li>if (($args{'OPERATOR'} eq 'LIKE') or ($args{'OPERATOR'} eq 'ILIKE')) {</li> <li>$QualifiedField =~ s/(?:Content|Largecontent)/textsearchable/i;</li> <li>$args{'OPERATOR'} = '@@';</li> <li>$args{'VALUE'} = "plainto_tsquery($args{'VALUE'})";</li> <li>}</li> <li>if (($args{'OPERATOR'} eq 'NOT LIKE') or ($args{'OPERATOR'} eq 'NOT ILIKE')) {</li> <li>$QualifiedField =~ s/(?:Content|Largecontent)/textsearchable/i;</li> <li>$args{'OPERATOR'} = '@@';</li> <li>$args{'VALUE'} = "(!! plainto_tsquery($args{'VALUE'}))";</li> <li>}</li> <li>}</li> <li> my $clause = { field => $QualifiedField, op => $args{'OPERATOR'},</li> </ul> <hr />

A couple of comments about the approach used. I added a second column to hold the processed data. This is needed because there are certain conditions that cause FULL TEXT indexing to fail and it is easier to work around using a trigger to generate the tsvector column instead of having the index cause the INSERT to fail completely. In this case, it will not be indexed but these a pathological cases that really should not be searched anyway. The final piece is to setup a trigger to update the textsearchable column whenever the attachment.(subject/content) or objectcustomfieldvalues.largcontent are updated to keep the searching accurate.

We also do not bother with stripping the '%' characters or the exit early tests in Handle.pm as the OracleText patches do. The reason is that the plainto_tsquery() will strip them for you so the basic patch is much simpler. Obviously, this search technique can be applied to any arbitrary field.

Okay, here are the two triggers you need to keep the texsearchable columns updated when the attachments.subject/content or objectcustomfieldvalues.largecontent are changed:

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE

ON attachments FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(textsearchable, 'pg_catalog.english', subject, content);

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE

ON objectcustomfieldvalues FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(textsearchable, 'pg_catalog.english', largecontent);

The set of triggers above will update the processed document column for every change. If you need more restricting updates, use something like the following which only processes the first 1/2MB of each attachment:

CREATE FUNCTION attachments_trigger() RETURNS trigger AS $$ begin new.textsearchable := to_tsvector('pg_catalog.english', substring(coalesce(new.subject, '') || coalesce(new.content, '') from 1 for 500000)); return new; end $$ LANGUAGE plpgsql; CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON attachments FOR EACH ROW EXECUTE PROCEDURE attachments_trigger(); CREATE FUNCTION objectcustomfieldvalues_trigger() RETURNS trigger AS $$ begin new.textsearchable := to_tsvector('pg_catalog.english', substring(coalesce(new.largecontent, '') from 1 for 500000)); return new; end $$ LANGUAGE plpgsql; CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON objectcustomfieldvalues FOR EACH ROW EXECUTE PROCEDURE objectcustomfieldvalues_trigger();

Please send me any comments or feedback. --Ken