Difference between revisions of "PostgreSQLFullText"

From Request Tracker Wiki
Jump to navigation Jump to search
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
{{Outdated}}
This article will describe, in short, what I have done to speed up queries which searches inside email bodies/attachments and RTFM articles which contain large amounts of [[WikiText]].
This article will describe, in short, what I have done to speed up queries which searches inside email bodies/attachments and RTFM articles which contain large amounts of [[WikiText]].


Line 18: Line 20:


  <nowiki>ALTER TABLE attachments ADD COLUMN textsearchable tsvector;
  <nowiki>ALTER TABLE attachments ADD COLUMN textsearchable tsvector;
UPDATE attachments SET textsearchable =
  UPDATE attachments SET textsearchable =
  to_tsvector('english', coalesce(subject,'') || coalesce(content,''));
    to_tsvector('english', coalesce(subject,'') || coalesce(content,''));
 
This first command failed with the error:
  This first command failed with the error:
    ERROR:  string is too long for tsvector
      ERROR:  string is too long for tsvector
 
</nowiki>
  </nowiki>


So I am adding the tsvectors only to those entries with a size &lt; 500KB:
So I am adding the tsvectors only to those entries with a size &lt; 500KB:


  <nowiki>UPDATE attachments SET textsearchable = to_tsvector('english',
  <nowiki>UPDATE attachments SET textsearchable = to_tsvector('english',
  substring(coalesce(subject,'') || coalesce(content,''), 1, 500000));
    substring(coalesce(subject,'') || coalesce(content,''), 1, 500000));
 
</nowiki>
  </nowiki>


Add the same text search column to objectcustomfieldvalues to index largecontent:
Add the same text search column to objectcustomfieldvalues to index largecontent:


  <nowiki>ALTER TABLE objectcustomfieldvalues ADD COLUMN textsearchable tsvector;
  <nowiki>ALTER TABLE objectcustomfieldvalues ADD COLUMN textsearchable tsvector;
UPDATE objectcustomfieldvalues SET textsearchable = to_tsvector('english',
  UPDATE objectcustomfieldvalues SET textsearchable = to_tsvector('english',
  substring(coalesce(largecontent,''), 1, 500000));
    substring(coalesce(largecontent,''), 1, 500000));
 
</nowiki>
  </nowiki>


Now add an index on the new column to speed up searches. Note, this can be either a GIST or GIN index. GIN is faster to search but larger and slower to update while GIST is slower to search but the index is smaller and faster to update -- pick your poison:
Now add an index on the new column to speed up searches. Note, this can be either a GIST or GIN index. GIN is faster to search but larger and slower to update while GIST is slower to search but the index is smaller and faster to update -- pick your poison:
Line 48: Line 50:
  CREATE INDEX largecontent_textsearch ON objectcustomfieldvalues
  CREATE INDEX largecontent_textsearch ON objectcustomfieldvalues
   USING gist(textsearchable);
   USING gist(textsearchable);


Here are the index creation commands using the GIN index type. I have tried both, and unless you are in an extremely update intensive environment you will really want GIN -- very, very fast queries.
Here are the index creation commands using the GIN index type. I have tried both, and unless you are in an extremely update intensive environment you will really want GIN -- very, very fast queries.
Line 57: Line 58:
  CREATE INDEX largecontent_textsearch ON objectcustomfieldvalues
  CREATE INDEX largecontent_textsearch ON objectcustomfieldvalues
   USING GIN (textsearchable );
   USING GIN (textsearchable );


Here is the patch to [[DBIx]]::[[SearchBuilder]] to add the [[PostgreSQL]] support:
Here is the patch to [[DBIx]]::[[SearchBuilder]] to add the [[PostgreSQL]] support:


  <nowiki>DBIx&gt; diff -u SearchBuilder.pm_*
  <nowiki>DBIx&gt; diff -u SearchBuilder.pm_*
--- SearchBuilder.pm_ORIG      2009-01-28 09:13:38.000000000 -0600
  --- SearchBuilder.pm_ORIG      2009-01-28 09:13:38.000000000 -0600
+++ SearchBuilder.pm_FULLTEXT  2009-02-01 15:36:52.000000000 -0600
  +++ SearchBuilder.pm_FULLTEXT  2009-02-01 15:36:52.000000000 -0600
@@ -926,6 +926,22 @@
  @@ -926,6 +926,22 @@
 
      }
      }
 
+    # Use FULLTEXT for large attachments.content and
  +    # Use FULLTEXT for large attachments.content and
+    # objectcustomfieldvalues.largecontent in PostgreSQL.
  +    # objectcustomfieldvalues.largecontent in PostgreSQL.
+    if (($QualifiedField =~ /Attachments_\d+\.Content/) or
  +    if (($QualifiedField =~ /Attachments_\d+\.Content/) or
+        ($QualifiedField =~ /ObjectCustomFieldValues_\d+\.Largecontent/)) {
  +        ($QualifiedField =~ /ObjectCustomFieldValues_\d+\.Largecontent/)) {
+        if (($args{'OPERATOR'} eq 'LIKE') or ($args{'OPERATOR'} eq 'ILIKE')) {
  +        if (($args{'OPERATOR'} eq 'LIKE') or ($args{'OPERATOR'} eq 'ILIKE')) {
+            $QualifiedField =~ s/(?:Content|Largecontent)/textsearchable/i;
  +            $QualifiedField =~ s/(?:Content|Largecontent)/textsearchable/i;
+            $args{'OPERATOR'} = '@@';
  +            $args{'OPERATOR'} = '@@';
+            $args{'VALUE'} = "plainto_tsquery($args{'VALUE'})";
  +            $args{'VALUE'} = "plainto_tsquery($args{'VALUE'})";
+        }
  +        }
+        if (($args{'OPERATOR'} eq 'NOT LIKE') or ($args{'OPERATOR'} eq 'NOT ILIKE')) {
  +        if (($args{'OPERATOR'} eq 'NOT LIKE') or ($args{'OPERATOR'} eq 'NOT ILIKE')) {
+            $QualifiedField =~ s/(?:Content|Largecontent)/textsearchable/i;
  +            $QualifiedField =~ s/(?:Content|Largecontent)/textsearchable/i;
+            $args{'OPERATOR'} = '@@';
  +            $args{'OPERATOR'} = '@@';
+            $args{'VALUE'} = "(!! plainto_tsquery($args{'VALUE'}))";
  +            $args{'VALUE'} = "(!! plainto_tsquery($args{'VALUE'}))";
+        }
  +        }
+    }
  +    }
+
  +
    my $clause = {
    my $clause = {
        field =&gt; $QualifiedField,
        field =&gt; $QualifiedField,
        op =&gt; $args{'OPERATOR'},
        op =&gt; $args{'OPERATOR'},
 
--------------------------------------------------------------------
  --------------------------------------------------------------------
 
</nowiki>
  </nowiki>


A couple of comments about the approach used. I added a second column to hold the processed data. This is needed because there are certain conditions that cause FULL TEXT indexing to fail and it is easier to work around using a trigger to generate the tsvector column instead of having the index cause the INSERT to fail completely. In this case, it will not be indexed but these a pathological cases that really should not be searched anyway. The final piece is to setup a trigger to update the textsearchable column whenever the attachment.(subject/content) or objectcustomfieldvalues.largcontent are updated to keep the searching accurate.
A couple of comments about the approach used. I added a second column to hold the processed data. This is needed because there are certain conditions that cause FULL TEXT indexing to fail and it is easier to work around using a trigger to generate the tsvector column instead of having the index cause the INSERT to fail completely. In this case, it will not be indexed but these a pathological cases that really should not be searched anyway. The final piece is to setup a trigger to update the textsearchable column whenever the attachment.(subject/content) or objectcustomfieldvalues.largcontent are updated to keep the searching accurate.
Line 105: Line 105:
  ON objectcustomfieldvalues FOR EACH ROW EXECUTE PROCEDURE
  ON objectcustomfieldvalues FOR EACH ROW EXECUTE PROCEDURE
  tsvector_update_trigger(textsearchable, 'pg_catalog.english', largecontent);
  tsvector_update_trigger(textsearchable, 'pg_catalog.english', largecontent);


The set of triggers above will update the processed document column for every change. If you need more restricting updates, use something like the following which only processes the first 1/2MB of each attachment:
The set of triggers above will update the processed document column for every change. If you need more restricting updates, use something like the following which only processes the first 1/2MB of each attachment:


  <nowiki>CREATE FUNCTION attachments_trigger() RETURNS trigger AS $$
  <nowiki>CREATE FUNCTION attachments_trigger() RETURNS trigger AS $$
begin
  begin
  new.textsearchable :=
    new.textsearchable :=
      to_tsvector('pg_catalog.english', substring(coalesce(new.subject, '') || coalesce(new.content, '') from 1 for 500000));
        to_tsvector('pg_catalog.english', substring(coalesce(new.subject, '') || coalesce(new.content, '') from 1 for 500000));
  return new;
    return new;
end
  end
$$ LANGUAGE plpgsql;
  $$ LANGUAGE plpgsql;
 
CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
  CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON attachments FOR EACH ROW EXECUTE PROCEDURE attachments_trigger();
  ON attachments FOR EACH ROW EXECUTE PROCEDURE attachments_trigger();
 
CREATE FUNCTION objectcustomfieldvalues_trigger() RETURNS trigger AS $$
  CREATE FUNCTION objectcustomfieldvalues_trigger() RETURNS trigger AS $$
begin
  begin
  new.textsearchable :=
    new.textsearchable :=
      to_tsvector('pg_catalog.english', substring(coalesce(new.largecontent, '') from 1 for 500000));
        to_tsvector('pg_catalog.english', substring(coalesce(new.largecontent, '') from 1 for 500000));
  return new;
    return new;
end
  end
$$ LANGUAGE plpgsql;
  $$ LANGUAGE plpgsql;
 
CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
  CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON objectcustomfieldvalues FOR EACH ROW EXECUTE PROCEDURE
  ON objectcustomfieldvalues FOR EACH ROW EXECUTE PROCEDURE
objectcustomfieldvalues_trigger();
  objectcustomfieldvalues_trigger();
 
</nowiki>
  </nowiki>


Please send me any comments or feedback. --Ken
Please send me any comments or feedback. --Ken

Revision as of 11:08, 9 April 2014



This page 'PostgreSQLFullText' is tagged as OUTDATED
This page contains out of date and possibly misleading information or instructions such as installation methods or configuration examples that no longer apply. Please consider this warning when reading the page below.
If you have checked or updated this page and found the content to be suitable, please remove this notice by editing the page and remove the Outdated template tag.


This article will describe, in short, what I have done to speed up queries which searches inside email bodies/attachments and RTFM articles which contain large amounts of WikiText.

There are 3 things that need to be done:

  • patch SearchBuilder, to change LIKE 'search' to @@ plainto_tsquery('search')
  • add a column to hold the processed .content/.largecontent fields
  • add PostgreSQL Text indexes for Attachments.content and ObjectCustomFieldValues.largecontent

Attached are the file needed todo this, that functionality isn't there anymore ;-( So here they come inline

This is the procedure that was followed to add full text

search support to attachments and RTFM Largecontent fields.

1. Patch SearchBuilder.pm.

2. Add a tsvector column to the attachements table to allow searching.

ALTER TABLE attachments ADD COLUMN textsearchable tsvector;
  UPDATE attachments SET textsearchable =
    to_tsvector('english', coalesce(subject,'') || coalesce(content,''));
  
  This first command failed with the error:
      ERROR:  string is too long for tsvector
  
  

So I am adding the tsvectors only to those entries with a size < 500KB:

UPDATE attachments SET textsearchable = to_tsvector('english',
    substring(coalesce(subject,'') || coalesce(content,''), 1, 500000));
  
  

Add the same text search column to objectcustomfieldvalues to index largecontent:

ALTER TABLE objectcustomfieldvalues ADD COLUMN textsearchable tsvector;
  UPDATE objectcustomfieldvalues SET textsearchable = to_tsvector('english',
    substring(coalesce(largecontent,''), 1, 500000));
  
  

Now add an index on the new column to speed up searches. Note, this can be either a GIST or GIN index. GIN is faster to search but larger and slower to update while GIST is slower to search but the index is smaller and faster to update -- pick your poison:

CREATE INDEX attachments_textsearch ON attachments
  USING gist(textsearchable);

CREATE INDEX largecontent_textsearch ON objectcustomfieldvalues
  USING gist(textsearchable);

Here are the index creation commands using the GIN index type. I have tried both, and unless you are in an extremely update intensive environment you will really want GIN -- very, very fast queries.

CREATE INDEX attachments_textsearch ON attachments
  USING GIN (textsearchable );

CREATE INDEX largecontent_textsearch ON objectcustomfieldvalues
  USING GIN (textsearchable );

Here is the patch to DBIx::SearchBuilder to add the PostgreSQL support:

DBIx> diff -u SearchBuilder.pm_*
  --- SearchBuilder.pm_ORIG       2009-01-28 09:13:38.000000000 -0600
  +++ SearchBuilder.pm_FULLTEXT   2009-02-01 15:36:52.000000000 -0600
  @@ -926,6 +926,22 @@
  
       }
  
  +    # Use FULLTEXT for large attachments.content and
  +    # objectcustomfieldvalues.largecontent in PostgreSQL.
  +    if (($QualifiedField =~ /Attachments_\d+\.Content/) or
  +        ($QualifiedField =~ /ObjectCustomFieldValues_\d+\.Largecontent/)) {
  +        if (($args{'OPERATOR'} eq 'LIKE') or ($args{'OPERATOR'} eq 'ILIKE')) {
  +            $QualifiedField =~ s/(?:Content|Largecontent)/textsearchable/i;
  +            $args{'OPERATOR'} = '@@';
  +            $args{'VALUE'} = "plainto_tsquery($args{'VALUE'})";
  +        }
  +        if (($args{'OPERATOR'} eq 'NOT LIKE') or ($args{'OPERATOR'} eq 'NOT ILIKE')) {
  +            $QualifiedField =~ s/(?:Content|Largecontent)/textsearchable/i;
  +            $args{'OPERATOR'} = '@@';
  +            $args{'VALUE'} = "(!! plainto_tsquery($args{'VALUE'}))";
  +        }
  +    }
  +
     my $clause = {
         field => $QualifiedField,
         op => $args{'OPERATOR'},
  
  --------------------------------------------------------------------
  
  

A couple of comments about the approach used. I added a second column to hold the processed data. This is needed because there are certain conditions that cause FULL TEXT indexing to fail and it is easier to work around using a trigger to generate the tsvector column instead of having the index cause the INSERT to fail completely. In this case, it will not be indexed but these a pathological cases that really should not be searched anyway. The final piece is to setup a trigger to update the textsearchable column whenever the attachment.(subject/content) or objectcustomfieldvalues.largcontent are updated to keep the searching accurate.

We also do not bother with stripping the '%' characters or the exit early tests in Handle.pm as the OracleText patches do. The reason is that the plainto_tsquery() will strip them for you so the basic patch is much simpler. Obviously, this search technique can be applied to any arbitrary field.

Okay, here are the two triggers you need to keep the texsearchable columns updated when the attachments.subject/content or objectcustomfieldvalues.largecontent are changed:

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON attachments FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(textsearchable, 'pg_catalog.english', subject, content);

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON objectcustomfieldvalues FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(textsearchable, 'pg_catalog.english', largecontent);

The set of triggers above will update the processed document column for every change. If you need more restricting updates, use something like the following which only processes the first 1/2MB of each attachment:

CREATE FUNCTION attachments_trigger() RETURNS trigger AS $$
  begin
    new.textsearchable :=
        to_tsvector('pg_catalog.english', substring(coalesce(new.subject, '') || coalesce(new.content, '') from 1 for 500000));
    return new;
  end
  $$ LANGUAGE plpgsql;
  
  CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
  ON attachments FOR EACH ROW EXECUTE PROCEDURE attachments_trigger();
  
  CREATE FUNCTION objectcustomfieldvalues_trigger() RETURNS trigger AS $$
  begin
    new.textsearchable :=
        to_tsvector('pg_catalog.english', substring(coalesce(new.largecontent, '') from 1 for 500000));
    return new;
  end
  $$ LANGUAGE plpgsql;
  
  CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
  ON objectcustomfieldvalues FOR EACH ROW EXECUTE PROCEDURE
  objectcustomfieldvalues_trigger();
  
  

Please send me any comments or feedback. --Ken