Ticket #744 (closed defect: fixed)

Opened 4 years ago

Last modified 3 years ago

Revamp text search within Ambra.

Reported by: rich Assigned to: ronald
Priority: critical Milestone:
Component: ambra Version: 0.9-SNAPSHOT
Keywords: advanced_search Cc:
Blocking: Blocked By:

Description (Last modified by amit)

Text search in Ambra is complicated and painful legacy of the original Topaz design. It needs to be redone to allow for graceful roll back on error and to simplify the search code.

Secondly it would be great to be able to do search within TQL relationship query.

Dependency Graph

Change History

01/08/08 15:04:50 changed by rich

  • milestone changed from pubApp_0.9.0 to pubApp_0.8.2.1.

01/08/08 15:26:45 changed by rich

  • priority changed from high to critical.

01/22/08 16:51:44 changed by russ

  • priority changed from critical to unassigned.
  • milestone deleted.

if we ever need to search annotations, one path will be to index annotations with lucene, requiring some interaction between mulgara and lucene.

unknown when/if this will happen.

01/22/08 17:08:35 changed by amit

Just to add to the discussion. With the checkin on the backend with regards to 'blobs', we can potentially mark a blob content for Lucene indexing and searching. One of the many way we are thinking full-text search can potentially be improved. Not for this release of course, but for future reference.

Lucene Annotations

01/23/08 13:39:53 changed by ronald

I would point to compass instead - I think that provides more features for us. Also, it's not just the blob but also various fields that need to be indexed by lucene, so I think it will be more than just blob triggered.

06/19/08 15:16:39 changed by amit

  • priority changed from unassigned to critical.
  • version set to 0.9-SNAPSHOT.
  • blocking changed.
  • blockedby changed.

Increasing priority for next release.

06/19/08 15:17:03 changed by amit

  • owner deleted.

06/23/08 15:48:39 changed by amit

We also need to merge simple search and advanced search internally into a single function and not two separate functions. From an end-user perspective there still might be two visible ways of searching, but internally it should map to one and the same.

We also need to make this 'journal' aware if necessary. At this stage Rich and I both agree that journal should be one more condition that the user can apply. I am making this the overall ticket related to text search and closing others as duplicate.

09/08/08 16:22:51 changed by dragisak

  • owner set to ronald.
  • milestone set to 0.9.1.

For next release just to remove complexity related to ingestion.

09/08/08 23:25:27 changed by amit

  • description changed.
  • summary changed from Advanced Search - Lucene / Mulgara to Revamp text search within Ambra..

09/10/08 16:57:34 changed by amit

  • type changed from task to defect.

09/20/08 22:51:37 changed by ronald

  • status changed from new to assigned.

Note: need to make sure we fix #1030 and make sure we got fields mentioned in #723 when we upgrade Ambra to the new search stuff. This may require pulling out more metadata during ingest than we currently do.

10/22/08 15:44:50 changed by rich

There are also a number of fields that are not being indexed during ingest but should be. See Search Use Cases - Additional Fields

11/11/08 17:06:01 changed by ronald

r6667 addresses this ticket.

11/23/08 04:42:52 changed by ronald

(In [6771]) Switch Ambra to use the new OTM full-text search support. On the front-end side things still use and build lucene queries; these are then parsed and translated into an OQL query. This way nothing really changes from the users point of view.

The search-server is now no longer necessary and has been removed; and bringing up ambra with jetty and embedded mulgara/blob-store now supports searching too. The configuration for using mulgara's distributed resolver to split mulgara into two instances, with and without the lucene queries and storage, has not been done yet.

A couple things don't work quite yet either: there are some bugs with the -foo (i.e. the "without words") translation in complex cases, and the boost factors are not applied. Most everything else should work, though.

The hits are evaluated lazily (i.e. the loading of the article info is done only when those hits are viewed). Also, this is done in a way such that multiple clients running the same query (most likely a spider) will queue up as necessary and get those results as soon as they are ready without resorting to re-running the query if things time out.

Regarding caching, the query results are cached but the cache is never explicitly invalidated; instead it is assumed the ttl on the cache (currently configured to 10min) will be sufficient. The problem is it's basically impossible to figure out when an article is published whether it would show up in the search without re-running the query.

Lastly, because the query is filtered just like other OQL queries, the returned results are now already limited to the journal, so the total-results estimate is now more likely to be close to the real value (unpublished articles will still cause the results to be further filtered on demand, though).

Addresses #744.

12/03/08 03:53:57 changed by ronald

  • status changed from assigned to closed.
  • resolution set to fixed.

(In [6854]) Add full support for -foo in search queries. This closes #744.

02/25/09 14:46:46 changed by

  • milestone deleted.

Milestone 0.9.1 deleted