Default job name | COLLECTION_NAME_spell_correction |
Input | Raw signals (the COLLECTION_NAME_signals collection by default) |
Output | Synonyms (the COLLECTION_NAME_query_rewrite_staging collection by default) |
query | count_i | type | timstamp_tdt | user_id | doc_id | session_id | fusion_query_id | |
---|---|---|---|---|---|---|---|---|
Required signals fields: | ✅ | ✅ | ✅ |
trainingCollection
parameter that can contain signal data or non-signal data. For signal data, select Input is Signal Data (signalDataIndicator
). Signals can be raw (from the _signals
collection) aggregated (from the _signals_aggr
collection).fieldToVectorize
parameter.count_i
is the field that records the count of raw signals and aggr_count_i
is the field that records the count after aggregation.mainType
parameter)filterType
parameter)click
with a minimum count of 0 and the filtering event type to be query
with a minimum count of 20, then the job:
dictionaryCollection
) and Dictionary Field (dictionaryField
) parameters. For example, in an e-commerce use case, you can use the catalog terms as the custom dictionary by specifying the product catalog collection as the dictionary collection and the product description field as the dictionary field.
query_rewrite_staging
collection by default; you can change this by setting the outputCollection
.
An example record is as follows:
suggested_corrections
field, which provides suggestions about using token correction or whole-phrase correction. If the confidence of the correction is not high, then the job labels the pair as “review” in this field. Pay special attention to the output records with the “review” labels.
With the output in a CSV file, you can sort by mis_string_len
(descending) and edit_dist
(ascending) to position more probable corrections at the top. You can also sort by the ratio of correction traffic over misspelling traffic (the corCount_misCount_ratio
field) to only keep high-traffic boosting corrections.
For phrase misspellings, the misspelled tokens are separated out and put in the token_wise_correction
field. If the associated token correction is already included in the one-word correction list, then the collation_check
field is labeled as “token correction include.” You can choose to drop those phrase misspellings to reduce duplications.
Fusion counts how many phrase corrections can be solved by the same token correction and puts the number into the token_corr_for_phrase_cnt
field. For example, if both “outdoor surveillance” and “surveillance camera” can be solved by correcting “surveillance” to “surveillance”, then this number is 2, which provides some confidence for dropping such phrase corrections and further confirms that correcting “surveillance” to “surveillance” is legitimate.
You might also see cases where the token-wise correction is not included in the list. For example, “xbow” to “xbox” is not included in the list because it can be dangerous to allow an edit distance of 1 in a word of length 4. But if multiple phrase corrections can be made by changing this token, then you can add this token correction to the list.
token_corr_for_phrase_cnt
and with collation_check
labeled as “token correction not included” could be potentially-problematic corrections.correction_types
field. If there is a user-provided dictionary to check against, and both spellings are in the dictionary with and without whitespace in the middle, we can treat these pairs as bi-directional synonyms (“combine/break words (bi-direction)” in the correction_types
field).
The sound_match
and lastChar_match
fields also provide useful information.
trainingDataFilterQuery /Data filter query |
See Event types above, then adjust this value to reflect the secondary event for your search application. To query all data, set this to *:* . |
minCountFilter /Minimum Filtering Event Count |
Lower this value to include less-frequent misspellings based on the data filter query. |
maxDistance /Maximum Edit Distance |
Raise this value to increase the number of potentially-related tokens and phrases detected. |
minMispellingLen /Minimum Length of Misspelling |
Lower this value to include shorter misspellings (which are harder to correct accurately). |
Query Rewrite Jobs Post-processing Cleanup
delete_lowConf_synonyms.json
file.
<your query_rewrite_staging collection name/update>
in the uri field. An example URI value for an app called DC_Large
would be DC_Large_query_rewrite_staging/update
.
id
field if applicable.
<your query_rewrite_staging collection name/update>
in the ENDPOINT URI field. An example URI value for an app called DC_Large
would be DC_Large_query_rewrite_staging/update
.
<root><delete><query>type:synonym AND confidence: [0 TO 0.0005]</query></delete><commit/></root>
<root><delete><query>type:synonym</query></delete><commit/></root>
delete_lowConf_phrases.json
file.
<your query_rewrite_staging collection name/update>
in the uri field. An example URI value for an app called DC_Large
would be DC_Large_query_rewrite_staging/update
.
<your query_rewrite_staging collection name/update>
in the ENDPOINT URI field. An example URI value for an app called DC_Large
would be DC_Large_query_rewrite_staging/update
.
<root><delete><query>type:phrase AND confidence: [0 TO <insert value>]</query></delete><commit/></root>
<root><delete><query>type:phrase</query></delete><commit/></root>
delete_lowConf_misspellings.json
file.
<your query_rewrite_staging collection name/update>
in the uri field. An example URI value for an app called DC_Large
would be DC_Large_query_rewrite_staging/update
.
<your query_rewrite_staging collection name/update>
in the ENDPOINT URI field. An example URI value for an app called DC_Large
would be DC_Large_query_rewrite_staging/update
.
<root><delete><query>type:spell AND confidence: [0 TO 0.5]</query></delete><commit/></root>
<root><delete><query>type:spell</query></delete><commit/></root>
delete_lowConf_headTail.json
file.
<your query_rewrite_staging collection name/update>
in the uri field. An example URI value for an app called DC_Large
would be DC_Large_query_rewrite_staging/update
.<your query_rewrite_staging collection name/update>
in the ENDPOINT URI field. An example URI value for an app called DC_Large
would be DC_Large_query_rewrite_staging/update
.<root><delete><query>type:tail</query></delete><commit/></root>