Spark Jobs

Apache Spark can power a wide variety of data analysis jobs. In Fusion, Spark jobs are especially useful for generating recommendations.

Spark job subtypes

For the Spark job type, the available subtypes are listed below.

SQL Aggregation job
A Spark SQL aggregation job where user-defined parameters are injected into a built-in SQL template at runtime.
Custom Python job
The Custom Python job provides user the ability to run Python code via Fusion. This job supports Python 3.6+ code.
Script
This job lets you run a custom Scala script in Fusion.

See Additional Spark jobs for more information.

Spark job configuration

Spark jobs can be created and modified using the Fusion UI or the Spark Jobs API. They can be scheduled using the Fusion UI or the Spark Jobs API. For the complete list of configuration parameters for all Spark job subtypes, see the Jobs Configuration Reference.

Machine learning jobs

Fusion provides these job types to perform machine learning tasks.

LucidAcademyLucidworks offers free training to help you get started.The Course for Machine Learning in Fusion focuses on the basics about machine learning jobs in Fusion:

Visit the LucidAcademy to see the full training catalog.

Signals analysis

These jobs analyze a collection of signals in order to perform query rewriting, signals aggregation, or experiment analysis.

Ground Truth
Ground truth or gold standard datasets are used in the ground truth jobs and query relevance metrics to define a specific set of documents.

Ground truth jobs estimate ground truth queries using click signals and query signals, with document relevance per query determined using a click/skip formula. Use this job along with the Ranking Metrics job to calculate relevance metrics, such as Normalized Discounted Cumulative Gain (nDCG). To create a ground truth job, sign in to Fusion and click Collections > Jobs. Then click Add+ and in the Experiment Evaluation Jobs section, select Ground Truth. You can enter basic and advanced parameters to configure the job. If the field has a default value, it is populated when you click to add the job.

Query rewriting

These jobs produce data that can be used for query rewriting or to inform updates to the synonyms.txt file.

These jobs are deprecated in Fusion 5.9.15 and will be removed in a future release. Lucidworks recommends using Neural Hybrid Search, which achieves superior relevance compared to legacy machine learning methods.

Head/Tail Analysis
Perform head/tail analysis of queries from collections of raw or aggregated signals, to identify underperforming queries and the reasons. This information is valuable for improving overall conversions, Solr configurations, auto-suggest, product catalogs, and SEO/SEM strategies, in order to improve conversion rates.
Phrase Extraction
Identify multi-word phrases in signals.
Synonym Detection Jobs
Use this job to generate pairs of synonyms and pairs of similar queries. Two words are considered potential synonyms when they are used in a similar context in similar queries.
Token and Phrase Spell Correction
Detect misspellings in queries or documents using the numbers of occurrences of words and phrases.

Signals aggregation

SQL Aggregation
A Spark SQL aggregation job where user-defined parameters are injected into a built-in SQL template at runtime.

Experiment analysis

Ranking Metrics
Use this job to calculate relevance metrics by replaying ground truth queries against catalog data using variants from an experiment. Metrics include Normalized Discounted Cumulative Gain (nDCG) and others.

Collaborative recommenders

These jobs analyze signals and generate matrices used to provide collaborative recommendations.

BPR Recommender
Use this job when you want to compute user recommendations or item similarities using a Bayesian Personalized Ranking (BPR) recommender algorithm.
Query-to-Query Session-Based Similarity jobs
This recommender is based on co-occurrence of queries in the context of clicked documents and sessions. It is useful when your data shows that users tend to search for similar items in a single search session. This method of generating query-to-query recommendations is faster and more reliable than the Query-to-Query Similarity recommender job, and is session-based unlike the similar queries previously generated as part of the Synonym Detection job.

Content-based recommenders

Content-based recommenders create matrices of similar items based on their content.

Content-Based Recommender
Use this job when you want to compute item similarities based on their content, such as product descriptions.

Content analysis

Cluster Labeling
Cluster labeling jobs are run against your data collections, and are used:
When clusters or well-defined document categories already exist
When you want to discover and attach keywords to see representative words within existing clusters
Document Clustering
The Document Clustering job uses an unsupervised machine learning algorithm to group documents into clusters based on similarities in their content. You can enable more efficient document exploration by using these clusters as facets, high-level summaries or themes, or to recommend other documents from the same cluster. The job can automatically group similar documents in all kinds of content, such as clinical trials, legal documents, book reviews, blogs, scientific papers, and products.
Classification job
This job analyzes how your existing documents are categorized and produces a classification model that can be used to predict the categories of new documents at index time.
Outlier Detection
Outlier detection jobs are run against your data collections, and also perform the following actions:
Identify information that significantly differs from other data in the collection
Attach labels to designate each outlier group

Data ingest

The Parallel Bulk Loader (PBL) job enables bulk ingestion of structured and semi-structured data from big data systems, NoSQL databases, and common file formats like Parquet and Avro. Datasources the PBL uses include not only common file formats, but Solr databases, JDBC-compliant databases, MongoDB databases and more. In addition, the PBL distributes the load across the Fusion Spark cluster to optimize performance. And because no parsing is needed, indexing performance is also maximized by writing directly to Solr. For more information about:

Available datasources and key features of the Parallel Bulk Loader, see Parallel Bulk Loader concepts.
Usage and detailed configuration, see Parallel Bulk Loader configuration reference.

Learn more

Troubleshoot Custom Spark Jobs

To troubleshoot problems with a custom job, start by looking for errors in the script-job driver log, /opt/fusion/latest.x*/var/log/api/spark-driver-launcher.log (on Unix) or C:\lucidworks\var\fusion\latest.*x\var\log\api\spark-driver-launcher.log (on Windows).For general Spark job troubleshooting steps, see Spark troubleshooting.

Get Started

Introduction to Fusion

Getting Data In

Getting Data Out

Operations

Reference

Developer Docs

Neural Hybrid Search

Release Notes

Spark job subtypes

Spark job configuration

Machine learning jobs

Signals analysis

Query rewriting

Signals aggregation

Experiment analysis

Collaborative recommenders

Content-based recommenders

Content analysis

Data ingest

Learn more

Get Started

Introduction to Fusion

Getting Data In

Getting Data Out

Operations

Reference

Developer Docs

Neural Hybrid Search

Release Notes

​Spark job subtypes

​Spark job configuration

​Machine learning jobs

​Signals analysis

​Query rewriting

​Signals aggregation

​Experiment analysis

​Collaborative recommenders

​Content-based recommenders

​Content analysis

​Data ingest

​Learn more

Spark job subtypes

Spark job configuration

Machine learning jobs

Signals analysis

Query rewriting

Signals aggregation

Experiment analysis

Collaborative recommenders

Content-based recommenders

Content analysis

Data ingest

Learn more