> ## Documentation Index
> Fetch the complete documentation index at: https://doc.lucidworks.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Apache Solr

export const LwTemplate = ({title = "Key questions to get you started", icon = "sparkles", cta = "Powered by Agent Studio", linkHref = "https://lucidworks.com/demo/?utm_source=docs&utm_medium=referral&utm_campaign=docs_cta_ai"}) => {
  const [isLoaded, setIsLoaded] = useState(false);
  useEffect(() => {
    const timer = setTimeout(() => {
      setIsLoaded(true);
    }, 500);
    return () => clearTimeout(timer);
  }, []);
  return <div className="lw-template-container">
      <Card title={title} icon={icon}>
        {isLoaded && <span dangerouslySetInnerHTML={{
    __html: `<lw-template id="a029c1a9-28be-427e-b0e1-5d918920246a"></lw-template
            >`
  }} />}
        <Link href={linkHref} className="agent-studio-link text-left text-gray-600 gap-2 dark:text-gray-400 text-sm font-medium flex flex-row items-center hover:text-primary dark:hover:text-primary-light group-hover:text-primary group-hover:dark:text-primary-light">Powered by Lucidworks Agent Studio</Link>
      </Card>
    </div>;
};

[localhost link]: http://localhost:3000/docs/5/fusion/intro/fusion-stack/solr/overview

[mintlify link]: https://doc.lucidworks.com/docs/5/fusion/intro/fusion-stack/solr/overview

[old doc.lw link]: https://doc.lucidworks.com/fusion/5.9/3214

Solr is the open source search engine at the core of Fusion. When you [index your data](/docs/5/fusion/getting-data-in/indexing/overview), it is stored in a Solr collection.

<LwTemplate />

## Collections

Your data is organized into collections. When you create an app, Fusion automatically creates a collection with the same name. You can create additional collections in any app.

A primary collection contains the data that your users will search. Every primary collection is associated with a set of auxiliary collections that contain related data, such as signals, aggregations, and more.

Under the hood, a Fusion collection is a distributed index in Solr, defined by a named configuration stored in ZooKeeper, with these properties:

* Number of shards\
  Documents are distributed across this number of partitions.
* Document routing strategy\
  How documents are assigned to shards.
* Replication factor\
  How many copies of each document in the collection.
* Replica placement strategy\
  Where to place replicas in the cluster.

If your data is already stored in a Solr instance or cluster, you can manage this collection
in Fusion by creating a Fusion collection that imports the existing Solr collection.

<Note>
  Collection names are case-insensitive, but Fusion preserves case when displaying collection names.
</Note>

### Auxiliary Collections

Every primary collection is associated with a set of auxiliary collections that contain related data, such as signals, aggregations, and more.

Some auxiliary collections are created for every primary collection. Others are created only for the app’s default collection, one per app.

Auxiliary collections are described below:

|                                  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                  |
| -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------- |
| `APP_NAME_job_reports`           | Output from Fusion [experiments](/docs/5/fusion/getting-data-out/data-analytics/experiments/overview), [Ranking Metrics jobs](/docs/5/fusion/reference/config-ref/jobs/ranking-metrics), and [Head/Tail Analysis jobs](/docs/5/fusion/reference/config-ref/jobs/head-tail-analysis).                                                                                                                                                                                                                                                                                                                                                                                                                        | 1 per app        |
| `APP_NAME_query_rewrite`         | <p>A collection of documents to use for [rewriting queries](/docs/5/fusion/getting-data-out/query-enhancement/query-rewriting), optimized for high‑volume traffic. These documents originate from the `COLLECTION_NAME_query_rewrite_staging` collection. Certain Fusion query pipeline stages read from this collection: <br />• [Text Tagger](/docs/5/fusion/reference/config-ref/pipeline-stages/query-stages/text-tagger-query-stage)<br />• [Apply Rules](/docs/5/fusion/reference/config-ref/pipeline-stages/query-stages/query-rules-query-stage)<br />• [Modify Response with Rules](/docs/5/fusion/reference/config-ref/pipeline-stages/query-stages/rules-augment-response-query-stage)<br /></p> | 1 per app        |
| `APP_NAME_query_rewrite_staging` | <p>A collection of documents created by the Rules Editor or by certain [Fusion jobs](/docs/5/fusion/reference/config-ref/jobs/overview), not optimized for production traffic. Documents move from this collection to the `COLLECTION_NAME_query_rewrite` collection as follows: <br />• Job output documents with high confidence contain a `review=auto` field and are moved to the `COLLECTION_NAME_query_rewrite` collection automatically. <br />• Job output documents with low confidence contain a `review=pending` field. When these are approved by a Fusion user, Fusion copies them to the `COLLECTION_NAME_query_rewrite` collection. <br /></p>                                               | 1 per app        |
| `COLLECTION_NAME_signals`        | A search query logs and signals collection.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 1 per collection |
| `COLLECTION_NAME_signals_aggr`   | A collection for aggregated signals.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 1 per collection |
| `APP_NAME_user_prefs`            | A collection of data to support App Studio’s social features, such as user‑generated tags, bookmarks, comments, ratings, and so on.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | 1 per app        |

<Note>
  Do not create primary collections with names that end in the suffixes above; these are reserved for Fusion auxiliary collections, which are created and managed by Fusion directly.
</Note>

Fusion maintains a set of Solr collections that store Fusion’s own
log files and other internal information.
These are called [System Collections](#system-collections), described below.

<Note>
  Do not create primary collections named "logs" or beginning with "system\_".
  These names are reserved for Fusion system collections.
</Note>

Fusion uses ZooKeeper to register information about all collections,
and the Fusion components and services related to a collection.
The Fusion components associated with a collection include:

* Datasources
* Pipelines
* Profiles
* Signals and aggregations
* Analytics dashboards

### System Collections

Fusion automatically creates some collections that are used for internal purposes and shared across all apps:

* **system\_autocomplete** store the content that the Fusion UI displays when you use the search bar.
* **system\_blobs** stores [blobs](/docs/5/fusion/getting-data-in/blob-storage) in Solr. This is used to store model files for the NLP components and other binary files used by Fusion components.
* **system\_history** keeps a record of configuration changes, start and stop times for services and experiments, and more.
* **system\_jobs\_history** keeps a record of Fusion [jobs](/docs/5/fusion/operations/jobs-and-scheduling/overview), including start/stop times and status.
* **system\_messages** is used by Fusion’s messaging services.

### Collection Configuration Properties

Collections have three properties that you can configure only when you are creating a collection using the
[Collections API](/api-reference/collections/get-collections-service-status).

| Property   | Description                                                                                                             | Default behavior                                                                                                                                                            |
| ---------- | ----------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| signals\*  | The `signals` property determines whether to create auxiliary collections with suffixes `_signals` and `_signals_aggr`. | When you create a collection in the Fusion UI, `signals` defaults to **true**.  When you create a collection using the Fusion API, this property defaults to **false**.     |
| searchLogs | The `searchLogs` property determines whether to create an auxiliary search query logs collection with suffix `_logs`.   | When you create a collection in the Fusion UI, this property defaults to **true**.  When you create a collection using the Fusion API, this property defaults to **false**. |

\*Signals are events with timestamps that can be used to improve search results.
For more information about signals in Fusion, see [Signals](/docs/5/fusion/getting-data-out/query-enhancement/signals/overview) in the Fusion documentation.

\*\*In schemaless mode, if a document contains a field not currently in the Solr schema, Solr processes the field value to determine what the field type should be defined as, and then adds a new field to the schema with the field name and field type.
This behavior can be convenient during preliminary application development, but it is rarely appropriate in a production environment.

### Using profiles to associate collections with pipelines

Index pipelines and query pipelines are not connected to a specific collection by default. Index profiles and query profiles are configurations that create consistent endpoints for indexing and querying, each with a specific pipeline and collection.

* [Index Profiles](/docs/5/fusion/getting-data-in/indexing/index-pipelines/index-profiles) work with index pipelines for getting content into the system.
* [Query Profiles](/docs/5/fusion/getting-data-out/query-basics/query-pipelines/query-profiles) work with query pipelines for user queries.

### Field Editor UI

The Fusion UI includes a space under Collections to edit Fields. The Fusion UI includes a space under Collections to edit Fields. Descriptions for these fields can be found in the Field Type Definitions section of the [Solr Reference Guide](/docs/5/fusion/reference/solr-reference-guide) associated with your Fusion release.

Field options displayed in the UI include:

* **Dynamic** checkbox (cannot change via UI)
* **Field Name** (cannot change via UI)
* **Field Type** (a preset value is shown that can be changed using edit mode)
* Checkboxes for **Indexed**, **Stored**, **Multivalued**, **Required**
* Text field to enter a **Default Value**
* **Copy Fields** uses the plus sign to add rows (**static** can copy to `raw_content` or `text`; **dynamic** can copy to any `raw_content`/`text` or any other dynamic field)
* **Advanced** toggles checkboxes for **Doc Values**, **Omit Norms**, **Omit Positions**, **Omit term freq and positions**, **Term Vectors**, **Term Positions**, **Term Offsets**

## Automated Solr backups

You can schedule backups of Solr collections, store the backups for a configurable period of time, and restore these backups into a specified Fusion cluster when needed.

The following guide uses Google Kubernetes Engine (GKE) for examples and assumes you used the setup scripts in the [fusion-cloud-native](https://github.com/lucidworks/fusion-cloud-native) repository to install Fusion.

<AccordionGroup>
  <Accordion title="Solr backups using cloud provider storage options">
    The standard approach of using a provider-specific Persistent Volume Claim (PVC) for storing collection backups ensures consistency in configuration. However, this method does not leverage the unique storage features offered by each cloud provider. For instance, Google Cloud provides Google Cloud Storage, which includes additional features such as access control management, various storage tiers, and other capabilities that are not available when using a PVC. To take advantage of these features, Solr instances running within Fusion require additional provider-specific information.

    Refer to the Solr documentation for detailed information on the [repositories available for configuring collection backups](https://solr.apache.org/guide/solr/latest/deployment-guide/backup-restore.html#backuprestore-storage-repositories). Each repository type comes with specific configuration options and features. Generally, you will need to integrate the provider-specific configuration into the `solr.xml` configuration file and ensure that the appropriate library or module for the provider is included in the Solr classpath. This step is necessary for the repository implementation to resolve correctly at runtime.

    For example, when configuring `GCSBackupRepository` to store backups in Google Cloud Storage (GCS), it is essential to include the corresponding library for the provider in the Solr classpath. Additionally, you will need to add a section to the `solr.xml` file similar to the XML example below to specify the target bucket where backups will be stored:

    ```xml wrap theme={"dark"}
    <backup>
      <repository name="gcs_backup" class="org.apache.solr.gcs.GCSBackupRepository" default="false">
        <str name="gcsBucket">solrBackups</str>
        <str name="gcsCredentialPath">/local/path/to/credential/file</str>
        <str name="location">/default/gcs/backup/location</str>
        <int name="gcsClientMaxRetries">5</int>
        <int name="gcsClientHttpInitialRetryDelayMillis">1500</int>
        <double name="gcsClientHttpRetryDelayMultiplier">1.5</double>
        <int name="gcsClientHttpMaxRetryDelayMillis">10000</int>
      </repository>
    </backup>
    ```

    After configuring the backup provider, you can utilize the standard [Solr backup and restore APIs](https://solr.apache.org/guide/solr/latest/deployment-guide/backup-restore.html#user-managed-clusters-and-single-node-installations) to create new backups or restore from existing ones. Instead of writing to a PVC, backups will be stored in the storage solution specific to the provider.
  </Accordion>

  <Accordion title="Solr backups using Persistent Volume Claim">
    Backups are taken using the Solr collection [BACKUP command](https://solr.apache.org/guide/solr/latest/deployment-guide/collection-management.html#backup). This requires that each Solr node has access to a shared volume or a `ReadWriteMany` volume in Kubernetes. Most cloud providers offer a simple way of creating a shared filestore and exposing it as a `PersistentVolumeClaim` within Kubernetes to mount into the Solr pods. An option is added to the **setup\_f5\_PROVIDER.sh** scripts in the **fusion-cloud-native** repository to provision these.

    The backup action of the script is invoked by a Kubernetes CronJob to run the backup schedule. The backups are saved to a configurable directory with an automatically generated name: `<collection_name>-<timestamp_in_some_format>`.

    A separate CronJob is responsible for cleanup and retention of backups. Cleanup can be disabled if not needed. Setting a series of retention periods can automatically remove backups as they become outdated.

    For example, a cluster that backs up a collection every 3 hours could specify a retention policy that:

    * Keeps all backups for a single day.
    * Keeps a single daily backup for a week.
    * Keeps a single weekly backup for a month.
    * Keeps a single monthly backup for 6 months.
    * Deletes all backups that are older than 6 months.

    All times are configurable as part of the `configmap` for this service.

    The process for restoring a collection is a manual step involving `kubectl run` to invoke the Solr `RESTORE` action pointing to the collection and the name of the backup being restored.

    <Note>These instructions are for GKE only. For other platforms, backup and restoration involves copying the collection to the cloud and using Parallel Bulk Loader.</Note>

    ### Install using a PVC with GKE

    The `solr-backup-runner` requires that a `ReadWriteMany` volume is mounted onto all `solr` and `backup-runner` pods so they all back up to the same filesystem.

    The easiest way to install on GKE is by using a GCP Filestore as the `ReadWriteMany` volume.

    1. Create the Filestore.
       ```bash wrap theme={"dark"}
       gcloud --project "${GCLOUD_PROJECT}" filestore instances create "${NFS_NAME}"  --tier=STANDARD --file-share=name="solrbackups,capacity=${SOLR_BACKUP_NFS_GB}GB" --zone="${GCLOUD_ZONE}" --network=name="${network_name}"
       ```

    2. Fetch the IP of the Filestore.
       ```bash wrap theme={"dark"}
       NFS_IP="$(gcloud filestore instances describe "${NFS_NAME}" --project="${GCLOUD_PROJECT}" --zone="${GCLOUD_ZONE}" --format="value(networks.ipAddresses[0])")"
       ```

    3. Create a Persistent Volume in Kubernetes that is backed by this volume.
       ```yaml theme={"dark"}
       cat <<EOF | kubectl -n "${NAMESPACE}" apply -f -
       apiVersion: v1
       kind: PersistentVolume
       metadata:
        name: ${NAMESPACE}-solr-backups
        annotations:
          pv.beta.kubernetes.io/gid: "8983"
       spec:
        capacity:
          storage: ${SOLR_BACKUP_NFS_GB}G
        accessModes:
          - ReadWriteMany
        nfs:
          path: /solrbackups
          server: ${NFS_IP}
       EOF
       ```

    4. Create a Persistent Volume Claim in the namespace that Solr is running in.
       ```yaml theme={"dark"}
       cat <<EOF | kubectl -n "${NAMESPACE}" apply -f -
       apiVersion: v1
       kind: PersistentVolumeClaim
       metadata:
        name: fusion-solr-backup-claim
       spec:
        volumeName: ${NAMESPACE}-solr-backups
        accessModes:
          - ReadWriteMany
        storageClassName: ""
        resources:
          requests:
            storage: ${SOLR_BACKUP_NFS_GB}G
       EOF
       ```

    5. Add the following values to your existing (or a new) Helm values file.
       ```yaml theme={"dark"}
       solr-backup-runner:
        enabled: true
        sharedPersistentVolumeName: fusion-solr-backup-claim
       solr:
        additionalInitContainers:
          - name: chown-backup-directory
            securityContext:
              runAsUser: 0
            image: busybox:latest
            command: ['/bin/sh', '-c', "owner=$(stat -c '%u' /mnt/solr-backups);  if [ ! \"${owner}\" = \"8983\" ]; then chown -R 8983:8983 /mnt/solr-backups; fi "]
            volumeMounts:
              - mountPath: /mnt/solr-backups
                name: solr-backups
        additionalVolumes:
          - name: solr-backups
            persistentVolumeClaim:
              claimName: fusion-solr-backup-claim
        additionalVolumeMounts:
          - name: solr-backups
            mountPath: "/mnt/solr-backups"
       ```

    6. Upgrade the release. Solr backups are now enabled.
  </Accordion>
</AccordionGroup>
