This topic provides answers to some commonly asked questions about the use of indexes.

How many shards and replicas do I need to specify for an index when I create the index?

You can refer to the following requirements when you specify the number of shards and the number of replicas for an index:

  1. The number of documents in a shard cannot exceed the maximum value of INT, which is approximately 2,100,000,000. If the number of documents exceeds the value of INT, the underlying Lucene layer overwrites existing documents.
  2. In normal cases, we recommend that you set the replicationFactor parameter to 1 and set the autoAddReplicas property to false. If your business has specific requirements, for example, your read loads are significantly higher than your write loads, we recommend that you set the replicationFactor parameter to 2 or 3. This helps balance the read loads.
  3. For example, assume that an instance has N service nodes and each service node contains M shards. If the following condition is met, you must scale out the cluster:
    The total amount of indexed data is greater than the value calculated by using the following formula: N × M × 2,100,000,000.
    If the value of M is too large for a service node, for example, the value of M is greater than the number of CPUs on the node, you must scale out the cluster or upgrade the instance. For an index, you can first specify a shard for each service node. In this case, the value of M is 1. If the preceding condition is not met, you can set M to 2, 3, or a greater value until the condition is met.
  4. The number of documents in a shard cannot exceed the maximum value of INT, which is approximately 2,100,000,000. We recommend that you do not store a large number of documents in the shard. For example, if 2,000,000,000 documents are stored in a shard and the size of each document is 1 KB, the total size of the documents in the shard is approximately 2 TB. As a result, the service node may become overloaded and scan operations may fail. In this case, we recommend that you scale out nodes to balance the loads caused by scan operations on a large amount of data. We recommend that you do not create shards that contain a large amount of data. The size of the documents that are stored in a shard also affects query performance. The query performance is different between shards whose document sizes are 10 KB and shards whose documents sizes are 200 bytes.
    Note You can adjust the number of shards and the number of replicas after you create an index. However, we recommend that you specify the numbers when you create the index.
How long does it take to synchronize data from HBase tables to Search indexes? How long do I need to wait before I can query the synchronized data in Search?
Index synchronization latency = Data synchronization latency + Commit time

If no synchronization tasks are pending, the synchronization latency is determined by the overheads of the framework, which can be a few milliseconds. If pending synchronization tasks exist, the latency may be higher. In this case, you must add more nodes to reduce the latency. By default, the commit time is 15s. If the write load is not heavy, you can set the commit time to 1s, 3s, or 5s. If the write load is heavy, we recommend that you set the commit time to a greater value. Otherwise, a large number of small files are generated and the files are frequently merged, which affects the system performance.

How do I use the pre-sorting feature?

When you query a large amount of data, data sorting consumes system resources and the response time is excessively long. However, if the data is pre-sorted based on a column, data retrieval is significantly accelerated. This is helpful especially when you compare a large amount of data. To configure the pre-sorting feature, perform the following steps:

  • Modify the MergePolicy configuration options in the solrconfig.xml file. For more information, visit Customizing Merge Policies.
  • Set the segmentTerminateEarly parameter to true when you query data. The following sample code shows how to configure the MergePolicy configuration options:
    <mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory">
    <str name="sort">timestamp desc</str>
    <str name="wrapped.prefix">inner</str>
    <str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
    <int name="inner.maxMergeAtOnce">10</int>
    <int name="inner.segmentsPerTier">10</int>
    </mergePolicyFactory>
    After you specify < str name="sort">timestamp desc< /str>, the inserted data is sorted by the timestamp column in descending order. Run the following command:
    curl "http://localhost:8983/solr/testcollection/query?q=*:*&sort=timestamp+desc&rows=10&segmentTerminateEarly=true"
    If you set the segmentTerminateEarly parameter to true, the response time of queries is significantly reduced, especially when the query is performed on terabyte-scale data.
Note
  • The value of the sort parameter that you specify when you query data must be the same as the value of the sort parameter that you specify when you configure the MergePolicy configuration options. Otherwise, the pre-sorting feature does not take effect.
  • You must specify the segmentTerminateEarly parameter. Otherwise, all rows are sorted.
  • If pre-sorting is used, the numFound parameter does not return an accurate value.