MarkLogic Universal and Range Indexes

Blog

MarkLogic Universal and Range Indexes

28 February, 2020
By Dave Cassel

Business

In NiFi, FlowFiles are pieces of data that a processor needs to work on. In this case, NiFi is calling a MarkLogic query for each FlowFile. The metric shown in the graph is FlowFiles processed by NiFi over 5 minute intervals. There are peaks around 16,000, but mostly lower, which wasn’t the throughput I needed. A one-line change and throughput went to about 80,000 over 5 minutes. The pace was still accelerating when it ran out of data to process, so I’m not sure how how it would have gone. So what did I change?

MarkLogic provides a number of indexes to improve query performance. Among them are the Universal Index (with many options) and Range Indexes.

Two types of queries look like they do very similar things, but they rely on these different indexes: cts:element-value-query() and cts:element-range-query() (with the “=” operator).

Here’s the original version of the function:

(: Given a URI, check whether there is a corresponding Item. 
 : If there is, delete it. 
 :)
declare function lib:delete-replaced-item($uri as xs:string)
{
  let $item-uri :=
    cts:uris(
      (),
      ("limit=1", "score-zero"),
      cts:and-query((
          cts:collection-query("item"),
          cts:element-value-query(
            xs:QName("es:id"), 
            lib:id-from-uri($uri)
          )
      ))
    )
  return
    if (fn:exists($item-uri)) then
      xdmp:spawn-function(
        function() { xdmp:document-delete($item-uri) }
      )
    else ()
};

Note the cts:element-value-query. This query uses the Universal Index, which captures every term, along with the XML or JSON structure. This makes for rapid lookups of which documents have a particular term.

The cts:element-value-query looks for documents that have the provided input as the entire contents of the target element. In certain cases, this works great. However, in the example above, the return value from lib:id-from-uri($uri) looks something like “a~b~c”. The problem is the “~” characters. As MarkLogic tokenizes the content, it sees “a~b~c” as the sequence of tokens (“a”, “b”, “c”). We can see this using the xdmp:plan function on cts:element-value-query(xs:QName("id"), "a~b~c"). The results include a final-plan element:

<qry:final-plan>
  <qry:and-query>
    <qry:term-query weight="1">
      <qry:key>12776805441528511383</qry:key>
      <qry:annotation>element(id,value("a","b","c"))</qry:annotation>
    </qry:term-query>
  </qry:and-query>
</qry:final-plan> This query needs to look for not just one value, but three, in the correct order. Let’s compare that with the plan using cts:element-range-query(xs:QName("id"), "=", "a~b~c"):

<qry:final-plan
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  <qry:and-query>
    <qry:range-query weight="0" min-occurs="1" max-occurs="4294967295">
      <qry:key>208346549518586783</qry:key>
      <qry:annotation>element(id)</qry:annotation>
      <qry:lower-bound xsi:type="xs:string">a~b~c</qry:lower-bound>
      <qry:upper-bound xsi:type="xs:string">a~b~c</qry:upper-bound>
    </qry:range-query>
  </qry:and-query>
</qry:final-plan>

Here, MarkLogic is working with an upper and lower bound, which are the same value. To find results, MarkLogic will use the “id” range index, do a seek on the list of values to find the appropriate entry, and any matching URIs are found.

MarkLogic provides many different types of indexes. Knowing the right one to use for your query can make a huge difference in the performance.

Share this post: