MarkLogic Universal and Range Indexes

Blog

MarkLogic Universal and Range Indexes

  • 28 February, 2020
  • By Dave Cassel
  • No Comments
blog-image

In NiFi, FlowFiles are pieces of data that a processor needs to work on. In this case, NiFi is calling a MarkLogic query for each FlowFile. The metric shown in the graph is FlowFiles processed by NiFi over 5 minute intervals. There are peaks around 16,000, but mostly lower, which wasn’t the throughput I needed. A one-line change and throughput went to about 80,000 over 5 minutes. The pace was still accelerating when it ran out of data to process, so I’m not sure how how it would have gone. So what did I change?

MarkLogic provides a number of indexes to improve query performance. Among them are the Universal Index (with many options) and Range Indexes.

Two types of queries look like they do very similar things, but they rely on these different indexes: cts:element-value-query() and cts:element-range-query() (with the “=” operator).

Here’s the original version of the function:

(: Given a URI, check whether there is a corresponding Item. 
 : If there is, delete it. 
 :)
declare function lib:delete-replaced-item($uri as xs:string)
{
  let $item-uri :=
    cts:uris(
      (),
      ("limit=1", "score-zero"),
      cts:and-query((
          cts:collection-query("item"),
          cts:element-value-query(
            xs:QName("es:id"), 
            lib:id-from-uri($uri)
          )
      ))
    )
  return
    if (fn:exists($item-uri)) then
      xdmp:spawn-function(
        function() { xdmp:document-delete($item-uri) }
      )
    else ()
};

Note the cts:element-value-query. This query uses the Universal Index, which captures every term, along with the XML or JSON structure. This makes for rapid lookups of which documents have a particular term.

The cts:element-value-query looks for documents that have the provided input as the entire contents of the target element. In certain cases, this works great. However, in the example above, the return value from lib:id-from-uri($uri) looks something like “a~b~c”. The problem is the “~” characters. As MarkLogic tokenizes the content, it sees “a~b~c” as the sequence of tokens (“a”, “b”, “c”). We can see this using the xdmp:plan function on cts:element-value-query(xs:QName("id"), "a~b~c"). The results include a final-plan element:

<qry:final-plan>
  <qry:and-query>
    <qry:term-query weight="1">
      <qry:key>12776805441528511383</qry:key>
      <qry:annotation>element(id,value("a","b","c"))</qry:annotation>
    </qry:term-query>
  </qry:and-query>
</qry:final-plan> This query needs to look for not just one value, but three, in the correct order. Let’s compare that with the plan using cts:element-range-query(xs:QName("id"), "=", "a~b~c"):
<qry:final-plan
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  <qry:and-query>
    <qry:range-query weight="0" min-occurs="1" max-occurs="4294967295">
      <qry:key>208346549518586783</qry:key>
      <qry:annotation>element(id)</qry:annotation>
      <qry:lower-bound xsi:type="xs:string">a~b~c</qry:lower-bound>
      <qry:upper-bound xsi:type="xs:string">a~b~c</qry:upper-bound>
    </qry:range-query>
  </qry:and-query>
</qry:final-plan>

Here, MarkLogic is working with an upper and lower bound, which are the same value. To find results, MarkLogic will use the “id” range index, do a seek on the list of values to find the appropriate entry, and any matching URIs are found.

MarkLogic provides many different types of indexes. Knowing the right one to use for your query can make a huge difference in the performance.

Share this post:

quote
In NiFi, FlowFiles are pieces of data that a processor needs to work on. In this case, NiFi is calling...

4V Services works with development teams to boost their knowledge and capabilities. Contact us today to talk about how we can help you succeed!

0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments
cta-bg

Partnering for Success on Data Projects

We work with companies like yours to improve business operations through better data management. Our role is to put you in a position to succeed. Let's talk about your goals and a plan to get you there.