Post

Fixing 100KB Truncation in Elasticsearch Connector

How to fix the Elasticsearch truncation issue at 100KB (99,999 characters) for Google Storage Connector.

Fixing 100KB Truncation in Elasticsearch Connector

When ingesting large documents from any cloud storage into Elasticsearch, one may notice the document body gets truncated during indexing, leaving the file useless and feeling a bit frustrating.

The Problem Statement

Long documents ingested into Elasticsearch by the cloud connectors, by default, get truncated to exactly 99,999 characters. This is not due to Elasticsearch itself, but how the connectors try to optimise the throughput while syncing and ingesting thousands of documents into Elasticsearch.

Documentation Gap

Unfortunately, at least I have not come across any Elasticsearch documentation or any friendly blogger, who covers this issue in detail, or even addresses it. The internet is fairly silent on this.

The Solution

The solution involves modifying the connector’s pipeline configuration to remove the character limit. Here’s how:

Steps

Go to your connectors page.

You will see these tabs:

  • Overview
  • Documents
  • Index mappings
  • Sync rules
  • Scheduling
  • Pipelines
  • Configuration

Click on Pipelines

And you would see this on the right side Elasticsearch connector pipeline configuration showing attachment settings

Notice how the indexed_chars parameter in the JSON is set to 100000. We need to change this value

Click on “:eye: View in Stack Management”

On the Stack Management page

You would see the same JSON on the right side Elasticsearch connector pipeline on Stack Management page

Click on Edit below

The JSON editor page will come up Elasticsearch connector JSON editor

Click on Attachment

Change Indexed Chars to -1.

Here -1 should mean unlimited, but feel free to set it to any other positive integer. Elasticsearch connector JSON editor for Indexed Chars

Click Update below.

Then run a full sync on the connector.

Closing Thoughts

While this solution was specifically tested with the Google Storage Connector linking a Storage Bucket to Elasticsearch (version 8.15.3), the same principle should apply to other cloud storage connectors as well. For self-hosted connectors, though the UI will differ, you can modify the same configuration parameter in the code of the connector settings.

For optimal performance, consider monitoring your system resources after removing the character limit, especially when dealing with very large documents.

If you encounter any issues or need clarification, please share your experience in the comments section.

This post is licensed under CC BY 4.0 by the author.