Fixing 100KB Truncation in Elasticsearch Connector
How to fix the Elasticsearch truncation issue at 100KB (99,999 characters) for Google Storage Connector.
When ingesting large documents from any cloud storage into Elasticsearch, one may notice the document body gets truncated during indexing, leaving the file useless and feeling a bit frustrating.
The Problem Statement
Long documents ingested into Elasticsearch by the cloud connectors, by default, get truncated to exactly 99,999 characters. This is not due to Elasticsearch itself, but how the connectors try to optimise the throughput while syncing and ingesting thousands of documents into Elasticsearch.
Documentation Gap
Unfortunately, at least I have not come across any Elasticsearch documentation or any friendly blogger, who covers this issue in detail, or even addresses it. The internet is fairly silent on this.
The Solution
The solution involves modifying the connector’s pipeline configuration to remove the character limit. Here’s how:
Steps
Go to your connectors page.
You will see these tabs:
- Overview
- Documents
- Index mappings
- Sync rules
- Scheduling
- Pipelines
- Configuration
Click on Pipelines
And you would see this on the right side
Notice how the indexed_chars
parameter in the JSON is set to 100000
. We need to change this value
Click on “ View in Stack Management”
On the Stack Management page
You would see the same JSON on the right side
Click on Edit below
The JSON editor page will come up
Click on Attachment
Change Indexed Chars to -1
.
Here -1
should mean unlimited, but feel free to set it to any other positive integer.
Click Update
below.
Then run a full sync on the connector.
Closing Thoughts
While this solution was specifically tested with the Google Storage Connector linking a Storage Bucket to Elasticsearch (version 8.15.3), the same principle should apply to other cloud storage connectors as well. For self-hosted connectors, though the UI will differ, you can modify the same configuration parameter in the code of the connector settings.
For optimal performance, consider monitoring your system resources after removing the character limit, especially when dealing with very large documents.
If you encounter any issues or need clarification, please share your experience in the comments section.