-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Add ingest-attachment support for per document indexed_chars
limit
#28977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ingest-attachment support for per document indexed_chars
limit
#28977
Conversation
We today support a global `indexed_chars` processor parameter. But in some cases, users would like to set this limit depending on the document itself. It used to be supported in mapper-attachments plugin by extracting the limit value from a meta field in the document sent to indexation process. We add an option which reads this limit value from the document itself by adding a setting named `indexed_chars_field`. Which allows running: ``` PUT _ingest/pipeline/attachment { "description" : "Extract attachment information. Used to parse pdf and office files", "processors" : [ { "attachment" : { "field" : "data", "indexed_chars_field" : "size" } } ] } ``` Then index either: ``` PUT index/doc/1?pipeline=attachment { "data": "BASE64" } ``` Which will use the default value (or the one defined by `indexed_chars`) Or ``` PUT index/doc/2?pipeline=attachment { "data": "BASE64", "size": 1000 } ``` Closes elastic#28942
Pinging @elastic/es-core-infra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dadoonet!
LGTM, assuming the pr build passes.
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=", | ||
"attachment": { | ||
"content_type": "application/rtf", | ||
"language": "ro", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc build fails, because sl
is returned instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. TBH I was expecting that as I did not run the doc check myself but let the CI do it (I was in a taxi when I pushed my PR ;) )...
I'll fix it, run a check locally and let the CI say ok.
Thanks!
@elasticmachine test this please |
@martijnvg Apparently CI is still unhappy but I'm unsure if it's my fault. Could you check it please? |
@dadoonet The test failure looks unrelated to this change. Maybe rebase master and re-run the pr build? |
Hmmm. Sounds like I forgot to backport this PR in 6.3 branch. |
Yes, that makes sense to me. |
We today support a global `indexed_chars` processor parameter. But in some cases, users would like to set this limit depending on the document itself. It used to be supported in mapper-attachments plugin by extracting the limit value from a meta field in the document sent to indexation process. We add an option which reads this limit value from the document itself by adding a setting named `indexed_chars_field`. Which allows running: ``` PUT _ingest/pipeline/attachment { "description" : "Extract attachment information. Used to parse pdf and office files", "processors" : [ { "attachment" : { "field" : "data", "indexed_chars_field" : "size" } } ] } ``` Then index either: ``` PUT index/doc/1?pipeline=attachment { "data": "BASE64" } ``` Which will use the default value (or the one defined by `indexed_chars`) Or ``` PUT index/doc/2?pipeline=attachment { "data": "BASE64", "size": 1000 } ``` Backport of #28977 in 6.x branch (6.4.0)
…31352) We today support a global `indexed_chars` processor parameter. But in some cases, users would like to set this limit depending on the document itself. It used to be supported in mapper-attachments plugin by extracting the limit value from a meta field in the document sent to indexation process. We add an option which reads this limit value from the document itself by adding a setting named `indexed_chars_field`. Which allows running: ``` PUT _ingest/pipeline/attachment { "description" : "Extract attachment information. Used to parse pdf and office files", "processors" : [ { "attachment" : { "field" : "data", "indexed_chars_field" : "size" } } ] } ``` Then index either: ``` PUT index/doc/1?pipeline=attachment { "data": "BASE64" } ``` Which will use the default value (or the one defined by `indexed_chars`) Or ``` PUT index/doc/2?pipeline=attachment { "data": "BASE64", "size": 1000 } ``` Backport of #28977 in 6.x branch (6.4.0)
We today support a global
indexed_chars
processor parameter. But in some cases, users would like to set this limit depending on the document itself.It used to be supported in mapper-attachments plugin by extracting the limit value from a meta field in the document sent to indexation process.
We add an option which reads this limit value from the document itself
by adding a setting named
indexed_chars_field
.Which allows running:
Then index either:
Which will use the default value (or the one defined by
indexed_chars
)Or
Closes #28942.