Skip to content

Conversation

dadoonet
Copy link
Contributor

We today support a global indexed_chars processor parameter. But in some cases, users would like to set this limit depending on the document itself.
It used to be supported in mapper-attachments plugin by extracting the limit value from a meta field in the document sent to indexation process.

We add an option which reads this limit value from the document itself
by adding a setting named indexed_chars_field.

Which allows running:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information. Used to parse pdf and office files",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars_field" : "size"
      }
    }
  ]
}

Then index either:

PUT index/doc/1?pipeline=attachment
{
  "data": "BASE64"
}

Which will use the default value (or the one defined by indexed_chars)

Or

PUT index/doc/2?pipeline=attachment
{
  "data": "BASE64",
  "size": 1000
}

Closes #28942.

We today support a global `indexed_chars` processor parameter. But in some cases, users would like to set this limit depending on the document itself.
It used to be supported in mapper-attachments plugin by extracting the limit value from a meta field in the document sent to indexation process.

We add an option which reads this limit value from the document itself
by adding a setting named `indexed_chars_field`.

Which allows running:

```
PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information. Used to parse pdf and office files",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars_field" : "size"
      }
    }
  ]
}
```

Then index either:

```
PUT index/doc/1?pipeline=attachment
{
  "data": "BASE64"
}
```

Which will use the default value (or the one defined by `indexed_chars`)

Or

```
PUT index/doc/2?pipeline=attachment
{
  "data": "BASE64",
  "size": 1000
}
```

Closes elastic#28942
@dadoonet dadoonet added >feature :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v7.0.0 v6.3.0 labels Mar 10, 2018
@dadoonet dadoonet self-assigned this Mar 10, 2018
@dadoonet dadoonet requested a review from talevy March 10, 2018 07:49
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dadoonet!

LGTM, assuming the pr build passes.

"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc build fails, because sl is returned instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. TBH I was expecting that as I did not run the doc check myself but let the CI do it (I was in a taxi when I pushed my PR ;) )...
I'll fix it, run a check locally and let the CI say ok.
Thanks!

@dadoonet
Copy link
Contributor Author

@elasticmachine test this please

@dadoonet
Copy link
Contributor Author

@martijnvg Apparently CI is still unhappy but I'm unsure if it's my fault. Could you check it please?

@martijnvg
Copy link
Member

@dadoonet The test failure looks unrelated to this change. Maybe rebase master and re-run the pr build?

@dadoonet dadoonet merged commit 87553bb into elastic:master Mar 14, 2018
@dadoonet dadoonet deleted the pr/28942-ingest-indexed_chars_field branch March 14, 2018 18:22
@dadoonet dadoonet removed the v6.3.0 label Jun 15, 2018
@dadoonet
Copy link
Contributor Author

Hmmm. Sounds like I forgot to backport this PR in 6.3 branch.
I just removed the 6.3.0 label then.
May be I can still backport to 6.x (6.4.0).

@martijnvg
Copy link
Member

May be I can still backport to 6.x (6.4.0).

Yes, that makes sense to me.

dadoonet added a commit that referenced this pull request Jun 15, 2018
We today support a global `indexed_chars` processor parameter. But in some cases, users would like to set this limit depending on the document itself.
It used to be supported in mapper-attachments plugin by extracting the limit value from a meta field in the document sent to indexation process.

We add an option which reads this limit value from the document itself
by adding a setting named `indexed_chars_field`.

Which allows running:

```
PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information. Used to parse pdf and office files",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars_field" : "size"
      }
    }
  ]
}
```

Then index either:

```
PUT index/doc/1?pipeline=attachment
{
  "data": "BASE64"
}
```

Which will use the default value (or the one defined by `indexed_chars`)

Or

```
PUT index/doc/2?pipeline=attachment
{
  "data": "BASE64",
  "size": 1000
}
```

Backport of #28977 in 6.x branch (6.4.0)
dadoonet added a commit that referenced this pull request Jun 16, 2018
…31352)

We today support a global `indexed_chars` processor parameter. But in some cases, users would like to set this limit depending on the document itself.
It used to be supported in mapper-attachments plugin by extracting the limit value from a meta field in the document sent to indexation process.

We add an option which reads this limit value from the document itself
by adding a setting named `indexed_chars_field`.

Which allows running:

```
PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information. Used to parse pdf and office files",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars_field" : "size"
      }
    }
  ]
}
```

Then index either:

```
PUT index/doc/1?pipeline=attachment
{
  "data": "BASE64"
}
```

Which will use the default value (or the one defined by `indexed_chars`)

Or

```
PUT index/doc/2?pipeline=attachment
{
  "data": "BASE64",
  "size": 1000
}
```

Backport of #28977 in 6.x branch (6.4.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >feature v7.0.0-beta1
4 participants