Accept Callables as Tokenizers for InMemoryDocumentStore #4720
Labels
2.x
Related to Haystack v2.0
Contributions wanted!
Looking for external contributions
type:feature
New feature or request
Discussed in #4695
Originally posted by farhanhubble April 18, 2023
InMemoryDocumentStore
currently only accepts a tokenizing pattern through the argumentbm25_tokenization_regex: str = r"(?u)\b\w\w+\b"
. The underlying BM25 supports acallable
though. Removing this restriction will enable correct tokenization of a larger variety of corpora. I ran into this limitation trying to index JSON documents that contain key-value pairs, like:The text was updated successfully, but these errors were encountered: