Skip to content

Text Utilities

Working with Generative AI (and in particular with the text modality), requires a lot of text manipulation. PromptingTools.jl provides a set of utilities to make this process easier and more efficient.

Highlights

The main functions to be aware of are

  • recursive_splitter to split the text into sentences and words (of a desired length max_length)

  • replace_words to mask some sensitive words in your text before sending it to AI

  • wrap_string for wrapping the text into a desired length by adding newlines (eg, to fit some large text into your terminal width)

  • length_longest_common_subsequence to find the length of the longest common subsequence between two strings (eg, to compare the similarity between the context provided and generated text)

  • distance_longest_common_subsequence a companion utility for length_longest_common_subsequence to find the normalized distance between two strings. Always returns a number between 0-1, where 0 means the strings are identical and 1 means they are completely different.

You can import them simply via:

julia
using PromptingTools: recursive_splitter, replace_words, wrap_string, length_longest_common_subsequence, distance_longest_common_subsequence

There are many more (especially in the AgentTools and RAGTools experimental modules)!

RAGTools module contains the following text utilities:

  • split_into_code_and_sentences to split a string into code and sentences

  • tokenize to tokenize a string (eg, a sentence) into words

  • trigrams to generate trigrams from a string (eg, a word)

  • text_to_trigrams to generate trigrams from a larger string (ie, effectively wraps the three functions above)

  • STOPWORDS a set of common stopwords (very brief)

Feel free to open an issue or ask in the #generative-ai channel in the JuliaLang Slack if you have a specific need.

References

# PromptingTools.recursive_splitterFunction.
julia
recursive_splitter(text::String; separator::String=" ", max_length::Int=35000) -> Vector{String}

Split a given string text into chunks of a specified maximum length max_length. This is particularly useful for splitting larger documents or texts into smaller segments, suitable for models or systems with smaller context windows.

There is a method for dispatching on multiple separators, recursive_splitter(text::String, separators::Vector{String}; max_length::Int=35000) -> Vector{String} that mimics the logic of Langchain's RecursiveCharacterTextSplitter.

Arguments

  • text::String: The text to be split.

  • separator::String=" ": The separator used to split the text into minichunks. Defaults to a space character.

  • max_length::Int=35000: The maximum length of each chunk. Defaults to 35,000 characters, which should fit within 16K context window.

Returns

Vector{String}: A vector of strings, each representing a chunk of the original text that is smaller than or equal to max_length.

Notes

  • The function ensures that each chunk is as close to max_length as possible without exceeding it.

  • If the text is empty, the function returns an empty array.

  • The separator is re-added to the text chunks after splitting, preserving the original structure of the text as closely as possible.

Examples

Splitting text with the default separator (" "):

julia
text = "Hello world. How are you?"
chunks = recursive_splitter(text; max_length=13)
length(chunks) # Output: 2

Using a custom separator and custom max_length

julia
text = "Hello,World," ^ 2900 # length 34900 chars
recursive_splitter(text; separator=",", max_length=10000) # for 4K context window
length(chunks[1]) # Output: 4

source

julia
recursive_splitter(text::AbstractString, separators::Vector{String}; max_length::Int=35000) -> Vector{String}

Split a given string text into chunks recursively using a series of separators, with each chunk having a maximum length of max_length (if it's achievable given the separators provided). This function is useful for splitting large documents or texts into smaller segments that are more manageable for processing, particularly for models or systems with limited context windows.

It was previously known as split_by_length.

This is similar to Langchain's RecursiveCharacterTextSplitter. To achieve the same behavior, use separators=["\n\n", "\n", " ", ""].

Arguments

  • text::AbstractString: The text to be split.

  • separators::Vector{String}: An ordered list of separators used to split the text. The function iteratively applies these separators to split the text. Recommend to use ["\n\n", ". ", "\n", " "]

  • max_length::Int: The maximum length of each chunk. Defaults to 35,000 characters. This length is considered after each iteration of splitting, ensuring chunks fit within specified constraints.

Returns

Vector{String}: A vector of strings, where each string is a chunk of the original text that is smaller than or equal to max_length.

Usage Tips

  • I tend to prefer splitting on sentences (". ") before splitting on newline characters ("\n") to preserve the structure of the text.

  • What's the difference between separators=["\n"," ",""] and separators=["\n"," "]? The former will split down to character level (""), so it will always achieve the max_length but it will split words (bad for context!) I prefer to instead set slightly smaller max_length but not split words.

How It Works

  • The function processes the text iteratively with each separator in the provided order. It then measures the length of each chunk and splits it further if it exceeds the max_length. If the chunks is "short enough", the subsequent separators are not applied to it.

  • Each chunk is as close to max_length as possible (unless we cannot split it any further, eg, if the splitters are "too big" / there are not enough of them)

  • If the text is empty, the function returns an empty array.

  • Separators are re-added to the text chunks after splitting, preserving the original structure of the text as closely as possible. Apply strip if you do not need them.

  • The function provides separators as the second argument to distinguish itself from its single-separator counterpart dispatch.

Examples

Splitting text using multiple separators:

julia
text = "Paragraph 1\n\nParagraph 2. Sentence 1. Sentence 2.\nParagraph 3"
separators = ["\n\n", ". ", "\n"] # split by paragraphs, sentences, and newlines (not by words)
chunks = recursive_splitter(text, separators, max_length=20)

Splitting text using multiple separators - with splitting on words:

julia
text = "Paragraph 1\n\nParagraph 2. Sentence 1. Sentence 2.\nParagraph 3"
separators = ["\n\n", ". ", "\n", " "] # split by paragraphs, sentences, and newlines, words
chunks = recursive_splitter(text, separators, max_length=10)

Using a single separator:

julia
text = "Hello,World," ^ 2900  # length 34900 characters
chunks = recursive_splitter(text, [","], max_length=10000)

To achieve the same behavior as Langchain's RecursiveCharacterTextSplitter, use separators=["\n\n", "\n", " ", ""].

julia
text = "Paragraph 1\n\nParagraph 2. Sentence 1. Sentence 2.\nParagraph 3"
separators = ["\n\n", "\n", " ", ""]
chunks = recursive_splitter(text, separators, max_length=10)

source


# PromptingTools.replace_wordsFunction.
julia
replace_words(text::AbstractString, words::Vector{<:AbstractString}; replacement::AbstractString="ABC")

Replace all occurrences of words in words with replacement in text. Useful to quickly remove specific names or entities from a text.

Arguments

  • text::AbstractString: The text to be processed.

  • words::Vector{<:AbstractString}: A vector of words to be replaced.

  • replacement::AbstractString="ABC": The replacement string to be used. Defaults to "ABC".

Example

julia
text = "Disney is a great company"
replace_words(text, ["Disney", "Snow White", "Mickey Mouse"])
# Output: "ABC is a great company"

source


# PromptingTools.wrap_stringFunction.
julia
wrap_string(str::String,
    text_width::Int = 20;
    newline::Union{AbstractString, AbstractChar} = '

')

Breaks a string into lines of a given text_width. Optionally, you can specify the newline character or string to use.

Example:

julia
wrap_string("Certainly, here's a function in Julia that will wrap a string according to the specifications:", 10) |> print

source


# PromptingTools.length_longest_common_subsequenceFunction.
julia
length_longest_common_subsequence(itr1, itr2)

Compute the length of the longest common subsequence between two sequences (ie, the higher the number, the better the match).

Source: https://cn.julialang.org/LeetCode.jl/dev/democards/problems/problems/1143.longest-common-subsequence/

Arguments

  • itr1: The first sequence, eg, a String.

  • itr2: The second sequence, eg, a String.

Returns

The length of the longest common subsequence.

Examples

julia
text1 = "abc-abc----"
text2 = "___ab_c__abc"
longest_common_subsequence(text1, text2)
# Output: 6 (-> "abcabc")

It can be used to fuzzy match strings and find the similarity between them (Tip: normalize the match)

julia
commands = ["product recommendation", "emotions", "specific product advice", "checkout advice"]
query = "Which product can you recommend for me?"
let pos = argmax(length_longest_common_subsequence.(Ref(query), commands))
    dist = length_longest_common_subsequence(query, commands[pos])
    norm = dist / min(length(query), length(commands[pos]))
    @info "The closest command to the query: "$(query)" is: "$(commands[pos])" (distance: $(dist), normalized: $(norm))"
end

But it might be easier to use directly the convenience wrapper distance_longest_common_subsequence!



[source](https://github.com/svilupp/PromptingTools.jl/blob/9e9fd16f6bbf320f1dbe7d136be6719a91c3d3aa/src/utils.jl#L250-L286)

</div>
<br>
<div style='border-width:1px; border-style:solid; border-color:black; padding: 1em; border-radius: 25px;'>
<a id='PromptingTools.distance_longest_common_subsequence-extra_tools-text_utilities_intro' href='#PromptingTools.distance_longest_common_subsequence-extra_tools-text_utilities_intro'>#</a>&nbsp;<b><u>PromptingTools.distance_longest_common_subsequence</u></b> &mdash; <i>Function</i>.




```julia
distance_longest_common_subsequence(
    input1::AbstractString, input2::AbstractString)

distance_longest_common_subsequence(
    input1::AbstractString, input2::AbstractVector{<:AbstractString})

Measures distance between two strings using the length of the longest common subsequence (ie, the lower the number, the better the match). Perfect match is distance = 0.0

Convenience wrapper around length_longest_common_subsequence to normalize the distances to 0-1 range. There is a also a dispatch for comparing a string vs an array of strings.

Notes

  • Use argmin and minimum to find the position of the closest match and the distance, respectively.

  • Matching with an empty string will always return 1.0 (worst match), even if the other string is empty as well (safety mechanism to avoid division by zero).

Arguments

  • input1::AbstractString: The first string to compare.

  • input2::AbstractString: The second string to compare.

Example

You can also use it to find the closest context for some AI generated summary/story:

julia
context = ["The enigmatic stranger vanished as swiftly as a wisp of smoke, leaving behind a trail of unanswered questions.",
    "Beneath the shimmering moonlight, the ocean whispered secrets only the stars could hear.",
    "The ancient tree stood as a silent guardian, its gnarled branches reaching for the heavens.",
    "The melody danced through the air, painting a vibrant tapestry of emotions.",
    "Time flowed like a relentless river, carrying away memories and leaving imprints in its wake."]

story = """
    Beneath the shimmering moonlight, the ocean whispered secrets only the stars could hear.

    Under the celestial tapestry, the vast ocean whispered its secrets to the indifferent stars. Each ripple, a murmured confidence, each wave, a whispered lament. The glittering celestial bodies listened in silent complicity, their enigmatic gaze reflecting the ocean's unspoken truths. The cosmic dance between the sea and the sky, a symphony of shared secrets, forever echoing in the ethereal expanse.
    """

dist = distance_longest_common_subsequence(story, context)
@info "The closest context to the query: "$(first(story,20))..." is: "$(context[argmin(dist)])" (distance: $(minimum(dist)))"

source