Pipelines - Data Preparation Primitives and Options

For most use cases, we recommend you use the aidb preparers to perform bulk pre-processing preparation steps on your data.

However, for testing and developing the configurations that suit your data best, you can use the following primitives. These functions allow you to test operations and their configurations on individual inputs with minimal setup. This is useful for quick experimentation before scaling up with a preparer for bulk data preparation.

Configuration

All data preparation operations can be customized with different options. The API for these options is identical between the primitives and the preparer, so you can prototype options with the aidb.chunk_text() primitive for use with a scalable Preparer that performs the ChunkText operation.

Chunk Text

Call aidb.chunk_text() to break text into smaller chunks.

SELECT
    chunk_id,
    chunk
FROM aidb.chunk_text(
    input => 'This is a significantly longer text example that might require splitting into smaller chunks. The purpose of this function is to partition text data into segments of a specified maximum length, for example, this sentence 145 is characters. This enables processing or storage of data in manageable parts.',
    options = '{"desired_length": 120, "max_length": 150}'
);
  • The desired_length size is the target size for the chunk. In most cases, this will also serve as the maximum size of the chunk. It is always possible that a chunk may be returned that is less than the desired value, as adding the next piece of text may have made it larger than the desired capacity.
  • The max_length size is the maximum possible chunk size that can be generated. By setting this to a larger value than desired, it means that the chunk should be as close to desired as possible, but can be larger if it means staying at a larger semantic level.
Note

This primitive function returns each chunk with a chunk_id for ease of development. However, a Preparer with the ChunkText operation outputs a single text array per input that can then be unnested as desired.

Parse HTML

Call aidb.parse_html() to extract text from HTML.

SELECT * FROM aidb.parse_html(
    html =>
        '<h1>Hello, world!</h1>
        <p>This is my first web page.</p>
        <p>
            It contains some <strong>bold text</strong>, some <em>italic test</em>, and a <a href="https://google.com" target="_blank">link</a>.
        </p>

        <img src="postgres_logo.png" alt="Postgres Logo Image">

        <ol>
            <li>List item</li>
            <li>List item</li>
            <li>List item</li>
        </ol>',
    options => '{"method": "StructuredPlaintext"}' -- Default
);
  • The method determines how the HTML is parsed:
    • StructuredPlaintext (Default) - Algorithmic text extraction to plaintext
    • StructuredMarkdown - Algorithmic text extraction to markdown-esque text that retains some syntactical context

Parse PDF

Call aidb.parse_pdf() to extract text from PDF bytes.

SELECT * FROM aidb.parse_pdf(
    bytes => decode('255044462d312e340a25b89a929d0a312030206f626a3c3c2f547970652f436174616c6f672f50616765732033203020523e3e0a656e646f626a0a322030206f626a3c3c2f50726f64756365722847656d426f782047656d426f782e50646620312e37202831372e302e33352e313034323b202e4e4554204672616d65776f726b29292f4372656174696f6e4461746528443a32303231313032383135313732312b303227303027293e3e0a656e646f626a0a332030206f626a3c3c2f547970652f50616765732f4b6964735b34203020525d2f436f756e7420312f4d65646961426f785b302030203539352e3332203834312e39325d3e3e0a656e646f626a0a342030206f626a3c3c2f547970652f506167652f506172656e742033203020522f5265736f75726365733c3c2f466f6e743c3c2f46302036203020523e3e3e3e2f436f6e74656e74732035203020523e3e0a656e646f626a0a352030206f626a3c3c2f4c656e6774682035393e3e73747265616d0a42540a2f46302031322054660a3120302030203120313030203730322e3733363636363720546d0a2848656c6c6f20576f726c642129546a0a45540a656e6473747265616d0a656e646f626a0a362030206f626a3c3c2f547970652f466f6e742f537562747970652f54797065312f42617365466f6e742f48656c7665746963612f4669727374436861722033322f4c61737443686172203131342f5769647468732037203020522f466f6e7444657363726970746f722038203020523e3e0a656e646f626a0a372030206f626a5b3237382032373820302030203020302030203020302030203020302030203020302030203020302030203020302030203020302030203020302030203020302030203020302030203020302030203020302030203732322030203020302030203020302030203020302030203020302030203020393434203020302030203020302030203020302030203020302030203535362035353620302030203020302030203020323232203020302035353620302030203333335d0a656e646f626a0a382030206f626a3c3c2f547970652f466f6e7444657363726970746f722f466c6167732033322f466f6e744e616d652f48656c7665746963612f466f6e7446616d696c792848656c766574696361292f466f6e74576569676874203530302f4974616c6963416e676c6520302f466f6e7442426f785b2d313636202d3232352031303030203933315d2f436170486569676874203731382f58486569676874203532332f417363656e74203731382f44657363656e74202d3230372f5374656d482037362f5374656d562038383e3e0a656e646f626a0a787265660a3020390a303030303030303030302036353533352066200a30303030303030303135203030303030206e200a30303030303030303539203030303030206e200a30303030303030313739203030303030206e200a30303030303030323537203030303030206e200a30303030303030333436203030303030206e200a30303030303030343531203030303030206e200a30303030303030353733203030303030206e200a30303030303030373733203030303030206e200a747261696c65720a3c3c2f526f6f742031203020522f49445b3c39333932413539463342453742383430383035443632373436453841344632393e3c39333932413539463342453742383430383035443632373436453841344632393e5d2f496e666f2032203020522f53697a6520393e3e0a7374617274787265660a3938380a2525454f460a', 'hex'),
    options => '{"method": "Structured", "allow_partial_parsing": true}' -- Default
);
  • The method determines how the PDF is parsed:
    • Structured (Default) - Algorithmic text extraction
      • The allow_partial_parsing flag determines whether PDFs should still be parsed when the parser encounters errors on one or more pages. Defaults to true.

Summarize Text

Call aidb.summarize_text() to summarize text.

-- Create a model for use in summarization
SELECT aidb.create_model('my_t5_model', 't5_local');

SELECT * FROM aidb.summarize_text(
    input => 'There are times when the night sky glows with bands of color. The bands may begin as cloud shapes and then spread into a great arc across the entire sky. They may fall in folds like a curtain drawn across the heavens. The lights usually grow brighter, then suddenly dim. During this time the sky glows with pale yellow, pink, green, violet, blue, and red. These lights are called the Aurora Borealis. Some people call them the Northern Lights. Scientists have been watching them for hundreds of years. They are not quite sure what causes them. In ancient times Long Beach City College WRSC Page 2 of 2 people were afraid of the Lights. They imagined that they saw fiery dragons in the sky. Some even concluded that the heavens were on fire.',
    options => '{"model": "my_t5_model"}'
);
  • The model is the name of the created model to use for summarization. The model must support the decode_text() and decode_text_batch() model primitives.

Could this page be better? Report a problem or suggest an addition!