Pipelines - Data Preparation Primitives and Options
For most use cases, we recommend you use the aidb preparers to perform bulk pre-processing preparation steps on your data.
However, for testing and developing the configurations that suit your data best, you can use the following primitives. These functions allow you to test operations and their configurations on individual inputs with minimal setup. This is useful for quick experimentation before scaling up with a preparer for bulk data preparation.
Configuration
All data preparation operations can be customized with different options. The API for these options is identical between the primitives and the preparer, so you can prototype options with the aidb.chunk_text()
primitive for use with a scalable Preparer that performs the ChunkText
operation.
Chunk Text
Call aidb.chunk_text()
to break text into smaller chunks.
SELECT chunk_id, chunk FROM aidb.chunk_text( input => 'This is a significantly longer text example that might require splitting into smaller chunks. The purpose of this function is to partition text data into segments of a specified maximum length, for example, this sentence 145 is characters. This enables processing or storage of data in manageable parts.', options = '{"desired_length": 120, "max_length": 150}' );
- The
desired_length
size is the target size for the chunk. In most cases, this will also serve as the maximum size of the chunk. It is always possible that a chunk may be returned that is less than thedesired
value, as adding the next piece of text may have made it larger than thedesired
capacity. - The
max_length
size is the maximum possible chunk size that can be generated. By setting this to a larger value thandesired
, it means that the chunk should be as close todesired
as possible, but can be larger if it means staying at a larger semantic level.
Note
This primitive function returns each chunk with a chunk_id
for ease of development. However, a Preparer with the ChunkText
operation outputs a single text array per input that can then be unnested as desired.
Parse HTML
Call aidb.parse_html()
to extract text from HTML.
SELECT * FROM aidb.parse_html( html => '<h1>Hello, world!</h1> <p>This is my first web page.</p> <p> It contains some <strong>bold text</strong>, some <em>italic test</em>, and a <a href="https://google.com" target="_blank">link</a>. </p> <img src="postgres_logo.png" alt="Postgres Logo Image"> <ol> <li>List item</li> <li>List item</li> <li>List item</li> </ol>', options => '{"method": "StructuredPlaintext"}' -- Default );
- The
method
determines how the HTML is parsed:StructuredPlaintext
(Default) - Algorithmic text extraction to plaintextStructuredMarkdown
- Algorithmic text extraction to markdown-esque text that retains some syntactical context
Parse PDF
Call aidb.parse_pdf()
to extract text from PDF bytes.
SELECT * FROM aidb.parse_pdf( bytes => decode('255044462d312e340a25b89a929d0a312030206f626a3c3c2f547970652f436174616c6f672f50616765732033203020523e3e0a656e646f626a0a322030206f626a3c3c2f50726f64756365722847656d426f782047656d426f782e50646620312e37202831372e302e33352e313034323b202e4e4554204672616d65776f726b29292f4372656174696f6e4461746528443a32303231313032383135313732312b303227303027293e3e0a656e646f626a0a332030206f626a3c3c2f547970652f50616765732f4b6964735b34203020525d2f436f756e7420312f4d65646961426f785b302030203539352e3332203834312e39325d3e3e0a656e646f626a0a342030206f626a3c3c2f547970652f506167652f506172656e742033203020522f5265736f75726365733c3c2f466f6e743c3c2f46302036203020523e3e3e3e2f436f6e74656e74732035203020523e3e0a656e646f626a0a352030206f626a3c3c2f4c656e6774682035393e3e73747265616d0a42540a2f46302031322054660a3120302030203120313030203730322e3733363636363720546d0a2848656c6c6f20576f726c642129546a0a45540a656e6473747265616d0a656e646f626a0a362030206f626a3c3c2f547970652f466f6e742f537562747970652f54797065312f42617365466f6e742f48656c7665746963612f4669727374436861722033322f4c61737443686172203131342f5769647468732037203020522f466f6e7444657363726970746f722038203020523e3e0a656e646f626a0a372030206f626a5b3237382032373820302030203020302030203020302030203020302030203020302030203020302030203020302030203020302030203020302030203020302030203020302030203020302030203020302030203732322030203020302030203020302030203020302030203020302030203020393434203020302030203020302030203020302030203020302030203535362035353620302030203020302030203020323232203020302035353620302030203333335d0a656e646f626a0a382030206f626a3c3c2f547970652f466f6e7444657363726970746f722f466c6167732033322f466f6e744e616d652f48656c7665746963612f466f6e7446616d696c792848656c766574696361292f466f6e74576569676874203530302f4974616c6963416e676c6520302f466f6e7442426f785b2d313636202d3232352031303030203933315d2f436170486569676874203731382f58486569676874203532332f417363656e74203731382f44657363656e74202d3230372f5374656d482037362f5374656d562038383e3e0a656e646f626a0a787265660a3020390a303030303030303030302036353533352066200a30303030303030303135203030303030206e200a30303030303030303539203030303030206e200a30303030303030313739203030303030206e200a30303030303030323537203030303030206e200a30303030303030333436203030303030206e200a30303030303030343531203030303030206e200a30303030303030353733203030303030206e200a30303030303030373733203030303030206e200a747261696c65720a3c3c2f526f6f742031203020522f49445b3c39333932413539463342453742383430383035443632373436453841344632393e3c39333932413539463342453742383430383035443632373436453841344632393e5d2f496e666f2032203020522f53697a6520393e3e0a7374617274787265660a3938380a2525454f460a', 'hex'), options => '{"method": "Structured", "allow_partial_parsing": true}' -- Default );
- The
method
determines how the PDF is parsed:Structured
(Default) - Algorithmic text extraction- The
allow_partial_parsing
flag determines whether PDFs should still be parsed when the parser encounters errors on one or more pages. Defaults totrue
.
- The
Summarize Text
Call aidb.summarize_text()
to summarize text.
-- Create a model for use in summarization SELECT aidb.create_model('my_t5_model', 't5_local'); SELECT * FROM aidb.summarize_text( input => 'There are times when the night sky glows with bands of color. The bands may begin as cloud shapes and then spread into a great arc across the entire sky. They may fall in folds like a curtain drawn across the heavens. The lights usually grow brighter, then suddenly dim. During this time the sky glows with pale yellow, pink, green, violet, blue, and red. These lights are called the Aurora Borealis. Some people call them the Northern Lights. Scientists have been watching them for hundreds of years. They are not quite sure what causes them. In ancient times Long Beach City College WRSC Page 2 of 2 people were afraid of the Lights. They imagined that they saw fiery dragons in the sky. Some even concluded that the heavens were on fire.', options => '{"model": "my_t5_model"}' );
- The
model
is the name of the created model to use for summarization. The model must support thedecode_text()
anddecode_text_batch()
model primitives.
Could this page be better? Report a problem or suggest an addition!