Extract Text is a powerful transformation that allows you to isolate specific parts of a string using regular expressions. This process is crucial for parsing complex text data, extracting meaningful information, and preparing data for further analysis or processing.


Why Use Text Extraction?
Data Parsing: Extract specific information from structured text.
Data Cleaning: Isolate relevant parts of messy or inconsistent data.
Information Retrieval: Pull out key details from larger text blocks.
Data Transformation: Prepare data for further processing or analysis.
Tip
Before implementing text extraction, analyze your data to identify common patterns and structures that can be targeted with regular expressions.
Applying Extract Text Transformation
Follow these steps to apply the Extract Text transformation:
Select "
Extract Text
" from the list of transformations.Enter the "
Regular Expression
" that matches your target text pattern.Specify the "
Group Number
" to extract the desired part of the match.Click "
Save
" to apply the transformation.Tip
Test your regular expressions on a sample of your data to ensure they capture the intended information accurately.
Extract Text Configuration
Two key components are required for text extraction:
Regular Expression Purpose: Extracts the filename from a complete file path by capturing only the text that appears after the final slash.
Pattern: ^.*/([^/]+)$
Example: When applied to s3://data-pipeline-qa.unifyapps.com/1.pdf, this pattern captures 1.pdf
How It Works:^.* matches everything from the start of the string
/ matches the last forward slash in the path
([^/]+) captures one or more characters that are not a slash
$ ensures we match to the end of the string
Result: Only the filename is extracted, without any directory information.
Group Number Purpose:
Identifies which portion of the matched text to extract when your pattern contains multiple capturing groups.
Format:
Enter $1 for the first group, $2 for the second group, and so on.
Example:
If your regex ([A-Z]+)-([0-9]+) matches INV-12345:$0 returns the entire match: INV-12345
$1 returns just the first group: INV
$2 returns just the second group: 12345


Tip
Use regex testing tools to visualize and refine your pattern matching before applying it to your data.
Testing the Transformation
After configuring the extraction:
Enter sample text in the "
Test Transformation
" field.Click the "
Test
" button to see the extracted output.Verify that the correct portion of the text is extracted.
Tip
Test with various input formats to ensure your extraction works across different data variations.
Best Practices for Text Extraction
Specificity: Create regular expressions that are as specific as possible to avoid false matches.
Flexibility: Account for potential variations in your data format.
Documentation: Clearly comment your regular expressions to explain their purpose and logic.
Error Handling: Implement fallback options for cases where the extraction fails.
Tip
Regularly review and update your extraction patterns as your data sources or formats may change over time.