Regex

Extract entities using a regex pattern.

AI_ENTITY_EXTRACTION


Extract entities from text using a regular expression (regex) pattern.


Overview

The Regex feature enables the extraction of specific entities from a text string by applying a regular expression pattern. This functionality is ideal for identifying structured data like product references, email addresses, or codes within unstructured text.


Inputs

NameIDDescription
TexttextThe input text from which entities are extracted.
Regex PatternregexThe regular expression pattern to search for.

Outputs

NameIDDescription
SuccesssuccessBoolean indicating whether the extraction was successful.
EntitymatchThe extracted entity matching the regex pattern.

Notes

  • Ensure the regex pattern is properly formatted and tested for accuracy.
  • Action ID for this operation: b620331b-db42-4d53-82e2-caaf54c52ecc.

Introduction to Regular Expressions (Regex)

Regular expressions (regex or regexp) are sequences of characters that define a search pattern. They are commonly used for string matching, searching, and text manipulation in programming, text processing, and data validation.


Basic Syntax

Literal Characters

Literal characters match themselves exactly. For example:

  • cat matches the string "cat."

Metacharacters

Metacharacters have special meanings in regex:

CharacterDescription
.Matches any single character except newline
^Matches the start of a string
$Matches the end of a string
*Matches 0 or more occurrences of the preceding element
+Matches 1 or more occurrences of the preceding element
?Matches 0 or 1 occurrence of the preceding element
{}Matches a specified number of occurrences
[]Matches any character in the set
``Acts as an OR operator
()Groups patterns and captures matches
\Escapes metacharacters

Character Classes

Character classes define a set of characters to match:

PatternDescription
[abc]Matches 'a', 'b', or 'c'
[^abc]Matches any character except 'a', 'b', or 'c'
[a-z]Matches any lowercase letter
[A-Z]Matches any uppercase letter
[0-9]Matches any digit
\dMatches any digit (same as [0-9])
\DMatches any non-digit character
\wMatches any word character (alphanumeric + _)
\WMatches any non-word character
\sMatches any whitespace character
\SMatches any non-whitespace character

Quantifiers

Quantifiers define how many instances of a character or group to match:

PatternDescription
*Matches 0 or more occurrences
+Matches 1 or more occurrences
?Matches 0 or 1 occurrence
{n}Matches exactly n occurrences
{n,}Matches n or more occurrences
{n,m}Matches between n and m occurrences

Anchors

Anchors are used to match positions within a string:

PatternDescription
^Matches the start of a string
$Matches the end of a string
\bMatches a word boundary
\BMatches a non-word boundary

Special Groups

Capturing Groups

Parentheses () are used to create capturing groups:

  • (abc) captures the sequence "abc."

Non-Capturing Groups

Use (?: ) to group without capturing:

  • (?:abc) matches "abc" but does not store the match.

Lookaheads and Lookbehinds

  • Positive Lookahead: (?=...) ensures that the following pattern matches.
  • Negative Lookahead: (?!...) ensures that the following pattern does not match.
  • Positive Lookbehind: (?<=...) ensures that the preceding pattern matches.
  • Negative Lookbehind: (?<!...) ensures that the preceding pattern does not match.

Examples

Matching an Email Address

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

Explanation:

  • ^[a-zA-Z0-9._%+-]+: Matches the username part.
  • @[a-zA-Z0-9.-]+: Matches the domain name.
  • \.[a-zA-Z]{2,}$: Matches the top-level domain (e.g., .com, .org).

Validating a Phone Number

^\+?[1-9]\d{1,14}$

Explanation:

  • ^\+?: Matches an optional plus sign.
  • [1-9]: Ensures the number doesn't start with 0.
  • \d{1,14}: Matches 1 to 14 digits.

Finding Duplicates

\b(\w+)\b(?=.*\b\1\b)

Explanation:

  • \b(\w+)\b: Captures a word.
  • (?=.*\b\1\b): Ensures the word appears again later.

Tools for Testing Regex

  1. Regex101 (https://regex101.com)
  2. RegExr (https://regexr.com)
  3. Debuggex (https://www.debuggex.com)

Conclusion

Regular expressions are powerful tools for text processing and validation. Mastering regex requires practice, but with understanding, they can greatly simplify complex string operations.