What Is RegEx?
A RegEx (regular expression) is a way to define a pattern that can be searched for inside a text or string. These patterns are commonly used for data extraction, validation, and transformation.
While the definition itself is simple, regular expressions have a reputation for being complex and hard to read. That reputation exists for a reason: regex is easy to learn, but hard to master.
There is a big difference between searching for a single digit:
\d
And searching for a phone number that supports multiple real-world formats:
^(\+[1-9]{2})?((?<!0)(\(?[1-9]{2}\)?)?|0[1-9]{2})(?:\s*)\d{4,5}-?\d{4}$
The regex above may look intimidating, but it can correctly match all of the following formats:
- +5538994898261
- 038994898261
- (38)994898261
- 994898261
- (38)99489-8261
- 99489-8261
This article focuses on the fundamental building blocks of regular expressions. With these concepts, you will be able to read, understand, and gradually write more complex patterns with confidence.
Note on regex flavor All examples assume a PCRE-compatible engine (such as Python, PHP, or modern tooling like Regex101). Behavior may vary slightly across languages and runtimes.
How to Use Regular Expressions
Regular expressions are available in many text editors and tools such as Vim, VS Code, and Notepad++. If you prefer experimenting directly in the browser, a great option is Regex101.
Regex101 allows you to:
- Define a regex pattern
- Provide a target text
- See matches highlighted in real time
- Inspect detailed explanations of how the pattern is evaluated
At a fundamental level, regex usage revolves around understanding:
- Character classes
- Quantifiers
- Logical operators
- Common abbreviations and anchors
Let's break each one down.
Character Classes
Character classes define a set of possible characters that can match at a given position. They are declared using square brackets [].
For example:
[A-Z]
Matches a single uppercase letter from A to Z based on the Unicode character table.
To match both uppercase and lowercase letters, you need to explicitly define both ranges:
[A-Za-z]
Examples
| Symbol | Meaning | Regex Example | What It Matches |
|---|---|---|---|
[] | One character from the set | [ABCD] | One of the first four uppercase letters |
- | Range indicator | [a-z] | Any lowercase letter |
^ | Negation inside a class | [^a] | Any character except a |
Quantifiers
By default, a character class matches exactly one character. Quantifiers define how many times a pattern may repeat.
For example:
[A-Z]{2}
Matches exactly two uppercase letters, which could represent a Brazilian state abbreviation.
Common Quantifiers
| Symbol | Meaning | Regex Example | Example Match |
|---|---|---|---|
+ | One or more | [ABCD]+ | ABCD |
* | Zero or more | A* | AAAAAA |
{2,5} | Between 2 and 5 | [abcd]{2,5} | abcd |
? | Zero or one | A? | May or may not contain A |
Quantifiers are one of the main reasons regex patterns can quickly become hard to read, especially when combined deeply.
Logical Operators
Regular expressions support logical composition. The most common operator is the OR operator (|).
Example:
A|B
Matches either A or B.
For simplicity, this article only covers
|, but regex engines support additional grouping and conditional constructs.
Abbreviations and Special Characters
Some character classes are used so frequently that regex engines provide shortcuts for them. There are also special anchors that allow you to constrain where a match may occur.
Common Abbreviations
| Symbol | Meaning | Usage Example | Example Match |
|---|---|---|---|
\d | Same as [0-9] | \d+ | 12345 |
\w | Letters, digits, underscore | \w* | AaaaA123_ |
\s | Whitespace characters | ab\s+cd | ab cd |
^ | Start of string | ^A | A at the beginning |
$ | End of string | A$ | A at the end |
. | Any character except newline | .+ | ABC0123#$%_ |
Breaking Down a Complex Regex
Instead of treating complex expressions as unreadable "magic spells", it helps to think of them as composed parts.
Here's a simplified conceptual breakdown of the phone number example:
Each block can be designed, tested, and reasoned about independently. This mindset is critical for writing regex that is maintainable, not just correct.
Readability and Maintainability
Regex is powerful, but it can easily become a maintenance liability if abused.
As patterns grow more complex, consider:
- Splitting validation logic across multiple steps
- Using named capture groups (when supported)
- Adding comments or verbose mode (if your engine allows it)
- Avoiding regex entirely when a parser or structured approach is clearer
A regex that only one person understands is a future bug waiting to happen.
Closing Thoughts
This article introduced the fundamentals of regular expressions, but this is only the tip of the iceberg. Regex becomes truly powerful when you learn how to compose, debug, and reason about patterns over time.
As a practical exercise, try improving the phone number regex shown earlier. Make it more readable, more explicit, or more adaptable to your own use case.
Good luck, and welcome to the world of regular expressions.