Quick Tools Online

Regular Expressions: A Practical Beginner's Guide

2025-12-03

Regular expressions (regex) are a pattern language for matching text. They let you describe what a string should look like rather than writing imperative code to check it character by character. A regex can find every phone number in a document, validate that an email address has roughly the right format, or replace every occurrence of a pattern in one operation. They are available in every mainstream programming language and many command-line tools.

What Is a Regular Expression?

A regex is a string of characters that describes a pattern. When you apply a regex to a text, the regex engine searches for substrings that match the pattern. The simplest regex is a literal string: the pattern cat matches the substring 'cat' anywhere in the text — in 'scattered', 'category', and 'cats'. Most of a regex's power comes from special characters that let you describe variable patterns rather than fixed strings.

The Basic Building Blocks

The dot (.) matches any single character except a newline. A character class in square brackets matches any one character from a set: [aeiou] matches any vowel, [0-9] matches any digit, [a-zA-Z] matches any letter. A caret inside the brackets negates the set: [^0-9] matches any character that is not a digit.

  • . — any character (except newline by default)
  • [abc] — any one of a, b, or c
  • [^abc] — any character except a, b, or c
  • [a-z] — any lowercase letter
  • \d — any digit (shorthand for [0-9])
  • \w — any word character (letters, digits, underscore)
  • \s — any whitespace character (space, tab, newline)
  • \D, \W, \S — the negations of \d, \w, \s

Quantifiers and Anchors

Quantifiers specify how many times the preceding element must match. The asterisk (*) means zero or more times. The plus (+) means one or more times. The question mark (?) means zero or one time. Curly braces specify exact counts: {3} means exactly three times, {2,5} means between two and five times, {3,} means three or more times.

Anchors do not match characters — they match positions. The caret (^) anchors to the start of the string (or line, with multiline mode). The dollar sign ($) anchors to the end. Without anchors, a pattern can match anywhere in the string. With anchors, it must match at the specified position. ^\d+$ means the entire string consists of one or more digits, with no other characters before or after.

Five Practical Examples

  1. Email format check: /^[^\s@]+@[^\s@]+\.[^\s@]+$/ — one or more non-whitespace, non-@ characters, then @, then a domain part with at least one dot.
  2. Extract URLs from text: /https?:\/\/[^\s]+/g — http or https, ://, then non-whitespace characters.
  3. Match a date in YYYY-MM-DD format: /\d{4}-\d{2}-\d{2}/ — four digits, hyphen, two digits, hyphen, two digits.
  4. Find US phone numbers: /\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}/ — handles (555) 123-4567, 555-123-4567, 5551234567.
  5. Remove leading/trailing whitespace: /^\s+|\s+$/g with an empty replacement — matches whitespace at start or end.

Common Mistakes to Avoid

Forgetting to escape special characters is the most common mistake. The characters . * + ? ^ $ { } [ ] ( ) | \ all have special meaning. To match a literal dot, write \.. To match a literal parenthesis, write \(. When a regex is not matching as expected, check whether any special characters in your pattern need to be escaped.

Catastrophic backtracking is a performance trap in complex patterns. Patterns like (a+)+ applied to a long non-matching string cause the engine to explore exponentially many paths before giving up. If your regex runs against user-supplied input and performance matters, test it against inputs designed to trigger worst-case backtracking — a long string of 'a' characters followed by a character that does not match is a classic test. Rewrite nested quantifiers as flat patterns whenever possible.