Swift Regex Deep Dive

Processing strings to extract, manipulate, or search data is a core skill that most software engineers need to acquire. Regular expressions, or regex, are the venerable titan of string processing. Regular expressions are an integral part of many command line tools found in Unix, search engines, word processors, and text editors. The downside of regular expressions is that they can be difficult to create, read and debug. Most programming languages support some form of regex — and now Swift does, too, with Swift Regex.

An exciting and new Regex Builder in Swift Regex gives us a programmatic way of creating regular expressions. This innovative approach to creating often complex regular expressions is sure to be an instant winner with the regex neophyte and aficionado alike. We’ll be digging into Regex Builder to discover its wide-reaching capabilities.

Swift Regex brings first-class support for regular expressions to the Swift language, and it aims to mitigate or outright eliminate many of the downsides of regex. The Swift compiler natively supports regex syntax, which gives us compile time errors, syntax highlighting, and strongly typed captures. Regex syntax in Swift is compatible with Perl, Python, Ruby, Java, NSRegularExpression, and many others.

It should be noted that as of the writing of this article, Swift Regex is still in the open beta period. We’ll be using the Swift Regex found in Xcode 14 beta 6.

Creating a Swift Regular Expression

Swift Regex supports creating a regular expression in several different ways, each of which is useful for different scenarios. First, let’s take a look at creating a compile-time regular expression.

Compile Time Regex


let regex = /d/


This regular expression will match a single digit. As is typical in regular expression syntax, the expression can be found between two forward slashes; “/<expression>/”. As you can see, this regular expression is a first-class type in Swift and can be assigned directly to a variable. As a Swift type, Xcode will also recognize this regex and provide both compile time checks and syntax highlighting.

Swift has added robust support for regex to a number of common APIs, and using this regular expression couldn’t be easier.


let user = "{name: Shane, id: 123, employee_id: 456}"
let regex = /name: w+/

if let match = user.firstMatch(of: regex) {
    print(match.output)
}


Which gives us the output:

name: Shane

You may be tempted to use the regular expression [a-zA-Z]+ in order to match a word here. However, using w+ allows the system to take into account the current locale.

Runtime Regex

Swift Regex also supports creating regular expressions at runtime. Runtime creation of a regular expression has many uses and can be useful for editors, command line tools, and search just to name a few. The expression syntax is the same as a compile time expression. However, they are created in a slightly different manner.


let regex = try Regex(".*(searchTerm).*")


This regular expression is looking for a specific search term supplied at runtime. Here the regular expression is created by constructing the Regex type with a String representing the regular expression. The try keyword is used since a Regex can throw an error if the supplied regular expression is invalid.

We can again apply this regex using the firstMatch(of:) function as in our first example. Note that this time our regex captures the line that matches by using a regex capture, (, and ).


let users = """
[
{name: Shane, id: 123, employee_id: 456},
{name: Sally, id: 789, employee_id: 101},
{name: Sam, id: 453, employee_id: 999}
]
"""

let idToSearch = 789
let regex = try Regex("(.*id: (idToSearch).*)")

if let match = users.firstMatch(of: regex) {
    print(match.output[1].substring ?? "not found")
}


Running the example gives us the following output:

{name: Sally, id: 789, employee_id: 101},

We can gain access to any data captured by the regex via output on the returned Regex.Match structure. Here, output is an existential with the first item, at index 0, being the regex input data. Each capture defined in the regex is found at subsequent indexes.

Regex Builder

The innovative and new Regex Builder introduces a declarative approach to composing regular expressions. This incredible new way of creating regular expressions will open the regex door to anyone who finds them difficult to understand, maintain, or create. Regex builder is Swift’s solution to the drawbacks of the regular expression syntax. Regex builder is a DSL for creating regular expressions with type safety while still allowing for ease of use and expressivity. Simply import the new RegexBuilder module, and you’ll have everything you need to create and compose powerful regular expressions.


import RegexBuilder

let regex = Regex {
    One(.digit)
}


This regular expression will match a single digit and is functionally equivalent to our first compile time regex example, /d/. Here the standard regex syntax is discarded in favor of a declarative approach. All regex operations, including captures, can be represented with RegexBuilder. In addition, when it makes sense, regex literals can be utilized right within the regex builder. This makes for a very expressive and powerful approach to creating regular expressions.

RegexBuilder Example

Let’s take a deeper look into RegexBuilder. In this example, we will use a regex builder to parse and extract information from a Unix top command.


top -l 1 -o mem -n 8 -stats pid,command,pstate,mem | sed 1,12d


For simplicity, we’ll take the output of running this command and assign it to a Swift variable.


// PID    COMMAND       STATE    MEMORY
let top = """
45360  lldb-rpc-server  sleeping 1719M
2098   Google Chrome    sleeping 1679M-
179    WindowServer     sleeping 1406M
106    BDLDaemon        running  1194M
45346  Xcode            running  878M
0      kernel_task      running  741M
2318   Dropbox          sleeping 4760K+
2028   BBEdit           sleeping 94M
"""


As you can see, the top command outputs structured data that is well suited for use with regular expressions. In our example, we will be extracting the name, status, and size of each item. When considering a Regex Builder it is useful to break a larger regex down into smaller component parts which are then concatenated by the builder. First, I’ll present the code, and then we’ll discuss how it works.


// 1
let separator = /s{1,}/

// 2
let topMatcher = Regex {
    // 3
    OneOrMore(.digit)

    // 4
    separator

    // 5
    Capture(
        OneOrMore(.any)
    )
    separator

    // 6
    Capture(
        ChoiceOf {
            "running"
            "sleeping"
            "stuck"
            "idle"
            "stopped"
            "halted"
            "zombie"
            "unknown"
        }
    )
    separator

    // 7
    Capture {
        OneOrMore(.digit)

        // /M|K|B/
        ChoiceOf {
            "M"
            "K"
            "B"
        }

        Optionally(/+|-/)
    }
}

// 8
let matches = top.matches(of: topMatcher)
for match in matches {
    // 9
    let (_, name, status, size) = match.output
    print("(name) tt (status) tt (size)")
}


Running the example gives us the following output:

lldb-rpc-server  		 sleeping 		 1719M
Google Chrome    		 sleeping 		 1679M-
WindowServer     		 sleeping 		 1406M
BDLDaemon        		 running 		 1194M
Xcode            		 running 		 878M
kernel_task      		 running 		 741M
Dropbox          		 sleeping 		 4760K+
BBEdit           		 sleeping 		 94M

Here is a breakdown of what is happening with the code:

  1. From looking at the data, we can see that each column is separated by one or more spaces. Here we define a compile time regex and assign it to the separator variable. We can then use separator within the regex builder in order to match column separators.
  2. Define the regex builder as a trailing closure to Regex and assign it to topMatcher.
  3. A quantifier that matches one or more occurrences of the specified CharacterClassCharacterClass is a struct that conforms to RegexComponent and is similar in function to a CharacterSet. The .digit CharacterClass defines a numeric digit.
  4. Matches the column separator.
  5. Captures one or more of any character. Regex captures are returned in the Output of the regex and are indexed based on their position within the regex.
  6. A capture of one item from the enclosed list of items. ChoiceOf is equivalent to a regex alternation (the | regex operator) and cannot have an empty block. You can think of this as matching a single value of an Enum. Use when there are a known list of values to be matched by the regular expression.
  7. Captures one or more digits followed by one item from the known list of “M”, “K”, or “B” optionally followed by a “+” or “-“. Notice that the Optionally component can take a regex literal as its parameter.
  8. Here we pass our regex as a parameter into the matches(of:) function. We assign the returned value to a variable that will allow use to access the regex output and our captured data.
  9. The output property of the regex returned data contains the entire input data followed by any captured data. Here we are unpacking the the output tuple by ignoring the first item (the input) and assigning each subsequent item to a variable for easy access.

As you can see from this example, the Swift regex builder is a powerful and expressive way to create regular expressions in Swift. This is just a sampling of its capability. So, next, let’s take a deeper look into the Swift regex builder and its strongly typed captures.

Strongly typed captures in Swift RegexBuilder

One of the more unique and compelling features of the Swift regex builder are strongly typed captures. Rather than simply returning a string match, Swift Regex can return a strong type representing the captured data.

In some cases, especially for performance reasons, we may want to exit early if a regex capture doesn’t meet some additional criteria. TryCapture allows us to do this. The TryCapture Regex Builder component will pass a captured value to a transform closure where we can perform additional validation or value transformation. When the transform closure returns a value, whether the original or a modified version, it is assumed valid, and the value is captured. However, when the transform closure returns nil, matching is signaled to have failed and will cause the regex engine to backtrack and try an alternative path. TryCapturetransform closure actively participates in the matching process. This is a powerful feature and allows for extremely flexible matching.

Let’s take a look at an example.

In this example, we will use a regex builder to parse and extract information from a Unix syslog command.

syslog -F '$((Time)(ISO8601)) | $((Level)(str)) | $(Sender)[$(PID)] | $Message'

We’ll take the output of running this command and assign it to a Swift variable.


// TIME                 LEVEL     PROCESS(PID)              MESSSAGE
let syslog = """
2022-06-09T14:11:52-05 | Notice | Installer Progress[1211] | Ordering windows out
2022-06-09T14:12:18-05 | Notice | Installer Progress[1211] | Unable to quit because there are connected processes
2022-06-09T14:12:30-05 | Critical | Installer Progress[1211] | Process 648 unexpectedly went away
2022-06-09T14:15:31-05 | Alert | syslogd[126] | ASL Sender Statistics
2022-06-09T14:16:43-05 | Error | MobileDeviceUpdater[3978] | tid:231b - Mux ID not found in mapping dictionary
"""


Next, we use Swift Regex to extract this data, including the timestamp, a strongly typed severity level, and filtering of processes with an id of less than 1000.

let separator = " | "

let regex = Regex {
    // 1
    Capture(.iso8601(assuming: .current, dateSeparator: .dash))
    // 2
    "-"
    OneOrMore(.digit)
    separator

    // 3
    TryCapture {
        ChoiceOf {
            "Debug"
            "Informational"
            "Notice"
            "Warning"
            "Error"
            "Critical"
            "Alert"
            "Emergency"
        }
    } transform: {
        // 4
        SeverityLevel(rawValue: String($0))
    }
    separator

    // 5
    OneOrMore(.any, .reluctant)
    "["
    Capture {
        OneOrMore(.digit)
    } transform: { substring -> Int? in
        // 6
        let pid = Int(String(substring))
        if let pid, pid >= 1000 {
            return pid
        }

        return nil
    }
    "]"
    separator

    OneOrMore(.any)
}

// 7
let matches = syslog.matches(of: regex)

print(type(of: matches[0].output))

for match in matches {
    let (_, date, status, pid) = match.output
    // 8
    if let pid {
        print("(date) (status) (pid)")
    }
}

// 9
enum SeverityLevel: String {
    case debug = "Debug"
    case info = "Informational"
    case notice = "Notice"
    case warning = "Warning"
    case error = "Error"
    case critical = "Critical"
    case alert = "Alert"
    case emergency = "Emergency"
}

Running the example gives us the following output:

(Substring, Date, SeverityLevel, Optional<Int>)
2022-06-09 19:11:52 +0000 notice 1211
2022-06-09 19:12:18 +0000 notice 1211
2022-06-09 19:12:30 +0000 critical 1211
2022-06-09 19:16:43 +0000 error 3978

Here’s what is happening with the syslog example.

  1. Here, we are capturing an ISO 8601 formatted date. The iso8601 static function (new in iOS 16) is called on the Date.ISO8601FormatStyle type. This function constructs and returns a date formatter for use by the Swift Regex Capture in converting the captured string into a Date. This Date is then used in the Captures output with no further string-to-date conversion necessary.
  2. After the ISO 8601 formatted date, we have a UTC offset timezone component matched by the dash and one or more digits.
  3. Here TryCapture is being used to transform a captures type. It will convert the matched value into a non-optional type or fail the match.
  4. The transform closure will be called upon matching the capture. It is passed the matched substring value that can then transform to the desired type. In this example, the transform is converting the matched substring into a SeverityLevel enum. The corresponding regex output for this capture becomes the closures return type. In the case of a transform on TryCapture this type will be non-optional. For a Capture transform, the type will be optional.
  5. Swift Regex defines several repetitions, which are OneOrMoreZeroOrMoreOptionally, and Repeat. The .reluctant repetition behavior will match as few occurrences as possible. The default repetition behavior for all repetitions is .eager.
  6. A transforming capture will transform the matching substring of digits into an optional Int value. If this value is 1000 or greater, then it is returned from the transform and becomes the captures output value. Otherwise, it returns nil for this captures output.
  7. Assign the matches of the regex to the matches variable.
  8. If the pid capture is not nil then print out the data.
  9. Defines the SeverityLevel enum type, which is used by the transforming capture defined in #3.

Conclusion

Swift Regex is a welcome and powerful addition to Swift. Regex Builder is a go-to solution for all but the simplest of regex needs, and mastering it will be time well spent. The declarative approach of Regex Builder coupled with compile time regex support giving us compile time errors, syntax highlighting, and strongly typed captures, makes for a potent combination. A lot of thought has gone into the design of Swift Regex, and it shows. Swift Regex will make a worthy addition to your development toolbox, and taking the time to learn it will pay dividends.

Resources

  1. Meet Swift Regex – WWDC 2022
  2. Swift Regex: Beyond the basics – WWDC 2022
  3. Swift Evolution Proposal 0351 – Regex builder DSL
  4. Swift Evolution Proposal 0354 – Regex Literals
  5. Swift Evolution Proposal 0355 – Regex Syntax and Run-time Construction
  6. Swift Evolution Proposal 0357 – Regex-powered string processing algorithms
  7. Swift Regex DSL Builder

The post Swift Regex Deep Dive appeared first on Big Nerd Ranch.