Rediscovering CVE-2023-36617 (ruby ReDoS) with fuzzing
summary
Two ReDoS bugs existed in the Ruby uri
module. Both bugs cause the program to hang and eventually throw a URI::InvalidURIError
error.
They affect version v0.12.2 of the gem.
The commit has some tests that help understand what was going on.
The first test:
def test_rfc3986_port_check
pre = ->(length) {"\t" * length + "a"}
uri = URI.parse("http://my.example.com")
assert_linear_performance((1..5).map {|i| 10**i}, pre: pre) do |port|
assert_raise(URI::InvalidComponentError) do
uri.port = port
end
end
end
It checks how long it takes for the code to complete.
The root cause is a greedy regex match that was first introduced on the commit 3e832346
:
commit 3e832346a42d9412a0f1df0489ed1365ac8c195c
Author: naruse <naruse@b2dd03c8-39d4-4d8f-98ff-823fe69b080e>
Date: Mon Jun 23 03:18:51 2014 +0000
* lib/uri/generic.rb (check_port): allow strings for port= as
described in rdoc.
* lib/uri/rfc3986_parser.rb (regexp): implementation detail of above.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@46504 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
diff --git a/lib/uri/rfc3986_parser.rb b/lib/uri/rfc3986_parser.rb
index cd95ab8..aa74e11 100644
--- a/lib/uri/rfc3986_parser.rb
+++ b/lib/uri/rfc3986_parser.rb
@@ -84,7 +84,7 @@ module URI
QUERY: /\A(?:%\h\h|[!$&-.0-;=@-Z_a-z~]|[\/?])*\z/,
FRAGMENT: /\A(?:%\h\h|[!$&-.0-;=@-Z_a-z~]|[\/?])*\z/,
OPAQUE: nil,
- PORT: nil,
+ PORT: /\A[\x09\x0a\x0c\x0d ]*\d*[\x09\x0a\x0c\x0d ]*\z/,
}
end
The second can be triggered by both URI::RFC2396_Parser.parse(uri)
and URI::RFC2396_Parser.split(uri)
Second test:
def test_rfc2822_parse_relative_uri
pre = ->(length) {
" " * length + "\0"
}
parser = URI::RFC2396_Parser.new
assert_linear_performance((1..5).map {|i| 10**i}, pre: pre) do |uri|
assert_raise(URI::InvalidURIError) do
parser.split(uri)
end
end
end
and was introduced on the commit d8c414e9
:
commit d8c414e99dda6cbb0bf91b9ad5f6a95321e00435
Author: naruse <naruse@b2dd03c8-39d4-4d8f-98ff-823fe69b080e>
Date: Sun Jun 22 00:22:19 2014 +0000
...
diff --git a/lib/uri/rfc2396_parser.rb b/lib/uri/rfc2396_parser.rb
new file mode 100644
index 0000000..50e3ae6
--- /dev/null
+++ b/lib/uri/rfc2396_parser.rb
@@ -0,0 +1,543 @@
...
+ ret[:ABS_URI] = Regexp.new('\A\s*' + pattern[:X_ABS_URI] + '\s*\z', Regexp::EXTENDED)
+ ret[:REL_URI] = Regexp.new('\A\s*' + pattern[:X_REL_URI] + '\s*\z', Regexp::EXTENDED)
...
From 2014, it's been hidden for a long time.
But this does seem hard to trigger, the bug lives on URI::RFC2396_Parser
and the default parser is URI::RFC3986_Parser
. It does have some functions that use RFC2396
but they are marked as deprecated.
def self.extract(str, schemes = nil, &block)
warn "URI.extract is obsolete", uplevel: 1 if $VERBOSE
DEFAULT_PARSER.extract(str, schemes, &block)
end
def self.regexp(schemes = nil)
warn "URI.regexp is obsolete", uplevel: 1 if $VERBOSE
DEFAULT_PARSER.make_regexp(schemes)
end
Also, I couldn't find any path that would lead from RFC3986
to RFC2396
.
The core of the problem is something called catastrophic backtracking. If the quantifier expressions (e.g. [\x09\x0a\x0c\x0d ]*
) appear more than once in the same regex and are not mutually exclusive, anytime a backtrack happens, the regex has to process the same character multiple times.
A better and more complete explanation is at: explosion explanation
Here is a neat visualization of what it looks like: explosion
To fix that we force the first quantifier to not backtrack by using possessive quantifiers:
A++A+B$
^
It just basically says don't backtrack.
There was also a fix to the regex used to parse URIs that seems to have the same problem as the other bugs:
- RFC3986_URI = /\A(?<URI>(?<scheme>[A-Za-z][+\-.0-9A-Za-z]*):(?<hier-part>\/\/(?<authority>(?:(?<userinfo>(?:%\h\h|[!$&-.0-;=A-Z_a-z~])*)@)?(?<host>(?<IP-literal>\[(?:(?<IPv6address>(?:\h{1,4}:){6}(?<ls32>\h{1,4}:\h{1,4}|(?<IPv4address>(?<dec-octet>[1-9]\d|1\d{2}|2[0-4]\d|25[0-5]|\d)\.\g<dec-octet>\.\g<dec-octet>\.\g<dec-octet>))|::(?:\h{1,4}:){5}\g<ls32>|\h{1,4}?::(?:\h{1,4}:){4}\g<ls32>|(?:(?:\h{1,4}:)?\h{1,4})?::(?:\h{1,4}:){3}\g<ls32>|(?:(?:\h{1,4}:){,2}\h{1,4})?::(?:\h{1,4}:){2}\g<ls32>|(?:(?:\h{1,4}:){,3}\h{1,4})?::\h{1,4}:\g<ls32>|(?:(?:\h{1,4}:){,4}\h{1,4})?::\g<ls32>|(?:(?:\h{1,4}:){,5}\h{1,4})?::\h{1,4}|(?:(?:\h{1,4}:){,6}\h{1,4})?::)|(?<IPvFuture>v\h+\.[!$&-.0-;=A-Z_a-z~]+))\])|\g<IPv4address>|(?<reg-name>(?:%\h\h|[!$&-.0-9;=A-Z_a-z~])*))(?::(?<port>\d*))?)(?<path-abempty>(?:\/(?<segment>(?:%\h\h|[!$&-.0-;=@-Z_a-z~])*))*)|(?<path-absolute>\/(?:(?<segment-nz>(?:%\h\h|[!$&-.0-;=@-Z_a-z~])+)(?:\/\g<segment>)*)?)|(?<path-rootless>\g<segment-nz>(?:\/\g<segment>)*)|(?<path-empty>))(?:\?(?<query>[^#]*))?(?:\#(?<fragment>(?:%\h\h|[!$&-.0-;=@-Z_a-z~\/?])*))?)\z/
- RFC3986_relative_ref = /\A(?<relative-ref>(?<relative-part>\/\/(?<authority>(?:(?<userinfo>(?:%\h\h|[!$&-.0-;=A-Z_a-z~])*)@)?(?<host>(?<IP-literal>\[(?:(?<IPv6address>(?:\h{1,4}:){6}(?<ls32>\h{1,4}:\h{1,4}|(?<IPv4address>(?<dec-octet>[1-9]\d|1\d{2}|2[0-4]\d|25[0-5]|\d)\.\g<dec-octet>\.\g<dec-octet>\.\g<dec-octet>))|::(?:\h{1,4}:){5}\g<ls32>|\h{1,4}?::(?:\h{1,4}:){4}\g<ls32>|(?:(?:\h{1,4}:){,1}\h{1,4})?::(?:\h{1,4}:){3}\g<ls32>|(?:(?:\h{1,4}:){,2}\h{1,4})?::(?:\h{1,4}:){2}\g<ls32>|(?:(?:\h{1,4}:){,3}\h{1,4})?::\h{1,4}:\g<ls32>|(?:(?:\h{1,4}:){,4}\h{1,4})?::\g<ls32>|(?:(?:\h{1,4}:){,5}\h{1,4})?::\h{1,4}|(?:(?:\h{1,4}:){,6}\h{1,4})?::)|(?<IPvFuture>v\h+\.[!$&-.0-;=A-Z_a-z~]+))\])|\g<IPv4address>|(?<reg-name>(?:%\h\h|[!$&-.0-9;=A-Z_a-z~])+))?(?::(?<port>\d*))?)(?<path-abempty>(?:\/(?<segment>(?:%\h\h|[!$&-.0-;=@-Z_a-z~])*))*)|(?<path-absolute>\/(?:(?<segment-nz>(?:%\h\h|[!$&-.0-;=@-Z_a-z~])+)(?:\/\g<segment>)*)?)|(?<path-noscheme>(?<segment-nz-nc>(?:%\h\h|[!$&-.0-9;=@-Z_a-z~])+)(?:\/\g<segment>)*)|(?<path-empty>))(?:\?(?<query>[^#]*))?(?:\#(?<fragment>(?:%\h\h|[!$&-.0-;=@-Z_a-z~\/?])*))?)\z/
+ RFC3986_URI = /\A(?<URI>(?<scheme>[A-Za-z][+\-.0-9A-Za-z]*+):(?<hier-part>\/\/(?<authority>(?:(?<userinfo>(?:%\h\h|[!$&-.0-;=A-Z_a-z~])*+)@)?(?<host>(?<IP-literal>\[(?:(?<IPv6address>(?:\h{1,4}:){6}(?<ls32>\h{1,4}:\h{1,4}|(?<IPv4address>(?<dec-octet>[1-9]\d|1\d{2}|2[0-4]\d|25[0-5]|\d)\.\g<dec-octet>\.\g<dec-octet>\.\g<dec-octet>))|::(?:\h{1,4}:){5}\g<ls32>|\h{1,4}?::(?:\h{1,4}:){4}\g<ls32>|(?:(?:\h{1,4}:)?\h{1,4})?::(?:\h{1,4}:){3}\g<ls32>|(?:(?:\h{1,4}:){,2}\h{1,4})?::(?:\h{1,4}:){2}\g<ls32>|(?:(?:\h{1,4}:){,3}\h{1,4})?::\h{1,4}:\g<ls32>|(?:(?:\h{1,4}:){,4}\h{1,4})?::\g<ls32>|(?:(?:\h{1,4}:){,5}\h{1,4})?::\h{1,4}|(?:(?:\h{1,4}:){,6}\h{1,4})?::)|(?<IPvFuture>v\h++\.[!$&-.0-;=A-Z_a-z~]++))\])|\g<IPv4address>|(?<reg-name>(?:%\h\h|[!$&-.0-9;=A-Z_a-z~])*+))(?::(?<port>\d*+))?)(?<path-abempty>(?:\/(?<segment>(?:%\h\h|[!$&-.0-;=@-Z_a-z~])*+))*+)|(?<path-absolute>\/(?:(?<segment-nz>(?:%\h\h|[!$&-.0-;=@-Z_a-z~])++)(?:\/\g<segment>)*+)?)|(?<path-rootless>\g<segment-nz>(?:\/\g<segment>)*+)|(?<path-empty>))(?:\?(?<query>[^#]*+))?(?:\#(?<fragment>(?:%\h\h|[!$&-.0-;=@-Z_a-z~\/?])*+))?)\z/
+ RFC3986_relative_ref = /\A(?<relative-ref>(?<relative-part>\/\/(?<authority>(?:(?<userinfo>(?:%\h\h|[!$&-.0-;=A-Z_a-z~])*+)@)?(?<host>(?<IP-literal>\[(?:(?<IPv6address>(?:\h{1,4}:){6}(?<ls32>\h{1,4}:\h{1,4}|(?<IPv4address>(?<dec-octet>[1-9]\d|1\d{2}|2[0-4]\d|25[0-5]|\d)\.\g<dec-octet>\.\g<dec-octet>\.\g<dec-octet>))|::(?:\h{1,4}:){5}\g<ls32>|\h{1,4}?::(?:\h{1,4}:){4}\g<ls32>|(?:(?:\h{1,4}:){,1}\h{1,4})?::(?:\h{1,4}:){3}\g<ls32>|(?:(?:\h{1,4}:){,2}\h{1,4})?::(?:\h{1,4}:){2}\g<ls32>|(?:(?:\h{1,4}:){,3}\h{1,4})?::\h{1,4}:\g<ls32>|(?:(?:\h{1,4}:){,4}\h{1,4})?::\g<ls32>|(?:(?:\h{1,4}:){,5}\h{1,4})?::\h{1,4}|(?:(?:\h{1,4}:){,6}\h{1,4})?::)|(?<IPvFuture>v\h++\.[!$&-.0-;=A-Z_a-z~]++))\])|\g<IPv4address>|(?<reg-name>(?:%\h\h|[!$&-.0-9;=A-Z_a-z~])++))?(?::(?<port>\d*+))?)(?<path-abempty>(?:\/(?<segment>(?:%\h\h|[!$&-.0-;=@-Z_a-z~])*+))*+)|(?<path-absolute>\/(?:(?<segment-nz>(?:%\h\h|[!$&-.0-;=@-Z_a-z~])++)(?:\/\g<segment>)*+)?)|(?<path-noscheme>(?<segment-nz-nc>(?:%\h\h|[!$&-.0-9;=@-Z_a-z~])++)(?:\/\g<segment>)*+)|(?<path-empty>))(?:\?(?<query>[^#]*+))?(?:\#(?<fragment>(?:%\h\h|[!$&-.0-;=@-Z_a-z~\/?])*+))?)\z/
After reading about it seems surprising that no one found out about this bug earlier. It is like a textbook catastrophic backtracking regex.
I was also going to try and write a high-effort summary of the inner workings of regex and explosions but fuzzing is just way more interesting.
fuzzing
I'm using afl-ruby which uses afl++, since afl++ is a coverage-guided fuzzer afl-ruby uses the TracePoint ruby class to feed it coverage.
So the first thing I tried was adding a trimmed-down version of the crash input and see if it would find it.
\t * 10000 + '\0'
AFL_SKIP_BIN_CHECK=1 afl-fuzz -a text -i input -o output -- $(which ruby) fuzz.rb
After a couple of hours, it didn't find anything and the corpus got stuck after the first minute. I tried messing with the options but it didn't change anything.
So I tried fuzzing the two different versions of the gem to see if I could find anything interesting. And I did find some quirks between the two versions (v0.12.1, v0.12.2).
The first one was:
legend = [:scheme, :userinfo, :host, :port, :registry, :path, :opaque, :query, :fragment]
component_ary = [nil, ":", nil, nil, nil, "/:", nil, nil, nil] # v0.12.1
component_ary = [nil, ":", "", nil, nil, "/:", nil, nil, nil] # v0.12.2
So the host is now ""
instead of nil
, but soon I found out that this was expected:
commit 81263c9e94bd67ca01deee238842a88c2c8885f3
Author: NARUSE, Yui <naruse@airemix.jp>
Date: Sun Jan 13 08:58:00 2019 +0900
URI.parse should set empty string in host instead of nil
ruby/ruby@dd5118f8524c425894d4716b787837ad7380bb0d
Very helpful commit message
There was also:
Now :to_s
adds double slashes and this change is directly related to the above commit.
input = "//:@:/:"
old = ":@/:" # v0.12.1
new = "//:@/:" # v0.12.2
# in v0.12.1 host would be nil causing this to be false changing to "" makes the check pass
if @host || %w[file postgres].include?(@scheme)
str << '//'
end
The third difference was:
input = "//::"
old_parser = nil # Bad URI expection
parser = "//::" # component_ary = [path]
input = "//p:x"
old_parser = nil # Bad URI expection
parser = "//p:x" # component_ary = [path]
input = "//@@?."
old_parser = nil # Bad URI expection
parser = "//@@" # component_ary = [path, query]
input = "//mmai:f#tZ"
old_parser = nil # Bad URI expection
parser = "//mmai:f"# component_ary = [path, fragment]
These are kinda interesting but don't seem to have any security implications.
grammar mutations
After a bit, I moved on from the diffing and tried out grammar mutators
/\A[\x09\x0a\x0c\x0d ]*\d*[\x09\x0a\x0c\x0d ]*\z/
The grammar that I came up with
{
"<port>": [
["<spaces>", "<digits>", "<spaces>"]
],
"<digits>": [["<digit-1>"]],
"<digit-1>": [[], ["<digit>"], ["<digit>", "<digit-1>"]],
"<digit>": [["0"], ["1"], ["2"], ["3"], ["4"], ["5"], ["6"], ["7"], ["8"]],
"<spaces>": [["<space-1>"]],
"<space-1>": [[], ["<space>"], ["<space>", "<space-1>"]],
"<space>": [["\u0009"], ["\u000a"], ["\u000c"], ["\u000d"], ["\u0000"]]
}
This looks to be the better option, it generated the biggest corpus. But it seems to also get stuck after a while.
After a while realized that since afl-ruby uses TracePoint it won't reach ruby internals like the regex engine. Since TracePoint only records whenever a C function is called, not its internals.
So I guess I can try different mutations and trust the fuzzer?
it found it
After about 9 hours of fuzzing, 6 of those without increasing the corpus it found the hang. maybe could also find it without the grammar mutator?
After about 20 minutes it found the hang.
The command used
AFL_SKIP_BIN_CHECK=1 afl-fuzz -i input -o output_default -t 500 -P exploit -- $(which ruby) fuzz.rb
Only a single file with a single white space as input.
I think the -P exploit
is what changed the result here.
Conclusion
This bug is interesting for fuzzing, afl found it in 20 minutes using the -P exploit
flag. Of course, I already knew that the bug existed and how to find it. But it also found it without anything on the corpus (just a file with a single whitespace in it).
On the other side, the bug seems kinda obvious if you understand how regexes work. But regex is hard and I guess no one checked it.
Notes
It seems that v0.12.1 was an incomplete fix for a similar issue earlier this year. CVE-2023-28755
Taking a second look at the code there is the following regex:
/\A(?:[^@,;]+@[^@,;]+(?:\z|[,;]))*\z/
This looks like it would be vulnerable to the same problem but I could not reproduce it. I think it's because the
@
at the middle serves as a checkpoint so it doesn't backtrack the whole regex.