January 2021 – hack.with(passion)

January 26, 2021June 27, 2021

Gem::Version != SemVer

Using dependencies with semantic versioning (also known as SemVer) is a key piece of modern software development, and if we’re writing Ruby code that checks the version of a dependency, we often use Gem::Version for our operations.

The initializer for Gem::Version checks if we’ve passed in a valid version number (such as 1.9.10), and raises an ArgumentError if it’s invalid:

$ irb
3.0.0 :001 > Gem::Version.new('1.9.10')
 => #<Gem::Version "1.9.10"> 
3.0.0 :002 > Gem::Version.new('foo bar car')
ArgumentError (Malformed version number string foo bar car)

If we wanted to test this behaviour in RSpec, it would look something like so:

Most dependencies use release numbers with the MAJOR.MINOR.PATCH pattern (eg. 1.9.10), but the SemVer allows more complex identifiers than that. For a list of interesting edge cases that illustrate what is and isn’t allowed, the SemVer specification links to a Regex101 page containing a list of test strings.

Let’s pass those into Gem::Version via RSpec and see what happens! In the code below, we pass in each valid test string into the initializer for a new Gem::Version, and check that it successfully creates a Gem::Version (ie. doesn’t raise an error).

We actually receive an ArgumentError if we pass in a version number with metadata (ie. everything after the + sign), so we strip that out before passing that in. It would be ideal if Gem::Version did that metadata parsing for us, but ignoring that edge case, all of our test cases pass:

What happens when we look at the invalid test cases? Let’s look at the RSpec below:

We expect to receive an ArgumentError if we’re passed in an invalid version number, which we check for in our expect statements. We also skip some of our test cases with this snippet:

.reject { |version_string| version_string.include?('+') }

We skip these cases because we already know that Gem::Version doesn’t handle metadata correctly. Let’s run the above RSpec tests:

We see that a lot of version identifiers that are invalid under the SemVer specification don’t actually raise errors in these tests, and can be used to instantiate a Gem::Version. For example, Gem::Version.new('1.2.3.DEV') creates a new object and doesn’t raise an error, even though that’s not a valid version number under SemVer.

As it turns out, Gem::Version (ie. the versioning used by RubyGems) is much more permissive than the SemVer specification. This means that if we’re trying to adhere to the strictest definition of a SemVer version number, we can’t actually use Gem::Version to determine if a version string is valid due to this permissiveness. We can only use Gem::Version to tell us if a version string is invalid (barring the aforementioned exception around metadata and the + sign).

My favourite behaviour of Gem::Version isn’t actually covered in the negative test cases of the SemVer spec. Gem::Version lets us be arbitrarily granular with the number of dots in our version number. This isn’t valid in the SemVer specification either, but Gem::Version will happily parse it and do version comparisons with it:

Start your journey towards writing better software, and watch this space for new content.

January 18, 2021June 27, 2021

Use RSpec to test for more than just correctness

RSpec is typically used to test the correctness of our Ruby on Rails code, but did you know that we can use it to maintain the readability of our application’s configuration files?

Suppose we have a YAML file containing an alphabetized list of all of our application’s feature flags. It might look something like so:

While our application may function correctly if feature_flags.yml stores these flags in a random order, we usually prefer an alphabetized list of flags for the sake of readability. One approach is to leave a comment for the next developer who modifies this file:

This is adequate, and hopefully our code review process would prevent a developer from adding an adding an unalphabetized flag. But, through the power of RSpec, we can actually enforce this alphabetization (rather than just suggest it with a comment!) Consider the following RSpec test that we could add to our codebase:

If another developer tried to submit a pull request containing a non-alphabetized YAML property, this test would fail the next time our tests ran in our continuous integration system. This allows us to preserve the alphabetization of this configuration file – even if the next developer to add a feature flag doesn’t read our comment!

Note that this test executes none of our actual Ruby application code – this is purely testing properties of our YAML file, and doesn’t deal with any of our classes that we expect to see at runtime.

Let’s look at another example! Suppose we have a configuration file that stores a list of UI elements to display, and we associate a position integer with each element (think of something akin to the acts_as_list library):

Again, our application may run correctly if our YAML file stores these elements in a random order, but we (and other developers) will have a far easier time understanding this file if these items are ordered by position. To keep these items ordered, we can add the following test:

If we try to add an element to this file that is out of order, this test will fail until we place it in the correct position.

It’s easy to think of a linter as our sole tool to ensure code quality, and that RSpec’s only role is to test correctness of our application – since this is usually the case! But, in occasional cases like this, we can use RSpec for more than just correctness, and leverage it to ensure high code quality in our codebase.

Start your journey towards writing better software, and watch this space for new content.

January 11, 2021June 27, 2021

You may not know that Ruby’s ‘puts’ method does this

If you ran this code snippet, what would you expect the output to be? Take a moment to think about the answer before reading the following paragraph.

If you had asked me a few months ago, I likely would answered with ’{:bar=>"car"}’, the rationale being something like: “When we pass in an object to puts, to_s gets called to do string conversion, and '{:bar=>"car"}' is the string representation of the value returned by to_s.” Seems reasonable, right?

When we actually run the code, we see the following output:

#<Foo:0x00007f7f7f160bb0>

Counterintuitive, right? We typically see the object’s class name and address (ie. #<Foo:0x00007f7f7f160bb0>) in an interpolated string like that when we haven’t explicitly defined to_s , but we did define that method.

Why are we printing out a reference to the parent Foo, rather than something related to our {:bar=>"car"} hash?

Let’s dig in further and look at a slightly different code example:

If we run the code snippet above, our terminal prints this:

#<Foo:0x00007f843b0f1268>
{:bar=>"car"}

Counterintuitive indeed. When we pass in an instance of Foo to puts, aren’t we expecting to_s to be called under the hood? Why are we getting a different result when we explicitly call to_s?

puts is Ruby function that’s purely implemented in C, so we can’t just step with a debugger like pry or byebug to find out more; puts doesn’t have Ruby code to step into! But, we can read through the Ruby source code on Github: the io.c file sounds like a promising place to read about puts, and we find this definition of rb_io_puts there:

C can be difficult to read compared to Ruby code, but this line of code looks promising: line = rb_obj_as_string(argv[i]);. So, let’s read the definition of rb_obj_as_string, which is found in the string.c file:

str = rb_funcall(obj, idTo_s, 0) calls our object’s to_s method (this idTo_type naming pattern is also found elsewhere in Ruby’s C source code for other built-in Ruby types, such as Array and Symbol). We then pass the result of to_s into rb_obj_as_string_result. How is rb_obj_as_string_result defined in string.c?

And this explains it! In the underlying implementation of puts, rb_obj_as_string_result explicitly checks if to_s has returned a string. If we haven’t returned a string, that value is overridden, and we use the return value of rb_any_to_s instead (ie. the function that returns a class name / address string like #<Foo:0x00007f7f7f160bb0>).

This is why we’re printing a reference to Foo, and not anything to do with the actual hash – the value of to_s is discarded because it’s not a string! This also explains the discrepancy between puts foo_instance and puts foo_instance.to_s – we pass in a hash to rb_obj_as_string, meaning {:bar=>"car"} is passed into rb_obj_as_string_result, which does have a definition of to_s that returns a string.

The way Ruby’s puts function to overrides a value we explicitly return with to_s can be unexpected if you haven’t seen it before, but upon reflection, I do think that this is sensible language design. The alternative would be for Ruby to recursively call our underlying rb_obj_as_string on the value returned by to_s until we get a string, but this introduces additional complexity for little benefit. At the end of the day, if we want to write clean code, any to_s functions that we write should, well, return a string 🙂

Start your journey towards writing better software, and watch this space for new content.

January 4, 2021June 27, 2021

Bypass GitHub’s search API rate limit by 27% (with just five lines of code!)

GitHub’s search API is a powerful tool, but its search functionality is heavily rate limited compared to the rest of its API. GitHub’s general rate limit on its API is 5000 requests per hour (roughly 83 requests per minute), while the rate limit for search requests is documented at only 30 requests per minute. This can be restrictive in some use cases, but with just five lines of code, we can increase this limit to over 40 requests per minute!

(At this point, some readers may be concerned that “over 40” divided by “30” is not, in fact, an increase of 27%. Read on to find out the source of this discrepancy!)

To begin, let’s clarify those aforementioned rate limits – these are limits on requests that we’ve associated with an access token connected to our GitHub account, also known as authenticated requests. We can also query the GitHub API using unauthenticated requests (ie. without an access token), but at a much lower rate limit – GitHub only allows 10 unauthenticated search requests per minute.

However, GitHub tracks these authenticated and unauthenticated rate limits separately! This is by design, which I confirmed with GitHub via HackerOne prior to posting. To increase our effective rate limit, we can write our application code to combine our authenticated and unauthenticated API requests. Our application can make an authenticated request, and if that authenticated request fails due to rate limiting, we can retry that request again without authentication. This effectively increases our rate limit by 10 requests per minute.

Let’s illustrate with two separate code snippets – the first using only authenticated requests, and the second using both authenticated and unauthenticated requests. In both of these snippets, we try to make 50 requests in parallel to the GitHub search API via Octokit’s search_repositories method.

In this first snippet, we expect to see 30 requests succeed (returning a Sawyer::Resource) and 20 fail (returning an Octokit error), given the documented rate limit.

Run it, and we see this output:

$ ruby authenticated_only.rb
36 requests succeeded, 14 requests failed

Oddly enough, GitHub does not appear to strictly adhere to its documented rate limit of 30 requests per minute, but our premise still holds – we can’t make all 50 requests due to GitHub’s rate limiting.

Now, let’s run the second snippet, which is five lines of code longer than our previous snippet. In this snippet, if a request using our authenticated client fails, we retry the same request using an unauthenticated client.

We see the following output:

$ ruby authenticated_and_unauthenticated.rb
46 requests succeeded, 4 requests failed

As predicted, we’ve successfully increased our rate limit from 36 to 46 requests per minute, a 27% increase from what we could achieve previously.

I really did expect to put the number 33% in this blog post’s title, not 27%. – it’s unclear to me why my authenticated client can make 36 successful requests, when the search API limit is documented at 30. I observed some variation on the output of this script too, ranging from 40 to 46 successful requests.

Going back to our performance gains – is this method effective for every application using the GitHub search API? No, probably not – 10 additional requests per minute is inconsequential in a large production application at scale. In that case, there are other techniques available to avoid hitting the GitHub search API rate limit. Some examples include caching your search results from the GitHub API, or rotating GitHub credentials to multiply your effective rate limit.

However, what if you’re using GitHub’s search API at a small scale? For example, you may be using the search API in a script that runs in your local development environment, or in some sort of internal tooling. In such a scenario, you may just be occasionally hitting the authenticated request limit, but haven’t reached a point where you need a more scalable solution. In that case, these five lines of code may give you a good “bang for your buck” in solving rate limiting issues.

Start your journey towards writing better software, and watch this space for new content.