Image

SSCG 3.0.8: Brought to you in part by Claude

This past week, as part of a series of experiments that I’ve been running to evaluate the viability of AI-supported coding, I decided that I would spend some time with the Cursor IDE, powered by the Claude 4 Sonnet large language model. Specifically, I was looking to expand the test suite of my SSCG project (more detail in this blog entry) to exercise more of the functionality at the low-level. I figured that this would be a good way to play around with AI code-generation, since the output won’t be impacting the actual execution of the program (and thus shouldn’t be able to introduce any new bugs in the actual execution.

The first thing I asked the AI tool to do was to create a unit test for the sscg_init_bignum(). I gave it no more instruction than that, in order to see what Claude made of that very basic prompt (which, you may note, also did not include any information about where that function was defined, where I wanted the test to be written or how it should be executed. I was actually quite surprised and impressed by the results. The first thing that it did was inspect the code of the rest of the project to familiarize itself with the layout and coding style of SSCG and then it proceeded to generate an impressively comprehensive set of tests, expanding on the single, very basic test I had in place previously. It created tests to verify initialization to zero; initialization to non-zero numeric values; initialization to the largest possible valid value; and it verified that none of these calls resulted in leaked or other memory errors. In fact, I only needed to make two minor changes to the resulting code:

  1. The whitespace usage and line length was inconsistent and rather ugly. A quick pass through clang-format resolved that trivially.
  2. There was a small ordering problem at the beginning of execution that resulted in the memory-usage check believing it was successfully freeing more memory than it had allocated, which was concerning until I realized that Claude had initialized the root memory context before starting the memory-usage recording functions.

All in all, this first experiment was decidedly eye-opening. My initial expectation was that I was going to need to write a much more descriptive prompt in order to get a decent output. I had mostly written the abbreviated one to get a baseline for how poorly it would perform. (I considered making popcorn while I watched it puzzle through.) It was quite a shock to see it come up with such a satisfactory answer from so little direction.

Naturally, I couldn’t resist going a bit further, so I decided to see if it could write some more complicated tests. Specifically, I had a test for create_ca() that I had long been meaning to extend to thoroughly examine the subjectAlternativeName (aka SAN) handling, but I kept putting it off because it was complicated to interact with through the OpenSSL API calls. Sounds like a perfect job for a robot, no? Claude, you’re up!

Unfortunately, I don’t have a record of the original prompt I used here, but it was something on the order of “Extend the test in create_ca_test.c to include a comprehensive test of all values that could be provided to subjectAlternativeName.” Again, quite a vague test, but it merrily started turning the crank, analyzing the create_ca() function in SSCG and manufacturing a test that provided at least one SAN in each of the possible formats and then verified that the produced certificate contained both the appropriate subjectAlternativeName fields and the CA certificate also provided the appropriate basicConstraints fields, as required by SSCG’s patented algorithm.

However, as I watched it “think”, I noticed something peculiar. It went and ran its test… and it failed. At first, I guessed that it had just made a mistake (it seemed to think so, as it re-evaluated its code), but when it didn’t see a logic problem, it expanded its search to the original function it was testing. It turned out that there was a bug in create_ca(): it turned out that it was improperly stripping out the slash from URI: type SANs, where it should only have been doing that for IP: types. The AI had discovered this… but it made a crucial mistake here. It went back and attempted to rewrite the test in such a way as to work around the bug, rather than to properly fail on the error and flag it up. I probably would have noticed this when I reviewed the code it produced, but I’m still glad I saw it making that incorrect decision in the chatbot output and interrupted to amend the prompt to tell it not to work around test failures.

That inadvertently led to its second mistake, and I think this one is probably an intentional design choice: when writing tests, Claude wants to write tests that pass, because that way it has a simple mechanism to determine whether what it wrote actually works. Since there was a known failure, when I told Claude not to work around it, the AI then attempted to rewrite the test so that it would store its failures until the end of execution and then print a summary of all of the failed tests rather than fail midway through. I assume it did this so that it could use the expected output text (rather than the error code) as a mechanism to identify if the tests it wrote were wrong, but it still resulted in a much more (needlessly) complicated test function. Initially, I was going to keep it anyway, since the summary output was fairly nice, but unfortunately, it turned out there were also bugs in it; in several places, the “saved” failures were being dropped on the proverbial floor and never reported. A Coverity scan reported these defects and I amended the prompt yet again to instruct it not to try to save failures for later reporting and the final set of test code was much cleaner and easier to review.

All in all, I think this experiment was a success. I won’t go into exhaustive detail on the rest of the 3.0.8 changes, but they followed much the same as the above two examples: AI provided a surprisingly close output to what I needed and I then examined everything carefully and tweaked it to make sure it worked properly. It vastly shortened the amount of time I needed to spend on generating unit tests and it helped me get much closer to 100% test coverage than I had a week ago. It even helped me identify a real bug, in however roundabout a way. I will definitely be continuing to explore this and I will try to blog on it some more in the future.