Monthly Archives: April 2021

SQL Server and cloud

Back in the day

You would have one database connection string, and that would be it. “The Database” would live in the darkest corner of the office and you would try not to interact with the physical machine that hosted the relational database management server. The Windows Forms app would open a connection and keep it alive forever with nothing to really threaten it. It would enable Multiple Active Result Sets because there was no grown-up way to query the database, queries were being shoved down the shared connection seemingly randomly depending on what the user was doing and what events forced data to be loaded. You would use the Windows Credentials of the end-user and specifically allow a the user to access the server, which fit with the licensing model where the user paid for accessing SQL Server when buying Office products, whilst pure server deployments needed per CPU core licensing which was hysterically expensive.

Cloud

The cloud revolution came and went, whilst we in the Microsoft / Windows / .NET bubble never really heard about it, until Microsoft decided to try their hand at hosting cloud servers with the Windows Azure project. I genuinely think this was extremely useful for Windows, as the prospect of hosting cloud services was so far beyond the capabilities of the bare OS, and forced some stark introspection and engineering effort to overcome some of the worst designs in the kernel.

For .NET Framework, a lot was too late already, and many misfeatures weren’t fixed until in the .NET Core reboot effort.

One of the things that did change for the average programmer was that database connections became ephemeral, and pooled. You could no longer hog one connection for all of your needs for all eternity – you created a new connection when you needed something and the framework would pool the connections for you, but you had to allow for connections taking longer to get established and also that they can be evicted at any point. Relatively quickly database providers would build in resilience so that you didn’t have to know or care, but in the early days even the happy path Microsoft marketing slides that usually never have error handling or security in them had to feature retry mechanisms – or else people simply couldn’t successfully replicate the early samples.

Security

As people started using the public cloud, eventually people figured out that security was a thing. Maybe we should not have overly powerful credentials lying around in config files on the webserver ready to be exploited by any nefarious visitors? People eventually started using things like Azure KeyVault or AWS Secrets Manager. Sure it’s bad if people break in and nick your stuff, but it’s worse if they also steal your car to carry away the loot.

Once people had all their secrets backed by hardened credentials services, features like autorotating credentials started becoming available. Since you are provided most of your authorisation by the general hosting environment anyway, and only really need credentials for systems that are incompatible with this approach, why don’t you just automatically periodically update the credentials?

Also – back in the day when the server lived back in the office, if somebody got a bit enthusiastic with the UPDATE statements and forgot to add a WHERE, you could send everybody out for coffee (read tea, dear UK readers) whilst the techiest person in the office would Restore to Point in Time and feel like a hero when the database stopped being in recovery mode and everything was back to the way it was before the Incident.

Today a “restore” means you get to create a new database that contains the world as you knew it, before the Incident. You then need to serially rename databases to achieve the effect of having restored the database. Not a big deal, just not what we used to call restore back in my day.

Back to retry mechanisms

For SQL Server running in Azure, you can now tell the Microsoft.Data.SqlClient to connect to the database using the Managed Identity for the app, meaning you are completely free of faff and the KeyVault is managing your access automatically.

With RDS on AWS you need to use legacy usernames and passwords unless you configure an Active Directory domain up there, but … no, but these credentials can be auto-rotated in AWS Secrets Manager. Because of the cost of talking to the Secrets Manager, it’s not something you want to do every request, as that piles up over a month.

One of those retry mechanisms from early Azure days start to make sense again, and is easily implemented as a factory method you can call instead of the await using var cn = new SqlConnection(...) you have littered all throughout the code. I mean you’ll still await using that factory method, but it can do the setting up of the connection and validating the credentials, and only spend the dosh fetching the latest from the vault if you get the error code for invalid credentials. This means your bespoke factory method replaces both new SqlConnection and OpenAsync().

Naïve retry mechanism, featuring goto:

// https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-18456-database-engine-error?view=sql-server-ver15
// Error code that indicates invalid credentials, meaning we need to bite the bullet and fetch latest settings from Secrets Manager
private const int INVALID_CREDS_ERROR = 18456;
private const int INVALID_DB_NAME = 4060;

public async Task EnsureValidCredentialsAsync(SqlConnection conn, CancellationToken cancellationToken)
{
    var rdsCredential = GetRdsCredential(await _secretsCache.GetCachedSecretAsync(_siteConfiguration.DefaultDatabaseCredentialKey));
    var dbCatalog = await _secretsCache.GetCachedSecretAsync(_siteConfiguration.DefaultCatalogKey);    int reconnectCount = 0;
reconnect:
    var connectionString = GetConnectionString(rdsCredential, dbCatalog);
    conn.ConnectionString = connectionString;
    try
    {
        await conn.OpenAsync(cancellationToken);
        conn.Close(); //  restore to known state
        return;
    }
    catch (SqlException sqlEx) when (sqlEx.Number == INVALID_CREDS_ERROR)
    {
        // Credentials are incorrect, double check with secrets manager to see what's what - potentially creds have been rotated
        rdsCredential = await _secretsCache.GetSecretAsync<AwsSecretsManagerCredential>( _siteConfiguration.DefaultDatabaseCredentialKey );
    }
    catch (SqlException sqlEx) when (sqlEx.Number == INVALID_DB_NAME)
    {
        // Database identifier is not valid, double check with secrets manager to see what's what (potentially restored db, deprecating old db name)
        dbCatalog =
            await _secretsCache.GetSecretAsync(_siteConfiguration.DefaultCatalogKey);
    }
    catch (SqlException sqlEx)
    {
        Log.Error(sqlEx, "Could not open default DB connection. Reattempting");
    }
    if (reconnectCount++ < 3) goto reconnect;
    // surrounding code expects an exception if the open fails 
    throw new InvalidOperationException("Could not open database connection");
}

What about Entity Framework?

Looking at the way you officially configure Entity Framework – it seems you can’t get away from having a known connection string up-front, which again isn’t a problem in Azure as discussed earlier, but for me, I want to only hand credentials to the app that I know have been tested.

In my past life inviting speakers to Oredev I once invited Julie Lerman to speak about Entity Framework and DDD, so I pretend that I know her and that I can call upon her for help in matters of programming, so I sent out a tweet linking to a Hail Mary Stack Overflow question I had created where I asked how I would be able to dynamically handle connection retries in a similar way or at least be able to call out and ask for credentials.

Surprisingly she had time to reply and pointed me to a video where she had addressed this subject and which taught me about something called DbConnectionInterceptors that had all the things I wanted, in addition this also introduced me to the super elegant solution natively supported by Microsoft.Data.SqlClient for handling this situation in Azure that I mentioned earlier.

Basically I therefore created a class that inherits from DbConnectionInterceptor and overrides two methods, one sync and one async version and call something like the above function to test the connection EF Core is about to open.

public override async ValueTask<InterceptionResult> ConnectionOpeningAsync(DbConnection connection, ConnectionEventData eventData, InterceptionResult result,
    CancellationToken cancellationToken = new CancellationToken())
{
    var sqlConnection = (SqlConnection) connection;
    // try opening the connection, if it doesn't work - update its params - close it before returning to achieve same state
    await _dbConnectionFactory.EnsureValidCredentialsAsync(sqlConnection, cancellationToken);
    return await base.ConnectionOpeningAsync(connection, eventData, result, cancellationToken);
}     

Of course – registering this interceptor is easy as well, after I had googled some more. There is a (synchronous) override that allows you access to an instantiated IServiceProvider as follows:

services.AddSingleton<ResilientRdsConnectionInterceptor>();
services.AddDbContext<ADbContext>((provider, options) =>
            {
                var dbConnectionFactory = provider.GetRequiredService<DbConnectionFactory>();
                var connectionString = dbConnectionFactory.GetDefaultConnectionString().GetAwaiter().GetResult(); // So sorry, but there is no async override
                options.AddInterceptors( provider.GetRequiredService<ResilientRdsConnectionInterceptor>());
                options.UseSqlServer(connectionString);
            });

An aside on async/sync. It seems Microsoft will have you rewrite every other line of your codebase to properly support async or your app will fall over. But when you then want to reuse your async code and not have to duplicate everything into similar-but-not-reusable sync and async versions, the tiny changes needed for MS to support async everywhere are all of a sudden “not needed”, line in configuration or DI. It’s laziness, I tell you.

Anyway I have integration tests in place that verify that the mechanism for calling out for updated credentials actually works. Provided the credentials in the secrets manager/key vault actually work, this works like a charm.

A fitting song. Also a banger.

Why is everybody waterfalling “Agile”?

Just like that rebrand cycle years ago when RUP consultants transitioned over to scrum masters through staged re-titling on LinkedIn and liberal use of search / replace in their CV, scaled agile frameworks and certified project managers attempt to apply the agile manifesto to large organisations by bringing in processes and tools to manage the individuals and interactions, comprehensive documentation of the working software, to negotiate contracts to manage customer collaboration and make plans for how to respond to changes. You start seeing concepts like the Agile Release Train, which are – well – absurd.

Why? Do they not see what they are doing? Are they Evil?

No – I think it’s simple – and really really hard, at the same time.

You cannot respond to change quickly if you have delays in the system. These delays could be things like manual regression testing due to lack of automated test coverage, insufficient or badly configured tooling around the release or having a test stack that is an inverted pyramid, where you rely on a big stack of UI tests to cover the entire feature set of the application because faster, lower level tests aren’t covering all the features and you have undeniable experience of users finding bugs for you.

Obviously, if these tests are all you have, you need to run them before releasing or you would alienate customers. If your software stack is highly coupled, it would be irresponsible to not coordinate carefully when making changes on shared components with less-than-stellar contract test coverage. You are stuck with this, and it is easy to just give up. The actual solution is to first shorten the time it takes from deciding you have all the features you want to release until the software is actually live. This means automate everything that isn’t automated (the release itself, the necessary tests, et c) which could very well be a “let’s just stop developing features and focus all our attention on this until this is in place” type investment, the gains are so great. After this initial push you need to make an investment into decoupling software components into bits that can be released independently. This can be done incrementally whilst normal business is progressing.

Once you have reached the minimum bar of being able to release whatever you want at any time you want and be confident that each change is small enough that you can roll them back in the unlikely event that the automated tests missed something, then you are in a position to talk about an agile process, because now teams are empowered and independent enough that you only need to coordinate in very special cases, where you can bring in ad hoc product and technical leadership, but in the day to day, product and engineering together will make very quick progress in independent teams without holding each other up.

When you can release small changes, you can all of a sudden see the value in delivering features in smaller chunks with feature flags, because you can understand the value in making 20 small changes in trunk (main for you zoomers) rather than a massive feature branch, as releases go live several times a day, and the benefit of your colleagues seeing your feature flagged changes start appearing from beginning to end, they can work with your massive refactor rather than be surprised when you open a 100 file PR at 16:45 on a Friday.