Mike Cole

A blog about development technologies, with a Microsoft and JavaScript tilt.

profile for Mike Cole at Stack Overflow, Q&A for professional and enthusiast programmers

Archive: July 2024

I maintain several Chocolatey community packages on GitHub. I am using the chocolatey-au tool to automatically update new versions of software, running in an AppVeyor scheduled build. A few months ago the Zoom package stopped automatically updating, and it appeared that the AppVeyor build agent was not seeing the latest version of the Zoom version feed. After troubleshooting what I thought was a caching issue, I realized that the build agent was still running on a Windows Server 2012 OS. It appears the Zoom version feed observes the user agent HTTP request header, and serves the latest version that your operating system can support. Once I updated the build agent to use the Visual Studio 2022 image, the chocolatey-au tool worked as expected.

#TIL that you can RDP into an AppVeyor build agent to help troubleshoot issues. First you need to add the following to the on_finish block in your .yml file:

on_finish:    
- ps: $blockRdp = $true; iex ((new-object net.webclient).DownloadString('https://raw.githubusercontent.com/appveyor/ci/master/scripts/enable-rdp.ps1'))

This will block the job from succeeding and output the RDP info in the console. Before re-running the job, you need to add an environment variable named APPVEYOR_RDP_PASSWORD to your build settings. The value of this will be the password of your RDP connection.

Azure App Service deployments using slot swapping is an easy way to do zero-downtime deployments. The basic workflow is to deploy your app to the staging slot in Azure, then swap the slot with production. Azure magic ensure that the application will cut over to the new version without a disruption as long as you’ve architected your application in a way that supports this. When reading about the az webapp deployment slot swap command, one of the first steps is -supposed- to warm up the slot automatically. In reality what we were finding was a lengthy operation (20+ minutes) followed by a failure with no helpful details. As part of the troubleshooting process I added a step to ping the staging slot after deployment, and I saw that it resulted in random errors, mostly 503 Service Unavailable. This was happening on several different systems. I ended up adding a few retries and saw that the site would eventually successfully return 200 OK after a minute or so and then the slot swap step would work reliably within 2 minutes. It seems that when your site does not immediately return 200 OK when starting a slot swap, the command would freak out.

In our deployment GitHub Actions jobs, I added the following step to manually warm up the site. The warmup_url should point to a page on your staging slot. We have it pointing to a health checks endpoint so we can ensure stability before swapping. --connect-timeout=30 will attempt to connect to the site for 30 seconds, then retry 4 times with a 30 second delay in between attempts. -f argument will tell your retry loop to not fail the step if it receives an http error. --retry-all-errors is an aggressive form of retry which ensures a retry on pretty much everything that isn’t a 200 level success code. The second curl statement instructs the step to fail, which will fail the entire job altogether.

- name: Warmup
  if: inputs.warmup_url != ''
  run: |
    curl \
      --connect-timeout 30 \
      --retry 4 \
      --retry-delay 30 \
      -f \
      --retry-all-errors \
      $
    curl \
        --fail-with-body \
        $

After adding this to all of our slot swap deployments, we consistently saw quite a bit more reliable behavior, and no more frustrating 20 minute black hole failures.