Rethinking the upgrade process of a hybrid cloud system

 

Storage administrators of hybrid cloud systems must upgrade their systems periodically and also apply hotfixes (or patches) to prevent security vulnerabilities. However, by speaking to multiple administrators it was clear that this process was complex and may involve a significant downtime. This would in turn decrease the revenue from the cloud system because multiple applications depend on the system to read and write data. 

 

 

How can the system help a storage administrator easily plan and execute an upgrade process with minimal downtime?

Methods used

User interviews​

Journey mapping

Rapid prototyping

Usability testing

Understanding the existing upgrade process

  • What happens during an upgrade? Upgrading a cloud system involves upgrading every server in the data center. The upgrade possible within a maintenance period is called a hop.
  • Who upgrades? Storage administrators and professional service consultants.
  • Why upgrade to the latest version? Customer’s internal policies prevent upgrading to the latest version due to security scans and bug fixes.
  • When does the upgrade occur? The admin declares a maintenance downtime for an upgrade. During this time the cloud system cannot be accessed. 
  • How? (Refer the diagram in the left) Hop to the highest hotfix in the current version (11.2.0.x). Then hop to a major version (11.3). Followed by a hop to the latest hotfix in the new version (11.3.0.y) until the users reach the latest version + hotfix.
Upgrade path from 11.2 to 11.4.0.z

Plotting the user journey to identify pain points

Performed heuristic analysis and user interviews to identify pain points. Presented the user journey to the product and engineering leadership teams to determine priority.

Some important gaps include:

        • Since hotfix and upgrade are two separate flows, they require two different maintenance windows although they are applied back-to-back.
        • The user has multiple context switches during the upgrade flow between UI, Linux host, documentation, and support site.
        • Lack of guidance on the next steps in the flow.
        • Users are not aware of the next version or hotfix.

Examples of solution explorations

Exploring ways to combine hotfix and upgrade

Explored through participatory design techniques with stakeholders and users. The explorations include:

    • Automatic updates
    • Skipping versions
    • Including hotfix files within the upgrade package
    • Uploading hotfix and upgrade files separately, but they are applied simultaneously.

Decreasing context switches and clarifying the upgrade path

  • Explored user flows that integrate support servers API calls to prevent the user from moving away from the upgrade flow. Eg: The ability to download required files within the flow.
  • Confusion in the upgrade path due to different types of hotfixes (required and optional) is solved by suggesting an ideal path to the latest version.

Guiding the user throughout the upgrade process

  • Estimated the time required for the upgrade before starting to help plan the maintenance window.
  • Showed information on important configuration considerations and suggestions that decrease the upgrade maintenance impact.
  • Introduced recommendations that enable version backtracking within the flow.
  • Removed faulty errors from primary node upgrade to decrease confusion.

Communicating a new release to the user

  • Explored different channels to notify grid administrators regarding a new version or hotfix release.
  • Notified users when the current version is not supported.

Some of these explorations include:

    • Creating an automated email
    • Leveraging the in-built alert mechanism
    • Using the license information in the header
    • Via the category headers in the side nav

Translating explorations into wireframes

A few key wireframes in the user flow

Testing the prototype

Breaking information blobs into chunks

  • Users suggested the “Things to know before the upgrade” section can be more readable if we break it into chunks by arranging it in a table rather than a blob. (Follow blue arrow)
  • Since the upgrade time can be calculated only after the pre-check, most of the informational points were moved into a single table.
  • Downloading recovery package is an important consideration that is strongly recommended, hence it has a separate section. (Follow red arrow)

Providing an estimate for Cassandra database upgrade

  • Although users suggested to provide a time estimate to complete the Cassandra upgrade process, since it was not technically feasible, a graph with its progress was provided.
  • Since the Cassandra upgrade process might take a long time to complete, a step that enables all the new features of the new release was added to the flow to allow users to use the new features as soon as the new version is installed.

Fetching the path automatically and preventing inconsistency

  • The “Check for updates” button was removed to decrease the number of clicks required for an upgrade. Instead, the upgrade path is fetched on page load.
  • To prevent the previous and next step from jumping up and down across steps, they were pinned into a sticky footer.

Handling edge case scenarios

  • To handle an emergency edge case where a new hotfix has to be applied during an upgrade that leads to data loss, the hotfix user flow was retained.
  • The upgrade flow caters to new version upgrades and hotfixes. Whereas the hotfix flow caters to the current version hotfixes.

Decluttering the upgrade path

  • It was not very clear to users that 11.x.y.z in the upgrade path includes both the upgrade and hotfix hops. Thus the hops in the upgrade path were divided to reflect both the upgrade and hotfix file separately. 
  • Furthermore, all other links and new tab icons were removed to declutter the path. 

Results and next steps

  • The overhauled flow reduced server maintenance downtime by 30%.
  • This flow can be made more efficient through the following enhancements in the future:
      • Downloading the upgrade file from the internet and uploading it to the server takes a lot of time. Downloading the file directly to the servers saves time.
      • In case the application is installed in a bare metal server, the user has to manually deploy containers in the Linux server. This touchpoint can be reduced by triggering commands from the primary server.