So You Want To Build A Platform?

So You Want To Build A Platform?

But what does that mean to you and your organization? While the Platform Engineering (PE) practice has gained strides as an approach for modern software engineering, it is still hard to pin down. In this article, we discuss the minimal components necessary to complete a platform. Additionally, we reference a burgeoning approach in the field spearheaded by McKinsey and Humanitec that provides reference architectures for platforms. Finally, we discuss the buy versus build dilemma in the context of platforms and provide references to the available “off the shelf” platforms.

You already have a platform

It’s true. You may not have planned it, and it may not be easy to work with, but if you are shipping software, you have a platform. In large organizations, these things tend to develop organically and have parallel work streams that often duplicate work. While this is frustrating and wasteful, it does give you a place to start.

I always say we shouldn’t “throw the baby out with the bathwater” when assessing any platform. There may be good things there; they may be half-baked and undersupported. So, you need to start by assessing what you already have and looking for inefficiencies and/or duplicate effort.

Shadow IT is an anti-pattern

Now that you have looked under the hood and found all the issues make your platform a first-class citizen. Many platform engineering professionals suggest  “treat your platform as a product,”: where the developers are the client and you have product owners planning features and roadmaps. I would take it further and say treat it like your factory and all your stakeholders are your customers. You should build in feedback loops that allow you to evaluate the toil and cognitive load that are required to maintain and use the platform. But wait, I haven’t told you what the parts are or who the stakeholders are yet. That’s OK, we will get to that. But first, a few more comments.

Don’t stifle innovation

This is where we admit Platform Engineering is an art and can have negative impacts if approached incorrectly. The trick here is to create reusable patterns that allow your stakeholders to innovate in your business domain while accelerating their ability to deliver features, data, insights, and status. As an example, your developers shouldn’t have to rebuild CI/CD pipelines, making sure they have all the correct integration points for SecOps, FinOps, and SRE. These should be part of the platform in a templated manner that can be extended and iterated upon. Another example: Your BI team shouldn’t have to build a data lake to generate analytics reports. You should have a data analysis component built into your platform and allow BI teams to use some dashboarding interface to access the data. 

On the flip side, you also don’t want to stifle innovation. There should be mechanisms in place to enable your R&D teams to look at new tooling or stray from the golden path in order to innovate for the business. These teams should be well aware that straying from the golden path may require more work, and the platform team should facilitate absorbing any new technology targeted for production use. This is really the product evolution of your platform. Engineers require more features to innovate and make the business successful and the platform team is the enabler. This means you have to be flexible and design for growth in the number of features available to the platform.

OK, so we are starting to see parts of the elephant but we really haven’t fleshed out what a platform is yet. I know, I know, cut to the chase. 

Don’t focus on greenfield

How often do you really start a new service? In most cases, this is rare, and you should not focus on this phase of the development cycle. It is much better to make incremental improvements on the processes and tooling used daily to maintain critical pieces of your software stack. Take one of your flagship products/services and ask the team responsible for it what they spend their time on. You may be surprised at the answers. This can provide great insight into where you should start improving the organic monster that grew all on its own.

What is this platform you speak of?

First, let’s clarify the difference between a complete platform and an internal developer platform. It is really as simple as this: an IDP is a component of the entire platform. You definitely need it. It allows your teams to develop features and ship them to production. But there are many more parts to a complete platform. I like to think of it as the ultimate technical manifestation of breaking down silos - with a twist.

With that in mind, start by enumerating your stakeholders and the data you need to make your business work. This is a simple exercise in the front-to-back design methodology. OK, so my stakeholders…let’s see. Well, we have developers who report to our CTO, security folks who report to our CISO, DevOps folks who report to our COO, finance folks who report to our CFO, and finally, business folks who ultimately report to our CEO. Well, that seems like a nice list to start with. At a minimum, we need interfaces for all these stakeholders to get at the data they need for reporting and decision-making. But we should also be wary of the technical expertise of each of these departments and thus provide as much bootstrapping as we can to make the tooling easy to use. As we discuss each of these groups, we try to define the minimal and best practice interfaces for each team. 

As an example, you probably don’t want to have your finance folks looking at raw AWS millisecond billing data. It would be much better to allow them access to the AWS billing dashboard through the appropriate RBAC controls and while you're at it maybe you give them access to buy reserved instances so they can reduce spend for static workloads. 

In other cases, SecOps interfaces, for instance, you may not need to build anything. In many cases, the SecOps organization is highly technical, with pre-built or bought systems that just need to be integrated with. This is a case of ensuring they get all the data shipped to them.

SecOps

When looking at a platform, it is vital to ensure that the CISO’s team is getting all the data they need to ensure the security posture of your organization. I would argue that it is the most critical aspect to address in your platform. The argument goes something like, “If we have a breach and share our clients/customers' PII, our reputation would be so tarnished that we will lose customers.” The ramifications of a breach may well extend far beyond just a tarnished reputation; there may be massive legal and financial consequences as well. 

These are the significant losses. But you may also be suffering death by a thousand cuts as well. Day-to-day attacks may be reducing your performance and costing you money. For these reasons, we start our discussion with the SecOps interface for a platform.

Static 

Two types of data need to be made readily (near real-time) available to your SecOps team. First, we will discuss the static data that needs to be available for gating 

and to determine where the security posture needs to improve. We should probably call these data “Just in Time” (JIT) as they are real-time during the SDLC process. This Static Security Data (SSD) is related to analyzing the actual source code of applications before the code ever goes live in any environment. This includes SCA and SAST analysis of your source code. 

Best practice here is to include these data collection tools in your CI pipelines and ship the data to your SecOps team to visualize in their dashboards. Of course, this data can be visualized somewhere else. Still, the SecOps team is best served with a single pane of glass interface to review the organization's security posture.

Dynamic

Once the code is live in an environment, you must provide the SecOps team with dynamic security data (DSD). This includes but is not limited to DAST, Pen Test results, continuous framework compliance measurements (SOC2, HiTrust, etc.), and network traffic data. This should also be shipped to the SecOps team, much of it ending up in the SIEM for the team to continuously monitor and mitigate any attacks on the system.

Some of this data is collected in the CD phase of your SDLC, while other data points will come from instrumentation in your services or the service containers used by your services. The important takeaway is that you need to enable your Secops team to be able to understand and react to the current security posture of your organization. If you want to do this efficiently, you should build all of this security boilerplate into your processes. This will reduce the cognitive load of your engineers and the toil required by your SecOps team needed to collect the data.

SRE (DevOps/Operations)

The next most important aspect of the platform to address for any organization is the operational capabilities. You should be centrally logging and monitoring your system and have the ability to visualize the results. You should also be able to trace your system, even if it is a modern distributed system. If you can’t keep the system running, you need to know where you need to spend your time and effort to make improvements. 

Being able to visualize metrics and debug the system end to end reduces the toil for both your DevOps engineers and your software engineers. It also helps to remove the friction and finger-pointing that often occurs when a service has trouble. Developers point fingers at the platform and DevOps engineers point fingers at the code. The organization is much better served if everyone can look at the same data and determine there is a problem, where it is, and what should be done about it.

A rigorous SRE framework is really an opportunity for business leaders to set guardrails around stability and performance. Do you need a hardening sprint? Should you EOL a service? SRE work really enables actionable decisions based on metrics about where you should ask your engineering teams to spend time. 

FinOps

This one may seem obvious to many folks as it directly correlates with the overall revenue of the product. While it is perhaps obvious I often see it handled very poorly in organizations. In most cases, I propose a pattern of “use what the CSP gives you but make sure everyone can see it.” Now, I am not suggesting that every engineer in your organization should have access to your CSP billing dashboard full stop. Instead, we propose an approach where developers are allowed and trained to understand their billing costs. The goal here is to internalize FinOps within your organizations to enable micro-optimization and less of the disruptive full sweeping changes that organizations have come to rely on for transformation. The exact separation of access here should be left up to the organization. Keep in mind that a powerful motivator for teams is to see other teams achieve better results.

The trick here is to make sure you know who owns what. We call that attribution of resources. Next, you need to establish Role/Attribute Based Access Controls (R/ABAC) for the granularity of data that will be displayed in the CSP’s cost dashboards. For instance, a developer may only be able to see his resources to run a small microservice. An engineering manager may need to see all parts of a service. At the top level, your CFO would probably like to see each product's cost to be able to compare it to the revenue it is generating. You would be surprised how often even large, successful organizations cannot directly attribute their cloud spend to their products.

One issue that is often overlooked here is training. The ease of use of the cloud has hidden the cost from the engineer. If it is easy to use, it must also be cheap, is how this fallacious line of thinking proceeds. You should be teaching engineers to consider cost as part of any exercise and I recommend that it is included in standups, retrospectives, and goal setting. 

Internal Developer Platform

So, you finally have your security, operations, and finances making sense in the cloud. You feel really great. But we are far from done. You need to grow revenue, retain, and add users, and we all know that features are what users crave, and they will pay for it. So, how do you get more features deployed faster? The answer, of course, is to optimize your SDLC approach.

The Internal Developer Platform (IDP) is the tool developers use daily to ship your features. Your features make you money. So, it is best for your business if this tool is easy to use for your developers while meeting the needs of your remaining stakeholders. Let me say that again: your SDLC needs to take care of any integrations needed for SecOps, SRE, and FinOps. Let’s give a few examples to hammer this home, as this is one of the biggest wins I see for many organizations.

You should have a standard template for all your CI/CD pipelines. How does this help? The standard template should include all the boilerplate needed to enable your stakeholder integrations. Examples include tasks for SSD in your CI pipeline template with gates set by thresholds maintained by the SecOps organization. Sophisticated large organizations should implement a tracking and exemption system to make sure issues that are discovered by these tools are assigned to be remediated. The DSD is a little more complex, but a great example would be to include requirements in CI for the injection of instrumentation tooling libraries into your source code or, better yet, manage a deployment manifest that runs security daemons as sidecars and requires all services to be derived from these base deployments. Adding compliance tooling, for instance, an admission controller to your orchestration system, helps prevent services from going live that do not meet your minimum compliance requirements.

You should have standard modules for Infrastructure as Code (IaC). Developers who need a DB shouldn’t be creating a stand-alone DB in the CSP’s GUI. They should be able to quickly add a resource to a system by just declaring what they need (type and size). This can be facilitated by any structured data file (YAML, JSON) that describes the size of the IaC templated resources. These workload definitions should be revision controlled as well in the exact location as the source code. 

For FinOps, you should attribute resources to engineers, products, and teams. I should never have to ask “what is this machine for?”. This makes it much easier to have loosely coupled financial decisions without disrupting the day-to-day work of your engineering teams. Thus enabling the business to be more dynamic without the constant in-fighting and self-discovery we often see in large organizations.

Business Analytics/Intelligence

It is essential to enable your business folks to play along as well. In most cases, these folks are far less technical than the folks building the platform or using the SDLC. It is crucial to provide these teams with data and analysis tools. Some organizations may opt to build a data lake and add tools like Tableau onto it. This is a great approach, though I might build a data lake house instead as unstructured data has become more and more relevant.

I am not going to be too prescriptive here at the moment because I am not a business analyst, and I don’t work for your organization, so in truth, I don’t know what you need. I will say you need to get the correct data to the correct people to facilitate business decisions: “How do we get more users?”; “Was the new feature a success?”; etc.

Generative AI

The use of LLM to enable NLP interfacing with data is almost surely required in any platform. I strongly suggest you consider how to enable this in your platform in a manner that supports your stakeholders. Examples of this include searching for PII, looking for exploitable CVE issues, and NLP-based business analytics. Imagine being able to ask your LLM integration if you should even build a feature! 

We may not be there yet but this is the direction I see the field going, so you better prepare for it. 

Ack. Is there some sort of reference available?

I suggest a few references. But in general, it probably really depends on your business needs and your future plans. I do think the reference work being done by Humanitec and McKinsey is quite interesting and I have included links to their work below.

  1. https://humanitec.com/reference-architectures
  2. https://github.com/humanitec-architecture

Ack. Can’t I just buy a platform?

Not really. We see a few interesting off-the-shelf IDPs and a few Platform Orchestrators that you can buy, but nothing that seems to fit the model we discussed. I would encourage readers to have a look at the references provided below.

  1. https://www.qovery.com/blog/10-best-internal-developer-platforms-to-consider-in-2023
  2. https://platformengineering.org/blog/why-putting-a-pane-of-glass-on-a-pile-of-shit-doesnt-solve-your-problem
  3. https://platformengineering.org/blog/what-to-build-first-the-house-or-the-front-door

Callaway Cloud Consulting is a technology consulting company that specializes in Platform Engineering, Data Engineering, AI, and Salesforce solutions.

At our core we are innovators, problem-solvers, and data enthusiasts that believe the best solutions are people-led and tech-powered. 

You can schedule a meeting to talk about your platform, data, AI or Salesforce needs.

Click here 🌟to schedule a brainstorm session with a Callaway consultant.

Visit us at www.callawaycloud.com

Back to all Blogs