Device Defender - Marion Summerville

Home » projects » Device Defender

Device Defender

AMAZON WEB SERVICES (AWS) INTERNET OF THINGS (IoT)

Summary

My role

As the design lead I directed (but did not manage) a team of 2 designers plus an external user research agency, and was accountable for both the quality and velocity of the design team. I partnered with the Product Management Lead and Engineering Manager throughout the process to define and refine product requirements.

In addition to managing the relationship with the external agency I was a hands-on contributor to process and requirements definition, interaction design, and UI design.

Internet of Things (IoT) fleets are difficult to secure on an ongoing basis and are an attractive target for hackers. Amazon Web Services (AWS) Device Defender provided tools to identify security issues and deviations from best practices:

Continuous auditing
Real-time detection and alerting
Fast investigation and mitigation

Design team

Design lead (me)
Sr UX designer
Jr UX designer
External agency for user research

TL;DR

WHO: The user

IoT Fleet Managers, Security Operations, and Security Architects responsible for a large fleet of IoT devices.

WHY: The problem

IoT fleets are difficult to secure on an ongoing basis and are an attractive target for hackers. Existing network security tools were insufficient for the specific threat profiles of internet-connected devices.

HOW: The process

Co-design with expert users across Amazon within a SCRUM framework. This project was a pilot for a cross-business-unit UX quality control and approval process.

WHAT: The solution

The first two modules (Audit and Detect) of an IoT-native security solution (Device Defender) integrated into the AWS IoT platform UX.

Business problem

IoT fleets often consist of large numbers of devices that have diverse capabilities, are long-lived, and are geographically distributed. These characteristics make fleet setup complex and error-prone. And since devices are often constrained in computational power, memory, and storage capabilities, this limits the use of encryption and other forms of security on the devices themselves. Also, devices often use software with known vulnerabilities. The combination of these factors makes IoT fleets an attractive target for hackers and makes it difficult to secure them on an ongoing basis.

About Device Defender

AWS IoT Device Defender enables customers to secure their IoT fleets on an ongoing basis by providing them the tools to identify security issues and respond to security breaches quickly before they cascade to other devices within their fleet.

Continuous auditing
Monitors device related policies to ensure proper security settings are in place. Device Defender detects any drifts from security best practices or access policies. Customers can run audits on a need basis or schedule them to be run periodically.

Fast investigation and mitigation
Enables customers to investigate alerts by providing contextual information about the alert such as device information, diagnostic logs and historical alerts for the device.

Real-time detection and alerting
Detects changes in connection pattern, devices communicating to unauthorized or unrecognized endpoints, and changes in inbound and outbound device traffic patterns.

Device Defender integrates with the AWS IoT Connected Device Management (CDM) service, allowing customers to perform actions such as revoke permissions, reboot a device, reset it to factory defaults, or push security fixes.

Device Defender: Audit

Audit monitors device related policies to ensure proper security settings are in place.

Best practices
Device Defender Audit runs a set of pre-defined rules — mapped to common IoT security best practices and vulnerability definitions — against a customer’s fleet.

Focus on noncompliant resources
Security professionals care most about active or emerging issues. Device Defender Audit identifies noncompliant resources and provides context for the scale of the issue.

Continuous auditing
Continuous auditing ensures that the current security posture of the device fleet is known, good, and trusted. Customers can run audits on a need basis or schedule them to be run periodically.

Mitigation
Device Defender Audit identifies noncompliant resources and enables the user to investigate down to the individual resource. For each type of compliance issue, Audit provides a suggested mitigation action.

Device Defender: Detect

Detect monitors the fleet and identifies abnormal behavior on devices, and provides device management tools to investigate and mitigate security issues.

Define security profiles
Security profiles contain a set of behaviors that describe how the device should be operating, e.g., “packets out less than 150 bytes in 5 minutes.”

Detect anomalies and publish alerts
Once Detect is configured to monitor a group of devices, it starts recording security-related attributes (e.g., connection attempts, bytes/sec/protocol, set of open ports). When a violation occurs, the system sends an alert via CloudWatch events, SNS notifications, or 3rd party systems.

Assign profiles to groups of devices
Security profiles are attached to new or existing groups of devices. Devices in those groups will be monitored for compliance to the behaviors defined in the attached security profiles. Security profiles may also be attached to the account, where they will apply to all devices in the fleet.

Investigate and mitigate device behavior
Alerts can contain device information, device statistics (e.g. last connection time, number of active connections, data transfer rate), and historical events for the device. Within the console, the operator can both view broad trends and drill down into the details of a specific device.

Personas

There are eight members of the IoT Platform Persona Family; three of them are relevant to Device Defender.

Fleet Manager

The Fleet Manager keeps the fleet of devices updated and operating smoothly.

Security Ops

Security Ops makes sure everything is functioning correctly from a security perspective and sounds the alarm when it’s not.

Security Architect

The Security Architect is responsible for maintaining the security of the systems. They must anticipate all of the moves and tactics that hackers will try to gain unauthorized access,

How personas interact

The Fleet Manager is the first line of defense. They monitor all aspects of the fleet and can escalate quickly to Security Ops if they see something concerning.

The relationship between Security Ops and the Security Architect is more complex. The Security Architect sets up the security parameters while Security Ops monitors devices and remediates issues. The Security Architect also serves as an expert consultant and escalation path for issues.

Walkthrough

The Device Defender product is immensely complex. In this case study I elide most of the Audit functionality and focus on only two of the six core tasks of the Detect module.

Glossary

Alert: a notification of an anomaly in device setup (e.g., firmware version), platform action (e.g., status of firmware update), or behaviour (e.g., a spike in outgoing packets). Alerts can be delivered via the AWS IoT platform GUI, AWS CloudWatch events, SNS, and integration with 3rd party tools like PagerDuty. Post-alert activities can be automated via AWS Lambda.
Behavior: parameters that describe how a device should behave. A behaviour definition includes a metric, an operator, a value, and, depending on the metric, a duration. For example, a behavior could be “bytes out < 200 in 30 minutes” — more data emitted signals that the device could be part of a DDoS attack.
Security profile: a set of behaviors plus one or more alert destinations.
Thing: a single IoT device
Thing group: a set of Things that are defined by the user. Thing groups may be based on parameters (dynamic) or a curated set of devices (static). In Device Defender each group should contain devices that are expected to have similar behavior, as security profiles are attached to Thing groups.
Violation: An instance where a metric (cloud-based or device-side) for a device does not match the expected behaviour defined by an associated security profile. Violations generate alerts.

Audit

This is the system flow for the Audit module. The product is conceptually simple: Security Ops creates an audit by selecting the security checks they want performed and then setting the schedule for the Audit to run.

Detect

There are six core user tasks within the Detect module.

Onboard & configure – Security engineer [primary task]

Task 1: Create a security profile

Monitor & Investigate – Security operations [primary task]

Task 2: Investigate violations

Update configuration – Security engineer [secondary tasks]

Task 3: Edit a security profile
Task 4: Attach a security profile to a Thing group
Task 5: Delete a security profile
Task 6: Duplicate a security profile

For brevity, I’ve included only the primary tasks for the Security Architect and Security Operations.

Task 1: Create a security profile (Security Architect)

Creating a security profile requires the following steps:

Add behaviors
Configure notifications
Name and save
Attach the security profile to one or more Thing groups

The Security Architect’s journey begins on the AWS IoT platform’s dashboard / home page.

They navigate to the Detect module. For this walkthrough this is the first use of Device Defender so the post-tour landing page prompts the user to create a security profile.

They give the profile a name and description, and begin defining their first behaviour. They choose the metric (in this case, Bytes out)

the operator,

the value, and duration.

This profile requires multiple behaviors, so they click the button to add another

They can also define behaviors in JSON if they prefer.

Once they’re finished with behaviors, they click Next to configure an SNS alert

by choosing a topic

and an IAM role

The next step is to attach the profile to either a specific Thing group or to all devices.

The creation of a Thing group has multiple steps (define group attributes, find relevant Things, add Things to the group) so we’ll elide that process and skip to where groups have been selected. The Security Architect confirms the security profile settings

And a new security profile has been created and attached to the 4 selected Thing groups.

They navigate to the Security Profiles hub, from which they can create the remaining security profiles necessary to configure Detect for their needs.

Task 2: Investigate violations (Security Operations)

The specific actions taken to investigate a violation vary widely, but generally there are three steps:

Review the alert summary
Explore violation details
Review the specific device(s)
Take mitigating action(s) on the device(s)

Security Operations’ journey begins on the AWS IoT platform’s dashboard / home page. They see that there are active alerts

and expand the Alerts panel. (Alerts did not exist in the IoT platform prior to Device Defender; we added the icon and created the panel as a new interaction pattern in the IoT design system library.) They see that there are 18 violations against the security profile traffic-shanghai and click on the link.

This opens the details page for the security profile, displaying the “Now” tab Violations section. They see 17 Things are in alarm, and hover over the timestamp for the Thing humidifier-88 to see that it has been in alarm for 2 minutes and 45 seconds.

They hover over the behaviour to remind themself of the underlying rule that is being violated. They click on the “History” tab

and see a summary of the Violation events for this security profile. They return to the “Now” tab

and click on the Thing name.

The Thing details page loads in the same context (Violations > Now). Humidifier-88 is actually violating three behaviors on two different security profiles. They click on the “History” tab

and see the prior pattern of violations for that single Thing. It’s important that Security Operations check both the profile that is violated and individual violating Things, as root causes can be complex. They return to the Violations hub (Violations > History)

and filter on the traffic-shanghai security profile.

In this Hub view they can interact with the data via filters and visualization via clicking on the timeline; the system keeps both views in sync.

Based on their analysis, Security Operations can take various actions at either the security profile level from the security profile detail page (e.g., upgrade firmware, take all devices offline) or at the individual Thing level (e.g., reboot device) from the Thing detail page.

Process

Discover

Device Defender was a project defined by AWS leadership based on a market opportunity. Foundational user research specific to the security space was conducted by an external agency before a design team was assigned to the project.

Research questions

Findings

How do customers manage security today?
What are their mental models around security audits?

Security professionals think about security as binary: are we secure or not? KPIs and other business metrics are less relevant.
Visualizations should be designed to identify issues that need to be resolved, and must provide sufficient data to make resolution as quick and easy as possible.

Iterate

Co-design and evaluative research

While we (the AWS design team in partnership with the external agency) conducted cycles of co-design and evaluative user research for both Audit and Detect, for brevity I’ve included only some of the findings for the Detect module.

The research question was simple: Does this service meet your needs? Does it work the way you expect?

Themes

What is normal? When investigating an event, understanding “normal” helps set context for a violation. Identifying a device that goes from 100 packets every 5 minutes to 100 packets a minute before exceeding the threshold of 500 packets over 5 min is only possible if the non-violating data is retained.
Which groups are involved? A blast radius is most commonly measured by a thing group. One or two devices acting up isn’t a big deal. Knowing the difference between: 50 devices across 13 thing groups vs. 50 devices in 1 thing group changes the way a threat is investigated.
When did this last happen? Large gaps between threats is desired/expected.There could be gaps of more than a month between similar violations if security professionals are doing a good job. Reviewing prior similar events is an important component of investigation.
What’s the pattern? Participants wanted to navigate their data in the user interface, and had particular expectations abput data visualization. Their focus in this tool is violation triage; mitigation actions could vary widely.

Applying research findings

We tracked findings and how user feedback has changed our solution. We layered in these insights and impacts into our persona task matrix.

We also acknowledged larger issues uncovered in user feedback and identified how they will be addressed in future.

Finding

Impact

Every participant successfully completed all tasks, and reported minimal friction. The core mental models (security profile, behavior, profile attached to a group, violation event) were well understood.

Changes to the UX based on research findings were straightforward refinements; most of the feedback will be used to define and prioritize future enhancements of the tool.

Participants requested additional flexibility in defining behaviors, including conditional operators (AND/OR) and custom metrics and values.

Custom metrics were already in the backlog. Currently behaviors are evaluated as independent entities; conditional operators will be further evaluated post-GA.

In AWS IoT customers organize their fleets via Thing groups. Participants expected to Thing group as a dimension by which to evaluate relationships between violations.

Given foundational technical constructs of the AWS IoT service, integrating an awareness of Thing groups into Detect would be non-trivial. Deeper investigation is required.

For every data table they encountered, participants expected to be able to sort, search, and filter by every column.

Additional APIs to provide filter, sort, and search capabilities are in the post-GA backlog.

Participants expected long-term access to every byte of data any device in their fleet had emitted.

Due to operating costs, the Detect service is structured to store only data related to behavior violations, and only for a relatively short period of time. This may be an opportunity for a premium offering. Deeper investigation is required.

Approvals

Device Defender was the first large project to go through the AWS Design Leadership’s QA process, and was specifically selected as a pilot because of my seniority and prior experience at IBM.

There were four checkpoints in the process, with the Design Leadership board enforcing a go / no go release gate based on a final fit and finish review on the live system. Each checkpoint had a structured agenda; while PM and Engineering were encouraged to attend, the Design Lead was accountable for presenting their service to the Board.

The Design Leadership board for any one product was composed of the design system and content leads, plus at least two design managers and two design leads. The same group of people reviewed every step of the process for a product. I served on the board for the initial launch of Elastic Container Service (ECS) and a major enhancement to Athena.

Deep Dive

The Deep Dive presentation was intended to give the Design Leadership board context on the project: its business value, personas, and user value, as well as a summary of any discovery work thus far. The Deep Dive was mandatory only for new Tier 1 products (high priority with significant strategic value), although it was encouraged for major new functionality in existing Tier 1 and Tier 2 products. Device Defender was a Tier 1 product.

Agenda

Anticipated timeline
Product summary
Persona(s)
User research
User journey

UX Sign Off

UX Sign Off was intended to be the gate before significant development work began on the product. In reality, most Tier 1 products were very far along in the development cycle, as design teams tended to be assigned after a proof of concept or alpha product was already built.

In practice the UX Sign off typically generated mandatory and optional feedback that required either a follow-up presentation (for major issues) or email responses (for minor issues). All major releases of a product were required to pass a UX Sign Off before the Fit & Finish review was scheduled.

Agenda

Project timeline
PR/FAQ
Persona(s)
User journey
Updates [from research or review feedback]
Demo [of core user tasks, in InVision]

Below are some examples of feedback (re: the Violations Hub) from the Deep Dive that were addressed for UX Sign Off:

Feedback

Outcome

The list view should have a time window attached to it.

The views of violations have been refactored to explicitly identify the time window for the displayed violations.

[In the data viz] The default view has “today” highlighted – the highlight state should match the selected time window.

User research identified that any interaction with the data viz should be reflected in the data table below (and vice versa), including filters on time and content.

Once a user has zeroed in on a specific device or time window, the system should support the transfer of the contextual data as the user navigates away from this page to take action

To the extent possible, the interaction model has been refined to maintain context (e.g., resource details, time windows, filter selections) across pages within the UI. As additional APIs are developed post-GA, transfer of context will be enhanced.

Fit & Finish

The Fit & Finish review was a hard gate on product release. While the Design Lead presented the front matter, the core of the presentation was a demo of the live system on staging by the Engineering Lead or Product Manager. Any issues identified in the Fit & Finish review were either remediated before launch (as proved via video clips) or put in the backlog as mandatory enhancements for the next release.

Agenda

Introduction
Persona(s)
User journey
User tasks
Scope changes after UX Sign-off
Known issues
Live demo of core user tasks