Jakki's Blog: 2021

Sunday, 28 March 2021

Apple Thunderbolt NHI kext big sur reboot issue fix

Normal boot

sudo fdesetup disable
fdesetup status => should read FileVault is Off

Boot in recovery mode

csrutil disable
csrutil authenticated-root csrutil
mount -uw /Volumes/"SYSTEM_HDD_NAME"
cd /Volumes/"SYSTEM_HDD_NAME"/System/Library/Extensions
mv AppleThunderboltNHI.kext AppleThunderboltNHI.kext.BAK
rm -rf /System/Library/Caches/*
kmutil install -u --force --volume-root /Volumes/"SYSTEM_HDD_NAME"/System/Library/Extensions
bless -folder /Volumes/"SYSTEM_HDD_NAME"/System/Library/CoreServices —bootefi --create-snapshot

Boot normaly

Done

Symptoms

You are working and suddenly the machine powers off.
The apple logo (and maybe the keyboard retro-lighting) stays on for a few seconds.
Then the machines powers off.
When you press the power key, the machine boots normally and no error message appears.

If this has happened to you, you are probably at the right place.

What happened ?

For certain Macbooks, there is a faulty component on their logic board (see this Stack Echange post, this MacRumor thread, this change.org petition) that forces the machine into deep sleep.

Changing this component requires skill, time and money that note everyone may have. If you macbook is under warranty, I would recommend you to go to the nearest AppleStore to get it fixed (in some case, it has resulted to have been not enough, and the presented solution was also required).

Otherwise this guide is for you.

Here, we are going to prevent the file AppleThunderboltNHI.kext from being used during boot. This file is a driver for connecting Ethernet through thunderbolt connector. It is also faulty on certain macbook and contributes in the go to sleep mode behaviour by wrongly changing the voltage of the CPU.

WARNING

The solution will permanently require you to deactivate SIP and FireVault. There is no other way here untill Apple does something about it.

Detailed solution

Here is the cooking recipe for solving the problem under Big Sur:

Boot normally
Play a video in the background, for instance this one to keep the CPU busy while we work.
Open a Terminal
Deactive FileVault as follows:
```
 sudo fdesetup disable
```
And press the “Enter” key to execute the command.
you can get you username using the command whoami your password is the session password
Now we have to wait for the hardrive to be de-encrypter. To check on the process you may run from time to time:
```
 fdesetup status
```
for those using brew, I recommend using watch fdesetup status
When the decryption processus has reached 100 or when you will read FileVault is Off, we can move on.
Reboot into recovery mode (hold down "cmd" + "R" keys when the machine powers up)
You should see the following screen at some point
On the top menu bar, goto Utilities > Terminal

Now we are going to turn off SIP

 csrutil disable
 csrutil authenticated-root csrutil

We are going to mount the root HDD
If you don’t know what is your root HDD, you can run :
```
diskutil apfs list
```
Your root HDD is the one whose (Role) in the label APFS Volume Disk matches “(System)”
To mount in writing mode the HDD, use the following command and susbtitute SYSTEM_HDD_NAME with the name for you root HDD:
```
mount -uw /Volumes/"SYSTEM_HDD_NAME"
```
In the previous image SYSTEM_HDD_NAME was “HD_macbook”. To avoid issues with spaces in a name, use quotes “” around the name, for instance “HD Macbook”
You can also escape spaces using backslashe -> HD\ Macbook
Now let’s move to the folder /Volumes/”SYSTEM_HDD_NAME”/System/Library/Extensions, deactivate the faulty driver and clear the caches. Don’t forget to change SYSTEM_HDD_NAME for the actual name of your HDD:
```
cd /Volumes/"SYSTEM_HDD_NAME"/System/Library/Extensions
mv AppleThunderboltNHI.kext AppleThunderboltNHI.kext.BAK
rm -rf /System/Library/Caches/*
```

We have to update the kext cache and select the snapshot for the next boot:

kmutil install -u --force --volume-root /Volumes/"SYSTEM_HDD_NAME"/System/Library/Extensions
bless -folder /Volumes/"SYSTEM_HDD_NAME"/System/Library/CoreServices —bootefi --create-snapshot

We will not enable SIP anymore. Enabling SIP has proven to create boot loops after modyfing operating system files.
You are done and can reboot in normal mode.
(Optional) reactivate FileVault (may not work)
```
sudo fdesetup enable
```
Or from the System Preferences GUI.
If you install any OS updates, it is very likely that you will need to redo all the steps.
To check that it worked, you can have a look at the About This Max > System Report > Hardware > Thunderbolt and you should read “No driver loaded”. You can also run the folowing command
```
kextstat | grep NHI
```
and you should not see any mention of “AppleThunderboltNHI.kext”

Resources

The solution for OSX Catalina can be found here: https://outluch.wixsite.com/rmbp-crash
However the step are incomplete for Big Sur We can’t use “sudo mount -uw” directly to modify “AppleThunderboltNHI” as in the guide because of read-only restrictions
The reason why the system is locked in read-only can be found here: [https://developer.apple.com/forums/thread/649832]
It shows that it worked under Catalina but no more under Big Sur
This reference shows the updated answer for Big Sur https://developer.apple.com/forums/thread/666567
The previous post is also found on Macrumors forum: https://forums.macrumors.com/threads/big-sur-and-applethunderboltnhi-kext.2267818/
for “csrutil authenticated-root disable”, FireVault on “Mackbook HD” (root HDD) must be turned off.
How to turn FireVaul off was found here: https://www.whileifblog.com/2017/07/09/macos-manage-filevault-from-command-line/
Another interesting read about making permanent change to the OS and issues with kmutil https://egpu.io/forums/mac-setup/macos-up-to-11/

Thursday, 14 January 2021

Multi-Account Log Aggregation in AWS for Observability and Operations - Part 2: Implementation

In my previous blog, we discussed 3 different ways of aggregating and processing logs from multiple accounts within AWS. These methods were :
1. Cloudwatch Logs plus Lambda Method
2. Cloudwatch Logs plus AWS SQS (Simple Queue Service) Method
3. Cloudwatch Logs plus AWS Kinesis Method

After analyzing the pros and cons based on scenarios, we concluded that using Method #3 is ideal for most of the customers having more than 2 accounts.

In this blog, I will walk through step by step process for setting up Method #3 for aggregating logs.

Overview of Method #3 - Cloudwatch Logs plus AWS Kinesis

Before we start the setup, let’s take a quick look at the architecture for Method #3.

Method_3_with_aws_kinesis

The following resources will be used during the setup:

AWS VPC Flow Logs
AWS CloudTrail
AWS GuardDuty
AWS CloudWatchLogs
Amazon Kinesis Stream
Amazon Kinesis Firehose
AWS Lambda
AWS S3 / RedShift
Amazon ElasticSearch

Steps

Let’s now go through one step at a time. As I demonstrate these steps, I will be using a combination of AWS CLI and the AWS Web Console.

NOTE: Not all features can be configured from the AWS Web Console.

Initial Master Account Setup

Step 1: Create ElasticSearch Cluster (Master/Logging Account)

Refer to this article to setup your ElasticSearch cluster in the MASTER/CENTRALIZED LOGGING account.

Step 2: Create S3 buckets (Master/Logging Account)

Create two s3 buckets in the master account.

The first S3 bucket would be for collecting logs processed by Kinesis Firehose (described later) as well as logs that failed the log processing stage. I will call this bucket demo-logs-s3. DO NOT ATTACH any additional policy to this bucket.

The second S3 bucket would be for backing up/collecting all the cloud-trail logs from member accounts to the Master/Logging account. I will call this bucket cloudtrail-all-accounts-demo.

The cloudtrail-all-accounts-demo bucket needs a bucket policy that allows member accounts to write to this bucket.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AWSCloudTrailAclCheck20131101",
      "Effect": "Allow",
      "Principal": {
        "Service": "cloudtrail.amazonaws.com"
      },
      "Action": "s3:GetBucketAcl",
      "Resource": "arn:aws:s3:::cloudtrail-all-accounts-demo"
    },
    {
      "Sid": "AWSCloudTrailWrite20131101",
      "Effect": "Allow",
      "Principal": {
        "Service": "cloudtrail.amazonaws.com"
      },
      "Action": "s3:PutObject",
      "Resource": [
        "arn:aws:s3:::cloudtrail-all-accounts-demo/AWSLogs/MEMBER_ACCOUNT_ID_1/*",
        "arn:aws:s3:::cloudtrail-all-accounts-demo/AWSLogs/MEMBER_ACCOUNT_ID_2/*"
      ],
      "Condition": { 
        "StringEquals": { 
          "s3:x-amz-acl": "bucket-owner-full-control" 
        }
      }
    }
  ]
}

Step 3: Setup Kinesis Data Stream (Master/Logging Account)

In the Master/Logging account, navigate to Services > Kinesis > Data Streams > Create Kinesis Stream .

The number of shards that you need to provision depends on the size of logs being ingested. There is an evaluation tool available on AWS that helps you with estimation.

create_kinesis_stream_1

It’s always a best practice to modify the Data Retention Period for Kinesis Stream. The default retention period is 24 hours and the max can be 7 days. To modify this, you can edit the stream created above and update it.

To perform the same operation from CLI, run

aws kinesis create-stream --stream-name Demo_Kinesis_Stream --shard-count 4

Use this command to increase the retention period

aws kinesis increase-stream-retention-period --stream-name Demo_Kinesis_Stream --retention-period-hours 168

Step 4: Setup Kinesis Firehose (Master/Logging Account)

Select the Kinesis Data Stream created in Step 3 and click on Connect Kinesis Consumers > Connect Delivery Stream

create_firehose_1

Moving on to the Process records step, you can also set up a data transformation function that will parse the incoming logs to analyze only those which are important. Click on Enabled and choose a lambda function that will do this transformation.

NOTE: Kinesis Firehose expects a particular data format. Refer here for more info.

create_firehose_2

If you don’t have any existing lambda functions to do this then click on Create New and Select Kinesis Firehose Process Record Streams as source.

Change the Runtime for lambda to python 3.6 and Click Next

Use the code from this repository within your Lambda function.

import base64
import gzip
import io
import json
import zlib

def cloudwatch_handler(event, context):
    output = []

    for record in event['records']:
      compressed_payload = base64.b64decode(record['data'])
      uncompressed_payload = gzip.decompress(compressed_payload)
      print('uncompressed_payload',uncompressed_payload)
      payload = json.loads(uncompressed_payload)
      
      # Drop messages of type CONTROL_MESSAGE
      if payload.get('messageType') == 'CONTROL_MESSAGE':
          
          output_record = {
              'recordId': record['recordId'],
              'result': 'Dropped'
          }
          return {'records': output}
     
     
       # Do custom processing on the payload here
      output_record = {
          'recordId': record['recordId'],
          'result': 'Ok',
          'data': base64.b64encode(json.dumps(payload).encode('utf-8')).decode('utf-8')
      }
      output.append(output_record)

    print('Successfully processed {}        records.'.format(len(event['records'])))

    return {'records': output}

Also, modify the Timeout under Basic Settings as shown below.

create_firehose_3

Now, go back to the Kinesis Firehose setup page and select the lambda function. In this step, we won’t be converting the record format as we want to send logs to ElasticSearch and S3.

Next, we need to select a destination to send these processed logs to. In this example, I will be sending it to Amazon ElasticSearch Service.

From the drop-down select the elastic search cluster created in Step 1. Also, select the s3 bucket created in Step 2 to backup the logs for future use. This might be necessary for regulatory and compliance reasons.

create_firehose_4

In the final step, complete the elastic search configuration as per your environment and resilience needs and create a new IAM role that allows Kinesis Firehose to write to ElasticSearch Cluster. Then review the summary and create the delivery stream.

Step 5: Create and Set policies in Master/Logging Account to allow data to be sent from Member Accounts

Create policy (cwltrustpolicy.json) to assume the role from CloudWatch Logs

Note: DO NOT USE IAM CONSOLE TO DO THIS

{
  "Statement": {
    "Effect": "Allow",
    "Principal": { "Service": "logs.region.amazonaws.com" },
    "Action": "sts:AssumeRole"
  }
}

Run this in Master/Logging account. You can use profiles in AWS CLI to manage your credentials for different accounts.

aws iam create-role --role-name cwrole --assume-role-policy-document file://cwltrustpolicy.json

Create a policy (cwlpermissions.json) to allow CloudWatchLogs to write to Kinesis

{
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "kinesis:PutRecord",
      "Resource": "arn:aws:kinesis:region:<MASTER/LOGGING ACCOUNT ID>:stream/Demo_Kinesis_Stream"
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::<MASTER/LOGGING ACCOUNT ID>:role/cwrole"
    }
  ]
}

Associate the above policy to cwrole (in Master Account)

aws iam put-role-policy --role-name cwrole --policy-name cwlpolicy --policy-document file://cwlpermissions.json

Ensure that the policy was associated. (in Master Account)

aws iam get-role-policy --role-name cwrole --policy-name cwlpolicy

Create a destination endpoint to which the logs would be sent (in Master Account)

aws logs put-destination --destination-name "kinesisDest" --target-arn "arn:aws:kinesis:us-west-2:<MASTER ACCOUNT ID>:stream/Demo_Kinesis_Stream" --role-arn "arn:aws:iam::<MASTER ACCOUNT ID>:role/cwrole"

Response from CLI

Logs_put_destination

Assign a destination policy that allows other AWS accounts to send data to Kinesis.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1571094446639",
      "Action": [
        "logs:PutSubscriptionFilter"
      ],
      "Principal" : {
          "AWS": [
      "MEMBER_ACCOUNT_1_ID",
      "MEMBER_ACCOUNT_2_ID"
      ]
         },
      "Effect": "Allow",
      "Resource": "arn:aws:logs:us-west-2:<MASTER_ACCOUNT_ID>:destination:kinesisDest"
    }
  ]
}

In the Master account run,

aws logs put-destination-policy --destination-name "kinesisDest" --access-policy file://destination_policy.json

Now, let’s set up VPC flow logs in Member Accounts.

Aggregating VPC Flow logs

Step 6: Setup CloudWatch Log Group (Member Account)

The first step is to setup CloudWatch Log group in all the member_account(s). This can be done via AWS CLI/AWS SDK or AWS Web Console.

Navigate to Services > CloudWatch > Logs > Create log group

Step 7: Setup VPC Flow Logs (Member Account)

The first step is to enable VPC Flow logs across all of your VPCs in all of the member_account. This can be done either via AWS CLI, AWS SDK or the web console.

Navigate to Services > VPC > Your VPCs and select the VPC of interest. Then in the bottom pane, click on Flow Logs > Create flow log. I will call it cwl_vpc_fl_member_account_1 (Refers to account 1)

create_flow_log_1

On the next page, Select the Filter. This indicates the type of VPC logs that you want AWS to capture. Choose ALL to log both accepted and rejected traffic. Then select the Destination as CloudWatch Logs.

From the drop-down choose the Destination log group.

create_flow_log_2

Finally, select an IAM role that allows VPC flow logs to be written to the CloudWatch Log group. If this is not already setup, then setup a role either by clicking on Set Up Permissions or by going to IAM and adding the policy, shown below, to a role. (In our case, the role is named Demo_flowlogsrole).

{
  "Statement": [
    {
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:DescribeLogGroups",
        "logs:DescribeLogStreams",
        "logs:PutLogEvents"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}

Copy the ARN of the Demo_flowlogsrole and Repeat the same steps across different accounts.

create_flow_logs_final

This will start forwarding VPC flow logs from all VPCs to the CloudWatch log group

Step 8: Create a subscription filter in Member Account(s) to send data to Kinesis Stream

Execute this command to obtain the destination arn from the Master Account [The –profile master stores creds for Master Account for AWS CLI]
```
aws logs describe-destinations --profile master
```

It has format similar to arn:aws:logs:us-west-2::destination:kinesisDest

Now in the member accounts, setup the subscription filter to forward logs to kinesis

In the command below, use the log group created in step 6.

aws logs put-subscription-filter --log-group-name "cwl_vpc_fl_member_account_1" --destination-arn arn:aws:logs:us-west-2:<MASTER_ACCOUNT_ID>:destination:kinesisDest --filter-name "vpc_flow_logs_filter" --filter-pattern " " --profile dev1

You should now see data in Elastic Search.

Aggregrating CloudTrail Logs

In this section, we will discuss the aggregation of CloudTrail logs. We will use some of the same resources created in the previous step.

Enable Cloudtrail in all the regions within Member Accounts.

While enabling across the organization, ensure that the S3 bucket to which the cloudtrail logs are sent is set to the bucket in the Logging/Master account as mentioned in Step 2. This is useful for long term storage of the logs.

create_cloudtrail_1

Forward CloudTrail to CloudWatch Logs

Navigate to the CloudTrail Service and click on cloud trail created in the previous step. (demo-cloud-trail)

Then, go to the section Cloudwatch Logs and click on Configure.

create_ct_cwl_1

Provide the name of the Cloudwatch log group

create_ct_cwl_2

This will then take you to the IAM configuration to create a role that gives CloudTrail permission to write to the CloudWatch log group.

iam_role_ct_cwl

Now similar to the previous section, create a subscription filter to forward the logs from this CloudWatch log group to kinesis. (in Memeber Account)

aws logs put-subscription-filter --log-group-name "CloudTrail/member_account_1" --destination-arn arn:aws:logs:us-west-2:<MASTER_ACCOUNT_ID>:destination:kinesisDest --filter-name "ct_filter" --filter-pattern " "

Once this is setup you will start seeing both processed VPC flow logs and CloudTrail events from all the accounts in ElasticSearch.

Aggregating GuardDuty events

The first step is to aggregate all GuardDuty events in the MASTER/Logging account. This can be done by sending invitations from the master account to member accounts. All you need is Member account ID and the email associated with the account.

Navigate to Services > GuardDuty > Enable GuardDuty

Then to add member accounts, go to GuardDuty > Accounts > Add Account

Provide the Account ID(s) of the member account and Email for the account. Once you do that, send an invite.

send_invite_gd

Go to the Member Account and accept the invite. Before you do so, enable GuardDuty in the member accounts. (You can do this from Terraform or CFT across member accounts.)

accept_invite_gd

You will notice the GuardDuty events from all your member accounts will now be available in the Master/Logging account. This is useful if you like using the GuardDuty UI.

Now, similar to other aggregations seen in earlier sections, we will forward these GuardDuty events to CloudWatch logs. (MASTER ACCOUNT)

To do this, go to CloudWatch > Events > Create Rule. Add the following details to it. Note, we are forwarding all the GuardDuty events stored in the Master account to the CloudWatch Log group.

create_cw_rule_for_gd

Update the destination policy to allow data from this account to be collected.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1571094446639",
      "Action": [
        "logs:PutSubscriptionFilter"
      ],
      "Principal" : {
          "AWS": [
            "MEMBER_ACCOUNT_1_ID",
            "MEMBER_ACCOUNT_2_ID",
            "MASTER_ACCOUNT_ID"
      ]
         },
      "Effect": "Allow",
      "Resource": "arn:aws:logs:us-west-2:<MASTER_ACCOUNT_ID>:destination:kinesisDest"
    }
  ]
}

Run this in Master Account

  aws logs put-destination-policy --destination-name "kinesisDest" --access-policy file://destination_policy.json

Finally, we will add a subscription filter to forward these CloudWatch Logs to Kinesis stream similar to the ones in previous sections. (Here we add subscription filter in the MASTER ACCOUNT)

aws logs put-subscription-filter --log-group-name "/aws/events/guardduty-demo" --destination-arn arn:aws:logs:us-west-2:<MASTER_ACCOUNT_ID>:destination:kinesisDest --filter-name "gd_filter" --filter-pattern " "

This should be the final state of your CloudWatch Log group

cw_final

Kibana DashBoard

After applying necessary filters, you will start noticing the data in Kibana Dashboard

kibana_dashboard

Conclusion

By implementing this architecture, you should get near real-time data in ElasticSearch for analysis (and send notifications using SNS). Also, you should see all the logs/events within your S3 bucket that is stored as a backup. You may choose to set policy so that these logs are archived to Glacier and other long term storage services.

The aggregated logs from the Master/Logging account can be forwarded to other external systems like Splunk/RedShift for analysis and VMWare Secure State for Cloud Security Posture Management.

SOURCE

Multi-Account Log Aggregation in AWS for Observability and Operations -1

Multi-Account Log Aggregation in AWS for Observability and Operations

Monitoring of infrastructure resources and applications within the public cloud, like AWS and Azure, is critical for audit, security, and compliance within the accounts. As the enterprises grow the number of accounts, the collection of these logs and events becomes more tedious. A common mechanism to achieve this is to use a separate AWS account for collecting all logs. AWS recommends using a separate account for collecting all the logs. So in case of a breach in other member accounts within an organization, the logs are never compromised.

In AWS, various services generate logs and events. These include:

CloudTrail - This service tracks all of the API requests made across your AWS infrastructure. The API requests could be from the SDK, CLI, CloudFormation Template (CFT), Terraform or the AWS Console. This helps in identifying which users and accounts made API calls to AWS, the source IP from where the calls originated and when the calls occurred. It also tracks the changes, if any, that were made with the API request.
VPC Flow Logs - The VPC Flow Logs capture the IP traffic to and from the network interfaces within a VPC. This helps in monitoring the traffic that reaches your instances, diagnosing security group rules and determining the direction of traffic from the network interfaces.
GuardDuty events - This service detects suspicious activity and unauthorized behavior for users and resources. It focuses primarily on the account and network-related events. It uses threat intelligence feeds, such as lists of malicious IPs and domains, and machine learning to identify these threats. For ex: It can detect unusual patterns for a user accessing a resource. The user may have never used an API to request IAM info. This will be flagged by GuardDuty as it learns the user access patterns overtime and uses Machine Learning.
CloudWatch - CloudWatch collects monitoring and operational data in the form of logs, metrics, and events, providing you with a unified view of AWS resources, applications, and services that run on AWS.
Application Logs - These logs are generated from within the application. They are usually meant to capture errors and traces.

While I will be focusing on infrastructure logs (security events, API calls, Network flows) in this blog, the same method can be used even for application logs.

In this blog, I will review the basic concepts and discuss different ways of aggregating logs from AWS. In particular, I will review:

Need for Log Aggregation and Analysis.
Forwarding logs to a Centralized Logging Account in AWS
Tradeoffs between different methods

In Part 2 of the blog, I will cover the actual implementation steps.

Need for Log Aggregation and Analysis

A common requirement from the security teams is to be able to analyze all the data collected across different accounts. The logs that are generated could be from Network flows, billing events or even API calls across a large number of cloud accounts. Some services, such as GuardDuty and CloudTrail, are regional which means that there is no single point where they can analyze the state and posture of the entire account. This problem becomes exponential when you have more than 1 account. Thus, aggregating the logs becomes very important.

In addition to this, the SecOps teams may not have the IAM “permissions” to access the member AWS accounts directly. Which again necessitates the need for Centralized Aggregation of logs.

Log analysis has a lot of benefits when implemented properly. Some of which are:

Improves security awareness and faster detection of unwanted configuration modifications
Identify resource usage across your accounts
Detect anomalous patterns and IAM behaviors within accounts
Demonstrating compliance with regulatory requirements

Let’s look at the steps involved in log analysis:

Aggregate the logs - As mentioned in the previous section.
Parse the logs - To extract essential information from all different logging services, the logs need to be parsed and fed to ElasticSearch or Splunk. This data transformation is used to either filter out to unnecessary logs or converts data into formats suitable for ELK, Splunk or RedShift. In some cases, the logs are compressed (.gzip) and sent to the destination in the Central logging account. So the transformation layer can be used to uncompress this file and extract individual logs.
Querying the logs - Querying the parsed log data enables to find greater insight into the data. This is where elastic search, Splunk or Amazon RedShift come into the picture.
Monitoring - Finally, building dashboards to analyze logs and metrics is crucial. Cloudwatch comes with various visualization options that can be built based on a query. Alarms can be set to trigger based on a particular condition.

Forwarding logs to a Centralized Logging Account in AWS

The first step is to understand all the possible services you want to collect information from. Then understand where this information is being stored in AWS or where/how you want to pull this information.

For simplicity, I will discuss the three main “management and governance” services from AWS where information is aggregated. Assuming most information is sent into these AWS services as an aggregation location for different “bits” within AWS.

However you might even want to pull information straight from the services, i.e. add fluentbit agents in EC2 to pull logs into Elasticsearch

GuardDuty, in a Member account, can be configured to send findings directly to the “Master” / “Central” Account. Once the accounts are added in the “Central” account, all the member accounts receive an invitation which they need to accept. This ensures that there was a trust relationship established. Moving forward all the GuardDuty events will be sent to this Central account.

But do remember that GuardDuty is a regional service. This meant that I had to enable it in every region within all of my member accounts. This could be a tedious task. In my next blog, I will provide a Terraform template that will make it easier to enable it in member accounts as well as accept an invitation for GuardDuty.

CloudTrail is also a regional service. This needs to be enabled in every region within your member accounts. The logs can be forwarded to an S3 bucket for storage. The Cloudtrail events can also be used to trigger notifications if any change is detected. This is achieved by creating an event rule in CloudWatch and then triggering an SNS notification.
VPC flow logs need to be enabled on every single VPC. This can be done post VPC creation as well. Similar to the CloudTrail logs, these logs can either be delivered to an S3 bucket for long term storage or they can be direct CloudWatch log group to generate notifications based on specific patterns.

The log aggregation usually serves 2 main purposes and the destination for aggregation is based on the use case:

Real-time Observations and Alerting - If the goal is to get real-time alerts from within your different accounts then log destination should be CloudWatch Log Group. Cloudwatch can be configured to trigger based on specific events. This can later be processed by Kinesis or SQS (Simple Queuing Service).
Regulatory requirements and Auditing - If your organization has regulatory requirements for storing the logs for a specific amount of time then S3 is the appropriate destination. These logs can be then archived to S3 Glacier for long term storage. If real-time alerts are not a requirement then logs stored in S3 can be used along with AWS Glue and AWS RedShift for analytics.

When converting these steps to practical implementation, I tried 3 different deployment models.

Method 1 - CloudWatch plus Lambda Method

While implementing the steps described in previous sections, I first architecture I implemented leveraged the following services:
* CloudTrail, GuardDuty, VPC Flowlogs - Log generation
* CloudWatch - Log aggregator
* Lambda Functions - Parse the logs (uncompress .gzip and extract logs)
* AWS ElasticSearch (ELK stack)/ AWS RedShift - Ingest data, analyze, search and visualize data

Method_1

In this workflow, once the logs are collected, the parsing was done using LAMBDA function(s). Multiple CloudWatch event rules were configured to trigger a specific lambda based on the type or source of a log which could be GuardDuty, CloudTrail, VPC Flow logs. The data can then be sent to log analysis tools like Splunk, ElastiSearch or AWS RedShift.

This worked fine with a smaller number of events. But when the number of events increased, I started noticing that some of the events/logs were not sent to ElasticSearch.

After some debugging, I realized that the issue was related to Lambda Throttling. AWS has limits of 1000 lambda functions running concurrently. This includes all the lambda functions that you might be using for your other applications plus the lambda functions used for log processing.

Once this limit is exceeded Lambda started returning 429 error code. Even setting a reserved concurrency for the log processing function was not sufficient because when the number of log processors exceeded the reserved limit, the lambda function again returned 429 code.

You can request AWS to increase the concurrency limit within your account but this usually ends up being a catch-up game unless you can exactly predict how many functions you would need in each region.

Method 2 with AWS SQS

To overcome the drawbacks from Method 1, I added SQS between the CloudWatch Logs group and Log processing lambda functions as shown below.

SQS - Simple Queue Service - It is a fully managed “Message Queue” service from AWS. It allows us to send, receive and store messages without losing them. A message could be any data that your services would like to send/receive from each other.

Method_2

With SQS in place, now processing the event/log messages was easier. This was possible because SQS could store messages without losing messages and need for a receiver, like a lambda function, to be available for processing. This meant that even after the account limit for lambda was reached, the event messages were still there in the queue. And once the number of concurrent Lambda executions decreased, the next available execution of the function would pick up the message and process it.

This method also has some limitations. While SQS has advantages concerning easy setup and increased read throughput, it does not support multiple consumers and message replaying capability.

What this means is, for some reason if the lambda function takes longer than expected for processing the log messages or if it crashed due to an unexpected error, the message would be permanently removed from the SQS queue. This would lead to the loss of some logs that would never be processed and won’t be available for analysis.

Method 3 with AWS Kinesis

This was the final method that I tried that addressed the drawbacks of both method 1 and method 2. I used the following additional services in the Central Logging Account:

AWS Kinesis Data Stream - It’s a real-time data streaming service that can capture gigabytes of data per second. It can store the messages up to 7 days (the Retention period can be modified). It’s used in real-time analytics and video streaming applications. This data stream can be customized based on the user’s needs.
AWS Kinesis Firehose - It’s a fully managed real-time delivery streaming service that can load data into other endpoints such as S3, RedShift or AWS ElasticSearch. It can also transform the data if needed. Firehose does not have a retention period.

Method_3

This method uses AWS Kinesis Data Stream and Firehose Delivery Stream in the log processing workflow as shown above.

Cloudwatch pushes the logs into Kinesis Data Stream (KDS). The Kinesis Firehose delivery stream reads the messages from the Kinesis data stream (KDS) and integrates with lambda for data transformation.

It’s very simple to set up the integration between KDS and Firehose. (Step by step details will be discussed in part 2 of the blog)

connect_kinesis_delivery_stream

create_kinesis_delivery_stream

With this in place, even if the lambda functions failed halfway through, the messages are retained in the Kinesis data stream. This message can be picked up by another function and can be processed. Kinesis also can mark the messages within a queue that allows the functions to decide which message should be picked next. Also, in the case of KDS, the number of shards is the unit of concurrency of lambda functions. For ex: If you have 50 active shards, then there will be at most 50 lambda functions executing. This adds more predictability to log processing lambda functions.

Finally, AWS Firehose loads the extracted data into ElasticSearch, RedShift or S3.

Now that we understand why Method 3 is most suitable, I will discuss this in detail with steps on how to implement it in the next blog.

Trade offs

Let’s now look at some of the tradeoffs across different methods

Method 1 - This method is the easiest to set up because of the minimal number of services interacting but it will not scale when very large amount of logs are being sent for processing. This method might be sufficient if you have very few accounts and logs.
Method 2 - This method requires the setup of SQS. It allows for the decoupling of producers and consumers. This allows for log processing functions to scale independently. The cost might be higher than method-1 because of the SQS along with increased complexity. Another disadvantage is that the messages are removed from the queue once they are read, providing no scope for retention. There is also a lack of continuous monitoring of CloudWatch metrics for the SQS. As of writing this blog, CloudWatch metrics for SQS are available only at intervals of 5 min.
Method 3 - This method requires the setup of Kinesis Data Stream and Firehose. It allows for multiple producers and consumers. It provides a retention period and the ability to add multiple consumers. But it increases the complexity drastically. Modifying the number of shards after provisioning of the KDS stream is tricky and requires some advanced knowledge of how streams operate. The cost also increases because of 2 additional services being added to the log processing workflow.

Conclusion

Central logging is required when the number of cloud accounts starts increasing. This provides the SecOps teams an easier way to analyze data from multiple sources for managing security, compliance and application analytics.

AWS lambda usually is used for log processing because of its event-based nature (Does not need a VM or container to be running if logs aren’t there to process). It comes with its limitations.

Since events in the multi-account environment tend to increase exponentially, a scalable and real-time data stream is needed for shorter detection periods. This is provided by services like AWS Kinesis.

SOURCE