Power BI architecture for multiple developers

If you fell like setting up a collaborative Power BI environment, then this post can give you some ideas.

Working in a large enterprise scale Power BI usually requires some setup like this, but usually you will have a data warehouse to back you up, and if you do, make use of it. However, if you don’t have a proper data warehouse or data lake, or anything like that, in your organization, but you still have a Power BI Premium capacity and several Power BI devs working together, this suggestion might work for you.

Premises:

  • You already have your data sources working, with data being properly loaded for your Power BI devs to connect to data and work with it.

Implementing a shared Power BI data model

The goal here is to create an architecture that allows a multi-dev work, where models, measures and data transformation can be re-used, by that creating a centralized single-source of truth, reducing or eliminating duplicative copies of datasets.

Here is the suggested solution architecture:

Let me explain this…

Use Dataflows

By loading your data using dataflow you can kind of simulate an ETL layer on your end. That means that all data transformations can be done using dataflows (power query online), layer by layer, until you have all your data put together for consumption.

Key point here is, you cannot create DAX code or construct any kind of modeling (relationships), but you can merge, append, union data, and all that good stuff, like cleansing your data.

The result of dataflows is a bunch of tables that can later be used in your datamart.

Use Datamarts

If you don’t have dataflows already setup, you can skip dataflows and go with Datamarts, since datamarts can do power query online just like dataflow. Problem is, if you need multi-layer dataflows, meaning, you need to use multiple dataflow to join different data sources, with different logins and etc., you might want stick to dataflows.

Anyway, your data is ready, is clean, is well structured, normalized, or de-normalized, to your needs. Now you can apply Datamart on top of it.

Datamart allows you to create DAX measures and relationships between your tables, and all datamarts automatically generates a dataset in the end (with all your stuff inside, of course).

As a result you have a datamart that you can connect to, from a PBID.

Connecting to a datamart from PBI Desktop (PBID) is like connecting to a dataset in your workspace, you will see your tables, you relationship diagram, but you wont see underlying data, nor have power query available to assess data, and you also cannot pick and choose which tables to load.

Each datamart generates a single dataset, therefore make sure that your dataset is complex and large enough so people can explore and find insights from it. However, if it start to become too large, maybe its a better idea to split it into smaller datamarts/datsets.

Where are we at?

We are here:

We just finished build dataflows and datamarts, so we have proper tables setup with proper relationships, and with proper DAX measures.

Power BI Desktop as Data Viz Layer

You data is curated and your measures are ready to use, all you have to do is, open your PBID, and connect to your datamart and create your visuals.

You can even create your own measures if you want. You cannot create columns, if you need new columns, you will have to ask for your data modelers to create if at dataflow level.

Once your report is create you can publish to any workspace you have access to.

And yes, you need to have at least build permission to access your datamarts’ datasets and create your reports, but your users don’t, just give them viewer access and distribute reports and dashboards through apps or in workspaces.

FAQ

Why don’t you connect Power BI Desktop to Dataflows directly?

Well, you can do that if you want, but that will create a scenario like this:

Notice here that the management of your DAX code and relationships will be scattered across different PBID files, which can make it harder to govern.

Ok, you didn’t convinced me, I still want to connect directly to the Dataflow using PBID and publish different datasets in different workspaces.

You can do that too, but that might put yourself in a very difficult position, because that will make harder to govern all your datasets. Also, how do you manage all different DAX measures? By doing that it makes it easier for someone to say: “my total amount is different from workspace A to workspace B, why is that?”.

To prevent this kind of problem is why we have centralized, governed, datasets, so you can make sure that all important measures will always match, whenever they are.

References

Power BI Whitepapers https://learn.microsoft.com/en-us/power-bi/guidance/whitepapers

Lots of RADACAP blogs and videos https://radacad.com/getting-started-with-power-bi-datamart

Cloud Glossary | Azure – AWS – GCP

I have been studying, and working with, Microsoft Azure, Amazon AWS and Google Cloud Platform. They are all cloud platforms and they all work in similar ways, so here on this post I will try to add all terms that I find that are shared across them all.

Topics Worth Knowing well for any Cloud

Scalability: Scalability refers to the ability of a system, application, or process to handle an increasing amount of work or to accommodate more users. In the context of cloud computing, scalability means that the system can dynamically allocate more resources to meet the changing demands of the application or workload. This helps organizations to avoid downtime and ensure that their applications are always performing at their best.

Flexibility: Flexibility refers to the ability to adapt to change and make changes quickly and easily. In the context of cloud computing, flexibility means that organizations can quickly and easily deploy and manage their applications, as well as experiment with new technologies and features. This helps organizations to respond to changing business needs and take advantage of new opportunities.

Agility: Agility refers to the ability to respond quickly and effectively to changing conditions or requirements. In the context of cloud computing, agility means that organizations can quickly deploy new applications and services, as well as make changes to existing applications, without the constraints of traditional IT infrastructure. This helps organizations to stay competitive and respond to changing customer needs.

Cost Savings: Cost savings refer to the reduction in costs associated with deploying, managing, and scaling applications. In the context of cloud computing, cost savings are achieved by leveraging the economies of scale and the shared infrastructure provided by cloud computing, as well as by reducing the costs associated with managing and maintaining traditional IT infrastructure.

Reliability: Reliability refers to the ability of a system to perform its intended function consistently and without failures. In the context of cloud computing, improved reliability means that applications are designed to be highly available and resilient, with automatic failover and replication capabilities. This helps organizations to avoid downtime and ensure that their applications are always up and running, even in the face of failures or outages.

Security: Security refers to the protection of data and systems from unauthorized access, use, disclosure, disruption, modification, or destruction. In the context of cloud computing, enhanced security means that organizations can benefit from the security features provided by cloud computing, such as encryption, access controls, and threat detection. This helps organizations to protect their applications and data from cyber threats and ensure the confidentiality and integrity of sensitive information.

Productivity: Productivity refers to the efficiency and effectiveness with which work is done. In the context of cloud computing, increased productivity means that organizations can benefit from the platform for collaboration and sharing information provided by cloud computing. This helps organizations to be more productive and achieve their goals more effectively.

Elasticity: Elasticity refers to the ability of a system to automatically scale its resources up or down to meet changing demands. In the context of cloud computing, elasticity means that organizations can automatically allocate more or fewer resources to their applications as needed, without having to manually manage the underlying infrastructure. This helps organizations to avoid over-provisioning or under-provisioning of resources, and ensures that their applications are always performing at their best.

Resiliency: Resiliency refers to the ability of a system to continue functioning even in the face of failures or disruptions. In the context of cloud computing, resiliency means that organizations can benefit from the automatic failover and replication capabilities provided by cloud computing, which help to ensure that their applications are always up and running, even in the face of failures or outages. This helps organizations to avoid downtime and ensure that their applications are always available to their customers.

Managed Instances: In the context of cloud computing, managed instances refer to a category of cloud services where the cloud provider takes care of managing and maintaining the underlying infrastructure, allowing customers to focus on their applications and services that run on top of it.

In the case of managed instances, the provider is responsible for tasks such as patching and updating the operating system, handling security and network configurations, and providing technical support. This frees up the customer from having to perform these tasks themselves, which can be time-consuming and complex, and allows them to focus on their core business operations.

Managed instances can take the form of managed virtual machines, managed databases, managed containers, and more. Examples of cloud providers offering managed instances include Amazon Web Services (AWS) with its EC2 and RDS services, Google Cloud Platform (GCP) with its Compute Engine and Cloud SQL services, and Microsoft Azure with its Azure Virtual Machines and Azure SQL Database services.

Managed Servers: Managed servers, like managed instances, are a type of cloud service where the provider takes care of managing and maintaining the underlying infrastructure. However, the key difference between managed servers and managed instances lies in the level of management provided by the provider.

Managed servers typically offer a higher level of management compared to managed instances. This can include tasks such as software installation and configuration, backup and recovery, and performance tuning, in addition to the management tasks performed for managed instances. Managed servers are often marketed as a fully managed solution, with the provider handling all aspects of server administration, leaving the customer to focus purely on their applications and services.

Managed servers are typically offered as physical servers, which are housed and maintained in the provider’s data centers. The customer rents the server, and the provider takes care of all the underlying infrastructure and management tasks. The customer has full control over the operating system and applications installed on the server, but the provider is responsible for the physical security and maintenance of the hardware.

In summary, managed servers offer a higher level of management compared to managed instances, and are typically offered as physical servers, with the provider taking care of all aspects of server administration. Managed instances, on the other hand, typically refer to cloud-based virtual machines, where the provider takes care of basic management tasks such as patching and updating the operating system, but leaves more advanced tasks, such as software installation and configuration, to the customer.

Managed Servers and IaaS: Managed servers can be considered a type of Infrastructure as a Service (IaaS), but they differ in the level of management provided by the provider.

IaaS refers to a cloud computing service model where the provider offers virtualized computing resources, such as virtual machines, storage, and networking, over the internet. The customer has complete control over the operating system, storage, and deployed applications, and is responsible for managing and maintaining the infrastructure. Examples of IaaS providers include Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).

Managed servers, as described earlier, are a type of IaaS offering, where the provider takes care of certain aspects of server administration, such as software installation and configuration, backup and recovery, and performance tuning. The customer still has control over the operating system and applications installed on the server, but the provider handles the underlying infrastructure and management tasks to a greater extent than in a standard IaaS offering.

So, while managed servers can be considered a type of IaaS, they differ in the level of management provided by the provider and the degree of control the customer has over the underlying infrastructure.

Serverless Computing: Serverless computing is a cloud computing model where the cloud provider manages the infrastructure and automatically allocates resources as needed to run the customer’s applications, without the customer having to provision or manage any servers.

In serverless computing, the customer uploads their code to the cloud provider, which then runs the code in response to events, such as incoming HTTP requests or changes to a database. The provider automatically provisions the necessary resources, such as compute, memory, and storage, to run the code, and charges the customer based on the actual resources used, rather than a pre-allocated amount.

The key advantage of serverless computing is that it allows the customer to focus solely on writing and deploying their code, without having to worry about the underlying infrastructure. The provider handles all aspects of infrastructure management, from provisioning resources to scaling the solution to meet changing demand. This can result in reduced costs and increased agility, as the customer only pays for the resources they actually use and can quickly and easily deploy new applications and services without having to worry about infrastructure management.

Examples of serverless computing platforms include AWS Lambda, Google Cloud Functions, and Microsoft Azure Functions. These platforms provide a way for customers to run their code in a fully managed, scalable, and highly available environment, without having to provision or manage any servers.

More examples of Serverless computing are:

  1. AWS EventBridge: A serverless event bus that makes it easy to connect applications together using data from your own applications, SaaS applications, and AWS services.
  2. Google Cloud Pub/Sub: A messaging service that allows you to send messages between independent applications asynchro-nously.
  3. Microsoft Azure Event Grid: A fully managed event routing service that enables event-driven architecture and helps you build dynamic and scalable applications.
  4. AWS Step Functions: A serverless workflow service that makes it easy to coordinate distributed applications and microservices using visual workflows.
  5. Google Cloud Composer: A fully managed workflow orchestration service built on Apache Airflow.
  6. Microsoft Azure Durable Functions: An extension of Azure Functions that lets you write stateful functions in a serverless environment.
  7. AWS S3: A serverless object storage service that allows you to store and retrieve any amount of data, at any time, from anywhere on the web.
  8. Google Cloud Storage: A serverless, highly scalable object storage service that allows you to store and retrieve any amount of data, at any time, from anywhere on the web.
  9. Microsoft Azure Blob Storage: A fully managed, serverless, scalable, and highly available object storage service for unstructured data.

SaaS and Serverless Computing: Software as a Service (SaaS) is not the same as serverless computing, but it can make use of serverless computing resources.

SaaS refers to a software delivery model where customers access software over the internet, without having to install it on their own computers or servers. SaaS is a type of cloud computing that provides customers with access to software applications and services through a web browser or mobile app. Examples of SaaS include customer relationship management (CRM) software, email and productivity tools, and project management software.

Serverless computing, on the other hand, is a type of cloud computing that allows customers to run their code in a fully managed environment, without having to worry about the underlying infrastructure. Serverless computing resources, such as AWS Lambda, Google Cloud Functions, and Microsoft Azure Functions, can be used to build and run backend APIs, event-driven applications, media processing pipelines, real-time analytics, and more.

SaaS applications can make use of serverless computing resources to build and run their underlying infrastructure, without having to manage any servers. For example, a SaaS application could use a serverless function to run backend APIs, process real-time data, or run background tasks. This can result in reduced costs, increased scalability, and the ability to build and deploy new features and functionality quickly and easily.

So, while SaaS is not serverless computing, it can leverage serverless computing resources to build and run its underlying infrastructure.

Fully Managed Environment: A fully managed environment is a type of cloud computing environment where the provider takes care of all the underlying infrastructure and management tasks, leaving the customer free to focus on developing and running their applications. In a fully managed environment, the customer does not have to worry about server provisioning, software updates, security patches, backups, or any other operational tasks. The provider handles all of these tasks, and provides the customer with a fully managed and highly available platform for running their applications.

Managed servers and managed instances, on the other hand, are types of cloud computing environments where the provider still provides some level of management and infrastructure, but the customer has more control and responsibility for the underlying environment. In a managed server or managed instance environment, the customer is typically responsible for configuring and managing the operating system, middleware, and application software. The provider provides the underlying infrastructure and some level of management, but the customer has more control over the environment and is responsible for many of the operational tasks.

Fully managed environments are different from managed servers and managed instances in that they provide a higher level of abstraction and a lower level of customer control and responsibility. With a fully managed environment, the customer can focus on developing and running their applications, without having to worry about the underlying infrastructure. With managed servers and managed instances, the customer has more control and responsibility for the underlying environment, and must perform additional operational tasks in order to maintain and manage the environment.

AWS Certified Solutions Architect Associate SAA-C03 | Well-Architected Framework

Here are some basics links for AWS Well-Architected Framework:

Overview – https://aws.amazon.com/architecture/well-architected/

Docs – https://docs.aws.amazon.com/wellarchitected/latest/framework/welcome.html

Whitepaper – https://docs.aws.amazon.com/pdfs/wellarchitected/latest/framework/wellarchitected-framework.pdf

AWS Well-Architected Framework Pillars

We must know all pillars, but pay some special attention to Sustainability.

AWS Certified Solutions Architect Associate SAA-C03 | Overview

Below are the basic details about the exam. 

Format: Multiple choice and multiple response questions only

Type: Associate

Delivery method: Pearson VUE testing center or online proctored exam

Number of questions: 65

Time: 130 minutes

Cost: 150 USD

Languages offered: English, French (France), German, Italian, Japanese, Korean, Portuguese (Brazil), Simplified Chinese, Spanish (Latin America), Spanish (Spain), and Traditional Chinese.

For more information about the exam and to schedule your exam, visit the AWS Certified Solutions Architect – Associate page. 

AWS Certified Solutions Architect – Associate (SAA-C03) Exam Guide: https://d1.awsstatic.com/training-and-certification/docs-sa-assoc/AWS-Certified-Solutions-Architect-Associate_Exam-Guide.pdf

Sample questions: https://d1.awsstatic.com/training-and-certification/docs-sa-assoc/AWS-Certified-Solutions-Architect-Associate_Sample-Questions.pdf

AWS training recommendations

Before moving to the next module to prepare for the exam, we recommend that you have completed the following training (or similar courses). We don’t require that you take any specific training before you take an exam.

Whitepapers and FAQs

Get to know AWS services and best practices by exploring AWS whitepapers and FAQs relevant to your exam.

Whitepapers

  • AWS Well-Architected Framework ā€“ This whitepaper is my absolute favorite and one of the most important papers that you should study for the exam. It discusses the different pillars that make up a well-architected cloud environment. Expect the scenarios in your exam to be heavily based upon these pillars and dive deep into each pillar of the AWS Well-Architected Framework.
  • AWS Security Best Practices – This whitepaper is also an important paper for the exam. It discusses the different pillars that make up a well-architected cloud environment. Expect the scenarios in the exam to be heavily based upon these pillars, and study the pillars carefully.

FAQs

AWS Certified Solutions Architect Associate SAA-C03 | Glossary

Another certification I’m working on, the AWS Associate Solution Architect. You can find more about it here: https://aws.amazon.com/certification/certified-solutions-architect-associate/

Here are some topics and terms (I’m calling this Glossary for simplicity) I would like to write down so we can start having a more foundational, basic, understanding of what AWS Solution Architect should know about.

Glossary

What is Amazon EC2?

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides scalable computing capacity in the Amazon Web Services (AWS) cloud. It enables users to launch and manage virtual machines, called instances, with a variety of operating systems and configurations, allowing for flexible and scalable deployment of applications. EC2 is designed to make web-scale cloud computing easier for developers.

EC2 instances can be easily resized, duplicated, or terminated as needed, allowing you to quickly and cost-effectively manage your computing resources. With EC2, you can also take advantage of features such as storage volumes, network interfaces, and security groups to further customize and secure your instances.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html

What is a General Purpose Amazon EC2 instance?

A General Purpose Amazon EC2 instance is a type of virtual machine in the Amazon EC2 cloud computing service. It provides a balanced amount of CPU, memory, and network resources and is suitable for a wide range of applications. The “m5” and “t3” families are examples of General Purpose instances. These instances can be used for a variety of workloads, including web and application servers, small and medium databases, development and testing environments, and many other use cases. General Purpose instances can be easily scaled to meet changing demands, making them a popular choice for many organizations.

https://aws.amazon.com/ec2/instance-types/

What is an EC2 Auto Scaling Group?

An EC2 Auto Scaling Group is a component of the Amazon EC2 Auto Scaling service in Amazon Web Services (AWS). It is used to automatically manage and scale a group of Amazon EC2 instances. An Auto Scaling Group is responsible for ensuring that the desired number of instances is running and available to handle incoming traffic.

With EC2 Auto Scaling, you can set up scaling policies based on criteria such as changes in network traffic or CPU utilization. When demand for your application increases, the Auto Scaling Group will automatically launch new instances to handle the increased load. Similarly, when demand decreases, the Auto Scaling Group will terminate excess instances to save on costs.

This allows you to maintain a consistent and predictable level of performance for your application while maximizing resource utilization and minimizing costs. EC2 Auto Scaling provides a flexible and cost-effective way to manage and scale your Amazon EC2 instances.

https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-groups.html

What is Amazon EC2 Auto Scaling service?

Amazon EC2 Auto Scaling is a service in Amazon Web Services (AWS) that enables automatic scaling for your Amazon Elastic Compute Cloud (EC2) resources. The service automatically adjusts the number of EC2 instances in a group (referred to as an “Auto Scaling Group”) based on user-defined policies and criteria, such as changes in demand for your application.

With EC2 Auto Scaling, you can ensure that your application has the right number of EC2 instances available to handle the incoming traffic. When demand increases, EC2 Auto Scaling launches additional EC2 instances to handle the load. When demand decreases, EC2 Auto Scaling terminates excess instances to save on costs.

This service provides a simple and effective way to maintain a predictable and optimal level of performance for your application while maximizing resource utilization and minimizing costs. EC2 Auto Scaling is integrated with other AWS services, such as Amazon Elastic Load Balancer (ELB) and Amazon CloudWatch, to provide a complete and flexible solution for scaling your applications in the cloud.

https://aws.amazon.com/ec2/autoscaling/#:~:text=Amazon%20EC2%20Auto%20Scaling%20helps,or%20real%2Dtime%20demand%20patterns.

What is Amazon EC2 Auto Scaling cooldown periods?

Amazon EC2 Auto Scaling cooldown periods are a setting in the Amazon EC2 Auto Scaling service that define the time interval after a scaling activity has completed before the next scaling activity can start. The cooldown period helps to ensure that your application has sufficient time to stabilize after a scaling event before another scaling event occurs.

During the cooldown period, EC2 Auto Scaling will not launch or terminate any additional instances, even if the conditions specified in your scaling policies have been met. This helps to prevent rapid and frequent scaling events, which can have a negative impact on the performance and stability of your application.

You can set the cooldown period for an Auto Scaling Group when you create or update the group. The cooldown period is specified in seconds, and you can choose a value that is appropriate for your application’s requirements. In general, longer cooldown periods can help to reduce the frequency of scaling events, while shorter cooldown periods can allow for more frequent scaling to respond to rapidly changing demand.

https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-scaling-cooldowns.html

What is a Storage Optimized Amazon EC2 instance?

A Storage Optimized Amazon EC2 instance is a type of virtual machine in the Amazon EC2 cloud computing service that provides high disk throughput and I/O performance. These instances are designed to handle intensive workloads that require frequent access to large amounts of data, such as NoSQL databases, distributed file systems, and data warehousing applications.

Storage Optimized instances provide a high ratio of storage to vCPUs and memory, and offer high-performance local storage options such as NVMe-based SSDs. They also offer low latency and high throughput, making them well-suited for demanding storage workloads.

Examples of Storage Optimized instances include the “d2” and “h1” families. These instances allow you to cost-effectively store and process large amounts of data in the cloud, providing a flexible and scalable solution for your storage needs.

https://aws.amazon.com/blogs/aws/new-storage-optimized-amazon-ec2-instances-im4gn-and-is4gen-powered-by-aws-graviton2-processors/#:~:text=EC2%20storage%2Doptimized%20instances%20are,%2Dvalue%20stores%2C%20and%20more.

What is a Linux-based Amazon EC2 instance on AWS Cloud?

A Linux-based Amazon EC2 instance on AWS Cloud refers to a virtual machine running a Linux operating system in the Amazon Elastic Compute Cloud (EC2) service in Amazon Web Services (AWS). EC2 provides a flexible and scalable way to launch and manage virtual servers in the cloud, and Linux-based instances are a popular choice for many organizations due to the wide range of available open-source software and tools.

With a Linux-based EC2 instance, you can launch a virtual server with your preferred Linux distribution, such as Ubuntu, Amazon Linux, Red Hat Enterprise Linux, or CentOS. You can then install and run any applications and services you need, just as you would on a physical server.

What is Amazon EC2 instance metadata?

Amazon EC2 instance metadata is information about an Amazon Elastic Compute Cloud (EC2) instance that is available from within the instance itself. EC2 instance metadata provides information about the instance, such as its ID, its public hostname, and its Amazon Machine Image (AMI) ID.

Instance metadata can be retrieved by making an HTTP request to a special endpoint, http://169.254.169.254/latest/meta-data/. The information returned by this endpoint is specific to the instance and can be used to configure the instance or to retrieve information needed by scripts and applications running on the instance.

For example, an instance can use the instance metadata to retrieve its Amazon EC2 instance ID and use it as part of a unique identifier. An instance can also use the metadata to retrieve its public hostname, which can be useful for configuring the instance’s network settings.

Instance metadata is available to the instance during its entire lifecycle, so it can be used to configure the instance at launch time, or to retrieve information about the instance at runtime. EC2 instance metadata provides a convenient and flexible way for instances to access information about themselves, making it easier to automate instance configuration and management.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html

Which kind of solution is Amazon EC2? IaaS, PaaS, SaaS, FaaS?

Amazon EC2 is a solution in the Infrastructure as a Service (IaaS) category.

IaaS is a cloud computing service model that provides virtualized computing resources, such as virtual machines, storage, and network interfaces, over the internet. With IaaS, customers can rent computing resources on-demand, without having to invest in and maintain physical hardware.

In contrast, Platform as a Service (PaaS) provides a platform for customers to develop, run, and manage their applications, without having to worry about the underlying infrastructure. Software as a Service (SaaS) is a delivery model in which software applications are provided over the internet, on a subscription basis. Function as a Service (FaaS) is a cloud computing model in which functions are executed in response to specific events, without the need to manage the underlying infrastructure.

https://aws.amazon.com/types-of-cloud-computing/

What is Amazon Elastic Block Store (EBS)?

Amazon Elastic Block Store (Amazon EBS) is a block-level storage service provided by Amazon Web Services (AWS) for use with Amazon Elastic Compute Cloud (Amazon EC2) instances. It provides persistent storage for Amazon EC2 instances and can be used as the primary storage for a database, file system, or backup storage.

Amazon EBS provides different volume types to meet different storage performance and cost requirements. You can choose from magnetic volumes for infrequent access, general-purpose SSDs for a balance of cost and performance, and provisioned IOPS SSDs for demanding workloads that require high I/O performance.

Amazon EBS volumes are highly available and durable, and are automatically replicated within an availability zone to protect against data loss. Amazon EBS also provides snapshots, which are point-in-time backups of Amazon EBS volumes that can be used to create new Amazon EBS volumes or to protect data.

Here is a link to the Amazon EBS official documentation: https://aws.amazon.com/ebs/

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html

What is Amazon S3 Bucket?

Amazon S3 (Simple Storage Service) is an object storage service provided by Amazon Web Services (AWS). An S3 bucket is a container for storing objects in S3.

An S3 bucket can be thought of as a folder that can hold any number of files and other folders. Objects stored in an S3 bucket can be up to 5 terabytes in size and can be any type of data, such as text files, images, videos, and more.

S3 provides a scalable and durable storage solution, and it offers a range of features for data management, such as versioning, lifecycle policies, and access controls. S3 is also designed for high availability and data durability, making it an ideal storage solution for critical data and applications.

S3 can be used for a variety of purposes, including storing backups, storing large unstructured data sets, and serving files to users over the internet. S3 is also commonly used as a back-end for other AWS services, such as Amazon EC2 and Amazon DynamoDB.

https://aws.amazon.com/s3/

What if I want to store more than 5TB of data in a S3 bucket?

If you need to store more than 5TB of data in a single S3 object, you have a few options:

  1. Multipart upload: You can use the Multipart Upload API to upload parts of an object in parallel, which can help you to upload very large objects more efficiently.
  2. Multiple objects: You can store the data across multiple S3 objects by dividing the data into smaller parts. You can use the S3 object naming convention and folder structure to organize these objects and make it easier to work with the data.
  3. Amazon S3 Glacier: If you have data that is infrequently accessed, you can store it in Amazon S3 Glacier, which is a low-cost, long-term archive storage service offered by AWS. You can use S3 Lifecycle policies to automatically transition objects to S3 Glacier as they age, or you can move them manually.
  4. Amazon S3 Intelligent-Tiering: If you have data with unknown or changing access patterns, you can store it in S3 Intelligent-Tiering, which is a new S3 storage class that automatically moves data to the most cost-effective access tier, without performance impact or operational overhead.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html

What is Amazon Multipart Upload for Amazon S3?

Amazon Multipart Upload is a feature of Amazon Simple Storage Service (S3) that enables you to upload large objects, such as videos and images, in parts. Instead of uploading a large object in a single step, you can divide it into parts and upload each part in parallel, which can help you to upload very large objects more efficiently. The parts can be uploaded in any order and in parallel, and then S3 automatically reassembles them into a single object.

This approach has several benefits, such as allowing you to upload objects in parallel, which can increase the overall upload speed and reduce the impact of network latency. Additionally, if the upload is interrupted, you can resume the upload from where it left off, rather than having to start the upload from the beginning.

You can initiate a multipart upload using the AWS Management Console, the AWS CLI, or the S3 API. Once the upload is complete, S3 automatically reassembles the parts and creates a single object from them.

Here is the official AWS documentation on Amazon S3 Multipart Upload: https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html

What is Amazon S3 Glacier Instant Retrieval?

Amazon S3 Glacier Instant Retrieval is a feature of Amazon S3 Glacier, a low-cost, long-term archive storage service offered by AWS. It allows you to retrieve a portion or all of an archived object within a few minutes, instead of waiting for several hours for a standard retrieval.

Instant Retrieval is useful for accessing data that is needed quickly but infrequently, such as for compliance and auditing purposes. With Instant Retrieval, you can retrieve the data when you need it and avoid the cost of regularly accessing and storing the data in more expensive storage solutions.

To use Instant Retrieval, you simply specify the retrieval option when you initiate a retrieval request, and S3 Glacier will retrieve the data within a few minutes. There are additional charges for using Instant Retrieval, but it can still be more cost-effective than using other storage solutions, especially for infrequently accessed data.

Here is the official AWS documentation on Amazon S3 Glacier Instant Retrieval: https://aws.amazon.com/glacier/instant-retrieval/

What is Amazon S3 Lifecycle policy?

Amazon S3 Lifecycle policy is a feature of Amazon Simple Storage Service (S3) that automatically transitions objects to different storage classes or archives them to Amazon S3 Glacier or Amazon S3 Glacier Deep Archive as they age, based on a set of rules that you define. The lifecycle policy can help you reduce storage costs by automatically moving objects to less expensive storage options as they age, or by automatically deleting them when they are no longer needed.

You can define a lifecycle policy at the bucket level, or for specific objects or prefixes within a bucket. The policy can include one or more rules, each of which defines a transition or deletion action for objects that meet specific conditions, such as their age or the date when they were last modified.

For example, you could define a rule to transition all objects in a bucket to the S3 Standard-Infrequent Access storage class after 30 days, and another rule to transition objects to the S3 One Zone-Infrequent Access storage class after 60 days. After 90 days, you could define a rule to archive objects to S3 Glacier, or after 180 days, you could define a rule to delete objects.

Here is the official AWS documentation on Amazon S3 Lifecycle policy: https://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html

What is a Elastic Load Balancing deregistration process in AWS Cloud?

The Elastic Load Balancer (ELB) deregistration process in the AWS Cloud refers to the process of removing an Amazon Elastic Compute Cloud (EC2) instance from the backend server pool of a load balancer. Deregistration occurs when an instance is terminated, becomes unhealthy, or when you manually deregister it.

The deregistration process is an important aspect of ELB’s automatic instance management, as it helps ensure that traffic is only sent to healthy instances. When an instance is deregistered, ELB stops sending traffic to it and begins distributing incoming traffic to the remaining healthy instances.

In an Auto Scaling group, instances are automatically deregistered and terminated when they are terminated by Auto Scaling, which makes it easier to manage your instances and maintain the desired number of instances in your backend pool.

Here is the official AWS documentation on Elastic Load Balancer deregistration process: https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/deregister-register-instances.html

What is a Provisioned IOPS SSD?

Provisioned IOPS SSD (input/output operations per second) is a storage volume type offered by Amazon Elastic Block Store (EBS) in Amazon Web Services (AWS) that provides high performance and low latency disk I/O for demanding I/O-intensive workloads, such as database applications, big data analytics, and enterprise applications.

With Provisioned IOPS SSD, you can specify the number of IOPS you want to provision for a volume, up to a maximum of 20,000 IOPS. This allows you to achieve consistent and predictable performance for your I/O-intensive workloads, regardless of the actual I/O demand.

Provisioned IOPS SSD volumes are backed by solid-state drives (SSDs) and are designed to deliver fast, predictable, and consistent I/O performance, even under heavy load. They are ideal for use cases that require high performance and low latency, such as database applications, enterprise applications, and big data analytics.

Here is the official AWS documentation on Provisioned IOPS SSD: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html

What is a Virtual Private Gateway on AWS Cloud?

A Virtual Private Gateway (VPG) is an Amazon Web Services (AWS) component that acts as a gateway between your Amazon Virtual Private Cloud (VPC) and your customer gateway. It allows you to establish a secure and private connection from your VPC to your on-premises network using an IPsec VPN connection.

A VPG provides secure communication between your VPC and your data center, enabling you to extend your network into the AWS Cloud and access your cloud-based resources as if they were part of your own data center. With a VPG, you can securely access your cloud resources, such as Amazon Elastic Compute Cloud (EC2) instances and Amazon Simple Storage Service (S3) buckets, from your on-premises network.

Here is the official AWS documentation on Virtual Private Gateway: https://docs.aws.amazon.com/vpn/latest/s2svpn/VPC_VPN.html

What is Amazon Virtual Private Cloud (VPC)?

Amazon Virtual Private Cloud (Amazon VPC) is a secure and scalable virtual network in the AWS Cloud. It enables you to launch AWS resources into a virtual network that you’ve defined. With Amazon VPC, you can define a virtual network topology that closely resembles a traditional network that you’d operate in your own data center.

Amazon VPC provides advanced security features, such as security groups and network access control lists, to enable inbound and outbound filtering at the instance and subnet level. You can easily customize the network configuration for your Amazon VPC. For example, you can create a public-facing subnet for your web servers that has direct access to the Internet, and place your backend systems, such as databases or application servers, in a private-facing subnet with no Internet access.

Here is the official AWS documentation on Amazon Virtual Private Cloud (VPC): https://aws.amazon.com/vpc/

What is Amazon Account Trust Policy?

An Amazon Account Trust Policy refers to the set of rules or guidelines that govern the trust relationship between AWS customers and AWS. It outlines the responsibilities of both parties in regards to the use of AWS services and the protection of customer data.

The trust policy defines the security controls that AWS has implemented to protect customer data and the measures that customers must take to secure their data within the AWS environment. The trust policy also outlines the responsibilities of AWS in the event of a security breach, such as incident response and notification, and the responsibilities of customers, such as reporting security incidents.

Here is the official AWS documentation on AWS Customer Agreement: https://aws.amazon.com/agreement/

What is Amazon Aurora Serverless?

Amazon Aurora Serverless is a variant of Amazon Aurora, a relational database service that is fully managed and highly available. With Amazon Aurora Serverless, you don’t need to manage any infrastructure, as the service automatically provisions and scales the underlying resources based on the application’s workloads.

Amazon Aurora Serverless provides a serverless relational database solution, which means that the service automatically starts, scales, and shuts down the database engine based on usage, so you only pay for what you use. You can use Aurora Serverless for applications that experience frequent spikes in traffic and require fast performance, as the service can automatically scale resources as needed.

Here is the official AWS documentation on Amazon Aurora Serverless: https://aws.amazon.com/rds/aurora/serverless/

What is Amazon CloudFront?

Amazon CloudFront is a content delivery network (CDN) service provided by Amazon Web Services (AWS). It is used to distribute static and dynamic web content, such as HTML pages, images, videos, and APIs, to end-users with low latency and high data transfer speeds.

CloudFront works by caching content at edge locations around the world, so that when a user requests content, it can be delivered from the nearest edge location, rather than from the origin server. This results in faster content delivery and improved user experience.

CloudFront integrates with other AWS services, such as Amazon S3 and Amazon EC2, as well as with custom origin servers, making it a flexible and scalable solution for distributing content.

Here is the official AWS documentation on Amazon CloudFront: https://aws.amazon.com/cloudfront/

What is edge locations on Amazon AWS?

Edge locations on Amazon AWS are data centers located at the edge of the network, closest to end-users. They are used by Amazon CloudFront, a content delivery network (CDN) service, to cache and distribute static and dynamic web content, such as HTML pages, images, videos, and APIs.

When a user requests content that is being served by CloudFront, the service routes the request to the nearest edge location, rather than to the origin server. This results in faster content delivery and improved user experience.

Edge locations are strategically placed around the world, so that content can be delivered to users with low latency and high data transfer speeds, regardless of their location. The number of edge locations and their locations are subject to change as Amazon continues to expand its network.

Here is the official AWS documentation on Amazon CloudFront Edge Locations: https://aws.amazon.com/cloudfront/features/#Global_Content_Delivery

What is AWS WAF on CloudFront?

AWS WAF (Web Application Firewall) is a managed service that helps protect your web applications from common web exploits that could affect application availability, compromise security, or consume excessive resources. When integrated with Amazon CloudFront, AWS WAF can provide protection for your CloudFront distributions, which are used to deliver content from your origin servers to your end-users through a global network of edge locations.

AWS WAF enables you to create custom rules to block, allow, or count web requests based on conditions like IP addresses, HTTP headers, and content strings. This helps you to identify and block malicious requests and ensure that your applications are only serving legitimate traffic.

With AWS WAF and CloudFront, you can implement protection for your applications at the edge of the network, reducing the latency and network load on your origin servers.

Here is a link to the AWS WAF official documentation: https://aws.amazon.com/waf/

What is the overall concept of content delivery network (CDN)?

A content delivery network (CDN) is a distributed network of servers that are used to deliver content to end-users over the internet. The primary goal of a CDN is to improve the performance, reliability, and availability of content delivery by caching the content at various edge locations that are closer to the end-users.

CDNs are commonly used for delivering static web content, such as images, videos, and HTML pages, as well as for delivering dynamic content, such as APIs and real-time data. They work by copying the content from the origin server and caching it at multiple edge locations around the world. When a user requests content, the CDN routes the request to the nearest edge location, where the cached content is delivered. This reduces the latency and improves the data transfer speeds, resulting in faster content delivery and a better user experience.

CDNs are widely used by websites, online services, and media companies to deliver content to their users. They are also used to offload traffic from the origin server, which can reduce the load and improve the reliability of the origin server.

Here is a link to the Wikipedia article on Content Delivery Network: https://en.wikipedia.org/wiki/Content_delivery_network

What is Amazon CloudWatch?

Amazon CloudWatch is a monitoring service provided by Amazon Web Services (AWS) that provides operational and performance data on AWS resources and applications. With CloudWatch, you can monitor and collect metrics, set alarms, and take automated actions in response to changes in your resources.

CloudWatch provides a variety of metrics and log data from various AWS services, including Amazon EC2, Amazon S3, Amazon RDS, and many others. You can use CloudWatch to monitor performance and troubleshoot issues, track resource utilization and usage trends, and optimize costs.

CloudWatch also provides a centralized log repository that you can use to store and access logs from multiple AWS resources. This can help you to quickly identify and troubleshoot issues with your applications and infrastructure.

Additionally, CloudWatch offers integrations with other AWS services and third-party tools, allowing you to extend its monitoring and alerting capabilities.

Here is a link to the Amazon CloudWatch official documentation: https://aws.amazon.com/cloudwatch/

What is Amazon Cognito on AWS Cloud?

Amazon Cognito is a service provided by Amazon Web Services (AWS) for managing user authentication and identity management for web and mobile applications. It enables you to create unique identities for your users, authenticate users with your own authentication systems or with AWS, and manage authorization for your users.

Cognito provides a simple and secure way to manage user sign-up, sign-in, and access control, allowing you to focus on building your application instead of managing user identities.

Cognito supports both standard user sign-up and sign-in and social identity providers, such as Amazon, Facebook, Google, and others. You can also add multi-factor authentication to further secure user access.

Cognito also integrates with other AWS services, such as Amazon S3, Amazon API Gateway, and AWS Lambda, to provide a complete solution for building and deploying secure and scalable web and mobile applications.

Here is a link to the Amazon Cognito official documentation: https://aws.amazon.com/cognito/

What is Amazon Elastic Container Registry (ECR)?

Amazon Elastic Container Registry (ECR) is a fully-managed Docker container registry that makes it easy to store, manage, and deploy Docker container images. With Amazon ECR, you can host your Docker images in a highly available and scalable infrastructure, and integrate with other AWS services like Amazon ECS, Amazon EKS, AWS Fargate, and others.

You can use Amazon ECR to store, manage, and deploy Docker images for your applications, and share images across teams within your organization. You can also control access to your images using AWS Identity and Access Management (IAM) policies, and monitor the security and compliance of your images with Amazon ECR Image Scanning.

Here is a link to the Amazon ECR official documentation: https://aws.amazon.com/ecr/

What is Amazon Elastic Container Service (ECS)?

Amazon Elastic Container Service (ECS) is a fully managed container orchestration service provided by AWS. It enables you to run, manage, and scale Docker containers on a cluster of Amazon EC2 instances.

With ECS, you can easily deploy, run, and manage containers and microservices applications, and take advantage of the scalability and availability of the AWS infrastructure. ECS provides a secure and reliable platform for running containers, and helps you manage the infrastructure and operations required to run your containers in production.

Some of the key features of Amazon ECS include:

  • Cluster management: ECS makes it easy to create, manage, and scale a cluster of EC2 instances that run your containers.
  • Task definition: ECS enables you to define and manage the containers, resources, and configuration of your applications in a task definition.
  • Load balancing: ECS integrates with AWS Elastic Load Balancer, enabling you to distribute incoming traffic evenly across your containers.
  • Automated scaling: ECS provides the ability to automatically scale the number of containers running in your cluster based on demand.

Here is a link to the official Amazon ECS documentation: https://aws.amazon.com/ecs/

What is Amazon IAM role?

Amazon IAM (Identity and Access Management) role is a feature in AWS that allows you to manage access to AWS resources. An IAM role defines a set of permissions that determine what actions an identity, such as an IAM user, EC2 instance, or a Lambda function, can perform in your AWS environment. IAM roles can be assumed by AWS services, applications running on EC2 instances, or other AWS users. They can also be assumed cross-account by trusted AWS accounts, enabling a flexible and secure way of granting access to resources in AWS.

https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html

What is Amazon Inspector?

Amazon Inspector is a security assessment service that helps improve the security and compliance of applications deployed on AWS. It enables you to automatically identify security vulnerabilities and deviations from best practices in your Amazon EC2 instances, and provides recommendations for remediation. With Amazon Inspector, you can assess the security and compliance of your applications and identify potential security issues before they are exploited. The service automatically runs a set of security checks and generates a report that highlights security findings and provides recommendations for improvement.

Here is the official documentation for Amazon Inspector: https://aws.amazon.com/inspector/

What is Amazon RDS?

Amazon Relational Database Service (Amazon RDS) is a managed relational database service provided by Amazon Web Services (AWS). It makes it easy to set up, operate, and scale a relational database in the cloud. With Amazon RDS, you can choose from popular database engines such as Amazon Aurora, Microsoft SQL Server, MySQL, Oracle, and PostgreSQL, and have the database software and infrastructure managed by AWS.

Amazon RDS handles tasks such as database setup, patching, backup, recovery, and failure detection and repair, freeing you from manual management tasks so that you can focus on application development and business initiatives.

Here is the official documentation for Amazon RDS: https://aws.amazon.com/rds/

What is Amazon RDS DB instances?

Amazon RDS DB instances are database instances managed by Amazon Relational Database Service (RDS). They allow customers to run relational databases in the AWS cloud, freeing up time and resources from database management tasks such as setup, patching, backups, and replication. Amazon RDS supports several database engines including Amazon Aurora, PostgreSQL, MySQL, MariaDB, Microsoft SQL Server, and Oracle.

https://aws.amazon.com/rds/db-instances/

What is Application Load Balancer in AWS Cloud?

Amazon Application Load Balancer is a type of Load Balancer in AWS Cloud, it operates at the application layer (layer 7) of the OSI model and routes incoming traffic to one or multiple target resources, such as EC2 instances, containers, or IP addresses, based on the content of the request. It provides advanced request routing, content-based routing, SSL/TLS decryption, and container monitoring, among other features.

Here is a link to the official documentation for Amazon Application Load Balancer: https://aws.amazon.com/elasticloadbalancing/applicationloadbalancer/

What is Auto Scaling group in AWS Cloud?

Auto Scaling group is a component of Amazon Web Services (AWS) Auto Scaling service that allows you to automatically increase or decrease the number of EC2 instances in a group based on certain criteria, such as traffic demand or CPU utilization. Auto Scaling groups ensure that your application always has the right number of instances to handle incoming traffic and maintain performance. They also allow you to scale up and down quickly to respond to changes in traffic demand, reducing the risk of downtime or poor performance due to insufficient capacity.

https://docs.aws.amazon.com/en_us/autoscaling/ec2/userguide/AutoScalingGroup.html

What is AWS Fargate?

WS Fargate is a technology for Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS) that allows users to run containers without having to manage the underlying instances. Fargate abstracts away the infrastructure management, so users can focus on building and operating their applications. With Fargate, users simply define the desired number of tasks and the required resources, and Fargate launches and manages the containers in the cloud.

Reference: https://aws.amazon.com/fargate/

What is AWS Firewall Manager?

AWS Firewall Manager is a security management service in Amazon Web Services (AWS) that provides centralized policy management for Amazon Virtual Private Cloud (VPC) security groups and Amazon Web Application Firewall (WAF) rules. It allows you to define your security policies once and apply them across all of your accounts and applications, reducing the risk of security misconfigurations and increasing operational efficiency. With Firewall Manager, you can also automate security updates and simplify compliance by applying pre-defined security templates.

https://aws.amazon.com/firewall-manager/

What is AWS Lambda Function?

AWS Lambda is a serverless computing service offered by Amazon Web Services (AWS) that allows you to run code without having to provision or manage servers. It allows you to run your code in response to events such as changes to data in an Amazon S3 bucket, or updates to a DynamoDB table. Lambda automatically scales your application in response to incoming request traffic and it charges you only for the compute time that you consume.

Reference: https://aws.amazon.com/lambda/

What is AWS Shield Advanced policy?

AWS Shield Advanced is a security service offered by Amazon Web Services (AWS) that provides protection against distributed denial of service (DDoS) attacks for applications running on the AWS cloud. It offers advanced features such as traffic filtering, real-time attack visibility, and automatic inline mitigations that can protect your applications even when a large-scale DDoS attack is in progress. AWS Shield Advanced also provides 24/7 access to AWS DDoS response experts who can help you mitigate attacks and restore normal traffic flow to your applications.

https://aws.amazon.com/shield/advanced/

What is AWS Snowball Edge Storage Optimized Device?

Amazon Snowball Edge Storage Optimized device is a data transfer appliance provided by Amazon Web Services (AWS) that helps transfer large amounts of data into and out of AWS. It is designed to be rugged and secure, and it can transfer data at high speeds, making it ideal for transferring large amounts of data in environments with limited or unreliable network connectivity. The device is equipped with high-performance storage and compute capabilities, allowing you to run compute functions and store data locally. The Snowball Edge Storage Optimized device also integrates with AWS Snowball and AWS Snowmobile for large-scale data transfers.

https://aws.amazon.com/snowball-edge/

AWS Snowball

AWS Snowball is a data migration service provided by Amazon Web Services (AWS). It helps to transfer large amounts of data into and out of the AWS Cloud, particularly when the amount of data being transferred is too large to be done over the internet in a reasonable time frame. AWS Snowball uses physical storage devices, called Snowballs, that are shipped to customers to transfer data to and from the AWS Cloud. The data transfer is performed in parallel to increase the speed of data migration and to minimize the impact on the customer’s network.

https://aws.amazon.com/snowball/

AWS Snowmobile

Amazon Snowmobile is a data transfer service provided by AWS that allows you to transfer large amounts of data (100PB+) into and out of the AWS Cloud. This service is intended for use cases such as disaster recovery, backup and archive, and migration of large datasets. The Snowmobile is a secure, climate-controlled truck that provides secure transfer of data from your on-premise infrastructure to AWS. Once the data is uploaded, it can be stored in Amazon S3 or Amazon Glacier for long-term retention or further processing.

Reference: https://aws.amazon.com/snowmobile/

What is Cross-site scripting for AWS Cloud?

Cross-site scripting (XSS) is a type of security vulnerability that allows attackers to inject malicious code into a website that is viewed by other users. In an XSS attack, the attacker creates a payload (usually in the form of a script) that is executed by the browser of a victim who visits the targeted website. This can result in sensitive information being stolen or malicious actions being performed on behalf of the victim. In the context of AWS Cloud, protecting against XSS attacks is an important aspect of securing web-based applications and resources. This can be achieved through a combination of secure coding practices, input validation, and using appropriate security controls such as web application firewalls (WAFs) to block XSS attacks.

https://aws.amazon.com/security/security-bulletins/cross-site-scripting-xss/

What is Customer Gateway on AWS Cloud?

A Customer Gateway in AWS Cloud refers to a logical representation of a customer’s on-premises VPN (Virtual Private Network) device. It is used in Amazon Virtual Private Cloud (VPC) to allow communication between an Amazon VPC and a customer’s network. The customer gateway is the device in the customer’s network that routes the data to Amazon VPC over the Internet or a VPN connection.

Reference: https://docs.aws.amazon.com/vpc/latest/userguide/VPC_VPN.html

What is Disaster Recovery in AWS Cloud?

What is DynamoDB with DynamoDB Accelerator (DAX)?

What is DynamoDB?

What is General Purposed SSD Storage in AWS Cloud?

What is IOPS in Amazon Cloud?

What is Load Balancer for AWS Cloud?

What is multiple Availability Zones in AWS Cloud?

What is Network Load Balancer in AWS Cloud?

What is Pilot Light for Disaster Recovery in AWS Cloud?

What is Provisioned IOPS storage in AWS Cloud?

What is single Availability Zone in AWS Cloud?

What is VPC private subnet on AWS Cloud?

What is Warm Standby for Disaster Recovery in AWS Cloud?

What means RTO for Disaster Recovery in AWS Cloud?

Google Cloud Digital Leader Certification – My resume

Google Cloud Digital Leader Certification

I am studying for this Google Cloud Digital Leader Certification, as I understand, this is like AZ 900 but from Google.

Here is the exam breakdown, every topic they say they can charge on the exam: https://cloud.google.com/certification/guides/cloud-digital-leader

In this post I will explain all topics with my own words, try to provide links and etc, so this can become some quick guide on how to be prepared for each and every topic.

I will keep updating this post as long as I am studying for this exam.

About this certification exam

Length: 90 minutes

Registration fee: $99

Language: English, Japanese

Exam format: 50-60 multiple choice and multiple select questions

Certification Renewal / Recertification: Candidates must recertify in order to maintain their certification status. The Cloud Digital Leader certification is valid for three years from the date of certification. Recertification is accomplished by retaking the exam during the recertification eligibility time period and achieving a passing score. You may attempt recertification starting 60 days prior to your certification expiration date.

Cloud Digital Leader

Certification exam guide

A Cloud Digital Leader can articulate the capabilities of Google Cloud core products and services and how they benefit organizations. The Cloud Digital Leader can also describe common business use cases and how cloud solutions support an enterprise. The Cloud Digital Leader exam is job-role agnostic and does not require hands-on experience with Google Cloud.

The Cloud Digital Leader exam assesses your knowledge in these areas:

  • Section 1: Digital transformation with Google Cloud (~10% of the exam)
  • Section 2: Innovating with data and Google Cloud (~30% of the exam)
  • Section 3: Infrastructure and application modernization (~30% of the exam)
  • Section 4: Google Cloud security and operations (~30% of the exam)

Section 1: Introduction to digital transformation with Google Cloud (approximately 10% of the exam)

This is just 10% of the exam, so, no need to spend too much effort here, just knows the basics and move to the other sections where we actually get 30% relevance.

1.1  Explain why cloud technology is revolutionizing business

Well, cloud is like having unlimited infrastructure over the internet, in multiple regions, with unlimited computing and storage power. Also, cloud is about being scalable, flexible, adaptable, also easier to maintain, no need for upfront costs, pay as you go models, and reliability with all those regions, and dedicated hosts, etc.

    ā—  Define key terms such as cloud, cloud technology, data, and digital transformation

Cloud is having computing power and infra available on the internet.

Cloud technology, as my understanding, is what makes the cloud, like, cloud computing technology, cloud storage technology, cloud infra, etc.

Data is what users or external source provides and stores within the cloud.

Digital transformation is the process of using digital technologies to create new — or modify existing — business processes, culture, and customer experiences to meet changing business and market requirements. (source)

1.2 Explain why it is critical for businesses to adopt new technology

Adopting new tech is important for business so they can keep up with customers and clients demands and expectations. Also to be able to compete against the competition, to have technology edger over others, or to try to.

    ā—  Compare and contrast cloud technology and traditional or on-premises technology

We already know that cloud is elastic, scalable, reliable, flexible, etc. Whereas, in on-prem network we do can make our network also elastic, scalable, reliable, flexible, but it takes great cost, it’s usually not feasible. Therefore, with on-prem we usually have more customization, specially in the infrastructure part, but with cloud we don’t have much infra customization, as the cloud providers take care of it for us, but we have everything else, like power to use as much as computing power necessary, allocate and de-allocate resources at will, and only pay for what we use.

    ā—  Describe how customer expectations have changed because of cloud technology 

Well, I think that now we are more used to always have systems with near 100% uptime, that never slows down because in the background the cloud is taking care of load by decreasing or increasing computing power, or storage, or ram, users are always expecting their apps/systems to behave like that, to have fail safes, to be also be able to recover from disaster, never loose data.

    ā—  Identify the business and technical considerations that organizations need to think about when adopting cloud technology, including: infrastructure; application and business platform modernization; the importance of data; security in the cloud

Very broad topic here, but I think they are talking about the shared responsibility model. See the images below.

overview – shared responsibility model for google cloud
compute power – shared responsibility model for google cloud

They might also be talking about Total Cost of Ownership (TCO), and how CAPEX vs OPEX models works. In the cloud you are OPEX, meaning, it’s operational expenses, meaning you are paying for what you are using within the cloud. For CAPEX, you are using money upfront to acquire hardware and software, and to maintain it all, meaning to hire people, pay for power, location, etc.

Section 2: Innovating with data and Google Cloud (approximately 30% of the exam)

2.1 Describe the role of data in digital transformation and the importance of a data-driven culture

Those concepts are little abstract, but google and wikipedia can give us some general direction on this, so we can describe data-driven as:

The adjective data-driven means that progress in an activity is compelled by data, rather than by intuition or by personal experience.

Therefore, the role of data in the digital transformation is pivotal, since that with not data that would not be possible to achieve data-driven solutions or culture.

    ā—  Explain how cloud technology enables data to be applied in new ways

Cloud enable data to be applied in new ways because by using cloud you can access, ingest, store and analyze data like never before. With ETL cloud tech and cloud data store a new era of big data is at hand, and because of cloud auto-scale properties.

2.2 Identify common Google Cloud solutions for data management

Google has a great host of data solutions, such as: BigQuery, Cloud Spanner, Firestore, Firebase, Cloud Storage, Cloud SQL, etc, See here: https://cloud.google.com/products/databases

    ā—  Recognize examples of structured and unstructured data

Structured data is relational database, like excel tables, csv files.

Unstructured data is data like documents, images, videos, etc.

2.3. Identify common Google Cloud solutions for smart analytics 

Google has resources like Vertex AI, Vision AI, Vision API, BigQuery ML, etc. See more here: https://cloud.google.com/products#section-3

    ā—  Articulate the business benefits of storing data in the cloud

It’s resilient, meaning the service will be almost always up and your data always there. I don’t mean your data will be hot and available, you can put in archive mode, but you can pick and choose how and where you will store your data, and that is great value for business.

    ā—  Apply appropriate business use cases for databases, data warehouses, and data lakes 

Big, huge, topic here. Ahm… I think here it’s more like, when to use A or B database or which data warehouse solution to use, etc. Hard to give simple example like this, but here are some:

Lets say you have an architecture where you receive lots of stream data from all over and that you want to store that data to analyze, how can you do that? Well, for that you can use a pub/sub to gather your data, then use dataflow to store in cloud storage, then persist this on cloud BigTable and use BigQuery to read and analyze it.

    ā—  Explain the benefits of Google Cloud data products, including: Looker, BigQuery, Cloud Spanner, Cloud SQL, Cloud Storage

Looker: serverless way to analyze data on top of your data warehouse.

BigQuery: serverless, let’s you analyze lots of data using SQL. With built-in features like machine learning, geospatial analysis, and business intelligence.

Cloud Spanner: Fully managed relational database with unlimited scale, strong consistency, and up to 99.999% availability.

Cloud SQL: Fully managed relational database service for MySQL, PostgreSQL, and SQL Server with rich extension collections, configuration flags, and developer ecosystems.

Cloud Storage: is a managed service for storing unstructured data. Store any amount of data and retrieve it as often as you like.

2.4. Identify Google Cloud’s solutions for machine learning and AI

Here you can see them all: https://cloud.google.com/products#section-3. But I would focus on:

Vertex AI, Cloud AutoML, BigQueryML, Cloud Natural Language, Vision AI, Vision API

    ā—  Define artificial intelligence (AI) and machine learning (ML) 

Artificial Intelligence (AI) is a branch of computer science that deals with the creation of intelligent machines that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation.

The goal of AI research is to create systems that can perform tasks that normally require human intelligence, such as reasoning, learning, and perception.

Machine Learning (ML) is a subset of Artificial Intelligence (AI) that focuses on the development of algorithms that enable computers to learn and make predictions or decisions based on data, without being explicitly programmed to perform the task. In Machine Learning, algorithms analyze and learn from data, and then make a prediction or classification about new, unseen data.

    ā—  Outline the importance of data quality in ML prediction accuracy

Data quality is crucial to the accuracy of Machine Learning (ML) predictions because the quality of the data determines the quality of the model that is built.

In summary, high-quality data is a necessary foundation for accurate ML predictions. A model can only be as good as the data it is trained on, so it’s important to invest time and effort into data preparation and cleaning to ensure that the data used for ML is of high quality.

    ā—  Describe Google Cloud’s differentiators with regard to AI and machine learning 

Well, as I understand, the key differentiator is that Google try to make AI and ML easy to use and accessible, meaning, that even people without much training can benefit from Google AI and ML.

    ā—  Recognize the ways customers can use Google Cloud’s AI and ML solutions to create business value

I think that the main point is by Automation of business processes, where customers can use Google Cloud’s AI and ML solutions to automate repetitive and manual tasks, reducing errors and freeing up employees’ time for more strategic work.

Section 3: Infrastructure and application modernization (approximately 30% of the exam)

3.1 Learn what modernizing IT infrastructure with Google Cloud means

Modernizing IT infrastructure with Google Cloud means upgrading and transforming traditional IT systems and processes to take advantage of the scalability, security, and innovation offered by cloud computing.

    ā—  Explain why legacy infrastructure struggles to deliver modern services to customers 

Legacy infra has a hard time upgrading their current workload, and even if they can upgrade it is usually expensive, and can’t scale well enough, because you will always have limited hardware resources.

    ā—  Explain the benefits of modernizing infrastructure with cloud technology

In comparison, cloud computing provides the scalability, flexibility, and innovation needed to deliver modern services to customers. By leveraging the power of the cloud, organizations can quickly and easily scale their IT resources, integrate new technologies, and keep pace with the latest innovations, all while reducing costs and improving security.

    ā—  Differentiate between hybrid and multicloud infrastructures

Hybrid is when you have Private Cloud + Public Cloud in your Organization.

Multicloud is when you have multiple, different, Public Clouds in your Organization.

    ā—  Differentiate between virtual machines, containers, and serverless computing within business use cases 

VMs is IaaS, so you have control of the infra, you need to install and maintain your OS, etc.

Containers are usually PaaS, there you already have the OS installed, you just focus on developing, deploying and running your app.

Serverless Computing, you abstract everything, here you just need to provide code in form of functions. With serverless computing, customers only pay for the computing resources they actually use, rather than having to reserve and pay for a fixed amount of computing resources in advance.

In a serverless computing environment, the customer provides their application code in the form of functions, which are triggered by events, such as a user request or a change in data. The cloud provider then executes the code and manages the underlying infrastructure, including provisioning, scaling, and monitoring the resources needed to run the code.

    ā—  Identify the Google Cloud solutions that help businesses modernize their infrastructure

Google Cloud offers a range of solutions to help businesses modernize their infrastructure, including GCP, Kubernetes Engine, Cloud Functions, Google Cloud Anthos, and Google Cloud Virtual Machines.

This might help: https://cloud.google.com/products/#section-5

3.2 Understand modernizing applications with Google Cloud 

    ā—  Describe the business drivers for modernizing applications

Business drivers for modernizing applications include increased efficiency, improved user experience, enhanced security, increased agility, better scalability, and improved collaboration. These drivers motivate organizations to invest in modernizing their applications to stay competitive and meet the changing needs of their customers and employees.

    ā—  Describe the benefits of using cloud-native applications

Cloud-native applications are designed to run on cloud computing infrastructure and take advantage of its scalability, reliability, and security features.

    ā—  Apply the appropriate change pattern to different business use cases

In general, applying the appropriate change pattern involves a process of analysis and evaluation, in which the specific requirements of each use case are considered, and the best method for making changes is selected based on those requirements.

    ā—  Explain the benefits of Google Kubernetes Engine, Anthos, and App Engine for application development

Google Kubernetes Engine (GKE): GKE is a fully managed Kubernetes service that makes it easy for organizations to run, manage, and scale containerized applications using Kubernetes.

Anthos is a hybrid and multi-cloud application platform that enables organizations to build, run, and manage applications on any cloud, including Google Cloud, on-premises, or other cloud providers.

Anthos provides a unified platform for managing applications, regardless of where they are deployed, which helps organizations to streamline their operations and increase their productivity.

App Engine is a fully managed platform for building and deploying web applications and mobile backends.

3.3 Understand the value of APIs 

    ā—  Explain how application programming interfaces (APIs) can modernize legacy systems

APIs are a way for different systems to communicate with each other, making them more accessible and easier to use. They help businesses improve security, respond to changing needs, and integrate their systems with new data sources. APIs can modernize legacy systems and make them work better with new technology, resulting in a more efficient and effective business.

    ā—  Describe how APIs can create new business value 

APIs allow businesses to securely share their data and services with others, opening up new opportunities for growth and innovation. By making their services available to others, businesses can reach new customers, create new revenue streams, improve their own services, and drive innovation. APIs can help businesses create new business value by leveraging the creativity and expertise of others.

    ā—  Explain the benefits of Apigee

Google Cloud Apigee is a tool that helps businesses create and manage their APIs in a simple and secure way. Using Apigee, businesses can increase their productivity by quickly creating and deploying APIs, improve security by protecting sensitive data, ensure reliability by monitoring API performance, handle more traffic by automatically scaling, make the process easier for developers, and respond quickly to changing business needs by deploying new APIs. In short, Apigee helps businesses efficiently and effectively create and manage their APIs.

Section 4: Understanding Google Cloud security and operations (approximately 30% of the exam)

4.1 Describe financial governance in the cloud and Google Cloud’s recommended best practices for effective cloud cost management 

    ā—  Explain how adopting cloud technology affects the total cost of ownership (TCO) 

Adopting cloud technology can affect the total cost of ownership (TCO) by reducing costs in several ways. The most significant cost savings come from reducing capital expenses as organizations no longer have to purchase, maintain, and upgrade expensive hardware and software.

Another factor is that cloud providers offer a pay-as-you-go model, where customers only pay for the resources they use, instead of having to make large upfront investments. This helps organizations reduce risk, as they are only paying for what they need, and can also help them manage cash flow more effectively.

Additionally, operating expenses are reduced as cloud providers manage the infrastructure, reducing the need for internal IT staff.

Cloud technology can also increase agility, allowing organizations to quickly respond to changing business needs without incurring additional costs. Furthermore, the ability to scale resources up or down as needed can help optimize costs, avoiding overprovisioning or underutilization of resources.

    ā—  Identify the cost management strategy needed in a given business scenario

 Cost management can depend on several factors. For Google Cloud you can leverage the use of Sustained use discounts, or committed use discounts, or the use of resources that can scale to zero, etc.

4.2 Describe a cloud security approach and Google Cloud security benefits

    ā—  Define fundamental cloud security terms, including privacy, availability, security, and control

In the context of cloud technology, privacy is a key concern and organizations must ensure that their cloud provider implements adequate measures to protect sensitive information.

In the context of cloud technology, high availability is a critical requirement to ensure the seamless functioning of applications and services.

In the context of cloud technology, security is a top priority and organizations must ensure that their cloud provider implements robust security measures to keep their data and applications safe.

In the context of cloud technology, control is a key concern for organizations who must ensure that they have the necessary visibility and management capabilities to secure their data and applications in the cloud.

    ā—  Explain what is meant by a shared responsibility model 

The shared responsibility model in Google Cloud refers to the division of security responsibilities between the customer and Google. According to this model, Google is responsible for securing the infrastructure and services provided by the cloud platform, while the customer is responsible for securing the applications and data that run on that infrastructure.

    ā—  Describe the security benefits of using Google Cloud

Google Cloud offers security benefits such as secure infrastructure, data protection, and compliance with industry standards. These features help ensure the safety and privacy of your data and applications.

For the exam, it is important to focus on the following services, Cloud Data Loss Prevention and Security Command Center.

    ā—  Identify today’s top cybersecurity challenges and threats to data privacy

    ā—  Demonstrate how organizations can control and manage access to cloud resources

Organizations can control and manage access to cloud resources through a combination of techniques, such as identity and access management (IAM), network security, and encryption.

4.3 Explain how IT operations need to adapt to thrive in the cloud 

When moving to the cloud, IT operations need to change the way they work to fully leverage the benefits of the cloud. This often involves adopting new processes and tools, such as automation and orchestration, to manage and scale resources dynamically.

    ā—  Differentiate service availability requirements in the cloud versus in on-premises environments

In a cloud environment, the service availability refers to the ability of the users to access and use the services provided by the cloud provider without interruption. In comparison, in an on-premises environment, the service availability refers to the ability of the users to access and use the services provided by the local IT infrastructure without interruption.

One major difference is that in a cloud environment, the cloud provider is responsible for ensuring the availability of the services, while in an on-premises environment, the responsibility falls on the local IT team. Another difference is that cloud providers typically offer higher levels of service availability and reliability, often with Service Level Agreements (SLAs) that guarantee certain levels of uptime. In an on-premises environment, achieving high levels of service availability can be more challenging and requires more investment in hardware, software, and staffing.

    ā—  Describe the operational challenges that DevOps solves

DevOps is a set of practices that combines software development and IT operations to optimize the delivery of software products. It aims to improve the collaboration, communication, and automation of software development processes.

    ā—  Apply the goals of site reliability engineering (SRE) to different business use cases 

Site reliability engineering (SRE) is a software engineering practice that aims to ensure the reliability, availability, and performance of a software system by combining the principles of software development and systems operations. The goals of SRE can be applied to different business use cases in order to improve the overall reliability and performance of their systems.

4.4 Identify Google Cloud solutions for cloud resource monitoring and application performance management 

Google Cloud offers several solutions for cloud resource monitoring and application performance management. Some of them include:

  1. Stackdriver: a comprehensive solution for monitoring, logging, and diagnosing the performance of cloud-powered applications.
  2. Cloud Monitoring: a cloud-native monitoring solution that provides insights into the performance and health of cloud resources.
  3. Cloud Trace: a distributed tracing system that provides visibility into the performance of microservices-based applications.
  4. Cloud Profiler: a performance profiling tool that helps identify performance bottlenecks and optimize the performance of applications running on Google Cloud.
  5. Cloud Debugger: a cloud-based debugging tool that enables developers to quickly find and fix performance issues in production applications.

    ā—  Explain the potential impact of unexpected or prolonged downtime

Unexpected or prolonged downtime can have a significant impact on a business. It can cause disruptions to operations, leading to loss of revenue and customers. Downtime can also harm the reputation of a company and impact customer trust in the brand. Additionally, prolonged downtime can cause long-term damage to a company’s bottom line, as the costs associated with resolving the issue and restoring operations can be high. Moreover, it can result in missed opportunities, such as new business or increased market share. In an increasingly competitive landscape, having reliable and available systems is critical to business success, making the potential impact of unexpected or prolonged downtime a significant concern for organizations of all sizes.

    ā—  Define monitoring, logging, and observability within the context of cloud operations 

Monitoring refers to the process of tracking the performance, availability, and resource utilization of an IT system. It is used to detect and diagnose issues in the system, and to ensure that the system is functioning correctly.

Logging involves collecting, storing, and analyzing log data generated by an IT system. Log data can be used to diagnose problems, track system activity, and support security and compliance requirements.

Observability refers to the ability to understand the behavior of a complex system by analyzing the data it generates. It enables IT operations teams to monitor the performance of their systems, detect and diagnose problems, and understand the root cause of issues. By improving observability, IT operations teams can ensure the reliability and performance of their systems, even as they scale and evolve.

    ā—  Identify the Google Cloud resource monitoring and maintenance tools.

Google Cloud offers several tools for monitoring and maintenance of cloud resources, including:

  1. Stackdriver: a comprehensive monitoring, logging, and diagnostics platform
  2. Cloud Monitoring: provides visibility into the performance and health of your cloud resources and applications
  3. Cloud Logging: provides real-time logs and insights into your applications and infrastructure
  4. Cloud Trace: helps you understand the performance and behavior of your cloud-based applications
  5. Cloud Profiler: provides performance profiling data for your applications running in the cloud
  6. Cloud Debugger: helps you diagnose and debug your cloud-based applications in production.

Cloud Digital Leader Sample Questions

The Cloud Digital Leader sample questions will familiarize you with the format of exam questions and example content that may be covered on the exam. These questions are for the latest version of the exam that launched January 26, 2022.

The sample questions do not represent the range of topics or level of difficulty of questions presented on the exam. Performance on the sample questions should not be used to predict your Cloud Digital Leader exam result.

https://docs.google.com/forms/d/e/1FAIpQLSedAmf77MGS7FGEaylFzY51KtBd7kkIZJIMDsV5zSRSmpKIOA/viewform

Sample Questions Answered and Reviewed

Here are some questions I find worth sharing with you.

Question 01

Question here is, what is maturity level to Google Cloud? What is this “transformational one” maturity level?

To be able to answer this question we must know about Google Cloud Adoption Framework (GCAF), it is a guide from Google to help customers with cloud adoption. You can check it in completeness here (it’s a pdf).

There are many topics on GCAF, but for this particular question what we need to understand is this:

GCAF Phases (source)

Google Cloud Adoption Framework also suggests that a businesses readiness for success in the cloud is determined by current business practices in each of these four themes. For each theme, those practices will fall into one of the following phases: 

Tactical (short term goals): Individual workloads are in place, but no coherent plan encompassing all of them with a strategy for building out to the future. The focus is on reducing the cost of discrete systems and on getting to the cloud with minimal disruption. The wins are quick, but there is no provision for scale. 

Strategic (mid term goals): A broader vision governs individual workloads, which are designed and developed with an eye to future needs and scale. You have begun to embrace change, and the people and processes portion of the equation are now involved. IT teams are both efficient and effective, increasing the value of harnessing the cloud for your business operations. 

Transformational (long term goals): With cloud operations functioning smoothly, you’ve turned your attention to integrating the data and insights garnered from working now in the cloud. Existing data is transparently shared. New data is collected and analyzed. The predictive and prescriptive analytics of machine learning applied. Your people and processes are being transformed, which further supports the technological changes. IT is no longer a cost center, but has become instead a partner to the business. 

Question 02

I didn’t really know what was SLO, so here it is Google Cloud definition for SLO with an example: “Latency can exceed 300 ms in only 5 percent of the requests over a rolling 30-day period”. But yeah, lowering SLO (I was trying to think SLO same as SLA) wouldn’t make any good. So this couldn’t be the correct answer.

Move to hybrid also not, because the question says “has completed migrated to the cloud”, why go back to on-prem? Makes no sense, so no.

Modernize their apps, this is the correct one, kind of simple and objective, but way too vague, so I thought, but well, it is what it is.

Manually provision, yeah, this is counter intuitive, we are moving towards the age of fully automatic stuff, so I guess I guessed wrong here.

If you noticed, like I am noticing now, this is that kind of question where you remove all that you know that is wrong and go with the one that can be the answer, even if the correct answer is as vague as it is.

Question 03

Developers want cloud providers to take full control of their apps performance, what, how? Makes no sense.

IT managers want to stop making gradual changes. Well, honestly I thought this could be the answer, but maybe IT managers aren’t the ones doing changes, so I left this one.

IT managers want cloud providers to make infra changes for them. I should kept my thought on IT managers don’t run changes, and left this one, but I didn’t, that’s why I failed.

Developers want to test ideas and experiment with more ease. In the end this is really the one that makes more sense, as in the cloud you can scale in and out, up and down, and also deploy resources quickly and easily.

Question 04

I was like, ok, they have the hardware, they have their cloud, so now use API to integrate that hardware to your cloud, or public cloud. But nope, so API here is actually being used to connect the hardware to their customer service app, whatever it is. The answer reflects my line of thought, but I still picked the wrong answer.

Question 05

Again, same thing here. No sensitive data goes to the public cloud, apparently. But I was more thinking about push notifications, so dashboards and charts wasn’t really looking like a correct answer to me, that’s why I got it wrong.

Question 06

Those questions are so tricky and vague. Ok, you can establish a partnership between sectors, but how does that guarantees that you will lower you cloud costs? Google might have some paper on this, once/if I found it, this will be updated.

Question 07

That’s a good one. I was really like between A and C. I probably already saw that question/answer on some Azure scenario.

Highlights on Google Cloud Resources

Here are some notes I took my own while studying for this certification:

Cloud Functions

  • Charged only when being used
  • Use for scenarios where you want to eliminate cost of Windows licensing; you need something as fully managed as possible
  • Can be written in many languages; PHP, Java, etc
  • Automatically spins up or down (scale to zero)
  • Use when you want to create an APP that will only be used during business hours
  • Is able to perform event-driven services, build serverless backend, process real-time data

Kubernet Engine

  • Container Orchestration platform
  • Scalable an flexible configuration
  • Complete control over orchestration, such as network, storage, observability
  • Supports stateful APP use case
  • Open source solution

VMWare Engine

  • Use for lift and shift your on-prem WMWare based APP to the cloud

Data Fusion

  • Scalable data integration
  • Use to clean, prepare, transfer and transform data
  • Has a web UI
  • CDAP open-source pipeline development
  • Can be used as visualization tool for ETL

Root Node

  • Organization is the root node of Google Cloud hierarchy of resources
  • Apply policies on root node when you need those policies to replicate on all levels (folders, projects, resources, billing accounts, etc)

Bare Metal

  • Use to run specialized workloads, meaning, use when there isn’t a proper resource to do it already in the cloud
  • Integrate with GCP with minimal latency

Billing Accounts

  • Google recommends the creation of one central cloud Billing Account for the whole Organization
  • To merge all projects under a single Billing Account move all projects to the root organization

Managed Instance Group for Compute Engine

  • Uses instance template to create identical instances
  • Allows operational on multiple identical VWs

Folders and Projects

  • Create a folder per department is the standard for multiple departments and allow better organization
  • Create a project per workload allows each workload to follow specific governance and compliance policies
  • Projects are the base, basis, level organizing unit in GCP (resources are foundation level)
  • Projects are required to use service-level resources
  • All service-level resources are parented by Projects

IAM

  • Least privilege
  • Resource policy for access and permissions
  • Networking: firewall rules

Service Account

  • Use when resources in a resource group need access to something, like BigQuery
  • Create a Service Account for each functionality-security equivalent resource

Cloud Logging

  • Monitor, analyze, alert logging data and events from both clouds (AWS and CGP)

Transfer Appliance Service

  • Use when you want to quickly transfer 50TB of data from on-prem to GCP

Beyond Corp

  • Use when you want to make sure that internal company data should only be accessible from cooperating devices when being accessed within the corporate offices

Firestore

  • Used for mobile gaming
  • Cloud hosted, NoSQL database
  • Keeps your data in-sync in real time across devices
  • Scalable, low-maintenance, and serveless document database

Directory Sync

  • Use when your organization has a LDAP server and want to allow their users to access GCP resources
  • Sync users and groups from AD/LDAP with GCP

Pub/Sub

  • Use for handling ingestion from many clients simultaneously
  • Gather events from many clients simultaneously

Business Payments Profile

  • Use this if you are paying in name of your company
  • Use if you want to give a person the ability to download all possible invoices for year-end tax purposes

Dataproc

  • Use for Apache Spark solutions, also for Presto, Apache Flink, etc

Anthos

  • Manage APPS by running containers and other workloads across on-premises and multi-public clouds

Preemptible Virtual Machine (PVM)

  • Cheap, short-lived compute instances
  • Use for batch and fault-tolerant workloads
  • PVM lasts up to 24 hours
  • PVM is the cheapest computation solution

Container Optimized OS

  • Is and OS image for your compute engine VM that is optimized for running Docker containers
  • Use to reduce boot time from VM to a minimal

Premium Support

  • Enables Customer Aware Support
  • Enables Account Manager

Storage Transfer Service

  • Use to transfer from AWS S3 to GCP cloud store online
  • Allows to quickly import data online into cloud storage
  • Allows repeating schedule for transferring data
  • Use to transfer data within cloud storage from one bucket to another

Cost Table Report

  • Cost detail per invoice per month
  • Gives a tabular view of your monthly costs for a given invoice

Cloud Big Table

  • Use to receive large amount of traffic
  • Reads has very low latency
  • Fully managed, scalable NoSQL datbase service
  • Ideal to store very large amounts of data in a key-value store
  • Use for large NoSQL analytical workloads
  • Can handle stream with spikes to ingest data
  • Handles large volume of data from sensors, IOT, stream, etc
  • Can work with stream data at a rate of 6000 clicks per minute up to 8500 clicker per second

Cloud Filestore

  • File share system that can be mounted effectively on several compute engine instances for media workloadsa
  • Use NFS

APIGEE

  • Use for managing APIs and establish security policies
  • APIGEE is a platform to develop and manage APIs

Cloud Code

  • Dev tool that works as IDE
  • Write, debug and deploy
  • Works with IntelliJ, VS Code or your browser
  • Works with Cloudshell

Cloud Run

  • Abstracts infra management by automatically scaling up and down depending on traffic
  • Only charges for resource used
  • Can build data processing APPs that transform lightweight data, as it arrives, and stores it as structured data

Signed URL

  • Grants temporary access to a google storage resource

Firebase

  • Fully managed platform for web and mobile APP development
  • Is serverless and can scale to zero
  • Is an alternative to Cloud Run when to focus on building and deploying APPs in a high optimized framework

Vision API

  • Can help identify products of interest within images and visually search product catalogs
  • Offers powerful pre-trained ML models
  • Detect objects and faces, reads printed and handwritten text
  • Find products of interest within images
  • Visually search product catalogs
  • Detect and classify multiple objects, including location, based on image
  • Helps categorization of products using images

Vertex AI

  • Can train, host and use ML models to make predictions at scale
  • Is an integrated suite of products that combines AI platform functions with pre-trained, AutoML and custom tooling
  • Hosts tensorflow models

Data Loss Prevention

  • Protect Sensitive data as you move to the cloud
  • Allows to classify sensitive data in structured and unstructured workloads
  • Use to ensure that PII (personally identifiable information) are de-identified via masking to keep information safe

Cloud Shell and CLoud SDK

  • Administer GCP resources using Google Cloud CLI commands

App Engine Flexible

  • Allows devs to focus on writing code
  • Scale to zero, while also balacing load
  • Based used when APPs runs on Docker
  • For APPs that receive constant traffic, experience regular traffic fluctuations

APP Engine Standard

  • Ideal for APPs that experience sudden and extreme spikes of traffic which requires immediate scaling
  • Can scale up in seconds to handle large spikes in demand

Google Cloud VPN

  • Cloud network solution to establish a secure connection to a VPN (virtual private cloud)
  • Connect peer network to VPC using IPSec VPN connection
  • Use if you need to establish a secure connection to a company VPC
  • Is cost effective
  • Use if low-latency is not a requirement
  • Use if high availability is not a requirement

Cloud SQL

  • Store PII information in table format in the cloud
  • Ideal for ERP, CRM, E-commerce

Cloud Spanner (SQL like)

  • Use to avoid scaling issue with mobile games
  • Use SQL-like database
  • Unlimited scale, strong consistency and 99.999% uptime for multi-region
  • Never needs to pause for patching, backups, etc

BigQuery ML

  • Analyze BigQuery data using Machine Learning
  • Creates and run ML models in BigQuery
  • Uses standard SQL queries
  • Can host tensorflow models

Artifact Registry

  • Use if you need to store container image within Google Cloud and support multi-region or regional repositories
  • Enables you to store artifacts and build dependencies
  • Integrate with Cloud Build and others CI/CD systems
  • Provides a single location for storing and managing your packages and Docker container images

Billing Reports

  • Use to perform a break down by region of your GCP costs
  • Cost breakdown by region

Committed Use Discounts

  • Use if your workloads are, and will continue, constant for a long time
  • This is the cheapest option for 1-3 years of commitement

Migrate to Compute Engine

  • Use if you want to migrate to GCP your on-prem VM Machines (not VM Apps)

Cloud Identity (IDaaS)

  • Manage devices remotely using the company on-prem AD server
  • Use Cloud Identity to federate IDs from different providers with Google

Cloud Storage

  • Can serve as CDN (content delivery network)
  • Use if a specific object in your APP has to be shared across numerous VM instances and zones
  • Archive storage class; lowest cost
  • Archive storage class; good for archiving data if you plan accessing less than once a year
  • Archive storage class; minimum 365-day storage duration
  • Storage Admin; grant this permission if users needs to access and manage cloud storage buckets and files
  • Storage Admin; can be applied to an individual bucket
  • Coldline storage; very low-cost for storing infrequently accessed data
  • Coldline storage; 90 day minimum data duration

Cloud Build

  • Automation in building, testing and deploying
  • Creates pipelines to automate deployments
  • Deploy to multiple clouds
  • Is protected by GCP Security

Cloud NAT

  • Do not reveal IP address
  • Allows you to provisioning your APP instances without public IP address while allowing them to access the internet

Resources

https://cloud.google.com/certification/guides/cloud-digital-leader

https://www.youtube.com/watch?v=UGRDM86MBIQ – Google Cloud Digital Leader Certification Course – Pass the Exam!

https://docs.google.com/document/d/10yMTxjhAAriZUTLcnyCeS4zBWXX_8DkAGYYYALQe0VE/edit?usp=sharing – Google Cloud Digital Leader Exam Prep Guide

https://medium.com/@walkrinthecloud/how-to-prepare-for-the-google-cloud-digital-leader-certification-exam-6bf1dcf017d7 – How to Prepare For The Google Cloud Digital Leader Certification Exam

https://www.exampro.co/gcp-cdl – Free and paid material, also questions, lots of questions

After topic 2 I got tired of writing everything and started using ChatGPT to write for me. Meaning, Topics 3 and 4 are generated using ChaGPT.

Google Looker – the basics

First part gives a macro overview on Google Looker, so it’s time to start digging deeper.

Is Looker free?

No, Looker is not free. Looker Studio is free, but that is a different product. I really wish they could change names, remove one of those products or I don’t know…. so much confusion, even harder to find relevant stuff online.

Be mindful there here our focus is Looker. I will probably never talk about Looker Studio, so if you read Looker here, it is Looker, not Looker Studio. Every time I talk about Looker Studio I will call it Studio or Looker Studio.

Know more about Looker here: https://cloud.google.com/looker/docs

What is the difference between Looker and Looker Studio?

I don’t really know how to say this, but they might be similar… As far as I understand, Looker Studio is like a free, capped version of Looker, but I might be wrong, check image below, from Google itself.

They behave different, the way you ingest data and etc. In Looker you have LookML which I think you don’t get on Looker Studio, so…. Actually, you can even use Looker as data source for Looker Studio.

here’s what Google point out between Looker and Looker Studio

How do I access Looker?

To access Looker you need to go here https://cloud.google.com/resources/looker-free-trial and ask for your free trial.

Other option is to use Google Cloud Skill Boost labs, here: https://www.cloudskillsboost.google/focuses/21214?parent=catalog. Just start the lab and start messing around. You can start this lab some many times per day (10 I think). You don’t actually need to fulfill the lab requested, it is just a “hack” to give you access on a Looker platform in case your job doesn’t provide you one.

How do I get started creating visuals and insights with Looker?

Ok, so you are now inside Looker and want to work with it. You want to get your dataset and start creating visuals and etc, but how?

Let’s take for instance the lab I provided the link before. I’m inside it, but now what?

Looker home, landing page

If what you want is to create visuals, go to Explore > then choose a model and a dataset inside that model. In looker this dataset inside a model is called Explore.

Here I am selecting the Fruit Basket explore.

Looker explore

This is what I get:

insider Looker explore Fruit Basket
Note: if you need more definition on images, right click and open original

What we have here? Pretty different from what I am used to with Power BI, I must confess.

The deal here is to use the dimensions and measures in the left pane to create a data result, a table. Let’s take Color dimension and Average Price measure and see the result.

Boring, I got nothing… I mean, we got two columns, like table headers, that are now on our data results pane.

Looker result pane

In order to see the data we need to RUN this. So, let me click on RUN.

Now I have COLOR and AVERAGE PRICE data.

Looker result pane after hitting Run button

What else I can do here? You can create visualizations and add filters and do some other stuff with those measures and dimensions, like pivoting data, creating calculated custom fields, add filtering, and perhaps some other stuff I am not remembering now, but once you get the hang of how this explore works, it became intuitive.

Here’s a view on some other options expanded in our Explore window.

Looker Explore panes open

Ok, great, but what if I want to add more data? More tables, more measures, more dimensions, how do I create relationships between that data?

How to add more data to my Explore?

So far we have the Fruit Basket explore, but what if I want to add more measures and dimensions to it?

You can do that by creating CUSTOM FIELDS, here the documentation on this custom fields: https://cloud.google.com/looker/docs/custom-fields, but basically, custom fields are like expressions, formulas on excel, or your DAX measures.

custom fields options

Let’s say I want to know the percentage of the total that each Average Price represents.

First, let me add the TOTAL row to this result and RUN this again, so we can have a better understanding what is going on.

totals enabled

So, if RED is 0.72 from a 0.55 total, then we can do this simple math, where RED minus TOTAL divided by TOTAL equals RED representation of it’s percentage of the total. Therefore; ( 0.72 – 0.55 ) / 0.55 equals approximately 30%.

We can do this by adding a Pie Chart and checking the distributed percentage:

Looker pie chart

We also can add a calculation, by clicking in ADD CALCULATION (near the TOTALS checkbox).

custom fields for table calculation
table calculation field New Calculation2 added

Of course, this was created using Looker UI, which is great, but you could also have written the thing down. More on table calculations here https://cloud.google.com/looker/docs/table-calculations

Here’s how to write this very calculation, as the documentation tell us:

Lexp = field_1/sum(field_1)
manually entering a formula expression on Looker
both results matches

So far we talked about adding more data to enhance your analysis by adding custom fields, but, what if we want to add even more data, like adding for the database, the source itself, how?

How to add even more data to your Explore

If custom fields is not enough for you, then your next option is tweaking around your views and models.

subtle link gets you to the LookML source

You can access your models and views by clicking on Go to LookML inside your Explore, or you can go back to Looker home and go Develop.

or use Develop option from the menu

Inside Develop menu you have Projects, check it there. Fruits_basket will be within qwiklabs-flights.

Fruit Basket is a pretty simple example and therefore doesn’t have its own project structure, but it will serve us for now.

So this is the source of our Fruit Basket Explore:

fruit basket model, and explore, and view, they are all bundled together here

As the image shows we have a “explore: fruit_basket {}” that defines our explore name, after that we have the view and after that dimensions. The image doesn’t show, but we have measures at the end of this file, here:

fruit basket measures

There is also another type of navigation available within the Develop area, the Object Browser. So far we were using the File Browser, which I think is a good place to be when you are starting your model, but if what you want is to modify your already created model, perhaps it is worth checking the Object Browser, like this:

object browser

Let’s say I wanna do something in my model, like, create a new dimension or measure… For that, first, activate DEVELOPMENT MODE.

activate development mode

If you feel like creating a new dimension, read this documentation: https://cloud.google.com/looker/docs/reference/param-dimension-filter-parameter-types, but if you are looking for measures, check this https://cloud.google.com/looker/docs/reference/param-measure-types.

Here I create a new dimension called “price_per_pound_time_weight” (typo: should’ve been “times”, not time).

new dimension I created

This dimension takes the value of price_per_pound and multiplies per weight. Notice that the variables I’m using here are the ones from the the other dimensions, and not data directly from the table source. This is the result:

price per pound time weight as shown in the Explore

Wrapping up

Let me stop right here, we have covered some very interesting topics and this will give you enough information to cover the basics. Make sure to read the documentation links I left throughout the reading.

That’s all I have for now, and here are some topics idea for part 2, so stay tuned! =)

  • How do I load data into Looker from database / csv / Google Sheets ?
  • What is a LookML project?
  • What are views in Looker?
  • What are models in Looker?

Reference

https://youtu.be/PuLUcmAzyO8 – Looker – Database to Dashboard Technical Demo

https://youtu.be/fbgva493gb8 – Technical deep dive on Looker: The enterprise BI solution for Google Cloud

Google Looker – First steps

So, I am trying to learn Looker for a new project, and, as usual, I will share here all materials I am using to better understand this new tool.

Let me first start saying that, start to learn a new tool from scratch, with no help from anyone, is really hard, you will usually get lost – and I am, but fear not. Let’s do it together and find a structured way to deal with all that.

Here is what, so far, I figured out that it is needed to learn if you want get to be a good Looker developer:

  • How views works
  • How explores works
  • How models works
  • How views, models and explores interacts in order to create a new visualization

Looker Connect

Self-paced enablement platform. I am taking the Looker connect learning paths. Here: https://connect.looker.com/

Very boring at start, but I had my first lab yesterday. If labs keep coming this might get very interesting.

Book – Data Storytelling with Google Looker Studio

I first started reading this book thinking it would be about Looker, but it is not. Looker and Looker Studio are two different thing. I still need to figure out the exact difference, but they are not the same thing.

Still, this is a very good book for get you going with data storytelling. Worth reading, for sure.

This book can be found here https://www.oreilly.com/library/view/data-storytelling-with/9781800568761/?_gl=1100x46m_gaMjA2MjYwNzAxNC4xNjY2MjcyNDg5_ga_092EL089CH*MTY3MDk3Mjc3Ni41LjAuMTY3MDk3MjgwMy4zMy4wLjA.

Key Features

  • Gain a solid understanding of data visualization principles and learn to apply them effectively
  • Get to grips with the concepts and features of Looker Studio to create powerful data stories
  • Explore the end-to-end process of building dashboards with the help of practical examples

I am always supportive of books, I think we should read more book, and by we I mean me, lol. So, I will definitely read this one. Will let you know what I found worth here.

Looker paradigm

This, the Looker paradigm, is about that I wrote in the start of this post, on “How views, models and explores interacts in order to create a new visualization”.

In the image above you can see that Database is in fact your data source, and that “LookML code files” is a macro view for a set of views/models, like here:

Looker Develop mode

And the result of what you see in the image above, can then be found on Explore mode, like this:

In this Explore we are inside the training_ecommerce.model, and on this model we have 2 explores: order_items and events.

Inside each explore we have a set of dimensions and facts. Follow the images below, read the caption, you will understand.

model showing explore events
explore showing the events from model
putting together models and explore

LookML (Looker modeling language)

LookML is not Look Machine Learning, it is terrible that they choose this acronym to represent the Looker modeling language. So, I would say that LookML is like Dax or M for Power BI.

What can be done with LookML? What can you use LookML for?

  • Restructure data: pre-aggregate values, apply window functions, etc.
  • Apply smart caching to conserve database resources/bandwidth
  • Configure row and column-level security on the underlying data
  • Create dynamic logic for aggregations, filters, joins, etc
  • Apply conditional formatting of colors, styles, etc
  • Link to other areas within or outside of Looker
  • Modularize code for easy reuse and expansion

Looker how to master

In order to really master Looker, Google suggests that we:

  • Master the science of LookerML code.
  • Master the art of designing explorer that are both performant and user friendly

More resources

Videos on Looker: https://vimeo.com/looker (haven’t watch any, but I’ve been told there are some goods there)

Looker youtube channel, I did not like it, watch at your own risk, lol: https://www.youtube.com/channel/UC1AIN4pUKbLzzMg8RKprtgQ

Learning badges, get recognized as a Looker person: https://www.cloudskillsboost.google/catalog?keywords=looker This is hot, I am working here, kinda messy UI/UX and structure if compared to Microsoft, but it is what it is.

Talking about cloud skill boost from google, this one has been specifically suggested: https://www.cloudskillsboost.google/quests/165

The documentation, the docs, never forget: https://cloud.google.com/looker/docs. I particularly do not like the way the docs are structured, but that’s what we have for now.

The udemy course I’m taking that looks good so far: https://www.udemy.com/course/looker-learning-tutorial-for-beginners/

Azure DP-900: Microsoft Azure Data Fundamentals | Notes and Highlights

Just passed Azure DP-900 exam!

Here is what and how I did it:

And here are all notes I’ve taken during my study process:

Document Data Store

  • All data should be stored and retrieved from a single-data-structure
  • In a single document, use to store user profile data, like shipping addresses, user preferences
  • Each user can have different number of addresses and also different number of preferences
  • Document database consists of entities that have their related data stored in a single document
  • Documents supports flexible schema
  • Usually JSON format documents
  • Document data store is a type of a non-relational data store

Azure Storage

  • Enable hierarchical namespace to implement folder and directory level security in Azure Storage.
  • Hierarchical namespace allows you to organize blob containers in folders and directories
  • Hierarchical namespace allows to define PSIX compatible permissions to container
  • Hierarchical namespace allows to define RBAC to container

Azure SQL Database

  • Azure SQL Database has lower cost compared against Azure SQL Managed Instance
  • Azure SQL Database has minimal administration compared against Azure SQL Managed Instance
  • Azure SQL Database has higher availability ( 99.995% ) compared against Azure SQL Managed Instance ( 99.99% )
  • Azure SQL Database is a relational database for modern cloud apps
  • Azure SQL Database is PAAS
  • Azure SQL Database eliminate the need to manage the SQL Server configuration
  • Azure SQL Database is not as compatible to on-premises as Azure SQL Managed Instance is
  • Azure SQL Database does not allow you to choose the SQL Server engine version
  • Azure SQL Database uses the lasted stable version of the SQL Server engine
  • Azure SQL Database includes a fully managed backup service
  • Azure SQL Database automatically applies security patches on underlying operational system

Graph Database

  • NODE is an instance of data entities, like people on companies
  • EDGE are relationships between instances of data entities
  • PROPERTY are attributes of data entities, like person email, address
  • DIRECTIONS are property of an Edge. Edges can be directed or undirected
  • Graph database store semi-structured data

Azure HDInsight

  • Azure HDInsight is a big data processing service
  • Azure HDInsight is used to provision and manage a cluster of open-source analytics solution
    • like Apache spark, Hadoop, Kafka
  • Azure HDInsight is compatible with Apache Spark

Azure Data Explorer

  • Uses KQL language

Azure Table Storage

  • Azure Table Storage is composed by row key + partition key
  • Azure Table Storage number of columns may vary according to each row
  • Azure Table Storage is a NoSQL data store, uses unstructured data
  • Azure Table Storage is a key/value data store
  • Azure Table Storage must have a unique key ( row key + partition key combination )
  • Azure Table Storage number of column per row can vary
  • Azure Table Storage partition key: identify partition containing in the row
  • Azure Table Storage row key: is unique to each row in the same partition

Key / Value Data Store

  • Key / Value data store can receive real-time events from multiple sources which may vary in schema
  • Key / Value data store works like a lookup
  • Key / Value data store find rows by using unique key
  • Key / Value data store is essentially a large hash-table
  • Key / Value data store is optimized for fast data writing
  • Key / Value data store each data row is referred by a single key-value

Spark Structured Streaming

  • Spark Structured Streaming can be used with: Azure Synapse Analytics, Azure DAtabricks, Azure HDInsight
  • Spark Structured Streaming is open-source library for apache-spark based services
  • Spark Structured Streaming is a near real-time processing engine

Azure Cosmos DB

  • Azure Cosmos DB allows simpler queries to retrieve data than a relational DB
  • Azure Cosmos DB supports app with global user base
  • Azure Cosmos DB is a NoSQL database
  • Azure Cosmos DB supports storing data in denormalized way
  • Azure Cosmos DB suppoarts multi-region writes
  • Azure Cosmos DB general resource hierarchy is: 1 azure cosmos account > 2 database > 3 container > 4 item

Azure AZ-900: Microsoft Azure Fundamentals | Notes and Highlights

Just passed AZ-900 certification!!

I have written a page on all my tips and notes to help you get through as well, check it out here: https://powerbixpert.com/azure-az-900-microsoft-azure-fundamentals/

Notice that not all topics are covered there.

What I suggest, and the best way I see to take advantage of these notes, is:

Have fun, good luck!

The Drunkard’s Walk, a book worth reading twice, or more. A Data Analyst perspective

As you know, if you follow my LinkedIn profile (which, if you doesn’t, you should) I was reading Leonard Mlodinow’s “The Drunkard’s Walk”, a really great book if you want to get a grasp on how chance, or how randomness, affects our lives, and, what is better, how we fail to perceive what is chance, and what is not.

This book builds, chapter by chapter, on how probability and statistics come to become a thing, and how our understand of this field still bother us, since we have a really hard time ceasing from using our intuition.

Here are some key points that I think it is worth knowing as a Data Analyst. Please, note that those are only some that I remember by heart after just having read the book, either way, I highly recommend this book to anyone that consider finding patterns in data an interesting thing.

  • The probability that two events will both occur can never be greater than the probability that each will occur individually. This is known as The First Law of probability.
  • If two possible events, A and B, are independent, then the probability that both will occur is equal to the product of their individual probabilities. This is known as The Second Law of probability.
  • Girolamo Cardano and the Law of Sample Space
  • Law of Large Numbers
  • Pascal’s triangle
  • Gambler’s Fallacy
  • Bayes’ theorem
  • The bell curve, normal distribution, sample variance
  • Significance testing
  • Confirmation bias, Ambiguous evidence

Those are all really complex topics to explore, so if you feel like knowing more about each of them I highly encourage you to read the book and/or reading on the internet.

4 Insights you can gather from Michael Lewis’ Moneyball as a Data Analyst

Moneyball book cover

If you follow me on LinkedIn you already know that I’m, or actually was, reading Moneyball. Just finished the book today, and so I thought it would be fun to share what I got from this book using my perspective, in other words, a data analyst’s perspective.

It’s no surprise that this book is, or was, a best seller. It’s easy, and compelling, to read, the story is engaging for those who likes baseball, or for those who likes “think out of the box” approaches, or for those who likes to see underdogs beating favorites, or if you like a story of someone that became relevant for something. With that being said, Moneyball do not provides any hard data or numbers, it’s not a math or statistics book, it’s just a story book, but you one can gather reliable insights by reading it.

So here are four insights gathered from Moneyball:

  1. It’s hard to challenge common wisdom

Billy Beane had to went from a becoming next baseball super start to a baseball failure to understand that the way things were going were wrong.

That being said, it’s easy to find people saying all over that we should think out of the box, but in truth they expect that you think inside their sandbox, and if you challenge their common sense they call you wrong or even worse. Besides, just thinking is something, acting is something else. If you are thinking differently and you start acting differently it may that a while until you get people to see that your way of doing it is better. Actually, you can find even today people that still do not take numbers and stats into account, because “they know better”.

  1. It takes time to build your data stats properly

The guy who really put things together were Billy James, who first article was published in 1977, and back then it wasn’t easy to gather all data, and not all data were properly gathered, in fact, gather, prepare and figure out which data you actually need is a on-going process that never ends. In fact, 1977 studies from James just started to being used by Billy Beane much later in 1997. So it was a 20 years apart thing, and furthermore, after all those years with baseball stats being used since 97 by Billy Beane’s team, they still found new and improved way to use and gather new data in 2001 by using Voros McCracken new way to measure defense data.

So, knowing that, it is important, specially if you are a leader, a manager or a director, to understand that data literacy takes time, a lot of time, maybe not 30 years, but it’s not a 6 month project that will get all the sweet sweet insights you need – it might give you some, but understand that sometimes what you really need was not even thought about.

  1. It’s hard to trust data when they tell you something you disagree

Moneyball introduces us some players that are only playing baseball major league because Billy Beane and Paul DePodesta stats sheet. All other teams thinks that Billy’s team is going to loose, but they end up winning many games, because stats. It doesn’t matter if the player looks good, or if they run fast, or if they thrown a ball 95mph, all that matters is if, by using baseball metrics they can deliver wins, and that is what happens. Many times players are doing something that everyone else think they should do differently, but common wisdom is usually wrong.

Have you ever seen something like this happen? When a manager check it’s report that says that he should be selling more in this area and not that, or that some product should be selling more than other, but he insists in doing something different from what the data is telling him to? It’s common and very similar to our point 1, because people often think they know better.

  1. The luck factor

In a baseball you can have many games throughout the season, but only a few in the playoffs. What happens is, the way your data applies changes, because your sample changes. That is why during a season Billy’s team would end up high in the ranking, but usually lost in the playoffs.

That shows us that not every kpi, metric, should be applied to the same sample, or considered in the same context. Be aware of what you are measuring before taking a metric in account, you may end up with a correct number for a wrong situation.

Conclusion

There are many more scenarios one can gather from reading this book, those listed here are just some that come from the top of my head as I’m writing this blog post. Something else worth noticing is how Billy dealt with this team members, it’s a long story, but follows pretty much the idea you can gather by reading another book, First, break all the rules – which I also highly recommend, specially if you are in a lead position.

You can buy Moneyball here, I don’t make a single dollar by recommending it, but I’m doing it anyway šŸ™‚