DummyExams LogoDummyExams
AWS logo

Free Practice · No Signup Required

30 Free AWS DAS-C01 Practice Questions

Real practice questions for the AWS AWS Certified Data Analytics Specialty (DAS-C01) exam, with answers and detailed explanations. Updated 2026.

Free questions

30

Passing score

750 out of 1000

Exam time

170 minutes

Question pool

80+ Questions

Below are 30 real practice questions for the AWS AWS Certified Data Analytics Specialty (DAS-C01) exam. Each question shows the correct answer and a detailed explanation when you reveal it. Use these to benchmark your readiness — if you score below 70% on these 30 questions, plan for at least 4 more weeks of study before booking.

DAS-C01 Practice Questions

  1. Question 1.A streaming application is reading data from Amazon Kinesis Data Streams and immediately writing the data to an Amazon S3 bucket every 10 seconds. The application is reading data from hundreds of shards. The batch interval cannot be changed due to a separate requirement. The data is being accessed by Amazon Athena. Users are seeing degradation in query performance as time progresses. Which action can help improve query performance?

    • A.Merge the files in Amazon S3 to form larger files.(correct answer)
    • B.Increase the number of shards in Kinesis Data Streams.
    • C.Add more memory and CPU capacity to the streaming application.
    • D.Write the files to multiple S3 buckets.
    Show answer & explanation

    Correct answer: A

    Merge the files in Amazon S3 to form larger files.

    Explanation

    Athena query performance on S3 depends on the number of files and their sizes. When a large number of small files are created (e.g., every 10 seconds), joining them into larger files (e.g., around 100-200MB) significantly improves query performance.

  2. Question 2.A team of data scientists plans to analyze market trend data for their company's new investment strategy. The trend data comes from five different data sources in large volumes. The team wants to utilize Amazon Kinesis to support their use case. The team uses SQL-like queries to analyze trends and wants to send notifications based on certain significant patterns in the trends. Additionally, the data scientists want to save the data to Amazon S3 for archival and historical reprocessing, and use AWS managed services wherever possible. The team wants to implement the lowest-cost solution. Which solution meets these requirements?

    • A.Publish data to one Kinesis data stream. Deploy a custom application using the Kinesis Client Library (KCL) for analyzing trends, and send notifications using Amazon SNS. Configure Kinesis Data Firehose on the Kinesis data stream to persist data to an S3 bucket.
    • B.Publish data to one Kinesis data stream. Deploy Kinesis Data Analytic to the stream for analyzing trends, and configure an AWS Lambda function as an output to send notifications using Amazon SNS. Configure Kinesis Data Firehose on the Kinesis data stream to persist data to an S3 bucket.(correct answer)
    • C.Publish data to two Kinesis data streams. Deploy Kinesis Data Analytics to the first stream for analyzing trends, and configure an AWS Lambda function as an output to send notifications using Amazon SNS. Configure Kinesis Data Firehose on the second Kinesis data stream to persist data to an S3 bucket.
    • D.Publish data to two Kinesis data streams. Deploy a custom application using the Kinesis Client Library (KCL) to the first stream for analyzing trends, and send notifications using Amazon SNS. Configure Kinesis Data Firehose on the second Kinesis data stream to persist data to an S3 bucket.
    Show answer & explanation

    Correct answer: B

    Publish data to one Kinesis data stream. Deploy Kinesis Data Analytic to the stream for analyzing trends, and configure an AWS Lambda function as an output to send notifications using Amazon SNS. Configure Kinesis Data Firehose on the Kinesis data stream to persist data to an S3 bucket.

    Explanation

    Kinesis Data Analytics allows you to use SQL-like queries to analyze streaming data in real-time. It can trigger AWS Lambda as an output to send notifications via Amazon SNS. Kinesis Data Firehose is the standard way to persist data to S3 for archival.

  3. Question 3.A large company receives files from external parties in Amazon EC2 throughout the day. At the end of the day, the files are combined into a single file, compressed into a gzip file, and uploaded to Amazon S3. The total size of all the files is close to 100 GB daily. Once the files are uploaded to Amazon S3, an AWS Batch program executes a COPY command to load the files into an Amazon Redshift cluster. Which program modification will accelerate the COPY process?

    • A.Upload the individual files to Amazon S3 and run the COPY command as soon as the files become available.
    • B.Split the number of files so they are equal to a multiple of the number of slices in the Amazon Redshift cluster. Gzip and upload the files to Amazon S3. Run the COPY command on the files.(correct answer)
    • C.Split the number of files so they are equal to a multiple of the number of compute nodes in the Amazon Redshift cluster. Gzip and upload the files to Amazon S3. Run the COPY command on the files.
    • D.Apply sharding by breaking up the files so the distkey columns with the same values go to the same file. Gzip and upload the sharded files to Amazon S3. Run the COPY command on the files.
    Show answer & explanation

    Correct answer: B

    Split the number of files so they are equal to a multiple of the number of slices in the Amazon Redshift cluster. Gzip and upload the files to Amazon S3. Run the COPY command on the files.

    Explanation

    The REDSHIFT COPY command is designed to load data in parallel across all slices in the cluster. Splitting the input files into a multiple of the number of slices ensures optimal parallel loading and maximum performance.

  4. Question 4.A company has a business unit uploading .csv files to an Amazon S3 bucket. The company's data platform team has set up an AWS Glue crawler to do discovery, and create tables and schemas. An AWS Glue job writes processed data from the created tables to an Amazon Redshift database. The AWS Glue job handles column mapping and creating the Amazon Redshift table appropriately. When the AWS Glue job is rerun for any reason in a day, duplicate records are introduced into the Amazon Redshift table. Which solution will update the Redshift table without duplicates when jobs are rerun?

    • A.Modify the AWS Glue job to copy the rows into a staging table. Add SQL commands to replace the existing rows in the main table as postactions in the DynamicFrameWriter class.(correct answer)
    • B.Load the previously inserted data into a MySQL database in the AWS Glue job. Perform an upsert operation in MySQL, and copy the results to the Amazon Redshift table.
    • C.Use Apache Spark's DataFrame dropDuplicates() API to eliminate duplicates and then write the data to Amazon Redshift.
    • D.Use the AWS Glue ResolveChoice built-in transform to select the most recent value of the column.
    Show answer & explanation

    Correct answer: A

    Modify the AWS Glue job to copy the rows into a staging table. Add SQL commands to replace the existing rows in the main table as postactions in the DynamicFrameWriter class.

    Explanation

    To avoid duplicates in Redshift during ETL reruns, the recommended pattern is to load data into a staging table first, then use a transaction to replace existing records in the target table based on a unique identifier (upsert).

  5. Question 5.An airline has .csv-formatted data stored in Amazon S3 with an AWS Glue Data Catalog. Data analysts want to join this data with call center data stored in Amazon Redshift as part of a dally batch process. The Amazon Redshift cluster is already under a heavy load. The solution must be managed, serverless, well-functioning, and minimize the load on the existing Amazon Redshift cluster. The solution should also require minimal effort and development activity. Which solution meets these requirements?

    • A.Unload the call center data from Amazon Redshift to Amazon S3 using an AWS Lambda function. Perform the join with AWS Glue ETL scripts.
    • B.Export the call center data from Amazon Redshift using a Python shell in AWS Glue. Perform the join with AWS Glue ETL scripts.
    • C.Create an external table using Amazon Redshift Spectrum for the call center data and perform the join with Amazon Redshift.(correct answer)
    • D.Export the call center data from Amazon Redshift to Amazon EMR using Apache Sqoop. Perform the join with Apache Hive.
    Show answer & explanation

    Correct answer: C

    Create an external table using Amazon Redshift Spectrum for the call center data and perform the join with Amazon Redshift.

    Explanation

    Amazon Redshift Spectrum allows you to query data directly from S3 using your existing Redshift cluster. This minimizes the load on the cluster for joining large S3 datasets with Redshift data, as the intensive compute is handled by the Spectrum layer.

  6. Question 6.A large ride-sharing company has thousands of drivers globally serving millions of unique customers every day. The company has decided to migrate an existing data mart to Amazon Redshift. The existing schema includes the following tables. A trips fact table for information on completed rides. A drivers dimension table for driver profiles. A customers fact table holding customer profile information. The company analyzes trip details by date and destination to examine profitability by region. The drivers data rarely changes. The customers data frequently changes. What table design provides optimal query performance?

    • A.Use DISTSTYLE KEY (destination) for the trips table and sort by date. Use DISTSTYLE ALL for the drivers and customers tables.
    • B.Use DISTSTYLE EVEN for the trips table and sort by date. Use DISTSTYLE ALL for the drivers table. Use DISTSTYLE EVEN for the customers table.
    • C.Use DISTSTYLE KEY (destination) for the trips table and sort by date. Use DISTSTYLE ALL for the drivers table. Use DISTSTYLE EVEN for the customers table.(correct answer)
    • D.Use DISTSTYLE EVEN for the drivers table and sort by date. Use DISTSTYLE ALL for both fact tables.
    Show answer & explanation

    Correct answer: C

    Use DISTSTYLE KEY (destination) for the trips table and sort by date. Use DISTSTYLE ALL for the drivers table. Use DISTSTYLE EVEN for the customers table.

    Explanation

    In Redshift, small dimension tables that rarely change (like 'drivers') are best distributed with DISTSTYLE ALL to avoid network shuffles. Large fact tables (like 'trips') should be distributed by a key used in joins. Frequently changing dimensions (like 'customers') are often best with DISTSTYLE EVEN if no single join key is ideal.

  7. Question 7.A software company hosts an application on AWS, and new features are released weekly. As part of the application testing process, a solution must be developed that analyzes logs from each Amazon EC2 instance to ensure that the application is working as expected after each deployment. The collection and analysis solution should be highly available with the ability to display new information with minimal delays. Which method should the company use to collect and analyze the logs?

    • A.Enable detailed monitoring on Amazon EC2, use Amazon CloudWatch agent to store logs in Amazon S3, and use Amazon Athena for fast, interactive log analytics.
    • B.Use the Amazon Kinesis Producer Library (KPL) agent on Amazon EC2 to collect and send data to Kinesis Data Streams to further push the data to Amazon Elasticsearch Service and visualize using Amazon QuickSight.
    • C.Use the Amazon Kinesis Producer Library (KPL) agent on Amazon EC2 to collect and send data to Kinesis Data Firehose to further push the data to Amazon Elasticsearch Service and Kibana.(correct answer)
    • D.Use Amazon CloudWatch subscriptions to get access to a real-time feed of logs and have the logs delivered to Amazon Kinesis Data Streams to further push the data to Amazon Elasticsearch Service and Kibana.
    Show answer & explanation

    Correct answer: C

    Use the Amazon Kinesis Producer Library (KPL) agent on Amazon EC2 to collect and send data to Kinesis Data Firehose to further push the data to Amazon Elasticsearch Service and Kibana.

    Explanation

    Kinesis Data Firehose can ingest log data and deliver it directly to Amazon OpenSearch (formerly Elasticsearch) Service with Kibana for visualization, providing a highly available, managed solution for real-time log analysis.

  8. Question 8.A manufacturing company has been collecting IoT sensor data from devices on its factory floor for a year and is storing the data in Amazon Redshift for daily analysis. A data analyst has determined that, at an expected ingestion rate of about 2 TB per day, the cluster will be undersized in less than 4 months. A long-term solution is needed. The data analyst has indicated that most queries only reference the most recent 13 months of data, yet there are also quarterly reports that need to query all the data generated from the past 7 years. The chief technology officer (CTO) is concerned about the costs, administrative effort, and performance of a long-term solution. Which solution should the data analyst use to meet these requirements?

    • A.Create a daily job in AWS Glue to UNLOAD records older than 13 months to Amazon S3 and delete those records from Amazon Redshift. Create an external table in Amazon Redshift to point to the S3 location. Use Amazon Redshift Spectrum to join to data that is older than 13 months.(correct answer)
    • B.Take a snapshot of the Amazon Redshift cluster. Restore the cluster to a new cluster using dense storage nodes with additional storage capacity.
    • C.Execute a CREATE TABLE AS SELECT (CTAS) statement to move records that are older than 13 months to quarterly partitioned data in Amazon Redshift Spectrum backed by Amazon S3.
    • D.Unload all the tables in Amazon Redshift to an Amazon S3 bucket using S3 Intelligent-Tiering. Use AWS Glue to crawl the S3 bucket location to create external tables in an AWS Glue Data Catalog. Create an Amazon EMR cluster using Auto Scaling for any daily analytics needs, and use Amazon Athena for the quarterly reports, with both using the same AWS Glue Data Catalog.
    Show answer & explanation

    Correct answer: A

    Create a daily job in AWS Glue to UNLOAD records older than 13 months to Amazon S3 and delete those records from Amazon Redshift. Create an external table in Amazon Redshift to point to the S3 location. Use Amazon Redshift Spectrum to join to data that is older than 13 months.

    Explanation

    Redshift Spectrum is ideal for a 'hot/cold' data strategy. By unloading data older than 13 months to S3 and querying it via Spectrum, you keep the most active data on fast local storage while maintaining access to historical data at a lower cost.

  9. Question 9.A financial company hosts a data lake in Amazon S3 and a data warehouse on an Amazon Redshift cluster. The company uses Amazon QuickSight to build dashboards and wants to secure access from its on-premises Active Directory to Amazon QuickSight. How should the data be secured?

    • A.Use an Active Directory connector and single sign-on (SSO) in a corporate network environment.(correct answer)
    • B.Use a VPC endpoint to connect to Amazon S3 from Amazon QuickSight and an IAM role to authenticate Amazon Redshift.
    • C.Establish a secure connection by creating an S3 endpoint to connect Amazon QuickSight and a VPC endpoint to connect to Amazon Redshift.
    • D.Place Amazon QuickSight and Amazon Redshift in the security group and use an Amazon S3 endpoint to connect Amazon QuickSight to Amazon S3.
    Show answer & explanation

    Correct answer: A

    Use an Active Directory connector and single sign-on (SSO) in a corporate network environment.

    Explanation

    Amazon QuickSight supports Active Directory integration via AD Connector or AWS Managed Microsoft AD, allowing users to sign in with their existing corporate credentials (SSO).

  10. Question 10.A US-based sneaker retail company launched its global website. All the transaction data is stored in Amazon RDS and curated historic transaction data is stored in Amazon Redshift in the us-east-1 Region. The business intelligence (BI) team wants to enhance the user experience by providing a dashboard for sneaker trends. The BI team decides to use Amazon QuickSight to render the website dashboards. During development, a team in Japan provisioned Amazon QuickSight in ap-northeast-1. The team is having difficulty connecting Amazon QuickSight from ap-northeast-1 to Amazon Redshift in us-east-1. Which solution will solve this issue and meet the requirements?

    • A.In the Amazon Redshift console, choose to configure cross-Region snapshots and set the destination Region as ap-northeast-1. Restore the Amazon Redshift Cluster from the snapshot and connect to Amazon QuickSight launched in ap-northeast-1.
    • B.Create a VPC endpoint from the Amazon QuickSight VPC to the Amazon Redshift VPC so Amazon QuickSight can access data from Amazon Redshift.
    • C.Create an Amazon Redshift endpoint connection string with Region information in the string and use this connection string in Amazon QuickSight to connect to Amazon Redshift.
    • D.Create a new security group for Amazon Redshift in us-east-1 with an inbound rule authorizing access from the appropriate IP address range for the Amazon QuickSight servers in ap-northeast-1.(correct answer)
    Show answer & explanation

    Correct answer: D

    Create a new security group for Amazon Redshift in us-east-1 with an inbound rule authorizing access from the appropriate IP address range for the Amazon QuickSight servers in ap-northeast-1.

    Explanation

    Amazon QuickSight can connect to cross-region data sources. Security groups in the destination region (us-east-1) must allow inbound traffic from the IP ranges used by QuickSight in the source region (ap-northeast-1).

  11. Question 11.An insurance company has raw data in JSON format that is sent without a predefined schedule through an Amazon Kinesis Data Firehose delivery stream to an Amazon S3 bucket. An AWS Glue crawler is scheduled to run every 8 hours to update the schema in the data catalog of the tables stored in the S3 bucket. Data analysts analyze the data using Apache Spark SQL on Amazon EMR set up with AWS Glue Data Catalog as the metastore. Data analysts say that, occasionally, the data they receive is stale. A data engineer needs to provide access to the most up-to-date data. Which solution meets these requirements?

    • A.Create an external schema based on the AWS Glue Data Catalog on the existing Amazon Redshift cluster to query new data in Amazon S3 with Amazon Redshift Spectrum.
    • B.Use Amazon CloudWatch Events with the rate (1 hour) expression to execute the AWS Glue crawler every hour.
    • C.Using the AWS CLI, modify the execution schedule of the AWS Glue crawler from 8 hours to 1 minute.
    • D.Run the AWS Glue crawler from an AWS Lambda function triggered by an S3:ObjectCreated:* event notification on the S3 bucket.(correct answer)
    Show answer & explanation

    Correct answer: D

    Run the AWS Glue crawler from an AWS Lambda function triggered by an S3:ObjectCreated:* event notification on the S3 bucket.

    Explanation

    Using S3 Event Notifications to trigger a Lambda function that starts an AWS Glue crawler ensures that the Data Catalog is updated as soon as new data arrives, providing the most up-to-date metadata for analysis.

  12. Question 12.A data analyst is using AWS Glue to organize, cleanse, validate, and format a 200 GB dataset. The data analyst triggered the job to run with the Standard worker type. After 3 hours, the AWS Glue job status is still RUNNING. Logs from the job run show no error codes. The data analyst wants to improve the job execution time without overprovisioning. Which actions should the data analyst take?

    • A.Enable job bookmarks in AWS Glue to estimate the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the executor-cores job parameter.
    • B.Enable job metrics in AWS Glue to estimate the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the maximum capacity job parameter.(correct answer)
    • C.Enable job metrics in AWS Glue to estimate the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the spark.yarn.executor.memoryOverhead job parameter.
    • D.Enable job bookmarks in AWS Glue to estimate the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the num-executors job parameter.
    Show answer & explanation

    Correct answer: B

    Enable job metrics in AWS Glue to estimate the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the maximum capacity job parameter.

    Explanation

    AWS Glue job metrics help identify bottlenecks like under-provisioned DPUs. Increasing the maximum capacity (DPUs) for the job allows for more parallel processing without manually configuring Spark executor details.

  13. Question 13.A company is streaming its high-volume billing data (100 MBps) to Amazon Kinesis Data Streams. A data analyst partitioned the data on account_id to ensure that all records belonging to an account go to the same Kinesis shard and order is maintained. While building a custom consumer using the Kinesis Java SDK, the data analyst notices that, sometimes, the messages arrive out of order for account_id. Upon further investigation, the data analyst discovers the messages that are out of order seem to be arriving from different shards for the same account_id and are seen when a stream resize runs. What is an explanation for this behavior and what is the solution?

    • A.There are multiple shards in a stream and order needs to be maintained in the shard. The data analyst needs to make sure there is only a single shard in the stream and no stream resize runs.
    • B.The hash key generation process for the records is not working correctly. The data analyst should generate an explicit hash key on the producer side so the records are directed to the appropriate shard accurately.
    • C.The records are not being received by Kinesis Data Streams in order. The producer should use the PutRecords API call instead of the PutRecord API call with the SequenceNumberForOrdering parameter.
    • D.The consumer is not processing the parent shard completely before processing the child shards after a stream resize. The data analyst should process the parent shard completely first before processing the child shards.(correct answer)
    Show answer & explanation

    Correct answer: D

    The consumer is not processing the parent shard completely before processing the child shards after a stream resize. The data analyst should process the parent shard completely first before processing the child shards.

    Explanation

    When a Kinesis stream is resized (shards split or merged), consumers must finish reading all records from the parent shards before beginning to read from the child shards to preserve data ordering per partition key.

  14. Question 14.A transportation company uses IoT sensors attached to trucks to collect vehicle data for its global delivery fleet. The company currently sends the sensor data in small .csv files to Amazon S3. The files are then loaded into a 10-node Amazon Redshift cluster with two slices per node and queried using both Amazon Athena and Amazon Redshift. The company wants to optimize the files to reduce the cost of querying and also improve the speed of data loading into the Amazon Redshift cluster. Which solution meets these requirements?

    • A.Use AWS Glue to convert all the files from .csv to a single large Apache Parquet file. COPY the file into Amazon Redshift and query the file with Athena from Amazon S3.
    • B.Use Amazon EMR to convert each .csv file to Apache Avro. COPY the files into Amazon Redshift and query the file with Athena from Amazon S3.
    • C.Use AWS Glue to convert the files from .csv to a single large Apache ORC file. COPY the file into Amazon Redshift and query the file with Athena from Amazon S3.
    • D.Use AWS Glue to convert the files from .csv to Apache Parquet to create 20 Parquet files. COPY the files into Amazon Redshift and query the files with Athena from Amazon S3.(correct answer)
    Show answer & explanation

    Correct answer: D

    Use AWS Glue to convert the files from .csv to Apache Parquet to create 20 Parquet files. COPY the files into Amazon Redshift and query the files with Athena from Amazon S3.

    Explanation

    Partitioning and using columnar formats like Apache Parquet significantly reduce query costs in Athena and improve loading speed in Redshift. Splitting data into files that align with the number of slices (e.g., 20 files for 10 nodes with 2 slices each) optimizes parallel COPY operations.

  15. Question 15.A company is migrating its existing on-premises ETL jobs to Amazon EMR. The code consists of a series of jobs written in Java. The company needs to reduce overhead for the system administrators without changing the underlying code. Due to the sensitivity of the data, compliance requires that the company use root device volume encryption on all nodes in the cluster. Corporate standards require that environments be provisioned though AWS CloudFormation when possible. Which solution satisfies these requirements?

    • A.Install open-source Hadoop on Amazon EC2 instances with encrypted root device volumes. Configure the cluster in the CloudFormation template.
    • B.Use a CloudFormation template to launch an EMR cluster. In the configuration section of the cluster, define a bootstrap action to enable TLS.
    • C.Create a custom AMI with encrypted root device volumes. Configure Amazon EMR to use the custom AMI using the CustomAmild property in the CloudFormation template.(correct answer)
    • D.Use a CloudFormation template to launch an EMR cluster. In the configuration section of the cluster, define a bootstrap action to encrypt the root device volume of every node.
    Show answer & explanation

    Correct answer: C

    Create a custom AMI with encrypted root device volumes. Configure Amazon EMR to use the custom AMI using the CustomAmild property in the CloudFormation template.

    Explanation

    Amazon EMR allows the use of custom AMIs. This is the recommended way to enforce root device volume encryption and other compliance requirements across all cluster nodes while using CloudFormation for provisioning.

  16. Question 16.A hospital uses wearable medical sensor devices to collect data from patients. The hospital is architecting a near-real-time solution that can ingest the data securely at scale. The solution should also be able to remove the patient's protected health information (PHI) from the streaming data and store the data in durable storage. Which solution meets these requirements with the least operational overhead?

    • A.Ingest the data using Amazon Kinesis Data Streams, which invokes an AWS Lambda function using Kinesis Client Library (KCL) to remove all PHI. Write the data in Amazon S3.
    • B.Ingest the data using Amazon Kinesis Data Firehose to write the data to Amazon S3. Have Amazon S3 trigger an AWS Lambda function that parses the sensor data to remove all PHI in Amazon S3.
    • C.Ingest the data using Amazon Kinesis Data Streams to write the data to Amazon S3. Have the data stream launch an AWS Lambda function that parses the sensor data and removes all PHI in Amazon S3.
    • D.Ingest the data using Amazon Kinesis Data Firehose to write the data to Amazon S3. Implement a transformation AWS Lambda function that parses the sensor data to remove all PHI.(correct answer)
    Show answer & explanation

    Correct answer: D

    Ingest the data using Amazon Kinesis Data Firehose to write the data to Amazon S3. Implement a transformation AWS Lambda function that parses the sensor data to remove all PHI.

    Explanation

    Kinesis Data Firehose supports inline data transformation using AWS Lambda. This is the most operationally efficient way to remove sensitive PHI data from the stream before it is persisted in durable storage like S3.

  17. Question 17.A media company wants to perform machine learning and analytics on the data residing in its Amazon S3 data lake. There are two data transformation requirements that will enable the consumers within the company to create reports: Daily transformations of 300 GB of data with different file formats landing in Amazon S3 at a scheduled time. One-time transformations of terabytes of archived data residing in the S3 data lake. Which combination of solutions cost-effectively meets the company's requirements for transforming the data? (Choose THREE)

    • A.For daily incoming data, use AWS Glue crawlers to scan and identify the schema.(correct answer)
    • B.For daily incoming data, use Amazon Athena to scan and identify the schema.
    • C.For daily incoming data, use Amazon Redshift to perform transformations.
    • D.For daily incoming data, use AWS Glue workflows with AWS Glue jobs to perform transformations.(correct answer)
    • E.For archived data, use Amazon EMR to perform data transformations.(correct answer)
    • F.For archived data, use Amazon SageMaker to perform data transformations.
    Show answer & explanation

    Correct answer: A, D, E

    For daily incoming data, use AWS Glue crawlers to scan and identify the schema. / For daily incoming data, use AWS Glue workflows with AWS Glue jobs to perform transformations. / For archived data, use Amazon EMR to perform data transformations.

    Explanation

    For daily transformations of moderate data (300GB), AWS Glue workflows and crawlers are suitable for schema discovery and ETL. For massive archived data (terabytes), Amazon EMR provides the scaling and performance needed for one-time large-scale transformations.

  18. Question 18.A marketing company wants to improve its reporting and business intelligence capabilities. During the planning phase, the company interviewed the relevant stakeholders and discovered that: The operations team reports are run hourly for the current month's data. The sales team wants to use multiple Amazon QuickSight dashboards to show a rolling view of the last 30 days based on several categories. The sales team also wants to view the data as soon as it reaches the reporting backend. The finance team's reports are run daily for last month's data and once a month for the last 24 months of data. Currently, there is 400 TB of data in the system with an expected additional 100 TB added every month. The company is looking for a solution that is as costeffective as possible. Which solution meets the company's requirements?

    • A.Store the last 24 months of data in Amazon Redshift. Configure Amazon QuickSight with Amazon Redshift as the data source.
    • B.Store the last 2 months of data in Amazon Redshift and the rest of the months in Amazon S3. Set up an external schema and table for Amazon Redshift Spectrum. Configure Amazon QuickSight with Amazon Redshift as the datasource.(correct answer)
    • C.Store the last 24 months of data in Amazon S3 and query it using Amazon Redshift Spectrum. Configure Amazon QuickSight with Amazon Redshift Spectrum as the data source.
    • D.Store the last 2 months of data in Amazon Redshift and the rest of the months in Amazon S3. Use a long-running Amazon EMR with Apache Spark cluster to query the data as needed. Configure Amazon QuickSight with Amazon EMRas the data source.
    Show answer & explanation

    Correct answer: B

    Store the last 2 months of data in Amazon Redshift and the rest of the months in Amazon S3. Set up an external schema and table for Amazon Redshift Spectrum. Configure Amazon QuickSight with Amazon Redshift as the datasource.

    Explanation

    The most cost-effective and performant solution for large datasets with mixed access patterns is a hybrid approach: keep recent data (rolling 30 days) in Amazon Redshift for low-latency queries, and keep historical data in S3, querying it as needed via Redshift Spectrum.

  19. Question 19.A company wants to improve user satisfaction for its smart home system by adding more features to its recommendation engine. Each sensor asynchronously pushes its nested JSON data into Amazon Kinesis Data Streams using the Kinesis Producer Library (KPL) in Java. Statistics from a set of failed sensors showed that, when a sensor is malfunctioning, its recorded data is not always sent to the cloud. The company needs a solution that offers near-real-time analytics on the data from the most updated sensors. Which solution enables the company to meet these requirements?

    • A.Set the RecordMaxBufferedTime property of the KPL to '-1' to disable the buffering on the sensor side. Use Kinesis Data Analytics to enrich the data based on a company-developed anomaly detection SQL script. Push the enriched data to a fleet of Kinesis data streams and enable the data transformation feature to flatten the JSON file. Instantiate a dense storage Amazon Redshift cluster and use it as the destination for the Kinesis Data Firehose delivery stream.
    • B.Update the sensors code to use the PutRecord/PutRecords call from the Kinesis Data Streams API with the AWS SDK for Java. Use Kinesis Data Analytics to enrich the data based on a company-developed anomaly detection SQL script. Direct the output of KDA application to a Kinesis Data Firehose delivery stream, enable the data transformation feature to flatten the JSON file, and set the Kinesis Data Firehose destination to an Amazon Elasticsearch Service cluster.(correct answer)
    • C.Set the RecordMaxBufferedTime property of the KPL to '0' to disable the buffering on the sensor side. Connect for each stream a dedicated Kinesis Data Firehose delivery stream and enable the data transformation feature to flatten theJSON file before sending it to an Amazon S3 bucket. Load the S3 data into an Amazon Redshift cluster.
    • D.Update the sensors code to use the PutRecord/PutRecords call from the Kinesis Data Streams API with the AWS SDK for Java. Use AWS Glue to fetch and process data from the stream using the Kinesis Client Library (KCL).Instantiate an Amazon Elasticsearch Service cluster and use AWS Lambda to directly push data into it.
    Show answer & explanation

    Correct answer: B

    Update the sensors code to use the PutRecord/PutRecords call from the Kinesis Data Streams API with the AWS SDK for Java. Use Kinesis Data Analytics to enrich the data based on a company-developed anomaly detection SQL script. Direct the output of KDA application to a Kinesis Data Firehose delivery stream, enable the data transformation feature to flatten the JSON file, and set the Kinesis Data Firehose destination to an Amazon Elasticsearch Service cluster.

    Explanation

    To achieve near-real-time analytics with buffering disabled (RecordMaxBufferedTime=0), Kinesis Data Analytics can process the stream directly. Transforming and flattening JSON via KDA or Firehose before outputting to OpenSearch allows for fast visualization and anomaly detection.

  20. Question 20.A large company has a central data lake to run analytics across different departments. Each department uses a separate AWS account and stores its data in an Amazon S3 bucket in that account. Each AWS account uses the AWS Glue Data Catalog as its data catalog. There are different data lake access requirements based on roles. Associate analysts should only have read access to their departmental data. Senior data analysts can have access in multiple departments including theirs, but for a subset of columns only. Which solution achieves these required access patterns to minimize costs and administrative tasks?

    • A.Consolidate all AWS accounts into one account. Create different S3 buckets for each department and move all the data from every account to the central data lake account. Migrate the individual data catalogs into a central data catalog and apply fine-grained permissions to give to each user the required access to tables and databases in AWS Glue and Amazon S3.
    • B.Keep the account structure and the individual AWS Glue catalogs on each account. Add a central data lake account and use AWS Glue to catalog data from various accounts. Configure cross-account access for AWS Glue crawlers toscan the data in each departmental S3 bucket to identify the schema and populate the catalog. Add the senior data analysts into the central account and apply highly detailed access controls in the Data Catalog and Amazon S3.
    • C.Set up an individual AWS account for the central data lake. Use AWS Lake Formation to catalog the cross-account locations. On each individual S3 bucket, modify the bucket policy to grant S3 permissions to the Lake Formationservice-linked role. Use Lake Formation permissions to add fine-grained access controls to allow senior analysts to view specific tables and columns.(correct answer)
    • D.Set up an individual AWS account for the central data lake and configure a central S3 bucket. Use an AWS Lake Formation blueprint to move the data from the various buckets into the central S3 bucket. On each individual bucket,modify the bucket policy to grant S3 permissions to the Lake Formation servicelinked role. Use Lake Formation permissions to add fine-grained access controls for both associate and senior analysts to view specific tables and columns.
    Show answer & explanation

    Correct answer: C

    Set up an individual AWS account for the central data lake. Use AWS Lake Formation to catalog the cross-account locations. On each individual S3 bucket, modify the bucket policy to grant S3 permissions to the Lake Formationservice-linked role. Use Lake Formation permissions to add fine-grained access controls to allow senior analysts to view specific tables and columns.

    Explanation

    AWS Lake Formation simplifies security management for cross-account data lakes by providing fine-grained access control (column-level) and centralizing permissions, reducing the overhead of managing individual bucket policies and IAM roles across many accounts.

  21. Question 21.A company developed a new elections reporting website that uses Amazon Kinesis Data Firehose to deliver full logs from AWS WAF to an Amazon S3 bucket. The company is now seeking a low-cost option to perform this infrequent data analysis with visualizations of logs in a way that requires minimal development effort. Which solution meets these requirements?

    • A.Use an AWS Glue crawler to create and update a table in the Glue data catalog from the logs. Use Athena to perform ad-hoc analyses and use Amazon QuickSight to develop data visualizations.(correct answer)
    • B.Create a second Kinesis Data Firehose delivery stream to deliver the log files to Amazon Elasticsearch Service (Amazon ES). Use Amazon ES to perform text-based searches of the logs for ad-hoc analyses and use Kibana for datavisualizations.
    • C.Create an AWS Lambda function to convert the logs into .csv format. Then add the function to the Kinesis Data Firehose transformation configuration. Use Amazon Redshift to perform ad-hoc analyses of the logs using SQL queries and use Amazon QuickSight to develop data visualizations.
    • D.Create an Amazon EMR cluster and use Amazon S3 as the data source. Create an Apache Spark job to perform ad-hoc analyses and use Amazon QuickSight to develop data visualizations.(correct answer)
    Show answer & explanation

    Correct answer: A, D

    Use an AWS Glue crawler to create and update a table in the Glue data catalog from the logs. Use Athena to perform ad-hoc analyses and use Amazon QuickSight to develop data visualizations. / Create an Amazon EMR cluster and use Amazon S3 as the data source. Create an Apache Spark job to perform ad-hoc analyses and use Amazon QuickSight to develop data visualizations.

    Explanation

    For infrequent analysis with minimal effort, using an AWS Glue crawler to catalog logs, Athena for SQL queries, and QuickSight for visualization is the most cost-effective serverless solution.

  22. Question 22.A data analyst is designing a solution to interactively query datasets with SQL using a JDBC connection. Users will join data stored in Amazon S3 in Apache ORC format with data stored in Amazon Elasticsearch Service (Amazon ES) and Amazon Aurora MySQL. Which solution will provide the MOST up-to-date results?

    • A.Use AWS Glue jobs to ETL data from Amazon ES and Aurora MySQL to Amazon S3. Query the data with Amazon Athena.
    • B.Use Amazon DMS to stream data from Amazon ES and Aurora MySQL to Amazon Redshift. Query the data with Amazon Redshift.
    • C.Query all the datasets in place with Apache Spark SQL running on an AWS Glue developer endpoint.
    • D.Query all the datasets in place with Apache Presto running on Amazon EMR.(correct answer)
    Show answer & explanation

    Correct answer: D

    Query all the datasets in place with Apache Presto running on Amazon EMR.

    Explanation

    Apache Presto on Amazon EMR is an open-source distributed SQL query engine designed to query data from multiple sources (S3, OpenSearch, Aurora) in place, providing high performance for complex joins across diverse datasets.

  23. Question 23.A retail company is building its data warehouse solution using Amazon Redshift. As a part of that effort, the company is loading hundreds of files into the fact table created in its Amazon Redshift cluster. The company wants the solution to achieve the highest throughput and optimally use cluster resources when loading data into the company's fact table. How should the company meet these requirements?

    • A.Use multiple COPY commands to load the data into the Amazon Redshift cluster.
    • B.Use S3DistCp to load multiple files into the Hadoop Distributed File System (HDFS) and use an HDFS connector to ingest the data into the Amazon Redshift cluster.
    • C.Use LOAD commands equal to the number of Amazon Redshift cluster nodes and load the data in parallel into each node.
    • D.Use a single COPY command to load the data into the Amazon Redshift cluster.(correct answer)
    Show answer & explanation

    Correct answer: D

    Use a single COPY command to load the data into the Amazon Redshift cluster.

    Explanation

    A single COPY command is most efficient for loading many files from S3 into Redshift, as it allows the cluster to automatically parallelize the work across all slices.

  24. Question 24.Once a month, a company receives a 100 MB .csv file compressed with gzip. The file contains 50,000 property listing records and is stored in Amazon S3 Glacier. The company needs its data analyst to query a subset of the data for a specific vendor. What is the most cost-effective solution?

    • A.Load the data into Amazon S3 and query it with Amazon S3 Select.(correct answer)
    • B.Query the data from Amazon S3 Glacier directly with Amazon Glacier Select.
    • C.Load the data to Amazon S3 and query it with Amazon Athena.
    • D.Load the data to Amazon S3 and query it with Amazon Redshift Spectrum.
    Show answer & explanation

    Correct answer: A

    Load the data into Amazon S3 and query it with Amazon S3 Select.

    Explanation

    S3 Select allows you to retrieve a subset of data from an S3 object using simple SQL expressions, which is highly cost-effective and faster than retrieving the entire 100MB object and then filtering it.

  25. Question 25.A company that monitors weather conditions from remote construction sites is setting up a solution to collect temperature data from the following two weather stations. Station A, which has 10 sensors. Station B, which has five sensors.These weather stations were placed by onsite subject-matter experts. Each sensor has a unique ID. The data collected from each sensor will be collected using Amazon Kinesis Data Streams. Based on the total incoming and outgoing data throughput, a single Amazon Kinesis data stream with two shards is created. Two partition keys are created based on the station names. During testing, there is a bottleneck on data coming from Station A, but not from StationB. Upon review, it is confirmed that the total stream throughput is still less than the allocated Kinesis Data Streams throughput. How can this bottleneck be resolved without increasing the overall cost and complexity of the solution, while retaining the data collection quality requirements?

    • A.Increase the number of shards in Kinesis Data Streams to increase the level of parallelism.
    • B.Create a separate Kinesis data stream for Station A with two shards, and stream Station A sensor data to the new stream.
    • C.Modify the partition key to use the sensor ID instead of the station name.(correct answer)
    • D.Reduce the number of sensors in Station A from 10 to 5 sensors.
    Show answer & explanation

    Correct answer: C

    Modify the partition key to use the sensor ID instead of the station name.

    Explanation

    Partitioning Kinesis streams by station name can lead to hot shards if one station (like Station A) sends significantly more data. Changing the partition key to a more granular value like sensor_id ensures a more even distribution across shards.

  26. Question 26.A company is building a data lake and needs to ingest data from a relational database that has time-series data. The company wants to use managed services to accomplish this. The process needs to be scheduled daily and bring incremental data only from the source into Amazon S3. What is the MOST cost-effective approach to meet these requirements?

    • A.Use AWS Glue to connect to the data source using JDBC Drivers. Ingest incremental records only using job bookmarks.(correct answer)
    • B.Use AWS Glue to connect to the data source using JDBC Drivers. Store the last updated key in an Amazon DynamoDB table and ingest the data using the updated key as a filter.
    • C.Use AWS Glue to connect to the data source using JDBC Drivers and ingest the entire dataset. Use appropriate Apache Spark libraries to compare the dataset, and find the delta.
    • D.Use AWS Glue to connect to the data source using JDBC Drivers and ingest the full data. Use AWS DataSync to ensure the delta only is written into Amazon S3.
    Show answer & explanation

    Correct answer: A

    Use AWS Glue to connect to the data source using JDBC Drivers. Ingest incremental records only using job bookmarks.

    Explanation

    AWS Glue Job Bookmarks are the native and most cost-effective way to track state and process only incremental data from a source during scheduled ETL runs, minimizing data transferred and processed.

  27. Question 27.A mortgage company has a microservice for accepting payments. This microservice uses the Amazon DynamoDB encryption client with AWS KMS managed keys to encrypt the sensitive data before writing the data to DynamoDB. The finance team should be able to load this data into Amazon Redshift and aggregate the values within the sensitive fields. The Amazon Redshift cluster is shared with other data analysts from different business units. Which steps should a data analyst take to accomplish this task efficiently and securely?

    • A.Create an AWS Lambda function to process the DynamoDB stream. Decrypt the sensitive data using the same KMS key. Save the output to a restricted S3 bucket for the finance team. Create a finance table in Amazon Redshift that is accessible to the finance team only. Use the COPY command to load the data from Amazon S3 to the finance table.
    • B.Create an AWS Lambda function to process the DynamoDB stream. Save the output to a restricted S3 bucket for the finance team. Create a finance table in Amazon Redshift that is accessible to the finance team only. Use the COPYcommand with the IAM role that has access to the KMS key to load the data from S3 to the finance table.(correct answer)
    • C.Create an Amazon EMR cluster with an EMR_EC2_DefaultRole role that has access to the KMS key. Create Apache Hive tables that reference the data stored in DynamoDB and the finance table in Amazon Redshift. In Hive, select the data from DynamoDB and then insert the output to the finance table in Amazon Redshift.
    • D.Create an Amazon EMR cluster. Create Apache Hive tables that reference the data stored in DynamoDB. Insert the output to the restricted Amazon S3 bucket for the finance team. Use the COPY command with the IAM role that has access to the KMS key to load the data from Amazon S3 to the finance table in Amazon Redshift.
    Show answer & explanation

    Correct answer: B

    Create an AWS Lambda function to process the DynamoDB stream. Save the output to a restricted S3 bucket for the finance team. Create a finance table in Amazon Redshift that is accessible to the finance team only. Use the COPYcommand with the IAM role that has access to the KMS key to load the data from S3 to the finance table.

    Explanation

    To securely load encrypted DynamoDB data into Redshift, you can use a Lambda function to process the DynamoDB stream, decrypt the data using the same KMS key, and save it to a restricted S3 bucket. Then, use the Redshift COPY command (with an IAM role having KMS access) to load the data into a restricted table.

  28. Question 28.A company has developed several AWS Glue jobs to validate and transform its data from Amazon S3 and load it into Amazon RDS for MySQL in batches once every day. The ETL jobs read the S3 data using a DynamicFrame. Currently, the ETL developers are experiencing challenges in processing only the incremental data on every run, as the AWS Glue job processes all the S3 input data on each run. Which approach would allow the developers to solve the issue with minimal coding effort?

    • A.Have the ETL jobs read the data from Amazon S3 using a DataFrame.
    • B.Enable job bookmarks on the AWS Glue jobs.(correct answer)
    • C.Create custom logic on the ETL jobs to track the processed S3 objects.
    • D.Have the ETL jobs delete the processed objects or data from Amazon S3 after each run.
    Show answer & explanation

    Correct answer: B

    Enable job bookmarks on the AWS Glue jobs.

    Explanation

    AWS Glue Job Bookmarks are specifically designed to capture state and track processed data, allowing ETL developers to handle incremental data on every run with minimal coding effort.

  29. Question 29.A media company has been performing analytics on log data generated by its applications. There has been a recent increase in the number of concurrent analytics jobs running, and the overall performance of existing jobs is decreasing as the number of new jobs is increasing. The partitioned data is stored in Amazon S3 One Zone-Infrequent Access (S3 One Zone-IA) and the analytic processing is performed on Amazon EMR clusters using the EMR File System (EMRFS) with consistent view enabled. A data analyst has determined that it is taking longer for the EMR task nodes to list objects in Amazon S3. Which action would MOST likely increase the performance of accessing log data in Amazon S3?

    • A.Use a hash function to create a random string and add that to the beginning of the object prefixes when storing the log data in Amazon S3.
    • B.Use a lifecycle policy to change the S3 storage class to S3 Standard for the log data.(correct answer)
    • C.Increase the read capacity units (RCUs) for the shared Amazon DynamoDB table.
    • D.Redeploy the EMR clusters that are running slowly to a different Availability Zone.
    Show answer & explanation

    Correct answer: B

    Use a lifecycle policy to change the S3 storage class to S3 Standard for the log data.

    Explanation

    Listing objects in S3 Standard-IA can be slower than S3 Standard, especially under heavy load. Changing the storage class back to S3 Standard for active log data can improve listing performance for EMR task nodes.

  30. Question 30.A financial company uses Apache Hive on Amazon EMR for ad-hoc queries. Users are complaining of sluggish performance. A data analyst notes the following: Approximately 90% of queries are submitted 1 hour after the market opens. Hadoop Distributed File System (HDFS) utilization never exceeds 10%. Which solution would help address the performance issues?

    • A.Create instance fleet configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch CapacityRemainingGB metric. Create an automatic scaling policy to scale in the instance fleet based on the CloudWatch CapacityRemainingGB metric.
    • B.Create instance fleet configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch YARNMemoryAvailablePercentage metric. Create an automatic scaling policy to scale in the instance fleet based on the CloudWatch YARNMemoryAvailablePercentage metric.
    • C.Create instance group configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch CapacityRemainingGB metric. Create an automatic scaling policy to scale in the instance groups based on the CloudWatch CapacityRemainingGB metric.
    • D.Create instance group configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch YARNMemoryAvailablePercentage metric. Create an automatic scaling policy to scale in the instance groups based on the CloudWatch YARNMemoryAvailablePercentage metric.(correct answer)
    Show answer & explanation

    Correct answer: D

    Create instance group configurations for core and task nodes. Create an automatic scaling policy to scale out the instance groups based on the Amazon CloudWatch YARNMemoryAvailablePercentage metric. Create an automatic scaling policy to scale in the instance groups based on the CloudWatch YARNMemoryAvailablePercentage metric.

    Explanation

    Scaling EMR instance groups based on the YARNMemoryAvailablePercentage metric is the best way to handle predictable spikes in query volume, as it directly reflects the cluster's capacity to process more Hive queries.

Ready for the full DAS-C01 exam?

Get all 80+ Questions, timed simulation, and weak-area analytics. Plans from $2.99 — credits never expire.

See pricing

Frequently Asked Questions

Are these real DAS-C01 practice questions?+
Yes. These 30 questions are taken directly from our 80+ Questions pool, written and reviewed by certified practitioners. They mirror the style, difficulty, and scope of the official AWS DAS-C01 exam.
Is the DAS-C01 exam hard?+
The AWS AWS Certified Data Analytics Specialty (DAS-C01) is considered a pass-mark exam (passing score: 750 out of 1000). Most candidates need 4–8 weeks of focused preparation. Use these free questions to gauge where you stand before committing to a full study plan.
How many questions are on the real DAS-C01 exam?+
The official DAS-C01 exam has 65 questions (50 scored, 15 unscored).
Do I need to sign up to use these questions?+
No. These 30 questions are free and require no signup. If you want timed simulation, performance analytics, and access to all 80+ Questions, our paid plans start at $2.99 per exam with credits that never expire.

Keep studying

Pass DAS-C01 on your first try

Join candidates using DummyExams to practice with realistic timed exams, detailed explanations, and weak-area analytics.

Start full DAS-C01 practice exam