How to count files in directories in S3
Problem:
We can count files in directories in S3 using the AWS CLI. The command is:
s3cmd ls --recursive s3://preciselab/sowa/voice/gtt | wc -l
We list all files recursively and then counting lines (-l flag) with file names.
But we would like to get distribution of files in directories.
Solution:
We can use the following command:
s3cmd ls --recursive s3://preciselab/sowa/voice/gtts | awk '{print $4}' | grep -oP 'gtts/\K[^/]+' | sort | uniq -c | sort -nr
190 pl
111 it
109 en
108 de
104 es
103 fr
100 ar
98 ru
97 nl
97 ja
96 zh
24 nb
Keys to understand what is happening:
- s3cmd ls —recursive: Lists all files in the S3 bucket recursively.
- awk ‘{print $4}’: Extracts the 4th column, which contains the S3 object paths.
- grep -oP ‘gtts/\K[^/]+’: Extracts directory in gtts/ using a Perl-compatible regex (\K discards everything before it).
- sort: Sorts the language codes alphabetically.
- uniq -c: Counts occurrences of each language code.
- sort -nr: Sorts the counted output in descending numerical order for easy reading.
It is worth to know regex, awk, grep, sort and uniq commands to faster analyze data.