$30
Base64 Assignment Megathread #3
Introduction
In this assignment, you will write a simple base64 encoding utility to familiarize yourself with the
basics of C programming. You will be using, exclusively, a small subset of the C standard library
to accomplish this task. You will practice:
Parsing command-line arguments
Using the standard io library features
Handling errors
Manipulating data and working with arrays
Learning Outcomes
How to write a C program to solve a problem? (Module 1 MLO 4)
How can programs invoke OS services using system calls? (Module 1 MLO 5)
How do you interact with the user in C programs? (Module 1 MLO 6)
How are C programs transformed into an executable form? (Module 1 MLO 7)
Specication
This specication is modeled after the man page for the system base64 utility. You can view
that man page by running the command man 1 base64 on the class server; a shorthand way
to refer to a man page is like base64(1) , where 1 means section 1 of the manual. You can
read more about the manual and its sections by reading the man page for the man command
itself: man(1) (i.e. man 1 man ).
NAME
base64 - Base64 encode data and print to standard output
SYNOPSIS
base64 [FILE]
Expand
Expand
Expand
DESCRIPTION
Base64 encode FILE, or standard input, and output to standard output.
With no FILE argument, or if FILE is - , read from standard input.
The special lename argument - is often used to refer to standard input with utilities that
accept lename arguments. This is a historical convention which is widely adhered to in
system utilities, and there are some good reasons for this feature, such as allowing
placeholders in sequences of lenames. If a le is literally named - , it can still be passed as
an argument by adding path components to its name, such as by prepending the current
working directory to the lename. As an example, base64 ./- would encode the le in the
current directory named "-", rather than read from standard input.
Encoded lines are wrapped every 76 characters
A line is a sequence of characters terminated in a newline '\n' character. The length of a line
does not include the newline character. The above specication means that each line of
output will consist of 76 non-newline characters plus a newline at the end, for a total of 77
characters. Of course, the last line of output may be shorter than this.
The data are encoded according to the standard algorithm and standard base64 alphabet
described in RFC 4648. NOTE: The example implementation in section 11 is o limits. If you copy
it, you'll be reported for plagiarism. It's also much more advanced than what you're learning, so
don't bother with looking at it, because it will only confuse you.
You may wish to search for online resources that describe how base64 encoding works, but
once you have a basic idea, you need to read the actual standard and understand it fully.
STDIN
The standard input shall be used only if no FILE operand is specied, or if the FILE operand is '-
' .
STDOUT
The standard output shall contain the base64 encoded output, wrapped to 76 characters.
STDERR
The standard error shall be used only for error reporting.
EXIT STATUS
0, if successful
>0, if an error occurs; an informative error message must be printed to stderr.
Additional Instructions
Your program must compile according to the c99 standard, with variable length arrays (VLAs)
disabled:
$ gcc -std=c99 -Werror=vla -o base64 ...
Constant Space Complexity
Additionally to not being allowed to use VLAs, you may not use any other memory allocation
functions such as malloc . In other words, your program must have a xed memory footprint. In
big-O notation, this is called a constant space complexity of O(1). Since the program can accept
an input of any length, it must logically produce output as it is reading input in chunks. You
cannot buer all of the input and then process it, and you cannot buer all of the output, and
then print it, and so on, because this could require innite resources.
Submissions which violate this requirement will receive a 0.
Library Restrictions
To make this program easier for beginners, you are restricted to the following standard library
functions:
errno and Exxx macros in errno.h
fopen , fclose , feof , ferror , fread , fwrite , putchar , fprintf from stdio.h
strcmp from string.h
You must use the err or errx convenience function from the non-standard err.h for error
reporting and exiting with an error code.
Read the man pages for these functions carefully. Do not use any other functions in your
program.
Getting Started
In this program, you will be working with arbitrary input data as raw bytes of a xed width (8-
bits), using bitwise arithmetic operations, as specied in the base64 specication. You will want to
read in data from the input le (or stdin) using a buer (array) of uint8_t , which is a special
unsigned 8-bit type provided by the <stdint.h> header.
Expand
In the C programming language, integer types are either signed or unsigned. Bitwise
operations on signed integers are not well dened when they represent negative values, so
we generally use unsigned integers for these types of operations. Generally, the unsigned
char type is used for generic raw bytes, but, interestingly, a char is not required to be 8 bits.
In this case, we will want to explicitly use the uint8_t type, which is an unsigned, 8-bit
integer.
You will need to perform bitwise operations on these raw bytes to calculate indices into the
base64 alphabet, and then look up the corresponding characters in the alphabet to produce text
for output.
Let's start writing some skeleton code. We will start with including some of the header les we
need. An #include directive tells the preprocessor to insert the referenced le directly into the
source le where it appears. These standard library headers just contain a lot of declarations and
macro denitions for things we want to use in our program:
#include <stdio.h> // Standard input and output
#include <errno.h> // Access to errno and Exxx macros
#include <stdint.h> // Extra fixed-width data types
#include <string.h> // String utilities
#include <err.h> // Convenience functions for error reporting (non-standard)
Next, we need to embed the base64 alphabet somewhere in our program's data, so that we can
translate our input bytes into output text:
static char const b64_alphabet[] =
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"abcdefghijklmnopqrstuvwxyz"
"0123456789"
"+/";
Let's look at those keywords we used:
static : at le-scope, this makes b64_alphabet be internally linked. This makes it global
with respect to the current source le, but keeps it from being externally linked (global with
respect to other source les, part of an external API). Top-level declarations should always
be specied as static unless you intend to expose that identier globally.
char : The char type is used exclusively for text. Never ever use char for any other
purpose ( signed char and unsigned char are dierent from char ).
const : this qualier tells the compiler that the characters in b64_alphabet can't be
changed. It's always good to const -qualify things that aren't supposed to change.
Notice we use [] to indicate an array of unknown size -- the compiler deduces the necessary
size based on the data used to initialize it. The string literals in the initializer are a special
shorthand for {'A', 'B', 'C', ...} .
Next we need to declare the required main function; we will use the argc/argv version since we
need to inspect the command-line FILE argument:
int main(int argc, char *argv[])
{
...
}
Inside the body of the main function, the rst thing we want to do is deal with program
arguments:
if (argc > 2) {
errno = EINVAL; /* "Invalid Argument" */
err(1, "Too many arguments");
} else if (argc == 2 && strcmp(argv[1], "-")) {
... /* open FILE */
} else {
... /* use stdin instead */
}
Now, the whole rest of the program will do the same thing whether you open a le or use stdin --
only the le/stream handle will be dierent. Past this point, the rest of your program shouldn't
need to think about command-line arguments at all. Think carefully!
Next you'll have a loop. One common beginner mistake is to do something like:
while (!feof(input_file)) { ...
This is almost always a bug, because the eof ag of a le stream isn't actually valid to inspect
until after you've attempted to read from the le in question at least once. Instead, you almost
always want to just have an innite loop with conditional-breaks inside the loop:
for (;;) {
uint8_t input_bytes[...] = {0};
size_t n_read = fread(input_bytes, ...)
if (n_read != 0) {
/* Have data */
int alph_ind[...];
alph_ind[0] = input_bytes[0] >> 2;
alph_ind[1] = (input_bytes[0] << 4 | input_bytes[1] >> 4) & 0x3Fu;
...
char output[...];
output[0] = b64_alphabet[alph_ind[0]];
...
... do something ...
size_t n_write = fwrite(output, ...)
if (ferror(...)) err(...); /* Write error */
}
if (n_read < num_requested) {
/* Got less than expected */
if (feof(...)) break; /* End of file */
if (ferror(...)) err(...); /* Read error */
}
}
After the loop, the program needs to clean up, such as calling fclose(...) on any opened les!
Otherwise you might have a memory leak, and those are bad!
Overall, this is a fairly simple program as far as C programs go. It can be well under 100 lines
long, so take it slow and focus on the fundamentals while you work on it.
Let's put it all together:
#include <stdio.h> // Standard input and output
#include <errno.h> // Access to errno and Exxx macros
#include <stdint.h> // Extra fixed-width data types
#include <string.h> // String utilities
#include <err.h> // Convenience functions for error reporting (non-standard)
static char const b64_alphabet[] =
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"abcdefghijklmnopqrstuvwxyz"
"0123456789"
"+/";
int main(int argc, char *argv[])
{
if (argc > 2) {
fprintf(stderr, "Usage: %s [FILE]\n", argv[0]);
errx(1, "Too many arguments");
} else if (argc == 2 && strcmp(argv[1], "-")) {
... /* open FILE */
} else {
... /* use stdin instead */
}
for (;;) {
uint8_t input_bytes[...] = {0};
size_t n_read = fread(input_bytes, ...)
if (n_read != 0) {
/* Have data */
int alph_ind[...];
alph_ind[0] = input_bytes[0] >> 2;
alph_ind[1] = (input_bytes[0] << 4 | input_bytes[1] >> 4) & 0x3Fu;
...
char output[...];
output[0] = b64_alphabet[alph_ind[0]];
...
... do something ...
size_t n_write = fwrite(output, ...)
if (ferror(...)) err(...); /* Write error */
}
if (n_read < num_requested) {
/* Got less than expected */
if (feof(...)) break; /* End of file */
if (ferror(...)) err(...); /* Read error */
}
}
if (... != stdin) fclose(...); /* close opened files; */
/* any other cleanup tasks? */
}
Common Issues
Always error check function calls. Don't assume a request to open a le or read/write data
was successful.
You do not need to manipulate the lename argument, if there is one. You can pass it
straight to a function like fopen .
The only <stdio.h> functions you should use for i/o with xed-size buers like in this
assignment are fread and fwrite . Stay away from string-based functions like printf
and puts . You can use putchar to write a single character (hint: '\n' )
Thoroughly read the man pages for the functions you are using. They're all in section 3, so
$ man 3 ...
Testing your program
There is a system base64 utility with the same name as the program you are writing. This is what the
command base64 will run. To run your program, you must provide an explicit path, such as ./base64 if
it's in the current directory. Be careful not to mix them up!
You should test your program using input redirection (or a le argument). Do not test it
interactively; you will get confused by the peculiarities of how a terminal interface works.
Quick, simple testing
To test standard input:
For short inputs, pipe the output of the printf utility into your program:
$ printf 'foobar' | ./base64
With printf , what you see is what you get, not so with other utilities, which may add trailing
newlines. Additionally, you can test non-text bytes with backslash escapes:
Expand
$ printf '\x00\x01' | ./base64
For longer inputs, store the input in a le and then use input redirection:
$ ./base64 <filename
To test FILE arguments, just run your program with a lename argument:
$ ./base64 filename
Don't forget to test other things...
$ ./base64 filename extra_filename
$ ./base64 filename <ignored_stdin
$ ./base64 - <not_ignored_stdin
etc...
Comprehensive testing
Once you think your program is working correctly, you can compare it to the system utility. By
default, the system utility wraps at 76 characters, so you can compare your program to it without
any additional arguments.
One good method is to use the cmp utility to compare your outputs:
$ base64 <input >reference
$ ./base64 <input >output
$ cmp reference output
The cmp utility produces no output if the les are the same. If it detects dierences, it will report
them. It's important to use something like cmp because it will catch non-printing bytes, and other
issues that you can't see.
Another very powerful tool is to use randomly generated data. You can use the head utility to
grab a certain number of bytes of random data and dump them into a le:
$ head -c 100000 /dev/urandom >100k-random-bytes
$ base64 <100k-random-byes >reference
$ ./base64 <100k-random-bytes >output
$ cmp reference output
Make sure to test with some other byte counts besides 100000! Speaking of byte counts, what
edge cases can you think of? What about:
0 bytes of input?
57 and 114 bytes of input?
Expand
Expand
1, 2, and 3 bytes of input?
All of these are common failure points! Can you think of others?
Error reporting
This one is so special, it gets its own section.
If your program encounters an error, you must:
Print an informative error message to stderr
Exit with a non-zero exit status
If you fail to do so, you will not get credit for those parts of the assignment!
Also, if your program does not encounter an error, it must not print to stderr, and its exit status
must be 0, or you will not get credit!
What errors might you expect to run into?
Wrong number of arguments
File open failure
Read failure
The most common mistakes
Only testing with text input / not testing with random data
Only testing small inputs, not being strategic/thoughtful about input size
Ignoring edge cases
Wrong exit code, printing errors to stdout, printing output to stderr
Not catching errors
Evaluation
To receive any credit, your program must:
Compile
$ gcc -std=c99 -Werror=vla -o base64 ...
Accept input on stdin and as a lename argument
Expand
Expand
Expand
Expand
Expand
Input provided as a multiple of 3 bytes, 0 < len(input) < 57
Output length must be proportional to input length
Output must not change if re-run with the same input
Output must consist entirely of the base64 alphabet, plus '\n' characters
$ printf 'foo' | ./base64
...some output...
$ printf 'foobar' | ./base64
...2x as much output...
Not use dynamic allocation
malloc , realloc , etc.
sbrk
mmap
etc.
Not write to any les
That is, you may not temporarily buer output or anything else into a temporary le. This is
equivalent to dynamic allocation. Any other similar tricks that violate the spirit of the O(1)
requirement will earn a 0.
Rubric
[5 points] Compiles strictly
The same command as above, with additional strict compilation ags:
$ gcc -std=c99 -Werror -Wvla -Wall -Wextra -o base64 ...
[5 points] Correct output for 0 - 57 input bytes in multiples of 3
$ head -c $(((RANDOM % 19) * 3)) /dev/random >rand-bytes
$ ./base64 <rand-bytes >reference
Expand
Expand
Expand
Expand
Expand
$ base64 <rand-bytes >output
$ cmp output reference
... no output ...
[5 points] Correct output for 0 - 57 input bytes
Same as above, but $((RANDOM % 57))
[15 points] Correct formatting for input of any length
Output must consist only of base64 alphabet and '\n' characters.
Output must be wrapped at exactly 76 characters
Output must be the correct length
Must complete in 5 seconds @ 1,000,000 bytes of input
[20 points] Correct output for input of any length that is a multiple of 3
Must exactly match system base64 utility
Must complete in 5 seconds @ 1,000,000 bytes of input
[10 points] Correct output for input of any length
Must exactly match system base64 utility
Must complete in 5 seconds @ 1,000,000 bytes of input
[5 points] Recognizes "-" FILE argument correctly
[5 points] Ignores any input on stdin if FILE argument is provided
[5 points] Incorrect number of arguments error handling
[5 points] Read from input le error handling
[5 points] Failure to open FILE argument error handling
[5 points] No memory/resource leaks
Expand
Hint: if you open a le, you'd better close it
[10 points] Never crashes during testing
"Crashed" means:
Killed by a signal (SIGSEGV, etc.)
Timed out
What to submit
Submit your assignment to gradescope. You may submit as many les as you would like. Most of
you will submit a single le, such as base64.c . Some students may prefer to break the
assignment into multiple source les, however. Either is accepted.
Your program will be compiled along the following lines, as described above:
$ gcc ... -o base64 *.c
Therefore, the name of your le(s) is irrelevant, as long as there is a .c le with a denition of
the main function to compile.
Benjamin Anderson 6h Unresolved
Reply
B
I tested the head -c1000000 /dev/random > testfile portion of the rubric and got the following
dierences:
My question is: What's is dierence between the two that I'm not seeing? Because in terms of
printable characters the two are the exact same. Are there some non-printable characters messing
with something?