README1/10-c代写|学霸联盟

README1/10-c代写

时间：2023-04-27

4/26/23, 10:20 AM README · UCSC-CSE-130/project2-map-reduce Wiki
https://github.com/UCSC-CSE-130/project2-map-reduce/wiki/README 1/10
UCSC-CSE-130 / project2-map-reduce Public template
README
Jump to bottom
Shun Kashiwa edited this page last week · 8 revisions
Project 2: MapReduce
GOAL: This project will provide you with experience building a system that involves multiple threads.
IMPORTANT: Please read SETUP on setting up the development environment.
Introduction to MapReduce
Motivation
Modern computers are equipped with multiple cores, which allows multiple threads to do
independent work in parallel. However, in order to take advantage of parallelism, your program
must use threads and synchronization mechanisms (e.g., mutexes, condition variables) and distribute
work. Writing parallel code is challenging, and developers can easily create bugs like deadlocking.
In the early 2000s, engineers at Google were facing a similar problem in a slightly different context.
Large companies like Google needs to process large amounts of data, such as crawled documents,
web request logs, etc., that is too large to be handled on one machine, and they use a cluster of
hundreds or thousands of machines to complete them in a reasonable amount of time. Distributing
work over computational entities make the program complex, even if the original, sequential
program is simple.
MapReduce is a programming model that makes it easy to think about such programs. It provides a
powerful abstraction that allows developers to express simple computations but hides the complex
details of distributing work.
Programming Model
Code Issues Pull requests Actions Projects Wiki Security Insig
4/26/23, 10:20 AM README · UCSC-CSE-130/project2-map-reduce Wiki
https://github.com/UCSC-CSE-130/project2-map-reduce/wiki/README 2/10
In the programming model of MapReduce, the computation takes a set of input key-value pairs and
produces a set of output key-value pairs. The developer writes the computation as two functions:
Map and Reduce.
Map is a function that takes an input pair and produces a set of intermediate key-value pairs. The
resulting set of grouped by intermediate keys and passed to the Reduce function.
Reduce is a different function that accepts an intermediate key I and a set of all key-value pairs
whose key is equal to I . It merges the values and outputs a set of resulting key-value pairs.
Example: Word-Counting
To understand how MapReduce works, consider the word-counting problem. In this problem, the
program is given a list of text files, and its goal is to count the number of occurrences of each word
in those files.
Suppose that input keys are the file names and values are their contents, like this:
First, let us define our Map function. The Map function splits each value by delimiters (e.g.,
whitespace), and for each word, it produces an intermediate key-value pair ($word, 1) . In Python,
it would look like this:
If you pass the previous example input to this function, it will generate a set of intermediate key-
value pairs like this:
Next, MapReduce groups those pairs by their key. In our example, because we have two occurrences
of "world", they are grouped together.
[("file1", "hello world"), ("file2", "good afternoon world")]
def m(key: str, value: str):
words = value.split()
intermediate_pairs = []
for word in words:
intermediate_pairs.append((word, 1))
return intermediate_pairs
[("hello", 1), ("world", 1), ("good", 1), ("afternoon", 1), ("world", 1)]
{
[("hello", 1)],
4/26/23, 10:20 AM README · UCSC-CSE-130/project2-map-reduce Wiki
https://github.com/UCSC-CSE-130/project2-map-reduce/wiki/README 3/10
Reduce is a function that merges all intermediate key-value pairs with the same key. Because we're
interested in the total number of word occurrences, our reduce function will return the sum of
values for the same key.
We join the lists produced by the reduce function get the result:
Notice that calls to Map and Reduce are independent. For example, two threads can call map on
"file1" and "file2" at the same time without invalidating the result. Similarly, reduce can also be
done with more than one thread for each key.
Project Details
In this project, you will implement a MapReduce-style multi-threaded data processing library. Your
goal is to complete the map_reduce function in mr.c with the following signature:
mapper is a function that performs the map operation.
num_mapper is the number of threads used to execute mapper .
For example, if num_mapper == 8 , map_reduce will spawn eight threads, each processing a
subset of the data
reducer is a function that performs the reduce operation.
num_reducer is the number of threads used to execute reducer .
input is a list of key-value pairs that represent the data to process
[("world", 1), ("world", 1)],
[("good", 1)],
[("afternoon", 1)],
}
def reduce(key: str, pairs: list[tuple[str, int]]) -> list[tuple[str, int]]:
s = 0
for word, count in pairs:
# all `key` is equal to `word`
s += count
return [(key, count)]
[("hello", 1), ("world", 2), ("good", 1), ("afternoon", 1)]
void map_reduce(mapper_t mapper, size_t num_mapper, reducer_t reducer,
size_t num_reducer, kvlist_t* input, kvlist_t* output);
4/26/23, 10:20 AM README · UCSC-CSE-130/project2-map-reduce Wiki
https://github.com/UCSC-CSE-130/project2-map-reduce/wiki/README 4/10
output is a list of key-value pairs that map_reduce writes the results to.
kvlist.h
Before we go into the details of the map_reduce function, let us first take a look at some data
structures provided to implement our MapReduce library.
In MapReduce, everything is a list of key-value pairs. Because arrays in C are hard to work with, we
implemented kvlist.h that provides useful data structures and functions.
First, kvpair_t represents a pair of two strings (key and value). It can be constructed with
kvpair_new and destructed with kvpair_free as follows.
Notice that kvpair_new allocates heap memory for the key and value and makes copies of them.
These memories are freed when kvpair_free is called.
The key of kvpair_t is immutable, i.e., it is constant throughout the lifetime of the object. The
value, however, can be updated with the kvpair_update_value function.
Next, kvlist_t represents a list of kvpair_t s. Internally, kvlist_t is implemented as a singly-
linked list. There are several functions to work with kvlist_t .
kvlist_new and kvlist_free are the constructor and destructor.
kvlist_append takes pointers to list and kvpair_t. It appends the key-value pair at the end of
the list.
kvlist_extend takes pointers to two lists: lst and lst2 . It moves all the elements of lst2
to the end of lst . After this function is called, lst2 becomes empty because nodes are
moved to lst .
kvlist_sort takes a pointer to a list and sorts the list by ascending order of the keys.
kvlist_print prints a list. This is used for printing results of map_reduce and is useful for
debugging.
Finally, kvlist_iterator_t is an iterator for kvlist_t . You can construct an iterator with
kvlist_iterator_new , get the next element in the list using kvlist_iterator_next , and
deconstruct the iterator with kvlist_iterator_free .
kvpair_t* pair = kvpair_new("key", "value"); // create a new key-value pair
pair->key; // -> "key"
pair->value; // -> "value"
kvpair_free(&pair); // free `pair`
4/26/23, 10:20 AM README · UCSC-CSE-130/project2-map-reduce Wiki
https://github.com/UCSC-CSE-130/project2-map-reduce/wiki/README 5/10
For example, the code below creates a list of five key-value pairs and uses the iterator to loop
through.
Note that kvlist.h is not thread-safe. You might need a mechanism to prevent multiple threads
from modifying the same list at the same time.
hash.h
hash.h provides a simple hash function that is useful for distributing the results of map to
reducers. hash.h provides one function hash with the following signature:
It takes a string and returns its hash as a long.
Mapper and Reducer
The mapper ( mapper_t ) and reducer ( reducer_t ) are defined as follows:
// construct a list
kvlist_t* list = kvlist_new();
// append 5 pairs
kvlist_append(list, kvpair_new("key1", "value1"));
kvlist_append(list, kvpair_new("key2", "value2"));
kvlist_append(list, kvpair_new("key3", "value3"));
kvlist_append(list, kvpair_new("key4", "value4"));
kvlist_append(list, kvpair_new("key5", "value5"));
// construct an iterator
kvlist_iterator_t* itor = kvlist_iterator_new(list);
while(true) {
kvpair_t* pair = kvlist_iterator_next(itor);
if(pair == NULL) {
// `kvlist_iterator_next` returns `NULL` at the end of list
break;
}
printf("key = %s, value = %s\n");
}
// cleanup
kvlist_iterator_free(&itor);
kvlist_free(&list); // will free the list and pairs
unsigned long hash(char *str);
typedef void (*mapper_t)(kvpair_t* kv, kvlist_t* output);
typedef void (*reducer_t)(char* key, kvlist_t* list, kvlist_t* output);
4/26/23, 10:20 AM README · UCSC-CSE-130/project2-map-reduce Wiki
https://github.com/UCSC-CSE-130/project2-map-reduce/wiki/README 6/10
mapper_t is a function that takes a key-value pair ( kvpair_t* kv ) and the list to write output
( kvlist_t* output ).
For example, the word-count example can be implemented as follows:
Then, we can pass a reference of this function to map_reduce .
Similarly, reducer_t is a function that takes a key , kvlist_t* lst , and another list to write results
( kvlist_t* output ). As a result of "shuffle" and sorting, every key in lst will be the same as key .
The word-count reducer can be defined as
It goes through the list, reads the values as integers, and adds them up to the sum variable. In the
end, it adds a single key-value pair to output , whose value is the string representation of the sum.
void m(kvpair_t* pair, kvlist_t* output) {
char delim[] = " \t\r\n-,.!?'\";():;";
char* token;
char* saveptr;
token = strtok_r(pair->value, delim, &saveptr);
if (token == NULL) {
return;
}
kvlist_append(output, kvpair_new(token, "1"));
while ((token = strtok_r(NULL, delim, &saveptr)) != NULL) {
kvlist_append(output, kvpair_new(token, "1"));
}
}
void r(char* key, kvlist_t* lst, kvlist_t* output) {
int sum = 0;
kvlist_iterator_t* itor = kvlist_iterator_new(lst);
for (;;) {
kvpair_t* pair = kvlist_iterator_next(itor);
if (pair == NULL) {
break;
}
sum += atoi(pair->value);
}
kvlist_iterator_free(&itor);
char sum_string[32];
sprintf(sum_string, "%d", sum);
kvlist_append(output, kvpair_new(key, sum_string));
}
4/26/23, 10:20 AM README · UCSC-CSE-130/project2-map-reduce Wiki
https://github.com/UCSC-CSE-130/project2-map-reduce/wiki/README 7/10
map_reduce
This section describes the high-level structure of the map_reduce function. Note that there are
many ways to implement this. What's described here is one way to implement this function, but feel
free to change the structure for better performance, lower memory usage, etc.
There are five phases in map_reduce as follows:
Split Phase: Split the input list into num_mapper smaller lists.
Map Phase: Spawn num_mapper threads and execute the provided map function.
Shuffle Phase: Shuffle mapper results to num_reducer lists.
Reduce Phase: Spawn num_reducer threads and execute the provided reduce function.
Concat the resulting lists to get a single list.
Split Phase
In the split phase (also called the partition phase), you split the input list into num_mapper smaller
lists so that each smaller list can be processed by different threads independently.
Map Phase
In the map phase, you create num_mapper threads. Each thread is responsible for a smaller list from
the previous phase and calls the mapper function to obtain a new list.
Shuffle Phase
In the shuffle phase, you create num_reducer independent lists that can be processed in the next
phase. Since you need to provide all pairs with the same key to the reducer function, the same key
must be assigned to the same list. For example, if you have two pairs with the same key, those two
pairs must be assigned to the same list so that they can be passed to reducer together.
4/26/23, 10:20 AM README · UCSC-CSE-130/project2-map-reduce Wiki
https://github.com/UCSC-CSE-130/project2-map-reduce/wiki/README 8/10
Reduce Phase
In the reduce phase, you create num_reducer threads. Each thread is responsible for a smaller list
from the previous phase and calls the reducer function to aggregate results.
When calling reducer , you need to construct a list of all pairs with the same key. There are many
ways to do this, but one way is to use the kvlist_sort function.
Output
You need to store the results in the output list passed as an argument. Use kvlist_extend to
move pairs to output .
Additional Functionality
In addition, your implementation must do the following:
You should not have a main function in mr.c .
make must create mr.o . We will use this object file to link your code with our test programs.
Your code must use POSIX threads ( pthread.h ).
Your code must not cause segfaults.
All source files must be formatted using clang-format. Run make format to format .c and .h
files.
Your map_reduce must not leak memory. Use valgrind to check memory leaks.
Testing with word-count
mr.c is a library and is designed to be used by some other program. We provide a sample
program that uses map_reduce : word-count so you can test your mr.c implementation. While
these programs are useful for testing, we recommend you write your own program that calls
map_reduce for debugging.
word-count is a variant of the word-counting example from the previous section. It is invoked with
three or more arguments: word-count $NUM_MAPPER $NUM_REDUCER file ... .
$NUM_MAPPER is a positive integer that specifies the number of threads used for the map
function.
$NUM_REDUCER is a positive integer that specifies the number of threads used for the reduce
function.
file ... is one or more (text) files.
4/26/23, 10:20 AM README · UCSC-CSE-130/project2-map-reduce Wiki
https://github.com/UCSC-CSE-130/project2-map-reduce/wiki/README 9/10
Pages 3
Find a page…
Home
README
Project 2: MapReduce
Introduction to MapReduce
Motivation
Programming Model
Example: Word-Counting
Project Details
kvlist.h
hash.h
Mapper and Reducer
map_reduce
Split Phase
Map Phase
Shuffle Phase
Reduce Phase
Output
Additional Functionality
Testing with word-count
Suppose that you have a file hello.txt whose content is "hello, world!".
word-count 1 1 hello.txt would count the words in hello.txt using one mapper thread and one
reducer thread and print the results to the standard output. word-count 4 2 hello.txt will do the
same with four mapper threads and two reducer threads.
It is possible that when you change the number of threads, you get the results in different orders.
This is fine, and we will grade based on the correctness of the count.
We will use this program and some other programs that include mr.h to test your map_reduce
function.
$ word-count 1 1 hello.txt
hello,1
world,1
4/26/23, 10:20 AM README · UCSC-CSE-130/project2-map-reduce Wiki
https://github.com/UCSC-CSE-130/project2-map-reduce/wiki/README 10/10
SETUP
Clone this wiki locally
https://github.com/UCSC-CSE-130/project2-map-reduce.wiki.git