COMP30023 Project 2 Web proxy Release date: 1 May 2025 Due date: No later than 11:59pm Monday 26 May, 2025 AEST Weight: 15% of the final mark 1 Project Overview The aim of this project is to familiarize you with socket programming. Your task is to write a caching web proxy for HTTP/1.1 (without persistent connections). Your code must be written in C or Rust. Submissions that do not compile and run on a cloud VM may receive zero marks. A web proxy is a process that runs on an internet host and receives web requests for URLs hosted on other hosts. It either serves these requests from a cache, or forwards the requests to the actual hosts. There are many reasons for using proxies. One reason is to cache content. Web browsers cache content locally, but if multiple computers try to download the same content, they cannot get it from another browser’s cache. If they all use the same nearby proxy, then the proxy can download the content once, and individual computers can download copies from there. Caching is difficult with HTTPS, which often simply rely on proxies to forward encrypted data without be- ing able to look at the headers. (For the reasons for forcing HTTPS, see https://www.troyhunt.com/ heres-why-your-static-website-needs-https and the accompanying video https://youtu. be/gZ1mM6OtXIc.) A web server accessible from your VM will be provided. If you want to test from your personal machine, you can use the following public sites that do not force an upgrade to HTTPS: • http://www.washington.edu • http://yimg.com • http://icio.us • http://rs6.net • http://www.faqs.org/faqs • http://icanhazip.com • http://example.com • http://detectportal.firefox.com • http://info.cern.ch • http://anzac.unimelb.edu.au Another reason for proxying is security. It is common for private networks to use “private” IP addresses, and so hosts cannot make TCP connections to hosts on the global internet. However, they can be configured to download web resources by using a HTTP proxy. The proxy has two IP addresses: one in the “private” address space and another in the global address space, which can reach the web servers. 2 Project Details Your task is to design and code a simple caching web proxy, capable of proxying GET requests. You should create an executable named htproxy, with command line syntax: ./htproxy -p listen-port [-c] If the optional -c is on the command line, then caching should be performed (stages 2–4 and stretch goal). The order of arguments is fixed. Argument listen-port is a TCP port number. You may assume that the input is valid. 1 2.1 Stage 1: Simple proxy The first stage is simply to proxy all requests, without caching. This stage will create a listening TCP socket on the port specified by -p on the command line, listening to all interfaces (including IPv6 interfaces), queueing up to backlog=10 incoming requests in listen(3). For any request it receives on that socket, it should identify the host (the “origin server”) from the Host header, create a TCP connection to that on port 80 and send the request (unchanged) to that host. It will then read the complete response and send it back to the host that sent the request. The request is terminated by the first blank line. (That is, requests do not have a body.) All header names are case-insensitive, and you can assume they are followed immediately by a single colon and a single space, and that the rest of the line is the value of that header, which you can assume is case-sensitive. The length in bytes of the body of the response is specified by the Content-Length header. Note that there is no limit on the maximum size of the response; you should not need to read the entire response into memory before starting to send it. However, you may choose to truncate responses longer than 100 kiB, with a penalty of only 0.5 marks. Long responses should not cause your code to abort. The program should log the line: Accepted to stdout once a connection socket is created. If socket creation fails (client or server), the server may discard this request and return to the loop waiting for a new request. The program should log the last line (only) of the header in the format: Request tail last line to stdout once the request has been read. Omit the trailing \r\n from last line before printing. Whenever a request is forwarded to the origin server, the program should log a line: GETting host request-URI to stdout, where the request-URI is the second value specified on the first line of the request (request- line), separated by a single space from the GET and by a single space from the HTTP/1.1. There should be a single space between host and request-URI in the log line. On receiving the response, the program should log the Content-Length header value: Response body length content-length to stdout. All lines logged to stdout must be terminated by a LF character. Flush stdout after each write such as by using fflush(3), or ensure that stdout is line buffered. Remember to check the server for both IPv4 and IPV6 addresses. After serving one request, the proxy should close the connection socket (that is, not support persistent connections) but keep listening for the next request. To kill it, use CTRL-C (SIGINT). You may notice that a port and interface which has been bound to a socket sometimes cannot be reused until after a timeout. To make your testing and our marking easier, please override this behaviour by placing the following lines before the bind() call: int enable = 1; if (setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &enable, sizeof(int)) < 0) { perror("setsockopt"); exit(1); } You can assume that: 2 1. The initial request and the server’s response conform to the RFCs. 2. There will be no requests to ports other than 80. 3. Each header entry is a single line (no line folding). 4. All characters in header names and values is printable ASCII. 5. There will be exactly one occurrence of the Host header in the request, with no commas in the field value. 6. There will be exactly one occurrence of the Content-Length header in the response, with a single decimal non-negative integer field value. You do not need a timeout; if the server doesn’t respond, your program is allowed to hang. If you implement a timeout, it should not be less than 30 seconds. 2.2 Stage 2: Naive caching The second stage is to keep a copy of all requests and their responses. Allocate a cache with 10 entries, each 100 kiB in size. This stage is not required to distinguish between cacheable and non-cacheable requests. Its behaviour on non- cacheable requests is undefined. Until you attempt stage 3, it can simply cache all requests. The cache is a key-value store, with the key being the entire request. For this project, two requests match if they are the exact same byte string. (In practice, fields can be reordered, and some fields are case-insensitive.) For every request received, if the request is less than 2000 bytes, look in the cache to see if you have received this request before. If you have, then reply with the response that you received last time. If you do not have the entry in the cache, evict the least recently used (LRU) element of the cache if there is no empty slot. Then fetch the response from the actual host, and if the request is less than 2000 bytes and the response is 100 kiB or less, place it in the cache. The eviction should occur even if the request is not cached. Whenever a response is sent from the cache, the program should log a line: Serving host request-URI from cache to stdout, instead of logging the GET command. Whenever an entry is evicted, the program should log a line: Evicting host request-URI from cache to stdout. 2.3 Stage 3: Valid caching Not all responses can be cached. For this stage, only responses to commands that can be cached should be cached. Responses that contain Cache-Control headers should be respected. If a Cache-Control header contains private, no-store, no-cache, max-age=0, must-revalidate or proxy-revalidate then do not cache that response. The ABNF for the Cache-Control header (according to RFC 9111 and RFC 9110) is as follows: token = 1*tchar tchar = "!" / "#" / "$" / "%" / "&" / "’" / "*" / "+" / "-" / "." / "^" / "_" / "‘" / "|" / "~" / DIGIT / ALPHA OWS = *( SP / HTAB ) Cache-Control = [ cache-directive *( OWS "," OWS cache-directive ) ] cache-directive = token [ "=" ( token / quoted-string ) ] 3 RFC 9111 5.2: Cache directives are identified by a token, to be compared case-insensitively. Whenever an item is fetched but not put in the cache due to the Cache-Control header, the program should log a line: Not caching host request-URI to stdout, after logging the GET command and response body length. 2.4 Stage 4: Expiration The Cache-Control header can specify a max-age=xxx field. This specifies how many seconds the response is valid for. You can assume xxx is a positive integer that fits in uint32. If it is cached, then the cache entry should become “stale” after that time. You may assume that there will be at most one Cache-Control header, with at most one max-age directive. For this stage, the code should not respond with stale cache entries. Instead, it should fetch a fresh copy. If it fits the criteria for caching then the fresh copy should be cached. Otherwise, the entry should be evicted, and logged as such after the GET command and response body length, and if applicable, after Not caching from Stage 3. Whenever a stale cache entry matches, the program should log a line: Stale entry for host request-URI to stdout, before logging the GET command. 2.5 Stretch goal: Checking for updates If the cache entry is stale, it is still not always necessary to download the document again. Instead, it is possible for the proxy to insert the If-Modified-Since header to request that the page be downloaded only if it is newer than the cached version. Note that this is no longer simply forwarding the request as-is. This allows the origin server to reply with status code 304 (Not Modified) if the cached entry is still valid. In this case, return the contents of the cache. The max-age argument will not be updated, and so the next request to this URL should again query with If-Modified-Since. This information can also be obtained using a HEAD request, although that is less efficient as it requires two HTTP requests. For simplicity, use the Date: header to determine the argument for If-Modified-Since. (In practice, the Last-Modified header is preferable, but neither is a required field, and so a real proxy would have multiple if-thens to determine a time.) Whenever a stale cache entry is served this way, without being downloaded again, the program should log a line: Entry for host request-URI unmodified to stdout, after logging the presence of the stale entry and after logging Serving...from cache. The key value for the cache entry should be the header received from the client, without the If-Modified- Since line. The marks allocated to this stretch goal are deliberately not worth effort it will take. It should only be attempted by those realistically hoping to get 15/15. If the project seems too big, then do not attempt this stretch goal. 3 Development and Testing Your code will be marked using curl, and so you should use curl for testing. In addition, you should test that it works in the following contexts: 1. Access the provided server (or one of the HTTP websites listed above) using 4 telnet [hostname] 80 Note that pressing
in telnet sends the backspace character, rather than erasing the last character you typed. It may be better to cut-and-paste the query from a text document. You should try pasting part of a line at a time. (What bugs will this highlight?) 2. Access the site using a browser. You can use lynx on your VM. If you are testing on your local machine, it may help to install a second browser (for example, if you use Chrome, install Firefox). That way you can keep your main browser working while your second browser has its proxy setting set to use your proxy. 4 Marking Criteria The marks are broken down as follows Task # and description Marks 1. Correctly proxy requests 4 2. Naive caching 3 3. Valid caching 2 4. Expiration 1 5. Safety 1 6. Build quality 1 7. Quality of software practices 2 Stretch goal: Checking for updates 1 Code that does not compile and run on cloud VM will usually be awarded zero marks for parts 1–5. Use the GitHub CI infrastructure to ensure your submission is valid. Your submission will be tested and marked with the following criteria: Task 1. Correctly proxy requests Your code correctly • opens a socket (logs Accepted at the right time) (0.5 marks) • receives a request, which may come as multiple packets (0.5 marks) • logs the Content-Length of the response (1 marks) • sends the (correct) response to the client (1 mark) • continues to process requests after the first is served (0.5 marks) • processes replies longer than 100 kiB (0.5 marks) Task 2. Naive caching Your code correctly • serves the second and later requests from the cache. (This will only be tested for pages that should be cached; valid stage 3 code will pass.) (2 marks) • serves pages too large for the cache (0.5 marks) • evicts entries correctly (0.5 marks) Task 3. Valid caching Your code correctly • doesn’t cache requests whose headers require them not to be cached, with simple Cache-Control header values (1 mark) • doesn’t cache requests whose headers require them not to be cached, with complex Cache-Control header values (1 mark) (To do well on Task 2, the code must cache responses with no Cache-Control header; don’t break that by attempting Task 3.) 5 Task 4. Expiration Your code correctly • re-loads stale entries after the expiration time (0.5 marks). • expires pages, even if the Cache-Control header is complex (0.2 marks) • handles different expiration times for different cache entries (0.3 marks) (To do well on Task 2, the code must serve from cache until the expiration time; don’t break that by attempting Task 4.) Task 5. Safety Network code should never crash with a segmentation fault, even if the hosts on the other side behave poorly. The sorts of bugs that cause segmentation faults (memory errors) also introduce security vulnerabilities. It is OK to print an error message to stderr and abandon the request (or, if necessary, call exit() with an error code). Task 5 covers segmentation faults, but code that crashes with a segmentation fault may be marked down in other tasks too. Task 6. Build quality • The repository must contain a Makefile that produces an executable named “htproxy”, along with all source files required to compile the executable. Place the Makefile at the root of your repository, and ensure that running make places the executable there too. • Running make clean && make -B && ./htproxy should ex- ecute the submission. • Compiling using “-Wall” should yield no warnings (C). Compiling using “rustc” should yield no warnings (Rust). Do not suppress any default warnings inline. • Running make clean should remove all object code and executables. • Do not commit htproxy or other executable files. Scripts (with .sh extension) are exempted. Test this by committing regularly, and checking the CI feedback; the CI will tell you the mark that you get for this section. (If you need help, ask on the forum.) Task 7. Quality of software practices • Proper use of version control, based on the regularity of commit and push events, their content and asso- ciated commit messages (e.g., repositories with a single commit and/or non-informative commit messages will lose marks). • Quality of code, based on the choice of variable names, comments, formatting (e.g. consistent indentation and spacing), and structure (e.g. abstraction, modularity). • Proper memory management, based on the absence of memory errors and memory leaks. Code will be tested with Valgrind to ensure no memory errors, as these are a security risk. Avoid memory leaks, but you should not catch SIGINT to clean up memory when terminating. Further deductions may be applied to inappropriate submissions, e.g. catching segmentation faults, hard-coding the output into the code. Stretch goal As stated in the instructions. 6 5 Submission All code must be written in C or Rust (e.g., it should not be a C wrapper over code in another language) and cannot use any external libraries, except standard libraries as noted below. You must not use or adapt any code or libraries relating to HTTP. Rust submissions must be compiled with stable rustc, with no external crates or build scripts. You can reuse the code that you wrote for your other individual projects if you clearly specify when and for what purpose you have written it (e.g., the code and the name of the subject, project description and the date, that can be verified if needed). You may use standard libraries (e.g., to create sockets, send, receive data etc.). Your code must compile and run on the provided VMs. The repository must contain a Makefile which produces an executable htproxy along with all source files required to compile the executable. Place the Makefile at the root of your repository, and ensure that running make places the executable there too. Make sure that all source code is committed and pushed. Executable files (that is, all files with the executable bit which are in your repository) will be removed before marking, and cause loss of marks. Hence, ensure that none of your source files have the executable flag set. (You can verify this by cloning your repo onto your VM, and using ls -l.) If you import code from somewhere else, within the collaboration policy, there should be a commit that does nothing but import that code, with a commit message saying “importing code from [reference]”. You should then customise the imported code in later commits. GitHub The use of GitHub is mandatory. Your submission will be assessed based using the code in your Project 2 repository (proj2-〈usernames...〉) under the subject’s organization. We strongly encourage you to commit your code at least once per day. Be sure to push after you commit. This is important not only to maintain a backup of your code, but also because the git history may be considered for matters such as special consideration, extensions and potential plagiarism. Proper use of git will have a positive effect on the mark you get for quality of software practices. Submission To submit your project, please follow these steps carefully: 1. Push your code to the repository named proj2-〈usernames...〉 under the subject’s organization, https://github.com/feit-comp30023-2025. Ensure your code compiles and runs on the provided VMs. Code that does not compile or produce correct output on VMs will typically receive very low or 0 marks. 2. Submit the full 40-digit SHA1 hash of the commit you want us to mark to the Project 2 Assignment on the LMS. You are allowed to update your chosen commit by resubmitting the LMS assignment as many times as de- sired. However, only the last commit hash submitted to the LMS before the deadline (or approved extension) will be marked without a late penalty. 3. Ensure that the commit that you submitted to the LMS is correct and accessible from a fresh clone of your repository. An example of how to do this is as follows: git clone git@github.com:feit-comp30023-2025/proj2- proj2 cd proj2 && git checkout Please be aware that we will only mark the commit submitted via the LMS. It is your responsibility to ensure that the submission is correct and corresponds to the commit you want us to mark. Late submissions will incur a deduction of 2 mark per day (or part thereof). We strongly encourage you to allow sufficient time to follow the submission process outlined above. Leaving it to the last minute usually results in a submission that is a few minutes to a few hours late, or in the submission of the incorrect commit hash. Either case leads to late penalties. 7 The submission date is determined solely by the date in which the LMS assignment was submitted. Forgetting to submit via the LMS or submitting the wrong commit hash will result in a late penalty that will apply regardless of the commit date. We will not give partial marks or allow code edits for either known or hidden cases without applying a late penalty (calculated from the deadline). Extension policy For extensions between 1-3 business days, you must: 1. Have an AAP or fill in FEIT’s short extension declaration form before the project’s deadline. 2. Submit an extension request via form in Project Module on LMS. For extensions of more than 3 business days, you must: 1. Apply for an extension via the special consideration portal before the assessment deadline. 2. Receive a successful outcome for your application. 3. Submit the outcome of your application via form in Project Module on LMS. Further details are available on the “FEIT Extensions and Special consideration" page on Canvas (under the Welcome module). 6 Testing You will have access to several test cases (via a HTTP server – see Ed) and their expected outputs. However, these test cases are far from exhaustive; they are mainly to avoid misinterpretation of the specification. Designing and running your own tests is a part of this project. Your code will be assessed on these cases other cases that you haven’t seen before. The unseen cases are not “trick” cases, but are chosen to reflect the fact that real world programming tasks do not come with an exhaustive list of test cases. Project 2 Repository : The project skeleton and sample outputs are available from: feit-comp30023-2025/project2. Continuous Integration Testing: To provide you with feedback on your progress before the deadline, we will set up a Continuous Integration (CI) pipeline on GitHub with the same set of test cases. Though you are strongly encouraged to use this service, the usage of CI is not assessed, i.e., we do not require CI tasks to complete for a submission to be considered for marking. The requisite ci.yml file has been provisioned and placed in your repository, but is also available from the .github/workflows directory of the project2 repository linked above. 7 TeamWork Both team members are expected to contribute equally to the project. If this is not the case, please approach the head tutor or lecturer to discuss your situation. In cases in which a student’s contribution is deemed inadequate, the student’s mark for the project will be adjusted to reflect their lack of contribution. We will look at git history when making such an assessment. 8 Collaboration and Plagiarism This is a pair project. Please keep a log of your group interactions in a GIT file called collab.txt or collab.tex. This should include things like who agreed to do what at which meeting, and any changes of plan. 8 There are no marks allocated to this file, but it will be used in cases where either party wants marks to be allocated unequally between the two partners. We will look at the GIT history of this file, so please update it as soon as an issue arises, such as if one of you is unable to attend a meeting. Please check this file regularly to check that you are happy with what you partner may have written. Even if you do not expect problems, it is good practice to keep minutes of meetings, and this file is a suitable place for that. If you want to keep a formatted document, you can either use LaTeX, or keep a separate word processor document and export a plain text version for GIT. Collaboration outside your group You may discuss this project abstractly with your classmates but what gets typed into your program must be individual work, not copied from anyone else. Do not share your code and do not ask others to give you their programs. The best way to help your friends in this regard is to say a very firm “no” if they ask to see your program, point out that your “no”, and their acceptance of that decision, are the only way to preserve your friendship. See https://academicintegrity.unimelb.edu.au for more information. Note also that solicitation of solutions via posts to online forums, whether or not there is payment involved, is also Academic Misconduct. You should not post your code to any public location (e.g., github.com) until final subject marks are released. If you use a small amount of code not written by you, you must attribute that code to the source you got it from (e.g., a book or Stack Exchange) in both the comments and the git commit messages. Do not post your code on the subject’s discussion board Ed, except in a Private thread. Plagiarism policy: You are reminded that all submitted project work in this subject is to be your own individual work. Automated similarity checking software will be used to compare submissions. It is University policy that cheating by students in any form is not permitted, and that work submitted for assessment purposes must be the independent work of the student concerned. Using git properly is an important step in the verification of authorship. We should see the stages of your code being written, not just the finished product. AI software such as ChatGPT can generate code, but it will not earn you marks. You are allowed to use tools like ChatGPT, but if you do then you must strictly adhere to the following rules. 1. Have a file called AI.txt 2. That file must state the query you gave to the AI, and the response it gave 3. You will only be marked on the differences between your final submission and the AI output. If the AI has built you something that gains you points for Task 1, then you will not get points for Task 1; the AI will get all those points. If the AI has built you something that gains no marks by itself, but you only need to modify five lines to get something that works, then you will get credit for identifying and modifying those five lines. 4. If you ask a generic question like “How do I convert an integer to network byte order?” or “What does the error ‘implicit declaration of function rpc_close_server’ mean?” then you will not lose any marks for using its answer, but please report it in your AI.txt file. If these rules seem too strict, then do not use the AI tools. These issues are new, and this may not be the best policy, but it is this year’s policy. If you have suggestions for better rules for future years, please mention them on the forum. Good luck! 9 学霸联盟