Background
Many websites that are hosting spam ads and links are either:
- allowing unauthenticated visitors to create content without an account… or
- allowing only authenticated accounts to create content, but allowing visitors to fully self-register for an account:
or
Popular examples of this are web forums, blogs with comment threads, and wikis. Unless you are restrictive or selective with regard to whom you allow to create content on your website, the site will get repeatedly spammed. The use of CAPTCHAs can reduce the occurrence of spam, but not fully eliminate it.
The Problem
Recently, we’ve seen an up-tick in the number of websites on our campus that are showing up in Google searches as hosting spam content; E.g.:
We are, by far, not the only university that’s experiencing this issue, and we’ve seen links like these on .edu websites for at least the past year, if not longer. The spammed websites all seem to share the following characteristics:
- All of these sites are running either WordPress or Drupal and all of them have comments and public account registration turned off.
- The spam is always in the form of PDF documents, and most of them refer to online viewing of movies or television shows/events.
- The sites all allow for users to submit information via a form or set of forms that:
- allow the upload of a PDF file (e.g. a form that allows visitors to submit an application and include a resume/CV as PDF file).
- have forms with modern CAPTCHAs that are meant to prevent automated spamming attacks.
So, if the sites are locked down and the forms all have CAPTCHAs, how are the spammers getting the PDFs up on the server? Have they figured out a way to break the CAPTCHAs? Are they paying for human labor to enter the CAPTCHA challenges?
Finding Spammed Sites
One way to find out if this is affecting websites in your domain is to Google dork for some of the common phrases that are being used in the ad PDFs. For example, the vast majority of the ads seem to be for video streaming (either live events or movies). You can use the form below to generate an example Google search for your domain (looks for PDFs mentioning the word “watch” in common URL paths):
Domain:
The form above will open a new window to a Google search for:
site:domain (inurl:sites OR inurl:files) filetype:pdf intitle:watch
You can leave the domain blank to see a listing of all .edu results.
If you see spammy results, don’t be confused if none of the pages in the results can be found once you click through to the URLs. The mere presence of spam entries in Google’s search results means that, at some time, Google crawled one or more of your websites and it was given the spam PDFs as a result. That should raise enough concern for you to investigate further.
The (Apparent) Cause
The problem seems to stem from the way that the forms on sites are configured…
First, all of these forms are all allowing for PDFs to be uploaded before the form is submitted. E.g.:
These uploads are taking place via AJAX. Here is an example web server log entry from a file uploaded to a Drupal site (via a webforms module form):
10.0.0.1 - - [05/Sep/2017:15:51:16 -0400] "GET /misc/throbber-active.gif HTTP/1.1" 200 1963 "https://somewebsite.gatech.edu/sites/default/files/css/css_xE-rWrJf-fncB6ztZfd2huxqgxu4WO-qwma6Xer30m4.css" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:54.0) Gecko/20100101 Firefox/55.0" 10.0.0.1 - - [05/Sep/2017:15:51:16 -0400] "POST /file/ajax/submitted/file/form-SDqIHCt_95oHHFep01lI8kcK1hbVRzuF4DkXTyxeBZM HTTP/1.1" 200 3776 "https://somewebsite.gatech.edu/form/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:54.0) Gecko/20100101 Firefox/55.0"
This may make for a “better“ user experience, but it has the side-effect of bypassing any CAPTCHA protection the form may have. CAPTCHAs in the forms are protecting the submissions of the forms (proper), but aren’t protecting the AJAX uploading via the file submission fields.
Secondly, the files that get uploaded are put on the server with permissions/location that allow for the files to be downloaded without authentication. The URL for an uploaded file is knowable/guessable by the spammers ahead of time, so it’s easy to upload a PDFs files to your server and then know the URLs that need to be submitted to Google for crawling.
Though CMSs/form plug-ins periodically check for uploads and move the newly uploaded files to a private/protected directory, the files generally stick around long enough for the spammers to submit the file’s URL to Google and for Google to retrieve, index, and cache the PDF file. For this reason, almost all of the spam links you find that are generated by this method return a 404 within a day or two after the spam was uploaded. This might seem like “problem solved,” but the SEO/reputation damage to your site and domain has already happened; it’s in your best interest to prevent these uploaded spam PDFs from being crawled in the first place. This is such a problem with Drupal websites that back in 2016, the Drupal team issued a PSA about it.
The Solution(s)
So, how do we stop Google from indexing the uploaded the files? There are at least four basic strategies:
- Configure the CMS, CMS plug-in/module, or form element to upload files into an area that is not publicly accessible. For example, when using the Drupal webform module, set the “Upload destination” of every file field to “Private files”:
- Use an .htaccess or web-server-configuration-equivalent method to block retrieval of the uploaded files, perhaps allowing access only to a limited audience:
- Use a robots.txt entry to request Google (and other well-behaved crawlers) not to crawl URLs in the upload path of form submissions. The downside of this is that spammers may not recognize your defense and continue to upload spam files to your site without realizing that Google won’t crawl them.
- Stop allowing file uploads in forms via AJAX. This may not be possible given your CMS/plug-in/module, but if your form is protected by an effective CAPTCHA, allowing the file upload as part of the form submission action (only), would stop all but the most aggressive spammers.
Further Notes
- This kind of website spam attack is somewhat unusual, as it doesn’t depend on weak credentials, nor upon outdated software. It depends, instead, on the specific configurations of CMSs and their form-related plug-ins/modules (OWASP top 10 category “A5 – Security Misconfiguration”). For this reason, this kind of issue is not easily scanned for with most vulnerability scanners (e.g. Nessus, Qualys, etc.). We have attempted to address the problem though education (presentations to user groups) and monitoring (via Google Alerts).
- Another observation we have regarding this type of attack is that Google dorking to find spammed sites within your domain seems to be broken with regard to date limits/sorting by date. For example, when doing a simple Google search on our domain, we can see the site physicsreu.gatech.edu listed in the results as having been crawled last on August 31, 2017:Looking at logs on the server that hosts that site confirms that Google successfully crawled a PDF file at that URL on that date. However, when we limit our results for the same search to the “Past month” or even “Past year,” that entry is bizarrely absent:For that reason, we don’t recommend relying on date limiting your Google searches when looking for spammed sites within your domain.