Skip to main content

ADR-011 - Improving the Anti-virus

Date: 2022-03-03

Status

⌛️ Superseded

Context

Github Repo for fb-av

The main reason for streaming the file rather than adding the file to a queue is that the user uploading the document is getting a real time message that the file has been uploaded successfully or that there is a problem (malware found) within it.

The current flow for a file in the User Filestore:

  1. File is uploaded and given arbitrary name
  2. Saved to disc in tmp/quarantine with the User Filestore
  3. The file size is checked
  4. The file type is checked
  5. Streamed to fb-av
  6. When size, type and results are correct the file is encrypted and sent to the S3 bucket
  7. File in quarantine is deleted

Performance Problem

Due to the real time need from scanning documents that an end user has uploaded, a queue can not be added before the antivirus. Scaling the fb-av pods width ways did not work as expected, one file is streamed to be scanned at a time to one pod - not to other pods running.

There is also a networking issue with how Kubernetes works, as there is only one pod running when recycled kubernetes will create a new pod and when ready kill the running pod which is expected behaviour. If the pod is being used and a file is being streamed and the pod is killed then we get the error, as it seems streaming to the pod is not detectable.

Another performance issue has arisen, Clam AV will restrict traffic if calls to update the full virus definitions come from the same IP address, this causes problems with the acceptance tests overnight as the process locks the scanning so the tests fail.

Options

  1. Remove the Virus Check We can’t - see the following guidance. https://ministryofjustice.github.io/security-guidance/malware-protection-guidance-defensive-layer-2/#malware-protection-guide-defensive-layer-2
  2. Do Nothing Hope that file upload remains in small numbers and accept that we have a bottleneck, failing acceptance tests and network issues.
  3. Adding a Queue Majority of files will be small (under 7mb) and take seconds to upload/stream, scan and respond. So adding a queue will be resilient but could give a poor user experience if many happen at once.
  4. 3rd Party Offering There are many services available from paid for service through the Digital Marketplace to the well known companies - for the usage we are currently expecting the cost does not balance. Having the service sitting outside the MOJ platform would require encrypting the file for transmission outside our boundary which will make scanning more complicated to build and scale.
  5. Reuse Looking at the MOJ repos at the virus checking they seem to have the same issues that ours does. Also considered GDS Notify, https://github.com/alphagov/notifications-antivirus, after reviewing their documentation it does incorporate a queue.
  6. Build something This is the best route, as the current method meets our needs except scalability. Three options include:
    1. Adding a controller application which will orchestrate documents to independently running ClamAV pods for scanning and report back the response (virus found or not). The application will need to manage which pods are being used so it can direct traffic correctly.
    2. Utilise a Lambda function to scan the file when adding a file to the s3 bucket, this tags the file and a response is sent back to the user dialog. The down side of this is that the file will be encrypted and will need to be either unencrypted, breaking our principle of everything is encrypted or the file is scanned within our boundary before encryption.
    3. Scanning the file from within the user-filestore. The file is already stored there and this removes streaming to another service and will be able to scale width ways.

Options 6.1 and 6.3 will require us to set up a mirror of the clamav definitions and therefore prevent slowing down or temporarily blocked for obtaining the definitions.

Options for setting up a mirror:

A. Reuse another mirror on the estate, on searching github, we do not have another mirror running within MOJ.

B. Basic setup, have the definitions downloaded and store in a read only s3 bucket and update all instances of the File Userstore from that.

C. Use the method clamav suggestion in their documentation - https://github.com/Cisco-Talos/cvdupdate.

Decision

After more research, an option has presented itself since the original fb-av was built.

  • Clamav now offer a docker image which contains the definitions
  • Adding a second replica pod

Consequences

  • Using the clamav image does not require downloading the entire database when building the fb-av image. This has removed the need for a mirror or copying the database files from somewhere else
  • Having a replica virus checking will not be interrupted while pods are recycled
  • Deploying the pods isn’t halted by definition update via clamav

Constraints

  • Still need to add monitoring and load test number of files can be streamed to a single fb-av pod