As the world becomes more and more data-driven, secure handling of user data is more critical than ever.
As developers, our jobs are hard enough already: dealing with highly complex and fragile systems with multiple failure points while we translate flitting human wishes into UIs and backends. To add to the task is an emerging and essential consideration: data security. And for a good reason: we as customers are enraged if our data gets misused (so it’s only fair we give our users a secure and enjoyable experience), and governments and enterprises demand it for compliance.
Data security as passing-the-buck
What makes security harder is that it has several layers and becomes the everybody’s-responsibility-is-nobody’s-responsibility thing. In a modern cloud team, multiple teams directly control the ingress/egress of data: developers, database admins, sysadmins (DevOps folks, if you will), privileged back-office users, and so on. These roles/teams can quickly close their eyes and think of data security as the others’ problem. Still, the reality is that they have their own worlds to take care of as a database admin cannot control the app side of security, a DevOps person can do absolutely nothing about the back office access, and so on.
Developers and data security
All that said, developers have the largest surface area of access when it comes to data: they build every part of the app; they connect to various backend services; the ferry access tokens back and forth; they have the whole database cluster to read/write from at their command; the apps they write have unquestioned access to all parts of the system (for instance, a Django app in production has all the privileges to dump or wipe the entire S3 collection of the last ten years), and so on. As a result, the highest chance of sloppiness or oversight in terms of security exists at the source code level and is the developer’s direct responsibility.
Now, data security is a bottomless rabbit hole, and there’s no way I can even scratch the surface in a single post. However, I want to cover the essential terminology that developers must know to keep their apps secure. Think of it as App Data Security 101.
Let’s get started!
If you want a highly rigorous definition, there’s always Wikipedia, but in simple terms, hashing is the process of converting data to another form, where the information is unreadable. For instance, using the well-known (and very insecure) process of Base64 encoding, the string “Is my secret safe with you?” can be converted (“hashed”) to “SXMgbXkgc2VjcmV0IHNhZmUgd2l0aCB5b3U/”. If you start writing your personal diary in Base64 format, for example, there’s no way your family can read your secrets (unless they know how to decode from Base64)!
This idea of scrambling the data is used when storing passwords, credit card numbers, etc., in web apps (actually, it should be used in all types of apps). The idea, of course, is that in the event of a data breach, the attacker shouldn’t be able to use the passwords, credit card numbers, etc., to do actual damage. Highly robust and sophisticated algorithms are used to perform this hashing; something like Base64 will be a joke and will be broken instantly by any attacker.
Password hashing uses a cryptographic technique known as one-way hashing, which means that while it’s possible to scramble the data, it’s not possible to unscramble it. Then how does the app know it’s your password when you log in? Well, it uses the same process and compares the scrambled form of what you just entered as the password to the scrambled form stored in the database; if they match, you’re allowed to log in!
While we’re on the topic of hashes, here’s something interesting. If you ever download software or files from the Internet, you might have been told to verify the files before using them. For instance, if you want to download the Ubuntu Linux ISO, the download page will show you an option to verify your download; if you click it, a popup will open:
The popup tells you to run a command, which is essentially going to hash the entire file you just downloaded and compare the result to the hash string you see on the download page:
5fdebc435ded46ae99136ca875afc6f05bde217be7dd018e1841924f71db46b5. This conversion is performed using the SHA256 algorithm, the mention of which you can see in the final parts of the command:
shasum -a 256 --check.
The idea is that if the hash produced through your check is different, this means someone has meddled with your download and supplied you with a compromised file instead.
Some familiar names you will hear in the domain of password hashing are MD5 (insecure and now defunct), SHA-1, and SHA-2 (families of algorithms, of which SHA-256 is a member, as is SHA-512), SCRYPT, BCRYPT, etc.
All types of security is a cat-and-mouse game: the thief learns the current system and comes up with a novel crack, which gets noticed, and the lock makers improve their game, and so on and so on. Cryptography is no exception. While converting hashes back to passwords has become impossible, attackers over time have developed sophisticated techniques that combine intelligent guesswork with sheer computation power; as a result, nine times out ten, they can predict the correct password, given just the hash.
As a result, the technique of salting has developed. All it means is that the hash calculation of a password (or any data) will be done based on a combination of two things: the data itself, as well as a new random string that the attacker cannot guess. So, with salting, if we want to hash the password
superman009, we’d first select a random string as a “salt,” say,
bCQC6Z2LlbAsqj77and then perform the hash calculation on
superman009-bCQC6Z2LlbAsqj77. The resulting hash will deviate from the usual structures produced by the algorithm, vastly reducing the scope for intelligent reverse engineering or guesswork.
Both Hashing and Salting are incredibly complicated domains and are constantly being evolved. So, as an application developer, we’d never deal directly with them. But it’d help us greatly if we knew these and could make better decisions. For instance, if you maintain an old PHP framework and happen to see that it uses MD5 hashes for passwords, you know it’s time to insert another password library in the user account creation process.
You’d come across the term “keys” often in the context of encryption. So far, we have been covering password hashing or one-way encryption, where we convert the data irreversibly and destroy the original form. This is a bad idea for everyday practical usage — a document written and emailed so securely that it can never be read is of no use! Thus, we want to encrypt data such that we want the information to be open with the sender and the receiver, but while it is being transferred or while it is stored, it should be unreadable.
For this, the concept of a “key” exists in cryptography. It’s exactly what it sounds like: the key to a lock. The person who owns the information scrambles it using some secret called a key. Unless the receiver/attacker has this key, it’s impossible to unscramble the data, no matter how sophisticated their algorithms might be.
While keys make encryption possible and reliable, they carry the risks passwords do: once someone knows the key, the whole game is up. Imagine a scenario in which somebody hacks some part of a service like GitHub (even if for a few seconds) and can get hold of code 20 years old. Inside the code, they also find the cryptographic keys used to encrypt the company’s data (horrible practice to store keys along with source code, but you’d be surprised how often this happens!). If the company hasn’t bothered to change its keys (just like passwords), the same key can be used to wreak havoc.
As a result, the practice of changing keys frequently has evolved. This is called key rotation, and if you’re using any respectable cloud PaaS provider, it should be available as an automated service.
For instance, AWS has a dedicated service for this called AWS Key Management Service (KMS). An automated service saves you the hassle of changing and distributing keys among all the servers and is a no-brainer these days when it comes to large deployments.
Public Key Cryptography
If all the previous talk about encryption and keys makes you think it’s highly cumbersome, you’re right. Keeping keys safe and passing them so that only the receiver can see the data runs into logistical issues that wouldn’t have allowed today’s secure communications to prosper. But all thanks to public-key cryptography, we can safely communicate or make purchases online.
This type of cryptography was a major mathematical breakthrough, and it’s the sole reason the Internet isn’t falling apart in fear and mistrust. The details of the algorithm are intricate and highly mathematical, so I can only explain it conceptually here.
Public Key Cryptography relies on the use of two keys to process information. One of the keys is called Private Key and is supposed to stay private with you and never be shared with anyone; the other one is called Public Key (from where the name of the method comes) and is supposed to be published publically. If I’m sending data to you, I first need to get your public key and encrypt the data and send it to you; at your end, you can decrypt the data using your private key and public key combination. As long as you don’t accidentally reveal your private key, I can send encrypted data to you that only you can open.
The beauty of the system is that I don’t need to know your private key, and anybody who intercepts the message can do nothing to read it even though they have your public key. If you’re wondering how this is even possible, the shortest and most non-technical answer comes from the properties of multiplication of prime numbers:
It’s tough for computers to factorize large prime numbers. So, if the original key is very large, you can be sure that the message can’t be decrypted even in thousands of years.
Transport Layer Security (TLS)
You now know how Public Key Cryptography works. This mechanism (knowing the receiver’s public key and sending them data encrypted using that) is what’s behind all the HTTPS popularity and is what causes Chrome to say, “This site is secure.” What’s happening is that the server and the browser are encrypting HTTP traffic (remember, web pages are very long strings of text that browsers can interpret) with each other’s public keys, resulting in Secure HTTP (HTTPS).
Image credit: MozillaIt’s interesting to note that the encryption doesn’t happen on the Transport Layer as such; the OSI model says nothing about encrypting data. It’s just that data is encrypted by the application (in this case, the browser) before it’s handed off to the Transport Layer, which later drops it off at its destination, where it’s decrypted. However, the process involves the Transport Layer, and at the end of the day, it all results in secure transport of data, so the loose term “transport” layer security has stuck around.
You might even come across the term Secure Socket Layer (SSL) in some cases. It’s the same concept as TLS, except that SSL originated much before and is now sunsetted in favor of TLS.
Full Disk Encryption
Sometimes security needs are so intense that nothing can be left to chance. For example, government servers where all biometric data of a country is stored cannot be provisioned and run like normal application servers as the risk is too high. It’s not enough for these needs that data be encrypted only when being transferred; it has to be encrypted when at rest, too. For this, full disk encryption is used to encrypt the entirety of a hard disk to ensure data is secure even when physically breached.
It’s important to note that Full Disk Encryption has to be done at the hardware level. That is so because if we encrypt the entire disk, the operating system is also encrypted and cannot run when the machine starts. So, the hardware has to understand that the disk contents are encrypted and must perform decryption on the fly as it passes requested disk blocks to the operating system. Because of this extra work being done, Full Disk Encryption results in slower read/writes, which must be kept in mind by the developers of such systems.
With the ongoing privacy and security nightmares of large social networks these days, no one is unaware of the term “end-to-end encryption,” even if they have nothing to do with making or maintaining apps.
We saw earlier how Full Disk Encryption provides the ultimate bullet-proof strategy, but for the everyday user, it’s not convenient. I mean, imagine that Facebook wants the phone data it generates and stores in your phone to be secure, but it can’t have access to encrypting your entire phone and locking out everything else in the process.
For this reason, these companies have started end-to-end encryption, which means data is encrypted when it’s created, stored, or transferred by the app. In other words, even when the data reaches the recipient, it’s fully encrypted and is accessible only by the recipient’s phone.
Note that End-to-End (E2E) encryption doesn’t carry any mathematical guarantees as Public Key cryptography does; it’s just standard encryption where the key is stored with the business, and your messages are as safe as the business decides.
You’ve likely heard of most of these terms already. Maybe even all of them. If so, I’d encourage you to revisit your understanding of these concepts, as well as perform an evaluation of how seriously you take them. Remember, app data security is a war you need to win every time (and not just once), as even a single breach is enough to destroy entire industries, careers, and even lives!