SvenSvensonov
PROFESSIONAL
- Joined
- Oct 15, 2014
- Messages
- 1,617
- Reaction score
- 207
- Country
- Location
The references, due to their formatting and excess use of hyperlinks and strange symbols tends to cause moderation issues, I've omitted them for now, but will try to include them after tweaking their formatting a bit.
In the wake of the NSA spying revelations (to those not familiar with them) the German BND decided to return to typewriters instead of computer based documents and communications. It won’t make a difference
Here's why
INTRODUCTION
This paper reports on recovering keystrokes typed on a keyboard from a sound recording of the user typing. Emanations produced by electronic devices have long been a topic of concern in the security and privacy communities. Both electromagnetic and optical emanations have been used as sources for attacks. For example, Kuhn was able to recover the display on CRT and LCD monitors using indirectly reflected optical emanations. Acoustic emanations are another source of data for attacks. Researchers have shown that acoustic emanations of matrix printers carry substantial information about the printed text. Some researchers suggest it may be possible to discover CPU operations from acoustic emanations in ground-breaking research, Asonov and Agrawal showed that it is possible to recover text from the acoustic emanations from typing on a keyboard.
Most emanations, including acoustic keyboard emanations, are not uniform across different instances, even when the same device model is used; and they are affected by the environment. Different users on a single keyboard or different keyboards (even of the same model) emit different sounds, making reliable recognition hard. Asonov and Agrawal achieved relatively high recognition rate (approximately 80 percent) when they trained neural networks with text-labeled sound samples of the same user typing on the same keyboard. Their attack is analogous to a known-plaintext attack on a cipher – the cryptanalyst has a sample of plaintext (the keys typed) and the corresponding cipher-text (the recording of acoustic emanations). This labeled training sample requirement suggests a limited attack, because the attacker needs to obtain training samples of significant length. Presumably these could be obtained from video surveillance or network sniffing. However, video surveillance in most cases should render the acoustic attack irrelevant, because even if passwords are masked on the screen, a video shot of the keyboard could directly reveal the keys being typed.
In this paper we argue that a labeled training sample requirement is unnecessary for an attacker. This implies keyboard emanation attacks are more serious than previous work suggests. The key insight in our work is that the typed text is often not random. When one types English text, the finite number of mostly used English words limits possible temporal combinations of keys, and English grammar limits word combinations. One can first cluster (using unsupervised methods) keystrokes into a number of acoustic classes based on their sound. Given sufficient (unlabeled) training samples, a most-likely mapping between these acoustic classes and actual typed characters can be established using the language constraints.
THE ATTACK
We take a recording of a user typing English text on a keyboard, and produce a recognizer that can, with high accuracy, determine subsequent keystrokes from sound recordings if it is typed by the same person, with the same keyboard, under the same recording conditions. These conditions can easily be satisfied by, for example, placing a wireless microphone in the user’s work area or by using parabolic or laser microphones from a distance. Although we do not necessarily know in advance whether a user is typing English text, in practice we can record continuously, try to apply the attack, and see if meaningful text is recovered.
It contains the following steps,
Feature extraction.
We use cepstrum features, a technique developed by researchers in voice recognition. As we discuss below, cepstrum features give better results than FFT.
Unsupervised key recognition
Using unlabeled training data. We cluster each keystroke into one of K acoustic classes, using standard data clustering methods. K is chosen to be slightly larger than the number of keys on the keyboard. If these acoustic clustering classes correspond exactly to different keys in a one-to-one mapping, we can easily determine the mapping between keys and acoustic classes. However, clustering algorithms are imprecise. Keystrokes of the same key are sometimes placed in different acoustic classes and conversely keystrokes of different keys can be in the same acoustic class. We let the acoustic class be a random variable conditioned on the actual key typed. A particular key will be in each acoustic class with a certain probability. In well clustered data, probabilities of one or a few acoustic classes will dominate for each key. Once the conditional distributions of the acoustic classes are determined, we try to find the most likely sequence of keys given a sequence of acoustic classes for each keystroke. Naively, one might think picking the letter with highest probability for each keystroke yields the best estimation and we can declare our job done.
Spelling and grammar checking
We use dictionary-based spelling correction and a simple statistical model of English grammar. These two approaches, spelling and grammar, are combined in a single Hidden Markov Model. This increases the character accuracy rate to over 70 percent, yielding a word accuracy rate of about 50 percent or more. At this point, the text is quite readable
Feedback-based training
Feedback-based training produces a keystroke acoustic classifier that does not require an English spelling and grammar model, enabling random text recognition, including password recognition. In this step, we use the previously obtained corrected results as labeled training samples. Note that our corrected results are not 100 percent correct. We use heuristics to select words that are more likely to be correct. For examples, a word that is not spell-corrected or one that changes only slightly during correction in the last step is more likely to be correct than those that had more changes. In our experiments, we pick out those words with fewer than one-fourth of the characters corrected and use them as labeled samples to train an acoustic classifier. The recognition phase recognizes the training samples again. This second recognition typically yields a higher keystroke accuracy rate. We use the number of corrections made in the spelling and grammar correction step as a quality indicator. Fewer corrections indicate better results. The same feedback procedure is performed repeatedly until no significant improvement is seen. In our experiments, we perform three feedback cycles. Our experiments indicate both linear classification and Gaussian mixtures perform well as classification algorithms and both are better than neural networks as used in. In our experiments, character accuracy rates (without a final spelling and grammar correction step) reach up to 92 percent.
The second phase, the recognition phase, uses the trained keystroke acoustic classifier to recognize new sound recordings. If the text consists of random strings, such as passwords, the result is output directly. For English text, the above spelling and grammar language model is used to further correct the result. To distinguish between two types of input, random or English, we apply the correction and see if reasonable text is produced. In practice, a human attacker can typically determine if text is random. An attacker can also identify occasions when the user types user names and passwords. For example, password entry typically follows a URL for a password protected website. Meaningful text recovered from the recognition phase during an attack can also be fed-back to the first phase. These new samples along with existing samples can be used together to increase the accuracy of the keystroke classifier.
Keystroke Extraction
Typical users can type up to about 300 characters per minutes. Keystrokes consist of a push and a release. Our experiments confirm Asonov and Agrawal’s observation that the period from push to release is typically about 100 milliseconds. That is, there is usually more than 100 milliseconds between consecutive keystrokes, which is large enough to distinguish the consecutive keystrokes. We need to detect the start of a keystroke, which is essentially the start of the push peak in a keystroke acoustic signal.
We distinguish between keystrokes and silence using energy levels in time windows. In particular, we calculate windowed discrete Fourier transform of the signal and use the sum of all FFT coefficients as energy. We use a threshold to detect the start of keystrokes.
Features: Cepstrum vs. FFT
Given the start of each keystroke, features of this keystroke are extracted from the audio signal during the period from wavposition to wavposition plus delta-theta. Our experiments compared two different types of features. First we used FFT features, as in. This time period roughly corresponds to the touch peak of the keystroke, which is when the finger touches the key. An alternative would be to use the hit peak, when the key hits the supporting plate. The hit peak is harder to pinpoint in the signal, so our experiments used the touch peak.
Next, we used cepstrum features. Cepstrum features are widely used in speech analysis and recognition Cepstrum features have been empirically verified to be more effective than plain FFT coefficients for voice signals. In particular, we used Mel-Frequency Cepstral Coefficients (MFCCs).
Asonov and Agrawal’s observation shows that high frequency acoustic data provides limited value. We ignore data over 12KHz. After feature extraction, each keystroke is represented as a vector of features (FFT coefficients or MFCCs).
Defenses
Since our attack is based on acoustic signal through passively eavesdropping, it is more difficult to detect this type of attacks than active attacks where attackers actively interact with victims. Here are some preliminary areas for potential defenses:
Reduce the possibility of leaking acoustic signals. Sound proving may help, but given the effectiveness of modern parabolic and laser microphones, the standards are very high.
Quieter keyboards as suggested by Asonov and Agrawal may reduce vulnerability. However, the two so-called “quiet” keyboards we used in our experiments proved ineffective against the attack. Asonov and Agrawal also suggest that keyboard makers could produce keyboards having keys that sound so similar that they are not easily distinguishable. They claim that one reason keys sound different today is that the plate underneath the keys makes different sounds when hit at different places. If this is true, using a more uniform plate may alleviate the attack. However, it is not clear whether these kinds of keyboards are commercially viable. Also, there is the possibility that more subtle differences between keys can still be captured by an attacker. Further, keyboards may develop distinct keystroke sounds after months of use.
Another approach is reduce the quality of acoustic signal that could be acquired by attackers. We could add masking noise while typing. However, we are not sure that masking noises might not be easily separated out. As we discussed above, an array of directional microphones may be able to record and distinguish sound into multiple channels according to the locations of the sound sources. This defense could also be ineffective when attackers are able to collect more data. Reducing the annoyance of masking is also an issue. Perhaps a short window of noise could be added at every predicted push peak. This may be more acceptable to users than continuous masking noise. Alternatively, perhaps we could randomly insert noise windows which sound like push peaks of keystrokes.
The practice of relying only on typed passwords or even long passphrases should be reexamined. One alternative is two-factor authentication that combines passwords or pass-phrases with smart cards, one-time-password tokens, biometric authentication and. However two-factor authentication does not solve all our problems. Typed text other than passwords is also valuable to attackers.
CONCLUSION
Our new attack on keyboard emanations needs only acoustic recording of typing using a keyboard and recovers the typed content. Compared to previous work that requires clear-text labeled training data, our attack is more general and serious. More important, the techniques we use to exploit inherent statistical constraints in the input and to perform feedback training can be applied to other emanations with similar properties.
@Oscar @Slav Defence @Jungibaaz @Gufi @Nihonjin1051 @AMDR @AUSTERLITZ @levina
In the wake of the NSA spying revelations (to those not familiar with them) the German BND decided to return to typewriters instead of computer based documents and communications. It won’t make a difference
Here's why
INTRODUCTION
This paper reports on recovering keystrokes typed on a keyboard from a sound recording of the user typing. Emanations produced by electronic devices have long been a topic of concern in the security and privacy communities. Both electromagnetic and optical emanations have been used as sources for attacks. For example, Kuhn was able to recover the display on CRT and LCD monitors using indirectly reflected optical emanations. Acoustic emanations are another source of data for attacks. Researchers have shown that acoustic emanations of matrix printers carry substantial information about the printed text. Some researchers suggest it may be possible to discover CPU operations from acoustic emanations in ground-breaking research, Asonov and Agrawal showed that it is possible to recover text from the acoustic emanations from typing on a keyboard.
Most emanations, including acoustic keyboard emanations, are not uniform across different instances, even when the same device model is used; and they are affected by the environment. Different users on a single keyboard or different keyboards (even of the same model) emit different sounds, making reliable recognition hard. Asonov and Agrawal achieved relatively high recognition rate (approximately 80 percent) when they trained neural networks with text-labeled sound samples of the same user typing on the same keyboard. Their attack is analogous to a known-plaintext attack on a cipher – the cryptanalyst has a sample of plaintext (the keys typed) and the corresponding cipher-text (the recording of acoustic emanations). This labeled training sample requirement suggests a limited attack, because the attacker needs to obtain training samples of significant length. Presumably these could be obtained from video surveillance or network sniffing. However, video surveillance in most cases should render the acoustic attack irrelevant, because even if passwords are masked on the screen, a video shot of the keyboard could directly reveal the keys being typed.
In this paper we argue that a labeled training sample requirement is unnecessary for an attacker. This implies keyboard emanation attacks are more serious than previous work suggests. The key insight in our work is that the typed text is often not random. When one types English text, the finite number of mostly used English words limits possible temporal combinations of keys, and English grammar limits word combinations. One can first cluster (using unsupervised methods) keystrokes into a number of acoustic classes based on their sound. Given sufficient (unlabeled) training samples, a most-likely mapping between these acoustic classes and actual typed characters can be established using the language constraints.
THE ATTACK
We take a recording of a user typing English text on a keyboard, and produce a recognizer that can, with high accuracy, determine subsequent keystrokes from sound recordings if it is typed by the same person, with the same keyboard, under the same recording conditions. These conditions can easily be satisfied by, for example, placing a wireless microphone in the user’s work area or by using parabolic or laser microphones from a distance. Although we do not necessarily know in advance whether a user is typing English text, in practice we can record continuously, try to apply the attack, and see if meaningful text is recovered.
It contains the following steps,
Feature extraction.
We use cepstrum features, a technique developed by researchers in voice recognition. As we discuss below, cepstrum features give better results than FFT.
Unsupervised key recognition
Using unlabeled training data. We cluster each keystroke into one of K acoustic classes, using standard data clustering methods. K is chosen to be slightly larger than the number of keys on the keyboard. If these acoustic clustering classes correspond exactly to different keys in a one-to-one mapping, we can easily determine the mapping between keys and acoustic classes. However, clustering algorithms are imprecise. Keystrokes of the same key are sometimes placed in different acoustic classes and conversely keystrokes of different keys can be in the same acoustic class. We let the acoustic class be a random variable conditioned on the actual key typed. A particular key will be in each acoustic class with a certain probability. In well clustered data, probabilities of one or a few acoustic classes will dominate for each key. Once the conditional distributions of the acoustic classes are determined, we try to find the most likely sequence of keys given a sequence of acoustic classes for each keystroke. Naively, one might think picking the letter with highest probability for each keystroke yields the best estimation and we can declare our job done.
Spelling and grammar checking
We use dictionary-based spelling correction and a simple statistical model of English grammar. These two approaches, spelling and grammar, are combined in a single Hidden Markov Model. This increases the character accuracy rate to over 70 percent, yielding a word accuracy rate of about 50 percent or more. At this point, the text is quite readable
Feedback-based training
Feedback-based training produces a keystroke acoustic classifier that does not require an English spelling and grammar model, enabling random text recognition, including password recognition. In this step, we use the previously obtained corrected results as labeled training samples. Note that our corrected results are not 100 percent correct. We use heuristics to select words that are more likely to be correct. For examples, a word that is not spell-corrected or one that changes only slightly during correction in the last step is more likely to be correct than those that had more changes. In our experiments, we pick out those words with fewer than one-fourth of the characters corrected and use them as labeled samples to train an acoustic classifier. The recognition phase recognizes the training samples again. This second recognition typically yields a higher keystroke accuracy rate. We use the number of corrections made in the spelling and grammar correction step as a quality indicator. Fewer corrections indicate better results. The same feedback procedure is performed repeatedly until no significant improvement is seen. In our experiments, we perform three feedback cycles. Our experiments indicate both linear classification and Gaussian mixtures perform well as classification algorithms and both are better than neural networks as used in. In our experiments, character accuracy rates (without a final spelling and grammar correction step) reach up to 92 percent.
The second phase, the recognition phase, uses the trained keystroke acoustic classifier to recognize new sound recordings. If the text consists of random strings, such as passwords, the result is output directly. For English text, the above spelling and grammar language model is used to further correct the result. To distinguish between two types of input, random or English, we apply the correction and see if reasonable text is produced. In practice, a human attacker can typically determine if text is random. An attacker can also identify occasions when the user types user names and passwords. For example, password entry typically follows a URL for a password protected website. Meaningful text recovered from the recognition phase during an attack can also be fed-back to the first phase. These new samples along with existing samples can be used together to increase the accuracy of the keystroke classifier.
Keystroke Extraction
Typical users can type up to about 300 characters per minutes. Keystrokes consist of a push and a release. Our experiments confirm Asonov and Agrawal’s observation that the period from push to release is typically about 100 milliseconds. That is, there is usually more than 100 milliseconds between consecutive keystrokes, which is large enough to distinguish the consecutive keystrokes. We need to detect the start of a keystroke, which is essentially the start of the push peak in a keystroke acoustic signal.
We distinguish between keystrokes and silence using energy levels in time windows. In particular, we calculate windowed discrete Fourier transform of the signal and use the sum of all FFT coefficients as energy. We use a threshold to detect the start of keystrokes.
Features: Cepstrum vs. FFT
Given the start of each keystroke, features of this keystroke are extracted from the audio signal during the period from wavposition to wavposition plus delta-theta. Our experiments compared two different types of features. First we used FFT features, as in. This time period roughly corresponds to the touch peak of the keystroke, which is when the finger touches the key. An alternative would be to use the hit peak, when the key hits the supporting plate. The hit peak is harder to pinpoint in the signal, so our experiments used the touch peak.
Next, we used cepstrum features. Cepstrum features are widely used in speech analysis and recognition Cepstrum features have been empirically verified to be more effective than plain FFT coefficients for voice signals. In particular, we used Mel-Frequency Cepstral Coefficients (MFCCs).
Asonov and Agrawal’s observation shows that high frequency acoustic data provides limited value. We ignore data over 12KHz. After feature extraction, each keystroke is represented as a vector of features (FFT coefficients or MFCCs).
Defenses
Since our attack is based on acoustic signal through passively eavesdropping, it is more difficult to detect this type of attacks than active attacks where attackers actively interact with victims. Here are some preliminary areas for potential defenses:
Reduce the possibility of leaking acoustic signals. Sound proving may help, but given the effectiveness of modern parabolic and laser microphones, the standards are very high.
Quieter keyboards as suggested by Asonov and Agrawal may reduce vulnerability. However, the two so-called “quiet” keyboards we used in our experiments proved ineffective against the attack. Asonov and Agrawal also suggest that keyboard makers could produce keyboards having keys that sound so similar that they are not easily distinguishable. They claim that one reason keys sound different today is that the plate underneath the keys makes different sounds when hit at different places. If this is true, using a more uniform plate may alleviate the attack. However, it is not clear whether these kinds of keyboards are commercially viable. Also, there is the possibility that more subtle differences between keys can still be captured by an attacker. Further, keyboards may develop distinct keystroke sounds after months of use.
Another approach is reduce the quality of acoustic signal that could be acquired by attackers. We could add masking noise while typing. However, we are not sure that masking noises might not be easily separated out. As we discussed above, an array of directional microphones may be able to record and distinguish sound into multiple channels according to the locations of the sound sources. This defense could also be ineffective when attackers are able to collect more data. Reducing the annoyance of masking is also an issue. Perhaps a short window of noise could be added at every predicted push peak. This may be more acceptable to users than continuous masking noise. Alternatively, perhaps we could randomly insert noise windows which sound like push peaks of keystrokes.
The practice of relying only on typed passwords or even long passphrases should be reexamined. One alternative is two-factor authentication that combines passwords or pass-phrases with smart cards, one-time-password tokens, biometric authentication and. However two-factor authentication does not solve all our problems. Typed text other than passwords is also valuable to attackers.
CONCLUSION
Our new attack on keyboard emanations needs only acoustic recording of typing using a keyboard and recovers the typed content. Compared to previous work that requires clear-text labeled training data, our attack is more general and serious. More important, the techniques we use to exploit inherent statistical constraints in the input and to perform feedback training can be applied to other emanations with similar properties.
@Oscar @Slav Defence @Jungibaaz @Gufi @Nihonjin1051 @AMDR @AUSTERLITZ @levina