How do you generate a random data set in python?
Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Generating Random Data in Python Show
How random is random? This is a weird question to ask, but it is one of paramount importance in cases where information security is concerned. Whenever you’re generating random data, strings, or numbers in Python, it’s a good idea to have at least a rough idea of how that data was generated. Here, you’ll cover a handful of different options for generating random data in Python, and then build up to a comparison of each in terms of its level of security, versatility, purpose, and speed. I promise that this tutorial will not be a lesson in mathematics or cryptography, which I wouldn’t be well equipped to lecture on in the first place. You’ll get into just as much math as needed, and no more. How Random Is Random?First, a prominent disclaimer is necessary. Most random data generated with Python is not fully random in the scientific sense of the word. Rather, it is pseudorandom: generated with a pseudorandom number generator (PRNG), which is essentially any algorithm for generating seemingly random but still reproducible data. “True” random numbers can be generated by, you guessed it, a true random number generator (TRNG). One example is to repeatedly pick up a die off the floor, toss it in the air, and let it land how it may. Assuming that your toss is unbiased, you have truly no idea what number the die will land on. Rolling a die is a crude form of using hardware to generate a number that is not deterministic whatsoever. (Or, you can have the dice-o-matic do this for you.) TRNGs are out of the scope of this article but worth a mention nonetheless for comparison’s sake. PRNGs, usually done with software rather than hardware, work slightly differently. Here’s a concise description:
You’ve likely been told to “read the docs!” at some point. Well, those people are not wrong. Here’s a particularly notable snippet from the
You’ve probably seen Perhaps the terms “random” and “deterministic”
seem like they cannot exist next to each other. To make that clearer, here’s an extremely trimmed down version of
Don’t take this example too literally, as it’s meant mainly to illustrate the concept. If you use the seed value 1234, the subsequent sequence of calls to >>>
You’ll see a more serious illustration of this shortly. What Is “Cryptographically Secure?”If you haven’t had enough with the “RNG” acronyms, let’s throw one more into the mix: a CSPRNG, or cryptographically secure PRNG. CSPRNGs are suitable for generating sensitive data such as passwords, authenticators, and tokens. Given a random string, there is realistically no way for Malicious Joe to determine what string came before or after that string in a sequence of random strings. One other term that you may see is entropy. In a nutshell, this refers to the amount of randomness introduced or desired. For example, one Python
module that you’ll cover here defines A key point about CSPRNGs is that they are still pseudorandom. They are engineered in some way that is internally deterministic, but they add some other variable or have some property that makes them “random enough” to prohibit backing into whatever function enforces determinism. What You’ll Cover HereIn practical terms, this means that you should use plain PRNGs for statistical modeling, simulation, and to make random data reproducible. They’re also significantly faster than CSPRNGs, as you’ll see later on. Use CSPRNGs for security and cryptographic applications where data sensitivity is imperative. In addition to expanding on the use cases above, in this tutorial, you’ll delve into Python tools for using both PRNGs and CSPRNGs:
You’ll touch on all of the above and wrap up with a high-level comparison. PRNGs in PythonThe random ModuleProbably the most widely known tool for generating random data in Python is its Earlier, you touched briefly on >>>
If you run this code yourself, I’ll bet my life savings that the numbers returned on your machine will be different. The default when you don’t seed the generator is to use your current system time or a “randomness source” from your OS if one is available. With >>>
Notice the repetition of “random” numbers. The sequence of random numbers becomes deterministic, or completely determined by the seed value, 444. Let’s take a look at some more basic functionality of >>>
With
>>>
If you need to generate random floats that lie within a specific [x, y] interval, you can use >>>
To pick a random element from a non-empty sequence (like a list or a tuple), you can use
To mimic sampling without replacement, use >>>
You can randomize a sequence in-place using >>>
If you’d rather not mutate the original list, you’ll need to make a copy first and then shuffle the copy. You can create
copies of Python lists with the Before moving on to generating random data with NumPy, let’s look at one more slightly involved application: generating a sequence of unique random strings of uniform length. It can help to think about the design of the function first. You need to choose from a “pool” of characters such as letters, numbers,
and/or punctuation, combine these into a single string, and then check that this string has not already been generated. A Python
Let’s try this function out: >>>
For a fine-tuned version of this function, this Stack Overflow answer uses generator functions, name binding, and some other advanced tricks to make a faster, cryptographically secure version of PRNGs for Arrays: numpy.randomOne thing you might have noticed is that a majority of the functions from >>>
But there is another option that is specifically designed for this. You can think of NumPy’s own Take note that >>>
Without further ado, here are a few examples to whet your appetite: >>>
In the syntax for Another common operation is to create a sequence of random
Boolean values, >>>
What about
generating correlated data? Let’s say you want to simulate two correlated time series. One way of going about this is with NumPy’s To sample from the multivariate normal distribution, you specify the means and covariance matrix, and you end up with multiple, correlated series of data that are each approximately normally distributed. However, rather than covariance, correlation is a measure that is more familiar and intuitive to most. It’s the covariance normalized by the product of standard deviations, and so you can also define covariance in terms of correlation and standard deviation: So, could you draw random samples from a multivariate normal distribution by specifying a correlation matrix and standard deviations? Yes, but you’ll need to get the above into matrix form first. Here, S is a vector of the standard deviations, P is their correlation matrix, and C is the resulting (square) covariance matrix: This can be expressed in NumPy as follows:
Now, you can generate two time series that are correlated but still random: >>>
You can think of >>>
Before we move on to CSPRNGs, it might be helpful to summarize some
Now that you’ve covered two fundamental options for PRNGs, let’s move onto a few more secure adaptations. CSPRNGs in Pythonos.urandom(): About as Random as It GetsPython’s
With >>>
Before
we go any further, this might be a good time to delve into a mini-lesson on character encoding. Many people, including myself, have some type of allergic reaction when they see
>>>
But how does this eventually get turned into a Python First, recall one of the fundamental concepts of computing, which is that a byte is made up of 8 bits. You can think of a bit as a single digit that is either 0 or 1. A byte effectively chooses between 0 and 1 eight times, so both >>>
This is equivalent to Where does that leave us? Using This means that each byte maps to an integer between 0 and 255. In other words, we would need more than 8 bits to express the integer 256. You can verify this by checking that Okay, now let’s get back to the >>>
If you call >>>
These backslashes are escape sequences, and If you need a refresher on hexadecimal, Charles Petzold’s Code: The Hidden Language is a great place for that. Hex is a base-16 numbering system that, instead of using 0 through 9, uses 0 through 9 and a through f as its basic digits. Finally, let’s get back to where you started, with the sequence of random bytes >>>
One last question: how is Even if the byte (such as With that under your belt, let’s touch on a recently introduced module, Python’s Best Kept secretsIntroduced in Python 3.6 by one of the more colorful PEPs out there, the You can check out the
source code for the module, which is short and sweet at about 25 lines of code. >>>
Now, how about a concrete example? You’ve probably used URL shortener services like tinyurl.com or bit.ly that turn an unwieldy URL into something like https://bit.ly/2IcCp9u. Most shorteners don’t do any complicated hashing from input to output; they just generate a random string, make sure that string has not already been generated previously, and then tie that back to the input URL. Let’s say that after taking a look at the Root Zone Database, you’ve registered the site short.ly. Here’s a function to get you started with your service:
Is this a full-fledged real illustration? No. I would wager that bit.ly does things in a slightly more advanced way than storing its gold mine in a global Python dictionary that is not persistent between sessions. However, it’s roughly accurate conceptually: >>>
The bottom line here is that, while One Last Candidate: uuidOne last option for generating a random token is the >>>
The nice thing is that all of >>>
You may also have seen some other variations:
Hopefully, by now you have a good idea of the distinction between different “types” of random data and how to create them. However, one other issue that might come to mind is that of collisions. In this case, a collision would simply
refer to generating two matching UUIDs. What is the chance of that? Well, it is technically not zero, but perhaps it is close enough: there are One common use of Why Not Just “Default to” SystemRandom?In addition to the secure modules discussed here such as At this point, you might be asking yourself why you wouldn’t just “default to” this version? Why not “always be safe” rather than defaulting to the deterministic I’ve already mentioned one reason: sometimes you want your data to be deterministic and reproducible for others to follow along with. But the second reason is that CSPRNGs, at least in Python, tend to be meaningfully slower than PRNGs. Let’s test that with a script,
Now to execute this from the shell:
A 5x timing difference is certainly a valid consideration in addition to cryptographic security when choosing between the two. Odds and Ends: HashingOne concept that hasn’t received much attention in this tutorial is
that of hashing, which can be done with Python’s A hash is designed to be a one-way mapping from an input value to a fixed-size string that is virtually impossible to reverse engineer. As such, while the result of a hash function may “look like” random data, it doesn’t really qualify under the definition here. RecapYou’ve covered a lot of ground in this tutorial. To recap, here is a high-level comparison of the options available to you for engineering randomness in Python:
Feel free to leave some totally random comments below, and thanks for reading. Additional Links
Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Generating Random Data in Python How do you generate random data?Excel RANDBETWEEN function generates a set of integer random numbers between the two specified numbers.. Select the cell in which you want to get the random numbers.. In the active cell, enter =RANDBETWEEN(1,100).. Hold the Control key and Press Enter.. How do you create a dataset in Python?How to Create Pandas DataFrame in Python. By typing the values in Python itself to create the DataFrame.. By importing the values from a file (such as a CSV file), and then creating the DataFrame in Python based on the values imported.. Is it possible to generate random using Python?Python defines a set of functions that are used to generate or manipulate random numbers through the random module. Functions in the random module rely on a pseudo-random number generator function random(), which generates a random float number between 0.0 and 1.0.
What is random () in Python?Python Random module is an in-built module of Python which is used to generate random numbers. These are pseudo-random numbers means these are not truly random. This module can be used to perform random actions such as generating random numbers, print random a value for a list or string, etc.
|