How do you split a file into chunks in python?
Here is a python script you can use for splitting large files using Show
You can call it externally:
You can also import The issue with this approach is high memory usage: Here is another pure python way of doing this, although I haven't tested it on huge files, it's going to be slower but be leaner on memory:
Here is another example using
The readlines example demonstrates how to chunk your data to pass chunks to function that expects chunks. Unfortunately readlines opens the whole file in memory, its better to use the reader example for performance. Although if you can easily fit what you need into memory and need to process it in chunks this should suffice. Splitting and Joining FilesLike most kids, mine spend a lot of time on the Internet. As far as I can tell, it’s the thing to do these days. Among this latest generation, computer geeks and gurus seem to be held with the same sort of esteem that rock stars once were by mine. When kids disappear into their rooms, chances are good that they are hacking on computers, not mastering guitar riffs. It’s probably healthier than some of the diversions of my own misspent youth, but that’s a topic for another kind of book. But if you have teenage kids and computers, or know someone who does, you probably know that it’s not a bad idea to keep tabs on what those kids do on the Web. Type your favorite four-letter word in almost any web search engine and you’ll understand the concern -- it’s much better stuff than I could get during my teenage career. To sidestep the issue, only a few of the machines in my house have Internet feeds. Now, while they’re on one of these machines, my kids download lots of games. To avoid infecting our Very Important Computers with viruses from public-domain games, though, my kids usually have to download games on a computer with an Internet feed, and transfer them to their own computers to install. The problem is that game files are not small; they are usually much too big to fit on a floppy (and burning a CD takes away valuable game playing time). If all the machines in my house ran Linux, this would be a nonissue. There are standard command-line programs on Unix for chopping a file into pieces small enough to fit on a floppy (split), and others for putting the pieces back together to recreate the original file (cat). Because we have all sorts of different machines in the house, though, we needed a more portable solution. Splitting Files PortablySince all the computers in my house run Python, a simple portable Python script came to the rescue. The Python program in Example 4-1 distributes a single file’s contents among a set of part files, and stores those part files in a directory. Example 4-1. PP2E\System\Filetools\split.py #!/usr/bin/python ######################################################### # split a file into a set of portions; join.py puts them # back together; this is a customizable version of the # standard unix split command-line utility; because it # is written in Python, it also works on Windows and can # be easily tweaked; because it exports a function, it # can also be imported and reused in other applications; ######################################################### import sys, os kilobytes = 1024 megabytes = kilobytes * 1000 chunksize = int(1.4 * megabytes) # default: roughly a floppy def split(fromfile, todir, chunksize=chunksize): if not os.path.exists(todir): # caller handles errors os.mkdir(todir) # make dir, read/write parts else: for fname in os.listdir(todir): # delete any existing files os.remove(os.path.join(todir, fname)) partnum = 0 input = open(fromfile, 'rb') # use binary mode on Windows while 1: # eof=empty string from read chunk = input.read(chunksize) # get next part <= chunksize if not chunk: break partnum = partnum+1 filename = os.path.join(todir, ('part%04d' % partnum)) fileobj = open(filename, 'wb') fileobj.write(chunk) fileobj.close() # or simply open( ).write( ) input.close( ) assert partnum <= 9999 # join sort fails if 5 digits return partnum if __name__ == '__main__': if len(sys.argv) == 2 and sys.argv[1] == '-help': print 'Use: split.py [file-to-split target-dir [chunksize]]' else: if len(sys.argv) < 3: interactive = 1 fromfile = raw_input('File to be split? ') # input if clicked todir = raw_input('Directory to store part files? ') else: interactive = 0 fromfile, todir = sys.argv[1:3] # args in cmdline if len(sys.argv) == 4: chunksize = int(sys.argv[3]) absfrom, absto = map(os.path.abspath, [fromfile, todir]) print 'Splitting', absfrom, 'to', absto, 'by', chunksize try: parts = split(fromfile, todir, chunksize) except: print 'Error during split:' print sys.exc_type, sys.exc_value else: print 'Split finished:', parts, 'parts are in', absto if interactive: raw_input('Press Enter key') # pause if clicked By default, this script splits the input file into chunks that are roughly the size of a floppy disk -- perfect for moving big files between electronically isolated machines. Most important, because this is all portable Python code, this script will run on just about any machine, even ones without a file splitter of their own. All it requires is an installed Python. Here it is at work splitting the Python 1.5.2 self-installer executable on Windows: C:\temp> Each of these four generated part files represent one binary chunk of file Operation modesThis script is designed to input its parameters in either interactive or command-line modes; it checks the number of command-line arguments to know in which mode it is being used. In command-line mode, you list the file to be split and the output directory on the command line, and can optionally override the default part file size with a third command-line argument. In interactive mode, the script asks for a filename and output directory at the console window with Binary file accessThis code is careful to open both input and output files in binary mode ( Manually closing filesThis script also goes out of its way to manually close its files. For instance: fileobj = open(partname, 'wb') fileobj.write(chunk) fileobj.close( ) As we also saw in Chapter 2, these three lines can usually be replaced with this single line: open(partname, 'wb').write(chunk) This shorter form relies on the fact that the current Python implementation automatically closes files for you when file objects are reclaimed (i.e., when they are garbage collected, because there are no more references to the file object). In this line, the file
object would be reclaimed immediately, because the As I was writing this chapter, though, there was some possibility that this automatic-close behavior may go away in the
future.[32] Moreover, the JPython Java-based Python implementation does not reclaim unreferenced objects as immediately as the standard Python. If you care about the Java port (or one possible future), your script may potentially create many files in a short amount of time, and your script may run on a machine that has a limit on the
number of open files per program, then close manually. The Joining Files PortablyBack to moving big files around the house. After downloading a big game program file, my kids generally run the previous splitter script by clicking on its name in Windows Explorer and typing filenames. After a split, they simply copy each part file onto its own floppy, walk the floppies upstairs, and recreate the split output directory on their target computer by copying files off the floppies. Finally, the script in Example 4-2 is clicked or otherwise run to put the parts back together. Example 4-2. PP2E\System\Filetools\join.py #!/usr/bin/python ########################################################## # join all part files in a dir created by split.py. # This is roughly like a 'cat fromdir/* > tofile' command # on unix, but is a bit more portable and configurable, # and exports the join operation as a reusable function. # Relies on sort order of file names: must be same length. # Could extend split/join to popup Tkinter file selectors. ########################################################## import os, sys readsize = 1024 def join(fromdir, tofile): output = open(tofile, 'wb') parts = os.listdir(fromdir) parts.sort( ) for filename in parts: filepath = os.path.join(fromdir, filename) fileobj = open(filepath, 'rb') while 1: filebytes = fileobj.read(readsize) if not filebytes: break output.write(filebytes) fileobj.close( ) output.close( ) if __name__ == '__main__': if len(sys.argv) == 2 and sys.argv[1] == '-help': print 'Use: join.py [from-dir-name to-file-name]' else: if len(sys.argv) != 3: interactive = 1 fromdir = raw_input('Directory containing part files? ') tofile = raw_input('Name of file to be recreated? ') else: interactive = 0 fromdir, tofile = sys.argv[1:] absfrom, absto = map(os.path.abspath, [fromdir, tofile]) print 'Joining', absfrom, 'to make', absto try: join(fromdir, tofile) except: print 'Error joining files:' print sys.exc_type, sys.exc_value else: print 'Join complete: see', absto if interactive: raw_input('Press Enter key') # pause if clicked After running the C:\temp> The join script simply uses Some of this process is still manual, of course (I haven’t quite figured out how to script the “walk the floppies upstairs” bit yet), but the Reading by blocks or filesBefore we move on, there are a couple of details worth underscoring in the join script’s code. First of all, notice that
this script deals with files in binary mode, but also reads each part file in blocks of 1K bytes each. In fact, the filebytes = open(filepath, 'rb').read( ) output.write(filebytes) The downside to this scheme is that it really does load all of a file into memory at once. For example, reading a 1.4M part file into memory all at once with the file object Sorting filenamesIf you study this script’s code closely, you may also notice that the join scheme it uses relies completely on the sort order of filenames in the parts directory. Because it simply calls the list >>> list = ['xx008', 'xx010', 'xx006', 'xx009', 'xx011', 'xx111'] >>> list.sort( ) >>> list ['xx006', 'xx008', 'xx009', 'xx010', 'xx011', 'xx111'] When sorted, the leading zero characters in small numbers guarantee that part files are ordered for joining correctly. Without the leading zeroes, >>> list = ['xx8', 'xx10', 'xx6', 'xx9', 'xx11', 'xx111'] >>> list.sort( ) >>> list ['xx10', 'xx11', 'xx111', 'xx6', 'xx8', 'xx9'] Because the list >>> list = ['xx8', 'xx10', 'xx6', 'xx9', 'xx11', 'xx111'] >>> list.sort(lambda x, y: cmp(int(x[2:]), int(y[2:]))) >>> list ['xx6', 'xx8', 'xx9', 'xx10', 'xx11', 'xx111'] But that still implies that filenames all must start with the same length substring, so this doesn’t quite remove the file naming dependency between the Usage VariationsLet’s run a few more experiments with these Python system utilities to demonstrate other usage modes. When run without full command-line arguments, both C:\temp> When these program files are double-clicked in a file explorer GUI, they work the same way (there usually are no command-line arguments when launched this way). In this mode, absolute path displays help clarify where files are really at. Remember, the current working directory is the script’s home directory when clicked like this, so the name [in a popup DOS console box when split is clicked] File to be split? Because these scripts package their core logic up in functions, though, it’s just as easy to reuse their code by importing and calling from another Python component: C:\temp> A word about performance: All the C:\temp> Split can take longer to finish, but only if the part file’s size is set small enough to generate thousands of part files -- splitting into 1006 parts works, but runs slower (on my computer this split and join take about five and two seconds, respectively, depending on what other programs are open): C:\temp> Finally, the splitter is also smart enough to create the output directory if it doesn’t yet exist, or clear out any old files there if it does exist. Because the joiner combines whatever files exist in the output directory, this is a nice ergonomic touch -- if the output directory was not cleared before each split, it would be too easy to forget that a prior run’s files are still there. Given that my kids are running these scripts, they need to be as forgiving as possible; your user base may vary, but probably not by much. C:\temp> Get Programming Python, Second Edition now with the O’Reilly learning platform. O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers. How do I split a file into chunks?To split a file into pieces, you simply use the split command. By default, the split command uses a very simple naming scheme. The file chunks will be named xaa, xab, xac, etc., and, presumably, if you break up a file that is sufficiently large, you might even get chunks named xza and xzz.
How do I split a large file into multiple smaller pieces?Open the Zip file. Open the Tools tab. Click the Split Size dropdown button and select the appropriate size for each of the parts of the split Zip file. If you choose Custom Size in the Split Size dropdown list, another small window will open and allow you to enter in a custom size specified in megabytes.
|