User Tools

Site Tools


wiki:2018fileglobbing

File globbing (wildcard expansion)

Yet another *-ix jargon word, glob, is the process for wildcard expansion on the command-line. Like many things, it is meant to reduce the number of actions, especially typing, in order to specify the files to work on. But it can also be confusing at first.

Patterns with wildcards

The following examples will use the asterisk (*) as the wildcard, since it is the most common.

The asterisk (*) can produce a match of any length.

A pattern with a wildcard can be broken down into 3 forms:

  • wildcard at the beginning of a filename
  • wildcard at the end of a filename
  • wildcard in the middle of a filename
  • only a wildcard.

The last one matches anything. This will list all files in your directory.

Any combination of the above are also valid.

$ ls *

The most useful test of wildcards is to use the ls command, since file globbing is about files.

In a directory with files a.txt and b.txt, this is equivalent to typing the matches on the command line.

The above is equivalent to:

$ ls a.txt b.txt

Wildcard at the beginning of a filename

You've seen this form before, when looking for files with a given extension, say .txt.

This example will produce the same result as the previous example, since both files ended with the .txt extension.

$ ls *.txt

Is equivalent to:

$ ls a.txt b.txt

…when there are only files that end with .txt, but would not match other files with a different extension.

For example, suppose there exists a third file, c.gff.

$ ls *

Equivalent to:

$ ls a.txt b.txt c.gff

Whereas,

$ ls *.txt

Is still equivalent to:

$ ls a.txt b.txt

Wildcard at the end of a filename

In general, UNIX/linux files have the form prefix.suffix or equivalently prefix.extension.

Suppose we add a fourth file c.txt. We might want to see all of the files that have the same prefix, with different extensions (suffixes).

$ ls c.*

resolves to

$ ls c.gff c.txt

Wildcard in the middle of a filename

The wildcard character can be placed anywhere inside a filename to find matches to cases where the surrounding parts are constant.

Suppose we have multiple file types in a pipeline that use the same prefix. For example, say there are three genes, which we have named gene1, gene2, and gene3. Suppose our pipeline produces two files at step1: a text file (.txt) and a GFF file (.gff). We have labelled these files with a middle filepart, the step number.

gene1.step1.txt
gene1.step1.gff
gene2.step1.txt
gene2.step1.gff
gene3.step1.txt
gene3.step1.gff

Suppose step 2 produces a BED (.bed) file from the GFF file, and a new text(.txt) file. At the end of step 2 we have:

gene1.step1.gff
gene1.step1.txt
gene1.step2.gff
gene1.step2.txt
gene2.step1.gff
gene2.step1.txt
gene2.step2.gff
gene2.step2.txt
gene3.step1.gff
gene3.step1.txt
gene3.step2.gff
gene3.step2.txt

The middle wildcard can provide some useful subsets for us:

List all text (.txt) files for the gene1 steps:

$ ls gene1.*.txt

Produces:

gene1.step1.txt
gene1.step2.txt

List all the step 1 (.step1), text (.txt) files for all genes:

$ ls gene*.step1.txt

Produces:

gene1.step1.txt
gene2.step1.txt
gene3.step1.txt

Combinations

Suppose we want all filetypes for step 1:

$ ls *.step1.*

Produces:

gene1.step1.gff
gene1.step1.txt
gene2.step1.gff
gene2.step1.txt
gene3.step1.gff
gene3.step1.txt

Multiple patterns

It's equally common to specify a number of patterns to match. Suppose we want to do the same as above for genes 1 and 3, but not 2.

$ ls gene1.step1.* gene3.step1.*

Produces:

gene1.step1.gff
gene1.step1.txt
gene3.step1.gff
gene3.step1.txt

Less-used wildcards and patterns

In reality, most people use the asterisk (*). But there are other constructs. The above example lists multiple patterns, but this could have been accomplished with an alternate construct:

Match any one character in a set [abc]

$ ls gene[13].step1.*

This is not the number thirteen (13), but a construct to mean “match 1 or 3”. It produces:

gene1.step1.gff   
gene1.step1.txt   
gene3.step1.gff   
gene3.step1.txt

Match any one character in a range [a-z]

As above, we can use numbers instead of characters.

$ ls gene[1-2].step*.gff gene[2-3].step1.txt

Produces:

gene1.step1.gff
gene1.step2.gff
gene2.step1.gff
gene2.step1.txt
gene2.step2.gff
gene3.step1.txt

Since our ranges only span two entries, they could have been specified as sets [12] and [23].

Match any single character with '?'

So far, we've used the asterisk (*) to match numbers with a single digit and file extensions. If you need to restrict the length of your match, you can use the question mark to match a single character.

$ ls gene?.*

This matches all filenames with a single digit gene number.

More examples

Match files with exactly 3-character file extensions.

$ ls *.???

In our example, this lists all the *.txt and *.gff.

Match files with exactly one character prefix.

$ ls ?.*

Since I still have a.txt b.txt and c.gff, in my directory, they are matched.

wiki/2018fileglobbing.txt · Last modified: 2018/08/29 08:13 by david