Saturday, May 25, 2013

Use awk to Extract a Column from a Text File


Here is an example of how to use awk to extract a column from a text file.

The default field separator is a space, so this will work with some text files:

   cat somefile.txt | awk '{print $2}'

But, if the fields are separated by a delimiter, such as tab, or comma, just specify the field separator by using "-F" before the print statement

   cat somefile.txt | awk -F"\t" '{print $2}'

 Unix and linux distributions generally come with sed and awk, among other things.

The problem.

You want to get a list of companies to research before investing in them, and paste only the stock symbols into a Yahoo finance portfolio.   You run a stock screen on a site such as  magicformulainvesting.com, or investors.com but if you try to paste the text into a spreadsheet it may show up as a single row, with no way to extract the column.

Here is the process

1.)  Run a Stock Screen

Get output similar to this:

Company Name (in alphabetical order) Ticker Market Cap ($ Millions) Price From Most Recent Quarter Data
Almost Family Inc     AFAM     201.49     09/24     06/30
Apollo Group Inc     APOL     3,317.35     09/24     05/31
Argan Inc     AGX     231.55     09/24     07/31
Block (H&R) Inc.     HRB     4,635.98     09/24     07/31
Body Central Corp     BODY     179.63     09/24     06/30
CACI International Inc.     CACI     1,333.66     09/24     06/30
Capella Education Co     CPLA     478.40     09/24     06/30
CF Industries Holdings Inc     CF     13,579.97     09/24     06/30
Cisco Systems Inc     CSCO     99,628.89     09/24     07/31
Dell Inc     DELL     17,586.96     09/24     07/31
Deluxe Corp     DLX     1,567.52     09/24     06/30
Dice Holdings Inc     DHX     522.27     09/24     06/30
Dolby Laboratories Inc     DLB     3,681.33     09/24     06/30
Express Inc     EXPR     1,313.50     09/24     07/31
GameStop Corp.     GME     2,755.19     09/24     07/31
Great Northern Iron Ore Properties     GNI     129.44     09/24     06/30
GT Advanced Technologies Inc     GTAT     703.72     09/24     06/30
Iconix Brand Group Inc     ICON     1,269.35     09/24     06/30
Intersections Inc     INTX     188.20     09/24     06/30
ITT Educational Services Inc     ESI     839.48     09/24     06/30
j2 Global Inc     JCOM     1,435.93     09/24     06/30
KLA-Tencor Corp     KLAC     7,857.04     09/24     06/30
Kulicke and Soffa Industries Inc     KLIC     763.82     09/24     06/30
Lender Processing Services Inc     LPS     2,435.63     09/24     06/30
LML Payment Systems Inc     LMLP     96.04     09/24     06/30
ManTech International Corp     MANT     900.87     09/24     06/30
Maxygen Inc     MAXY     73.00     09/24     06/30
Metabolix Inc     MBLX     54.72     09/24     06/30
Microsoft Corp     MSFT     257,967.20     09/24     06/30
Momenta Pharmaceuticals Inc     MNTA     754.55     09/24     06/30
Nature's Sunshine Products Inc     NATR     254.74     09/24     06/30
Nevsun Resources Ltd     NSU     913.88     09/24     06/30
Nu Skin Enterprises Inc.     NUS     2,263.52     09/24     06/30
PDL BioPharma Inc     PDLI     1,086.98     09/24     06/30
PetMed Express Inc     PETS     201.72     09/24     06/30
Pozen Inc     POZN     194.06     09/24     06/30
Questcor Pharmaceuticals Inc.     QCOR     1,138.54     09/24     06/30
Raytheon Co.     RTN     19,209.52     09/24     06/30
Seagate Technology Plc     STX     12,175.83     09/24     06/30
Spirit Airlines Inc     SAVE     1,272.78     09/24     06/30
Strayer Education Inc     STRA     792.35     09/24     06/30
TeleNav Inc     TNAV     250.19     09/24     06/30
Ubiquiti Networks Inc     UBNT     1,103.50     09/24     06/30
Unisys Corp     UIS     939.02     09/24     06/30
USA Mobility Inc     USMO     263.79     09/24     06/30
Utstarcom Holdings Corp     UTSI     159.41     09/24     06/30
Vaalco Energy Inc     EGY     505.50     09/24     06/30
Veeco Instruments Inc     VECO     1,229.09     09/24     06/30
Vonage Holdings Corp     VG     514.25     09/24     06/30
Warner Chilcott Plc     WCRX     3,334.05     09/24     06/30

2.)  Copy and paste to a text file


    vi magic_20120925_50M_50.txt

3.)  Extract the column

 Use awk with a field separator of "\t" for tab, and print the second column:

    cat magic_20120925_50M_50.txt | awk -F"\t" '{print $2}'

AFAM
APOL
AGX
HRB
BODY
CACI
CPLA
CF
CSCO
DELL
DLX
DHX
DLB
EXPR
GME
GNI
GTAT
ICON
INTX
ESI
JCOM
KLAC
KLIC
LPS
LMLP
MANT
MAXY
MBLX
MSFT
MNTA
NATR
NSU
NUS
PDLI
PETS
POZN
QCOR
RTN
STX
SAVE
STRA
TNAV
UBNT
UIS
USMO
UTSI
EGY
VECO
VG
WCRX

4.)  Then just copy the text, and paste it into your portfolio.


That's all there is to it. 

For more advanced processing, you may have to use printf, which follows the standard C format.

Disclaimer: We do not provide investing advice or recommend stocks.  We may or may not have positions in some of these stocks.  To understand what to do with the information from the stock screener, it is best to read the book listed on the site.

More awk Examples

We will use a one-liner using echo, so the incoming data and the output are both visible.

Here it is in action on a comma separated list:

  $ echo "Almost Family Inc,     AFAM,     201.49,     09/24,     06/30" | awk -F, '{print $2}'
       AFAM
 

Where have we seen comma separated lists before?  Oh yeah, spreadsheets can be saved as .csv files.
 
Here we can print any number of fields:

$ echo "Almost Family Inc,     AFAM,     201.49,     09/24,     06/30" | awk -F, '{print $2,$3}'
     AFAM      201.49


Now suppose we want columns one and two, but in reverse order?  We try this:

$ echo "Almost Family Inc,     AFAM,     201.49,     09/24,     06/30" | awk -F, '{print $2 $1}'
     AFAMAlmost Family Inc


Not quite what we want.  Or should we say Almost..., no pun intended?  What we really want is column two, followed by a space, and then column one.  Separating the fields by a comma may give us what we want.

$ echo "Almost Family Inc,AFAM,201.49,09/24,06/30" | awk -F, '{print $2,$1}'
AFAM Almost Family Inc


Much better, but if we want exact control of the output, we could use printf instead of print.

$ echo "Almost Family Inc,     AFAM,     201.49,     09/24,     06/30" | awk -F, '{printf("%s %s\n",$2,$1)}'
     AFAM Almost Family Inc


Notice how printf follows the C or perl syntax, and requires a newline character "\n", unless you want all the output on the same line.  We use "%s" for string, and anything between the quotes including spaces, and special characters such as tab \t, and newline \n will give us the format we want.  Then after the close quote we give it a comma separated list of variables.  The number of variables must match the format in the quotes.  In the example above we want to print two variables with a space in between, and a newline at the end.  The first part of printf is for the format, and the second is for the variables.