The Tao of Programming: perl

Showing posts with label perl. Show all posts

Thursday, 1 April 2010

Songs list from files list

I had a list of mp3 files like:

ANBE_VAA_ARUGILE.MP3
ANBU_THAYAE.MP3
ANDANVAN_THEANAKTRU.MP3
ANNANENNA.MP3

and wanted the list of song names from this… What shall we do?

* Open Cygwin (Unix emulator within Windows), navigate to that folder, do:
      ls | cut -d. -f1 | sed –e 's/_/ /g'
   list the files in this directory, cut and extract the 1st field with delimiter as . character, then substitute space character in place of _ character globally. In other words, get the actual filename removing the .mp3 extension, then replace underscores with spaces.

* Save the above output to a file
     Actually this too should be a part of the above step, like:
      ls | cut -d. -f1 | sed –e 's/_/ /g' > ./songnames.txt
     The file songnames.txt now has:

ANBE VAA ARUGILE
ANBU THAYAE
ANDANVAN THEANAKTRU
ANNANENNA

* Open the file in Vim editor, and select all the text by pressing ggVG. Then, press : to enter command mode, and Vim will show :’<,’> and wait for your input. Type this:
    !perl –lpe "@words = map { ucfirst(lc($_)) } split; $_ = join(\" \", @words)"
and press enter, and voila, the list has become:

Anbe Vaa Arugile
Anbu Thayae
Andanvan Theanaktru
Annanenna

Beautiful, isn't it?
Thank God, Unix exists! :)

Sunday, 4 May 2008

Largest prime factor of a number

Long live the programming overlords! Today is the day I have begun serious programming again... 'serious' is perhaps a bit too much - programming is never actually serious, it's a sort of fun activity that happens to appear boring and difficult and hence obtains good paychecks from the HR types... ;)

Anyway, I happened to come across the Project Euler and registered in the site. The registration was painless and I fell in love with the site immediately. Problems 1 and 2 were trivial (finding the sum of the multiples of 3 or 5 under 1000, finding the sum of even numbers in the Fibonacci series which are less than 4 million).

Problem 3 was a little more involved. I had to find the largest prime factor of 600851475143. It isn't particularly difficult, but it does call for some thinking. The first method I thought of was to

divide the number by successive integers upto sqrt(n), and
when the remainder is 0, check whether the divisor is prime,
if so, store it as answer overwriting the previous value, if any.

An obvious improvement was to come down from sqrt(n) so that the first number to satisfy conditions 2 and 3 would be the answer.
However, by this time, something seemed fishy in this method. Particularly, it occurred to me that a prime number checking was costly and unnecessary here. Instead, you can go about it using a recursive solution:

FindLargestPrimeFactor(n):

divide the number by successive integers (each denoted by i) upto sqrt(n),
when the remainder is 0, return the maximum among i and FindLargestPrimeFactor(n/i).

It might not be clear at first sight that the algorithm works correctly. There seems to be no check for primality. However, this algorithm exploits the fact that the smallest factor sf of a number n will necessarily be prime. This is because, if it were not prime, its factors would be factors of n and hence sf would not be the smallest factor. Thus, we need only find the smallest factor to have a prime factor. Then, we can divide away this factor and again find the smallest factor. This way, we can have all the prime factors of a number.

In the algorithm, we find a prime factor of n, compare it with the largest prime factor of n/i, and return the larger number. Thus, finally, we'll have the largest prime factor of the number n itself.
Obviously, we need a limit for the recursion. In this case, this limit occurs when i overflows sn. This means that n is prime, and hence we must return n itself as the largest prime number.

This was the logic of my program as I coded and finished it. I entered the answer into the Project Euler website (which is projecteuler.net, by the way). After that, I saw the forums in the website and saw that many had implemented my original algorithm which used prime testing. This more efficient recursive solution had missed the neurons of many... With that boost to my hubris, I turned to the pdf containing the 'overview' of the problem.

Wow, the algorithm described is very close to mine. One improvement is that the recursion has instead become an iteration. This is one area where I stumble every time - converting a recursive program into an iterative one. So I bookmarked this in my mind and moved on.

The next improvement was one that made me feel stupid for not thinking of it earlier. Since I'm going to find only prime factors, I need only use 2 and odd numbers in step 1. So, I can use a separate step for 2 and then start with 3 and increment i by 2. This way, we'll save about half of the iterations...

UPDATE: There are quite a number of other improvements that must be done to this algorithm. For the actual algorithm for this problem (which develops on the algorithm here), see Finding the Largest prime factor efficiently.

The final perl program I wrote was:

use strict;
use warnings;

my $magic = 600851475143;

sub largestprimef($);
sub max($$);

print largestprimef($magic);
sub largestprimef($)
{
my $n = shift;

my $i;
return largestprimef(max(2, $n/2)) if($n % 2 == 0);

my $sn = int(sqrt($n));

for ($i = 3; $i <= $sn; $i += 2)  {

  last if($n % $i == 0);

if($i > $sn) #loop ran over, means the number is prime
{
  return $n;
}
else
{
return max($i, largestprimef($n/$i));
}
}

sub max($$)
{
return (sort { $a <=> $b }(@_))[1];
}

A few notes about the program here:

not 'use'-ing the strict and warning pragmas gave a speed improvement of a few milliseconds
first I implemented max in the good old comparison way - if i > j, return i else return j. Turns out perl's sort implementation gives ~~faster results than this~~ equivalent speed. I just tried it for the heck of it ~~, and was surprised to see a few milliseconds improvement. Probably the speed improvement came from the removal of 'shift's to get the numbers from @_ to $i and $j.~~(that was a temporary illusion; in repeated testing both turned out to be equivalent.
Since I did the program in Linux, the age old newline problem cropped up. When I pasted the code into blogger, it showed the code as one continuous mess. I had to use http://www.hashemall.com/ for converting the newlines. The site dutifully hashed the program, and thankfully returned my original text with MS Windows newlines in place of Unix ones.. (The hash value under sha-1 was b8b7188ad3440084e5db7f44fd70d599f897e33a, in case you are interested... :) )
The final runtime was around 14 ms.
UPDATE: The code highlighting is courtesy of Online syntax highlighter 'tohtml' which is an excellent service by the way.

Wednesday, 7 November 2007

Law of small numbers and such...

The first strong law of small numbers (Gardner 1980, Guy 1988, Guy 1990) states "There aren't enough small numbers to meet the many demands made of them."
The second strong law of small numbers (Guy 1990) states that "When two numbers look equal, it ain't necessarily so." Guy (1988a) gives 35 examples of this statement, and 40 more in Guy (1990). For example, example 35 notes that the first few values of the interpolating polynomial [...] for , 2, ... are 1, 2, 4, 8, 16, .... Thus, the polynomial appears to give the powers of 2, but then continues 31, 57, 99, ... (Sloane's A000127).

I just read this at Wolfram Mathworld; and something flashed within me. What if the differences between this poly and the actual powers of two were significant? Well, being terribly bored and having a lot of time to kill, I set out to explore that. I was stuck at the first step itself - I was in Linux, and didn't know any way to do math explorations like this. Had I been in Windows, I would have used Matlab (damn good and surprisingly easy to use). I'm sure there's some 'popular' Free Software for this kind of job, but I was not in a mood to search for software packages and get into dependency hell just yet. So, I just asked myself, can I do it with what I do know to do in Linux?
Writing C programs. Perl scripts. Shell scripts. For this kind of job, Perl somehow seemed more appropriate. So, I typed out a Perl script to find the differences (here f(n) is the 4th order polynomial given in the above article):

n	f(n)	2^n	diff
6	31	32	1
7	57	64	7
8	99	128	29
9	163	256	93
10	256	512	256
11	386	1024	638
12	562	2048	1486
13	794	4096	3302
14	1093	8192	7099
15	1471	16384	14913
16	1941	32768	30827
17	2517	65536	63019
18	3214	131072	127858
19	4048	262144	258096
20	5036	524288	519252
21	6196	1048576	1042380
22	7547	2097152	2089605
23	9109	4194304	4185195
24	10903	8388608	8377705

No obvious pattern emerged. Then, on second thoughts, there did seem to be some sort of pattern to it. The differences are all increasing. But each one increased by a lesser amount than the previous one. 7/1 is (quite obviously) 7, 29/7 is slightly greater than 4, 93/29 is slightly greater than 3, etc. As the numbers increased, it seemed that they increased by successively smaller factors. The numbers had some kind of geometrical pattern - sort of like a geometric series where you get successive terms by multiplying each term by a constant ratio. However, in this case, the ratio wasn't exactly constant. Instead, it seemed to be tending towards some constant.

To check this out, I then printed the quotient obtained by dividing each difference by its predecessor. Lo and behold, suddenly, there was Light:

n	f(n)	2^n	diff	quotient
6	31	32	1	~
7	57	64	7	7
8	99	128	29	4.14285714285714
9	163	256	93	3.20689655172414
10	256	512	256	2.75268817204301
11	386	1024	638	2.4921875
12	562	2048	1486	2.32915360501567
13	794	4096	3302	2.22207267833109
14	1093	8192	7099	2.14990914597214
15	1471	16384	14913	2.10071841104381
16	1941	32768	30827	2.06712264467243
17	2517	65536	63019	2.04427936549129
18	3214	131072	127858	2.02888017899364
19	4048	262144	258096	2.01861440035039
20	5036	524288	519252	2.01185605356147
21	6196	1048576	1042380	2.00746458367036
22	7547	2097152	2089605	2.00464801703793
23	9109	4194304	4185195	2.00286417767951
24	10903	8388608	8377705	2.00174782775952

So, there you have it: the quotient of dividing each successive difference between 2^n and this f(n) is approaching 2. For n=70 and higher, it was so close to 2 that Perl gave it up and printed it as just 2. So, the quotients are converging towards 2 as far as we can see. Each successive difference, then, is obtained by multiplying the previous difference by a number very close to 2.
That is the kind of beauty math has. You obtain a sequence that looks like powers of 2 for small numbers, you find that it deviates from the actual value for higher powers, then you find that the deviation itself is increasing in powers of two. This thing has some inexplicable beauty to it... :)
So, now that we've found that this thing seems to converge to 2, how do we interpret it? That's the kind of thing for which you need mathematical training. If I had had some good math course on series and sequences, may be I would know what to do with this. May be I'd theoretically prove that the quotients actually converge to 2 as n tends to infinity. May be we can show that this f(n) can indeed be used to find powers of 2.
For now, let's get into the nitty gritty of this number crunching. First, I wrote this cute little program (any claims to its uncuteness are by Agents of Satan; avoid them like plague) :

#!/usr/bin/perl

use warnings;
use strict;

my @range = (6 .. 24); #6 is where the differences start to be non-zero
my $i;
my @diffs;

foreach $i (@range)
{
print $i,"\t\t";
print (( $i**4 - 6*$i**3 + 23*$i**2 - 18*$i + 24 ) / 24, "\t\t");
print 2**($i - 1), "\t\t";
push @diffs, (2**($i - 1) - (( $i**4 - 6*$i**3 + 23*$i**2 - 18*$i + 24 ) / 24));
print $diffs[$#diffs], "\t\t";
$#diffs && print $diffs[$#diffs]/$diffs[$#diffs-1];
print "\n";
}

Then, I ran the output through txt2html with --make_tables option. I got the HTML code for a table, which you can see with naked eye by right clicking here and choosing 'View source' (or whatever option reads close to that). Using the code, I made a blog post named 'Law of small numbers and such...'.

You just finished reading it!