Friday, September 12, 2014

Help me switch completely to console Vim on OSX

I have decided to stop using MacVim. There is no one specific reason. A whole bunch of little pinpricks have made me uncomfortable enough that I deleted it from my Applications folder, cleaned my Open-With menu, and I am using a custom compiler Vim from iTerm2 now:

The only thing I am missing is the ability to open a file in a Finder window by right-clicking and selecting something like "Edit with Vim", and having the file opened in either as a buffer in a currently running vim instance in an iTerm window, or starting a new instance. This is obviously not essential, as I can navigate within Vim, especially using the wonderful CtrlP plugin.

However, it is bothering me that I don't know how to do this, and if anyone has already found a way, I would appreciate hearing about it.

Thursday, September 11, 2014

Scraping PDF documents without losing your sanity

The epiphany came when I was trying to extract usable information from a bunch of documents.

Some people insist on distributing essential information in PDF format making it very hard to make use of said information.

Now, I have never really made it past the table of contents of Adobe's PDF Reference, and I can't really figure out many of the available Perl modules dealing with PDFs. I know what goes into a PDF document (basically boxes with coordinates), but, just as I have never written a web server in Postscript either, I haven't been able to go into this in depth.

One of the problems with utilities that naively convert PDF to text is that usually they do a straightforward translation of the layout which does a number on the order the text comes out. The location of an object on the page and it's position in the object stream don't really correspond very reliably to each other.

Thanks to Thomas Levine, I found out about pdftohtml.

At first, I was very frustrated … Then, I realized the value of the -xml option.

With this option, the PDF document is output as <page> and <text> elements. For example:

<page number="6" position="absolute"
  top="0" left="0" height="918" width="1188">
…
<text top="176" left="109" width="125"
  height="15" font="2">DATA RECORD </text>

This is extremely useful when trying to extract information. First, if the entity producing the document used consistent styling, the font attribute of the text elements can be used to select items of interest. However, multi-column documents are still a pain.

The key to my epiphany lies in sorting the text elements using a lexicographic ordering: Text on page 5 should come before text on page 7. Text in column one comes before text in column three. Text on line five in column two comes before text on line two in column three … See what I did there?

At first you might think it is OK to define columns using the left attributes of text elements. The problem is when some attributes for the data you want to extract are defined in section headers that can appear in the middle of a column. People will usually center the text in those headers (for visual aesthetic reasons), and therefore they will appear to be in a later column than the data items that follow.

This may seem obvious right now, but the solution came to me only after looking at the following plot:

That is, I need a mapping of ranges of left margins to columns.

Once that mapping is defined, text elements can be sorted into a natural reading order, and information can be extracted using usual methods.

Monday, September 1, 2014

Stop your Mac from keeping a perpetual connection to Apple

I had done this some time ago on my laptop, but had to try to remember once again while helping someone else. I am just noting it here so it is not as difficult to remember the next time :-)

Basically, the problem is:

$ netstat -an
tcp4       0      0  192.168.xxx.xxx.52623      17.172.233.127.5223    ESTABLISHED
tcp4       0      0  192.168.xxx.xxx.52622      17.172.232.9.5223      ESTABLISHED

These connections are established as soon as the user logs in, and maintained perpetually.

$ lsof -i 4tcp
apsd    334 root    8u  IPv4 0x…      0t0  TCP 192.…:52622->17.172.232.9:5223 (ESTABLISHED)
apsd    334 root   11u  IPv4 0x…      0t0  TCP 192.…:52622->17.172.232.9:5223 (ESTABLISHED)
apsd    334 root   12u  IPv4 0x…      0t0  TCP 192.…:52623->17.172.233.127:5223 (ESTABLISHED)
apsd    334 root   14u  IPv4 0x…      0t0  TCP 192.…:52623->17.172.233.127:5223 (ESTABLISHED)

Seriously annoying.

apsd is not a rogue process or anything, but here's what the man page says:

apsd
ApplePushService daemon for Apple Push Notification service.
This is part of the ApplePushService framework.

There are no configuration options to apsd.
Users should not run apsd manually.

Well, alrighty then.

apple.stackexchange to the rescue:

$ sudo launchctl unload -w \
/System/Library/LaunchDaemons/com.apple.apsd.plist

turns it off, and,

$ sudo launchctl load -w \
/System/Library/LaunchDaemons/com.apple.apsd.plist

turns it back on.

HTH

Sunday, August 31, 2014

Context dependence in Turkish

So, I tried to improve the Turkish translation of Gabor's Scalar and List context in Perl, the size of an array. Given that my heart is not really in it, I am not going to make a habit of this, but I do hope that the new version is useful.

In thinking for a corresponding Turkish example to Gabor's example of context dependence in English, the first thing that popped into my had was karı-koca versus karı küredim which is OK, but not very impressive.

I like my example much better:

  1. Çivi çakmak
  2. Çakmak çakmak
  3. Matematik'ten çakmak
  4. Beşlik çakmak

where çakmak çakmak and Matematik'ten çakmak both have at least two meanings each depending on the context.

Friday, August 29, 2014

Replacing hash keys with values does not a translation make

<rant>

Some time ago, Gabor and I had a disagreement regarding the value of translating programming articles to languages other than English.

In a nutshell, having actually worked as a translator (Danish↔Turkish and English↔Turkish, including a stint translating for CTW producers Sesame Street episodes written by Turkish writers for the Turkish version), I am quite familiar with what happens when people attempt to translate meaning by looking up terms in a dictionary.

I am afraid, Perlde scalar ve list bağlam, bir dizinin boyutu (English original) forms a good example of why such translations are not only not useful, but also harmful.

I am sure Kadir Beyazlı put in a lot of good work into the translation, but the result is an abomination.

The grammar error in the first word of the title is repeated throughout the body of the post.

More importantly, look at that title again:

Perlde scalar ve list bağlam, bir dizinin boyutu

Is that Turkish or English?

Having failed to find good terms to replace scalar and list, the translator decided to keep using the English words. How does that help a person who supposedly doesn't understand English, and, therefore, would be reading this translation? For all she knows, we could have used pony and rabbit, instead of scalar, and list, and, so long as the substitution was consistent, she would get the same benefit out of reading this.

If you look at the English-Turkish Math Dictionary, scalar is translated as sayısal which actually means numeric, which kinda works when we are talking about an array in scalar context, but then fails when we say a reference is a scalar.

Let me state this unequivocally: The Turkish language has been impoverished over the past century by the blind culling of words of Arabic, Farsi, and in some instances Hebrew origin (although, at least Eylül is still Eylül) in some blind drive towards purification of the language following Atatürk's reforms. Blind importation of words from English and French did not help either (Turks cannot distinguish among the meanings of the word economy in the Turkish Economy, the Economics Department, and economy class — which saddens me. Hint: Türkiye Ekonomisi, İktisat Bölümü, and ucuz bölüm.

I am happy I did not have to learn to write Turkish in Arabic alphabet, but every time I think of my grandfather Dr. Şinasi Kıpçak's vocabulary, mastery of the language, I am filled with both nostalgia and envy.

Coming back to how to translate scalar and list context to Turkish …

Here, as in many cases, the translator must think about what words express the meanings of those phrases most consistently and usefully.

To me, the answer is clear: Scalar context in Perl refers to situations where something is interpreted as just one thing.

So, a translation of that title that actually conveys the meaning instead of doing a simple hash lookup might be:

Perl programlarında tekli ve çoklu bağlam: Bir dizinin elemanlarının sayısı

whose literal translation back to English would be Scalar and list context in Perl programs: The number of elements in an array. I believe such a translation conveys a whole lot more meaning to a person who actually does not speak English.

Moving on, we have:

Mesela "left" kelimesi birçok anlam içerir:

I left the building.

I turned left at the building.

Why use English examples to explain how we can deduce the meaning of homonyms from context? How does someone who does not speak English get anything out of that?

Why not use a simple Turkish example?

Karı küredim.

Karı-koca.

I am willing to bet the sentence Çözümü SCALAR bağlamda veri döndüren scalar() fonksiyonunu kullanmaktır does not make any more sense to a Turkish speaker who speaks no English than The solution is to use the scalar() function that will create SCALAR context for its parameter.

Translation and hash-lookup are different things. If you want to convey meaning, you have to have a command of both languages, and the subject matter. Without that, you are only going to add to the word soup. Translating big event as büyük okazyon helps no one.

I am sure both Kadir and Gabor had the best of intentions with these translations. I just happened to notice that their collaboration happened to produce a translation that highlights everything that a non-English speaking aspiring programmer has to fight with.

In my experience, trying to learn programming from translated technical writings is a fool's errand. One would be far better off picking up a little English, watching movies with subtitles, and reading a great book such as Learning Perl. When doing so, consult mostly an English-English dictionary. Stick with it for about six months, through thick and thin, and you'll be amazed how much better your results will be through that process rather than fighting through:

Şu an bir dizinin SCALAR bağlamdaki değerinin eleman sayısı olduğunu biliyoruz. Ayrıca eğer ki dizi boş ise bu değerin 0 (that is FALSE) olduğunu, 1 veya daha fazla eleman içeriyor ise pozitif bir sayı (that is TRUE) olduğunu da biliyoruz.

Allah rızası için, sen n'apıyorsun yav gözümün içi??? "That is" ifadesini o bağlamdaki karşılığı Türkçe'de "yani" dir. Ayrıca niye "şu an"? Bi de Perl'de olacak başlıktaki.

</rant>

Friday, August 22, 2014

Convert multi-page PDF to invidual PNG images using GraphicsMagick

I had to look this up … A lot of hits by Google show ancient syntax. Here's what works:

$ gm convert 'document.pdf[12-45]' +adjoin output-%03d.png

HTH

Monday, August 18, 2014

File::Which comes with its own 'multiwhich'

I uploaded App::multiwhich, based on a script I have been using for many years, in observance of #CPANDAY. While honestly thought it was a cute, useful little utility which I could improve by fixing edge-cases, I just realized that there is no reason for you to use it ;-)

File::Which comes with its command line utility called pwhich. For example:

$ pwhich -a perl vim doesnotexist
/Users/auser/perl/5.20.0/bin/perl
/opt/local/bin/perl
/usr/bin/perl
/opt/local/bin/vim
/usr/bin/vim
pwhich: no doesnotexist in PATH

The module definitely predates my foray into Perl. I cannot fathom how I missed the pwhich utility.

So, don't use App::multiwhich. Use pwhich. I'll make the requisite changes in the module distribution.