Benutzer:Andreas Plank/Sed

Remember the rules:

sed reads input line by input line first, not command line by command line first, that is:

read one input line first and
apply all the commands to just this line
read the next input line
apply all the commands to just this line
…

What it does not do, is:

read one command line
apply this to all input lines
read the next command line
apply this to all input lines
…

This gets complicated, if you do multiline stuff upon that input line. In case you do multi line stuff, I recommend to first write the multi-line-command-stuff and then the other command stuff that does not apply to multiple lines. If you do it vice versa and write first not-multi-line command stuff and then multi-line-command stuff, then

sed reads one input line, applies e.g. command N (=that is: join lines by current-input-line\nnext-input-line), apply all subsequent sed commands to this joint line
the next input line is not this «next-input-line» but the line after it!

Inhaltsverzeichnis

1 Text snippets for the sed running under Linux
2 Sed help
3 Ersetzungen mit MediaWiki-XML-Export
4 Split a file into separate single files according to its content: multiple {{Metadata | … | … }}

Text snippets for the sed running under Linux

#### file options
# -e, --expression=… → execute
# -f, --file= → file: script file
# -i, --in-place → insert into file: edit file in place
# -l 40, --line-length 40 → specify the desired line-wrap length for the “l” command
# -n, --quiet, --silent: nothing i.e. quiet
# -r, --regexp-extended → extended regular expressions
# -s, --separate → separate: consider files as separate
# -u, --unbuffered → unbuffered = load minimal amounts of data from the input files and flush the output buffers more often

#### actions 
# a → append action (after)
#     $a append after last line
# c → change: You can replace the current line with the ‘c’ action
# d → delete action 
# i → insert action (before)
#     1i insert before 1st line
# n → ?read the next line
# p → print action
# q → quit immediately without further processing

#### ACTIONS 
# d and D → delete
#   d command deletes the current pattern space, reads in the next line, puts the new line into the pattern space, and aborts the current command, and starts execution at the first sed command. This is called starting a new "cycle."
#   D command deletes the first portion of the pattern space, up to the new line character, leaving the rest of the pattern alone.
# n and N → next
#   n command will print out the current pattern space (unless the "--quiet, --silent or -n" flag is used), empty the current pattern space, and read in the next line of input.
#   N command does *not* print out the current pattern space and does *not* empty the pattern space. It reads in the next line, but appends a new line character along with the input line itself to the pattern space:
#     first-whole-line   ┬→ first-whole-line\nsecond-whole-line
#     second-whole-line  ┘
# h  and H → Hold
#   h command copies the pattern buffer into the hold buffer. The pattern buffer is unchanged. An identical script to the above uses the hold
#   H command allows you to combine several lines in the hold buffer. It acts like the "N" command as lines are appended to the buffer, with a "\n" between the lines. You can save several lines in the hold buffer, and print them only if a particular pattern is found later.



#### addresses for instance with p → print
  '1,10p' # line 1 to 10
  '/beginRE/,/endRE/p' # reg. expr: beginRE to endRE
  '10~2p' # at line 10 then each 2nd line
  '$='    # last line “$” and provide “=” the line
#### examples
# sorted lines → delete duplicate lines
  sed '$!N; /^\(.*\)\n\1$/!P; D' temp2.txt > temp3.txt
# delete: 
  sed -e '1,10d' # line 1-10
  sed -e '11,$d' # line 11 to end of file
  sed -e '10~2d' # delete every 2nd line starting from 10
# extract:
  sed -n -e '1,10p'
# quit 
  sed -e '10q' # quit 
# commands on multiple lines or with -e again:
  sed -e '1,4d  
    6,9d'
  sed -e '1,4d' -e '6,9d'
# delete lines with debug + print lines with foo
  sed -n -e '/debug/d' -e '/foo/p'
# no […] in the line
  echo -e "line one\nli[ne] two" | sed --silent --expression '/[][]\+/!{ p; }'
  echo -e "line one\nli[ne] two" | sed -ne '/[][]\+/!{ p; }'
# pipes
  gcc sourcefile.c 2>&1 | sed -n -e '/warning:/,/error:/p'
  gcc sourcefile.c 2>&1 | sed --silent --expression '/warning:/,/error:/p'
# upper case to lower case and vice versa
  echo qWeRtzzuiPÜ | sed --regexp-extended --expression 's@([[:lower:]]?)([[:upper:]]?)@\U\1\L\2@g' # or
  echo qWeRtzzuiPÜ | sed -re 's@([[:lower:]]?)([[:upper:]]?)@\U\1\L\2@g' # or
  echo qWeRtzzuiPÜ | sed  --expression 's@\([[:lower:]]\?\)\([[:upper:]]\?\)@\U\1\L\2@g'
  echo qWeRtzzuiPÜ | sed  -e 's@\([[:lower:]]\?\)\([[:upper:]]\?\)@\U\1\L\2@g'

Sed help

### Command line syntax ###############
#   file ↘
#    sed -f sed_replacements.sed old_file.txt > new_file.txt
# insert ↘    ↙ file
#    sed -i -f sed_replacements.sed overwritten.txt

### Regular expressions ###############
# note the different (default) regexpr !!!
# summarised: + is + ? is ? ( is ( { is { | is | → all no expressions!
#   (..) → \(\)    reference 
#   ?    → \?      0 or 1
#   .+   → .\+     1 or many
#   .*   → .*      0 or many
#   [..] → [..]    character definition range
#   {..} → \{..\}
#   | → \| means “or”

### Search and replace ################
#  ↙ search          ↙ global scope
# s/search/replace/g;
# s/\(search\)/\U\1\E/g; # finds “search” → “SEARCH” in \U (upper case) \E stops transformation
# s/\(SEARCH\)/\L\1\E/g; # finds “SEARCH” → “search” in \L (lower case) \E stops transformation

### Search and replace (address) ######
#   /address/s/search/replace/g
# NOT-matched or everything except “address”
#   /address/!s/search/replace/g
# address can be: 1,$ (1st line to the end) or a search pattern

### multiline search with “address” ###
#               “address”
#┌─────────────────┴─────────────────────┐
#   first line   append \n  second line
#┌───────┴────────┐ ↓  ┌────────┴────────┐ 
/first line pattern/N;/second line pattern/{
  # do something with the found pattern
  #     search                replace       global scope
  #       ↓                      ↓          ↓
  s@first line\nsecond line@replace pattern@g
}

Ersetzungen mit MediaWiki-XML-Export

Siehe Sed help.

 ######### MediaWiki Export ##########
 #  don't replace!!
 #  &amp;nbsp; → &nbps;
 #  &lt;br /&gt; → <br/>
 #####################################
 
 # set username
 # syntax explained
 #       first line           and append \n and second line
 #┌───────────┴───────────────────┐ ↓  ┌───────────┴──────────────┐ 
 /\(<username>\).\+\(<\/username>\)/N;/\(.\+<id>\)[0-9]\+\(<\/id>\)/{
   # do something with the found pattern
   # replace user name and user ID now with "\n"
   s@\(<username>\).\+\(</username>\n.\+<id>\)[0-9]\+\(</id>\)@\1Ihr Name\2123\3@g
 }
 
 # comment
 /<comment>/,/<\/comment>/{
   :label_add_newlines
     N; # add newlines as '\n'
   # if line contains not (!) '</comment>' go (b)ack to label_add_newlines
   /<\/comment>/!b label_add_newlines 
   s@\(<comment>\).\+\(</comment>\)@\1new comment\2@g;
 }
 
 # delete revision
 /\(<revision>\)/N;/\( \+<id>\)[0-9]\+\(<\/id>\)/{
 s@\(<revision>\n \+<id>\)[0-9]\+\(<\/id>\)@\1\2@g
 }
 
 # timestamp  (-2 hours)
 s@\(<timestamp>\).\+\(</timestamp>\)@\12011-09-08T10:12:11Z\2@g
 
 # do your replacements here
 /^|\(Stichworte\|Ort\) *=.\+$/{
   s@,@;@g
 }

All die eben aufgezeigten Anweisungen funktionieren nur, wenn das Suchmuster das Gesuchte in einer Zeile finden kann. Dies ist bei sed halt so, daß es nur in einzelnen Zeilen findet. Sucht man über mehrere Zeilen hinweg, muß man all diese Zeilen mit N aneinanderhängen, so daß sie im sog. pattern-space zu einer Zeile werden. Dann führt man die Ersetzungen durch und läßt es wieder ausgeben. Dies wird mit labels erreicht, auf die wieder zurückgegriffen werden kann. Wenn man im Falle von <text…</text> eine Weiterleitung erstellen will, die aber den Titel benötogt, sieht das wie folgt aus:

# beware!! it corrupts the history!!! 
# create a REDIRCET based on the title
/<title>.*<\/title>/ h # save the found title to the hold space

/<text/ { # start at <text
:labelTextStart # set a marker/label to cycle back later on <text
  N  # append new lines
  /<\/text>/!b labelTextStart # if it is not </text> cycle back to labelTextStart
  # all <text…</text> is now in ONE single line!!
  s@<text.*</text>@@ # replace it
  x # exchange hold space and pattern space (get the title)
  s@<title>\(.*\)<\/title>@<text xml:space="preserve">[[REDIRCET: New page \1]]</text>@
}

Split a file into separate single files according to its content: multiple {{Metadata | … | … }}

Bear in mind that you should have line ends only with “\n” (i.e. Linux/Unix) not line ends with “\r\n” (carriage returns and line feed: common on Windows text files), because sed will behave weird on those line ends with “\r\n”.
You can replace all “\r\n” to “\n” by (s)ubstitute (g)lobally:
sed 's@[\r\n]@[\n]@g;' sourcefile.txt

Let’s say we want to split all templates of {{Metadata | … | … }} contained in a single file to separate files according to the content of the very template text data. So we try to find and use the highlighted text data:

{{Metadata
| Type = StillImage
| Description ={{Metadata/Description | language code=en | content=Drabkina 1999: A selection of 114 drawings, redrawn by M. Drabkina after the publication by Punithalingam (1988): E. Punithalingam: Ascochyta. II. Species on monocotyledons (excluding grasses), cryptogams and gymnosperms. Mycological Papers, 159 (1988), pp. 1–235.}}
| Provider Page = Julius Kühn Institute – Federal Research Institute for Cultivated Plants
| Title = Ascochyta acori
| Resource ID = 362574154
| Copyright Statement = © M. Drabkina 1999
| License Statement = Creative Commons non-commercial, by attribution, share-alike license (version 2.5)
| Creators = M. Drabkina;
| Metadata Creator = G. Hagedorn
| Language = zxx
| Metadata Language = en
| Original Creation Date = 12.10.2001
| Country Codes = global
| Subject Category = Deuteromycota
| Taxonomic Coverage = Ascochyta
| Scientific Names = Ascochyta acori Oudem.
| Taxon Count = 1
| Vernacular Names = 
| General Keywords = 
| Best Quality URI = http://212.201.100.117/storage/Fungi/Coelos/Punithalingam%20Ascochyta/edt/AscP88-002.png
| Best Quality Availability = online (free)
}}

{{Metadata
| Type = StillImage
| Description ={{Metadata/Description | language code=en | content=Drabkina 1999: A selection of 114 drawings, redrawn by M. Drabkina after the publication by Punithalingam (1988): E. Punithalingam: Ascochyta. II. Species on monocotyledons (excluding grasses), cryptogams and gymnosperms. Mycological Papers, 159 (1988), pp. 1–235.}}
| Provider Page = Julius Kühn Institute – Federal Research Institute for Cultivated Plants
| Title = Ascochyta acori
| Resource ID = 601700422
| Copyright Statement = © M. Drabkina 1999
| License Statement = Creative Commons non-commercial, by attribution, share-alike license (version 2.5)
| Creators = M. Drabkina;
| Metadata Creator = G. Hagedorn
| Language = zxx
| Metadata Language = en
| Original Creation Date = 12.10.2001
| Country Codes = global
| Subject Category = Deuteromycota
| Taxonomic Coverage = Ascochyta
| Scientific Names = Ascochyta acori Oudem.
| Taxon Count = 1
| Vernacular Names = 
| General Keywords = 
| Best Quality URI = http://212.201.100.117/storage/Fungi/Coelos/Punithalingam%20Ascochyta/edt/AscP88-001.png
| Best Quality Availability = online (free)
}}

Now we extract those highlighted text data modify them a little and finally we want to get a sed command line that writes the exact (found) line numbers into a file. This sed syntax should look like:

sed "42,64w filename.txt" # from line 42 to (,) line 64 (w)rite to file “filename.txt”

Note that BASH allows you to concatenate output by using command | next-command | next-command. So multiple modifications can be concatenated. With sed it can be done as follows:

# store $source_metadata_file
source_metadata_file="Drabkina 1999: selected drawings after Punithalingam (1988): Ascochyta on monocotyledons.mwt";
dataset_directory="Drabkina_1999"
# proceed whit reading the source file and apply sed commands to it and process its out put further (after |) ...
sed --regexp-extended --silent '
/^\{\{Metadata$/,/^\}\}$/ { # find lines of only “<line-start>{{Metadata<line-end>” to  “<line-start>}}<line-end>”
  /^\{\{Metadata$/{   # at line “<line-start>{{Metadata<line-end>”
    =; # print line number (=) 
  };
  /Best Quality URI/{ # at line with Best Quality URI
    # extract the file basename, strip and (s)ubstitute extension to a medawiki-text file (.mwt)...
    s@.*Best Quality URI.*/([^/]+)\.[a-z]{3,4}$@\1.mwt@g; 
    p; # (p)rint result (new file basename)
  };
  /^\}\}$/{# at line  “<line-start>}}<line-end>” 
    =; # print the line number (=)
  };
  ######################
  # we should get an output like:
  # 18
  # AscP88-002.mwt
  # 40
  # 42
  # AscP88-001.mwt
  # 64
  # ...
  ######################
  # now this output can be processed further concatenate the next BASH-command sequence by | 
}' "$source_metadata_file" | sed --regexp-extended --silent "# process the preceding output further
/^[0-9]+\$/{ # at “<line-start>any number/numbers<line-end>”
  N; # append (N)ext line by \n
  N; # append (N)ext line by \n
  # subsitute “<line-start>any-numbers (\n)ewline anything (\n)ewline any-numbers<line-end>”
  # by a comment what it will do and the very sed command ...
  s@^([0-9]+)\n(.+)\n([0-9]+)\$@echo 'from line \1 to \3 write to file ./$dataset_directory/\2'\nsed --silent '\1,\3w ./$dataset_directory/\2' '$source_metadata_file'@;
  p; # (p)rint
};" > split_file_at_metadata.sh

Check the file split_file_at_metadata.sh. We should get content of this file like:

######################
head split_file_at_metadata.sh
######################
echo 'from line 18 to 40 write to file ./Drabkina_1999/AscP88-002.mwt'
sed --silent '18,40w ./Drabkina_1999/AscP88-002.mwt'
echo 'from line 42 to 64 write to file ./Drabkina_1999/AscP88-001.mwt'
sed --silent '42,64w ./Drabkina_1999/AscP88-001.mwt'

Enable execution-rights of file split_file_at_metadata.sh and execute it:

chmod ugo+x split_file_at_metadata.sh
./split_file_at_metadata.sh # execute above commands

Benutzer:Andreas Plank/Sed

Inhaltsverzeichnis

Text snippets for the sed running under Linux

Sed help

Ersetzungen mit MediaWiki-XML-Export

Split a file into separate single files according to its content: multiple {{Metadata | … | … }}

Navigationsmenü

Meine Werkzeuge

Namensräume

Varianten

Ansichten

Mehr

Suche

Navigation

Portale

Mitarbeit

Werkzeuge

In anderen Sprachen

Schwesterprojekte